I'm working in a Google Colab notebook and set up via
!wget http://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu
a quick version check nlu.version() confirms 3.4.2
Several of the official tutorial notebooks (for ex.: XLNet) create a multi-model pipeline that includes both 'sentiment' and 'emotion'.
Direct copy of content from the notebook:
import pandas as pd
# Download the dataset
!wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv -P /tmp
# Load dataset to Pandas
df = pd.read_csv('/tmp/train-balanced-sarcasm.csv')
pipe = nlu.load('sentiment pos xlnet emotion')
df['text'] = df['comment']
max_rows = 200
predictions = pipe.predict(df.iloc[0:100][['comment','label']], output_level='token')
predictions
However, running a prediction on this pipe results in the following error:
sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
xlnet_base_cased download started this may take some time.
Approximate size to download 417.5 MB
[OK!]
classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
---------------------------------------------------------------------------
IllegalArgumentException Traceback (most recent call last)
<ipython-input-1-9b2e4a06bf65> in <module>()
34
35 # NLU to gives us one row per embedded word by specifying the output level
---> 36 predictions = pipe.predict( df.iloc[0:5][['text','label']], output_level='token' )
37
38 display(predictions)
9 frames
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in raise_from(e)
IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SentimentDLModel_6c1a68f3f2c7.
Current inputCols: sentence_embeddings#glove_100d. Dataset's columns:
(column_name=text,is_nlp_annotator=false)
(column_name=document,is_nlp_annotator=true,type=document)
(column_name=sentence,is_nlp_annotator=true,type=document)
(column_name=sentence_embeddings#tfhub_use,is_nlp_annotator=true,type=sentence_embeddings).
Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: sentence_embeddings
Having experimented with various combinations of models, it turns out that the problem is caused whenever 'sentiment' and 'emotion' models are specified in the same pipeline (regardless of pipeline order or what other models are listed).
Running pipe = nlu.load('emotion ANY OTHER MODELS') or pipe = nlu.load('sentiment ANY OTHER MODELS') will be successful, so it really appears to be only a result of combining 'sentiment' and 'emotion'
Is this a known bug? Does anyone have any suggestions for fixing?
My temporary solution has been to run emoPipe = nlu.load('emotion').predict() in isolation, then inner join the resulting dataframe to the the resulting df of pipe = nlu.load('sentiment pos xlnet').predict().
However, I would like to understand better what is failing and to know if there is a way to streamline the inclusion of all models.
Thanks
Related
Hi all I have been researching this challenge for some time now and all my efforts are futile. What I am trying to do?
I am running YOLOV5 and this is working fine in the training stage and the detection stages. The operations is outputting multiple .txt files per video for each detection per frame. This is the command I am using:
python3 detect.py --weights /Users/YOLO2ClassOnly/yolov5/runs/train/exp11/weights/best.pt --source /Users/YOLO2ClassOnly/yolov5/data/videos --conf 0.1 --line-thickness 1 --save-txt SAVE_TXT --save-conf
This command produces multiple text files, for example vid0_walking.txt, vid1_walking.txt, vid2_walking.txt...n/ etc.
This is depleting my storage resources and I am trying to avoid this.
What I would like to do?
Store the files in one .csv file in this format, please.
# xmin ymin xmax ymax confidence class name
# 0 749.50 43.50 1148.0 704.5 0.874023 0 person
# 2 114.75 195.75 1095.0 708.0 0.624512 0 person
# 3 986.00 304.00 1028.0 420.0 0.286865 27 tie
I have been following Glen Jorcher Links Here:
https://github.com/ultralytics/yolov5/issues/7499
But this is futile, this function print(results.pandas().xyxy[0])
is not working to generate the output for video as per above.
Please help, this is challenging me due to my lack of understanding.
Thanx in advance for acknowledging my digital presence and I am grateful for your guidance!
I have a dataframe(df) with 1 million rows and two columns (ID (long int) and description(String)). After transforming them into tfidf (using Tokenizer, HashingTF, and IDF), the dataframe, df has two columns (ID and features (sparse vector).
I computed the item-item similarity matrix using udf and dot function.
Computing the similarities is done successfully.
However, when I'm calling the show() function getting
"raise EOFError"
I read so many questions on this issue but did not get right answer yet.
Remember, if I apply my solution on a small dataset (like 100 rows), everything is working successfully.
Is it related to the out of memory issue?
I checked my dataset and description information, I don't see any records with null or unsupported text messages
dist_mat = data.alias("i").join(data.alias("j"), psf.col("i.ID") < psf.col("j.ID")) \
.select(psf.col("i.ID").alias("i"), psf.col("j.ID").alias("j"),
dot_udf("i.features", "j.features").alias("score"))
dist_mat = dist_mat.filter(psf.col('score') > 0.05)
dist_mat.show(1)```
If I removed the last line dist_mat.show(), it is working without error. However, when I used this line, got the error like
.......
```Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded```
...
Here is the part of the error message:
```[Stage 6:=======================================================> (38 + 1) / 39]Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 397, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError```
I increased the cluster size and run it again. It is working without errors. So, the error message is true
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
However, computing the pairwise similarities for such a large scale matrix, I found an alternative solution, Large scale matrix multiplication with pyspark
In fact, it is very efficient and much more faster, even better than the use of BlockMatrix
My question is, if I want to create one tfrecords file for my data , it will take approximately 15 days to finish it, it has 500000 pairs of template , and each template is 32 frames( images). In order to save the time, I have 3 GPUs, so I thought I can create three tfrocords file each one file on one GPUs and then I can finish creating the tfrecords in 5 days. But then I searched about a way to merge these three files in one file and couldn't find proper solution.
So Is there any way to merge these three files in one file, OR is there any way that I can train my network by feeding batch of example extracted form the three tfrecords files, knowing I am using Dataset API.
As the question is asked two months ago, I thought you already find the solution. For the follows, the answer is NO, you do not need to create a single HUGE tfrecord file. Just use the new DataSet API:
dataset = tf.data.TFRecordDataset(filenames_to_read,
compression_type=None, # or 'GZIP', 'ZLIB' if compress you data.
buffer_size=10240, # any buffer size you want or 0 means no buffering
num_parallel_reads=os.cpu_count() # or 0 means sequentially reading
)
# Maybe you want to prefetch some data first.
dataset = dataset.prefetch(buffer_size=batch_size)
# Decode the example
dataset = dataset.map(single_example_parser, num_parallel_calls=os.cpu_count())
dataset = dataset.shuffle(buffer_size=number_larger_than_batch_size)
dataset = dataset.batch(batch_size).repeat(num_epochs)
...
For details, check the document.
Addressing the question title directly for anyone looking to merge multiple .tfrecord files:
The most convenient approach would be to use the tf.Data API:
(adapting an example from the docs)
# Create dataset from multiple .tfrecord files
list_of_tfrecord_files = [dir1, dir2, dir3, dir4]
dataset = tf.data.TFRecordDataset(list_of_tfrecord_files)
# Save dataset to .tfrecord file
filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(dataset)
However, as pointed out by holmescn, you'd likely be better off leaving the .tfrecord files as separate files and reading them together as a single tensorflow dataset.
You may also refer to a longer discussion regarding multiple .tfrecord files on Data Science Stackexchange
The answer by MoltenMuffins works for higher versions of tensorflow. However, if you are using lower versions, you have to iterate through the three tfrecords and save them them into a new record file as follows. This works for tf versions 1.0 and above.
def comb_tfrecord(tfrecords_path, save_path, batch_size=128):
with tf.Graph().as_default(), tf.Session() as sess:
ds = tf.data.TFRecordDataset(tfrecords_path).batch(batch_size)
batch = ds.make_one_shot_iterator().get_next()
writer = tf.python_io.TFRecordWriter(save_path)
while True:
try:
records = sess.run(batch)
for record in records:
writer.write(record)
except tf.errors.OutOfRangeError:
break
Customizing the above the script for better tfrecords listing
import os
import glob
import tensorflow as tf
save_path = 'data/tf_serving_warmup_requests'
tfrecords_path = glob.glob('data/*.tfrecords')
dataset = tf.data.TFRecordDataset(tfrecords_path)
writer = tf.data.experimental.TFRecordWriter(save_path)
writer.write(dataset)
( please pardon my long post, dearly appreciate your help )
I am training the squeezeDet model for the pascal VOC style custom data as per the training code from the repository HERE
train.py
model_definition and HERE
the saved model checkpoint performs well as I can see acceptable performance.
Now i am trying to freeze the model for deployment using coreML to see how the performance is in a mobile platform. The authors of the script only report performance in a GPU environment in their research paper.
I follow the recommended steps as per tensorflow, my commands are as below
First,
I write the graph out from the checkpoint meta file
path_to_ckpt_meta = rootdir + "model.ckpt-355000.meta"
path_to_ckpt_data = rootdir + "model.ckpt-355000"
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
saver = tf.train.import_meta_graph(path_to_ckpt_meta)
saver.restore(sess, path_to_ckpt_data)
tf.train.write_graph(tf.get_default_graph().as_graph_def(), rootdir, "model_ckpt_355000_graph_V2.pb", False)
Now
I check the graph summary as see all the tensors in the model . The output summary file is HERE.
However, when I check the checkpoint file using the inspect_checkpoint.py function from tensorflow I see no image_input nodes. The output of inspection is HERE.
Second
I freeze the graph using the tensorflow freeze_graph.py function
python ./tensorflow/python/tools/freeze_graph.py \
--input_graph=path-to-dir/train/model_ckpt_355000_graph.pb \
--input_checkpoint=path-to-dir/train/model.ckpt-355000 \
--output_graph=path-to-dir/train/frozen_sqdt_ckpt_355000.pb \
--output_node_names=bbox/trimming/bbox,probability/score,probability/class_idx
the freeze_graph call completes without error and results in the frozen graph as per the command above.
Now,
when I check the frozen graph using the summarize_graph function call
bazel-bin/tensorflow/tools/graph_transforms/summarize_graph --in_graph=/tmp/logs/squeezeDet_NewDataset_test01_March02/train/frozen_sqdt_ckpt_355000.pb
I get the following
No inputs spotted.
No variables spotted.
Found 3 possible outputs: (name=bbox/trimming/bbox, op=Transpose) (name=probability/score, op=Max) (name=probability/class_idx, op=ArgMax)
Found 2703452 (2.70M) const parameters, 0 (0) variable parameters, and 0 control_edges
Op types used: 130 Const, 68 Identity, 32 BiasAdd, 32 Conv2D, 31 Relu, 15 Mul, 14 Add, 10 ConcatV2, 9 Sub, 5 RealDiv, 5 Reshape, 4 Maximum, 4 Minimum, 3 StridedSlice, 3 MaxPool, 2 Exp, 2 Greater, 2 Cast, 2 Select, 1 Transpose, 1 Softmax, 1 Sigmoid, 1 Unpack, 1 RandomUniform, 1 QueueDequeueManyV2, 1 Pack, 1 Max, 1 Floor, 1 FIFOQueueV2, 1 ArgMax
To use with tensorflow/tools/benchmark:benchmark_model try these arguments:
bazel run tensorflow/tools/benchmark:benchmark_model -- --graph=/tmp/logs/squeezeDet_NewDataset_test01_March02/train/frozen_sqdt_ckpt_355000.pb --show_flops --input_layer= --input_layer_type= --input_layer_shape= --output_layer=bbox/trimming/bbox,probability/score,probability/class_idx
this output above suggests that there is no input detected from the frozen graph. I check the summary of the frozen graph and find no image_input tensor. HERE
When I check my original graph ( written in step 1 ) with summarize graph, It does show inputs.
My troubleshooting
Suggests there is some mixup in the original authors code where the image_input is not provided as an input tensor. Though, the confusing part is that I can see the input image tensor in the summary of the output graph from the checkpoint meta file.
My question is,
-- why is the frozen graph removing the input nodes, when the original graph has the inputs ?
-- And, what can I do to change this and be able to successfully freeze_graph correctly.
Is there a transformation that need to perform in order to make this freeze model compatible with the coreML format.?
All your help is much appreciated.
Best
Aman
I have an application for which I need to import U.S. National Weather Service surface analyses, which are distributed as grib2 files. I want to pull those into PostGIS 2.0 rasters, do some calculations and modeling, and display the data and model results in GeoServer.
Since grib2 is a GDAL-supported format, the supplied raster2pgsql utility should be able to slurp a grib2 right into PostGIS-compatible SQL, and once it's there, GeoServer ought to be able to handle it. However, I'm running into problems which have no obvious solutions -- not obvious to me, at any rate! Raster2pgsql runs, apparently without errors, producing SQL, and running the SQL creates what looks very much like a raster. But GeoServer can't display it -- the bounds, in particular, come out looking weird (0,0 -1,-1) and "preview layer" just throws a NullPointerException.
Has anyone been down this road already? I've got issues as basic as not knowing what the SRID should be for the data (4326, perhaps?). I don't expect anyone to debug my problems for me but if someone has already got this toolchain working, or part of it, I can plug known-good things in and see what I can discover.
TIA,
rw
Updated: Per Mike, here is the coordinate-system stuff from one of the files; I elided the other 749 bands in the output from "gdalinfo". Note that the filename is different -- I found out by running "gdalinfo" on my original file that something was wrong with it, gdalinfo couldn't read it. New (35MB!) file here.
Gdalinfo output:
Driver: GRIB/GRIdded Binary (.grb)
Files: ruc2.t00z.bgrb13anl.grib2
Size is 451, 337
Coordinate System is:
PROJCS["unnamed",
GEOGCS["Coordinate System imported from GRIB file",
DATUM["unknown",
SPHEROID["Sphere",6371229,0]],
PRIMEM["Greenwich",0],
UNIT["degree",0.0174532925199433]],
PROJECTION["Lambert_Conformal_Conic_2SP"],
PARAMETER["standard_parallel_1",25],
PARAMETER["standard_parallel_2",25],
PARAMETER["latitude_of_origin",0],
PARAMETER["central_meridian",265],
PARAMETER["false_easting",0],
PARAMETER["false_northing",0]]
Origin = (-3332155.288903323933482,6830293.833488883450627)
Pixel Size = (13545.000000000000000,-13545.000000000000000)
Corner Coordinates:
Upper Left (-3332155.289, 6830293.833) (139d51'22.04"W, 54d10'20.71"N)
Lower Left (-3332155.289, 2265628.833) (126d 6'34.06"W, 16d 9'49.48"N)
Upper Right ( 2776639.711, 6830293.833) ( 57d12'21.76"W, 55d27'10.73"N)
Lower Right ( 2776639.711, 2265628.833) ( 68d56'16.73"W, 17d11'55.33"N)
Center ( -277757.789, 4547961.333) ( 98d 8'30.73"W, 39d54'5.40"N)
Band 1 Block=451x1 Type=Float64, ColorInterp=Undefined
Description = 1[-] HYBL="Hybrid level"
Metadata:
GRIB_UNIT=[Pa]
GRIB_COMMENT=Pressure [Pa]
GRIB_ELEMENT=PRES
[Etc., Etc., for all 750 bands]
I hope this helps, at least those comming to this thread.
Bear in mind that GeoServer, while being capable of loading Raster data from PostGIS, the default PostGIS "importing" module is ONLY available for vector data, that's why you get those odd bounds (-1 -1 0 0).
You'll have to add ImageMosaicJDBC plugin to your geoserver installation, follow steps here!
http://docs.geoserver.org/latest/en/user/tutorials/imagemosaic-jdbc/imagemosaic-jdbc_tutorial.html
Got an excellent answer to my problem here. Putting it in as a separate answer.
He recommended using gdalwarp to pull the GRIB2 file into a known SRID, thus:
gdalwarp -t_srs EPSG:4326 original_file.grib2 4326_file.grib2
Then, raster2pgsql works just fine, e.g.
raster2pgsql -M -a 4326_file.grib2 some_sql.sql