What are missing attributes as defined in the hdf5 specification and metadata in group h5md? - metadata

I have a one hdf5 format file Data File containing the molecular dynamics simulation data. For quick inspection, the h5ls tool is handy. For example:
h5ls -d xaa.h5/particles/lipids/positions/time | less
now my question is based on the comment I received on the data format! What attributes are missing according the hdf5 specifications and metadata in group?

Are you trying to get the value of the Time attribute from a dataset? If so, you need to use h5dump, not h5ls. And, the attributes are attached to each dataset, so you have to include the dataset name on the path. Finally, attribute names are case sensitive; Time != time. Here is the required command for dataset_0000 (repeat for 0001 thru 0074):
h5dump -d /particles/lipids/positions/dataset_0000/Time xaa.h5
You can also get attributes with Python code. Simple example below:
import h5py
with h5py.File('xaa.h5','r') as h5f:
for ds, h5obj in h5f['/particles/lipids/positions'].items():
print(f'For dataset={ds}; Time={h5obj.attrs["Time"]}')

Related

What does the Time (0040,A122) tag mean in DICOM header?

I have some trouble understanding the value of Time (0040,A122) tag. I am trying to update an anonymization software, but I can't seem to find any example of the actual tag.
The DICOM standard (PS 3.3) mentions that:
This is the Value component of a Name/Value pair when the Concept implied by Concept Name Code Sequence (0040,A043) is a time.
Note
The purpose or role of the date value could be specified in Concept Name Code Sequence (0040,A043).
Required if the value that Concept Name Code Sequence (0040,A043) requires (implies) is a time. Shall not be present otherwise.
So basically Concept Name Code Sequence (0040,A043) specifies what type of time it is? I would like to know what are some examples of Concept Name Code Sequence?
I would suggest to have a look at the SR sample given from the DICOM standard section PS 3.20:
A.7.2 Target DICOM SR "Measurement Report" (TID 1500)
In particular:
>>>>1.5.1.1.4: HAS ACQ CONTEXT: TIME: (111061,DCM,"Study Time") = "070844"
You may also want to check PS 3.16 for the definition of TID 1500:
TID 1500 Measurement Report
Just as a reminder, Enhanced SR are defined in PS 3.3:
A.35.2 Enhanced SR IOD

How can I merge multiple tfrecords file into one file?

My question is, if I want to create one tfrecords file for my data , it will take approximately 15 days to finish it, it has 500000 pairs of template , and each template is 32 frames( images). In order to save the time, I have 3 GPUs, so I thought I can create three tfrocords file each one file on one GPUs and then I can finish creating the tfrecords in 5 days. But then I searched about a way to merge these three files in one file and couldn't find proper solution.
So Is there any way to merge these three files in one file, OR is there any way that I can train my network by feeding batch of example extracted form the three tfrecords files, knowing I am using Dataset API.
As the question is asked two months ago, I thought you already find the solution. For the follows, the answer is NO, you do not need to create a single HUGE tfrecord file. Just use the new DataSet API:
dataset = tf.data.TFRecordDataset(filenames_to_read,
compression_type=None, # or 'GZIP', 'ZLIB' if compress you data.
buffer_size=10240, # any buffer size you want or 0 means no buffering
num_parallel_reads=os.cpu_count() # or 0 means sequentially reading
)
# Maybe you want to prefetch some data first.
dataset = dataset.prefetch(buffer_size=batch_size)
# Decode the example
dataset = dataset.map(single_example_parser, num_parallel_calls=os.cpu_count())
dataset = dataset.shuffle(buffer_size=number_larger_than_batch_size)
dataset = dataset.batch(batch_size).repeat(num_epochs)
...
For details, check the document.
Addressing the question title directly for anyone looking to merge multiple .tfrecord files:
The most convenient approach would be to use the tf.Data API:
(adapting an example from the docs)
# Create dataset from multiple .tfrecord files
list_of_tfrecord_files = [dir1, dir2, dir3, dir4]
dataset = tf.data.TFRecordDataset(list_of_tfrecord_files)
# Save dataset to .tfrecord file
filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(dataset)
However, as pointed out by holmescn, you'd likely be better off leaving the .tfrecord files as separate files and reading them together as a single tensorflow dataset.
You may also refer to a longer discussion regarding multiple .tfrecord files on Data Science Stackexchange
The answer by MoltenMuffins works for higher versions of tensorflow. However, if you are using lower versions, you have to iterate through the three tfrecords and save them them into a new record file as follows. This works for tf versions 1.0 and above.
def comb_tfrecord(tfrecords_path, save_path, batch_size=128):
with tf.Graph().as_default(), tf.Session() as sess:
ds = tf.data.TFRecordDataset(tfrecords_path).batch(batch_size)
batch = ds.make_one_shot_iterator().get_next()
writer = tf.python_io.TFRecordWriter(save_path)
while True:
try:
records = sess.run(batch)
for record in records:
writer.write(record)
except tf.errors.OutOfRangeError:
break
Customizing the above the script for better tfrecords listing
import os
import glob
import tensorflow as tf
save_path = 'data/tf_serving_warmup_requests'
tfrecords_path = glob.glob('data/*.tfrecords')
dataset = tf.data.TFRecordDataset(tfrecords_path)
writer = tf.data.experimental.TFRecordWriter(save_path)
writer.write(dataset)

Why does Open XML API Import Text Formatted Column Cell Rows Differently For Every Row

I am working on an ingestion feature that will take a strongly formatted .xlsx file and import the records to a temp storage table and then process the rows to create db records.
One of the columns is strictly formatted as "Text" but it seems like the Open XML API handles the columns cells differently on a row-by-row basis. Some of the values while appearing to be numeric values are truly not (which is why we format the column as Text) -
some examples are "211377", "211727.01", "209395.388", "209395.435"
what these values represent is not important but what happens is that some values (using the Open XML API v2.5 library) will be read in properly as text whether retrieved from the Shared Strings collection or simply from InnerXML property while others get sucked in as numbers with what appears to be appended rounding or precision.
For example the "211377", "211727.01" and "209395.435" all come in exactly as they are in the spreadsheet but the "209395.388" value is being pulled in as "209395.38800000001" (there are others that this happens to as well).
There seems to be no rhyme or reason to which values get messed up and which ones which import fine. What is really frustrating is that if I use the native Import feature in SQL Server Management Studio and ingest the same spreadsheet to a temp table this does not happen - so how is that the SSMS import can handle these values as purely text for all rows but the Open XML API cannot.
To begin the answer you main problem seems to be values,
"209395.388" value is being pulled in as "209395.38800000001"
Yes in .xlsx file value is stored as 209395.38800000001 instead of 209395.388. And it's the correct format to store floating point numbers; nothing wrong in it. You van simply confirm it by following code snippet
string val = "209395.38800000001"; // <= What we extract from Open Xml
Console.WriteLine(double.Parse(val)); // < = Simply pass it to double and print
The output is :
209395.388 // <= yes the expected value
So there's nothing wrong in the value you extract from .xlsx using Open Xml SDK.
Now to cells, yes cell can have verity of formats. Numbers, text, boleans or shared string text. And you can styles to a cell which would format your string to a desired output in Excel. (Ex - Date Time format, Forced strings etc.). And this the way Excel handle the vast verity of data. It need this kind of formatting and .xlsx file format had to be little complex to support all.
My advice is to use a proper parse method set at extracted values to identify what format it represent (For example to determine whether its a number or a text) and apply what type of parse.
ex : -
string val = "209395.38800000001";
Console.WriteLine(float.Parse(val)); // <= Float parse will be deduce a different value ; 209395.4
Update :
Here's how value is saved in internal XML
Try for yourself ;
Make an .xlsx file with value 209395.388 -> Change extention to .zip -> Unzip it -> goto worksheet folder -> open Sheet1
You will notice that value is stored as 209395.38800000001 as scene in attached image.. So nothing wrong on API for extracting stored number. It's your duty to decide what format to apply.
But if you make the whole column Text before adding data, you will see that .xlsx hold data as it is; simply said as string.

Generate XML from XML schema xsd in 4GL Progess OpenEdge?

iam using 4GL in Progress OpenEdge 11.3 and i want to write a xml file from xsd schema file.
Can i generate a xml file from a XML Schema (xsd) with 4GL Progress OpenEdge?
thanks.
Well, you can use a method called READ-XMLSCHEMA (and it's counterpart WRITE-XMLSCHEMA).
These can be applied to both TEMP-TABLES and ProDataSets (depending of the complexity of the xml).
The ProDataSet documentation, found here, contains quite a lot information about this. There's also a book called Working with XML that can help you.
This is the basic syntax of READ-XMLSCHEMA (when working with datasets):
READ-XMLSCHEMA ( source-type, { file | memptr | handle | longchar },
override-default-mapping [, field-type-mapping [, verify-schema-mode ] ] ).
A basic example would be:
DATASET ds:READ-XMLSCHEMA("file", "c:\temp\file.xsd", FALSE).
However since you need to work with the actual XML you also will have to handle data. That data is handled in the TEMP-TABLES contained withing the Dataset. It might be easier to start with creating a static ProDataSet that corresponds to the schema and then handle it's data whatever way you want.

Text classification using Weka

I'm a beginner to Weka and I'm trying to use it for text classification. I have seen how to StringToWordVector filter for classification. My question is, is there any way to add more features to the text I'm classifying? For example, if I wanted to add POS tags and named entity tags to the text, how would I use these features in a classifier?
It depends of the format of your dataset and the preprocessing steps you perform. For instance, let us suppose that you have pre-POS-tagged your texts, looking like:
The_det dog_n barks_v ._p
So you can build an specific tokenizer (see weka.core.tokenizers) to generate two tokens per word, one would be "The" and the other one would be "The_det" so you keep the tag information.
If you want only tagged words, then you can just ensure that "_" is not a delimiter in the weka.core.tokenizers.WordTokenizer.
My advice is to have both the words and tagged words, so a simpler way would be to write an script that joins the texts and the tagged texts. From a file containing "The dog barks" and another one cointaining "The_det dog_n barks_v ._p", it would generate a file with "The The_det dog dog_n barks barks_v . ._p". You may even forget about the order unless you are going to make use of n-grams.