Is there a way to save each HDF5 data set as a .csv column? - h5py

I'm struggling with a H5 file to extract and save data as a multi column csv. as shown in the picture the structure of h5 file consisted of main groups (Genotypes, Positions, and taxa). The main group, Genotypes contains more than 1500 subgroups (genotype partial names) and each subgroup contains sub-sun groups (complete name of genotypes).There are about 1 million data sets (named calls) -each one is laid in one sub-sub group - which i need them to be written - each one - in a separate column. The problem is that when i use h5py (group.get function) i have to use the path of any calls. I extracted the all paths containing "calls" at the end of path but I cant reach all
1 million calls to get them into a csv file.
could anybody help me to extracts "calls" which are 8bit integer i\as a separate columns in a csv file.
By running the code in first answer I get this error:
Traceback (most recent call last): File "path/file.py", line 32,
in
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string! File "path/file.py", line 565, in visititems
return h5o.visit(self.id, proxy) File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File
"h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 355, in h5py.h5o.visit File
"h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name File
"h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple File
"path/file.py", line 564, in proxy
return func(name, self[name]) File "path/file.py", line 10, in dump_calls2csv
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',') File "<array_function internals>", line 6, in savetxt File
"path/file.py", line 1377, in savetxt
open(fname, 'wt').close() OSError: [Errno 22] Invalid argument: 'Genotypes_ArgentineFlintyComposite-C(1)-37-B-B-B2-1-B25-B2-B?-1-B:100000977_calls.csv

16-May-2020 Update:
Added a second example that reads and exports using Pytables (aka
tables) using .walk_nodes(). I prefer this method over h5py
.visititems()
For clarity, I separated the code that creates the example file from the
2 examples that read and export the CSV data.
Enclosed below are 2 simple examples that show how to recursively loop on all top level objects. For completeness, the code to create the test file is at the end of this post.
Example 1: with h5py
This example uses the .visititems() method with a callable function (dump_calls2csv).
Summary of this procedure:
1) Checks for dataset objects with calls in the name.
2) When it finds a matching object it does the following:
a) reads the data into a Numpy array,
b) creates a unique file name (using string substitution on the H5 group/dataset path name to insure uniqueness),
c) writes the data to the file with numpy.savetxt().
import h5py
import numpy as np
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
print ('visiting object:', node.name, ', exporting data to CSV')
csvfname = node.name[1:].replace('/','_') +'.csv'
arr = node[:]
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')
##########################
with h5py.File('SO_61725716.h5', 'r') as h5r :
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
If you want to get fancy, you can replace arr in np.savetxt() with node[:].
Also, you you want headers in your CSV, extract and reference the dtype field names from the dataset (I did not create any in this example).
Example 2: with PyTables (tables)
This example uses the .walk_nodes() method with a filter: classname='Leaf'. In PyTables, a leaf can be any of the storage classes (Arrays and Table).
The procedure is similar to the method above. walk_nodes() simplifies the process to find datasets and does NOT require a call to a separate function.
import tables as tb
import numpy as np
with tb.File('SO_61725716.h5', 'r') as h5r :
for node in h5r.walk_nodes('/',classname='Leaf') :
print ('visiting object:', node._v_pathname, 'export data to CSV')
csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
np.savetxt(csvfname, node.read(), fmt='%d', delimiter=',')
For completeness, use the code below to create the test file used in the examples.
import h5py
import numpy as np
ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2
with h5py.File('SO_61725716.h5', 'w') as h5w :
for gcnt in range(ngrps):
grp1 = h5w.create_group('Group_'+str(gcnt))
for scnt in range(nsgrps):
grp2 = grp1.create_group('SubGroup_'+str(scnt))
for dcnt in range(nds):
i_arr = np.random.randint(1,100, (nrows,ncols) )
ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)

Related

Import a variable from .mat and export to CSV

I have a .mat file which contains a struct (called wiki),
in which there is a field called full_path containing data as follows:
ans =
Columns 1 through 4
{'17/10000217_198…'} {'48/10000548_192…'} {'12/100012_1948-…'} {'65/10001965_193…'}
Columns 5 through 8
{'16/10002116_197…'} {'02/10002702_196…'} {'41/10003541_193…'} {'39/100039_1904-…'}
and so on
How can I create a .csv file with the data present in the curly braces?
This is quite a common problem, which requires very basic functions:
wiki = struct2array(load('wiki.mat', 'wiki'));
fid = fopen('q52688399.csv', 'w');
fprintf(fid,'%s\n', wiki.full_path{:});
fclose(fid);
The above will produce a ~2MB text tile containing a single column of strings.

pyspark : Categorical variables preparation for kmeans

I know Kmeans is not a good selection to be applied to categorical data, but we dont have much options in spark 1.4 for clustering categorical data.
Regardless of above issue. I'm getting errors in my below code.
I read my table from hive, use onehotencoder in a pipeline and then send the code into Kmeans.
Im getting an error when running this code.
Could the error be in datatype fed to Kmeans? doen is expect numpay Array data? if so How can I transfer my indexed data to numpy array!?!?
All comments are aporeciated and thanks for your help!
The error Im getting:
Traceback (most recent call last):
File "/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip/pyspark /daemon.py", line 157, in manager
File "/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip/pyspark/daemon.py",
line 61, in worker
File "/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip/pyspark/worker.py",
line 136, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM: File "/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip/pyspark/serializers.py",
line 544, in read_int
raise EOFError EOFError File "", line 1
Traceback (most recent call last):
My code:
#aline will be passed in from another rdd
aline=["xxx","yyy"]
# get data from Hive table & select the column & convert back to Rdd
rddRes2=hCtx.sql("select XXX, YYY from table1 where xxx <> ''")
rdd3=rddRes2.rdd
#fill the NA values with "none"
Rdd4=rdd3.map(lambda line: [x if len(x) else 'none' for x in line])
# convert it back to Df
DataDF=Rdd4.toDF(aline)
# Indexers encode strings with doubles
string_indexers=[
StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))
for x in DataDF.columns if x not in '' ]
encoders=[
OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))
for x in DataDF.columns if x not in ''
]
# Assemble multiple columns into a single vector
assembler=VectorAssembler(
inputCols=["enc_{0}".format(x) for x in DataDF.columns if x not in ''],
outputCol="features")
pipeline= Pipeline(stages=string_indexers+encoders+[assembler])
model=pipeline.fit(DataDF)
indexed=model.transform(DataDF)
labeled_points=indexed.select("features").map(lambda row: LabeledPoint(row.features))
# Build the model (cluster the data)
clusters = KMeans.train(labeled_points, 3, maxIterations=10,runs=10, initializationMode="random")
I guess the correction would not solve the problem.
you can convert dense vectors to Array by uising XXX.toarray()

How to import only some data files but ignore others?

I have 48 1000x28 data files, (no headers, strings or special characters) which I'd like to import in 4 batches of 12.
In the first batch the files have names:
spread_YAB_4ACH_caretype_??_model_1 where ??=1:6
The second batch
spread_YAB_4ACH_caretype_??_MC_model_1 where again ??=1:6
I'm not sure where to put the wildcard *
D = dir('spread_YAB_4ACH_caretype_*_model_1.txt');
dummy=zeros(1000,length(D));
for k=1:length(D)
file = num2str(D(k).name);
fid=fopen(file);
myCell = textscan (fid, '%f');
dummydummy=reshape(cell2mat(myCell(:,end)),1000,28); %#cell makes one column vector, why?
dummy(:,k)=dummydummy(:,end); %# Only want last column
fclose(fid);
end
This script looks an awful mess, surely you don't need this much bumpf to import a group of simple data files. Any thoughts?
d=dir(foldername); %#That is where your files are
for i=3:1:length(d) %#ignore the . and ..
if strfind(d(i,1).name,'MC_model')
%#some code to do with the file of the second batch#%
else
%some code to do with the file of the first batch#%
end
end

Import csv file in Matlab

I need your help to import some csv files into matlab. They have the following format
#CONTENT
Class,Category,Level,Form
xxxxx,xxxxx,1.0,1
#DATA_GENERATION
Date,Agency,Version,ScientificAuthority
2010-04-08,INME,1.0,XXX xxx xxxx
#PLATFORM
Type,ID,Name,Country,GAW_ID
STN,308,xxxx,xxx
#INSTRUMENT
Name,Model,Number
ECC,6A,6A23500
#LOCATION
Latitude,Longitude,Height
25,-3,631.0
#TIMESTAMP
UTCOffset,Date,Time
+00:00:00,2010-04-07,10:51:00
* SOFTWARE: SNDPRO 1.321
* TROPOPAUSE IN MB 184
*
#FLIGHT_SUMMARY
IntegratedO3,CorrectionCode,SondeTotalO3,CorrectionFactor,TotalO3,WLCode,ObsType,Instrument,Number
328.4,0,379.9
#AUXILIARY_DATA
MeteoSonde,ib1,ib2,PumpRate,BackgroundCorr,SampleTemperatureType,MinutesGroundO3
RS92-SGPW,,,,Pressure,Pump
#PUMP_CORRECTION
Pressure,Correction
2.0,1.171
3.0,1.131
5.0,1.092
10.0,1.055
20.0,1.032
30.0,1.022
50.0,1.015
100.0,1.011
200.0,1.008
300.0,1.006
500.0,1.004
1000.0,1.000
#PROFILE
Pressure,O3PartialPressure,Temperature,WindSpeed,WindDirection,LevelCode,Duration,GPHeight,RelativeHumidity,SampleTemperature
945.36,4.590,14.6,10.0,30,2,0,631,43,22.8
944.90,4.620,14.3,7.8,20,0,2,635,44,22.8
943.51,4.630,13.9,7.6,17,0,4,647,44,22.8
942.13,4.620,13.4,8.1,16,0,6,660,45,22.8
940.98,4.590,13.0,9.0,16,0,8,670,45,22.8
939.83,4.590,12.6,9.8,17,0,10,680,46,22.8
938.69,4.600,12.1,10.3,18,2,12,691,46,22.8
937.77,4.600,12.2,10.9,18,0,14,699,47,22.9
936.63,4.600,12.1,11.4,19,0,16,709,47,22.9
935.48,4.600,11.8,11.9,19,0,18,719,47,22.9
934.12,4.600,11.7,12.3,19,0,20,731,47,22.9
932.98,4.590,11.6,12.6,19,0,22,742,48,22.9
931.84,4.590,11.6,12.9,18,0,24,752,48,22.9
930.93,4.600,11.6,13.2,18,0,26,760,48,22.9
929.79,4.600,11.4,13.4,17,0,28,770,49,22.9
928.88,4.610,11.5,13.6,16,0,30,778,49,22.9
927.98,4.620,11.4,13.7,15,0,32,787,49,23.0
927.30,4.620,11.3,13.8,14,0,34,793,49,23.0
The first line of the file is empty and second line contains the #CONTENT. I would like to have in a matrix all data that are under the line Pressure,O3PartialPressure,Temperature,WindSpeed,WindDirection,LevelCode,Duration,GPHeight,RelativeHumidity,SampleTemperature
Use the csvread() function. From the documentation:
csvread Read a comma separated value file.
M = csvread('FILENAME') reads a comma separated value formatted file
FILENAME. The result is returned in M. The file can only contain
numeric values.
In your case, since you want to exclude all of the content up until the #PROFILE data, you would have to know the line number of the data you're interested in in advance, then use one of the following uses (again from the documentation):
M = csvread('FILENAME',R,C) reads data from the comma separated value
formatted file starting at row R and column C. R and C are zero-
based so that R=0 and C=0 specifies the first value in the file.
M = csvread('FILENAME',R,C,RNG) reads only the range specified
by RNG = [R1 C1 R2 C2] where (R1,C1) is the upper-left corner of
the data to be read and (R2,C2) is the lower-right corner. RNG
can also be specified using spreadsheet notation as in RNG = 'A1..B7'.

Matlab: finding/writing data from mutliple files by column header text

I have an array which I read into Matlab with importdata. It has 5 header lines
file = 'aoao.csv';
s = importdata(file,',', 5);
Matlab automatically treats the last line as the column header. I can then call up the column number that I want with
s.data(:,n); %n is desired column number
I want to be able to load up many similar files at once, and then call up the columns in the different files which have the same column header name (which are not necessarily the same column number). I want to be able to write and export all of these columns together into a new matrix, preferably with each column labelled with its file name,
what should I do?
samp = 'len-c.mp3'; %# define desired sample/column header name
file = dir('*.csv');
have the directory ready in main screen current folder. This creates a detailed description of file,
for i=1:length(file)
set(i) = importdata(file(i).name,',', 5);
end
this imports the data from each of the files (comma delimited, 5 header lines) and transports it to a cell array called 'set'
for k = 1:14;
for i=1:length(set(k).colheaders)
TF = strcmp(set(k).colheaders(i),samp); %compares strings for match
if TF == 1; %if match is true
group(:,k) = set(k).data(:,i); %save matching column# to 'group'
end
end
end
this retrieves the data from the named colheader within each file