Accessing array contents inside .mat file in python - matlab

I want to read a .mat file available at http://www.eigenvector.com/data/tablets/index.html. To access the data inside this file, I am trying the follwing:
import scipy.io as spio
import numpy as np
import matplotlib.pyplot as plt
mat = spio.loadmat('nir_shootout_2002.mat')
# http://pyhogs.github.io/reading-mat-files.html
def print_mat_nested(d, indent=0, nkeys=0):
if nkeys>0:
d = {k: d[k] for k in d.keys()[:nkeys]}
if isinstance(d, dict):
for key, value in d.iteritems():
print '\t' * indent + 'Key: ' + str(key)
print_mat_nested(value, indent+1)
if isinstance(d,np.ndarray) and d.dtype.names is not None:
for n in d.dtype.names:
print '\t' * indent + 'Field: ' + str(n)
print_mat_nested(d[n], indent+1)
print_mat_nested(mat, nkeys=1)
Above command shows that the first key in the dictionary is "validate_1" and it has a field "data". To access this field, I try:
t = mat['validate_1']
print(t['data'])
It prints an array but when I use np.shape(t['data']), it just returns (1,1) whereas the data seems to be larger. I am not sure how to access array inside t['data'].
Thanks for the help.

I found that following works:
t = mat['validate_1']['data'][0,0]
print(np.shape(t))
It returns that t is an array of shape (40,650).

Related

Pyspark - How to calculate file hashes

I have a bunch of CSV files in a mounted blob container and I need to calculate the 'SHA1' hash values for every file to store as inventory. I'm very new to Azure cloud and pyspark so I'm not sure how this can be achieved efficiently. I have written the following code in Python Pandas and I'm trying to use this in pyspark. It seems to work however it takes quite a while to run as there are thousands of CSV files. I understand that things work differently in pyspark, so can someone please guide if my approach is correct, or if there is a better piece of code I can use to accomplish this task?
import os
import subprocess
import hashlib
import pandas as pd
class File:
def __init__(self, path):
self.path = path
def get_hash(self):
hash = hashlib.sha1()
with open(self.path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash.update(chunk)
self.md5hash = hash.hexdigest()
return self.md5hash
path = '/dbfs/mnt/data/My_Folder' #Path to CSV files
cnt = 0
rlist = []
for path, subdirs, files in os.walk(path):
for fi in files:
if cnt < 10: #check on only 10 files for now as it takes ages!
f = File(os.path.join(path, fi))
cnt +=1
hash_value = f.get_hash()
results = {'File_Name': fi, 'File_Path': f.filename, 'SHA1_Hash_Value': hash_value}
rlist.append(results)
print(fi)
df = pd.DataFrame(rlist)
print(str(cnt) + ' files processed')
df = pd.DataFrame(rlist)
#df.to_csv('/dbfs/mnt/workspace/Inventory/File_Hashes.csv', mode='a', header=False) #not sure how to write files in pyspark!
display(df)
Thanks
Since you want to treat the files as blobs and not read them into a table. I would recommend using spark.sparkContext.binaryFiles this would land you an RDD of pairs where the key is the file name and the value is a file-like object, on which you can calculate the hash in a map function (rdd.mapValues(calculate_hash_of_file_like))
For more information, refer to the documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.binaryFiles.html#pyspark.SparkContext.binaryFiles

Table fields in format in PDF report generator - Matlab

When I save a table as PDF (using report generator) I do not get the numeric fields (Index1, Index2, Index3) shortG format (total of 5 digits). What is the problem and how can I fix it?
The code:
function ButtonPushed(app, event)
import mlreportgen.dom.*;
import mlreportgen.report.*
format shortG;
ID = [1;2;3;4;5];
Name = {'San';'John';'Lee';'Boo';'Jay'};
Index1 = [71.1252;69.2343245;64.345345;67.345322;64.235235];
Index2 = [176.23423;163.123423654;131.45364572;133.5789435;119.63575647];
Index3 = [176.234;16.123423654;31.45364572;33.5789435;11.6647];
mt = table(ID,Name,Index1,Index2,Index3);
d = Document('myPDF','pdf');
d.OutputPath = ['E:/','temp'];
append(d,'Table 1: ');
append(d,mt);
close(d);
rptview(d.OutputPath);
end
To fix it, format your numeric arrays to character arrays with 5 significant digits before writing to PDF.
mt = table(ID,Name,f(Index1),f(Index2),f(Index3));
where,
function FivDigsStr = f(x)
%formatting to character array with 5 significant digits and then splitting.
%at each tab. categorical is needed to remove ' 's that appear around char
%in the output PDF file with newer MATLAB versions
%e.g. with R2018a, there are no ' ' in the output file but ' ' appears with R2020a
FivDigsStr = categorical(split(sprintf('%0.5G\t',x)));
%Removing the last (<undefined>) value (which is included due to \t)
FivDigsStr = FivDigsStr(1:end-1);
end
The above change gives the following result:
Edit:
To bring back the headers:
mt.Properties.VariableNames(3:end) = {'Index1', 'Index2', 'Index3'};
or in a more general way to extract the variable names instead of hardcoding them, you can use inputnames to extract the variable names.
V = #(x) inputname(1);
mt.Properties.VariableNames(3:end) = {V(Index1), V(Index2), V(Index3)};
which gives:

Is there a way to save each HDF5 data set as a .csv column?

I'm struggling with a H5 file to extract and save data as a multi column csv. as shown in the picture the structure of h5 file consisted of main groups (Genotypes, Positions, and taxa). The main group, Genotypes contains more than 1500 subgroups (genotype partial names) and each subgroup contains sub-sun groups (complete name of genotypes).There are about 1 million data sets (named calls) -each one is laid in one sub-sub group - which i need them to be written - each one - in a separate column. The problem is that when i use h5py (group.get function) i have to use the path of any calls. I extracted the all paths containing "calls" at the end of path but I cant reach all
1 million calls to get them into a csv file.
could anybody help me to extracts "calls" which are 8bit integer i\as a separate columns in a csv file.
By running the code in first answer I get this error:
Traceback (most recent call last): File "path/file.py", line 32,
in
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string! File "path/file.py", line 565, in visititems
return h5o.visit(self.id, proxy) File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File
"h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 355, in h5py.h5o.visit File
"h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name File
"h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple File
"path/file.py", line 564, in proxy
return func(name, self[name]) File "path/file.py", line 10, in dump_calls2csv
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',') File "<array_function internals>", line 6, in savetxt File
"path/file.py", line 1377, in savetxt
open(fname, 'wt').close() OSError: [Errno 22] Invalid argument: 'Genotypes_ArgentineFlintyComposite-C(1)-37-B-B-B2-1-B25-B2-B?-1-B:100000977_calls.csv
16-May-2020 Update:
Added a second example that reads and exports using Pytables (aka
tables) using .walk_nodes(). I prefer this method over h5py
.visititems()
For clarity, I separated the code that creates the example file from the
2 examples that read and export the CSV data.
Enclosed below are 2 simple examples that show how to recursively loop on all top level objects. For completeness, the code to create the test file is at the end of this post.
Example 1: with h5py
This example uses the .visititems() method with a callable function (dump_calls2csv).
Summary of this procedure:
1) Checks for dataset objects with calls in the name.
2) When it finds a matching object it does the following:
a) reads the data into a Numpy array,
b) creates a unique file name (using string substitution on the H5 group/dataset path name to insure uniqueness),
c) writes the data to the file with numpy.savetxt().
import h5py
import numpy as np
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
print ('visiting object:', node.name, ', exporting data to CSV')
csvfname = node.name[1:].replace('/','_') +'.csv'
arr = node[:]
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')
##########################
with h5py.File('SO_61725716.h5', 'r') as h5r :
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
If you want to get fancy, you can replace arr in np.savetxt() with node[:].
Also, you you want headers in your CSV, extract and reference the dtype field names from the dataset (I did not create any in this example).
Example 2: with PyTables (tables)
This example uses the .walk_nodes() method with a filter: classname='Leaf'. In PyTables, a leaf can be any of the storage classes (Arrays and Table).
The procedure is similar to the method above. walk_nodes() simplifies the process to find datasets and does NOT require a call to a separate function.
import tables as tb
import numpy as np
with tb.File('SO_61725716.h5', 'r') as h5r :
for node in h5r.walk_nodes('/',classname='Leaf') :
print ('visiting object:', node._v_pathname, 'export data to CSV')
csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
np.savetxt(csvfname, node.read(), fmt='%d', delimiter=',')
For completeness, use the code below to create the test file used in the examples.
import h5py
import numpy as np
ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2
with h5py.File('SO_61725716.h5', 'w') as h5w :
for gcnt in range(ngrps):
grp1 = h5w.create_group('Group_'+str(gcnt))
for scnt in range(nsgrps):
grp2 = grp1.create_group('SubGroup_'+str(scnt))
for dcnt in range(nds):
i_arr = np.random.randint(1,100, (nrows,ncols) )
ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)

Retrieve stored image from mongodb using python

from pymongo import MongoClient
from bson.objectid import ObjectId
import numpy as np
import gridfs
import os,os.path
i=0
try:
for file in os.listdir("/Users/sarthakgupta/Desktop/sae/Images"):
if (file.endswith(".png") | file.endswith(".jpg")):
filename = "/Users/sarthakgupta/Desktop/sae/Images/"+file
datafile = open(filename,"rb")
thedata = datafile.read()
datafile.close()
c = MongoClient()
i=i+1
db = c.trial5
fs = gridfs.GridFS(db)
t = "class"+str(i)
stored = fs.put(thedata,filename=q)
except IOError:
print("Image file %s not found" %datafile)
raise SystemExit
I stored image in a mongo db . Now i want to retrieve those images from database by the filename and to store the image(or pixels) of same filename in an array or list. Suppose if there is 2 images with filename "class1",then they should be in one array.
Create your fs variable like before, and:
data = fs.get_last_version(filename).read()
You could also query for a list of files like:
from bson import Regex
for f in fs.find({'filename': Regex(r'.*\.(png|jpg)')):
data = f.read()
Also, a comment about your code: it's very slow to recreate the MongoClient and GridFS instances for every iteration of your loop. Create them once before you start looping, and reuse them.

Read data to matlab with for loop

I want to read the data of a file with size about 60 MB into matlab in some variables, but I get errors. This is my code:
clear all ;
clc ;
% Reading Input File
Dataz = importdata('leak0.lis');
%Dataz = load('leak0.lis');
for k = 1:1370
foundPosition = 1 ;
for i=1:size(Dataz,1)
strp = sprintf('I%dz=',k);
fprintf(strp);
findValue = strfind(Dataz{i}, strp) ;
if ~isempty(findValue)
eval_param = strp + '(foundPosition) = sscanf(Dataz{i},''%*c%*c%*f%*c%*c%f'') ;';
disp(eval_param);
% str(foundPosition) = sscanf(Dataz{i},'%*c%*c%*f%*c%*c%f') ;
eval(eval_param);
foundPosition = foundPosition + 1 ;
end
end
end
When I debugged it, I found out that the dataz is empty & so it doesn't proceed to next lines. I replace it with fopen, load & etc, but it didn't work.
From the Matlab help files, import data is likely failing because it doesn't understand your file format.
From the help files
Name and extension of the file to import, specified as a string. If importdata recognizes the file extension, it calls the MATLAB helper function designed to import the associated file format (such as load for MAT-files or xlsread for spreadsheets). Otherwise, importdata interprets the file as a delimited ASCII file.
For ASCII files and spreadsheets, importdata expects to find numeric
data in a rectangular form (that is, like a matrix). Text headers can
appear above or to the left of the numeric data, as follows:
Assuming that your .lis files actually have delimited text.
You should adjust the delimiter in the importdata call so that Matlab can understand your file.
filename = 'myfile01.txt';
delimiterIn = ' ';
headerlinesIn = 1;
A = importdata(filename,delimiterIn,headerlinesIn);