I want to use Python3 package h5py to read a matlab .mat file of version 7.3.
It contains a variable in matlab, named results.
It contains a 1*1 cell, and the value in the struct inside is what I need.
In matlab, I can get these data through the following code:
load('.mat PATH');
results{1}.res
How should I read this data in h5py?
Example .mat file can be obtained from here
While h5py can read h5 files from MATLAB, figuring out what is there takes some exploring - looking at keys groups and datasets (and possibly attr). There's nothing in scipy that will help you (scipy.io.loadmat is for the old MATLAB mat format).
With the downloaded file:
In [61]: f = h5py.File('Downloads/Basketball_ECO_HC.mat','r')
In [62]: f
Out[62]: <HDF5 file "Basketball_ECO_HC.mat" (mode r)>
In [63]: f.keys()
Out[63]: <KeysViewHDF5 ['#refs#', 'results']>
In [65]: f['results']
Out[65]: <HDF5 dataset "results": shape (1, 1), type "|O">
In [66]: arr = f['results'][:]
In [67]: arr
Out[67]: array([[<HDF5 object reference>]], dtype=object)
In [68]: arr.item()
Out[68]: <HDF5 object reference>
I'd have to check the h5py docs to see if I can check that object reference further. I'm not familiar with it.
But exploring the other key:
In [69]: list(f.keys())[0]
Out[69]: '#refs#'
In [70]: f[list(f.keys())[0]]
Out[70]: <HDF5 group "/#refs#" (2 members)>
In [71]: f[list(f.keys())[0]].keys()
Out[71]: <KeysViewHDF5 ['a', 'b']>
In [72]: f[list(f.keys())[0]]['a']
Out[72]: <HDF5 dataset "a": shape (2,), type "<u8">
In [73]: _[:]
Out[73]: array([0, 0], dtype=uint64)
In [74]: f[list(f.keys())[0]]['b']
Out[74]: <HDF5 group "/#refs#/b" (7 members)>
In [75]: f[list(f.keys())[0]]['b'].keys()
Out[75]: <KeysViewHDF5 ['annoBegin', 'fps', 'fps_no_ftr', 'len', 'res', 'startFrame', 'type']>
In [76]: f[list(f.keys())[0]]['b']['fps']
Out[76]: <HDF5 dataset "fps": shape (1, 1), type "<f8">
In [77]: f[list(f.keys())[0]]['b']['fps'][:]
Out[77]: array([[22.36617883]])
In the OS shell , I can look at the file with h5dump. From that it looks like the res dataset has the most data. The datasets also have attributes. That may be a better way of getting an overview, and use that to guide the h5py loads.
In [80]: f[list(f.keys())[0]]['b']['res'][:]
Out[80]:
array([[198., 196., 195., ..., 330., 328., 326.],
[214., 214., 216., ..., 197., 196., 192.],
[ 34., 34., 34., ..., 34., 34., 34.],
[ 81., 81., 81., ..., 81., 80., 80.]])
In [81]: f[list(f.keys())[0]]['b']['res'][:].shape
Out[81]: (4, 725)
In [82]: f[list(f.keys())[0]]['b']['res'][:].dtype
Out[82]: dtype('<f8')
If your question is asking generally how to read matfiles saved using v7.3 in Python, the hdf5storage package provides some utilities that might work for you. In the case of your file (after installing the package) you would run
In [0]: import hdf5storage as hdf5
In [1]: pyIn = LoadMatFile('Basketball_ECO_HC.mat')
In [2]: type(pyIn)
Out[2]: dict
In [3]: pyIn.keys()
Out[3]: dict_keys(['results'])
In [4]: type(pyIn['results'])
Out[4]: numpy.ndarray
In [5]: pyIn['results'].shape
Out[5]: (1, 1)
In [6]: pyIn['results'].dtype
Out[6]: dtype('O')
In [7]: pyIn['results'][0,0].dtype
Out[7]: dtype([('type', '<U4', (1, 1)), ('res', '<f8', (725, 4)), ('fps', '<f8', (1, 1)), ('fps_no_ftr', '<f8', (1, 1)), ('len', '<f8', (1, 1)), ('annoBegin', '<f8', (1, 1)), ('startFrame', '<f8', (1, 1))])
You can see it does a good job of parsing the input array, though it does things like collapsing the cell-in-a-cell that you would access in Matlab with results{1}{1} into a 2D numpy array you access with pyIn['results'][0,0] instead. Another odd thing I ran into with this data is the addition of a dimension in the deeper structure fields, as below:
In [8]: pyIn['results'][0,0]['res'].shape
Out[8]: (1, 725, 4)
In [9]: pyIn['results'][0,0]['res'][0,0,:]
Out[9]: array([198., 214., 34., 81.])
Not entirely sure why this happens, but in general it should work well.
That said, I did run into an issue with the latest version (0.2) of this package where for really deep array/cell/structure combos it became incredibly slow. The nice thing is that this package is still being maintained, so fixes for this might be in the pipeline. Nevertheless, this prompted me to write my own h5py reader for matfiles which is faster in these cases, and I'll discuss it as another answer.
As I mentioned in my other post on the hd5fstorage package, I have run into problems of it being far too slow when it comes to loading deep arrays. So I implemented my own matfile loader whose code might also be more useful (because it's compact) if you care about the specifics of how reading a v7.3 matfile into Python works. (That said, the code currently has very few comments, so maybe not that useful.)
For the case of my library, the outputs are very similar to hdf5storage, as shown here.
In [0]: from MatFileMethods import LoadMatFile
In [1]: pyIn = LoadMatFile('/Users/emilio/Downloads/Basketball_ECO_HC.mat')
In [2]: type(pyIn)
Out[2]: dict
In [3]: pyIn.keys()
Out[3]: dict_keys(['results'])
In [4]: type(pyIn['results'])
Out[4]: numpy.ndarray
In [5]: pyIn['results'].shape
Out[5]: (1, 1)
Note that as with the hdf5storage package, the cell-within-a-cell in Matlab, which gets called using results{1}{1} becomes a two-dimensional numpy.ndarray which gets called with pyIn['results'][0,0], as below.
In [6]: type(pyIn['results'][0,0])
Out[6]: dict
In [7]: pyIn['results'][0,0].keys()
Out[7]: dict_keys(['annoBegin', 'fps', 'fps_no_ftr', 'len', 'res', 'startFrame', 'type'])
In [8]: pyIn['results'][0,0]['res'].shape
Out[8]: (725, 4)
In [9]: pyIn['results'][0,0]['res'][0,:]
Out[9]: array([198., 214., 34., 81.])
In contrast with hdf5storage, I opt to make Matlab structures into Python dicts, so that the fields of the structures are the keys of the dictionaries.
In any case, this module is by no means fully tested, but has served me well for loading ~500Mb and larger mat files that version 0.2 of hdf5storage doesn't seem to handle (~1.5 minutes for my own loader vs >10 minute loading time for hdf5storage (it hadn't finished loading at 10 minutes)). (I'll note that the 1.5 minutes still pales in comparison to Matlab's own <15s load times, so there's still room for improvement...)
Related
I really like the heatmap, but what I need are the numbers behind the heatmap (AKA correlation matrix).
Is there an easy way to extract the numbers?
It was a bit hard to track down but starting from the documentation; specifically
from the report structure then digging into the following function get_correlation_items(summary) and then going into the source and looking at the usage of it we get to this call that essentially loops over each of the correlation types in the summary, to obtain the summary object we can find the following, if we lookup the caller we find that it is get_report_structure(summary) and if we try to find how to get the summary arg we find that it is simply the description_set property as shown here.
Given the above, we can now do the following using version 2.9.0:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(
np.random.rand(100, 5),
columns=["a", "b", "c", "d", "e"]
)
profile = ProfileReport(df, title="StackOverflow", explorative=True)
correlations = profile.description_set["correlations"]
print(correlations.keys())
dict_keys(['pearson', 'spearman', 'kendall', 'phi_k'])
To see a specific correlation do:
correlations["phi_k"]["e"]
a 0.000000
b 0.112446
c 0.289983
d 0.000000
e 1.000000
Name: e, dtype: float64
Sample Notebook
Using networkx I created a graph G and added the edge (1,2). Strangely,
(1,2) in G.edges() == False
but
G.has_edge(1,2) == True
I'm baffled. This is a concise summary of the problem. The actual graph data is large and cannot be reproduced here in full.
The issue is that G.edges() only has each edge once. Since it's an undirected graph, it could be as either (1,2) or as (2,1) (you wouldn't want both to appear --- you might be iterating and delete the edge the first time). You can't be sure of what order it will do so because it's coming from a python dict, which does not have predictable order. You could do (1,2) in G.edges() or (2,1) in G.edges(), but you really don't want to do that - it's inefficient to create the edge lists.
So the test should use G.has_edge(1,2) which checks it correctly (and much more efficiently).
Just to show the lack of predictability for different python implementations, this is what I get
In [3]: G=nx.Graph()
In [4]: G.add_edge(1,2)
In [5]: G.edges()
Out[5]: [(1, 2)]
In [6]: (1,2) in G.edges()
Out[6]: True
In [7]: G.add_edge(4,3)
In [8]: G.edges()
Out[8]: [(1, 2), (3, 4)]
In [9]: (4,3) in G.edges()
Out[9]: False
My data consists of 50 columns and most of them are strings. I have a single multi-class variable which I have to predict. I tried using LabelEncoder in scikit-learn to convert the features (not classes) into whole numbers and feed them as input to the RandomForest model I am using. I am using RandomForest for classification.
Now, when new test data comes (stream of new data), for each column, how will I know what the label for each string will be since using LabelEncoder now will give me a new label independent of the labels I generated before. Am, I doing this wrong? Is there anything else I should use for consistent encoding?
The LabelEncoder class has two methods that handle this distinction: fit and transform. Typically you call fit first to map some data to a set of integers:
>>> le = LabelEncoder()
>>> le.fit(['a', 'e', 'b', 'z'])
>>> le.classes_
array(['a', 'b', 'e', 'z'], dtype='U1')
Once you've fit your encoder, you can transform any data to the label space, without changing the existing mapping:
>>> le.transform(['a', 'e', 'a', 'z', 'a', 'b'])
[0, 2, 0, 3, 0, 1]
>>> le.transform(['e', 'e', 'e'])
[2, 2, 2]
The use of this encoder basically assumes that you know beforehand what all the labels are in all of your data. If you have labels that might show up later (e.g., in an online learning scenario), you'll need to decide how to handle those outside the encoder.
You could save the mapping: string -> label in training data with each column.
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> col_1 = ["paris", "paris", "tokyo", "amsterdam"]
>>> set_col_1 = list(set(col_1))
>>> le.fit(col_1)
>>> dict(zip(set_col_1, le.transform(set_col_1)))
{'amsterdam': 0, 'paris': 1, 'tokyo': 2}
When the testing data come, you could use those mapping to encode corresponding columns in testing data. You do not have to use encoder again in testing data.
If each process creates its own nx.Graph() and adds/removes nodes/edges to it, is there any reason for them to collide? I am noticing some weird phenomenoms and trying to debug them.
The general problem is that I am dumping a single graph as an edge list, and recreating it from a subset in each process into a new graph. for some reason those new graphs are missing edges.
EDIT:
I think I found the part of code which causes the problems for me, the question is whether the following is the intended behaviour of NetworkX or not:
>>> import networkx as nx
>>> g = nx.Graph()
>>> g.add_path([0,1,2,3])
>>> g.nodes()
[0, 1, 2, 3]
>>> g.edges()
[(0, 1), (1, 2), (2, 3)]
>>> g[1][0]
{}
>>> g[0][1] = {"test":1}
>>> g.edges(data=True)
[(0, 1, {'test': 1}), (1, 2, {}), (2, 3, {})]
>>> g[1][0]
{}
>>> g[0][1]
{'test': 1}
>>>
Since the graph is a undirectional one I would expect the edge data to appear both regardless to the nodes id in the request, is that an incorrect assumption?
In general there should be no issue with multiple processes as long as they each have their own Graph() object.
EDIT:
In your case you are explicitly assigning data to the internal NetworkX Graph structure with the line
>>> g[0][1] = {"test":1}
While that is technically allowed it breaks the API and data structure. You should instead use
>>> g.add_edge(0,1,test=1)
which won't add a new edge here, only a new attribute. Doing it that way assigns the data to g[0][1] and g[1][0] correctly.
I have used read_graphml to load a graph, and it looks as if it returned a Multiedges graph object, which I can't run the PageRank method on (returns an exception that the graph must be non-multiedged).
Is there a way to convert my graph to a non-multiedged type (I don't think I have multi edges in the graph I loaded...).
Thanks
If the read_graphml() function returned a MultiGraph() object it probably found parallel (multiple) edges in the input file. But you can convert that to a graph without parallel edges simply by passing into a new Graph(). e.g.
In [1]: import networkx as nx
In [2]: G = nx.MultiGraph([(1,2),(1,2)])
In [3]: G.edges()
Out[3]: [(1, 2), (1, 2)]
In [4]: H = nx.Graph(G)
In [5]: H.edges()
Out[5]: [(1, 2)]