How to store hierarchical data in scipy? - scipy

I am working with data analysis using the Scipy stack, and often have data with complicated hierarchies, e.g. dictionaries with elements as lists and again these lists have multiple dictionaries within them...
And therefore I have the need to store those data in a file. I have been using scipy.io.savemat, but I read that HDF5 is more suitable and less platform-specific (savemat is for Matlab). The drawback is that with h5py, instead of saving variables in the workspace directly, I have to manually replicated the complicate structures (i.e. for each dictionary, I need to manually create_group accordingly).
Would there be a standard way to do this, like how we save ".mat" files in Matlab?
Thank you!
-Shawn

If you only need to read your data back from python and don't need to change some bits of your data, it is much easier to save your things using pickle protocol.
E.g
import cPickle
f=open('something.pickle','w+')
cPickle.dump(whatever_object_youd_like_to_save,f,protocol=2)
and then you can load it using
your_object = cPickle.load(open('something.pickle'))

joblib is another tool that will let you dump arbitrary Python objects, with the added advantage of dedicated storage for NumPy arrays.

Related

How to pre-populate a Matlab data store with timeseries data rather than just Initial Value

I have a Simulink model which requires the use of data store read/write blocks.
Normally, the read blocks will read from a data store that is being written-to from another part of the model.
However, for modular testing purposes, sometimes some of the data store write blocks are not available.
In this case, I would like to be able to pre-populate the data store with timeseries data for the whole simulation (rather than just setting the initial value).
Note that I am creating the data store with Simulink.Signal objects in the base workspace (not using data store memory blocks)
I know that I could temporarily replace the data store read block with a block such as From Workspace.
However, this is not an ideal solution because it would be a very manual process to find and replace all of Data Store Read blocks back and forth between testing and production every time, and using variant switches or similar would add alot of clutter.
Any suggestions would be greatly appreciated!
Many thanks in advance.

Memory issues with large amounts of data stored as nested cells in MATLAB

I have large amounts of data stored as nested cells in .mat files. My biggest problem right now is the load times for accessing these files, but I'm wondering if the underlying problem is that I came up with an inefficient way for storing the data and I should restructure it to be smaller.
The full file consists of a cell aray:
Hemi{1,h} where there are 52 versions of h
.{n,p} where there are 85 versions of n and up to ~100 versions of p
.Variable where there are 10 variables, each with ~2500 values
This full file ate up all my memory, so I saved it in parts, aka:
Hemi1.mat=Hemi{1,1}
Hemi2.mat=Hemi{1,2}
etc.
The next step for this application is to load each file, determine which part of it is an appropriate solution (I need Hemi{1,h}.{n,p}.Var1, Hemi{1,h}.{n,p}.Var2, and Hemi{1,h}.{n,p}.Var3 for this, but I still need to keep track of the other Variables), save the solution, then close the file and move to the next one.
Is there a faster way to load these files?
Is the problem less my dataset and more how I've chosen to store it? Is there a better alternative?
That is quite a lot of data. I have a few suggestions that you could look into. The first is to see if you can change the datatypes to something like Categorical Objects. They are way more memory efficient. Also if you are storing strings as your final data this can be quite heavy storage wise.
Second you could look into HDF5 file storage. I hear it is a nice way to store structured data.
You could finally try to convert your {n,p} arrays into Table structures. I am not sure if this is better for memory, but tables are nice to work with and it may help you out. (Depending on your version of Matlab you may not have tables :P).
I hope this helps!
-Kyle

What is the Storable module used for?

I am having a hard time understanding what Storable does.
I know that it "stores" a variable into your disk, but why would I need to do that? What would I use this module for, and how would I do it?
Reasons that spring to mind:
Persist memory across script calls
Sharing variables across different processes (sometimes it isn't possible to pipe stuff)
Of course, that's not all that Storable does. It also:
Makes it possible to create deep clones of data structures
Serializes the data structure stored, which implies a smaller file footprint than output from Data::Dump
Is optimized for speed (so it's faster to retrieve than to require a file containing Data::Dump output
One example:
Your program spends a long time populating your data structure, a graph, or trie, and if the program crashes then you'd lose it all and have to start again from square one. To avoid losing this data and be able to continue where it stopped last time you can save a snapshot of the data to a file manually or just simply use Storable.

Efficient disk access of large number of small .mat files containing objects

I'm trying to determine the best way to store large numbers of small .mat files, around 9000 objects with sizes ranging from 2k to 100k, for a total of around half a gig.
The typical use case is that I only need to pull a small number (say 10) of the files from disk at a time.
What I've tried:
Method 1: If I save each file individually, I get performance problems (very slow save times and system sluggishness for some time after) as Windows 7 has difficulty handling so may files in a folder (And I think my SSD is having a rough time of it, too). However, the end result is fine, I can load what I need very quickly. This is using '-v6' save.
Method 2: If I save all of the files in one .mat file and then load just the variables I need, access is very slow (loading takes around three quarters of the time it takes to load the whole file, with small variation depending on the ordering of the save). This is using '-v6' save, too.
I know I could split the files up into many folders but it seems like such a nasty hack (and won't fix the SSD's dislike of writing many small files), is there a better way?
Edit:
The objects are consist mainly of a numeric matrix of double data and an accompanying vector of uint32 identifiers, plus a bunch of small identifying properties (char and numeric).
Five ideas to consider:
Try storing in an HDF5 object - take a look at http://www.mathworks.com/help/techdoc/ref/hdf5.html - you may find that this solves all of your problems. It will also be compatible with many other systems (e.g. Python, Java, R).
A variation on your method #2 is to store them in one or more files, but to turn off compression.
Different datatypes: It may also be the case that you have some objects that compress or decompress inexplicably poorly. I have had such issues with either cell arrays or struct arrays. I eventually found a way around it, but it's been awhile & I can't remember how to reproduce this particular problem. The solution was to use a different data structure.
#SB proposed a database. If all else fails, try that. I don't like building external dependencies and additional interfaces, but it should work (the primary problem is that if the DB starts to groan or corrupts your data, then you're back at square 1). For this purpose consider SQLite, which doesn't require a separate server/client framework. There is an interface available on Matlab Central: http://www.mathworks.com/matlabcentral/linkexchange/links/1549-matlab-sqlite
(New) Considering that the objects are less than 1GB, it may be easier to just copy the entire set to a RAM disk and then access through that. Just remember to copy from the RAM disk if anything is saved (or wrap save to save objects in two places).
Update: The OP has mentioned custom objects. There are two methods to consider for serializing these:
Two serialization program from Matlab Central: http://www.mathworks.com/matlabcentral/fileexchange/29457 - which was inspired by: http://www.mathworks.com/matlabcentral/fileexchange/12063-serialize
Google's Protocol Buffers. Take a look here: http://code.google.com/p/protobuf-matlab/
Try storing them as blobs in a database.
I would also try the multiple folders method as well - it might perform better than you think. It might also help with organization of the files if that's something you need.
The solution I have come up with is to save object arrays of around 100 of the objects each. These files tend to be 5-6 meg so loading is not prohibitive and access is just a matter of loading the right array(s) and then subsetting them to the desired entry(ies). This compromise avoids writing too many small files, still allows for fast access of single objects and avoids any extra database or serialization overhead.

How to load data into Core Data?

thanks for you help.
I'm attempting to add core data to my project and I'm stuck at where and how to add the actual data into the persistent store (I'm assuming this is the place for the raw data).
I will have 1000 < objects so I don't want to use a plist approach. From my searches, there seems to be xml and csv approaches. Is there a way I can use SQL for input?
The data will not be changed by the user and the data file will be typed in by hand, so I won't need to update these files during runtime, and at this point I am not limited in any type of file - the lightest on syntax is preferred.
Thanks again for any help.
You could load your data from an xml/csv/json file and create the DB on the first lunch of your application (if the DB is not there, then read the data and create it).
A better/faster approach might be to ship your sqllite DB within your application. You can parse the file in any format you want on the simulator, create a DB with all your entities, then take it from the ApplicationData and just add it to your app as a resource.
Although I'm sure there are lighter file types that could be used, I would include a JSON file into the app bundle from which you import the initial dataset.
Update: some folks are recommending XML. NSXMLParser is almost as fast as JSONKit (but much faster than most other parsers), but the XML syntax is heavier than JSON. So an XML bundled file that holds the initial dataset would weight more than if it was in JSON.
Considering Apple considers the format of its persistent stores implementation details, shipping a prefabricated SQLite database is not a very good idea. I.e. the names of fields and tables may change between iOS versions/phones/whatever hidden variable you can think of. You should, in general, not concern yourself with how this serialization of your data is formatted.
There's a brief article about importing data on Apple's developer site: Efficiently Importing Data
You should ship initial data in whatever format you're comfortable with (XML allows you to do incremental parsing efficiently, which reduces memory footprint) and write an import routine to run if you need to import data.
Edit: With EliBud's comment in mind, I still consider the approach a bit "iffy"... The format of the SQLite database used by Core Data is not something you'd want to generate by yourself (it's weird, simply put, and still not something you should really rely on).
So you'd want to use a mock app running on the Simulator and use Core Data to create the database (as per EliBud's answer). But you'd still have to import the data into that mock-app! And while it might make sense to do this once on a "real" computer instead of a lot of times on a mobile device (i.e. copying a file is easy, importing data is hard), you're essentially using the Simulator as an administration tool.
But hey, if it works...