Alternatives to Matlab's Mat File Format - matlab

I'm finding that writing and reading the native mat file format becomes very, very slow with larger data structures of about 1G in size. In addition we have other, non-matlab, software that should be able to read and write these files. So I would to find an alternative format to use to serialize matlab data structures. Ideally this format would ...
be able to represent an arbitrary matlab structure to a file.
have faster I/O than than mat files.
have I/O libraries for other languages like Java, Python and C++.

Simplifying your data structures and using the new v7.3 MAT file format, which is a variant of HDF5, might actually be the best approach. The HDF5 format is open and already has I/O libraries for your other languages. And depending on your data structure, they may be faster than the old binary mat files.
Simplify the data structures you're saving, preferring large arrays of primitives to complex container structures.
Try turning off compression if your data structures are still complex.
Try the v7.3 MAT file format using "-v7.3"
If using a network file system, consider saving and loading to a temporary dir on a fast local drive and copying to/from the network
For large data structures, your MAT file I/O speed may be determined more by the internal structure of the data you're writing out than the size of the resulting MAT file itself. (In my experience, this has usually been the major factor in slow MAT files.) When you say "arbitrary Matlab structure", that suggests you might be using cells, structs, or objects to make complex data structures. That slows down MAT I/O because there is per-array overhead in MAT file I/O, and the members of cell and struct arrays (container types) all count as separate arrays. For example, 5,000 strings stored in a cellstr are much, much slower than the same 5,000 strings stored in a 2-D char array. And objects have even more overhead. As a test, try writing out a 1 GB file that contains just a 1 GB primitive array of random uint8s, and see how long that takes. From there, see if you can simplify your data to reduce the total mxarray count, even if that means reshaping it for serialization. (My experience with this is mostly with the v7 format; the newer HDF5 format may have less per element overhead.)
If your data files live on the network, you could also try doing the save and load operations on temporary files on fast local drives, and separately using copy operations to move them back and forth between the network. At least on Windows networks, I've seen speedups of up to 2x from doing this. Possibly due to optimizations the full-file copy operation can do that the MAT I/O code can't.
It would probably be a substantial effort to come up with an alternate file format that supported fully arbitrary Matlab data structures and was portable to other languages. I'd try making smaller changes around your use of the existing format first.

mat format has changed with Matlab versions. v7.3 uses HDF5 format, which has builtin compression and other features and it can take a large time to read/write. However, you can force Matlab to use previous formats which are faster (but might take more space).
See here:
http://www.mathworks.com/help/matlab/import_export/mat-file-versions.html

Related

Decrease .mat file size

I have a lot of .txt files with sizes from 100MB to 300MB and I want to transfer all the data that is in them to .mat files. The data is mostly numbers and I am looking to make the storage space as small as possible. Right now what I've done is read the data in the .txt files, put them in a struct and then save it using a v7.3 compression scheme, but each variable then comes out to be almost 9 GB. Would anyone have an idea about how I can make this better ?
You can save to MAT file v7 as noted in the documentation:
Version 7.3 MAT-files use an HDF5 based format that requires some overhead storage to describe the contents of the file. For cell arrays, structure arrays, or other containers that can store heterogeneous data types, Version 7.3 MAT-files are sometimes larger than Version 7 MAT-files.

Optimizing compression using HDF5/H5 in Matlab

Using Matlab, I am going to generate several data files and store them in H5 format as 20x1500xN, where N is an integer that can vary, but typically around 2300. Each file will have 4 different data sets with equal structure. Thus, I will quickly achieve a storage problem. My two questions:
Is there any reason not the split the 4 different data sets, and just save as 4x20x1500xNinstead? I would prefer having them split, since it is different signal modalities, but if there is any computational/compression advantage to not having them separated, I will join them.
Using Matlab's built-in compression, I set deflate=9 (and DataType=single). However, I have now realized that using deflate multiplies my computational time with 5. I realize this could have something to do with my ChunkSize, which I just put to 20x1500x5 - without any reasoning behind it. Is there a strategic way to optimize computational load w.r.t. deflation and compression time?
Thank you.
1- Splitting or merging? It won't make a difference in the compression procedure, since it is performed in blocks.
2- Your choice of chunkshape seems, indeed, bad. Chunksize determines the shape and size of each block that will be compressed independently. The bad is that each chunk is of 600 kB, that is much larger than the L2 cache, so your CPU is likely twiddling its fingers, waiting for data to come in. Depending on the nature of your data and the usage pattern you will use the most (read the whole array at once, random reads, sequential reads...) you may want to target the L1 or L2 sizes, or something in between. Here are some experiments done with a Python library that may serve you as a guide.
Once you have selected your chunksize (how many bytes will your compression blocks have), you have to choose a chunkshape. I'd recommend the shape that most closely fits your reading pattern, if you are doing partial reads, or filling in in a fastest-axis-first if you want to read the whole array at once. In your case, this will be something like 1x1500x10, I think (second axis being the fastest, last one the second fastest, and fist the slowest, change if I am mistaken).
Lastly, keep in mind that the details are quite dependant on the specific machine you run it: the CPU, the quality and load of the hard drive or SSD, speed of RAM... so the fine tuning will always require some experimentation.

using mapreduce programming technique in matlab

I am studying rat ultrasonic vocalisations (their speech in ultrasound). I have several audio wav files of the rats speeches. Ideally, I would import the whole file into matlab and just process it but I will get memory issues even with the smallest 70mb file. This is what I want help with.
[y, Fs, nbits] = audioread('T0000201.wav');
[S F T] = spectrogram(y,100,[],256,Fs,'yaxis');
..
..
..rest of program
I could consider breaking the audio (in one file) into blocks, and process the block before considering the next block, but I'm not sure what I would do for cases where rat calls are cut off half way through, at the end of the blocks (this might have a negative impact on the STFT spectrogram).
I came across another technique called "Mapreduce" which seems to allow me to use the entirety of my data without actually reading it in. While this seems most ideal, I don't quite understand how it works or can be implemented. "Hadoop" has also been mentioned. Can anyone provide any assistance?
I am currently using this (http://uk.mathworks.com/help/matlab/import_export/find-maximum-value-with-mapreduce.html) for reference. My first step was trying to use the wav file as the data store (like the csv file in the example) but that didn't work.
Since you're working primarily with a repository of audio (.wav) files, mapreduce might not be your best option. The datastore function only works with text files or key-value files.
Use the memory function to explore what the limits of memory are for MATLAB, and try processing the audio files in smaller blocks as you mentioned. Using a combination of audioread(), audioinfo(), and audiowrite(), you can break your collection of audio files up into a larger collection of smaller files that can then be individually processed.
If you have a small number of files to work with, then you can manually inspect the smaller blocks to make sure no important rat calls are cut off between blocks. Of course if you have thousands of files to work with then that approach won't be feasible.

Matlab .mat file saving

I have identical code in Matlab, identical data that was analyzed using two different computers. Both are Win 7 64 bit. Both Matlabs are 2014-a version. After the code finishes its run, I save the variables using save command and it outputs .mat file.
Is it possible to have two very different memory sizes for these files? Like one being 170 MB, and the other being 2.4 GB? This is absurd because when I check the variables in matlab they add up to maybe 1.5 GB at most. What can be the reason for this?
Does saving to .mat file compress the variables (still with the regular .mat extension)? I think it does because when I check the individual variables they add up to around 1.5 GB.
So why would one output smaller file size, but the other just so huge?
Mat in recent versions is HDF5, which includes gzip compression. Probably on one pc the default mat format is changed to an old version which does not support compression. Try saving specifying the version, then both PCs should result in the same size.
I found the reason for this based on the following stackoverflow thread: MATLAB: Differences between .mat versions
Apparently one of the computers was using -v7 format which produces much smaller files. - v7.3 just inflates the files significantly. But this is ironical in my opinion since -v7.3 enables saving files larger than 2 GB, which means they will be much much larger when saved in .mat file.
Anyway this link is very useful.
Update:
I implemented the serialization mentioned in the above link, and it increased the file size. In my case the best option will be using -v7 format since it provides the smallest file size, and is also able to save structures and cell arrays that I use a lot.

Saving an array in matlab efficiently

I want to save an array efficiently in matlab. I have the array of size 3000 by 9000. If I save this array in the mat file it consumes around 214 MB using just the save function. If I use fwrite and use float data type came to be around 112. Is there any other way that I can still reduce the hard disk space consumed when I save this array in matlab?
I would suggest writing using binary mode and then using compression algorithm such as bzip
There are a few ways to reduce the required memory:
1. Reducing precision
Rather than using that double you normally have, consider using a single, or perhaps even a uint8 or logical. Using the print function will also so this, but you may want to consider compressing further afterwards as printing does not create a compressed file.
2. Utilizing a pattern
If your matrix has a certain pattern, this can sometimes be leveraged to store the data more efficiently. Or at least the information to create the data. The most common example is that your matrix is storeable as a few vectors. For example when it is sparse.