using mapreduce programming technique in matlab - matlab

I am studying rat ultrasonic vocalisations (their speech in ultrasound). I have several audio wav files of the rats speeches. Ideally, I would import the whole file into matlab and just process it but I will get memory issues even with the smallest 70mb file. This is what I want help with.
[y, Fs, nbits] = audioread('T0000201.wav');
[S F T] = spectrogram(y,100,[],256,Fs,'yaxis');
..
..
..rest of program
I could consider breaking the audio (in one file) into blocks, and process the block before considering the next block, but I'm not sure what I would do for cases where rat calls are cut off half way through, at the end of the blocks (this might have a negative impact on the STFT spectrogram).
I came across another technique called "Mapreduce" which seems to allow me to use the entirety of my data without actually reading it in. While this seems most ideal, I don't quite understand how it works or can be implemented. "Hadoop" has also been mentioned. Can anyone provide any assistance?
I am currently using this (http://uk.mathworks.com/help/matlab/import_export/find-maximum-value-with-mapreduce.html) for reference. My first step was trying to use the wav file as the data store (like the csv file in the example) but that didn't work.

Since you're working primarily with a repository of audio (.wav) files, mapreduce might not be your best option. The datastore function only works with text files or key-value files.
Use the memory function to explore what the limits of memory are for MATLAB, and try processing the audio files in smaller blocks as you mentioned. Using a combination of audioread(), audioinfo(), and audiowrite(), you can break your collection of audio files up into a larger collection of smaller files that can then be individually processed.
If you have a small number of files to work with, then you can manually inspect the smaller blocks to make sure no important rat calls are cut off between blocks. Of course if you have thousands of files to work with then that approach won't be feasible.

Related

Optimizing compression using HDF5/H5 in Matlab

Using Matlab, I am going to generate several data files and store them in H5 format as 20x1500xN, where N is an integer that can vary, but typically around 2300. Each file will have 4 different data sets with equal structure. Thus, I will quickly achieve a storage problem. My two questions:
Is there any reason not the split the 4 different data sets, and just save as 4x20x1500xNinstead? I would prefer having them split, since it is different signal modalities, but if there is any computational/compression advantage to not having them separated, I will join them.
Using Matlab's built-in compression, I set deflate=9 (and DataType=single). However, I have now realized that using deflate multiplies my computational time with 5. I realize this could have something to do with my ChunkSize, which I just put to 20x1500x5 - without any reasoning behind it. Is there a strategic way to optimize computational load w.r.t. deflation and compression time?
Thank you.
1- Splitting or merging? It won't make a difference in the compression procedure, since it is performed in blocks.
2- Your choice of chunkshape seems, indeed, bad. Chunksize determines the shape and size of each block that will be compressed independently. The bad is that each chunk is of 600 kB, that is much larger than the L2 cache, so your CPU is likely twiddling its fingers, waiting for data to come in. Depending on the nature of your data and the usage pattern you will use the most (read the whole array at once, random reads, sequential reads...) you may want to target the L1 or L2 sizes, or something in between. Here are some experiments done with a Python library that may serve you as a guide.
Once you have selected your chunksize (how many bytes will your compression blocks have), you have to choose a chunkshape. I'd recommend the shape that most closely fits your reading pattern, if you are doing partial reads, or filling in in a fastest-axis-first if you want to read the whole array at once. In your case, this will be something like 1x1500x10, I think (second axis being the fastest, last one the second fastest, and fist the slowest, change if I am mistaken).
Lastly, keep in mind that the details are quite dependant on the specific machine you run it: the CPU, the quality and load of the hard drive or SSD, speed of RAM... so the fine tuning will always require some experimentation.

How to write "Big Data" to a text file using Matlab

I am getting some readings off an accelerometer connected to an Arduino which is in turn connected to MATLAB through serial communication. I would like to write the readings into a text file. A 10 second reading will write around 1000 entries that make the text file size around 1 kbyte.
I will be using the following code:
%%%%%// Communication %%%%%
arduino=serial('COM6','BaudRate',9600);
fopen(arduino);
fileID = fopen('Readings.txt','w');
%%%%%// Reading from Serial %%%%%
for i=1:Samples
scan = fscanf(arduino,'%f');
if isfloat(scan),
vib = [vib;scan];
fprintf(fileID,'%0.3f\r\n',scan);
end
end
Any suggestions on improving this code ? Will this have a time or Size limit? This code is to be run for 3 days.
Do not use text files, use binary files. 42718123229.123123 is 18 bytes in ASCII, 4 bytes in a binary file. Don't waste space unnecessarily. If your data is going to be used later in MATLAB, then I just suggest you save in .mat files
Do not use a single file! Choose a reasonable file size (e.g. 100Mb) and make sure that when you get to that many amount of data you switch to another file. You could do this by e.g. saving a file per hour. This way you minimize the possible errors that may happen if the software crashes 2 minutes before finishing.
Now knowing the real dimensions of your problem, writing a text file is totally fine, nothing special is required to process such small data. But there is a problem with your code. You are writing a variable vid which increases over time. That may cause bad performance because you are not using preallocation and it may consume a lot of memory. I strongly recommend not to keep this variable, and if you need the dater read it afterwards.
Another thing you should consider is verification of your data. What do you do when you receive less samples than you expect? Include timestamps! Be aware that these timestamps are not precise because you add them afterwards, but it allows you to identify if just some random samples are missing (may be interpolated afterwards) or some consecutive series of maybe 100 samples is missing.

'Out of memory' in Matlab. A slow but a permanent solution?

I am wondering if my suggestion to 'Out of Memory' problem is impossible. Here is my suggestion:
The idea is seamlessly saving huge matrices (say BIG = rand(10^6)) to HDD as a .mat(-v7.3) file when it is not possible to keep it in memory and call it seamlessly whenever required. Then, when you want to use it like:
a = BIG(3678,2222);
s = size(BIG);
, it seamlessly does this behind the scene:
m = matfile('BIG.m');
a = m.BIG(3678,2222);
s = size(m,'BIG');
I know that speed is important but suppose that I have enough time but not enough memory. And also its better to write an memory efficient program but again suppose that I need to use someone else's function which cant be optimized. I do actually have some more related questions: Can this be implemented using objects? Or does it require a infrastructural change in Matlab?
Seems to me like this is certainly possible, as this essentially what many operating systems do in the form of paging.
Moreover, something similar is provided by MATLAB Distributed Computing Server. This allows you (among other things) to store the data for a single matrix on multiple machines, and access it seamlessly in the way that you propose.
IMHO, allowing for data to be paged to file/swap should be a setting in MATLAB. Unfortunately, that is not how MATLAB's memory model works, and I suspect it is very difficult to implement this on their side. Plus, when this setting is enabled, users will not be protected anymore against making silly mistakes like zeros(1e7) instead of zeros(1e7,1); it will simply seem to hang the system, as MATLAB is busy filling your entire drive with zeros.
Anyway, I think it is possible using MATLAB classes. But I wouldn't recommend it. Note that implementing a proper subsref and subsasgn is *ahum* challenging, plus, it is likely that you'll have to re-implement many algorithms (like mldivide). This will most likely mean you'll lose a great deal of performance; think factors of in the thousands.
Here's an interesting random relevant paper I found while googling around a bit.
What could probably be a solution to your problem is memory mapped io (which matlab supports).
There a file is mapped to the memory and all read/writes to that memory address are actually read/writes to the file.
This only reserves/blocks memory addresses, it does not consume physical memory.
However, I would only advise it with 64 bit matlab, since with 32 bit matlab the address space is simply not large enough to use ram for data, code for matlab and dlls, and memory mapped io.
Check out the examples for the documentation page of memmapfile(), e.g.,
m = memmapfile('records.dat', ...
'Offset', 1024, ...
'Format', {'uint32' [4 10 18] 'x'});
A = m.Data(1).x;
whos A
Name Size Bytes Class
A 4x10x18 2880 uint32 array
Note that, accesses to m.Data(1).x redirect to file IO, i.e. no memory is consumed. So it provides efficient random access to parts of possibly very large data files residing on disk. Also note, that more complex data structures as in the example can be realized.
It seems to me that this provides the "implementation with objects" you had in mind.
Unfortunately, this does not allow to memmap MATfiles directly, which would be really useful. Probably this is difficult because of the compression.
Write a function:
function a=BIG(x,y)
m = matfile('BIG.mat');
a = m.BIG(x,y);
end
Every time you write BIG(a,b) the function is called.

Alternatives to Matlab's Mat File Format

I'm finding that writing and reading the native mat file format becomes very, very slow with larger data structures of about 1G in size. In addition we have other, non-matlab, software that should be able to read and write these files. So I would to find an alternative format to use to serialize matlab data structures. Ideally this format would ...
be able to represent an arbitrary matlab structure to a file.
have faster I/O than than mat files.
have I/O libraries for other languages like Java, Python and C++.
Simplifying your data structures and using the new v7.3 MAT file format, which is a variant of HDF5, might actually be the best approach. The HDF5 format is open and already has I/O libraries for your other languages. And depending on your data structure, they may be faster than the old binary mat files.
Simplify the data structures you're saving, preferring large arrays of primitives to complex container structures.
Try turning off compression if your data structures are still complex.
Try the v7.3 MAT file format using "-v7.3"
If using a network file system, consider saving and loading to a temporary dir on a fast local drive and copying to/from the network
For large data structures, your MAT file I/O speed may be determined more by the internal structure of the data you're writing out than the size of the resulting MAT file itself. (In my experience, this has usually been the major factor in slow MAT files.) When you say "arbitrary Matlab structure", that suggests you might be using cells, structs, or objects to make complex data structures. That slows down MAT I/O because there is per-array overhead in MAT file I/O, and the members of cell and struct arrays (container types) all count as separate arrays. For example, 5,000 strings stored in a cellstr are much, much slower than the same 5,000 strings stored in a 2-D char array. And objects have even more overhead. As a test, try writing out a 1 GB file that contains just a 1 GB primitive array of random uint8s, and see how long that takes. From there, see if you can simplify your data to reduce the total mxarray count, even if that means reshaping it for serialization. (My experience with this is mostly with the v7 format; the newer HDF5 format may have less per element overhead.)
If your data files live on the network, you could also try doing the save and load operations on temporary files on fast local drives, and separately using copy operations to move them back and forth between the network. At least on Windows networks, I've seen speedups of up to 2x from doing this. Possibly due to optimizations the full-file copy operation can do that the MAT I/O code can't.
It would probably be a substantial effort to come up with an alternate file format that supported fully arbitrary Matlab data structures and was portable to other languages. I'd try making smaller changes around your use of the existing format first.
mat format has changed with Matlab versions. v7.3 uses HDF5 format, which has builtin compression and other features and it can take a large time to read/write. However, you can force Matlab to use previous formats which are faster (but might take more space).
See here:
http://www.mathworks.com/help/matlab/import_export/mat-file-versions.html

Need a method to store a lot of data in Matlab

I've asked this before, but I feel I wasn't clear enough so I'll try again.
I am running a network simulation, and I have several hundreds output files. Each file holds the simulation's test result for different parameters.
There are 5 different parameters and 16 different tests for each simulation. I need a method to store all this information (and again, there's a lot of it) in Matlab with the purpose of plotting graphs using a script. suppose the script input is parameter_1 and test_2, so I get a graph where parameter_1 is the X axis and test_2 is the Y axis.
My problem is that I'm not quite familier to Matlab, and I need to be directed so it doesn't take me forever (I'm short on time).
How do I store this information in Matlab? I was thinking of two options:
Each output file is imported separately to a different variable (matrix)
All output files are merged to one output file and imprted together. In the resulted matrix each line is a different output file, and each column is a different test. Problem is, I don't know how to store the simulation parameters
Edit: maybe I can use a dataset?
So, I would appreciate any suggestion of how to store the information, and what functions might help me fetch the only the data I need.
If you're still looking to give matlab a try with this problem, you can iterate through all the files and import them one by one. You can create a list of the contents of a folder with the function
ls(name)
and you can import data like this:
A = importdata(filename)
if your data is in txt files, you should consider this Prev Q
A good strategy to avoid cluttering your workspace is to import them all into a single matrix. SO if you have a matrix called VAR, then VAR{1,1}.{1,1} could be where you put your test results and VAR{1,1}.{2,1} could be where you put your simulation parameters of the first file. I think that is simpler than making a data structure. Just make sure you uniformly place the information in the same indexes of the arrays. You could also organize your VAR row v col by parameter vs test.
This is more along the lines of your first suggestion
Each output file is imported separately to a different variable
(matrix)
Your second suggestion seems unnecessary since you can just iterate through your files.
You can use the command save to store your data.
It is very convenient, and can store as much data as your hard disk can bear.
The documentation is there:
http://www.mathworks.fr/help/techdoc/ref/save.html
Describe the format of text files. Because if it has a systematic format then you can use dlmread or similar commands in matlab and read the text file in a matrix. From there, you can plot easily. If you try to do it in excel, it will be much slower than reading from a text file. If speed is an issue for you, I suggest that you don't go for Excel.