MATLAB: How to create multiple mapped memory files with a simple "iterator"? - matlab

I have files (>100) that each contain recorded sets of data like this:
file0: [no. of data sets in file, no. of data points for recording1, related data to recording1, no. of data points for recording2, related data to recording2, ... , no. of data points for recordingM, related data to recordingM]
file1: [no. of data sets in file, ...] (same as above)
All of the data together may exceed 20 GB, so loading all of it into memory is not an option. Hence, I would like to create memory-mapped files for each of the files BUT hiding from the "user" the complexity of the underlying data, e.g., I would like to be able to operate on the data like this:
for i=1:TotalNumberOfRecordings
recording(i) = recording(i) * 10; % some stupid data operation
% or even more advanced better:
recording(i).relatedData = 2000;
end
So, no matter if recording(i) is in file0, file1, or some other file, and no matter its position within the file, I have a list that allows to me access the related data via a memory map.
What I have so far, is a list of all files within a certain directory, my idea now was to simply create a list like this:
entry1: [memoryMappedFileHandle, dataRangeOfRecording]
entry2: [memoryMappedFileHandle, dataRangeOfRecording]
And then use this list to further abstract files and recordings. I started with this code:
fileList = getAllFiles(directoryName);
list = []; n = 0;
for file = 1:length(fileList);
m = memmapfile(fileList(file));
for numberOfTracesInFile
n = n+1;
list = [list; [n, m]];
end
end
But I do get the error:
Memmapfile objects cannot be concatenated
I'm quite new to MATLAB so this is probably a bad idea after all. How to do it better? Is it possible to create a memorymapped table that contains multiple files?

I'm not sure whether the core of your question is specifically about memory-mapped files, or about whether there is a way to seamlessly process data from multiple large files without the user needing to bother with the details of where the data is.
To address the second question, MATLAB 2014b introduced a new datastore object that is designed to do pretty much this. Essentially, you create a datastore object that refers to your files, and you can then pull data from the datastore without needing to worry about which file it's in. datastore is also designed to work very closely with the new mapreduce functionality that was introduced at the same time, which allows you to easily parallelize map-reduce programming patterns, and even tie in with Hadoop.
To answer the first question - I'm afraid I think you've found your answer, which is that memmapfile objects can not be concatenated, so no, not straightforward. I think your best approach would be to build your own class, which would contain multiple memmapfile objects in a cell array along with information about which data was in which file, along with some sort of getData method that would retrieve the appropriate data from the appropriate file. (This would be basically like writing your own datastore class, but which worked with memory-mapped files rather than files, so you might be able to copy much of the design and/or implementation details from datastore itself).

Like Horchler said; you could put the memmepfile objects in a cell array:
list = cell(1,10); % preallocate cell
for it = 1:10
memmapfile_object = memmepfile('/path/to/file');
list{it} = memmapfile_object;
end

Related

KubeFlow, handling large dynamic arrays and ParallelFor with current size limitations

I've been struggling to find a good solution for this manner for the past day and would like to hear your thoughts.
I have a pipeline which receives a large & dynamic JSON array (containing only stringified objects),
I need to be able to create a ContainerOp for each entry in that array (using dsl.ParallelFor).
This works fine for small inputs.
Right now the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes (or that is what I understood from the current open issues), but - when I try to read the file from one Op to use as input for the ParallelFor I encounter the output size limitation.
What would be a good & reusable solution for such a scenario?
Thanks!
the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes
Usually the external data is first imported into the pipeline (downloaded and output). Then the components use inputPath and outputPath to pass big data pieces as files.
The size limitation only applies for the data that you consume as value instead of file using inputValue.
The loops consume the data by value, so the size limit applies to them.
What you can do is make this data smaller. For example if your data is a JSON list of big objects [{obj1}, {obj2}, ... , {objN}], you can transform it to list of indexes [1, 2, ... , N], pass that list to the loop and then inside the loop you can have a component that uses the index and the data to select a single piece to work on N ->{objN}.

Working with many inputs (Matlab)

I'm new to Matlab and I need some suggestions on how to deal with having many inputs to a function.
The program reads data from multiple elements and stores them in an array, which I'm doing in a loop. The problem is that if I input the wrong information about one element, I must re-input the data all over again. I believe that there must exist a better way to input these data, like reading it from a external file, for example.
The problem with the external file would be, as far as I know, with the reading of multiple arrays from a single file, hence the need of multiple external files - and I believe also that must exist some better way.
As noted by #beaker, you can use save and load to store the data. You can store multiple variables in a given file without a problem.

Create a trie in perl

Objective is to create a trie data structure. I had seen Tree::Trie and used it. It converts data into trie structure only after the file(database) is read. So this will make processing slow as each time a lookup is needed entire data is converted into trie.
Is there a way where I can create a trie for one time and use it the trie structure for lookups.
If you look again at the synopsis in the Tree::Trie documentation that you linked to, you will see that the trie is only created once (my($trie) = new Tree::Trie; - although this should be written as my $trie = Tree::Trie->new; rather than using indirect object notation) and data is only added to it once ($trie->add(...);), and then the trie is used for multiple lookups (my(#all) = $trie->lookup(""); and my(#ms) = $trie->lookup("m");).
The way to create the trie once and then use it for lookups, then, is to simply keep the $trie variable around (in scope) and use it for all your lookups instead of creating new Tree::Trie instances each time.
If this answer isn't useful to you, please update your question to include a small, self-contained, runnable example program showing how you're using Tree::Trie and we can show you how to modify it so that the trie only gets built once.

Clear all application data from a MATLAB figure

I'd like to clear all application data from a single figure, without using the names of individual application data variables.
Is there any function in MATLAB that will do the above?
No, you can't do this in a simple way.
The application data for a figure is used to store lots of things by MATLAB itself (such as the zoom and pan status of the figure), not just things that you set yourself - so just "removing" it all is a bad idea.
You can get the full set of application data using getappdata(f), where f is the handle to the figure (as opposed to the more usual getappdata(f, 'varname'), which would get a specific variable that you'd stored in the application data).
The result is a structure, and you can than go through the field names and delete anything you've stored.
To make this easier, you can use a consistent prefix for the names of any variables you store. Then just go through the field names and call rmappdata for any field that starts with your prefix.

How should I store my large MATLAB data files during analysis?

I am having issues with 'data overload' while processing point cloud data in MATLAB. This is what I am currently doing:
I begin with my raw data files, each in the order of ~30Mb each.
I then do initial processing on them to extract n individual objects and remove outlying points, which are all combined into a 1 x n structure, testset, saved into testset.mat (~100Mb).
So far so good. Now things become complicated:
For each point in each object in testset, I will compute one of a number of features, which ends up being a matrix of some size (for each point). The size of the matrix, and some other properties of the computation, are parameters of the calculations. I save these computed features in a 1 x n cell array, each cell of which contains an array of the matrices for each point.
I then save this cell array in a .mat file, where the name specified the parameters, the name of the test data used and the types of features extracted. For example:
testset_feature_type_A_5x5_0.2x0.2_alpha_3_beta_4.mat
Now for each of these files, I then do some further processing (using a classification algorithm). Again there are more parameters to set.
So now I am in a tricky situation, where each final piece of the initial data has come through some path, but the path taken (and the parameters set along that path) are not intrinsically held with the data itself.
So my question is:
Is there a better way to do this? Can anyone who has experience in working with large datasets in MATLAB suggest a way to store the data and the parameter settings more efficiently, and more integrally?
Ideally, I would be able to look up a certain piece of data without having to use regex on the file strings—but there is also an incentive to keep individually processed files separate to save system memory when loading them in (and to help prevent corruption).
The time taken for each calculation (some ~2 hours) prohibits computing data 'on the fly'.
For a similar problem, I have created a class structure that does the following:
Each object is linked to a raw data file
For each processing step, there is a property
The set method of the properties saves the data to file (in a directory with the same name as
the raw data file), stores the file name, and updates a "status" property to indicate that this step is done.
The get method of the properties loads the data if the file name has been stored and the status indicates "done".
Finally, the objects can be saved/loaded, so that I can do some processing now, save the object, later load it and I immediately know how far along the particular data set is in the processing pipeline.
Thus, the only data in memory is the data that is currently being worked on, and you can easily know which data set is at which processing stage. Furthermore, if you set up your methods to accept arrays of objects, you can do very convenient batch processing.
I'm not completely sure if this is what you need, but the save command allows you to store multiple variables inside a single .mat file. If your parameter settings are, for example, stored in an array, then you can save this together with the data set in a single .mat file. Upon loading the file, both the dataset and the array with parameters are restored.
Or do you want to be able to load the parameters without loading the file? Then I would personally opt for the cheap solution of having a second set of files with just the parameters (but similar filenames).