How should I store my large MATLAB data files during analysis? - matlab

I am having issues with 'data overload' while processing point cloud data in MATLAB. This is what I am currently doing:
I begin with my raw data files, each in the order of ~30Mb each.
I then do initial processing on them to extract n individual objects and remove outlying points, which are all combined into a 1 x n structure, testset, saved into testset.mat (~100Mb).
So far so good. Now things become complicated:
For each point in each object in testset, I will compute one of a number of features, which ends up being a matrix of some size (for each point). The size of the matrix, and some other properties of the computation, are parameters of the calculations. I save these computed features in a 1 x n cell array, each cell of which contains an array of the matrices for each point.
I then save this cell array in a .mat file, where the name specified the parameters, the name of the test data used and the types of features extracted. For example:
testset_feature_type_A_5x5_0.2x0.2_alpha_3_beta_4.mat
Now for each of these files, I then do some further processing (using a classification algorithm). Again there are more parameters to set.
So now I am in a tricky situation, where each final piece of the initial data has come through some path, but the path taken (and the parameters set along that path) are not intrinsically held with the data itself.
So my question is:
Is there a better way to do this? Can anyone who has experience in working with large datasets in MATLAB suggest a way to store the data and the parameter settings more efficiently, and more integrally?
Ideally, I would be able to look up a certain piece of data without having to use regex on the file strings—but there is also an incentive to keep individually processed files separate to save system memory when loading them in (and to help prevent corruption).
The time taken for each calculation (some ~2 hours) prohibits computing data 'on the fly'.

For a similar problem, I have created a class structure that does the following:
Each object is linked to a raw data file
For each processing step, there is a property
The set method of the properties saves the data to file (in a directory with the same name as
the raw data file), stores the file name, and updates a "status" property to indicate that this step is done.
The get method of the properties loads the data if the file name has been stored and the status indicates "done".
Finally, the objects can be saved/loaded, so that I can do some processing now, save the object, later load it and I immediately know how far along the particular data set is in the processing pipeline.
Thus, the only data in memory is the data that is currently being worked on, and you can easily know which data set is at which processing stage. Furthermore, if you set up your methods to accept arrays of objects, you can do very convenient batch processing.

I'm not completely sure if this is what you need, but the save command allows you to store multiple variables inside a single .mat file. If your parameter settings are, for example, stored in an array, then you can save this together with the data set in a single .mat file. Upon loading the file, both the dataset and the array with parameters are restored.
Or do you want to be able to load the parameters without loading the file? Then I would personally opt for the cheap solution of having a second set of files with just the parameters (but similar filenames).

Related

How to use VTK to efficiently write time-varying field data on a fixed mesh?

I am working on physics simulation research. I have a large fixed grid in one of my projects that does not vary with time. The fields on the grid, on the other hand, vary with time in the simulation. I need to use VTK to record the field data in each step for visualization (Paraview).
The method I am using is to write a separate *.vtu file to disk at each time step. This basically serves the purpose, but actually writes a lot of duplicate data (re-recording the geometry of the mesh at each step), which not only consumes more disk space, but also wastes time on encoding and parsing.
I would like to have a way to write the mesh information only once, and the rest of the time only new field data is written, while being able to guarantee the same visualization. Please let me know if VTK and Paraview provide such an interface and how to implement it.
Using .pvtu and refer to the same .vtu as Piece for each step should do the trick.
See this similar post on the ParaView discourse, and the pvtu doc
EDIT
This seems to be a side effect of the format, this is not supported by the writer.
The correct solution is to use another file format ...
Let me provide my own research findings for reference.
As Nico said, with the combination of pvtu/vtu files, we could theoretically implement a geometry structure stored in a separate vtu file, referenced by a pvtu file. Setting the NumberOfPieces attribute of the ptvu file to 1 would enable the construction of only one separate vtu file.
However, the VTK library does not expose a dedicated operation interface to control the writing process of vtu files. No matter how it is set, as long as the writer's input contains geometry structures, the writer will write geometry information to disk, and this process cannot be skipped through the exposed interface.
However, it is indeed possible to make multiple pvtu files point to the same vtu file by manually editing the piece node in the ptvu file, and paraview can recognize and visualize such a file group properly.
I did not proceed to try adding arrays to the unstructured grid and using pvtu output.
So, I think the conclusion is.
if you don't want to dive into VTK's library code and XML implementation, then this approach doesn't make sense.
if you are willing to write a series of files, delete most of them from the vtu file, and then point all the pvtu's piece nodes to the only surviving vtu file by editing the pvtu file, you can save a lot of disk space, but will not shorten the write, read, and parse times.
If you implement an XML writer by yourself, you can achieve all the requirements in theory, but it requires a lot of coding work.

KubeFlow, handling large dynamic arrays and ParallelFor with current size limitations

I've been struggling to find a good solution for this manner for the past day and would like to hear your thoughts.
I have a pipeline which receives a large & dynamic JSON array (containing only stringified objects),
I need to be able to create a ContainerOp for each entry in that array (using dsl.ParallelFor).
This works fine for small inputs.
Right now the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes (or that is what I understood from the current open issues), but - when I try to read the file from one Op to use as input for the ParallelFor I encounter the output size limitation.
What would be a good & reusable solution for such a scenario?
Thanks!
the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes
Usually the external data is first imported into the pipeline (downloaded and output). Then the components use inputPath and outputPath to pass big data pieces as files.
The size limitation only applies for the data that you consume as value instead of file using inputValue.
The loops consume the data by value, so the size limit applies to them.
What you can do is make this data smaller. For example if your data is a JSON list of big objects [{obj1}, {obj2}, ... , {objN}], you can transform it to list of indexes [1, 2, ... , N], pass that list to the loop and then inside the loop you can have a component that uses the index and the data to select a single piece to work on N ->{objN}.

Working with many inputs (Matlab)

I'm new to Matlab and I need some suggestions on how to deal with having many inputs to a function.
The program reads data from multiple elements and stores them in an array, which I'm doing in a loop. The problem is that if I input the wrong information about one element, I must re-input the data all over again. I believe that there must exist a better way to input these data, like reading it from a external file, for example.
The problem with the external file would be, as far as I know, with the reading of multiple arrays from a single file, hence the need of multiple external files - and I believe also that must exist some better way.
As noted by #beaker, you can use save and load to store the data. You can store multiple variables in a given file without a problem.

Clear all application data from a MATLAB figure

I'd like to clear all application data from a single figure, without using the names of individual application data variables.
Is there any function in MATLAB that will do the above?
No, you can't do this in a simple way.
The application data for a figure is used to store lots of things by MATLAB itself (such as the zoom and pan status of the figure), not just things that you set yourself - so just "removing" it all is a bad idea.
You can get the full set of application data using getappdata(f), where f is the handle to the figure (as opposed to the more usual getappdata(f, 'varname'), which would get a specific variable that you'd stored in the application data).
The result is a structure, and you can than go through the field names and delete anything you've stored.
To make this easier, you can use a consistent prefix for the names of any variables you store. Then just go through the field names and call rmappdata for any field that starts with your prefix.

MATLAB: How to create multiple mapped memory files with a simple "iterator"?

I have files (>100) that each contain recorded sets of data like this:
file0: [no. of data sets in file, no. of data points for recording1, related data to recording1, no. of data points for recording2, related data to recording2, ... , no. of data points for recordingM, related data to recordingM]
file1: [no. of data sets in file, ...] (same as above)
All of the data together may exceed 20 GB, so loading all of it into memory is not an option. Hence, I would like to create memory-mapped files for each of the files BUT hiding from the "user" the complexity of the underlying data, e.g., I would like to be able to operate on the data like this:
for i=1:TotalNumberOfRecordings
recording(i) = recording(i) * 10; % some stupid data operation
% or even more advanced better:
recording(i).relatedData = 2000;
end
So, no matter if recording(i) is in file0, file1, or some other file, and no matter its position within the file, I have a list that allows to me access the related data via a memory map.
What I have so far, is a list of all files within a certain directory, my idea now was to simply create a list like this:
entry1: [memoryMappedFileHandle, dataRangeOfRecording]
entry2: [memoryMappedFileHandle, dataRangeOfRecording]
And then use this list to further abstract files and recordings. I started with this code:
fileList = getAllFiles(directoryName);
list = []; n = 0;
for file = 1:length(fileList);
m = memmapfile(fileList(file));
for numberOfTracesInFile
n = n+1;
list = [list; [n, m]];
end
end
But I do get the error:
Memmapfile objects cannot be concatenated
I'm quite new to MATLAB so this is probably a bad idea after all. How to do it better? Is it possible to create a memorymapped table that contains multiple files?
I'm not sure whether the core of your question is specifically about memory-mapped files, or about whether there is a way to seamlessly process data from multiple large files without the user needing to bother with the details of where the data is.
To address the second question, MATLAB 2014b introduced a new datastore object that is designed to do pretty much this. Essentially, you create a datastore object that refers to your files, and you can then pull data from the datastore without needing to worry about which file it's in. datastore is also designed to work very closely with the new mapreduce functionality that was introduced at the same time, which allows you to easily parallelize map-reduce programming patterns, and even tie in with Hadoop.
To answer the first question - I'm afraid I think you've found your answer, which is that memmapfile objects can not be concatenated, so no, not straightforward. I think your best approach would be to build your own class, which would contain multiple memmapfile objects in a cell array along with information about which data was in which file, along with some sort of getData method that would retrieve the appropriate data from the appropriate file. (This would be basically like writing your own datastore class, but which worked with memory-mapped files rather than files, so you might be able to copy much of the design and/or implementation details from datastore itself).
Like Horchler said; you could put the memmepfile objects in a cell array:
list = cell(1,10); % preallocate cell
for it = 1:10
memmapfile_object = memmepfile('/path/to/file');
list{it} = memmapfile_object;
end