I have an issue with saving and loading a huge dataset in Matlab.
My dataset contains properties of series of images using Matlab's regionprops.
I currently have a MAT-file of about 21GB and this takes a while to load.
This MAT-file has one cell array containing structure arrays of the properties of ellipses on each slice.
Are they any suggestions as to how to go around this?
Is there any better and efficient way of saving MAT-files than the -v7.3 formats?
One solution could be to use the 'table' argument to regionprops. This causes the output to be a table rather than a struct array. This format is more efficient for storage than the struct array.
Better yet, if you don't mind manually keeping track of what data is where, is to create a numeric array with the relevant data:
BW = imread('text.png'); % Example image used in the docs
s = regionprops(BW,{'MajorAxisLength','MinorAxisLength','Orientation'});
t = regionprops('table',BW,{'MajorAxisLength','MinorAxisLength','Orientation'});
m = [s.MajorAxisLength; s.MinorAxisLength; s.Orientation];
whos
Name Size Bytes Class Attributes
BW 256x256 65536 logical
m 3x88 2112 double
s 88x1 31872 struct
t 88x3 3496 table
A numeric array is a much more efficient way of storing data than a struct array, because each element in the struct array is a separate matrix, which needs its own header. The header (114 bytes I believe) in this case is far larger than the value stored in the array (8 bytes in this case), hence the overhead of 31872 / 2112 = 15.1.
The table stores each column in a separate array, so there you have a much smaller overhead. Instead of having 3 x 88 (number of features x number of objects) arrays, you have only 3.
If each image is guaranteed to have the same number of objects, you could consider putting these matrices into a single 3D array instead of a cell array. The gain here would be smaller.
Related
Up until recently, I have been storing time series data in structs in MATLAB by placing the index after the field name, e.g.:
Structure.fieldA(1) = 23423
So, the struct has a set of fields, and each field is a vector.
I've seen a lot of other programs use a different format, where the structure itself is indexed, and each index of the structure contains a set of fields, e.g.:
Structure(1).fieldA
Which is the most efficient method? Should I stick with the top option or should I be switching my programs across to using the bottom method?
A struct where each field is an array is more performant since you have fewer data elements (one array per field) whereas a struct array has more flexibility at the cost of performance and memory usage (on element per struct per field).
From MATLAB's own documentation
Structures require a similar amount of overhead per field. Structures with many fields and small contents have a large overhead and should be avoided. A large array of structures with numeric scalar fields requires much more memory than a structure with fields containing large numeric arrays.
We can check the memory usage with a simple example
S = struct('field1', {1, 2}, 'field2', {3, 4});
SArray = struct('field1', {[1,2]}, 'field2', {[3,4]});
whos S*
% Name Size Bytes Class Attributes
%
% S 1x2 608 struct
% SArray 1x1 384 struct
Some of the flexibility afforded by a struct array includes being able to easily grab a subset of the data:
subset = SArray(1:3);
% Compared to
subset.field1 = S.field1(1:3);
subset.field2 = S.field2(1:3);
Also, being able to store data of different sizes that may not easily fit into an array.
S(1).field1 = [1,2];
S(2).field1 = 3;
What solution is better, really depends on the data and how you're using it. If you have a large amount of data, the first option is likely going to be preferable due to it's smaller memory footprint.
If you're code is working for you, I wouldn't worry about converting it just for the sake of using a different convention unless you're having issues with performance (in which case use a struct of arrays) or difficulty accessing/modifying the data (use the array of struct).
I am fairly new to matlab and I am trying to figure out when it is best to use cells, tables, or matrixes to store sets of data and then work with the data.
What I want is to store data that has multiple lines that include strings and numbers and then want to work with the numbers.
For example a line would look like
'string 1' , time, number1, number 2
. I know a matrix works best if al elements are numbers, but when I use a cell I keep having to convert the numbers or strings to a matrix in order to work with them. I am running matlab 2012 so maybe that is a part of the problem. Any help is appreciated. Thanks!
Use a matrix when :
the tabular data has a uniform type (all are floating points like double, or integers like int32);
& either the amount of data is small, or is big and has static (predefined) size;
& you care about the speed of accessing data, or you need matrix operations performed on data, or some function requires the data organized as such.
Use a cell array when:
the tabular data has heterogeneous type (mixed element types, "jagged" arrays etc.);
| there's a lot of data and has dynamic size;
| you need only indexing the data numerically (no algebraic operations);
| a function requires the data as such.
Same argument for structs, only the indexing is by name, not by number.
Not sure about tables, I don't think is offered by the language itself; might be an UDT that I don't know of...
Later edit
These three types may be combined, in the sense that cell arrays and structs may have matrices and cell arrays and structs as elements (because thy're heterogeneous containers). In your case, you might have 2 approaches, depending on how you need to access the data:
if you access the data mostly by row, then an array of N structs (one struct per row) with 4 fields (one field per column) would be the most effective in terms of performance;
if you access the data mostly by column, then a single struct with 4 fields (one field per column) would do; first field would be a cell array of strings for the first column, second field would be a cell array of strings or a 1D matrix of doubles depending on how you want to store you dates, the rest of the fields are 1D matrices of doubles.
Concerning tables: I always used matrices or cell arrays until I
had to do database related things such as joining datasets by a unique key; the only way I found to do this in was by using tables. It takes a while to get used to them and it's a bit annoying that some functions that work on cell arrays don't work on tables vice versa. MATLAB could have done a better job explaining when to use one or the other because it's not super clear from the documentation.
The situation that you describe, seems to be as follows:
You have several columns. Entire columns consist of 1 datatype each, and all columns have an equal number of rows.
This seems to match exactly with the recommended situation for using a [table][1]
T = table(var1,...,varN) creates a table from the input variables,
var1,...,varN . Variables can be of different sizes and data types,
but all variables must have the same number of rows.
Actually I don't have much experience with tables, but if you can't figure it out you can always switch to using 1 cell array for the first column, and a matrix for all others (in your example).
So I've got a .tcl file with data representing a large three-dimensional matrix, and all values associated with the matrix appended in a single column, like this:
128 128 512
3.2867
4.0731
5.2104
4.114
2.6472
1.0059
0.68474
...
If I load the file into the command window and whos the variable, I have this:
whos K
Name Size Bytes Class Attributes
K 8388810x3 201331440 double
The two additional columns seem to be filled with NaNs, which don't appear in the original file. Is this a standard way for MATLAB to store a three-dimensional matrix? I'm more familiar with the .mat way of storing a matrix, and I'm curious if there's a quick command I can run to revert it to a more friendly format.
The file's first line (128 128 512) gives it 3 columns. I don't know why there are 2so many extra rows (128*128*512 = 8388608), but your 3d matrix can probably be constructed like this:
N = 128*128*512;
mat = reshape(tab(2:N+1,1),[128 128 512]);
What's on the last hundred lines of the table that gets loaded?
I'm trying to create a dataset from a double matrix and cell array of labels.
I don't have access to the mat2dataset function so I'm trying to write something similar.
>> whos data feature_labels
Name Size Bytes Class Attributes
data 2x208 3328 double
feature_labels 1x208 50776 cell
In actual use the data will have ~2million rows and always be double format. The number of columns will range from 20 up to 2000, so doing something like;
>> D = dataset([],[],[],[],[],...[], 'VarNames', feature_labels);
isn't really feasible.
Any suggestions?
edit:
Currently using a for loop and horzcat to concatenate new dataset columns on each loop. I don't see a way to pre-allocate the dataset size is this way so I imagine performance will chug with the larger datasets though..
Have you considered using a struct? I use these all the time in MATLAB for database things. I know it works absolutely fantastic for up to 20,000 elements with about 15 fields each, so I think it would still work just as well as anything else for 2 million items with 2 fields.
Alternatively, can't you just put it in a cell array?
DataBase{rowNum,1}=dataVector(rowNum,:);
DataBase{rowNum,2}=label{rowNum};
To preallocate a struct or cell, its relatively easy, with a struct, once you make your first one to initialize the fields, just say Struct(2000000).fieldName =[]
TO preallocate your cell array, just do
DataBase={[]}
DataBase{2000000,2}=[]
This preallocates all of it and fills it with empty values.
I am using MATLAB to load a text file that I want to make a sparse matrix out of. The columns in the text file refer to the row indices and are double type. I need them to be integers to be able to use them as indices for rows and columns. I tried using uint8, int32 and int64 to convert them to integers to use them to build a sparse matrix as so:
??? Undefined function or method 'sparse' for input
arguments of type 'int64'.
Error in ==> make_network at 5
graph =sparse(int64(listedges(:,1)),int64(listedges(:,2)),ones(size(listedges,1),1));
How can I convert the text file entries loaded as double so as to be used by the sparse function?
There is no need for any conversion, keep the indices double:
r = round(listedges);
graph = sparse(r(:, 1), r(:, 2), ones(size(listedges, 1), 1));
There are two reasons why one might want to convert to int:
The first, because you have data type restrictions.
The second, your inputs may contain fractions and are un-fit to be used as integers.
If you want to convert because of the first reason - then there's no need to: Matlab works with double type by default and often treats doubles as ints (for example, when used as indices).
However, if you want to convert to integers becuase of the second reason (numbers may be fractionals), then you should use round(), ceil() or floor() - whatever suits your purpose best.
There is another very good reason ( and really the primary one..) why one may want to convert indices of any structure (array, matrix, etc.) to int.
If you ever program in any language other than Matlab, you would be familiar with wanting to save memory space, especially with large structures. Being able to address elements in such structures with indices other than double is key.
One major issue with Matlab is the inability to more finely control the size of multidimensional structures in this way. There are sparse matrix solutions, but those are not adequate for many cases. Cell arrays will preserve the data types upon access, however the storage for every element in the cell array is extremely wasteful in terms of storage (113 bytes for a single uint8 encapsulated in a cell).