Creating dataset from matrix in Matlab - matlab

I'm trying to create a dataset from a double matrix and cell array of labels.
I don't have access to the mat2dataset function so I'm trying to write something similar.
>> whos data feature_labels
Name Size Bytes Class Attributes
data 2x208 3328 double
feature_labels 1x208 50776 cell
In actual use the data will have ~2million rows and always be double format. The number of columns will range from 20 up to 2000, so doing something like;
>> D = dataset([],[],[],[],[],...[], 'VarNames', feature_labels);
isn't really feasible.
Any suggestions?
edit:
Currently using a for loop and horzcat to concatenate new dataset columns on each loop. I don't see a way to pre-allocate the dataset size is this way so I imagine performance will chug with the larger datasets though..

Have you considered using a struct? I use these all the time in MATLAB for database things. I know it works absolutely fantastic for up to 20,000 elements with about 15 fields each, so I think it would still work just as well as anything else for 2 million items with 2 fields.
Alternatively, can't you just put it in a cell array?
DataBase{rowNum,1}=dataVector(rowNum,:);
DataBase{rowNum,2}=label{rowNum};
To preallocate a struct or cell, its relatively easy, with a struct, once you make your first one to initialize the fields, just say Struct(2000000).fieldName =[]
TO preallocate your cell array, just do
DataBase={[]}
DataBase{2000000,2}=[]
This preallocates all of it and fills it with empty values.

Related

Loading huge cell array containing structures

I have an issue with saving and loading a huge dataset in Matlab.
My dataset contains properties of series of images using Matlab's regionprops.
I currently have a MAT-file of about 21GB and this takes a while to load.
This MAT-file has one cell array containing structure arrays of the properties of ellipses on each slice.
Are they any suggestions as to how to go around this?
Is there any better and efficient way of saving MAT-files than the -v7.3 formats?
One solution could be to use the 'table' argument to regionprops. This causes the output to be a table rather than a struct array. This format is more efficient for storage than the struct array.
Better yet, if you don't mind manually keeping track of what data is where, is to create a numeric array with the relevant data:
BW = imread('text.png'); % Example image used in the docs
s = regionprops(BW,{'MajorAxisLength','MinorAxisLength','Orientation'});
t = regionprops('table',BW,{'MajorAxisLength','MinorAxisLength','Orientation'});
m = [s.MajorAxisLength; s.MinorAxisLength; s.Orientation];
whos
Name Size Bytes Class Attributes
BW 256x256 65536 logical
m 3x88 2112 double
s 88x1 31872 struct
t 88x3 3496 table
A numeric array is a much more efficient way of storing data than a struct array, because each element in the struct array is a separate matrix, which needs its own header. The header (114 bytes I believe) in this case is far larger than the value stored in the array (8 bytes in this case), hence the overhead of 31872 / 2112 = 15.1.
The table stores each column in a separate array, so there you have a much smaller overhead. Instead of having 3 x 88 (number of features x number of objects) arrays, you have only 3.
If each image is guaranteed to have the same number of objects, you could consider putting these matrices into a single 3D array instead of a cell array. The gain here would be smaller.

Build a matrix starting from instances of structure fields in MATLAB

I'm really sorry to bother so I hope it is not a silly or repetitive question.
I have been scraping a website, saving the results as a collection in MongoDB, exporting it as a JSON file and importing it in MATLAB.
At the end of the story I obtained a struct object organised
like this one in the picture.
What I'm interested in are the two last cell arrays (which can be easily converted to string arrays with string()). The first cell array is a collection of keys (think unique products) and the second cell array is a collection of values (think prices), like a dictionary. Each field is an instance of possible values for a set of this keys (think daily prices). My goal is to build a matrix made like this:
KEYS VALUES_OF_FIELD_1 VALUES_OF_FIELD2 ... VALUES_OF_FIELDn
A x x x
B x z NaN
C z x y
D NaN y x
E y x z
The main problem is that, as shown in the image and as I tried to explain in the example matrix, I don't always have a value for all the keys in every field (as you can see sometimes they are 321, other times 319 or 320 or 317) and so the key is missing from the first array. In that case I should fill the missing value with a NaN. The keys can be ordered alphabetically and are all unique.
What would you think would be the best and most scalable way to approach this problem in MATLAB?
Thank you very much for your time, I hope I explained myself clearly.
EDIT:
Both arrays are made of strings in my case, so types are not a problem (I've modified the example). The main problem is that, since the keys vary in each field, firstly I have to find all the (unique) keys in the structure, to build the rows, and then for each column (field) I have to fill the values putting NaN where the key is missing.
One thing to remember you can't simply use both strings and number in one matrix. So, if you combine them together they can be either all strings or all numbers. I think all strings will work for you.
Before make a matrix make sure that all the cells have same element.
new_matrix = horzcat(keys,values1,...valuesn);
This will provide a matrix for each row (according to your image). Now you can use a for loop to get matrices for all the rows.
For now, I've solved it by considering the longest array of keys in the structure as the complete set of keys, let's call it keys_set.
Then I've created for each field in the structure a Map object in this way:
for i=1:length(structure)
structure(i).myMap = containers.Map(structure(i).key_field, structure(i).value_field);
end
Then I've built my matrix (M) by checking every map against the keys_set array:
for i=1:length(keys_set)
for j=1:length(structure)
if isKey(structure(j).myMap,char(keys_set(i)))
M(i,j) = string(structure(j).myMap(char(keys_set(i))));
else
M(i,j) = string('MISSING');
end
end
end
This works, but it would be ideal to also be able to check that keys_set is really complete.
EDIT: I've solved my problem by using this function and building the correct set of all the possible keys:
%% Finding the maximum number of keys in all the fields
maxnk = length(structure(1).key_field);
for i=2:length(structure)
if length(structure(i).key_field) > maxnk
maxnk = length(structure(i).key_field);
end
end
%% Initialiting the matrix containing all the possibile set of keys
keys_set=string(zeros(maxnk,length(structure)));
%% Filling the matrix by putting "0" if the dimension is smaller
for i=1:length(structure)
d = length(string(structure(i).key_field));
if d == maxnk
keys_set(:,i) = string(structure(i).key_field);
else
clear tmp
tmp = [string(structure(i).key_field); string(zeros(maxnk-d,1))];
keys_set(:,i) = tmp;
end
end
%% Merging without duplication and removing the "0" element
keys_set = union_several(keys_set);
keys_set = keys_set(keys_set ~= string(0));

how to store data without dynamically naming variables

I have 40 variables. The 40 variables names are in a cell array (40 x 1).
Each variable will have a matrix. The matrix is of type of double and will be size 5000 x 150. It will also have a vector of size 1 x 150 & one last vector 1 x 4.
Initially I was going to dynamically name each struct with its variable name in the cell array. So would look like something like below (assuming variable name is ABC),
ABC.dataMatrix
ABC.dataVec
ABC.summaryData
All the variables would be saved to a directory.
However I've read using eval isn't a very good idea so guessing my idea isn't the best way to go about this problem. What would be the best way to go about this problem?
You can either use struct arrays with dynamic field names, as #Shai and #RobertStettler suggest.
However, another option is a table. This might be more appealing if you want to see your data in one big matrix, and you can give each table row the name of your variables too! Note that the rows in a table would then be what you call your variables, but MATLAB calls the table columns its variables.
Also note that using a table can be more difficult than using struct or cell arrays, but if you know how to use them, you can handle a table too.
An example:
% create dummy data
rowNames = {'a';'b';'c'};
M = {ones(3); zeros(3); nan(3)}; % a cell, one element per item in rowNames
V = [1 2; 3 4; 5 6]; % a matrix of vectors, each row containing a vector for every item in rowNames
% create a table
T = table(M,V,'RowNames',rowNames); % this is where your variable names could go
Now, to access data you could use (some examples):
T(2,:) or T('b',:), return a table for all data on the 'b' row.
T(:,2) or T(:,'V'), return a table of variable V for all rows.
T.V or T{:,2} or T{:,'V'} or T.(2), return matrix V for all rows. This syntax is similar to accessing a (dynamic) field name of a struct.
T{3,1} or T{'c',1} or T.M('c'), return cell M for row 'c'. This syntax is similar to accessing a cell, but with more advanced possibilities, i.e. the ability to access the table via row or variable names.
T{3,1}{:} or T{'c',1}{:} or T.M{'c'}, return cell contents M for row 'c'.
And even more complex: T('a',:).M{:} is a complex way of accessing the cell content of M for row 'a', which can be done with T{1,1}{:} or T.M{'a'} or T{'a','M'}{:} or T.M{1} as well.
In your case you would en up with a 40x3 table, with every row what you call a variable and the first column the matrices (in cell arrays), and the last two columns the vectors (as well in cell arrays or as a 40xm double, m being the length of your vector).

When to use a cell, matrix, or table in Matlab

I am fairly new to matlab and I am trying to figure out when it is best to use cells, tables, or matrixes to store sets of data and then work with the data.
What I want is to store data that has multiple lines that include strings and numbers and then want to work with the numbers.
For example a line would look like
'string 1' , time, number1, number 2
. I know a matrix works best if al elements are numbers, but when I use a cell I keep having to convert the numbers or strings to a matrix in order to work with them. I am running matlab 2012 so maybe that is a part of the problem. Any help is appreciated. Thanks!
Use a matrix when :
the tabular data has a uniform type (all are floating points like double, or integers like int32);
& either the amount of data is small, or is big and has static (predefined) size;
& you care about the speed of accessing data, or you need matrix operations performed on data, or some function requires the data organized as such.
Use a cell array when:
the tabular data has heterogeneous type (mixed element types, "jagged" arrays etc.);
| there's a lot of data and has dynamic size;
| you need only indexing the data numerically (no algebraic operations);
| a function requires the data as such.
Same argument for structs, only the indexing is by name, not by number.
Not sure about tables, I don't think is offered by the language itself; might be an UDT that I don't know of...
Later edit
These three types may be combined, in the sense that cell arrays and structs may have matrices and cell arrays and structs as elements (because thy're heterogeneous containers). In your case, you might have 2 approaches, depending on how you need to access the data:
if you access the data mostly by row, then an array of N structs (one struct per row) with 4 fields (one field per column) would be the most effective in terms of performance;
if you access the data mostly by column, then a single struct with 4 fields (one field per column) would do; first field would be a cell array of strings for the first column, second field would be a cell array of strings or a 1D matrix of doubles depending on how you want to store you dates, the rest of the fields are 1D matrices of doubles.
Concerning tables: I always used matrices or cell arrays until I
had to do database related things such as joining datasets by a unique key; the only way I found to do this in was by using tables. It takes a while to get used to them and it's a bit annoying that some functions that work on cell arrays don't work on tables vice versa. MATLAB could have done a better job explaining when to use one or the other because it's not super clear from the documentation.
The situation that you describe, seems to be as follows:
You have several columns. Entire columns consist of 1 datatype each, and all columns have an equal number of rows.
This seems to match exactly with the recommended situation for using a [table][1]
T = table(var1,...,varN) creates a table from the input variables,
var1,...,varN . Variables can be of different sizes and data types,
but all variables must have the same number of rows.
Actually I don't have much experience with tables, but if you can't figure it out you can always switch to using 1 cell array for the first column, and a matrix for all others (in your example).

calculating the number of columns in a row of a cell array in matlab

i've got a cell array full of numbers, with 44 rows and different column length in each row
how could i calculate the number of columns in each row?(the columns which their contents are not empty)
i've used 2 different ways which both of them where wrong
the 1st one:
%a is the cell array
s=length(a)
it gives 44 which is the number of rows
the 2nd one
[row, columms]=size(a)
but it doesn't work either cause the number of columns is different in each row.
at least i mean the number of columns which are not empty
for example i need the number of columns in row one which it is 43(a{1 1:43}) but it gives the number of columns for each elements like a{1,1} which is 384 or a{1,2},a{1,3} and so on
You need to access each member of the cell array separately, you are looking for the size of the data contained in the cell - the cell is the container. Two methods
for loop:
cell_content_lengths=zeros(1,length(a));
for v=1:length(a)
cell_content_lengths(v)=length(a{v});
end
cellfun:
cell_content_lengths=cellfun(#length,a);
Any empty cells will just have length 0. To extend the for-loop to matrices is trivial, and you can extend the cellfun part to cells containing matrix by using something like this, if you are interested:
cell_content_sizes=cell2mat(cellfun(#length,a,'uniformoutput',false));
(Note for the above, each element of a needs to have the same dimension, otherwise it will give errors about concatenating different size matrices)
EDIT
Based on your comment I think I understand what you are looking for:
non_empty_cols = sum(~cellfun(#isempty,a),2);
With thanks to #MZimmerman6 who understood it before me.
So what you're really asking, is "How many non-empty elements are in each row of my cell array?"
filledCells = ~cellfun(#isempty,a);
columns = sum(filledCells,2);