Preallocation of cell array in matlab - matlab

This is more a question to understand a behavior rather than a specific problem.
Mathworks states that numerical are stored continuous which makes preallocation important. This is not the case for cell arrays.
Are they something similar than vector or array of pointers in C++?
This would mean that prealocation is not so important since a pointer is half the size of a double (according to whos - but there surely is overhead somewhere to store the datatype of the mxArray).
Running this code:
clear all
n = 1e6;
tic
A = [];
for i=1:n
A(end + 1) = 1;
end
fprintf('Numerical without preallocation %f s\n',toc)
clear A
tic
A = zeros(1,n);
for i=1:n
A(i) = 1;
end
fprintf('Numerical with preallocation %f s\n',toc)
clear A
tic
A = cell(0);
for i=1:n
A{end + 1} = 1;
end
fprintf('Cell without preallocation %f s\n',toc)
tic
A = cell(1,n);
for i=1:n
A{i} = 1;
end
fprintf('Cell with preallocation %f s\n',toc)
returns:
Numerical without preallocation 0.429240 s
Numerical with preallocation 0.025236 s
Cell without preallocation 4.960297 s
Cell with preallocation 0.554257 s
There is no surprise for the numerical values. But the did surprise me since only the container of the pointers and not the data itself would need reallocation. Which should (since the pointer is smaller than a double) lead to difference of <.2s. Where does this overhead come from?
A related question would be, if I would like to make a data container for heterogeneous data in Matlab (preallocation is not possible since the final size is not known in the beginning). I think handle classes are not good since the also have huge overhead.
already looking forward to learn something
magu_
Edit:
I tried out the linked list proposed by Eitan T but I think the overhead from matlab is still rather big. I tried something with an double array as data (rand(200000,1)).
I made a little plot to illustrate:
code for the graph: (I used the dlnode class from the matlab hompage as stated in the answering post)
D = rand(200000,1);
s = linspace(10,20000,50);
nC = zeros(50,1);
nL = zeros(50,1);
for i = 1:50
a = cell(0);
tic
for ii = 1:s(i)
a{end + 1} = D;
end
nC(i) = toc;
a = list([]);
tic
for ii = 1:s(i)
a.insertAfter(list(D));
end
nL(i) = toc;
end
figure
plot(s,nC,'r',s,nL,'g')
xlabel('#iter')
ylabel('time (s)')
legend({'cell' 'list'})
Don't get me wrong I love the idea of linked list, since there are rather flexible, but I think the overhead might be to big.

Are cell arrays something similar to a vector or an array of pointers in C++?
Cell arrays allow storing data of different types and sizes indeed, but each cell also adds a constant overhead of 112 bytes (see this other answer of mine). This is far more than an 8-byte double, and this is non-negligible, especially when dealing with large cell arrays as in your example.
It is reasonable to assume that a cell array is implemented as a continuous array of pointers, each pointing to the actual content of the cell.
This means that you can modify the content of each cell individually without actually resizing the cell array container itself. However, this also means that adding new cells to the cell array requires dynamic storage allocation and this is why preallocating memory for a cell array improves performance.
A related question would be, if I would like to make a data container for heterogeneous data in Matlab (preallocation is not possible since the final size is not known in the beginning)
Not knowing the final size may indeed be a problem, but you could always preallocate a cell array with the maximum supported size necessary (if there is one), and remove the empty cells in the end. I also suggest that you look into implementing linked lists in MATLAB.

Related

How can I prevent memory churn in MATLAB?

I'm putting together a hypothesis tree type algorithm in MATLAB and it is being slowed terribly by memory issues. The profiler shows all time being spent just writing into arrays.
The algorithm keeps a list of hypotheses with information about them in an array of structs. The issue is related to 3D arrays (not big) within the hypothesis:
H(x).someInfo(a,b,c)
Each iteration, some hypotheses are discarded:
H = H(keepIndices);
And the ones that remain are expanded and updated:
Hin = H;
H(N*length(H)) = H(1); % Pre-alloc?
count = 0;
for x = 1:length(Hin)
for y = 1:N
count = count + 1;
H(count) = Hin(x);
... % Computations
H(count).someInfo(:,:,a) = M; % Much time spent here
end
end
The profiler indicates huge amounts of time spent just doing the write (note comment). "someInfo" is preallocated so is not itself growing dynamically, but it is getting copied around.
Can anyone suggest a way to achieve this type of functionality without getting crossways with inefficiencies in the way MATLAB deals with memory? Not blaming MATLAB, but its flexibility makes this harder than it would be in C++.
If the access pattern to someInfo is always the same, you could turn it into a cell array of 2D matrices. You'll find that
H(count).someInfo{a} = M;
is faster than
H(count).someInfo(:,:,a) = M;
because the array data is not copied over, only the reference to the data is.
...and if that is the case, you might want to do
H{count,a} = M;
Note that the fewer levels of indexing (you have 3!), the faster it is.

Incremental appending: How to avoid performance penalty of struct arrays

If you must incrementally append data to arrays, it seems that using individual vectors of basic data types is orders of magnitude faster than an array of structs (with one vector element per record). Even trying to collect the individual vectors into a struct seems to double the time. The tests are:
N=5e4;
fprintf('\nstruct array (array of structs):\n')
clear x y;
y=struct( 'a',[], 'b',[], 'c',[], 'd',[] );
tic
for iIns = 1 : N
x.a=rand; x.b=rand; x.c=rand; x.d=rand;
y(end+1)=x;
end % for iIns
toc
fprintf('\nSeparate arrays of scalars:\n')
clear a b c d;
a=[]; b=[]; c=[]; d=[];
tic
for iIns = 1 : N
a(end+1) = rand;
b(end+1) = rand;
c(end+1) = rand;
d(end+1) = rand;
end % for iIns
toc
fprintf('\nA struct with arrays of scalars for fields:\n')
clear a b c d x y
x.a=[]; x.b=[]; x.c=[]; x.d=[];
tic
for iIns = 1:N
x.a(end+1)=rand;
x.b(end+1)=rand;
x.c(end+1)=rand;
x.d(end+1)=rand;
end % for iIns
toc
The results:
struct array (array of structs):
Elapsed time is 24.127274 seconds.
Separate arrays of scalars:
Elapsed time is 0.048190 seconds.
A struct with arrays of scalars for fields:
Elapsed time is 0.084624 seconds.
Even though collecting individual vectors of basic data types into a struct (3rd scenario above) imposes such a penalty, it may be preferrable to simply using individual vectors (second scenario above) because the variables are more organized. Your variable name space isn't filled up with so many variables which are in fact conceptually grouped.
That's quite a significant penalty, however, to pay for such organization. I don't suppose there is way to avoid this?
There are two ways to avoid this performance penalty: (1) pre-allocate, and (2) rethink your stance on "organizing" variables. I suggest both. Oh, and if you can, don't use arrays of structs where each field only uses scalars - if your application suddenly has to handle a couple of orders of magnitude more data, the memory overhead will force you to rewrite everything.
Pre-allocation
You often know how many elements your array will end up having. Thus, initialize your arrays as s = struct('a',NaN(1:N),'b',NaN(1:N)); If you don't know ahead of time how many entries there will be, but you can estimate an upper limit, initialize with the upper limit, and either remove the elements, or use functions (e.g. nanmean) that do not care if the array has a few extra NaNs in the end. If you truly know nothing about the final size (except that N will be large enough to matter), pre-allocate with a nice number (e.g. N=1337), and extend the array in chunks. MathWorks have sped up dynamic growing of numeric arrays in a recent release, but as you demonstrate in your answer, the optimization has not been applied to structs yet. Don't count MathWorks' optimization team to fix your code.
Nice variables
Why worry about your variable space? As long as you use explicitVariableNames, your code remains readable and you will have an easy time picking out the right variable. But ok, let's say you want to clean up: The first way to keeping the number of active variables low is to use clear or keep at strategic points in your code to make sure you only keep around what's needed. The second (assuming you want to optimize for performance), is to put contextually linked vectors into the same array: objectDimensions = [lengthOfObject, widthOfObject, heightOfObject]. This keeps everything as numeric arrays (which are fastest), and allows easy vectorization such as objectVolume = prod(objectDimensions,2);.
/aside: I should disclose that I used to use structures frequently for assembling results (so that I could return a lot of information a single variable and have the field names be part of the documentation). I have since switched to use object-oriented-programming (usually handle-objects), which no only collect related variables, but also the associated functionality, and which facilitate code re-use. I do take a performance hit, but the time it saves me coding makes more than up for it. Note that I do pre-allocate if at all possible (and if it's not just growing an array three times).
Example
Assume you have a function getDimensions that reads dimensions (length, height, width) of objects. However, sometimes, the object is 2D, sometimes it is 3D. Thus, you want to fill the following variables: twoD.length, twoD.width, threeD.length, threeD.width, threeD.height, ideally as arrays of structs, so that each element of a struct corresponds to an object. You do not know ahead of time how many objects there are, all you can do is poll the function thereAreMoreObjects, which returns true or false, until there are no more objects.
Here's how you can do this with reasonable efficiency and growing arrays by chunks:
%// preassign the temporary variable, and some others
chunkSize = 1000;
numObjects = 0;
idAndDimensions = zeros(chunkSize,4);
while thereAreMoreObjects()
objectId = getCurrentObjectId();
%// hi==-1 if it's flat
[len,wid,hi] = getObjectDimensions(objectId);
%// allocate more, if needed
numObjects = numObjects + 1;
if numObjects > size(idAndDimensions,1)
%// grow array
idAndDimensions(end+chunkSize,1) = 0;
end
idAndDimensions(numObjects,:) = [objectId, len, wid, hi];
end
%// throw away excess
idAndDimensions = idAndDimensions(1:numObjects,:);
%// split into 2D and 3D objects
isTwoD = numObjects(:,end) == -1;
%// assign twoD struct
twoD = struct('id',num2cell(idAndDimensions(isTwoD,1),...
'length',num2cell(idAndDimensions(isTwoD,2),...
'width',num2cell(idAndDimensions(isTwoD,3));
%// assign threeD struct
%// clean up - we need only the two structs
%// I use keep from the File Exchange instead of clearvars
clearvars -except twoD threeD

How to preallocate a list of external data structure in matlab?

My problem is related to an externally defined data structure: tensor. Tensor is a multidimensional array. In the Matlab tensor toolbox 2.5, tensor is a class with two fields: t.data, t.size:
% Create the tensor
t.data = data;
t.size = siz;
t = class(t, 'tensor');
return;
Like the built-in function zeros() in Matlab, I can use tenzeros() , to create a tensor full of zeros, e.g., tenzeros([2,3,4]). There're also other types of tensor data structure in this toolbox: tensor, sptensor, ktensor, ttensor, etc.
My question is, how I can preallocate 200 of tenzeros or other tensor types, where each tensor is of the same size [100,200,300]? That is, preallocating memory for 200 tensors. The reason is currently I use a for loop to create 200 tensors one by one, the memory requirements just goes up very very high. Some people advised me to preallocate memory for large data structures I need before I really compute them.
Thus, I want to preallocate an array of 200 tensors in the beginning; then in a for loop (parfor loop specifically), I compute the actual result of each tensor and send it to the preallocated space.
Why I couldn't use:
c=repmat(tenzeros([100, 200, 300]),200,1)
which throws:
Error using tensor.size
Too many output arguments.
Error in repmat (line 73)
[m,n] = size(A);
----------
update:
I pre-allocate the memory for the 200 tensors just because I heard memory preallocation can make the data continuous in the memory and thus can alleviate the OutOfMemory problem. Actually I only need each computed tensor to be written into each txt file in a for loop, which means I do not need the 200 tensors all together as my final result.
So currently I am using #Andrew Janke's third piece of codes to pre-allocate the memory for the 200 tensors in the beginning:
%Memory pre-allocation
c = cell([200, 1]);
parfor i = 1:numel(c)
c{i} = tenrand([100,200,300]); %This is just a tensor with random values to fill in the memory space
end
Then I virtually compute the 200 tensors in a parfor loop and fill in the pre-allocated memory space (i.e. c):
%Compute the 200 tensors in a parfor loop
parfor i = 1: 200
c{i} = computeTensorFunction(...)...;
aTensor = c{i};
write aTensor (i.e. c{i}) into a text file...;
end
Will the second part overwrite the space in c with-preallocated memory?
The experssion aTensor = c{i}: it doesn't make a duplicated copy, right? (I do not make changes to aTensor)
You can preallocate a cell array of initialized tensor objects by using repmat basically the way you are, but by sticking each tensor inside a cell.
c=repmat( { tenzeros([100, 200, 300]) }, 200, 1);
The { } curly braces surrounding the tenzeros call enclose it in a 1-by-1 cell.
If repmat is blowing up, you may be able to work around it by assigning the cell contents yourself from a re-used temporary variable. This will be basically as fast as repmat, and have the same memory usage characteristics.
sz = [200, 1];
c = cell(sz);
% Construct initial value *once* outside the loop
tmp = tensor(...);
for i = 1:numel(c)
c{i} = tmp;
end
Note that this isn't going to do as much for performance as preallocating primitive arrays, because only the top "container" level of composite types gets preallocated and possibly modified in-place. The arrays stored in fields of objects (like tensors) will still get copied when their values are changed inside functions, and probably even in the local workspace that first created them.
This will help a little bit with the peak memory usage because all of the initial zero tensors will be sharing their memory via the copy-on-write optimization. So it's more efficient that initializing the cell array with new tensors in a loop over multiple constructor calls. But since you're going to be discarding those initial zero values anyway, the most memory-efficient way to do this would be to just initialize it with empty cells.
sz = [200, 1];
c = cell(sz);
parfor i = 1:numel(c)
c{i} = calculate_your_result(...);
end
Because the tensor is a composite type (object), preallocation won't help much with the space they consume. You should probably work out an estimate of how much memory your data set will require in the best case scenario and see how that compares to the actual usage you're seeing. You might just need more memory for this application.

Guide to Optimizing MATLAB Code

I have noticed many individual questions on SO but no one good guide to MATLAB optimization.
Common Questions:
Optimize this code for me
How do I vectorize this?
I don't think that these questions will stop, but I'm hoping that the ideas presented here will them something centralized to refer to.
Optimizing Matlab code is kind of a black-art, there is always a better way to do it. And sometimes it is straight-up impossible to vectorize your code.
So my question is: when vectorization is impossible or extremely complicated, what are some of your tips and tricks to optimize MATLAB code? Also if you have any common vectorization tricks I wouldn't mind seeing them either.
Preface
All of these tests are performed on a machine that is shared with others, so it is not a perfectly clean environment. Between each test I clear the workspace to free up memory.
Please don't pay attention to the individual numbers, just look at the differences between the before and after optimisation times.
Note: The tic and toc calls I have placed in the code are to show where I am measuring the time taken.
Pre-allocation
The simple act of pre-allocating arrays in Matlab can give a huge speed advantage.
tic;
for i = 1:100000
my_array(i) = 5 * i;
end
toc;
This takes 47 seconds
tic;
length = 100000;
my_array = zeros(1, length);
for i = 1:length
my_array(i) = 5 * i;
end
toc;
This takes 0.1018 seconds
47 seconds to 0.1 seconds for a single line of code added is an amazing improvement. Obviously in this simple example you could vectorize it to my_array = 5 * 1:100000 (which took 0.000423 seconds) but I am trying to represent the more complicated times when vectorization isn't an option.
I recently found that the zeros function (and others of the same nature) are not as fast at pre-allocating as simply setting the last value to 0:
tic;
length = 100000;
my_array(length) = 0;
for i = 1:length
my_array(i) = 5 * i;
end
toc;
This takes 0.0991 seconds
Now obviously this tiny difference doesn't prove much but you'll have to believe me over a large file with many of these optimisations the difference becomes a lot more apparent.
Why does this work?
The pre-allocation methods allocate a chunk of memory for you to work with. This memory is contiguous and can be pre-fetched, just like an Array in C++ or Java. However if you do not pre-allocate then MATLAB will have to dynamically find more and more memory for you to use. As I understand it, this behaves differently to a Java ArrayList and is more like a LinkedList where different chunks of the array are split all over the place in memory.
Not only is this slower when you write data to it (47 seconds!) but it is also slower every time you access it from then on. In fact, if you absolutely CAN'T pre-allocate then it is still useful to copy your matrix to a new pre-allocated one before you start using it.
What if I don't know how much space to allocate?
This is a common question and there are a few different solutions:
Overestimation - It is better to grossly overestimate the size of your matrix and allocate too much space, than it is to under-allocate space.
Deal with it and fix later - I have seen this a lot where the developer has put up with the slow population time, and then copied the matrix into a new pre-allocated space. Usually this is saved as a .mat file or similar so that it could be read quickly at a later date.
How do I pre-allocate a complicated structure?
Pre-allocating space for simple data-types is easy, as we have already seen, but what if it is a very complex data type such as a struct of structs?
I could never work out to explicitly pre-allocate these (I am hoping someone can suggest a better method) so I came up with this simple hack:
tic;
length = 100000;
% Reverse the for-loop to start from the last element
for i = 1:length
complicated_structure = read_from_file(i);
end
toc;
This takes 1.5 minutes
tic;
length = 100000;
% Reverse the for-loop to start from the last element
for i = length:-1:1
complicated_structure = read_from_file(i);
end
% Flip the array back to the right way
complicated_structure = fliplr(complicated_structure);
toc;
This takes 6 seconds
This is obviously not perfect pre-allocation, and it takes a little while to flip the array afterwards, but the time improvements speak for themselves. I'm hoping someone has a better way to do this, but this is a pretty good hack in the mean time.
Data Structures
In terms of memory usage, an Array of Structs is orders of magnitude worse than a Struct of Arrays:
% Array of Structs
a(1).a = 1;
a(1).b = 2;
a(2).a = 3;
a(2).b = 4;
Uses 624 Bytes
% Struct of Arrays
a.a(1) = 1;
a.b(1) = 2;
a.a(2) = 3;
a.b(2) = 4;
Uses 384 Bytes
As you can see, even in this simple/small example the Array of Structs uses a lot more memory than the Struct of Arrays. Also the Struct of Arrays is in a more useful format if you want to plot the data.
Each Struct has a large header, and as you can see an array of structs repeats this header multiple times where the struct of arrays only has the one header and therefore uses less space. This difference is more obvious with larger arrays.
File Reads
The less number of freads (or any system call for that matter) you have in your code, the better.
tic;
for i = 1:100
fread(fid, 1, '*int32');
end
toc;
The previous code is a lot slower than the following:
tic;
fread(fid, 100, '*int32');
toc;
You might think that's obvious, but the same principle can be applied to more complicated cases:
tic;
for i = 1:100
val1(i) = fread(fid, 1, '*float32');
val2(i) = fread(fid, 1, '*float32');
end
toc;
This problem is no longer simple because in memory the floats are represented like this:
val1 val2 val1 val2 etc.
However you can use the skip value of fread to achieve the same optimizations as before:
tic;
% Get the current position in the file
initial_position = ftell(fid);
% Read 100 float32 values, and skip 4 bytes after each one
val1 = fread(fid, 100, '*float32', 4);
% Set the file position back to the start (plus the size of the initial float32)
fseek(fid, position + 4, 'bof');
% Read 100 float32 values, and skip 4 bytes after each one
val2 = fread(fid, 100, '*float32', 4);
toc;
So this file read was accomplished using two freads instead of 200, a massive improvement.
Function Calls
I recently worked on some code that used many function calls, all of which were located in separate files. So lets say there were 100 separate files, all calling each other. By "inlining" this code into one function I saw a 20% improvement in execution speed from 9 seconds.
Obviously you would not do this at the expense of re-usability, but in my case the functions were automatically generated and not reused at all. But we can still learn from this and avoid excessive function calls where they are not really needed.
External MEX functions incur an overhead for being called. Therefore one call to a large MEX function is a lot more efficient than many calls to smaller MEX functions.
Plotting Many Disconnected Lines
When plotting disconnected data such as a set of vertical lines, the traditional way to do this in Matlab is to iterate multiple calls to line or plot using hold on. However if you have a large number of individual lines to plot, this becomes very slow.
The technique I have found uses the fact that you can introduce NaN values into data to plot and it will cause a break in the data.
The below contrived example converts a set of x_values, y1_values, and y2_values (where the line is from [x, y1] to [x, y2]) to a format appropriate for a single call to plot.
For example:
% Where x is 1:1000, draw vertical lines from 5 to 10.
x_values = 1:1000;
y1_values = ones(1, 1000) * 5;
y2_values = ones(1, 1000) * 10;
% Set x_plot_values to [1, 1, NaN, 2, 2, NaN, ...];
x_plot_values = zeros(1, length(x_values) * 3);
x_plot_values(1:3:end) = x_values;
x_plot_values(2:3:end) = x_values;
x_plot_values(3:3:end) = NaN;
% Set y_plot_values to [5, 10, NaN, 5, 10, NaN, ...];
y_plot_values = zeros(1, length(x_values) * 3);
y_plot_values(1:3:end) = y1_values;
y_plot_values(2:3:end) = y2_values;
y_plot_values(3:3:end) = NaN;
figure; plot(x_plot_values, y_plot_values);
I have used this method to print thousands of tiny lines and the performance improvements were immense. Not only in the initial plot, but the performance of subsequent manipulations such as zoom or pan operations improved as well.

Growable data structure in MATLAB

I need to create a queue in matlab that holds structs which are very large. I don't know how large this queue will get. Matlab doesn't have linked lists, and I'm worried that repeated allocation and copying is really going to slow down this code which must be run thousands of times. I need some sort of way to use a growable data structure. I've found a couple of entries for linked lists in the matlab help but I can't understand what's going on. Can someone help me with this problem?
I posted a solution a while back to a similar problem. The way I tried it is by allocating the array with an initial size BLOCK_SIZE, and then I keep growing it by BLOCK_SIZE as needed (whenever there's less than 10%*BLOCK_SIZE free slots).
Note that with an adequate block size, performance is comparable to pre-allocating the entire array from the beginning. Please see the other post for a simple benchmark I did.
Just create an array of structs and double the size of the array when it hits the limit. This scales well.
If you're worried that repeated allocation and copying is going to slow the code down, try it. It may in fact be very slow, but you may be pleasantly surprised.
Beware of premature optimization.
Well, I found the easy answer:
L = java.util.LinkedList;
I think the built-in cell structure would be suitable for storing growable structures.
I made a comparison among:
Dynamic size cell, size of the cell changes every loop
Pre-allocated cell
Java LinkedList
Code:
clear;
scale = 1000;
% dynamic size cell
tic;
dynamic_cell = cell(0);
for ii = 1:scale
dynamic_cell{end + 1} = magic(20);
end
toc
% preallocated cell
tic;
fixed_cell = cell(1, scale);
for ii = 1:scale
fixed_cell{ii} = magic(20);
end
toc
% java linked list
tic;
linked_list = java.util.LinkedList;
for ii = 1:scale
linked_list.add(magic(20));
end
toc;
Results:
Elapsed time is 0.102684 seconds. % dynamic
Elapsed time is 0.091507 seconds. % pre-allocated
Elapsed time is 0.189757 seconds. % Java LinkedList
I change scale and magic(20) and find the dynamic and pre-allocated versions are very close on speed. Maybe cell only stores pointer-like structures and is efficient on resizing.
The Java way is slower. And I find it sometimes unstable (it crashes my MATLAB when the scale is very large).