Matlab Preallocation, guess a large matrix or a small one? - matlab

According to this question, I should try to use Preallocation is Matlab.
Now I have a situation that I cannot calculate the exact size of the matrix to preallocate. I can guess the size.
suppose the actual size of the matrix is 100, but I don't know it. Sh
Which scenario is more efficient:
Should I be lavish? I guess a large matrix and at the end I remove extra rows.
Should I be stingy? I guess a small size and If it was wrong, I add new rows.
Thanks.

To my opinion, the answer is a bit more complex than portrayed by #natan.
I think there are two factors his answer does not take into account:
Possible necessary copies of memory: when you under-estimate a matrix size and you re-allocate it, all its old values should be copied to the new allocated location.
Continuity of memory chunks: sometimes Matlab is able to allocate new memory continuously at the end of the old matrix. In principle, in such a scenario the old values need not be copied to the new location - since it is the same as the old one just bigger. However, if you add rows to a 2D matrix, the content needs to be copied even in this scenario, since Matlab stores matrices in a row-major fashion in memory.
So, my answer is this:
First of all, what exactly don't you know about the size of the matrix: if you know one dimension - make it the number of rows of your matrix, so you'll only need to change the number of columns. This way, if your already stored data needs to be copied, it would be copied at larger chunks.
Second, it depends on how much free RAM you have at your disposal.
If you are not short at RAM, then there's nothing wrong with over estimating.
However, if you are short at RAM, consider under estimating. BUT when you re-allocate, increase the new block size at each iteration:
BASIC_SIZE = X; % first estimate
NEW_SIZE = Y; % if need more, add this amount
factor = 2;
arr = zeros( m, BASIC_SIZE ); % first allocation, assuming we know number of rows
while someCondition
% process arr ...
if needMoreCols
arr(:, size(arr,2) + (1:NEW_SIZE) ) = 0; % allocate another block
NEW_SIZE = round(NEW_SIZE * factor); % it seems like we are off in estimation, try larger chunk next time factor should be > 1
end
end
arr = arr(:, 1:actualNumOfCols ); % resize to actual size, discard unnecessary columns

+1 for the interesting question.
EDITED Answer:
From a little experimental study at first it seems better to add rows later, but it now seems more efficient overesrimated and preallocate again when you have the info of the correct size . I started with matrix size 3000 and guessed an error of 10% in the size estimation, see below:
clear all
clc
guess_size=3000;
m=zeros(guess_size);
%1. oops overesrimated, take out rows
tic
m(end-300:end,:)=[];
toc
%1b. oops overesrimated, preallocate again
tic
m=zeros(guess_size-300,guess_size);
toc
%2. oops overesrimated, take out cols
m=zeros(guess_size);
tic
m(:,end-300:end)=[];
toc
%2b. oops overesrimated, preallocate again
m=zeros(guess_size);
tic
m=zeros(guess_size,guess_size-300);
toc
%3. oops underesrimated, add rows
m=zeros(guess_size);
tic
m=zeros(guess_size+300,guess_size);
toc
%4. oops underesrimated, add cols
m=zeros(guess_size);
tic
m=zeros(guess_size,guess_size+300);
toc
Elapsed time is 0.041893 seconds.
Elapsed time is 0.026925 seconds.
Elapsed time is 0.041818 seconds.
Elapsed time is 0.023425 seconds.
Elapsed time is 0.027523 seconds.
Elapsed time is 0.029509 seconds.
Option 2b and 1b are slightly faster than underestimating, so if you can, better overestimate and then preallocate again. It is never efficient to delete rows from an array. Also adding columns seems slightly more efficient, but this is just a quick and dirty job. See #Shai detailed answer for the inner workings...

In addition to the other educating answers, The short short version:
There are three cases:
The size of the array is relatively small (up to thousends of bytes) -> it doesn't really matter.
The array is big, but you are not bounded by the amount of memory your system have -> Overestimate.
The array is big, and you are bounded by the amount of memory your system have -> do what Shai suggested.

Related

Saving time and memory using parfor?

Consider prova.mat in MATLAB obtained in the following way
for w=1:100
for p=1:9
A{p}=randn(100,1);
end
baseA_.A=A;
eval(['baseA.A' num2str(w) '= baseA_;'])
end
save(sprintf('prova.mat'),'-v7.3', 'baseA')
To have an idea of the actual dimensions in my data, the 1x9 cell in A1 is composed by the following 9 arrays: 904x5, 913x5, 1722x5, 4136x5, 9180x5, 3174x5, 5970x5, 4455x5, 340068x5. The other Aj's have a similar composition.
Consider the following code
clear all
load prova
tic
parfor w=1:100
indA=sprintf('A%d', w);
Aarr=baseA.(indA).A;
Boot=[];
for p=1:9
C=randn(100,1).*Aarr{p};
Boot=[Boot; C];
end
D{w}=Boot;
end
toc
If I run the parfor loop with 4 local workers in my Macbook Pro it takes 1.2 sec. Replacing parfor with for it takes 0.01 sec.
With my actual data, the difference of time is 31 sec versus 7 sec [the creation of the matrix C is also more complicated].
If have understood correctly the problem is that the computer has to send baseAto each local worker and this takes time and memory.
Could you suggest a solution that is able to make parfor more convenient than for? I thought that saving all cells in baseA was a way to save time by loading once at the beginning, but maybe I'm wrong.
General information
A lot of functions have implicit multi-threading built-in, making a parfor loop not more efficient, when using these functions, than a serial for loop, since all cores are already being used. parfor will actually be a detriment in this case, since it has the allocation overhead, whilst being as parallel as the function you are trying to use.
When not using one of the implicitly multithreaded functions parfor is basically recommended in two cases: lots of iterations in your loop (i.e., like 1e10), or if each iteration takes a very long time (e.g., eig(magic(1e4))). In the second case you might want to consider using spmd (slower than parfor in my experience). The reason parfor is slower than a for loop for short ranges or fast iterations is the overhead needed to manage all workers correctly, as opposed to just doing the calculation.
Check this question for information on splitting data between separate workers.
Benchmarking
Code
Consider the following example to see the behaviour of for as opposed to that of parfor. First open the parallel pool if you've not already done so:
gcp; % Opens a parallel pool using your current settings
Then execute a couple of large loops:
n = 1000; % Iteration number
EigenValues = cell(n,1); % Prepare to store the data
Time = zeros(n,1);
for ii = 1:n
tic
EigenValues{ii,1} = eig(magic(1e3)); % Might want to lower the magic if it takes too long
Time(ii,1) = toc; % Collect time after each iteration
end
figure; % Create a plot of results
plot(1:n,t)
title 'Time per iteration'
ylabel 'Time [s]'
xlabel 'Iteration number[-]';
Then do the same with parfor instead of for. You will notice that the average time per iteration goes up (0.27s to 0.39s for my case). Do realise however that the parfor used all available workers, thus the total time (sum(Time)) has to be divided by the number of cores in your computer. So for my case the total time went down from around 270s to 49s, since I have an octacore processor.
So, whilst the time to do each separate iteration goes up using parfor with respect to using for, the total time goes down considerably.
Results
This picture shows the results of the test as I just ran it on my home PC. I used n=1000 and eig(500); my computer has an I5-750 2.66GHz processor with four cores and runs MATLAB R2012a. As you can see the average of the parallel test hovers around 0.29s with a lot of spread, whilst the serial code is quite steady around 0.24s. The total time, however, went down from 234s to 72s, which is a speed up of 3.25 times. The reason that this is not exactly 4 is the memory overhead, as expressed in the extra time each iteration takes. The memory overhead is due to MATLAB having to check what each core is doing and making sure that each loop iteration is performed only once and that the data is put into the correct storage location.
Slice broadcasted data into a cell array
The following approach works for data which is looped by group. It does not matter what the grouping variable is, as long as it is determined before the loop. The speed advantage is huge.
A simplified example of such data is the following, with the first column containing a grouping variable:
ngroups = 1000;
nrows = 1e6;
data = [randi(ngroups,[nrows,1]), randn(nrows,1)];
data(1:5,:)
ans =
620 -0.10696
586 -1.1771
625 2.2021
858 0.86064
78 1.7456
Now, suppose for simplicity that I am interested in the sum() by group of the values in the second column. I can loop by group, index the elements of interest and sum them up. I will perform this task with a for loop, a plain parfor and a parfor with sliced data, and will compare the timings.
Keep in mind that this is a toy example and I am not interested in alternative solutions like bsxfun(), this is not the point of the analysis.
Results
Borrowing the same type of plot from Adriaan, I first confirm the same findings about plain parfor vs for. Second, both methods are completely outperformed by the parfor on sliced data which takes a bit more than 2 seconds to complete on a dataset with 10 million rows (the slicing operation is included in the timing). The plain parfor takes 24s to complete and the for almost twice that amount of time (I am on Win7 64, R2016a and i5-3570 with 4 cores).
The main point of slicing the data before starting the parfor is to avoid:
the overhead from the whole data being broadcast to the workers,
indexing operations into ever growing datasets.
The code
ngroups = 1000;
nrows = 1e7;
data = [randi(ngroups,[nrows,1]), randn(nrows,1)];
% Simple for
[out,t] = deal(NaN(ngroups,1));
overall = tic;
for ii = 1:ngroups
tic
idx = data(:,1) == ii;
out(ii) = sum(data(idx,2));
t(ii) = toc;
end
s.OverallFor = toc(overall);
s.TimeFor = t;
s.OutFor = out;
% Parfor
try parpool(4); catch, end
[out,t] = deal(NaN(ngroups,1));
overall = tic;
parfor ii = 1:ngroups
tic
idx = data(:,1) == ii;
out(ii) = sum(data(idx,2));
t(ii) = toc;
end
s.OverallParfor = toc(overall);
s.TimeParfor = t;
s.OutParfor = out;
% Sliced parfor
[out,t] = deal(NaN(ngroups,1));
overall = tic;
c = cache2cell(data,data(:,1));
s.TimeDataSlicing = toc(overall);
parfor ii = 1:ngroups
tic
out(ii) = sum(c{ii}(:,2));
t(ii) = toc;
end
s.OverallParforSliced = toc(overall);
s.TimeParforSliced = t;
s.OutParforSliced = out;
x = 1:ngroups;
h = plot(x, s.TimeFor,'xb',x,s.TimeParfor,'+r',x,s.TimeParforSliced,'.g');
set(h,'MarkerSize',1)
title 'Time per iteration'
ylabel 'Time [s]'
xlabel 'Iteration number[-]';
legend({sprintf('for : %5.2fs',s.OverallFor),...
sprintf('parfor : %5.2fs',s.OverallParfor),...
sprintf('parfor_sliced: %5.2fs',s.OverallParforSliced)},...
'interpreter', 'none','fontname','courier')
You can find cache2cell() on my github repo.
Simple for on sliced data
You might wonder what happens if we run the simple for on the sliced data? For this simple toy example, if we take away the indexing operation by slicing the data, we remove the only bottleneck of the code, and the for will actually be slighlty faster than the parfor.
However, this is a toy example where the cost of the inner loop is completely taken by the indexing operation. Hence, for the parfor to be worthwhile, the inner loop should be more complex and/or spread out.
Saving memory with sliced parfor
Now, assuming that your inner loop is more complex and the simple for loop is slower, let's look at how much memory we save by avoiding broadcasted data in a parfor with 4 workers and a dataset with 50 million rows (for about 760 MB in RAM).
As you can see, almost 3 GB of additional memory are sent to the workers. The slice operation needs some memory to be completed, but still much less than the broadcasting operation and can in principle overwrite the initial dataset, hence bearing negligible RAM cost once completed. Finally, the parfor on the sliced data will only use a small fraction of memory, i.e. that amount that corresponds to slices being used.
Sliced into a cell
The raw data is sliced by group and each section is stored into a cell. Since a cell array is an array of references we basically partitioned the contiguous data in memory into independent blocks.
While our sample data looked like this
data(1:5,:)
ans =
620 -0.10696
586 -1.1771
625 2.2021
858 0.86064
78 1.7456
out sliced c looks like
c(1:5)
ans =
[ 969x2 double]
[ 970x2 double]
[ 949x2 double]
[ 986x2 double]
[1013x2 double]
where c{1} is
c{1}(1:5,:)
ans =
1 0.58205
1 0.80183
1 -0.73783
1 0.79723
1 1.0414

Recursive loop optimization

Is there a way to rewrite my code to make it faster?
for i = 2:length(ECG)
u(i) = max([a*abs(ECG(i)) b*u(i-1)]);
end;
My problem is the length of ECG.
You should pre-allocate u like this
>> u = zeros(size(ECG));
or possibly like this
>> u = NaN(size(ECG));
or maybe even like this
>> u = -Inf(size(ECG));
depending on what behaviour you want.
When you pre-allocate a vector, MATLAB knows how big the vector is going to be and reserves an appropriately sized block of memory.
If you don't pre-allocate, then MATLAB has no way of knowing how large the final vector is going to be. Initially it will allocate a short block of memory. If you run out of space in that block, then it has to find a bigger block of memory somewhere, and copy all the old values into the new memory block. This happens every time you run out of space in the allocated block (which may not be every time you grow the array, because the MATLAB runtime is probably smart enough to ask for a bit more memory than it needs, but it is still more than necessary). All this unnecessary reallocating and copying is what takes a long time.
There are several several ways to optimize this for loop, but, surprisingly memory pre-allocation is not the part that saves the most time. By far. You're using max to find the largest element of a 1-by-2 vector. On each iteration you build this vector. However, all you're doing is comparing two scalars. Using the two argument form of max and passing it two scalar is MUCH faster: 75+ times faster on my machine for large ECG vectors!
% Set the parameters and create a vector with million elements
a = 2;
b = 3;
n = 1e6;
ECG = randn(1,n);
ECG2 = a*abs(ECG); % This can be done outside the loop if you have the memory
u(1,n) = 0; % Fast zero allocation
for i = 2:length(ECG)
u(i) = max(ECG2(i),b*u(i-1)); % Compare two scalars
end
For the single input form of max (not including creation of random ECG data):
Elapsed time is 1.314308 seconds.
For my code above:
Elapsed time is 0.017174 seconds.
FYI, the code above assumes u(1) = 0. If that's not true, then u(1) should be set to it's value after preallocation.

Smart way to extend matrix in matlab

For example I have 2x2 matrix, now i have to increase its left and right side to 1 column each, then top and bottom side to 1 row each, now I will have a 4x4 matrix with the old matrix is located in the center of the new one. Is there any way to do it fast rather than create new one and transfer values from old to new one?
Thank you very much
You will always have to allocate new memory for the new array, no matter what you do.
Also, if your matrix is only 2x2, the speed of any approach is good enough. Or do you want to handle larger matrices as well? Then, consider the following tests of two methods you can use:
A = rand(5000);
% explicitly add zero vectors on all sides of A
tic;
B = [zeros(1, size(A,1)+2);
zeros(size(A, 2),1) A zeros(size(A, 2),1);
zeros(1, size(A,1)+2)];
toc
Elapsed time is 0.204940 seconds.
% create the output array and assign the A array to correct sub-matrix
tic
B = zeros(size(A)+2);
B(2:end-1,2:end-1) = A;
toc
Elapsed time is 0.102501 seconds.
Another option is
B = padarray(A,[1,1],'both');
For speed (at least for my computer), this is between the two methods suggested by angainor, and it has the advantage that you don't have to create a new variable if you prefer not to.

Growable data structure in MATLAB

I need to create a queue in matlab that holds structs which are very large. I don't know how large this queue will get. Matlab doesn't have linked lists, and I'm worried that repeated allocation and copying is really going to slow down this code which must be run thousands of times. I need some sort of way to use a growable data structure. I've found a couple of entries for linked lists in the matlab help but I can't understand what's going on. Can someone help me with this problem?
I posted a solution a while back to a similar problem. The way I tried it is by allocating the array with an initial size BLOCK_SIZE, and then I keep growing it by BLOCK_SIZE as needed (whenever there's less than 10%*BLOCK_SIZE free slots).
Note that with an adequate block size, performance is comparable to pre-allocating the entire array from the beginning. Please see the other post for a simple benchmark I did.
Just create an array of structs and double the size of the array when it hits the limit. This scales well.
If you're worried that repeated allocation and copying is going to slow the code down, try it. It may in fact be very slow, but you may be pleasantly surprised.
Beware of premature optimization.
Well, I found the easy answer:
L = java.util.LinkedList;
I think the built-in cell structure would be suitable for storing growable structures.
I made a comparison among:
Dynamic size cell, size of the cell changes every loop
Pre-allocated cell
Java LinkedList
Code:
clear;
scale = 1000;
% dynamic size cell
tic;
dynamic_cell = cell(0);
for ii = 1:scale
dynamic_cell{end + 1} = magic(20);
end
toc
% preallocated cell
tic;
fixed_cell = cell(1, scale);
for ii = 1:scale
fixed_cell{ii} = magic(20);
end
toc
% java linked list
tic;
linked_list = java.util.LinkedList;
for ii = 1:scale
linked_list.add(magic(20));
end
toc;
Results:
Elapsed time is 0.102684 seconds. % dynamic
Elapsed time is 0.091507 seconds. % pre-allocated
Elapsed time is 0.189757 seconds. % Java LinkedList
I change scale and magic(20) and find the dynamic and pre-allocated versions are very close on speed. Maybe cell only stores pointer-like structures and is efficient on resizing.
The Java way is slower. And I find it sometimes unstable (it crashes my MATLAB when the scale is very large).

Turning a binary matrix into a vector of the last nonzero index in a fast, vectorized fashion

Suppose, in MATLAB, that I have a matrix, A, whose elements are either 0 or 1.
How do I get a vector of the index of the last non-zero element of each column in a faster, vectorized way?
I could do
[B, I] = max(cumsum(A));
and use I, but is there a faster way? (I'm assuming cumsum would cost a bit of time even suming 0's and 1's).
Edit: I guess that I vectorized even more than I need fast - Mr. Fooz' loop is great but each loop in MATLAB seems to cost me a lot in debugging time even if it is fast.
Fast is what you should worry about, not necessarily full vectorization. Recent versions of Matlab are much smarter about handling loops efficiently. If there's a compact vectorized way of expressing something, it's usually faster, but loops should not (always) be feared like they used to be.
clc
A = rand(5000)>0.5;
A(1,find(sum(A,1)==0)) = 1; % make sure there is at least one match
% Slow because it is doing too much work
tic;[B,I1]=max(cumsum(A));toc
% Fast because FIND is fast and it runs the inner loop
tic;
I3=zeros(1,5000);
for i=1:5000
I3(i) = find(A(:,i),1,'last');
end
toc;
assert(all(I1==I3));
% Even faster because the JIT in Matlab is smart enough now
tic;
I2=zeros(1,5000);
for i=1:5000
I2(i) = 0;
for j=5000:-1:1
if A(j,i)
I2(i) = j;
break;
end
end
end
toc;
assert(all(I1==I2));
On R2008a, Windows, x64, the cumsum version takes 0.9 seconds. The loop and find version takes 0.02 seconds. The double loop version takes a mere 0.001 seconds.
EDIT: Which one is fastest depends on the actual data. The double-loop takes 0.05 seconds when you change the 0.5 to 0.999 (because it takes longer to hit the break; on average). cumsum and the loop&find implementation have more consistent speeds.
EDIT 2: gnovice's flipud solution is clever. Unfortunately, on my test machine it takes 0.1 seconds, so it's much faster than cumsum, but slower than the looped versions.
As shown by Mr Fooz, for loops can be pretty fast now with newer versions of MATLAB. However, if you really want to have compact vectorized code, I would suggest trying this:
[B,I] = max(flipud(A));
I = size(A,1)-I+1;
This is faster than your CUMSUM-based answer, but still not quite as fast as Mr Fooz's looping options.
Two additional things to consider:
What results do you want to get for a column that has no ones in it at all? With the above option I gave you, I believe you will get an index of size(A,1) (i.e. the number of rows in A) in such a case. For your option, I believe you will get a 1 in such a case, while the nested-for-loops option from Mr Fooz will give you a 0.
The relative speed of these different options will likely vary based on the size of A and the number of non-zeroes you expect it to have.