Guide to Optimizing MATLAB Code - matlab

I have noticed many individual questions on SO but no one good guide to MATLAB optimization.
Common Questions:
Optimize this code for me
How do I vectorize this?
I don't think that these questions will stop, but I'm hoping that the ideas presented here will them something centralized to refer to.
Optimizing Matlab code is kind of a black-art, there is always a better way to do it. And sometimes it is straight-up impossible to vectorize your code.
So my question is: when vectorization is impossible or extremely complicated, what are some of your tips and tricks to optimize MATLAB code? Also if you have any common vectorization tricks I wouldn't mind seeing them either.

Preface
All of these tests are performed on a machine that is shared with others, so it is not a perfectly clean environment. Between each test I clear the workspace to free up memory.
Please don't pay attention to the individual numbers, just look at the differences between the before and after optimisation times.
Note: The tic and toc calls I have placed in the code are to show where I am measuring the time taken.
Pre-allocation
The simple act of pre-allocating arrays in Matlab can give a huge speed advantage.
tic;
for i = 1:100000
my_array(i) = 5 * i;
end
toc;
This takes 47 seconds
tic;
length = 100000;
my_array = zeros(1, length);
for i = 1:length
my_array(i) = 5 * i;
end
toc;
This takes 0.1018 seconds
47 seconds to 0.1 seconds for a single line of code added is an amazing improvement. Obviously in this simple example you could vectorize it to my_array = 5 * 1:100000 (which took 0.000423 seconds) but I am trying to represent the more complicated times when vectorization isn't an option.
I recently found that the zeros function (and others of the same nature) are not as fast at pre-allocating as simply setting the last value to 0:
tic;
length = 100000;
my_array(length) = 0;
for i = 1:length
my_array(i) = 5 * i;
end
toc;
This takes 0.0991 seconds
Now obviously this tiny difference doesn't prove much but you'll have to believe me over a large file with many of these optimisations the difference becomes a lot more apparent.
Why does this work?
The pre-allocation methods allocate a chunk of memory for you to work with. This memory is contiguous and can be pre-fetched, just like an Array in C++ or Java. However if you do not pre-allocate then MATLAB will have to dynamically find more and more memory for you to use. As I understand it, this behaves differently to a Java ArrayList and is more like a LinkedList where different chunks of the array are split all over the place in memory.
Not only is this slower when you write data to it (47 seconds!) but it is also slower every time you access it from then on. In fact, if you absolutely CAN'T pre-allocate then it is still useful to copy your matrix to a new pre-allocated one before you start using it.
What if I don't know how much space to allocate?
This is a common question and there are a few different solutions:
Overestimation - It is better to grossly overestimate the size of your matrix and allocate too much space, than it is to under-allocate space.
Deal with it and fix later - I have seen this a lot where the developer has put up with the slow population time, and then copied the matrix into a new pre-allocated space. Usually this is saved as a .mat file or similar so that it could be read quickly at a later date.
How do I pre-allocate a complicated structure?
Pre-allocating space for simple data-types is easy, as we have already seen, but what if it is a very complex data type such as a struct of structs?
I could never work out to explicitly pre-allocate these (I am hoping someone can suggest a better method) so I came up with this simple hack:
tic;
length = 100000;
% Reverse the for-loop to start from the last element
for i = 1:length
complicated_structure = read_from_file(i);
end
toc;
This takes 1.5 minutes
tic;
length = 100000;
% Reverse the for-loop to start from the last element
for i = length:-1:1
complicated_structure = read_from_file(i);
end
% Flip the array back to the right way
complicated_structure = fliplr(complicated_structure);
toc;
This takes 6 seconds
This is obviously not perfect pre-allocation, and it takes a little while to flip the array afterwards, but the time improvements speak for themselves. I'm hoping someone has a better way to do this, but this is a pretty good hack in the mean time.
Data Structures
In terms of memory usage, an Array of Structs is orders of magnitude worse than a Struct of Arrays:
% Array of Structs
a(1).a = 1;
a(1).b = 2;
a(2).a = 3;
a(2).b = 4;
Uses 624 Bytes
% Struct of Arrays
a.a(1) = 1;
a.b(1) = 2;
a.a(2) = 3;
a.b(2) = 4;
Uses 384 Bytes
As you can see, even in this simple/small example the Array of Structs uses a lot more memory than the Struct of Arrays. Also the Struct of Arrays is in a more useful format if you want to plot the data.
Each Struct has a large header, and as you can see an array of structs repeats this header multiple times where the struct of arrays only has the one header and therefore uses less space. This difference is more obvious with larger arrays.
File Reads
The less number of freads (or any system call for that matter) you have in your code, the better.
tic;
for i = 1:100
fread(fid, 1, '*int32');
end
toc;
The previous code is a lot slower than the following:
tic;
fread(fid, 100, '*int32');
toc;
You might think that's obvious, but the same principle can be applied to more complicated cases:
tic;
for i = 1:100
val1(i) = fread(fid, 1, '*float32');
val2(i) = fread(fid, 1, '*float32');
end
toc;
This problem is no longer simple because in memory the floats are represented like this:
val1 val2 val1 val2 etc.
However you can use the skip value of fread to achieve the same optimizations as before:
tic;
% Get the current position in the file
initial_position = ftell(fid);
% Read 100 float32 values, and skip 4 bytes after each one
val1 = fread(fid, 100, '*float32', 4);
% Set the file position back to the start (plus the size of the initial float32)
fseek(fid, position + 4, 'bof');
% Read 100 float32 values, and skip 4 bytes after each one
val2 = fread(fid, 100, '*float32', 4);
toc;
So this file read was accomplished using two freads instead of 200, a massive improvement.
Function Calls
I recently worked on some code that used many function calls, all of which were located in separate files. So lets say there were 100 separate files, all calling each other. By "inlining" this code into one function I saw a 20% improvement in execution speed from 9 seconds.
Obviously you would not do this at the expense of re-usability, but in my case the functions were automatically generated and not reused at all. But we can still learn from this and avoid excessive function calls where they are not really needed.
External MEX functions incur an overhead for being called. Therefore one call to a large MEX function is a lot more efficient than many calls to smaller MEX functions.
Plotting Many Disconnected Lines
When plotting disconnected data such as a set of vertical lines, the traditional way to do this in Matlab is to iterate multiple calls to line or plot using hold on. However if you have a large number of individual lines to plot, this becomes very slow.
The technique I have found uses the fact that you can introduce NaN values into data to plot and it will cause a break in the data.
The below contrived example converts a set of x_values, y1_values, and y2_values (where the line is from [x, y1] to [x, y2]) to a format appropriate for a single call to plot.
For example:
% Where x is 1:1000, draw vertical lines from 5 to 10.
x_values = 1:1000;
y1_values = ones(1, 1000) * 5;
y2_values = ones(1, 1000) * 10;
% Set x_plot_values to [1, 1, NaN, 2, 2, NaN, ...];
x_plot_values = zeros(1, length(x_values) * 3);
x_plot_values(1:3:end) = x_values;
x_plot_values(2:3:end) = x_values;
x_plot_values(3:3:end) = NaN;
% Set y_plot_values to [5, 10, NaN, 5, 10, NaN, ...];
y_plot_values = zeros(1, length(x_values) * 3);
y_plot_values(1:3:end) = y1_values;
y_plot_values(2:3:end) = y2_values;
y_plot_values(3:3:end) = NaN;
figure; plot(x_plot_values, y_plot_values);
I have used this method to print thousands of tiny lines and the performance improvements were immense. Not only in the initial plot, but the performance of subsequent manipulations such as zoom or pan operations improved as well.

Related

Large workspace variables affect multiplication runtime of unrelated variables

Old Title: *Small matrix multiplication much slower in R2016b than R2016a*
(update below)
I find that multiplication of small matrices seems much smaller in R2016b than R2016a. Here's a minimal example:
r = rand(50,100);
s = rand(100,100);
tic; r * s; toc
This takes about 0.0012s in R2016a and 0.018s R2016b.
Creating an artificial loop to make sure this isn't just some initial overhead or something leads to the same loss factor:
tic; for i = 1:1000, a = r*s; end, toc
This takes about 0.18s in R2016a and 2.1s R2016b.
Once I make the matrices much bigger, say r = rand(500,1000); and s = rand(1000,1000), the version behave similarly (R2016b even seems to be ~15% faster). Anyone have any insight as to why this is, or can verify this behavior on another system?
I wonder if it has to do with the new arithmetic expansions implementation (if this feature has some cost for small matrix multiplication): http://blogs.mathworks.com/loren/2016/10/24/matlab-arithmetic-expands-in-r2016b/
update
After many tests, I discovered that this difference was not between MATLAB versions (my apologies). Instead, it seems to be a difference of what's in my base workspace... and worse, the type of variable that's in the base workspace.
I cleared a huge workspace (which had many large cell arrays with many small, differently sized matrix entries). If I clear the variables and do the timing of r*s, I get much faster runtime (x10-x100) than before the workspace was loaded.
So the question is, why does having variables in the workspace affect the matrix multiplication of two small variables? And even more, why does having certain types of variables slow down the workspace dramatically.
Here's an example where a large variable in cell form in the workspace affects the runtime of the matrix multiplication or two unrelated matrices. If I collapse this cell to a matrix, the effect goes away.
clear;
ticReps = 10000;
nCells = 100;
aa = rand(50,100);
bb = rand(100, 100);
% test original timing
tic; for i = 1:ticReps, aa * bb; end
fprintf('original: %3.3f\n', toc);
% make some matrices inside a large number of cells
q = cell(nCells, nCells);
for i = 1:nCells * nCells
q{i} = sprand(10000,10000, 0.0001);
end
% the timing again
tic; for i = 1:ticReps, aa * bb; end
fprintf('after large q cell: %3.3f\n', toc);
% make q into a matrix
q = cat(2, q{:});
% the timing again
tic; for i = 1:ticReps, aa * bb; end
fprintf('after large q matrix: %3.3f\n', toc);
clear q
% the timing again
tic; for i = 1:ticReps, aa * bb; end
fprintf('after clear q: %3.3f\n', toc);
In both staged, q takes up about 2Gb. Result:
original: 0.183
after large q cell: 0.320
after large q matrix: 0.175
after clear q: 0.184
I've received an update from mathworks.
As far as I understand it, they say that this is the fault of the Windows memory manager, which allots memory to large cell arrays in a fairly fragmented manner. Since the (unrelated) multiplication needs memory (for the output), getting this piece of memory now takes longer time due to the memory fragmentation caused by the cell. Linux (as tested) does not have this issue.

Incremental appending: How to avoid performance penalty of struct arrays

If you must incrementally append data to arrays, it seems that using individual vectors of basic data types is orders of magnitude faster than an array of structs (with one vector element per record). Even trying to collect the individual vectors into a struct seems to double the time. The tests are:
N=5e4;
fprintf('\nstruct array (array of structs):\n')
clear x y;
y=struct( 'a',[], 'b',[], 'c',[], 'd',[] );
tic
for iIns = 1 : N
x.a=rand; x.b=rand; x.c=rand; x.d=rand;
y(end+1)=x;
end % for iIns
toc
fprintf('\nSeparate arrays of scalars:\n')
clear a b c d;
a=[]; b=[]; c=[]; d=[];
tic
for iIns = 1 : N
a(end+1) = rand;
b(end+1) = rand;
c(end+1) = rand;
d(end+1) = rand;
end % for iIns
toc
fprintf('\nA struct with arrays of scalars for fields:\n')
clear a b c d x y
x.a=[]; x.b=[]; x.c=[]; x.d=[];
tic
for iIns = 1:N
x.a(end+1)=rand;
x.b(end+1)=rand;
x.c(end+1)=rand;
x.d(end+1)=rand;
end % for iIns
toc
The results:
struct array (array of structs):
Elapsed time is 24.127274 seconds.
Separate arrays of scalars:
Elapsed time is 0.048190 seconds.
A struct with arrays of scalars for fields:
Elapsed time is 0.084624 seconds.
Even though collecting individual vectors of basic data types into a struct (3rd scenario above) imposes such a penalty, it may be preferrable to simply using individual vectors (second scenario above) because the variables are more organized. Your variable name space isn't filled up with so many variables which are in fact conceptually grouped.
That's quite a significant penalty, however, to pay for such organization. I don't suppose there is way to avoid this?
There are two ways to avoid this performance penalty: (1) pre-allocate, and (2) rethink your stance on "organizing" variables. I suggest both. Oh, and if you can, don't use arrays of structs where each field only uses scalars - if your application suddenly has to handle a couple of orders of magnitude more data, the memory overhead will force you to rewrite everything.
Pre-allocation
You often know how many elements your array will end up having. Thus, initialize your arrays as s = struct('a',NaN(1:N),'b',NaN(1:N)); If you don't know ahead of time how many entries there will be, but you can estimate an upper limit, initialize with the upper limit, and either remove the elements, or use functions (e.g. nanmean) that do not care if the array has a few extra NaNs in the end. If you truly know nothing about the final size (except that N will be large enough to matter), pre-allocate with a nice number (e.g. N=1337), and extend the array in chunks. MathWorks have sped up dynamic growing of numeric arrays in a recent release, but as you demonstrate in your answer, the optimization has not been applied to structs yet. Don't count MathWorks' optimization team to fix your code.
Nice variables
Why worry about your variable space? As long as you use explicitVariableNames, your code remains readable and you will have an easy time picking out the right variable. But ok, let's say you want to clean up: The first way to keeping the number of active variables low is to use clear or keep at strategic points in your code to make sure you only keep around what's needed. The second (assuming you want to optimize for performance), is to put contextually linked vectors into the same array: objectDimensions = [lengthOfObject, widthOfObject, heightOfObject]. This keeps everything as numeric arrays (which are fastest), and allows easy vectorization such as objectVolume = prod(objectDimensions,2);.
/aside: I should disclose that I used to use structures frequently for assembling results (so that I could return a lot of information a single variable and have the field names be part of the documentation). I have since switched to use object-oriented-programming (usually handle-objects), which no only collect related variables, but also the associated functionality, and which facilitate code re-use. I do take a performance hit, but the time it saves me coding makes more than up for it. Note that I do pre-allocate if at all possible (and if it's not just growing an array three times).
Example
Assume you have a function getDimensions that reads dimensions (length, height, width) of objects. However, sometimes, the object is 2D, sometimes it is 3D. Thus, you want to fill the following variables: twoD.length, twoD.width, threeD.length, threeD.width, threeD.height, ideally as arrays of structs, so that each element of a struct corresponds to an object. You do not know ahead of time how many objects there are, all you can do is poll the function thereAreMoreObjects, which returns true or false, until there are no more objects.
Here's how you can do this with reasonable efficiency and growing arrays by chunks:
%// preassign the temporary variable, and some others
chunkSize = 1000;
numObjects = 0;
idAndDimensions = zeros(chunkSize,4);
while thereAreMoreObjects()
objectId = getCurrentObjectId();
%// hi==-1 if it's flat
[len,wid,hi] = getObjectDimensions(objectId);
%// allocate more, if needed
numObjects = numObjects + 1;
if numObjects > size(idAndDimensions,1)
%// grow array
idAndDimensions(end+chunkSize,1) = 0;
end
idAndDimensions(numObjects,:) = [objectId, len, wid, hi];
end
%// throw away excess
idAndDimensions = idAndDimensions(1:numObjects,:);
%// split into 2D and 3D objects
isTwoD = numObjects(:,end) == -1;
%// assign twoD struct
twoD = struct('id',num2cell(idAndDimensions(isTwoD,1),...
'length',num2cell(idAndDimensions(isTwoD,2),...
'width',num2cell(idAndDimensions(isTwoD,3));
%// assign threeD struct
%// clean up - we need only the two structs
%// I use keep from the File Exchange instead of clearvars
clearvars -except twoD threeD

Saving time and memory using parfor?

Consider prova.mat in MATLAB obtained in the following way
for w=1:100
for p=1:9
A{p}=randn(100,1);
end
baseA_.A=A;
eval(['baseA.A' num2str(w) '= baseA_;'])
end
save(sprintf('prova.mat'),'-v7.3', 'baseA')
To have an idea of the actual dimensions in my data, the 1x9 cell in A1 is composed by the following 9 arrays: 904x5, 913x5, 1722x5, 4136x5, 9180x5, 3174x5, 5970x5, 4455x5, 340068x5. The other Aj's have a similar composition.
Consider the following code
clear all
load prova
tic
parfor w=1:100
indA=sprintf('A%d', w);
Aarr=baseA.(indA).A;
Boot=[];
for p=1:9
C=randn(100,1).*Aarr{p};
Boot=[Boot; C];
end
D{w}=Boot;
end
toc
If I run the parfor loop with 4 local workers in my Macbook Pro it takes 1.2 sec. Replacing parfor with for it takes 0.01 sec.
With my actual data, the difference of time is 31 sec versus 7 sec [the creation of the matrix C is also more complicated].
If have understood correctly the problem is that the computer has to send baseAto each local worker and this takes time and memory.
Could you suggest a solution that is able to make parfor more convenient than for? I thought that saving all cells in baseA was a way to save time by loading once at the beginning, but maybe I'm wrong.
General information
A lot of functions have implicit multi-threading built-in, making a parfor loop not more efficient, when using these functions, than a serial for loop, since all cores are already being used. parfor will actually be a detriment in this case, since it has the allocation overhead, whilst being as parallel as the function you are trying to use.
When not using one of the implicitly multithreaded functions parfor is basically recommended in two cases: lots of iterations in your loop (i.e., like 1e10), or if each iteration takes a very long time (e.g., eig(magic(1e4))). In the second case you might want to consider using spmd (slower than parfor in my experience). The reason parfor is slower than a for loop for short ranges or fast iterations is the overhead needed to manage all workers correctly, as opposed to just doing the calculation.
Check this question for information on splitting data between separate workers.
Benchmarking
Code
Consider the following example to see the behaviour of for as opposed to that of parfor. First open the parallel pool if you've not already done so:
gcp; % Opens a parallel pool using your current settings
Then execute a couple of large loops:
n = 1000; % Iteration number
EigenValues = cell(n,1); % Prepare to store the data
Time = zeros(n,1);
for ii = 1:n
tic
EigenValues{ii,1} = eig(magic(1e3)); % Might want to lower the magic if it takes too long
Time(ii,1) = toc; % Collect time after each iteration
end
figure; % Create a plot of results
plot(1:n,t)
title 'Time per iteration'
ylabel 'Time [s]'
xlabel 'Iteration number[-]';
Then do the same with parfor instead of for. You will notice that the average time per iteration goes up (0.27s to 0.39s for my case). Do realise however that the parfor used all available workers, thus the total time (sum(Time)) has to be divided by the number of cores in your computer. So for my case the total time went down from around 270s to 49s, since I have an octacore processor.
So, whilst the time to do each separate iteration goes up using parfor with respect to using for, the total time goes down considerably.
Results
This picture shows the results of the test as I just ran it on my home PC. I used n=1000 and eig(500); my computer has an I5-750 2.66GHz processor with four cores and runs MATLAB R2012a. As you can see the average of the parallel test hovers around 0.29s with a lot of spread, whilst the serial code is quite steady around 0.24s. The total time, however, went down from 234s to 72s, which is a speed up of 3.25 times. The reason that this is not exactly 4 is the memory overhead, as expressed in the extra time each iteration takes. The memory overhead is due to MATLAB having to check what each core is doing and making sure that each loop iteration is performed only once and that the data is put into the correct storage location.
Slice broadcasted data into a cell array
The following approach works for data which is looped by group. It does not matter what the grouping variable is, as long as it is determined before the loop. The speed advantage is huge.
A simplified example of such data is the following, with the first column containing a grouping variable:
ngroups = 1000;
nrows = 1e6;
data = [randi(ngroups,[nrows,1]), randn(nrows,1)];
data(1:5,:)
ans =
620 -0.10696
586 -1.1771
625 2.2021
858 0.86064
78 1.7456
Now, suppose for simplicity that I am interested in the sum() by group of the values in the second column. I can loop by group, index the elements of interest and sum them up. I will perform this task with a for loop, a plain parfor and a parfor with sliced data, and will compare the timings.
Keep in mind that this is a toy example and I am not interested in alternative solutions like bsxfun(), this is not the point of the analysis.
Results
Borrowing the same type of plot from Adriaan, I first confirm the same findings about plain parfor vs for. Second, both methods are completely outperformed by the parfor on sliced data which takes a bit more than 2 seconds to complete on a dataset with 10 million rows (the slicing operation is included in the timing). The plain parfor takes 24s to complete and the for almost twice that amount of time (I am on Win7 64, R2016a and i5-3570 with 4 cores).
The main point of slicing the data before starting the parfor is to avoid:
the overhead from the whole data being broadcast to the workers,
indexing operations into ever growing datasets.
The code
ngroups = 1000;
nrows = 1e7;
data = [randi(ngroups,[nrows,1]), randn(nrows,1)];
% Simple for
[out,t] = deal(NaN(ngroups,1));
overall = tic;
for ii = 1:ngroups
tic
idx = data(:,1) == ii;
out(ii) = sum(data(idx,2));
t(ii) = toc;
end
s.OverallFor = toc(overall);
s.TimeFor = t;
s.OutFor = out;
% Parfor
try parpool(4); catch, end
[out,t] = deal(NaN(ngroups,1));
overall = tic;
parfor ii = 1:ngroups
tic
idx = data(:,1) == ii;
out(ii) = sum(data(idx,2));
t(ii) = toc;
end
s.OverallParfor = toc(overall);
s.TimeParfor = t;
s.OutParfor = out;
% Sliced parfor
[out,t] = deal(NaN(ngroups,1));
overall = tic;
c = cache2cell(data,data(:,1));
s.TimeDataSlicing = toc(overall);
parfor ii = 1:ngroups
tic
out(ii) = sum(c{ii}(:,2));
t(ii) = toc;
end
s.OverallParforSliced = toc(overall);
s.TimeParforSliced = t;
s.OutParforSliced = out;
x = 1:ngroups;
h = plot(x, s.TimeFor,'xb',x,s.TimeParfor,'+r',x,s.TimeParforSliced,'.g');
set(h,'MarkerSize',1)
title 'Time per iteration'
ylabel 'Time [s]'
xlabel 'Iteration number[-]';
legend({sprintf('for : %5.2fs',s.OverallFor),...
sprintf('parfor : %5.2fs',s.OverallParfor),...
sprintf('parfor_sliced: %5.2fs',s.OverallParforSliced)},...
'interpreter', 'none','fontname','courier')
You can find cache2cell() on my github repo.
Simple for on sliced data
You might wonder what happens if we run the simple for on the sliced data? For this simple toy example, if we take away the indexing operation by slicing the data, we remove the only bottleneck of the code, and the for will actually be slighlty faster than the parfor.
However, this is a toy example where the cost of the inner loop is completely taken by the indexing operation. Hence, for the parfor to be worthwhile, the inner loop should be more complex and/or spread out.
Saving memory with sliced parfor
Now, assuming that your inner loop is more complex and the simple for loop is slower, let's look at how much memory we save by avoiding broadcasted data in a parfor with 4 workers and a dataset with 50 million rows (for about 760 MB in RAM).
As you can see, almost 3 GB of additional memory are sent to the workers. The slice operation needs some memory to be completed, but still much less than the broadcasting operation and can in principle overwrite the initial dataset, hence bearing negligible RAM cost once completed. Finally, the parfor on the sliced data will only use a small fraction of memory, i.e. that amount that corresponds to slices being used.
Sliced into a cell
The raw data is sliced by group and each section is stored into a cell. Since a cell array is an array of references we basically partitioned the contiguous data in memory into independent blocks.
While our sample data looked like this
data(1:5,:)
ans =
620 -0.10696
586 -1.1771
625 2.2021
858 0.86064
78 1.7456
out sliced c looks like
c(1:5)
ans =
[ 969x2 double]
[ 970x2 double]
[ 949x2 double]
[ 986x2 double]
[1013x2 double]
where c{1} is
c{1}(1:5,:)
ans =
1 0.58205
1 0.80183
1 -0.73783
1 0.79723
1 1.0414

Recursive loop optimization

Is there a way to rewrite my code to make it faster?
for i = 2:length(ECG)
u(i) = max([a*abs(ECG(i)) b*u(i-1)]);
end;
My problem is the length of ECG.
You should pre-allocate u like this
>> u = zeros(size(ECG));
or possibly like this
>> u = NaN(size(ECG));
or maybe even like this
>> u = -Inf(size(ECG));
depending on what behaviour you want.
When you pre-allocate a vector, MATLAB knows how big the vector is going to be and reserves an appropriately sized block of memory.
If you don't pre-allocate, then MATLAB has no way of knowing how large the final vector is going to be. Initially it will allocate a short block of memory. If you run out of space in that block, then it has to find a bigger block of memory somewhere, and copy all the old values into the new memory block. This happens every time you run out of space in the allocated block (which may not be every time you grow the array, because the MATLAB runtime is probably smart enough to ask for a bit more memory than it needs, but it is still more than necessary). All this unnecessary reallocating and copying is what takes a long time.
There are several several ways to optimize this for loop, but, surprisingly memory pre-allocation is not the part that saves the most time. By far. You're using max to find the largest element of a 1-by-2 vector. On each iteration you build this vector. However, all you're doing is comparing two scalars. Using the two argument form of max and passing it two scalar is MUCH faster: 75+ times faster on my machine for large ECG vectors!
% Set the parameters and create a vector with million elements
a = 2;
b = 3;
n = 1e6;
ECG = randn(1,n);
ECG2 = a*abs(ECG); % This can be done outside the loop if you have the memory
u(1,n) = 0; % Fast zero allocation
for i = 2:length(ECG)
u(i) = max(ECG2(i),b*u(i-1)); % Compare two scalars
end
For the single input form of max (not including creation of random ECG data):
Elapsed time is 1.314308 seconds.
For my code above:
Elapsed time is 0.017174 seconds.
FYI, the code above assumes u(1) = 0. If that's not true, then u(1) should be set to it's value after preallocation.

vectorizing loops in Matlab - performance issues

This question is related to these two:
Introduction to vectorizing in MATLAB - any good tutorials?
filter that uses elements from two arrays at the same time
Basing on the tutorials I read, I was trying to vectorize some procedure that takes really a lot of time.
I've rewritten this:
function B = bfltGray(A,w,sigma_r)
dim = size(A);
B = zeros(dim);
for i = 1:dim(1)
for j = 1:dim(2)
% Extract local region.
iMin = max(i-w,1);
iMax = min(i+w,dim(1));
jMin = max(j-w,1);
jMax = min(j+w,dim(2));
I = A(iMin:iMax,jMin:jMax);
% Compute Gaussian intensity weights.
F = exp(-0.5*(abs(I-A(i,j))/sigma_r).^2);
B(i,j) = sum(F(:).*I(:))/sum(F(:));
end
end
into this:
function B = rngVect(A, w, sigma)
W = 2*w+1;
I = padarray(A, [w,w],'symmetric');
I = im2col(I, [W,W]);
H = exp(-0.5*(abs(I-repmat(A(:)', size(I,1),1))/sigma).^2);
B = reshape(sum(H.*I,1)./sum(H,1), size(A, 1), []);
Where
A is a matrix 512x512
w is half of the window size, usually equal 5
sigma is a parameter in range [0 1] (usually one of: 0.1, 0.2 or 0.3)
So the I matrix would have 512x512x121 = 31719424 elements
But this version seems to be as slow as the first one, but in addition it uses a lot of memory and sometimes causes memory problems.
I suppose I've made something wrong. Probably some logic mistake regarding vectorizing. Well, in fact I'm not surprised - this method creates really big matrices and probably the computations are proportionally longer.
I have also tried to write it using nlfilter (similar to the second solution given by Jonas) but it seems to be hard since I use Matlab 6.5 (R13) (there are no sophisticated function handles available).
So once again, I'm asking not for ready solution, but for some ideas that would help me to solve this in reasonable time. Maybe you will point me what I did wrong.
Edit:
As Mikhail suggested, the results of profiling are as follows:
65% of time was spent in the line H= exp(...)
25% of time was used by im2col
How big are I and H (i.e. numel(I)*8 bytes)? If you start paging, then the performance of your second solution is going to be affected very badly.
To test whether you really have a problem due to too large arrays, you can try and measure the speed of the calculation using tic and toc for arrays A of increasing size. If the execution time increases faster than by the square of the size of A, or if the execution time jumps at some size of A, you can try and split the padded I into a number of sub-arrays and perform the calculations like that.
Otherwise, I don't see any obvious places where you could be losing lots of time. Well, maybe you could skip the reshape, by replacing B with A in your function (saves a little memory as well), and writing
A(:) = sum(H.*I,1)./sum(H,1);
You may also want to look into upgrading to a more recent version of Matlab - they've worked hard on improving performance.