How to do an optimization of a for loop?

How to do an optimization of a for loop? - matlab

I have this part of code which is very long to run and I would like to know if it is possible to do an optimization or a vectorization for a faster running?
if intersect(pt, coord,'rows')
for t=1:size(pt,1)
for u=1:size(Mbb,1)
if pt(t,1)==Mbb(u,1)
img(pt(t,1),Mbb(u,2))=1;
end
end
end
end

Try multi-threading. Even on a single core multi-threading may increase the efficiency of your core. If you have a multi-core system then multi-threading will yield even more benefit. In MATLAB this is done using parfor. Note that this can only be done when there is no dependencies between loop iteration. Your code will have to look something like this. Sometimes the MATLAB interpreter will be over conservative in detecting dependencies and hence you have to write your loops in such a way that the interpreter doesnt see dependencies in iterations
if intersect(pt, coord,'rows')
loopsize=size(pt,1);
parfor t=1:loopsize
for u=1:size(Mbb,1)
if pt(t,1)==Mbb(u,1)
img(pt(t,1),Mbb(u,2))=1;
end
end
end
end

You spend much time comparing pt(t,1) and Mbb(u,1) to find matches, in a double loop. If the respective sizes are large, this can be costly (O(NM)).
What you can do is to pre-sort these arrays and search for equal values by a merge-like process, taking only O(N+M) operations.
Anyway, note that if the arrays pt and Mbb include many equal elements, which are also equal between the arrays, the problem can degenerate to NM matches. In this case, the sorting trick can't help.

Related

faster way to add many large matrix in matlab

Say I have many (around 1000) large matrices (about 1000 by 1000) and I want to add them together element-wise. The very naive way is using a temp variable and accumulates in a loop. For example,
summ=0;
for ii=1:20
for jj=1:20
summ=summ+ rand(400);
end
end
After searching on the Internet for some while, someone said it's better to do with the help of sum(). For example,
sump=zeros(400,400,400);
count=0;
for ii=1:20
for j=1:20
count=count+1;
sump(:,:,count)=rand(400);
end
end
sum(sump,3);
However, after I tested two ways, the result is
Elapsed time is 0.780819 seconds.
Elapsed time is 1.085279 seconds.
which means the second method is even worse.
So I am just wondering if there any effective way to do addition? Assume that I am working on a computer with very large memory and a GTX 1080 (CUDA might be helpful but I don't know whether it's worthy to do so since communication also takes time.)
Thanks for your time! Any reply will be highly appreciated!.

The fastes way is to not use any loops in matlab at all.
In many cases, the internal functions of matlab all well optimized to use SIMD or other acceleration techniques.
An example for using the build in functionalities to create matrices of the desired size is X = rand(sz1,...,szN).
In your explicit case sum(rand(400,400,400),3) should give you then the fastest result.

MATLAB: Why is double looping so much slower than squaring?

I wonder why it is faster to square a matrix with the A2=A^2 command (A being a LxL matrix) than to just do a double for loop and assign the value to a zeroed matrix.
I have run the following code to check the first case
tic
psi2=psi.^2;
T1=toc;
and the following for the second
psi2=zeros(L,L);
tic
for i=1:L
for j=1:L
psi2(i,j)=psi(i,j)^2;
end
end
T2=toc;
In this figure the elapsed time for several matrices sizes (L) are shown and the speedup is clear.
I would not be surprised to see that MATLAB has a very efficient implementation of matrix multiplication as it what is it made for, but I can't understand how there's a faster way to do element-wise operations than just looping over it.
Thanks for time.

There are several things that make the vector operation faster than your loop.
First, a loop compiled into C++ code is faster than a script loop which is interpreted / converted and compiled as Java.
Secondly, the C or C++ compiler can use Single Instruction, Multiple Data instructions (SIMD) to do the operation on multiple matrix elements in a single operation. And then do this in multiple threads.
Finally, it's possible to push the operation to the GPU which can process even more elements simultaneously (hundreds of cores, compared to 4-8 on the CPU). Your scripted loops cannot do this.

.^2 take the advantage of doing parallel operation using CPU. For a nested loop (double loop) solution, the entire solution is done in sequence. In addition it also have over head to increment of the loop control variable and condition checking.

Fast DP in Matlab (Viterbi for profile HMMs)

I've got efficiency problems with viterbi logodds computation in Matlab.
Basically my problem is that it is mandatory to have nested loops which slows the code down a lot. This is the expensive part:
for i=1:input_len
for j=1:num_states
v_m=emission_value+max_over_3_elements; %V_M
v_i=max_over_2_elements; %V_I
v_d=max_over_2_elements; %V_D
end
end
I believe I'm not the first to implement viterbi for profile HMMs so maybe you've got some advice. I also took a look into Matlab's own hmmviterbi but there were no revelations (also uses nested loops). I also tested replacing max with some primitive operations but there was no noticeable difference (was actually a little slower).

Unfortunately, loops just are slow in Matlab (it gets better with more recent versions though) - and I don't think it can be easily vectorized/parallelized as the operations inside the loops are not independent on other iterations.
This seems like a task for MEX - it should not be too much work to write this in C and the expected speedup is probably quite large.

Why is Arrayfun much faster than a for-loop when using GPU?

Could someone tell why Arrayfun is much faster than a for loop on GPU? (not on CPU, actually a For loop is faster on CPU)
Arrayfun:
x = parallel.gpu.GPUArray(rand(512,512,64));
count = arrayfun(#(x) x^2, x);
And equivalent For loop:
for i=1:size(x,1)*size(x,2)*size(x,3)
z(i)=x(i).^2;
end
Is it probably because a For loop is not multithreaded on GPU?
Thanks.

I don't think your loops are equivalent. It seems you're squaring every element in an array with your CPU implementation, but performing some sort of count for arrayfun.
Regardless, I think the explanation you're looking for is as follows:
When run on the GPU, you code can be functionally decomposed -- into each array cell in this case -- and squared separately. This is okay because for a given i, the value of [cell_i]^2 doesn't depend on any of the other values in other cells. What most likely happens is the array get's decomposed into S buffers where S is the number of stream processing units your GPU has. Each unit then computes the square of the data in each cell of its buffer. The result is copied back to the original array and the result is returned to count.
Now don't worry, if you're counting things as it seems *array_fun* is actually doing, a similar thing is happening. The algorithm most likely partitions the array off into similar buffers, and, instead of squaring each cell, add the values together. You can think of the result of this first step as a smaller array which the same process can be applied to recursively to count the new sums.

As per the reference page here http://www.mathworks.co.uk/help/toolbox/distcomp/arrayfun.html, "the MATLAB function passed in for evaluation is compiled for the GPU, and then executed on the GPU". In the explicit for loop version, each operation is executed separately on the GPU, and this incurs overhead - the arrayfun version is one single GPU kernel invocation.

This is the time i got for the same code. Arrayfun in CPU take approx 17 sec which is much higher but in GPU, Arrayfun is much faster.
parfor time = 0.4379
for time = 0.7237
gpu arrayfun time = 0.1685

optimization, reduction variables, and MATLAB parfor

I'm trying to write a simple generic parallel code for minimizing a function in MATLAB. The idea is very simple, essentially:
parfor k = 1:N
(...find a good solution xcurrent with cost fcurrent ... )
% keep best current value
fmin = min(fmin,fxcurrent)
end
This works fine, because fmin is a reduction variable, and thus I can use this construction to update the current best value.
However, I couldn't find a nice elegant way of keeping (or storing) the best current solution ("xcurrent").
How do I keep track of the best solution found so far?
In other words, if the current value is strictly smaller than fmin, how can I save xcurrent (subject to the constraints that parallel loops impose in MATLAB)?
[Of course, the serial version is trivial, just prepend
if fxcurrent < fmin;
xbest = xcurrent;
end;
but this does not work on a parfor loop.]
A few approaches that come to mind:
I could just store all solutions and costs (using sliced variables), but this is hugely memory inefficient (the number of iterations N is very large, and the solutions themselves are very big).
Similarly, I could use a (set or matrix) reduction variable and do:
solutionset = [solutionset,xcurrent]
but this is almost as bad in terms of memory requirement.
I could also save xcurrent to disk every time the solution is improved.
I tried to look around for a simpler solution, but nothing was very useful.
The question seems to be well-defined (so it's not like in other problems, where the output could depend on iteration order), but I couldn't find an elegant way of doing this.
Apologies in advance if I'm missing something obvious, and thanks a lot in advance!

Thanks so I copy the suggestion down here.
Just an idea- what if you write your own reduction function - basically just containing the if block and a save or output?

You will presumably need to maintain multiple xcurrent structures in memory anyway, since there will have to be a separate copy for each worker executing the loop-body. I would try splitting your loop into an outer parallel part and an inner serial part -- this will allow you to adjust the number of copies of xcurrent separately to the total iteration count.
The inner (serial) loop can use the normal if fxcurrent < fmin; xmin = xcurrent; end construct to update its best solution, and the outer (parallel) loop can just store all solutions using slicing. As a final step you select the best solution from your (small) set.