I wonder why it is faster to square a matrix with the A2=A^2 command (A being a LxL matrix) than to just do a double for loop and assign the value to a zeroed matrix.
I have run the following code to check the first case
tic
psi2=psi.^2;
T1=toc;
and the following for the second
psi2=zeros(L,L);
tic
for i=1:L
for j=1:L
psi2(i,j)=psi(i,j)^2;
end
end
T2=toc;
In this figure the elapsed time for several matrices sizes (L) are shown and the speedup is clear.
I would not be surprised to see that MATLAB has a very efficient implementation of matrix multiplication as it what is it made for, but I can't understand how there's a faster way to do element-wise operations than just looping over it.
Thanks for time.
There are several things that make the vector operation faster than your loop.
First, a loop compiled into C++ code is faster than a script loop which is interpreted / converted and compiled as Java.
Secondly, the C or C++ compiler can use Single Instruction, Multiple Data instructions (SIMD) to do the operation on multiple matrix elements in a single operation. And then do this in multiple threads.
Finally, it's possible to push the operation to the GPU which can process even more elements simultaneously (hundreds of cores, compared to 4-8 on the CPU). Your scripted loops cannot do this.
.^2 take the advantage of doing parallel operation using CPU. For a nested loop (double loop) solution, the entire solution is done in sequence. In addition it also have over head to increment of the loop control variable and condition checking.
Related
Say I have many (around 1000) large matrices (about 1000 by 1000) and I want to add them together element-wise. The very naive way is using a temp variable and accumulates in a loop. For example,
summ=0;
for ii=1:20
for jj=1:20
summ=summ+ rand(400);
end
end
After searching on the Internet for some while, someone said it's better to do with the help of sum(). For example,
sump=zeros(400,400,400);
count=0;
for ii=1:20
for j=1:20
count=count+1;
sump(:,:,count)=rand(400);
end
end
sum(sump,3);
However, after I tested two ways, the result is
Elapsed time is 0.780819 seconds.
Elapsed time is 1.085279 seconds.
which means the second method is even worse.
So I am just wondering if there any effective way to do addition? Assume that I am working on a computer with very large memory and a GTX 1080 (CUDA might be helpful but I don't know whether it's worthy to do so since communication also takes time.)
Thanks for your time! Any reply will be highly appreciated!.
The fastes way is to not use any loops in matlab at all.
In many cases, the internal functions of matlab all well optimized to use SIMD or other acceleration techniques.
An example for using the build in functionalities to create matrices of the desired size is X = rand(sz1,...,szN).
In your explicit case sum(rand(400,400,400),3) should give you then the fastest result.
I have this part of code which is very long to run and I would like to know if it is possible to do an optimization or a vectorization for a faster running?
if intersect(pt, coord,'rows')
for t=1:size(pt,1)
for u=1:size(Mbb,1)
if pt(t,1)==Mbb(u,1)
img(pt(t,1),Mbb(u,2))=1;
end
end
end
end
Try multi-threading. Even on a single core multi-threading may increase the efficiency of your core. If you have a multi-core system then multi-threading will yield even more benefit. In MATLAB this is done using parfor. Note that this can only be done when there is no dependencies between loop iteration. Your code will have to look something like this. Sometimes the MATLAB interpreter will be over conservative in detecting dependencies and hence you have to write your loops in such a way that the interpreter doesnt see dependencies in iterations
if intersect(pt, coord,'rows')
loopsize=size(pt,1);
parfor t=1:loopsize
for u=1:size(Mbb,1)
if pt(t,1)==Mbb(u,1)
img(pt(t,1),Mbb(u,2))=1;
end
end
end
end
You spend much time comparing pt(t,1) and Mbb(u,1) to find matches, in a double loop. If the respective sizes are large, this can be costly (O(NM)).
What you can do is to pre-sort these arrays and search for equal values by a merge-like process, taking only O(N+M) operations.
Anyway, note that if the arrays pt and Mbb include many equal elements, which are also equal between the arrays, the problem can degenerate to NM matches. In this case, the sorting trick can't help.
the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.
Could someone tell why Arrayfun is much faster than a for loop on GPU? (not on CPU, actually a For loop is faster on CPU)
Arrayfun:
x = parallel.gpu.GPUArray(rand(512,512,64));
count = arrayfun(#(x) x^2, x);
And equivalent For loop:
for i=1:size(x,1)*size(x,2)*size(x,3)
z(i)=x(i).^2;
end
Is it probably because a For loop is not multithreaded on GPU?
Thanks.
I don't think your loops are equivalent. It seems you're squaring every element in an array with your CPU implementation, but performing some sort of count for arrayfun.
Regardless, I think the explanation you're looking for is as follows:
When run on the GPU, you code can be functionally decomposed -- into each array cell in this case -- and squared separately. This is okay because for a given i, the value of [cell_i]^2 doesn't depend on any of the other values in other cells. What most likely happens is the array get's decomposed into S buffers where S is the number of stream processing units your GPU has. Each unit then computes the square of the data in each cell of its buffer. The result is copied back to the original array and the result is returned to count.
Now don't worry, if you're counting things as it seems *array_fun* is actually doing, a similar thing is happening. The algorithm most likely partitions the array off into similar buffers, and, instead of squaring each cell, add the values together. You can think of the result of this first step as a smaller array which the same process can be applied to recursively to count the new sums.
As per the reference page here http://www.mathworks.co.uk/help/toolbox/distcomp/arrayfun.html, "the MATLAB function passed in for evaluation is compiled for the GPU, and then executed on the GPU". In the explicit for loop version, each operation is executed separately on the GPU, and this incurs overhead - the arrayfun version is one single GPU kernel invocation.
This is the time i got for the same code. Arrayfun in CPU take approx 17 sec which is much higher but in GPU, Arrayfun is much faster.
parfor time = 0.4379
for time = 0.7237
gpu arrayfun time = 0.1685
I have an m x n matrix of integers and where n is a fairly big number m and n ~1000. I want to iterate through all of these and perform a some operations, like accessing a particular cell and assigning a value of a particular cells.
However, at least in my implementation, this is rather inefficient as I have two for loops with Matrix(a,b) = Matrix(a,b+1) or something along these lines. Is there any other way to do this seeing as my current implementation takes a long time to traverse through about 100,000 cells and perform some operations.
Thank you
In matlab, it's almost always possible to avoid loops.
If you want to do Matrix(a,b)=Matrix(a,b+1), you should just do Matrix2=Matrix(:,2:end);
If you are more precise about what you do inside the loop, I can help you more.
Matlab uses column major ordering of matrixes in memory (unlike C). Are you sure you are iterating the indexes in the correct order? If not, try switching them and see if performance improves..
If you can't get rid of the for loops, one possibility would be to rewrite the expensive operations in C and create a MEX file as described here.