parfor slows down computation - matlab

I ran this code in a computer with 44 workers. However, each iteration in parallel is slower than in serial mode, though the total execution time for the loop as a whole goes down.
template=cell(31,1);
for i=1:31
template{i}=rand(i+24);
end
parfor i=1:5
img=rand(800,1280+i); % It's an other function that gives me the values of img ,here it's just an example
tic
cellfun (#(t) normxcorr2 ( t ,img),template,'UniformOutput',0);
toc
end
As a result, elapsed time in each loop is approximately 18s. When I change the parfor to for, the time elapsed is approximately 6.7s in each loop.
Can you explain me why in this case the parfor loop is slower than the for loop?
I checked the MATLAB documentation and also similar questions, but it didn't help me.
Note : the total time of execution of the script is faster for the parfor version, I just want to understand why the function cellfun is 3 times slower in parallel version.

Check CPU usage.
I believe main reason here is that stuff like fft (which is a part of xcorr most likely) will already use more than a single core. I can't test parfor right now, but ordinary for loop already has about 70% CPU utilization on my 4C/4T CPU with your code. So, parfor can at best fill the remaining 30% (on my computer), but will obviously then run each instance slower.

Related

Wildly different iteration times in parfor loop

I have an embarrassingly-parallel Monte Carlo code for which I am using parfor. Each parfor loop iteration does around 100 seconds of calculation into a temporary array, before adding that array to the master array, which is an reduction variable. I'm timing using a sliced variable.
Like this:
master_array=zeros(1000,1000,50)
iteration_times=zeros(500,1);
parfor i=1:500
iteration_start=datetime;
temp_array=zeros(1000,1000,50);
**do about 100 seconds of work building temp_array**
master_array=master_array+temp_array;
iteration_times(i)=seconds(datetime-iteration_start);
end
Sometimes when the code is run, but not all the time, the code takes far longer. When it does, it seems that some loop iterations are taking way way longer than they should, as shown in the graph iteration number vs runtime. (This is not caused by variations in the amount of 'work done' - the variation disappears without the parfor.) This was running with a pool of 4 workers.
The arrays are large, and can be 100MB or more. I don't seem to be running out of memory - the hard drive is not being used, and I get this problem still with smaller arrays.
'Slow' iteration times seem oddly quantised, like some loop interations are waiting for others to complete before completing themselves.
Any ideas what i can check?

Parfor limitations [duplicate]

the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.

Matlab parfor execution speed [duplicate]

the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.

parfor not giving speed ups [duplicate]

the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.

fread() optimisation matlab

I am reading a 40 MB file with three different ways. But the first one is way faster than the 2 others. Do you guys have an idea why ? I would rather implement condition in loops or whiles to separate data than load everything with the first quick method and separate them then - memory saving -
LL=10000000;
fseek(fid,startbytes, 'bof');
%% Read all at once %%%%%%%%%%%%%%%%%%%%%%
tic
AA(:,1)=int32(fread(fid,LL,'int32'));
toc
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
fseek(fid,startbytes,'bof');
%% Read all using WHILE loop %%%%%%%%%%%%%
tic
i=0;
AA2=int32(zeros(LL,1));
while i<LL
i=i+1;
AA2(i,1)=fread(fid,1,'int32');
end
toc
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
fseek(fid,startbytes,'bof');
%% Read all using FOR loop %%%%%%%%%%%%%%%
tic
AA3=int32(zeros(LL,1));
for i=1:LL
AA3(i,1)=fread(fid,1,'int32');
end
toc
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Elapsed time is 0.312916 seconds.
Elapsed time is 138.811520 seconds.
Elapsed time is 116.799286 seconds.
Here are my two cents on this:
Is the JIT accelerator enabled?
Since MATLAB is an interpreted language, for loops can be quite slow. while loops may be even slower, because the termination condition is re-evaluted in each iteration (unlike for loops that iterate a predetermined number of times). Nevertheless, this is not so true with JIT acceleration, which can significantly boost their performance.
I'm not near MATLAB at the moment, so I cannot reproduce this scenario myself, but you can check yourself whether you have JIT acceleration turned on by typing the following in the command window:
feature accel
If the result is 0 it means that it's disabled, and this is probably the reason for the huge reduction in performance.
Too many system calls?
I'm not familiar with the internals of fread, but I can only assume that one fread call to read the entire file invokes less system calls than multiple fread calls. System calls are usually expensive, so this can, to some extent, account for the slowdown.