matlab parfor is very slow with operation on a large matrix - matlab

I am writing a matlab code, which does some operations on a large matrix. First I create three 3D array
dw2 = 0.001;
W2 = [0:dw2:1];
dp = 0.001;
P1 = [dp:dp:1];
dI = 0.001;
I = [1:-dI:0];
[II,p1,ww2] = ndgrid(I,P1,W2);
Then my code basically does the following
G = [0:0.1:10]
Y = zeros(length(G),1)
for i = 1:1:length(G)
g = G(i);
Y(i) = myfunction(II,p1,ww2,g)
end
This code roughly takes about 100s, with each iteration being nearly 10s.
However, after I start parfor
ProcessPool with properties:
Connected: true
NumWorkers: 48
Cluster: local
AttachedFiles: {}
AutoAddClientPath: true
IdleTimeout: 30 minutes (30 minutes remaining)
SpmdEnabled: true
Then it is like running forever. The maximum number of workers is 48. I've also tried 2, 5, 10. All of these are slower than non-parallel computing. Is that because matlab copied II,p1,ww2 48 times and that causes the problem? Also myfunction involves a lot of vectorization. I have already optimized the myfunction. Will that lead to slow performance of parfor? Is there a way to utilize (some of) the 48 workers to speed up the code? Any comments are highly appreciated. I need to run millions of cases. So I really hope that I can utilize the 48 workers in some way.

It seems that you have large data, and a lot of cores. It is likely that you simply run out of memory, which is why things get so slow.
I would suggest that you set up your workers to be threads, not separate processes.
You can do this with parpool('threads'). Your code must conform to some limitations, not all code can be run this way, see here.
In thread-based parallelism, you have shared memory (arrays are not copied). In process-based parallelism, you have 48 copies of MATLAB running on your computer at the same time, each needing their own copy of your data. That latter system was originally designed to work on a compute cluster, and was later retrofitted to work on a single machine with two or four cores. I don’t think it was ever meant for 48 cores.
If you cannot use threads with your code, configured your parallel pool to have fewer workers. For example parpool('local',8).
For more information, see this documentation page.

Related

Why is the Matlab parfor scheduler leaving workers idle?

I have a fairly long-running parfor loop (let's say 100,000 iterations where each iterations takes about a minute) that I'm running with 36 cores. I've noticed that near the end of the job, a large number of the cores go idle while a few finish what I think has to be multiple iterations per worker. This leads to a lot of wasted computing time waiting for one worker to finish several jobs while the others are sitting idle.
The following script shows the issue (using the File Exchange utility Par.m):
% Set up parallel pool
nLoop = 480;
p1 = gcp;
% Run a loop
pclock = Par(nLoop);
parfor iLoop = 1:nLoop
Par.tic;
pause(0.1);
pclock(iLoop) = Par.toc;
end
stop(pclock);
plot(pclock);
% Process the timing info:
runs = [[pclock.Worker]' [pclock.ItStart]' [pclock.ItStop]'];
nRuns = arrayfun(#(x) sum(runs(:,1) == x), 1:max(runs));
starts = nan(max(nRuns), p1.NumWorkers);
ends = nan(max(nRuns), p1.NumWorkers);
for iS = 1:p1.NumWorkers
starts(1:nRuns(iS), iS) = sort(runs(runs(:, 1) == iS, 2));
ends(1:nRuns(iS), iS) = sort(runs(runs(:, 1) == iS, 3));
end
firstWorkerStops = min(max(ends));
badRuns = starts > firstWorkerStops;
nBadRuns = sum(sum(badRuns)) - (p1.NumWorkers-1);
fprintf('At least %d (%3.1f%%) iterations run inefficiently.\n', ...
nBadRuns, nBadRuns/nLoop * 100);
The way I'm looking at it, every worker should be busy until the queue is empty, after which all workers sit idle. But here it looks like that's not happening - with 480 iterations, I'm getting between 6-20 iterations that start on a worker after a different worker has been sitting idle for a full cycle. This number appears to scale linearly with the number of loop iterations coming in near 2% of the total. With limited testing this appears to be consistent across Matlab 2016b and 2014b.
Is there any reason that this is the expected behavior or is this simply a poorly written scheduler in the parfor implementation? If so, how can I structure this so I'm not sitting around with idle workers so long?
I think this explains what you are observing.
If there are more iterations than workers, some workers perform more than one loop iteration; in this case, a worker might receive multiple iterations at once to reduce communication time. (From "When to Use parfor")
Towards the end of a loop, a two workers may finish their iterations at around the same time. If there is only one group of iterations left to be assigned, then one worker will get them all, and the other will remain idle. It sounds like it is expected behavior, and it is probably because the underlying implementation tries to reduce the communication cost associated with a worker pool. I've looked around the web and the Matlab settings and it doesn't seem like there is a way to adjust the communication strategy.
The parfor scheduler attempts to load-balance for loops where the iterations do not take a uniform amount of time. Unfortunately, as you observe, this can lead to workers becoming idle at the end of the loop. With parfor, you don't have control over the division of work; but you could use parfeval to divide your work up into even chunks - that might give you better utilisation. Or, you could even use spmd in conjunction with a for-drange loop.

Parfor limitations [duplicate]

the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.

work stealing with Parallel Computing Toolbox

I have implemented a combinatorial search algorithm (for comparison to a more efficient optimization technique) and tried to improve its runtime with parfor.
Unfortunately, the work assignments appear to be very badly unbalanced.
Each subitem i has complexity of approximately nCr(N - i, 3). As you can see, the tasks i < N/4 involve significantly more work than i > 3*N/4, yet it seems MATLAB is assigning all of i < N/4 to a single worker.
Is it true that MATLAB divides the work based on equally sized subsets of the loop range?
No, this question cites the documentation saying it does not.
Is there a convenient way to rebalance this without hardcoding the number of workers (e.g. if I require exactly 4 workers in the pool, then I could swap the two lowest bits of i with two higher bits in order to ensure each worker received some mix of easy and hard tasks)?
I don't think a full "work-stealing" implementation is necessary, perhaps just assigning 1, 2, 3, 4 to the workers, then when 4 completes first, its worker begins on item 5, and so on. The size of each item is sufficiently larger than the number of iterations that I'm not too worried about the increased communication overhead.
If the loop iterations are indeed distributed ahead of time (which would mean that in the end, there is a single worker that will have to complete several iterations while the other workers are idle - is this really the case?), the easiest way to ensure a mix is to randomly permute the loop iterations:
permutedIterations = randperm(nIterations);
permutedResults = cell(nIterations,1); %# or whatever else you use for storing results
%# run the parfor loop, completing iterations in permuted order
parfor iIter = 1:nIterations
permutedResults(iIter) = f(permutedIterations(iIter));
end
%# reorder results for easier subsequent analysis
results = permutedResults(permutedIterations);

Matlab parfor execution speed [duplicate]

the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.

parfor not giving speed ups [duplicate]

the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.