How to use SPMD for different input variables and save the output in order?

How to use SPMD for different input variables and save the output in order? - matlab

I am using a simulated annealing algorithm to optimize my problem, I have to do it for 100 different input variables and save the output for all variables in order. the problem is that I don't know how to implement spmd in my code to do parallel computing so that each input run on one CPU core and the final results stored in a matrix with 100 rows. I've tried to put it before the first for loop but it only returns a composite consists of 4 elements, since my CPU has 4 cores. Here is my code
spmd
for v=1:100
posmat=loading_param(Matrix,v);
nvar=size(posmat,2);
popsize=50;
maxiter=20;
T0=1000;
Tf=1;
Tdamp=((T0-Tf)/maxiter);
nn=5;
T=T0;
%% initial population
tic
emp.var=[];
emp.fit=inf;
pop=repmat(emp,popsize,1);
for i=1:popsize
pop(i).var=randperm(nvar);
pop_double=pop(i).var;
posmat_new=tabdil(nvar,pop_double,posmat);
dis=cij(posmat_new);
pop(i).fit=fittness(dis);
end
[value,index]=min([pop.fit]);
gpop=pop(index);
%% algorithm main loop
BEST=zeros(maxiter,1);
for iter=1:maxiter
for i=1:popsize
bnpop=emp;
for j=1:nn
npop=create_new_pop(pop(j),nvar,posmat);
if npop.fit<bnpop.fit
bnpop=npop;
end
end
if bnpop.fit<pop(i).fit
pop(i)=bnpop;
else
E=bnpop.fit-pop(i).fit;
pr=exp(-E/T);
if rand<pr
pop(i)=bnpop;
end
end
end
T=T-Tdamp;
[value,index]=min([pop.fit]);
if value<gpop.fit
gpop=pop(index);
BEST(iter)=gpop.fit;
disp([ 'iter= ' num2str(iter) 'BEST=' num2str(BEST(iter))])
end
end
%% algorithm results
disp([ ' Best solution=' num2str(gpop.var)])
disp([ ' Best fittness=' num2str(gpop.fit)])
disp([ ' Best time=' num2str(toc)])
bnpop_all(d,:)=bnpop.var;
d=d+1;
end %end of main for loop
end % end of spmd

From the documentation on spmd:
Values returning from the body of an spmd statement are converted to Composite objects on the MATLAB client. A Composite object contains references to the values stored on the remote MATLAB workers, and those values can be retrieved using cell-array indexing. The actual data on the workers remains available on the workers for subsequent spmd execution, so long as the Composite exists on the client and the parallel pool remains open.
Thus the output is a composite with 4 elements, since you have 4 CPU cores, so output{1} gives you the first element, output{2} the second etc. Just concatenate those to get your output in a single matrix.
Your code at this point just runs four times, one complete 100 iteration for loop per worker. An easier way to solve this, is to use parfor instead of spmd, as you can leave your loop the same. If you want to use spmd, first cut your v into four pieces (of 25 elements each), then on each worker iterate over just those 25 elements.
Seeing your code, with its three nested loops, I suggest not parallellising now, but instead try to profile your code, find out where the bottlenecks are, and try to speed up those. Probably trying to vectorise your nested loops will improve a lot already.

Related

Why accessing 2d matrix in parfor so slow?

Let's say I have a large matrix A:
A = rand(10000,10000);
The following serial code took around 0.5 seconds
tic
for i=1:5
r=9999*rand(1);
disp(A(round(r)+1, round(r)+1))
end
toc
Whereas the following code with parfor took around 47 seconds
tic
parfor i=1:5
r=9999*rand(1);
disp(A(round(r)+1, round(r)+1))
end
toc
How can I speed up the parfor code?
EDIT: If instead of using disp, I try to compute the sum with the following code
sum=0;
tic
for i=1:5000
r=9999*rand(1);
sum=sum+(A(round(r)+1, round(r)+1));
end
toc
This takes .025 sec
But parfor it takes 42.5 sec:
tic
parfor i=1:5000
r=9999*rand(1);
sum=sum+(A(round(r)+1, round(r)+1));
end
toc

Your issue is in not considering node communication overheads.
When you use a parfor to loop using parallel computation, you have to think about the structure of several worker nodes doing small tasks for the client node.
Here are some issues with the tests you present:
The function disp is serial, since you can only display results one at a time to the client node. Communication between nodes is needed to schedule this task.
Creating a summation external to the loop means all of the nodes have to communicate the current value back to the client node.
A is a broadcast variable in all of your examples. From the docs:
This type of variable can be useful or even essential for particular tasks. However, large broadcast variables can cause significant communication between client and workers and increase parallel overhead.
The MATLAB editor warns you about this, underlining the variable in orange with the following tooltip:
The entire array or structure 'A' is a broadcast variable. This might result in unnecessary communication overhead.
Instead, we can calculate some random indices up front and slice A into temporary variables to use in the loop. Then do gathering operations (like summing all of the parts) after the loop.
k = 50;
sumA = zeros( k, 1 ); % Output values for each loop index
idx = randi( [1,1e4], k, 1 ); % Calculate our indices outside the loop
randA = A( idx, idx ); % Slice A outside the loop
parfor ii = 1:k
sumA( ii ) = randA( ii ); % All that's left to do in the loop
end
sumA = sum( sumA ); % Collate results from all nodes
I did a quick benchmark to compare your 2 summation tests with the above code, using R2017b and 12 workers, here are my results:
Serial loop: ~ 0.001 secs
Parallel with broadcasting: ~ 100 secs
Parallel no broadcasting: ~ 0.1 secs
Parallel loops are overkill for this operation, the overhead isn't justified, but it's clear that with some pre-allocation and avoiding of broadcast variables, they are at least not 5 orders of magnitude slower!
See how the version of the code without broadcast variables uses more vectorisation too, which will speed up the code without even having to use parfor. Optimising your code before using parallel computation will not only speed things up for serial computation, but often make the transition easier too!
Side note: sum and i are bad variable names because they are the names of built-in functions.

So there are a few main causes,
MATLAB parallel toolbox sucks. It just does unless you're using the GPU portion.
The only time it's beneficial is if the individual tasks are large enough. Your computer has to dedicate a core to assigning jobs to all the other cores. This is expensive and has a lot of overhead unless the jobs are of sufficient size. Your computer is running overtime assigning small jobs. If you were assigning jobs that would each take a minute it would be a different story.
You're running too few jobs. You're only looping through 5 times on very small jobs. Why would you even bother trying to multithread this? When I assign it to loop through 500,000 times it finally gains a speedup with parfor if I reduce the matrix size to 1000 x 1000
When you run parfor, MATLAB has to duplicate memory across all of the treads, you have a 10,000 x 10,000 matrix which takes up 800 MB. Duplicated across a 4 core machine is 3,200 MB or probably half of your RAM. Operating on these arrays costs extra memory, potentially doubling the size -> 6,400 MB. Probably more than you can afford to use.
Simply put, "how do you I speed up this parfor code?"
You don't

How to save intermediate iterations during SPMD in MATLAB?

I am experimenting with MATLAB SPDM. However, I have the following problem to solve:
I am running a quite long algorithm and I would like to save the progress along the way in case the power gets cut, someone unplugs the power plug or memory error.
The loop has 144 iterations that take each around 30 minutes to complete => 72h. A lot of problems can occur in that interval.
Of course, I have the distributed computing toolbox on my machine. The computer has 4 physical cores. I run MATLAB R2016a.
I do not really want to use a parfor loop because I concatenate results and have dependency across iterations. I think SPMD is the best choice for what I want to do.
I'll try to describe what I want as best as I can:
I want to be able to save at a set iteration of the loop the results so far, and I want to save the results by worker.
Below is a Minimum (non)-Working Example. The last four lines should be put in a different .m file. This function, called within a parfor loop, allows to save intermediate iterations. It is working properly in other routines that I use. The error is at line 45 (output_save). Somehow, I would like to "pull" the composite object into a "regular" object (cell/structure).
My hunch is that I do not quite understand how Composite objects work and especially how they can be saved into "regular" objects (cells, structures, etc).
% SPMD MWE
% Clear necessary things
clear output output2 output_temp iter kk
% Useful thing that will be used later on
Rorder=perms(1:4);
% Stem of the file to save the data to
stem='MWE_MATLAB_spmd';
% Create empty cells where the results of the kk loop will be stored
output1{1,1}=[];
output2{1,2}=[];
% Start the parpool
poolobj=gcp;
% Define which worker/lab will do which iteration
iterperworker=ceil(size(Rorder,1)/poolobj.NumWorkers);
for i=1:poolobj.NumWorkers
if i<poolobj.NumWorkers
itertodo{1,i}=1+(iterperworker)*(i-1):iterperworker*i;
else
itertodo{1,i}=1+(iterperworker)*(i-1):size(Rorder,1);
end
end
%Start the spmd
% try
spmd
iter=1;
for kk=itertodo{1,labindex}
% Print which iteration is done at the moment
fprintf('\n');
fprintf('Ordering %d/%d \r',kk,size(Rorder,1));
for j=1:size(Rorder,2)
output_temp(1,j)=Rorder(kk,j).^j; % just to populate a structure
end
output.output1{1,1}=cat(2,output.output1{1,1},output_temp); % Concatenate the results
output.output2{1,2}=cat(2,output.output1{1,2},0.5*output_temp); % Concatenate the results
labindex_save=labindex;
if mod(iter,2)==0
output2.output=output; % manually put output in a structure
dosave(stem,labindex_save,output2); % Calls the function that allows me to save in parallel computing
end
iter=iter+1;
end
end
% catch me
% end
% Function to paste in another m-file
% function dosave(stem,i,vars)
% save(sprintf([stem '%d.mat'],i),'-struct','vars')
% end

A Composite is created only outside an spmd block. In particular, variables that you define inside an spmd block exist as a Composite outside that block. When the same variable is used back inside an spmd block, it is transformed back into the original value. Like so:
spmd
x = labindex;
end
isa(x, 'Composite') % true
spmd
isa(x, 'Composite') % false
isequal(x, labindex) % true
end
So, you should not be transforming output using {:} indexing - it is not a Composite. I think you should simply be able to use
dosave(stem, labindex, output);

How to nest multiple parfor loops

parfor is a convenient way to distribute independent iterations of intensive computations among several "workers". One meaningful restriction is that parfor-loops cannot be nested, and invariably, that is the answer to similar questions like there and there.
Why parallelization across loop boundaries is so desirable
Consider the following piece of code where iterations take a highly variable amount of time on a machine that allows 4 workers. Both loops iterate over 6 values, clearly hard to share among 4.
for row = 1:6
parfor col = 1:6
somefun(row, col);
end
end
It seems like a good idea to choose the inner loop for parfor because individual calls to somefun are more variable than iterations of the outer loop. But what if the run time for each call to somefun is very similar? What if there are trends in run time and we have three nested loops? These questions come up regularly, and people go to extremes.
Pattern needed for combining loops
Ideally, somefun is run for all pairs of row and col, and workers should get busy irrespectively of which iterand is being varied. The solution should look like
parfor p = allpairs(1:6, 1:6)
somefun(p(1), p(2));
end
Unfortunately, even if I knew which builtin function creates a matrix with all combinations of row and col, MATLAB would complain with an error The range of a parfor statement must be a row vector. Yet, for would not complain and nicely iterate over columns. An easy workaround would be to create that matrix and then index it with parfor:
p = allpairs(1:6, 1:6);
parfor k = 1:size(pairs, 2)
row = p(k, 1);
col = p(k, 2);
somefun(row, col);
end
What is the builtin function in place of allpairs that I am looking for? Is there a convenient idiomatic pattern that someone has come up with?

MrAzzman already pointed out how to linearise nested loops. Here is a general solution to linearise n nested loops.
1) Assuming you have a simple nested loop structure like this:
%dummy function for demonstration purposes
f=#(a,b,c)([a,b,c]);
%three loops
X=cell(4,5,6);
for a=1:size(X,1);
for b=1:size(X,2);
for c=1:size(X,3);
X{a,b,c}=f(a,b,c);
end
end
end
2) Basic linearisation using a for loop:
%linearized conventional loop
X=cell(4,5,6);
iterations=size(X);
for ix=1:prod(iterations)
[a,b,c]=ind2sub(iterations,ix);
X{a,b,c}=f(a,b,c);
end
3) Linearisation using a parfor loop.
%linearized parfor loop
X=cell(4,5,6);
iterations=size(X);
parfor ix=1:prod(iterations)
[a,b,c]=ind2sub(iterations,ix);
X{ix}=f(a,b,c);
end
4) Using the second version with a conventional for loop, the order in which the iterations are executed is altered. If anything relies on this you have to reverse the order of the indices.
%linearized conventional loop
X=cell(4,5,6);
iterations=fliplr(size(X));
for ix=1:prod(iterations)
[c,b,a]=ind2sub(iterations,ix);
X{a,b,c}=f(a,b,c);
end
Reversing the order when using a parfor loop is irrelevant. You can not rely on the order of execution at all. If you think it makes a difference, you can not use parfor.

You should be able to do this with bsxfun. I believe that bsxfun will parallelise code where possible (see here for more information), in which case you should be able to do the following:
bsxfun(#somefun,(1:6)',1:6);
You would probably want to benchmark this though.
Alternatively, you could do something like the following:
function parfor_allpairs(fun, num_rows, num_cols)
parfor i=1:(num_rows*num_cols)
fun(mod(i-1,num_rows)+1,floor(i/num_cols)+1);
end
then call with:
parfor_allpairs(#somefun,6,6);

Based on the answers from #DanielR and #MrAzzaman, I am posting two functions, iterlin and iterget in place of prod and ind2sub that allow iteration over ranges also if those do not start from one. An example for the pattern becomes
rng = [1, 4; 2, 7; 3, 10];
parfor k = iterlin(rng)
[plate, row, col] = iterget(rng, k);
% time-consuming computations here %
end
The script will process the wells in rows 2 to 7 and columns 3 to 10 on plates 1 to 4 without any workers idling while more wells are waiting to be processed. In hope that this helps someone, I deposited iterlin and iterget at the MATLAB File Exchange.

Using parfor in matlab for nested loops

I want to parallelize block2 for each block1 and parallerlize outer loop too.
previous code:
for i=rangei
<block1>
for j=rangej
<block2> dependent on <block1>
end
end
changed code:
parfor i=rangei
<block1>
parfor j=rangej
<block2> dependent on <block1>
end
end
how much efficient can this get and will the changed code do the right thing?
Is the changed code valid for my requirements?

In MATLAB, parfor cannot be nested. Which means, in your code, you should replace one parfor by a for (the outer loop most likely). More generally, I advise you to look at this tutorial on parfor.

parfor cannot be nested. In nested parfor statements, only the outermost call to parfor is paralellized, which means that the inner call to parfor only adds unnecessary overhead.
To get high efficiency with parfor, the number of iterations should be much higher than the number of workers (or an exact multiple in case each iteration takes the same time), and you want a single iteration to take more than just a few milliseconds to avoid feeling the overhead from paralellization.
parfor i=rangei
<block1>
for j=rangej
<block2> dependent on <block1>
end
end
may actually fit that description, depending on the size of rangei. Alternatively, you may want to try unrolling the nested loop into a single loop, where you iterate across linear indices.

The following code uses a single parfor loop to implicitly manage two nested loops. The loop1_index and loop2_index are the ranges, and the loop1_counter and loop2_counter are the actual loop iterators. Also, the iterators are put in reverse order in order to have a better load balance, because usually the load of higher range values is bigger than those of smaller values.
loop1_index = [1:5]
loop2_index = [1:4]
parfor temp_label_parfor = 1 : numel(loop1_index) * numel(loop2_index)
[loop1_counter, loop2_counter] = ind2sub([numel(loop1_index), numel(loop2_index)], temp_label_parfor)
loop1_counter = numel(loop1_index) - loop1_counter + 1;
loop2_counter = numel(loop2_index) - loop2_counter + 1;
end

You can't use nested parfor, From your question it seems that you are working on a matrix( with parameter i,j),
try using blockproc, go through this link once blockproc

Parfor in MATLAB Problem

Why can't I use the parfor in this piece of code?
parfor i=1:r
for j=1:N/r
xr(j + (N/r) * (i-1)) = x(i + r * (j-1));
end
end
This is the error:
Error: The variable xr in a parfor cannot be classified.
See Parallel for Loops in MATLAB, "Overview".

The issue here is that of improper indexing of the sliced array. parfor loops are run asynchronously, meaning the order in which each iteration is executed is random. From the documentation:
MATLAB workers evaluate iterations in no particular order, and independently of each other. Because each iteration is independent, there is no guarantee that the iterations are synchronized in any way, nor is there any need for this.
You can easily verify the above statement by typing the following in the command line:
parfor i=1:100
i
end
You'll see that the ordering is arbitrary. Hence if you split a parallel job between different workers, one worker has no way of telling if a different iteration has finished or not. Hence, your variable indexing cannot depend on past/future values of the iterator.
Let me demonstrate this with a simple example. Consider the Fibonacci series 1,1,2,3,5,8,.... You can generate the first 10 terms of the series easily (in a naïve for loop) as:
f=zeros(1,10);
f(1:2)=1;
for i=3:10
f(i)=f(i-1)+f(i-2);
end
Now let's do the same with a parfor loop.
f=zeros(1,10);
f(1:2)=1;
parfor i=3:10
f(i)=f(i-1)+f(i-2);
end
??? Error: The variable f in a parfor cannot be classified.
See Parallel for Loops in MATLAB, "Overview"
But why does this give an error?
I've shown that iterations are executed in an arbitrary order. So let's say that a worker gets the loop index i=7 and the expression f(i)=f(i-1)+f(i-2);. It is now supposed to execute the expression and return the results to the master node. Now has iteration i=6 finished? Is the value stored in f(6) reliable? What about f(5)? Do you see what I'm getting at? Supposing f(5) and f(6) are not done, then you'll incorrectly calculate that the 7th term in the Fibonnaci series is 0!
Since MATLAB has no way of telling if your calculation can be guaranteed to run correctly and reproduce the same result each time, such ambiguous assignments are explicitly disallowed.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse