Generate random numbers inside spmd in matlab - matlab

I am running a Monte carlo simulation in Matlab using parallelisation due to the extensive time that the simulation takes to run.
The main objective is create a really big panel data set and use that to estimate some regressions.
The problem is that when I run the simulation without parallelise they take A LOT of time to run, so I decided to use spmd option. However, results are very different running the parallelised code compared to the normal one.
rng(3857);
for r=1:MCREP
Ycom=[];
Xcom=[];
YLcom=[];
spmd
for it=labindex:numlabs:NT
(code to generate different components, alpha, delta, x_it, eps_it)
%e.g. x_it=2+1*randn(TT,1);
(uses random number generator: rndn)
% Create different time periods observations for each individual
for t=2:TT
yi(t)=xi*alpha+mu*delta+rho*yi(t-1)+beta*x_it(t)+eps_it(t);
yLi(t)=yi(t-1);
end
% Concatenate each individual in a big matrix: create panel
Ycom=[Ycom yi];
Xcom=[Xcom x_it];
YLcom=[YLcom yLi];
end
end
% Retrieve data stored in composite form
mm=matlabpool('size');
for i=1:mm
Y(:,(i-1)*(NT/mm)+1:i*(NT/mm))=Ycom{i};
X(:,(i-1)*(NT/mm)+1:i*(NT/mm))=Xcom{i};
YL(:,(i-1)*(NT/mm)+1:i*(NT/mm))=YLcom{i};
end
(rest of the code, run regressions)
end
The intensive part of the code is the one that is parallelised with the spmd, it creates a really large panel data set in where columns are independent individuals, and rows are dependent time periods.
My main problem is that when I run the code using the parallel then results are different than when I don't use it, moreover results are different if I use 8 workers or 16 workers. However for a matter of time is unfeasible to run the code without parallelisation.
I believe problem is coming from the random numbers generation, but I can not fix the seed inside the spmd because that mean fixing the seed inside the Monte Carlo loop, so all the repetitions are going to have the same numbers.
I would want to know how can I fix the random number generator in such a way that it does not matter how many workers I use it will give me the same results.
PS. Another solution would be to do the spmd in the most outer loop (the Monte Carlo loop), however I can not see a performance gain when I use the parallelisation in that way.
Thank you very much for your help.

Heh... the random generators in MATLAB's parallel execution is indeed an issue.
The MATLAB's web page about random generators (http://www.mathworks.com/help/matlab/math/creating-and-controlling-a-random-number-stream.html) states that only two streams/generators can have multiple streams. These two have a limited period (see the Table at the previous link).
BUT!!! The default generator (mt19937ar) can be seeded in order to have different results :)
Thus, what you can do is to start with the mrg32k3a, obtain a random number in each worker and then use this random number along with the worker index to seed an mt19937ar generator.
E.g.
spmd
r1 = rand(randStream{labindex}, [1 1]);
r2 = rand(randStream{labindex}, [1 1]);
rng(labindex+(r1/r2), 'twister');
% Do you stuff
end
Of course, the r1 and r2 can be modified (or, maybe, add more r's) in order to have more complicated seeding.

Related

One-time randomization

I have a matrix, ECGsig, with each row containing a 1-second-long ECG signal,
I will classify them later but I want to randomly change the rows like,
idx = randperm(size(ECGsig,1));
ECGsig = ECGsig(idx,:);
However I want this to happen just once and not every time that I run the program,
Or in other words to have the random numbers generated only once,
Because if it changes every time I would have different results for classification,
Is there any way to do this beside doing in a separate m file and saving it in a mat file?
Thanks,
You can set the random generation seed so that every time you run a random result, it will generate the same random result each time. You can do this through rng. This way, even though run the program multiple times, it will still generate the same random sequence regardless. As such, try doing something like:
rng(1234);
The input into rng would be the seed. However, as per Luis Mendo's comment, rng is only available with newer versions of MATLAB. Should rng not be available with your distribution of MATLAB, do this instead:
rand('seed', 1234);
You can also take a look at randstream, but that's a bit too advanced so let's not look at it right now. To reset the seed to what it was before you opened MATLAB, choose a seed of 0. Therefore:
rng(0); %// or
rand('seed', 0);
By calling this, any random results you generate from this point will be based on a pre-determined order. The seed can be any integer you want really, but use something that you'll remember. Place this at the very beginning of your code before you do anything. The main reason why we have control over how random numbers are generated is because this encourages the production of reproducible results and research. This way, other people can generate the results you have created should you decide to do anything with random or randomizing.
Even though you said you only want to run this randomization once, this will save you the headache of saving your results to a different file before you run the program multiple times. By setting the seed, even though you're running the program multiple times, you're guaranteed to generate the same random sequence each time.

Predicting runtime of parallel loop using a-priori estimate of effort per iterand (for given number of workers)

I am working on a MATLAB implementation of an adaptive Matrix-Vector Multiplication for very large sparse matrices coming from a particular discretisation of a PDE (with known sparsity structure).
After a lot of pre-processing, I end up with a number of different blocks (greater than, say, 200), for which I want to calculate selected entries.
One of the pre-processing steps is to determine the (number of) entries per block I want to calculate, which gives me an almost perfect measure of the amount of time each block will take (for all intents and purposes the quadrature effort is the same for each entry).
Thanks to https://stackoverflow.com/a/9938666/2965879, I was able to make use of this by ordering the blocks in reverse order, thus goading MATLAB into starting with the biggest ones first.
However, the number of entries differs so wildly from block to block, that directly running parfor is limited severely by the blocks with the largest number of entries, even if they are fed into the loop in reverse.
My solution is to do the biggest blocks serially (but parallelised on the level of entries!), which is fine as long as the overhead per iterand doesn't matter too much, resp. the blocks don't get too small. The rest of the blocks I then do with parfor. Ideally, I'd let MATLAB decide how to handle this, but since a nested parfor-loop loses its parallelism, this doesn't work. Also, packaging both loops into one is (nigh) impossible.
My question now is about how to best determine this cut-off between the serial and the parallel regime, taking into account the information I have on the number of entries (the shape of the curve of ordered entries may differ for different problems), as well as the number of workers I have available.
So far, I had been working with the 12 workers available under a the standard PCT license, but since I've now started working on a cluster, determining this cut-off becomes more and more crucial (since for many cores the overhead of the serial loop becomes more and more costly in comparison to the parallel loop, but similarly, having blocks which hold up the rest are even more costly).
For 12 cores (resp. the configuration of the compute server I was working with), I had figured out a reasonable parameter of 100 entries per worker as a cut off, but this doesn't work well when the number of cores isn't small anymore in relation to the number of blocks (e.g 64 vs 200).
I have tried to deflate the number of cores with different powers (e.g. 1/2, 3/4), but this also doesn't work consistently. Next I tried to group the blocks into batches and determine the cut-off when entries are larger than the mean per batch, resp. the number of batches they are away from the end:
logical_sml = true(1,num_core); i = 0;
while all(logical_sml)
i = i+1;
m = mean(num_entr_asc(1:min(i*num_core,end))); % "asc" ~ ascending order
logical_sml = num_entr_asc(i*num_core+(1:num_core)) < i^(3/4)*m;
% if the small blocks were parallelised perfectly, i.e. all
% cores take the same time, the time would be proportional to
% i*m. To try to discount the different sizes (and imperfect
% parallelisation), we only scale with a power of i less than
% one to not end up with a few blocks which hold up the rest
end
num_block_big = num_block - (i+1)*num_core + sum(~logical_sml);
(Note: This code doesn't work for vectors num_entr_asc whose length is not a multiple of num_core, but I decided to omit the min(...,end) constructions for legibility.)
I have also omitted the < max(...,...) for combining both conditions (i.e. together with minimum entries per worker), which is necessary so that the cut-off isn't found too early. I thought a little about somehow using the variance as well, but so far all attempts have been unsatisfactory.
I would be very grateful if someone has a good idea for how to solve this.
I came up with a somewhat satisfactory solution, so in case anyone's interested I thought I'd share it. I would still appreciate comments on how to improve/fine-tune the approach.
Basically, I decided that the only sensible way is to build a (very) rudimentary model of the scheduler for the parallel loop:
function c=est_cost_para(cost_blocks,cost_it,num_cores)
% Estimate cost of parallel computation
% Inputs:
% cost_blocks: Estimate of cost per block in arbitrary units. For
% consistency with the other code this must be in the reverse order
% that the scheduler is fed, i.e. cost should be ascending!
% cost_it: Base cost of iteration (regardless of number of entries)
% in the same units as cost_blocks.
% num_cores: Number of cores
%
% Output:
% c: Estimated cost of parallel computation
num_blocks=numel(cost_blocks);
c=zeros(num_cores,1);
i=min(num_blocks,num_cores);
c(1:i)=cost_blocks(end-i+1:end)+cost_it;
while i<num_blocks
i=i+1;
[~,i_min]=min(c); % which core finished first; is fed with next block
c(i_min)=c(i_min)+cost_blocks(end-i+1)+cost_it;
end
c=max(c);
end
The parameter cost_it for an empty iteration is a crude blend of many different side effects, which could conceivably be separated: The cost of an empty iteration in a for/parfor-loop (could also be different per block), as well as the start-up time resp. transmission of data of the parfor-loop (and probably more). My main reason to throw everything together is that I don't want to have to estimate/determine the more granular costs.
I use the above routine to determine the cut-off in the following way:
% function i=cutoff_ser_para(cost_blocks,cost_it,num_cores)
% Determine cut-off between serial an parallel regime
% Inputs:
% cost_blocks: Estimate of cost per block in arbitrary units. For
% consistency with the other code this must be in the reverse order
% that the scheduler is fed, i.e. cost should be ascending!
% cost_it: Base cost of iteration (regardless of number of entries)
% in the same units as cost_blocks.
% num_cores: Number of cores
%
% Output:
% i: Number of blocks to be calculated serially
num_blocks=numel(cost_blocks);
cost=zeros(num_blocks+1,2);
for i=0:num_blocks
cost(i+1,1)=sum(cost_blocks(end-i+1:end))/num_cores + i*cost_it;
cost(i+1,2)=est_cost_para(cost_blocks(1:end-i),cost_it,num_cores);
end
[~,i]=min(sum(cost,2));
i=i-1;
end
In particular, I don't inflate/change the value of est_cost_para which assumes (aside from cost_it) the most optimistic scheduling possible. I leave it as is mainly because I don't know what would work best. To be conservative (i.e. avoid feeding too large blocks to the parallel loop), one could of course add some percentage as a buffer or even use a power > 1 to inflate the parallel cost.
Note also that est_cost_para is called with successively less blocks (although I use the variable name cost_blocks for both routines, one is a subset of the other).
Compared to the approach in my wordy question I see two main advantages:
The relatively intricate dependence between the data (both the number of blocks as well as their cost) and the number of cores is captured much better with the simulated scheduler than would be possible with a single formula.
By calculating the cost for all possible combinations of serial/parallel distribution and then taking the minimum, one cannot get "stuck" too early while reading in the data from one side (e.g. by a jump which is large relative to the data so far, but small in comparison to the total).
Of course, the asymptotic complexity is higher by calling est_cost_para with its while-loop all the time, but in my case (num_blocks<500) this is absolutely negligible.
Finally, if a decent value of cost_it does not readily present itself, one can try to calculate it by measuring the actual execution time of each block, as well as the purely parallel part of it, and then trying to fit the resulting data to the cost prediction and get an updated value of cost_it for the next call of the routine (by using the difference between total cost and parallel cost or by inserting a cost of zero into the fitted formula). This should hopefully "converge" to the most useful value of cost_it for the problem in question.

Random seed across different PBS jobs

I am trying to create random numbers in Matlab which will be different across multiple PBS jobs (I am using a job array). Each Matlab job uses a parallel parfor loop in which random numbers are generated, something like this:
parfor k = 1:10
tmp = randi(100, [1 200]);
end
However when I plot my result, I see that the results from different jobs are not completely random - I cannot quantify it, e.g by saying the numbers are exactly the same, since my results are a function of the random numbers, but it is unmistakeable when plotting it.
I tried to initialize the random seed in each job, using the process id and/or the clock:
rngSeed = feature('getpid'); % OR: rngSeed = RandStream.shuffleSeed;
rng(rngSeed);
But this didn't solve the problem. I also tried to pause for a different number of seconds in each job, before using the shuffleSeed (which is clock based).
All this made me think the parfor is somehow messing with the random seed - and it makes sense, if the parfor needs to make sure you get different random numbers across different iterations of the parfor.
My questions are, is it really the case, and how can I solve it and get randomness across different PBS jobs?
EDIT running 4 jobs, each using parfor with 2 workers, I verified that although each job has it's own seed (set outside the parfor), the numbers generated are identical across jobs (not across iterations of the parfor - that is handled by Matlab).
EDIT 2 Trying what was suggested by #Sam Roberts, I use the following code:
matlabpool open local 2
st = RandStream('mlfg6331_64');
RandStream.setGlobalStream(st);
rng('shuffle');
parfor n = 1:4
x=randi(100,[1 10]);
fprintf('%d ',x(:)');
fprintf('\n')
end
matlabpool close
but I still get the same numbers on different calls to the above script.
You may want to look into using random substreams, for correct randomness and reproducibility when running in parallel.
The RandStream class allows you to create a pseudorandom number stream - numbers drawn from this stream have the properties you'd hope for (independence etc) and, if you control the seed, you also have reproducibility.
But it may not be the case that, for example, every second or every fourth number drawn from the stream has the same properties. In addition, when you use parfor you have no control over the order in which the loop iterations are run, which means that you will lose reproducibility. You can use a different substream on each worker within a parfor loop.
Some RNGs, for example mlfg6331_64, a multiplicative lagged Fibonacci generator, or mrg32k3a, a combined multiple recursive generator, support substreams - independent streams that are generated by the same RNG, but which retain the same pseudorandom properties and can be selected from separately, retaining reproducibility. In addition, many MATLAB and Toolbox functions have an option 'UseParallel' and 'UseSubstreams', which will tell them to do this stuff for you automatically.
Although the above is documented at a technical level within the MATLAB documentation, it's kind of hard to find. There's a much more explanatory guide within Statistics Toolbox documentation (should really be moved to MATLAB if you ask me). You can read it online here.
Hope that helps!

Can I run a script on multiple MATLAB sessions instead of parallelizing the script?

I have a script which solves a system of differential equations for many parameters in a for loop. ( iterations are completely independent, but at the end of each iteration , a large matrix ( mat ) is modified according to the results of the computation ). Here is the code: (B is a matrix containing parameters)
mat=zeros(20000,1);
for n=1:20000
prop=B(n,:); % B is a (20000 * 2 ) matrix that contains U and V parameters
U=prop(1);
V=prop(2);
options=odeset('RelTol',1e-6,'AbsTol',1e-20);
[T,X]=ode45(#acceleration,tspan,x0,options);
rad=X(:,1);
if max(rad)<radius % radius is a constant
mat(n)=1;
end
function xprime=acceleration(T,X)
.
.
.
end
First I tried to use parfor, but because the acceleration function (ode45 input) was defined as an inline function, (to achieve better performance) I couldn't do that .
Can I open 4 MATLAB sessions (my CPU has 4 cores) and run the code separately in each session , instead of modifying the code to implement acceleration as a separate function, and therefore , using parfor? Does it give 4X the performance of running on one session? (or does it give same performance as parallelized code ? - in parallel code I can't define inline functions-)
(on Windows)
If you're prepared do to the work of separating out the problem to run separately in 4 sessions, then reassemble the results, sure you can do that. In my experience (on Windows) it actually runs faster to run code in four separate sessions than to have a parfor loop with 4 workers. Not quite as fast as 4x performance of a single session, because the operating system will have other work to do... so for example if you have no other processor-heavy applications running, the OS itself might take up 25% of one core, leaving you with maybe 3.75x performance of a single session. However, this assumes you have enough memory for that not be the limiting factor.
If you wanted to do this regularly you might need to create some file-based signalling/data passing system.
This is obviously not as elegant as a parfor, but is workable for your situation, or if you can't afford the license fee for the parallel toolbox.

Matlab parallel computing toolbox, dynamic allocation of work in parfor loops

I'm working with a long running parfor loop in matlab.
parfor iter=1:1000
chunk_of_work(iter);
end
There are generally about 2-3 timing outliers per run. That is to say for every 1000 chunks of work performed there are 2-3 that take about 100 times longer than the rest. As the loop nears completion, the workers that evaluated the outliers continue to run while the rest of the workers have no computational load.
This is consistent with the parfor loop distributing work statically. This is in contrast with the documentation for the parallel computing toolbox found here:
"Work distribution is dynamic. Instead of being allocated a fixed
iteration range, the workers are allocated a new iteration only after
they finish processing their current iteration, which results in an
even work load distribution."
Any ideas about what's going on?
I think the doc you quote has a pretty good description what is considered a static allocation of work: each worker "being allocated a fixed iteration range". For 4 workers, this would mean the first being assigned iter 1:250, the second iter 251:500,... or the 1:4:100 for the first, 2:4:1000 for the second and so on.
You did not say exactly what you observe, but what you describe is well consistent with dynamic workload distribution: First, the four (example) workers work on one iter each, the first one that is finished works on a fifth, the next one that is done (which may well be the same if three of the first four take somewhat longer) works on a sixth, and so on. Now if your outliers are number 20, 850 and 900 in the order MATLAB chooses to process the loop iterations and each take 100 times as long, this only means that the 21st to 320th iterations will be solved by three of the four workers while one is busy with the 20th (by 320 it will be done, now assuming roughly even distribution of non-outlier calculation time). The worker being assigned the 850th iteration will, however, continue to run even after another has solved #1000, and the same for #900. In fact, if there were about 1100 iterations, the one working on #900 should be finished roughly at the time when the others are.
[edited as the orginal wording implied MATLAB would still assign the iterations of the parfor loop in order from 1 to 1000, which should not be assumed]
So long story short, unless you find a way to process your outliers first (which of course requires you to know a priori which ones are the outliers, and to find a way to make MATLAB start the parfor loop processing with these), dynamic workload distribution alone cannot avoid the effect you observe.
Addition: I think, however, that your observation that as "the loop nears completion, the worker*s* that evaluated the outliers continue to run" seems to imply at least one of the following
The outliers somehow are among the last iterations MATLAB starts to process
You have many workers, in the order of magnitude of the number of iterations
Your estimate of the number of outliers (2-3) or your estimate of their computation time penalty (factor 100) is too low
The work distribution in PARFOR is somewhat deterministic. You can observe precisely what's going on by having each worker log to disk how things go, but basically it turns out that PARFOR divides your loop up into chunks in a deterministic way, but farms them out dynamically. Unfortunately, there's currently no way to control that chunking.
However, if you cannot predict which of your 1000 cases are going to be outliers, it's hard to imagine an efficient scheme for distributing the work.
If you can predict your outliers, you might be able to take advantage of the fact that roughly speaking, PARFOR executes loop iterations in reverse order, so you could put them at the "end" of the loop so work starts on them immediately.
The problem you face is well described in #arne.b's answer, I have nothing to add to that.
But, the parallel compute toolbox does contain functions for decomposing a job into tasks for independent execution. From your question it's not possible to conclude either that this is suitable or that this is not suitable for your application. If it is, the general strategy is to break the job into tasks of some size and have each processor tackle a task, when finished go back to the stack of unfinished tasks and start on another.
You might be able to decompose your problem such that one task replaces one loop iteration (lots of tasks, lots of overhead in managing the computation but best load-balancing) or so that one task replaces N loop iterations (fewer tasks, less overhead, poorer load-balancing). Jobs and tasks are a little trickier to implement than parfor too.
As an alternative to PARFOR, in R2013b and later, you can use PARFEVAL and divide up the work any way you see fit. You could even cancel the 'timing outliers' once you've got sufficient results, if that's appropriate. There is, of course, overhead when dividing up your existing loop into 1000 individual remote PARFEVAL calls. Perhaps that's a problem, perhaps not. Here's the sort of thing I'm imagining:
for idx = 1:1000
futures(idx) = parfeval(#chunk_of_work, 1, idx);
end
done = false; numComplete = 0;
timer = tic();
while ~done
[idx, result] = fetchNext(futures, 10); % wait up to 10 seconds
if ~isempty(idx)
numComplete = numComplete + 1;
% stash result
end
done = (numComplete == 1000) || (toc(timer) > 100);
end
% cancel outstanding work, has no effect on completed futures
cancel(futures);