Predicting runtime of parallel loop using a-priori estimate of effort per iterand (for given number of workers) - matlab

I am working on a MATLAB implementation of an adaptive Matrix-Vector Multiplication for very large sparse matrices coming from a particular discretisation of a PDE (with known sparsity structure).
After a lot of pre-processing, I end up with a number of different blocks (greater than, say, 200), for which I want to calculate selected entries.
One of the pre-processing steps is to determine the (number of) entries per block I want to calculate, which gives me an almost perfect measure of the amount of time each block will take (for all intents and purposes the quadrature effort is the same for each entry).
Thanks to https://stackoverflow.com/a/9938666/2965879, I was able to make use of this by ordering the blocks in reverse order, thus goading MATLAB into starting with the biggest ones first.
However, the number of entries differs so wildly from block to block, that directly running parfor is limited severely by the blocks with the largest number of entries, even if they are fed into the loop in reverse.
My solution is to do the biggest blocks serially (but parallelised on the level of entries!), which is fine as long as the overhead per iterand doesn't matter too much, resp. the blocks don't get too small. The rest of the blocks I then do with parfor. Ideally, I'd let MATLAB decide how to handle this, but since a nested parfor-loop loses its parallelism, this doesn't work. Also, packaging both loops into one is (nigh) impossible.
My question now is about how to best determine this cut-off between the serial and the parallel regime, taking into account the information I have on the number of entries (the shape of the curve of ordered entries may differ for different problems), as well as the number of workers I have available.
So far, I had been working with the 12 workers available under a the standard PCT license, but since I've now started working on a cluster, determining this cut-off becomes more and more crucial (since for many cores the overhead of the serial loop becomes more and more costly in comparison to the parallel loop, but similarly, having blocks which hold up the rest are even more costly).
For 12 cores (resp. the configuration of the compute server I was working with), I had figured out a reasonable parameter of 100 entries per worker as a cut off, but this doesn't work well when the number of cores isn't small anymore in relation to the number of blocks (e.g 64 vs 200).
I have tried to deflate the number of cores with different powers (e.g. 1/2, 3/4), but this also doesn't work consistently. Next I tried to group the blocks into batches and determine the cut-off when entries are larger than the mean per batch, resp. the number of batches they are away from the end:
logical_sml = true(1,num_core); i = 0;
while all(logical_sml)
i = i+1;
m = mean(num_entr_asc(1:min(i*num_core,end))); % "asc" ~ ascending order
logical_sml = num_entr_asc(i*num_core+(1:num_core)) < i^(3/4)*m;
% if the small blocks were parallelised perfectly, i.e. all
% cores take the same time, the time would be proportional to
% i*m. To try to discount the different sizes (and imperfect
% parallelisation), we only scale with a power of i less than
% one to not end up with a few blocks which hold up the rest
end
num_block_big = num_block - (i+1)*num_core + sum(~logical_sml);
(Note: This code doesn't work for vectors num_entr_asc whose length is not a multiple of num_core, but I decided to omit the min(...,end) constructions for legibility.)
I have also omitted the < max(...,...) for combining both conditions (i.e. together with minimum entries per worker), which is necessary so that the cut-off isn't found too early. I thought a little about somehow using the variance as well, but so far all attempts have been unsatisfactory.
I would be very grateful if someone has a good idea for how to solve this.

I came up with a somewhat satisfactory solution, so in case anyone's interested I thought I'd share it. I would still appreciate comments on how to improve/fine-tune the approach.
Basically, I decided that the only sensible way is to build a (very) rudimentary model of the scheduler for the parallel loop:
function c=est_cost_para(cost_blocks,cost_it,num_cores)
% Estimate cost of parallel computation
% Inputs:
% cost_blocks: Estimate of cost per block in arbitrary units. For
% consistency with the other code this must be in the reverse order
% that the scheduler is fed, i.e. cost should be ascending!
% cost_it: Base cost of iteration (regardless of number of entries)
% in the same units as cost_blocks.
% num_cores: Number of cores
%
% Output:
% c: Estimated cost of parallel computation
num_blocks=numel(cost_blocks);
c=zeros(num_cores,1);
i=min(num_blocks,num_cores);
c(1:i)=cost_blocks(end-i+1:end)+cost_it;
while i<num_blocks
i=i+1;
[~,i_min]=min(c); % which core finished first; is fed with next block
c(i_min)=c(i_min)+cost_blocks(end-i+1)+cost_it;
end
c=max(c);
end
The parameter cost_it for an empty iteration is a crude blend of many different side effects, which could conceivably be separated: The cost of an empty iteration in a for/parfor-loop (could also be different per block), as well as the start-up time resp. transmission of data of the parfor-loop (and probably more). My main reason to throw everything together is that I don't want to have to estimate/determine the more granular costs.
I use the above routine to determine the cut-off in the following way:
% function i=cutoff_ser_para(cost_blocks,cost_it,num_cores)
% Determine cut-off between serial an parallel regime
% Inputs:
% cost_blocks: Estimate of cost per block in arbitrary units. For
% consistency with the other code this must be in the reverse order
% that the scheduler is fed, i.e. cost should be ascending!
% cost_it: Base cost of iteration (regardless of number of entries)
% in the same units as cost_blocks.
% num_cores: Number of cores
%
% Output:
% i: Number of blocks to be calculated serially
num_blocks=numel(cost_blocks);
cost=zeros(num_blocks+1,2);
for i=0:num_blocks
cost(i+1,1)=sum(cost_blocks(end-i+1:end))/num_cores + i*cost_it;
cost(i+1,2)=est_cost_para(cost_blocks(1:end-i),cost_it,num_cores);
end
[~,i]=min(sum(cost,2));
i=i-1;
end
In particular, I don't inflate/change the value of est_cost_para which assumes (aside from cost_it) the most optimistic scheduling possible. I leave it as is mainly because I don't know what would work best. To be conservative (i.e. avoid feeding too large blocks to the parallel loop), one could of course add some percentage as a buffer or even use a power > 1 to inflate the parallel cost.
Note also that est_cost_para is called with successively less blocks (although I use the variable name cost_blocks for both routines, one is a subset of the other).
Compared to the approach in my wordy question I see two main advantages:
The relatively intricate dependence between the data (both the number of blocks as well as their cost) and the number of cores is captured much better with the simulated scheduler than would be possible with a single formula.
By calculating the cost for all possible combinations of serial/parallel distribution and then taking the minimum, one cannot get "stuck" too early while reading in the data from one side (e.g. by a jump which is large relative to the data so far, but small in comparison to the total).
Of course, the asymptotic complexity is higher by calling est_cost_para with its while-loop all the time, but in my case (num_blocks<500) this is absolutely negligible.
Finally, if a decent value of cost_it does not readily present itself, one can try to calculate it by measuring the actual execution time of each block, as well as the purely parallel part of it, and then trying to fit the resulting data to the cost prediction and get an updated value of cost_it for the next call of the routine (by using the difference between total cost and parallel cost or by inserting a cost of zero into the fitted formula). This should hopefully "converge" to the most useful value of cost_it for the problem in question.

Related

Set Batch Size *and* Number of Training Iterations for a neural network?

I am using the KNIME Doc2Vec Learner node to build a Word Embedding. I know how Doc2Vec works. In KNIME I have the option to set the parameters
Batch Size: The number of words to use for each batch.
Number of Epochs: The number of epochs to train.
Number of Training Iterations: The number of updates done for each batch.
From Neural Networks I know that (lazily copied from https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network):
one epoch = one forward pass and one backward pass of all the training examples
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
As far as I understand it makes little sense to set batch size and iterations, because one is determined by the other (given the data size, which is given by the circumstances). So why can I change both parameters?
This is not necessarily the case. You can also train "half epochs". For example, in Google's inceptionV3 pretrained script, you usually set the number of iterations and the batch size at the same time. This can lead to "partial epochs", which can be fine.
If it is a good idea or not to train half epochs may depend on your data. There is a thread about this but not a concluding answer.
I am not familiar with KNIME Doc2Vec, so I am not sure if the meaning is somewhat different there. But from the definitions you gave setting batch size + iterations seems fine. Also setting number of epochs could cause conflicts though leading to situations where numbers don't add up to reasonable combinations.

Generate random numbers inside spmd in matlab

I am running a Monte carlo simulation in Matlab using parallelisation due to the extensive time that the simulation takes to run.
The main objective is create a really big panel data set and use that to estimate some regressions.
The problem is that when I run the simulation without parallelise they take A LOT of time to run, so I decided to use spmd option. However, results are very different running the parallelised code compared to the normal one.
rng(3857);
for r=1:MCREP
Ycom=[];
Xcom=[];
YLcom=[];
spmd
for it=labindex:numlabs:NT
(code to generate different components, alpha, delta, x_it, eps_it)
%e.g. x_it=2+1*randn(TT,1);
(uses random number generator: rndn)
% Create different time periods observations for each individual
for t=2:TT
yi(t)=xi*alpha+mu*delta+rho*yi(t-1)+beta*x_it(t)+eps_it(t);
yLi(t)=yi(t-1);
end
% Concatenate each individual in a big matrix: create panel
Ycom=[Ycom yi];
Xcom=[Xcom x_it];
YLcom=[YLcom yLi];
end
end
% Retrieve data stored in composite form
mm=matlabpool('size');
for i=1:mm
Y(:,(i-1)*(NT/mm)+1:i*(NT/mm))=Ycom{i};
X(:,(i-1)*(NT/mm)+1:i*(NT/mm))=Xcom{i};
YL(:,(i-1)*(NT/mm)+1:i*(NT/mm))=YLcom{i};
end
(rest of the code, run regressions)
end
The intensive part of the code is the one that is parallelised with the spmd, it creates a really large panel data set in where columns are independent individuals, and rows are dependent time periods.
My main problem is that when I run the code using the parallel then results are different than when I don't use it, moreover results are different if I use 8 workers or 16 workers. However for a matter of time is unfeasible to run the code without parallelisation.
I believe problem is coming from the random numbers generation, but I can not fix the seed inside the spmd because that mean fixing the seed inside the Monte Carlo loop, so all the repetitions are going to have the same numbers.
I would want to know how can I fix the random number generator in such a way that it does not matter how many workers I use it will give me the same results.
PS. Another solution would be to do the spmd in the most outer loop (the Monte Carlo loop), however I can not see a performance gain when I use the parallelisation in that way.
Thank you very much for your help.
Heh... the random generators in MATLAB's parallel execution is indeed an issue.
The MATLAB's web page about random generators (http://www.mathworks.com/help/matlab/math/creating-and-controlling-a-random-number-stream.html) states that only two streams/generators can have multiple streams. These two have a limited period (see the Table at the previous link).
BUT!!! The default generator (mt19937ar) can be seeded in order to have different results :)
Thus, what you can do is to start with the mrg32k3a, obtain a random number in each worker and then use this random number along with the worker index to seed an mt19937ar generator.
E.g.
spmd
r1 = rand(randStream{labindex}, [1 1]);
r2 = rand(randStream{labindex}, [1 1]);
rng(labindex+(r1/r2), 'twister');
% Do you stuff
end
Of course, the r1 and r2 can be modified (or, maybe, add more r's) in order to have more complicated seeding.

work stealing with Parallel Computing Toolbox

I have implemented a combinatorial search algorithm (for comparison to a more efficient optimization technique) and tried to improve its runtime with parfor.
Unfortunately, the work assignments appear to be very badly unbalanced.
Each subitem i has complexity of approximately nCr(N - i, 3). As you can see, the tasks i < N/4 involve significantly more work than i > 3*N/4, yet it seems MATLAB is assigning all of i < N/4 to a single worker.
Is it true that MATLAB divides the work based on equally sized subsets of the loop range?
No, this question cites the documentation saying it does not.
Is there a convenient way to rebalance this without hardcoding the number of workers (e.g. if I require exactly 4 workers in the pool, then I could swap the two lowest bits of i with two higher bits in order to ensure each worker received some mix of easy and hard tasks)?
I don't think a full "work-stealing" implementation is necessary, perhaps just assigning 1, 2, 3, 4 to the workers, then when 4 completes first, its worker begins on item 5, and so on. The size of each item is sufficiently larger than the number of iterations that I'm not too worried about the increased communication overhead.
If the loop iterations are indeed distributed ahead of time (which would mean that in the end, there is a single worker that will have to complete several iterations while the other workers are idle - is this really the case?), the easiest way to ensure a mix is to randomly permute the loop iterations:
permutedIterations = randperm(nIterations);
permutedResults = cell(nIterations,1); %# or whatever else you use for storing results
%# run the parfor loop, completing iterations in permuted order
parfor iIter = 1:nIterations
permutedResults(iIter) = f(permutedIterations(iIter));
end
%# reorder results for easier subsequent analysis
results = permutedResults(permutedIterations);

Mahout K-means has different behavior based on the number of mapping tasks

I experience a strange situation when running Mahout K-means:
Using the a pre-selected set of initial centroids, I run K-means on a SequenceFile generated by lucene.vector. The run is for testing purposes, so the file is small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default considering the Hadoop split size which in my cluster is 128MB), it reaches a given clustering result in 2 iterations (Case A).
However, I wanted to test if there would be any improvement/deterioration in the algorithm's execution speed by firing more mapping tasks (the Hadoop cluster has in total 6 nodes).
I therefore set the -Dmapred.max.split.size parameter to 5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B).
I indeed succeeded in starting two mappers, but the strange thing was that the job finished after 5 iterations instead of 2, and that even at the first assignment of points to clusters, the mappers made different choices compared to the single-map execution . What I mean is that after close inspection of the clusterDump for the first iteration for both two cases, I found that in case B some points were not assigned to their closest cluster.
Could this behavior be justified by the existing K-means Mahout implementation?
From a quick look at the sources, I see two problems with the Mahout k-means implementation.
First of all, the way the S0, S1, S2 statistics are kept is probably not numerically stable for large data sets. Oh, and since k-means actually does not even use S2, it is also unnecessary slow. I bet a good implementation can beat this version of k-means by a factor of 2-5 at least.
For small data sets split onto multiple machines, there seems to be an error in the way they compute their means. Ouch. This will amplify if the reducer is applied to more than one input, in particular when the partitions are small. To be more verbose, the cluster mean apparently is initialized with the previous mean instead of the 0 vector. Now if you if you reduce 't' copies of it, the resulting vector will be off by 't' times the previous mean.
Initialization of AbstractCluster:
setS1(center.like());
Update of the mean:
getS1().assign(x, Functions.PLUS);
Merge of multiple copies of a cluster:
setS1(getS1().plus(cl.getS1()));
Finalization to new center:
setCenter(getS1().divide(getS0()));
So with this approach, the center will be offset from the proper value by the previous center times t / n where t is the number of splits, and n the number of objects.
To fix the numerical instability (which arises whenever the data set is not centered on the 0 vector), I recommend replacing the S1 statistic by the true mean, not S0*mean. Both S1 and S2 can be incrementally updated at little cost using the incremental mean formula which AFAICT was used in the original "k-means" publication by MacQueen (which actually is an online kmeans, while this is Lloyd style batch iterations). Well, for an incremental k-means you obviously need the updatable mean vector anyway... I believe the formula was also discussed by Knuth in his essential books. I'm surprised that Mahout does not seem to use it. It's fairly cheap (just a few CPU instructions more, no additional data, so it all happens in the CPU cache line) and gives you extra precision when you are dealing with large data sets.

Matlab parallel computing toolbox, dynamic allocation of work in parfor loops

I'm working with a long running parfor loop in matlab.
parfor iter=1:1000
chunk_of_work(iter);
end
There are generally about 2-3 timing outliers per run. That is to say for every 1000 chunks of work performed there are 2-3 that take about 100 times longer than the rest. As the loop nears completion, the workers that evaluated the outliers continue to run while the rest of the workers have no computational load.
This is consistent with the parfor loop distributing work statically. This is in contrast with the documentation for the parallel computing toolbox found here:
"Work distribution is dynamic. Instead of being allocated a fixed
iteration range, the workers are allocated a new iteration only after
they finish processing their current iteration, which results in an
even work load distribution."
Any ideas about what's going on?
I think the doc you quote has a pretty good description what is considered a static allocation of work: each worker "being allocated a fixed iteration range". For 4 workers, this would mean the first being assigned iter 1:250, the second iter 251:500,... or the 1:4:100 for the first, 2:4:1000 for the second and so on.
You did not say exactly what you observe, but what you describe is well consistent with dynamic workload distribution: First, the four (example) workers work on one iter each, the first one that is finished works on a fifth, the next one that is done (which may well be the same if three of the first four take somewhat longer) works on a sixth, and so on. Now if your outliers are number 20, 850 and 900 in the order MATLAB chooses to process the loop iterations and each take 100 times as long, this only means that the 21st to 320th iterations will be solved by three of the four workers while one is busy with the 20th (by 320 it will be done, now assuming roughly even distribution of non-outlier calculation time). The worker being assigned the 850th iteration will, however, continue to run even after another has solved #1000, and the same for #900. In fact, if there were about 1100 iterations, the one working on #900 should be finished roughly at the time when the others are.
[edited as the orginal wording implied MATLAB would still assign the iterations of the parfor loop in order from 1 to 1000, which should not be assumed]
So long story short, unless you find a way to process your outliers first (which of course requires you to know a priori which ones are the outliers, and to find a way to make MATLAB start the parfor loop processing with these), dynamic workload distribution alone cannot avoid the effect you observe.
Addition: I think, however, that your observation that as "the loop nears completion, the worker*s* that evaluated the outliers continue to run" seems to imply at least one of the following
The outliers somehow are among the last iterations MATLAB starts to process
You have many workers, in the order of magnitude of the number of iterations
Your estimate of the number of outliers (2-3) or your estimate of their computation time penalty (factor 100) is too low
The work distribution in PARFOR is somewhat deterministic. You can observe precisely what's going on by having each worker log to disk how things go, but basically it turns out that PARFOR divides your loop up into chunks in a deterministic way, but farms them out dynamically. Unfortunately, there's currently no way to control that chunking.
However, if you cannot predict which of your 1000 cases are going to be outliers, it's hard to imagine an efficient scheme for distributing the work.
If you can predict your outliers, you might be able to take advantage of the fact that roughly speaking, PARFOR executes loop iterations in reverse order, so you could put them at the "end" of the loop so work starts on them immediately.
The problem you face is well described in #arne.b's answer, I have nothing to add to that.
But, the parallel compute toolbox does contain functions for decomposing a job into tasks for independent execution. From your question it's not possible to conclude either that this is suitable or that this is not suitable for your application. If it is, the general strategy is to break the job into tasks of some size and have each processor tackle a task, when finished go back to the stack of unfinished tasks and start on another.
You might be able to decompose your problem such that one task replaces one loop iteration (lots of tasks, lots of overhead in managing the computation but best load-balancing) or so that one task replaces N loop iterations (fewer tasks, less overhead, poorer load-balancing). Jobs and tasks are a little trickier to implement than parfor too.
As an alternative to PARFOR, in R2013b and later, you can use PARFEVAL and divide up the work any way you see fit. You could even cancel the 'timing outliers' once you've got sufficient results, if that's appropriate. There is, of course, overhead when dividing up your existing loop into 1000 individual remote PARFEVAL calls. Perhaps that's a problem, perhaps not. Here's the sort of thing I'm imagining:
for idx = 1:1000
futures(idx) = parfeval(#chunk_of_work, 1, idx);
end
done = false; numComplete = 0;
timer = tic();
while ~done
[idx, result] = fetchNext(futures, 10); % wait up to 10 seconds
if ~isempty(idx)
numComplete = numComplete + 1;
% stash result
end
done = (numComplete == 1000) || (toc(timer) > 100);
end
% cancel outstanding work, has no effect on completed futures
cancel(futures);