TAGE prediction accuracy improves with loop over larger array? - cpu-architecture

The code snippet iterates through a 1D matrix. (N is the size of the matrix).
for (i=0; i< N; i++) // outer loop for Rows
When I run this piece of code on a processor simulator to measure TAGE accuracy, I realize that as the size of the array (N) increases, the TAGE accuracy increases.
What is the reason for this?

Loop branches typically mispredict only on the last iteration when execution falls through out of the loop instead of jumping to the top. (For fairly obvious reasons: they quickly learn that the branch is always-taken, and predict that way.)
The more iterations your loops run, the more correctly-predicted taken branches you have for the same number of not-taken special cases that mispredict.
Fun fact: on modern Intel CPUs (like Haswell / Skylake), their IT-TAGE branch predictors can "learn" a pattern up to about 22 iterations, correctly predicting the loop exit. With a very long outer loop to give the CPU time to learn the pattern, an inner loop that runs only 22 or fewer iterations tends to predict even the loop-exit branches correctly. So there's a significant dropoff in performance (and instruction throughput) when the inner-loop size grows past that point, if the loop body is pretty simple.
But it probably takes quite a few outer-loop iterations to train the predictors with that much history. I was testing 10 million outer-loop iterations or so, to average out noise and startup overhead for a whole process with perf stat on real hardware under Linux. So the startup / learning phase was negligible.
With older simpler branch predictors (before TAGE), I think some CPUs did implement loop-pattern prediction with a counter to predict loop exits for inner loops that ran a constant number of iterations every time they were reached. https://danluu.com/branch-prediction/ says the same, that "modern CPUs" "often" have such predictors.

Related

Parfor limitations [duplicate]

the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.

Matlab parfor execution speed [duplicate]

the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.

parfor not giving speed ups [duplicate]

the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.

Predicting runtime of parallel loop using a-priori estimate of effort per iterand (for given number of workers)

I am working on a MATLAB implementation of an adaptive Matrix-Vector Multiplication for very large sparse matrices coming from a particular discretisation of a PDE (with known sparsity structure).
After a lot of pre-processing, I end up with a number of different blocks (greater than, say, 200), for which I want to calculate selected entries.
One of the pre-processing steps is to determine the (number of) entries per block I want to calculate, which gives me an almost perfect measure of the amount of time each block will take (for all intents and purposes the quadrature effort is the same for each entry).
Thanks to https://stackoverflow.com/a/9938666/2965879, I was able to make use of this by ordering the blocks in reverse order, thus goading MATLAB into starting with the biggest ones first.
However, the number of entries differs so wildly from block to block, that directly running parfor is limited severely by the blocks with the largest number of entries, even if they are fed into the loop in reverse.
My solution is to do the biggest blocks serially (but parallelised on the level of entries!), which is fine as long as the overhead per iterand doesn't matter too much, resp. the blocks don't get too small. The rest of the blocks I then do with parfor. Ideally, I'd let MATLAB decide how to handle this, but since a nested parfor-loop loses its parallelism, this doesn't work. Also, packaging both loops into one is (nigh) impossible.
My question now is about how to best determine this cut-off between the serial and the parallel regime, taking into account the information I have on the number of entries (the shape of the curve of ordered entries may differ for different problems), as well as the number of workers I have available.
So far, I had been working with the 12 workers available under a the standard PCT license, but since I've now started working on a cluster, determining this cut-off becomes more and more crucial (since for many cores the overhead of the serial loop becomes more and more costly in comparison to the parallel loop, but similarly, having blocks which hold up the rest are even more costly).
For 12 cores (resp. the configuration of the compute server I was working with), I had figured out a reasonable parameter of 100 entries per worker as a cut off, but this doesn't work well when the number of cores isn't small anymore in relation to the number of blocks (e.g 64 vs 200).
I have tried to deflate the number of cores with different powers (e.g. 1/2, 3/4), but this also doesn't work consistently. Next I tried to group the blocks into batches and determine the cut-off when entries are larger than the mean per batch, resp. the number of batches they are away from the end:
logical_sml = true(1,num_core); i = 0;
while all(logical_sml)
i = i+1;
m = mean(num_entr_asc(1:min(i*num_core,end))); % "asc" ~ ascending order
logical_sml = num_entr_asc(i*num_core+(1:num_core)) < i^(3/4)*m;
% if the small blocks were parallelised perfectly, i.e. all
% cores take the same time, the time would be proportional to
% i*m. To try to discount the different sizes (and imperfect
% parallelisation), we only scale with a power of i less than
% one to not end up with a few blocks which hold up the rest
end
num_block_big = num_block - (i+1)*num_core + sum(~logical_sml);
(Note: This code doesn't work for vectors num_entr_asc whose length is not a multiple of num_core, but I decided to omit the min(...,end) constructions for legibility.)
I have also omitted the < max(...,...) for combining both conditions (i.e. together with minimum entries per worker), which is necessary so that the cut-off isn't found too early. I thought a little about somehow using the variance as well, but so far all attempts have been unsatisfactory.
I would be very grateful if someone has a good idea for how to solve this.
I came up with a somewhat satisfactory solution, so in case anyone's interested I thought I'd share it. I would still appreciate comments on how to improve/fine-tune the approach.
Basically, I decided that the only sensible way is to build a (very) rudimentary model of the scheduler for the parallel loop:
function c=est_cost_para(cost_blocks,cost_it,num_cores)
% Estimate cost of parallel computation
% Inputs:
% cost_blocks: Estimate of cost per block in arbitrary units. For
% consistency with the other code this must be in the reverse order
% that the scheduler is fed, i.e. cost should be ascending!
% cost_it: Base cost of iteration (regardless of number of entries)
% in the same units as cost_blocks.
% num_cores: Number of cores
%
% Output:
% c: Estimated cost of parallel computation
num_blocks=numel(cost_blocks);
c=zeros(num_cores,1);
i=min(num_blocks,num_cores);
c(1:i)=cost_blocks(end-i+1:end)+cost_it;
while i<num_blocks
i=i+1;
[~,i_min]=min(c); % which core finished first; is fed with next block
c(i_min)=c(i_min)+cost_blocks(end-i+1)+cost_it;
end
c=max(c);
end
The parameter cost_it for an empty iteration is a crude blend of many different side effects, which could conceivably be separated: The cost of an empty iteration in a for/parfor-loop (could also be different per block), as well as the start-up time resp. transmission of data of the parfor-loop (and probably more). My main reason to throw everything together is that I don't want to have to estimate/determine the more granular costs.
I use the above routine to determine the cut-off in the following way:
% function i=cutoff_ser_para(cost_blocks,cost_it,num_cores)
% Determine cut-off between serial an parallel regime
% Inputs:
% cost_blocks: Estimate of cost per block in arbitrary units. For
% consistency with the other code this must be in the reverse order
% that the scheduler is fed, i.e. cost should be ascending!
% cost_it: Base cost of iteration (regardless of number of entries)
% in the same units as cost_blocks.
% num_cores: Number of cores
%
% Output:
% i: Number of blocks to be calculated serially
num_blocks=numel(cost_blocks);
cost=zeros(num_blocks+1,2);
for i=0:num_blocks
cost(i+1,1)=sum(cost_blocks(end-i+1:end))/num_cores + i*cost_it;
cost(i+1,2)=est_cost_para(cost_blocks(1:end-i),cost_it,num_cores);
end
[~,i]=min(sum(cost,2));
i=i-1;
end
In particular, I don't inflate/change the value of est_cost_para which assumes (aside from cost_it) the most optimistic scheduling possible. I leave it as is mainly because I don't know what would work best. To be conservative (i.e. avoid feeding too large blocks to the parallel loop), one could of course add some percentage as a buffer or even use a power > 1 to inflate the parallel cost.
Note also that est_cost_para is called with successively less blocks (although I use the variable name cost_blocks for both routines, one is a subset of the other).
Compared to the approach in my wordy question I see two main advantages:
The relatively intricate dependence between the data (both the number of blocks as well as their cost) and the number of cores is captured much better with the simulated scheduler than would be possible with a single formula.
By calculating the cost for all possible combinations of serial/parallel distribution and then taking the minimum, one cannot get "stuck" too early while reading in the data from one side (e.g. by a jump which is large relative to the data so far, but small in comparison to the total).
Of course, the asymptotic complexity is higher by calling est_cost_para with its while-loop all the time, but in my case (num_blocks<500) this is absolutely negligible.
Finally, if a decent value of cost_it does not readily present itself, one can try to calculate it by measuring the actual execution time of each block, as well as the purely parallel part of it, and then trying to fit the resulting data to the cost prediction and get an updated value of cost_it for the next call of the routine (by using the difference between total cost and parallel cost or by inserting a cost of zero into the fitted formula). This should hopefully "converge" to the most useful value of cost_it for the problem in question.

Matlab parallel computing toolbox, dynamic allocation of work in parfor loops

I'm working with a long running parfor loop in matlab.
parfor iter=1:1000
chunk_of_work(iter);
end
There are generally about 2-3 timing outliers per run. That is to say for every 1000 chunks of work performed there are 2-3 that take about 100 times longer than the rest. As the loop nears completion, the workers that evaluated the outliers continue to run while the rest of the workers have no computational load.
This is consistent with the parfor loop distributing work statically. This is in contrast with the documentation for the parallel computing toolbox found here:
"Work distribution is dynamic. Instead of being allocated a fixed
iteration range, the workers are allocated a new iteration only after
they finish processing their current iteration, which results in an
even work load distribution."
Any ideas about what's going on?
I think the doc you quote has a pretty good description what is considered a static allocation of work: each worker "being allocated a fixed iteration range". For 4 workers, this would mean the first being assigned iter 1:250, the second iter 251:500,... or the 1:4:100 for the first, 2:4:1000 for the second and so on.
You did not say exactly what you observe, but what you describe is well consistent with dynamic workload distribution: First, the four (example) workers work on one iter each, the first one that is finished works on a fifth, the next one that is done (which may well be the same if three of the first four take somewhat longer) works on a sixth, and so on. Now if your outliers are number 20, 850 and 900 in the order MATLAB chooses to process the loop iterations and each take 100 times as long, this only means that the 21st to 320th iterations will be solved by three of the four workers while one is busy with the 20th (by 320 it will be done, now assuming roughly even distribution of non-outlier calculation time). The worker being assigned the 850th iteration will, however, continue to run even after another has solved #1000, and the same for #900. In fact, if there were about 1100 iterations, the one working on #900 should be finished roughly at the time when the others are.
[edited as the orginal wording implied MATLAB would still assign the iterations of the parfor loop in order from 1 to 1000, which should not be assumed]
So long story short, unless you find a way to process your outliers first (which of course requires you to know a priori which ones are the outliers, and to find a way to make MATLAB start the parfor loop processing with these), dynamic workload distribution alone cannot avoid the effect you observe.
Addition: I think, however, that your observation that as "the loop nears completion, the worker*s* that evaluated the outliers continue to run" seems to imply at least one of the following
The outliers somehow are among the last iterations MATLAB starts to process
You have many workers, in the order of magnitude of the number of iterations
Your estimate of the number of outliers (2-3) or your estimate of their computation time penalty (factor 100) is too low
The work distribution in PARFOR is somewhat deterministic. You can observe precisely what's going on by having each worker log to disk how things go, but basically it turns out that PARFOR divides your loop up into chunks in a deterministic way, but farms them out dynamically. Unfortunately, there's currently no way to control that chunking.
However, if you cannot predict which of your 1000 cases are going to be outliers, it's hard to imagine an efficient scheme for distributing the work.
If you can predict your outliers, you might be able to take advantage of the fact that roughly speaking, PARFOR executes loop iterations in reverse order, so you could put them at the "end" of the loop so work starts on them immediately.
The problem you face is well described in #arne.b's answer, I have nothing to add to that.
But, the parallel compute toolbox does contain functions for decomposing a job into tasks for independent execution. From your question it's not possible to conclude either that this is suitable or that this is not suitable for your application. If it is, the general strategy is to break the job into tasks of some size and have each processor tackle a task, when finished go back to the stack of unfinished tasks and start on another.
You might be able to decompose your problem such that one task replaces one loop iteration (lots of tasks, lots of overhead in managing the computation but best load-balancing) or so that one task replaces N loop iterations (fewer tasks, less overhead, poorer load-balancing). Jobs and tasks are a little trickier to implement than parfor too.
As an alternative to PARFOR, in R2013b and later, you can use PARFEVAL and divide up the work any way you see fit. You could even cancel the 'timing outliers' once you've got sufficient results, if that's appropriate. There is, of course, overhead when dividing up your existing loop into 1000 individual remote PARFEVAL calls. Perhaps that's a problem, perhaps not. Here's the sort of thing I'm imagining:
for idx = 1:1000
futures(idx) = parfeval(#chunk_of_work, 1, idx);
end
done = false; numComplete = 0;
timer = tic();
while ~done
[idx, result] = fetchNext(futures, 10); % wait up to 10 seconds
if ~isempty(idx)
numComplete = numComplete + 1;
% stash result
end
done = (numComplete == 1000) || (toc(timer) > 100);
end
% cancel outstanding work, has no effect on completed futures
cancel(futures);