What’s the performance difference between moving(sum, X, 10) and msum(X, 10) and what causes the difference? - database-performance

For a vector of length 1 million, what’s the performance difference between moving(sum, X, 10) and msum(X, 10) and what causes the difference?

The calculation speed of function msum will be 50 to 200 times higher than the moving function. It may vary depending on the data volume. The reasons are as follows:
The functions adopt different methods to process data: msum puts data into memory at one time, no need to allocate memory separately for each calculation; while moving generates a sub-object and reallocates memory to the sub-object for each calculation, and the memory is reclaimed after the calculation completes.
The function msum implements incremental computation where each calculation adds the previous result to the adjacent value and subtract the last value of the previous calculation, while moving adds up all the data in the window for each calculation.

Related

Terminate a benchmark early because of long score calculation

What i want to achieve
Currently i am running large inputs in my OptaPlanner project and with the current implementation of the constraints they are taking a long time even to calculate the initial score. So a given solver destroys the whole benchmark because it gets stuck and can not terminate. As a score calculation type i am using Drools.
I am trying to achieve an early termination of a solver that after a certain amount of time still has not passed the initial score calculation(no "Solving started" is displayed). So in a single benchmark i want to run multiple different inputs and for each of them i want to have a given timer and if that timer expires before the initial score calculation is done i want the solver to be terminated immediately. A desirable option would be to have a percentage of how much of the score calculation was completed.
The reason why i'm not just jumping on to doing optimizations is because i want to have a baseline for comparison and keep track of the results as the optimizations go on. So the information how much percentage of the initial score calculation has passed is vital for me.
What i have/know currently
The version of OptaPlanner that i'm using is the one from GitHub that has the whole source code open(it is not the official version from the website which is compiled in JAR's and the core is not editable)
Implemented timers for each solver of the benchmark which after a given time period call the solver.terminateEarly() method.
Each solver runs on it's unique thread. So the relation solver : thread is 1:1. The way i am finding out which solver is currently executing the code is by doing a lookup in a Map<Integer,Solver> solverMap where the key is the value of the hashCode of the thread executing the solver -> Thread.currentThread().hashCode(). As the solvers start and finish this Map is being updated. This way i am able to do the lookup from all the places (optaplanner-examples, optaplanner-core, optaplanner-benchmark projects and Drools rules(example below))
Found out kcontext.getKieRuntime().halt() from the Drools documentation which is used to terminate rule execution immediately.
Implemented specialized rules that will reach the then part after each change of a planning/shadow entity and from the then part checks first is the solver terminated early(by the corresponding Timer) and if it is calls kcontext.getKieRuntime().halt(). For example:
In the rule below the then part will be reached after each change in a ShiftAssignment instance and the rule execution will be stopped if the solver is set to be terminated early.
salience 1 //so that it is triggered first
rule "ShiftAssignmentChange"
when
ShiftAssignment()
then
if(TerminateBenchmarkEarly.solverMap.get(Thread.currentThread().hashCode()).isTerminateEarly()){
kcontext.getKieRuntime().halt();//This command is used to terminate the fire loop early.
}
end
The intention with these rules is that they have salience 1 opposed to the default option which is 0, so they will be the first ones that will be executed and the rule execution will be immediately stopped
6. The kieSession.fireAllRules() call from the org.optaplanner.core.impl.score.director.drools.DroolsScoreDirector calculateScore method returns the number of rules that were executed. I can use this measure as a baseline for how much the initial score has achieved.As the optimizations go on it is expected that this number grows higher and the time taken is becoming smaller.
The problem that i'm facing currently
The problem that i have is that even with this implemented again it is taking it a lot of time to reach the check in the rules, or in some cases crashes because of an OutOfMemory error. Turning on the Trace option for Drools i was able to see that some smaller part of the time it was inserting the facts into the working memory, and then after that it constantly is outputting TRACE BetaNode stagedInsertWasEmpty=false. The problem lies in the kieSession.fireAllRules() call from the org.optaplanner.core.impl.score.director.drools.DroolsScoreDirector calculateScore method, the code of fireAllRules is from the Drools core and this code is compiled into a JAR so it cannot be edited.
Conclusion
Anyway i know that this is somehow a hack but as i say above i need this information as a baseline to know where my current solution is and keep track of the benchmark information as the optimizations go on.
If there is different(smarter) way with which i can achieve this, i would be happy to do it.
Results from a benchmark
Input 1
Entity count: 12,870
Variable count: 7,515
Maximum value count: 21
Problem scale: 22,068
Memory usage after loading the inputSolution (before creating the Solver): 44,830,840 bytes on average.
Average score calculation speed after Construction Heuristic = 1965/sec
Average score calculation speed after Local Search = 1165/sec
Average score calculation speed after Solver is finished = 1177/sec
Input 2
Entity count: 17,559
Variable count: 7,515
Maximum value count: 8
Problem scale: 21,474
Memory usage after loading the inputSolution (before creating the Solver): 5,964,200 bytes on average.
Average score calculation speed after Construction Heuristic = 1048/sec
Average score calculation speed after Local Search = 1075/sec
Average score calculation speed after Solver is finished = 1075/sec
Input 3
Entity count: 34,311
Variable count: 14,751
Maximum value count: 8
Problem scale: 43,358
Memory usage after loading the inputSolution (before creating the Solver): 43,178,536 bytes on average.
Average score calculation speed after Construction Heuristic = 1134/sec
Average score calculation speed after Local Search = 450/sec
Average score calculation speed after Solver is finished = 452/sec
Input 4
Entity count: 175,590
Variable count: 75,150
Maximum value count: 11
Problem scale: 240,390
Memory usage after loading the inputSolution (before creating the Solver): 36,089,240 bytes on average.
Average score calculation speed after Construction Heuristic = 739/sec
Average score calculation speed after Local Search = 115/sec
Average score calculation speed after Solver is finished = 123/sec
Input 5
Entity count: 231,000
Variable count: 91,800
Maximum value count: 31
Problem scale: 360,150
Memory usage after loading the inputSolution (before creating the Solver): 136,651,744 bytes on average.
Average score calculation speed after Construction Heuristic = 142/sec
Average score calculation speed after Local Search = 11/sec
Average score calculation speed after Solver is finished = 26/sec
Input 6
Entity count: 770,000
Variable count: 306,000 '
Maximum value count:
51
Problem scale: 1,370,500
Memory usage after loading the
inputSolution (before creating the Solver): 114,488,056 bytes on
average.
Average score calculation speed after Construction
Heuristic = 33/sec
Average score calculation speed after Local Search = 1/sec
Average score calculation speed after Solver is finished = 17/sec
When commenting out the rules in Drools i get the next average score
calculation speed (for Input 6):
After Construction Heuristic = 17800/sec
After Local Search = 22557/sec
After Solver is finished = 21690/sec
If possible, I'd first focus on making the DRL faster, instead of these hacks. So that comes down to figuring out which score rules are slow. Use the score calculation speed (in the last INFO log line) to determine that by commenting out score rules and seeing their impact on the score calculation speed.
That being said, normally I'd advice to look at unimprovedSecondsSpentLimit or a custom Termination - but that indeed won't help as those aren't checked while the initial score is calculated from scratched: they are only checked between every move (so between every fireAllRules(), usually 10k/sec).

Predicting runtime of parallel loop using a-priori estimate of effort per iterand (for given number of workers)

I am working on a MATLAB implementation of an adaptive Matrix-Vector Multiplication for very large sparse matrices coming from a particular discretisation of a PDE (with known sparsity structure).
After a lot of pre-processing, I end up with a number of different blocks (greater than, say, 200), for which I want to calculate selected entries.
One of the pre-processing steps is to determine the (number of) entries per block I want to calculate, which gives me an almost perfect measure of the amount of time each block will take (for all intents and purposes the quadrature effort is the same for each entry).
Thanks to https://stackoverflow.com/a/9938666/2965879, I was able to make use of this by ordering the blocks in reverse order, thus goading MATLAB into starting with the biggest ones first.
However, the number of entries differs so wildly from block to block, that directly running parfor is limited severely by the blocks with the largest number of entries, even if they are fed into the loop in reverse.
My solution is to do the biggest blocks serially (but parallelised on the level of entries!), which is fine as long as the overhead per iterand doesn't matter too much, resp. the blocks don't get too small. The rest of the blocks I then do with parfor. Ideally, I'd let MATLAB decide how to handle this, but since a nested parfor-loop loses its parallelism, this doesn't work. Also, packaging both loops into one is (nigh) impossible.
My question now is about how to best determine this cut-off between the serial and the parallel regime, taking into account the information I have on the number of entries (the shape of the curve of ordered entries may differ for different problems), as well as the number of workers I have available.
So far, I had been working with the 12 workers available under a the standard PCT license, but since I've now started working on a cluster, determining this cut-off becomes more and more crucial (since for many cores the overhead of the serial loop becomes more and more costly in comparison to the parallel loop, but similarly, having blocks which hold up the rest are even more costly).
For 12 cores (resp. the configuration of the compute server I was working with), I had figured out a reasonable parameter of 100 entries per worker as a cut off, but this doesn't work well when the number of cores isn't small anymore in relation to the number of blocks (e.g 64 vs 200).
I have tried to deflate the number of cores with different powers (e.g. 1/2, 3/4), but this also doesn't work consistently. Next I tried to group the blocks into batches and determine the cut-off when entries are larger than the mean per batch, resp. the number of batches they are away from the end:
logical_sml = true(1,num_core); i = 0;
while all(logical_sml)
i = i+1;
m = mean(num_entr_asc(1:min(i*num_core,end))); % "asc" ~ ascending order
logical_sml = num_entr_asc(i*num_core+(1:num_core)) < i^(3/4)*m;
% if the small blocks were parallelised perfectly, i.e. all
% cores take the same time, the time would be proportional to
% i*m. To try to discount the different sizes (and imperfect
% parallelisation), we only scale with a power of i less than
% one to not end up with a few blocks which hold up the rest
end
num_block_big = num_block - (i+1)*num_core + sum(~logical_sml);
(Note: This code doesn't work for vectors num_entr_asc whose length is not a multiple of num_core, but I decided to omit the min(...,end) constructions for legibility.)
I have also omitted the < max(...,...) for combining both conditions (i.e. together with minimum entries per worker), which is necessary so that the cut-off isn't found too early. I thought a little about somehow using the variance as well, but so far all attempts have been unsatisfactory.
I would be very grateful if someone has a good idea for how to solve this.
I came up with a somewhat satisfactory solution, so in case anyone's interested I thought I'd share it. I would still appreciate comments on how to improve/fine-tune the approach.
Basically, I decided that the only sensible way is to build a (very) rudimentary model of the scheduler for the parallel loop:
function c=est_cost_para(cost_blocks,cost_it,num_cores)
% Estimate cost of parallel computation
% Inputs:
% cost_blocks: Estimate of cost per block in arbitrary units. For
% consistency with the other code this must be in the reverse order
% that the scheduler is fed, i.e. cost should be ascending!
% cost_it: Base cost of iteration (regardless of number of entries)
% in the same units as cost_blocks.
% num_cores: Number of cores
%
% Output:
% c: Estimated cost of parallel computation
num_blocks=numel(cost_blocks);
c=zeros(num_cores,1);
i=min(num_blocks,num_cores);
c(1:i)=cost_blocks(end-i+1:end)+cost_it;
while i<num_blocks
i=i+1;
[~,i_min]=min(c); % which core finished first; is fed with next block
c(i_min)=c(i_min)+cost_blocks(end-i+1)+cost_it;
end
c=max(c);
end
The parameter cost_it for an empty iteration is a crude blend of many different side effects, which could conceivably be separated: The cost of an empty iteration in a for/parfor-loop (could also be different per block), as well as the start-up time resp. transmission of data of the parfor-loop (and probably more). My main reason to throw everything together is that I don't want to have to estimate/determine the more granular costs.
I use the above routine to determine the cut-off in the following way:
% function i=cutoff_ser_para(cost_blocks,cost_it,num_cores)
% Determine cut-off between serial an parallel regime
% Inputs:
% cost_blocks: Estimate of cost per block in arbitrary units. For
% consistency with the other code this must be in the reverse order
% that the scheduler is fed, i.e. cost should be ascending!
% cost_it: Base cost of iteration (regardless of number of entries)
% in the same units as cost_blocks.
% num_cores: Number of cores
%
% Output:
% i: Number of blocks to be calculated serially
num_blocks=numel(cost_blocks);
cost=zeros(num_blocks+1,2);
for i=0:num_blocks
cost(i+1,1)=sum(cost_blocks(end-i+1:end))/num_cores + i*cost_it;
cost(i+1,2)=est_cost_para(cost_blocks(1:end-i),cost_it,num_cores);
end
[~,i]=min(sum(cost,2));
i=i-1;
end
In particular, I don't inflate/change the value of est_cost_para which assumes (aside from cost_it) the most optimistic scheduling possible. I leave it as is mainly because I don't know what would work best. To be conservative (i.e. avoid feeding too large blocks to the parallel loop), one could of course add some percentage as a buffer or even use a power > 1 to inflate the parallel cost.
Note also that est_cost_para is called with successively less blocks (although I use the variable name cost_blocks for both routines, one is a subset of the other).
Compared to the approach in my wordy question I see two main advantages:
The relatively intricate dependence between the data (both the number of blocks as well as their cost) and the number of cores is captured much better with the simulated scheduler than would be possible with a single formula.
By calculating the cost for all possible combinations of serial/parallel distribution and then taking the minimum, one cannot get "stuck" too early while reading in the data from one side (e.g. by a jump which is large relative to the data so far, but small in comparison to the total).
Of course, the asymptotic complexity is higher by calling est_cost_para with its while-loop all the time, but in my case (num_blocks<500) this is absolutely negligible.
Finally, if a decent value of cost_it does not readily present itself, one can try to calculate it by measuring the actual execution time of each block, as well as the purely parallel part of it, and then trying to fit the resulting data to the cost prediction and get an updated value of cost_it for the next call of the routine (by using the difference between total cost and parallel cost or by inserting a cost of zero into the fitted formula). This should hopefully "converge" to the most useful value of cost_it for the problem in question.

Making knnsearch fast when one argument remains constant

I have the following problem.
for i=1:3000
[~,dist(i,1)]=knnsearch(C(selectedIndices,:),C);
end
Let me explain the code above. Matrix C is a huge matrix (300000 x 1984). C(selectedIndices,:) is a subset of 100 elements of C depending on the value of i. It means: For i=1, first 100 points of C are selected, for i==2, C(101:200,:) is selected. As you can see, the second argument remains constant.
Is there any way to make this work faster. I have tried the following:
- [~,dist(i,1)]=knnsearch(C,C); %obviously goes out of memory
send a bigger chunk of selectedIndices instead of sending just 100. This adds a little bit post-processing which I am not worried about. But this doesn't work since it takes equivalent amount of time. For example, if I send 100 points of C at a time, it takes 60 seconds. If I send 500, it takes 380 seconds with the post-processing.
Tried using parfor as: different sets of selectedIndices will be executed parallely. It doesn't work as two copies of big matrix C may have got created (not sure how parfor works), but I am sure that computer becomes very slow in turn negating the advantage of parfor.
Haven't tried yet: break both arguments into smaller chunks and send it in parfor. Do you think this will make any difference?
I am open to any suggestion i.e. if you feel braking a matrix in some different way may speed up the computation, do suggest it. Since, at the end I only care about finding closest point from a set of points (here each set has 100 points) for each point in C.

MATLAB: Slow convergence of convex optimization algorithm

I want to speed up the convergence of a convex optimization problem in MATLAB.
My objective function is convex having three parameters and I am using gradient ascent for the maximization.
Right now I am manually writing the iteration with the termination condition being the difference between the new parameter value and old parameter value is very small (around 0.0000001). I cannot terminate based upon the number of iterations because it doesn't guarantee that it has converged to the optimum solution.
So, it takes a lot of time to converge - almost 2 days! Is there any way to speed this up?
Actually my objective function has only three parameters. I know that my first parameter's value should be greater than that of the second.
So starting with the initial condition, the second parameter's value starts increasing rapidly. After it has reached a certain point, the first parameter's value starts increasing rapidly. While the first parameter's value starts increasing, the second parameter's value starts decreasing slowly. Eventually, I have the first parameter's value greater than that of second.
Is there any way to speed up the process? 2 days is a very long time. Furthermore, calculating the gradient is also time consuming. It needs a lot of matrix computations.
I don't want to start with the defined parameter values like parameter1's value greater than that of second. Also it's not necessary that the first parameter always has to be greater than the the second. I just know which parameter value should be greater. Any suggestions?
If the calculation of gradients is very slow and you still want to do a manual implementation you could try this, it will take more steps but could be a lot quicker as the steps are so simple:
Define a stepsize
Try all the points where your variable moves -1, 0 or 1 times in the direction of the stepsize (3^3 = 27 possibilities)
Pick the best one
If the best one is your previous one, multiply the stepsize with a factor 0.5
Of course the success of this process depends on the properties of your function. Furthermore it should be noted that a much simpler solution could be to set the desired difference to something like 0.0001

Why is Arrayfun much faster than a for-loop when using GPU?

Could someone tell why Arrayfun is much faster than a for loop on GPU? (not on CPU, actually a For loop is faster on CPU)
Arrayfun:
x = parallel.gpu.GPUArray(rand(512,512,64));
count = arrayfun(#(x) x^2, x);
And equivalent For loop:
for i=1:size(x,1)*size(x,2)*size(x,3)
z(i)=x(i).^2;
end
Is it probably because a For loop is not multithreaded on GPU?
Thanks.
I don't think your loops are equivalent. It seems you're squaring every element in an array with your CPU implementation, but performing some sort of count for arrayfun.
Regardless, I think the explanation you're looking for is as follows:
When run on the GPU, you code can be functionally decomposed -- into each array cell in this case -- and squared separately. This is okay because for a given i, the value of [cell_i]^2 doesn't depend on any of the other values in other cells. What most likely happens is the array get's decomposed into S buffers where S is the number of stream processing units your GPU has. Each unit then computes the square of the data in each cell of its buffer. The result is copied back to the original array and the result is returned to count.
Now don't worry, if you're counting things as it seems *array_fun* is actually doing, a similar thing is happening. The algorithm most likely partitions the array off into similar buffers, and, instead of squaring each cell, add the values together. You can think of the result of this first step as a smaller array which the same process can be applied to recursively to count the new sums.
As per the reference page here http://www.mathworks.co.uk/help/toolbox/distcomp/arrayfun.html, "the MATLAB function passed in for evaluation is compiled for the GPU, and then executed on the GPU". In the explicit for loop version, each operation is executed separately on the GPU, and this incurs overhead - the arrayfun version is one single GPU kernel invocation.
This is the time i got for the same code. Arrayfun in CPU take approx 17 sec which is much higher but in GPU, Arrayfun is much faster.
parfor time = 0.4379
for time = 0.7237
gpu arrayfun time = 0.1685