Matlab parfor work distribution - matlab

I have a parfor loop through say 100 iterations, and the workload on every iteration is different but changes linearly in a way that the first one takes the most time and the last one is the fastest. But when I run through the parfor loop with my four instances/labs, during the last few hours only one lab is active as it's running through the few first iterations by its own.
So I know which iterations are the slow ones. How could I make workload between cores more even. For example somehow force all labs to start working on the first four slow ones and then proceed in order? Or something similar to prevent only one active core running the few slow ones alone..

Matlab parfor does nothing more but split up the indices and distributes them to the workers. It does this by creating contiguous chunks from the indices. I don't know the exact algorithm but this means that data with similar indices get computed in the same chunk and by the same worker.
The simplest solution would be a stochastic one. Just shuffle your indices so that the work intensive steps are distributed nicely. While this doesn't give you any guarantees on performance it is simple and will work most of the time.
Some example code:
% dummy data
% generate the permutated indices
% permute the data
% run the loop
parfor i=1:N
% do something e.g. pause for the time as specified by data
%invert the index permutation
I used pause to simulate the different computation times.

I don't think this is documented anywhere, but you can quickly deduce that PARFOR runs iterations in reverse loop order (using pause and disp if you want to see it in action). So, you should simply reverse your loop. PARFOR gives you no means to explicitly control execution order, but SPMD using for-drange does (PARFOR is significantly easier to use though).
#denahiro's suggestion is also a good one.


How we can have two parfor loops simultaneously in MATLAB to reduce calculation IDLE times?

Suppose that I want run my code using a parfor loop in MATLAB R2016a. first four iterations of this loop (1:4) has higher computational load comparing to second four iterations (5:8) so I have high differences in calculation time of these two parts.
How we can have two parfor loops simultaneously in this case? For example second part of parfor loop (5:8) is doing it's job without waiting for completion of first part (1:4)? I want reduce the IDLE times using a new coding structure or any other tricks.

Relationship number iteration/parallel workers in Matlab

I have a question regarding the use of parfor in Matlab: should the number of parallel workers be proportional to the number of iterations in the loop?
Matlab will divide your parfor loops in the a way to make sure all workers work similarly. Do not worry about that, you can easily parfor 1:100 and use 6 cores.
To extend a bit more, Matlab will actually send different chunks to different workers. Bigger in the beginning, smaller in the end. So in the beginning Matlab will send for example 10 for iterations to each worker, and when they finish it will send 5, 3 ... 1 for loop to each of them (I just invented the numbers).

MATLAB: Is it inefficient to use parfor (parallel for loop) within a while loop.

I'm having a trouble doing MCMC(Monte Carlo Markov Chain). So for MCMC, say I will run 10000 iterations, then within each iteration, I will draw some parameters. But in each iteration, I have some individual data which are independently, so I can do parfor. However, the problem is, it seems the time to finish one iteration just grows quickly as MCMC goes on. Soon, it's extremely time consuming.
My question is: is there any efficient way to combine parfor and while loop?
I have the following pseudo-code:
while r<10000
parfor i=1:I
make draws from proposal distribution
calculate acceptance rate
accept or reject current draw
Launching lots of separate parfor loops can be inefficient if each loop duration is small. Unfortunately, as you are probably aware, you cannot break out of a parfor loop. One alternative might be to use parfeval. The idea would be to make many parfeval calls (but not too many), and then you can terminate when you have sufficient results.
This (fairly long) blog article shows an example of using parfeval in a situation where you might wish to terminate the computations early.

how to speed up Matlab nested for loops when I cannot vectorize the calculations?

I have three big 3D arrays of the same size [41*141*12403], named in the Matlab code below alpha, beta and ni. From them I need to calculate another 3D array with the same size, which is obtained elementwise from the original matrices through a calculation that combines an infinite sum and definite integral calculations, using the value of each element. It therefore seems inevitible to have to use several nested loops to make this calculation. The code is already running now for several hours(!) and it is still in the first iteration of the outer loop (which needs to be performed 41 times!! According to my calculation, in this way the program will have to run more than two years!!!). I don't know how to optimize the code. Please help me !!
the code I use:
z_len=size(KELDYSH_PARAM_r_z_t,1); % 41 rows
r_len=size(KELDYSH_PARAM_r_z_t,2); % 141 columns
t_len=size(KELDYSH_PARAM_r_z_t,3); % 12403 slices
for z_ind=1:z_len
z_ind % in order to track the advancement of the calculation
for r_ind=1:r_len
for t_ind=1:t_len
while abs(sumPrevious-sumCurrent)>1e-6
kapa=kapa_0+s; %some scalar
sumCurrent=sumCurrent+exp(-alpha(z_ind,r_ind,t_ind).* ...
(kapa-ni(z_ind,r_ind,t_ind))).*(x_of_w.^(2*abs(m)+1)/2).* ...
function res=w_m_integral(x_of_w,m)
function y=integrandFun(t)
Option 1 - more vectorising
It's a pretty complex model you're working with and not all the terms are explained, but some parts can still be further vectorised. Your alpha, beta and ni matrices are presumably static and precomputed? Your s value is a scalar and kapa could be either, so you can probably precompute the x_of_w matrix all in one go too. This would give you a very slight speedup all on its own, though you'd be spending memory to get it - 71 million points is doable these days but will call for an awful lot of hardware. Doing it once for each of your 41 rows would reduce the burden neatly.
That leaves the integral itself. The quad function doesn't accept vector inputs - it would be a nightmare wouldn't it? - and neither does integral, which Mathworks are recommending you use instead. But if your integration limits are the same in each case then why not do the integral the old-fashioned way? Compute a matrix for the value of the integrand at 1, compute another matrix for the value of the integrand at 0 and then take the difference.
Then you can write a single loop that computes the integral for the whole input space then tests the convergence for all the matrix elements. Make a mask that notes the ones that have not converged and recalculate those with the increased s. Repeat until all have converged (or you hit a threshold for iterations).
Option 2 - parallelise it
It used to be the case that matlab was much faster with vectorised operations than loops. I can't find a source for it now but I think I've read that it's become a lot faster recently with for loops too, so depending on the resources you have available you might get better results by parallelising the code you currently have. That's going to need a bit of refactoring too - the big problems are overheads while copying in data to the workers (which you can fix by chopping the inputs up into chunks and just feeding the relevant one in) and the parfor loop not allowing you to use certain variables, usually ones which cover the whole space. Again chopping them up helps.
But if you have a 2 year runtime you will need a factor of at least 100 I'm guessing, so that means a cluster! If you're at a university or somewhere where you might be able to get a few days on a 500-core cluster then go for that...
If you can write the integral in a closed form then it might be amenable to GPU computation. Those things can do certain classes of computation very fast but you have to be able to parallelise the job and reduce the actual computation to something basic comprised mainly of addition and multiplication. The CUDA libraries have done a lot of the legwork and matlab has an interface to them so have a read about those.
Option 3 - reduce the scope
Finally, if neither of the above two results in sufficient speedups, then you may have to reduce the scope of your calculation. Trim the input space as much as you can and perhaps accept a lower convergence threshold. If you know how many iterations you tend to need inside the innermost while loop (the one with the s counter in it) then it might turn out that reducing the convergence criterion reduces the number of iterations you need, which could speed it up. The profiler can help see where you're spending your time.
The bottom line though is that 71 million points are going to take some time to compute. You can optimise the computation only so far, the odds are that for a problem of this size you will have to throw hardware at it.

Parallelizing a for loop to run simultaneously on multiple GPU cores?

I understand that you can use a matlabpool and parfor to run for loop iterations in parallel, however, I want to try and take advantage of using the high number of cores in my GPU to run a larger number of simultaneous iterations. I was wondering if there is any built in functionality to do this?
To my understanding, the method in which MATLAB runs code on the GPU is through a GPUarray, but that does not seem to parallelize a loop, only certain functions inside the loop.
For the loop that I am running, each iteration can run independently and the only variables that need to exist outside of the loop is the data to be processed (a 3-D array, where the first index is time, and each iteration is operating on a different time) and a 2-D output array where each iteration is storing the result for a particular time. Each time is independent.
With a GPUArray, you can run elementwise operations in parallel by structuring your algorithm in terms of MATLAB's arrayfun. Effectively, this implicitly loops over each element of your arrays, and can apply the body of a MATLAB function to each element. The doc is: here.
There's a simple demo: here.