parfor extremely slow with system call - matlab

I have a MATLAB loop that looks something like this:
S = zeros(20, 1);
for i = 1:1:20
system(command)
S(i) = fetch_results_of_command_from_file()
end
where command is a string for a system call which runs a python command, fetch...gets results of the system call, and it's shoved into a return vector S.
Some details: 'command' calls a python script. That script reads some data from a file, then performs an optimization procedure, then writes results back to file. The python script is small - less than 1,000 lines - and on a shared disk. All commands call this, but use different input data - one input file per worker. The optimization itself is a single thread. The cores are are all residing on a local machine which has over 40 CPUs, 2 cores each.
Each iteration of this loop runs in about 2 minutes, and everything is fine, but slow.
I am on a local machine with over 20 cores, and so I want to parallelize my code as such:
S = zeros(20, 1);
parfor i = 1:1:20
system(command)
S(i) = fetch_results_of_command_from_file()
end
It seems each system call deploys to a different processor just fine. I would expect this loop to run on an order of 2 minutes + a little overhead time for the parfor loop, since it can be embarrassingly parallelized.
Unfortunately, no system call finishes in over 10 minutes (it never even gets to the fetch command). Somehow, parfor is slowing every command down, as though it were being called in serial. I do have the parallel toolbox installed, so that shouldn't be the problem. What could be going on?

Related

Wildly different iteration times in parfor loop

I have an embarrassingly-parallel Monte Carlo code for which I am using parfor. Each parfor loop iteration does around 100 seconds of calculation into a temporary array, before adding that array to the master array, which is an reduction variable. I'm timing using a sliced variable.
Like this:
master_array=zeros(1000,1000,50)
iteration_times=zeros(500,1);
parfor i=1:500
iteration_start=datetime;
temp_array=zeros(1000,1000,50);
**do about 100 seconds of work building temp_array**
master_array=master_array+temp_array;
iteration_times(i)=seconds(datetime-iteration_start);
end
Sometimes when the code is run, but not all the time, the code takes far longer. When it does, it seems that some loop iterations are taking way way longer than they should, as shown in the graph iteration number vs runtime. (This is not caused by variations in the amount of 'work done' - the variation disappears without the parfor.) This was running with a pool of 4 workers.
The arrays are large, and can be 100MB or more. I don't seem to be running out of memory - the hard drive is not being used, and I get this problem still with smaller arrays.
'Slow' iteration times seem oddly quantised, like some loop interations are waiting for others to complete before completing themselves.
Any ideas what i can check?

Parallel computing data extraction from a SQL database in Matlab

In my current setup I have a for loop in which I extract different type of data from a SQL database hosted on Amazon EC2. This extraction is done in the function extractData(variableName). After that the data gets parsed and stored as a mat file in parsestoreData(data):
variables = {'A','B','C','D','E'}
for i = 1:length(variables)
data = extractData(variables{i});
parsestoreData(data);
end
I would like to parallelize this extraction and parsing of the data and to speed up the process. I argue that I could do this using a parfor instead of for in the above example.
However, I am worried that the extraction will not be improved as the SQL database will get slowed down when multiple requests are made on the same database.
I am therefore wondering if Matlab can handle this issue in a smart way, in terms of parralelization?
The workers in parallel pool running parfor are basically full MATLAB processes without a UI, and they default to running in "single computational thread" mode. I'm not sure whether parfor will benefit you in this case - the parfor loop simply arranges for the MATLAB workers to execute the iterations of your loop in parallel. You can estimate for yourself how well your problem will parallelise by launching multiple full desktop MATLABs, and set them off running your problem simultaneously. I would run something like this:
maxNumCompThreads(1);
while true
t = tic();
data = extractData(...);
parsestoreData(data);
toc(t)
end
and then check how the times reported by toc vary as the number of MATLAB clients varies. If the times remain constant, you could reasonably expect parfor to give you benefit (because it means the body can be parallelised effectively). If however, the times decrease significantly as you run more MATLAB clients, then it's almost certain that parfor would experience the same (relative) slow-down.

Why is the Matlab parfor scheduler leaving workers idle?

I have a fairly long-running parfor loop (let's say 100,000 iterations where each iterations takes about a minute) that I'm running with 36 cores. I've noticed that near the end of the job, a large number of the cores go idle while a few finish what I think has to be multiple iterations per worker. This leads to a lot of wasted computing time waiting for one worker to finish several jobs while the others are sitting idle.
The following script shows the issue (using the File Exchange utility Par.m):
% Set up parallel pool
nLoop = 480;
p1 = gcp;
% Run a loop
pclock = Par(nLoop);
parfor iLoop = 1:nLoop
Par.tic;
pause(0.1);
pclock(iLoop) = Par.toc;
end
stop(pclock);
plot(pclock);
% Process the timing info:
runs = [[pclock.Worker]' [pclock.ItStart]' [pclock.ItStop]'];
nRuns = arrayfun(#(x) sum(runs(:,1) == x), 1:max(runs));
starts = nan(max(nRuns), p1.NumWorkers);
ends = nan(max(nRuns), p1.NumWorkers);
for iS = 1:p1.NumWorkers
starts(1:nRuns(iS), iS) = sort(runs(runs(:, 1) == iS, 2));
ends(1:nRuns(iS), iS) = sort(runs(runs(:, 1) == iS, 3));
end
firstWorkerStops = min(max(ends));
badRuns = starts > firstWorkerStops;
nBadRuns = sum(sum(badRuns)) - (p1.NumWorkers-1);
fprintf('At least %d (%3.1f%%) iterations run inefficiently.\n', ...
nBadRuns, nBadRuns/nLoop * 100);
The way I'm looking at it, every worker should be busy until the queue is empty, after which all workers sit idle. But here it looks like that's not happening - with 480 iterations, I'm getting between 6-20 iterations that start on a worker after a different worker has been sitting idle for a full cycle. This number appears to scale linearly with the number of loop iterations coming in near 2% of the total. With limited testing this appears to be consistent across Matlab 2016b and 2014b.
Is there any reason that this is the expected behavior or is this simply a poorly written scheduler in the parfor implementation? If so, how can I structure this so I'm not sitting around with idle workers so long?
I think this explains what you are observing.
If there are more iterations than workers, some workers perform more than one loop iteration; in this case, a worker might receive multiple iterations at once to reduce communication time. (From "When to Use parfor")
Towards the end of a loop, a two workers may finish their iterations at around the same time. If there is only one group of iterations left to be assigned, then one worker will get them all, and the other will remain idle. It sounds like it is expected behavior, and it is probably because the underlying implementation tries to reduce the communication cost associated with a worker pool. I've looked around the web and the Matlab settings and it doesn't seem like there is a way to adjust the communication strategy.
The parfor scheduler attempts to load-balance for loops where the iterations do not take a uniform amount of time. Unfortunately, as you observe, this can lead to workers becoming idle at the end of the loop. With parfor, you don't have control over the division of work; but you could use parfeval to divide your work up into even chunks - that might give you better utilisation. Or, you could even use spmd in conjunction with a for-drange loop.

parfor problems: only one process is running

I have parfor loop in matlab, when it is running, only one process is using CPU (Top and system monitor shows same CPU usage,see attached screenshot), and the parfor doesn't run faster. Why???
ubuntu 12.04 LTS, 64bits matlab 2012b
pools = matlabpool('size');
if pools ~= 10
if pools > 0
matlabpool('close');
end
matlabpool local 10; %10+ the one I am using = 11 matlab process in system monitor
end
parfor i = 1:num_utt
dojob();
end
Thanks, Marcin & Edric,
I had run a small test case as you suggest, and then I noticed that the problem is caused by the inner loop code access outer loop data, in this http://www.mathworks.com/help/distcomp/advanced-topics.html , they call it as access broadcast variables.
At the start of a parfor-loop, the values of any broadcast variables
are sent to all workers. Although this type of variable can be useful
or even essential, broadcast variables that are large can cause a lot
of communication between client and workers. In some cases it might be
more efficient to use temporary variables for this purpose, creating
and assigning them inside the loop.
For my case the broadcast variable holds lots of data, so it has problem to pass it to the worker.
After I remove some of the data, the parfor loop works fine.

Can I run a script on multiple MATLAB sessions instead of parallelizing the script?

I have a script which solves a system of differential equations for many parameters in a for loop. ( iterations are completely independent, but at the end of each iteration , a large matrix ( mat ) is modified according to the results of the computation ). Here is the code: (B is a matrix containing parameters)
mat=zeros(20000,1);
for n=1:20000
prop=B(n,:); % B is a (20000 * 2 ) matrix that contains U and V parameters
U=prop(1);
V=prop(2);
options=odeset('RelTol',1e-6,'AbsTol',1e-20);
[T,X]=ode45(#acceleration,tspan,x0,options);
rad=X(:,1);
if max(rad)<radius % radius is a constant
mat(n)=1;
end
function xprime=acceleration(T,X)
.
.
.
end
First I tried to use parfor, but because the acceleration function (ode45 input) was defined as an inline function, (to achieve better performance) I couldn't do that .
Can I open 4 MATLAB sessions (my CPU has 4 cores) and run the code separately in each session , instead of modifying the code to implement acceleration as a separate function, and therefore , using parfor? Does it give 4X the performance of running on one session? (or does it give same performance as parallelized code ? - in parallel code I can't define inline functions-)
(on Windows)
If you're prepared do to the work of separating out the problem to run separately in 4 sessions, then reassemble the results, sure you can do that. In my experience (on Windows) it actually runs faster to run code in four separate sessions than to have a parfor loop with 4 workers. Not quite as fast as 4x performance of a single session, because the operating system will have other work to do... so for example if you have no other processor-heavy applications running, the OS itself might take up 25% of one core, leaving you with maybe 3.75x performance of a single session. However, this assumes you have enough memory for that not be the limiting factor.
If you wanted to do this regularly you might need to create some file-based signalling/data passing system.
This is obviously not as elegant as a parfor, but is workable for your situation, or if you can't afford the license fee for the parallel toolbox.