What is happening when Matlab is starting a "parallel pool"? - matlab

Running parallel CPU processes in Matlab starts with the command
parpool()
According to the documentation, that function:
[creates] a special job on a pool of workers, and [connects] the MATLAB client to the parallel pool.
This function usually takes a bit of time to execute, on the order of 30 seconds. But in other multi-CPU paradigms like OpenMP, the parallel execution seems totally transparent -- I've never noticed any behavior analogous to what Matlab does (granted I'm not very experienced with parallel programming).
So, what is actually happening between the time that parpool() is called and when it finishes executing? What takes so long?

Parallel Computing Toolbox enables you to run MATLAB code in parallel using several different paradigms (e.g. jobs and tasks, parfor, spmd, parfeval, batch processing), and to run it either locally (parallelised across cores in your local machine) or remotely (parallelised across machines in a cluster - either one that you own, or one in the cloud).
In any of these cases, the code is run on MATLAB workers, which are basically copies of MATLAB without an interactive desktop.
If you're intending to run on a remote cluster, it's likely that these workers will already be started up and ready to run code. If you're intending to run locally, it's possible that you might already have started workers, but maybe you haven't.
Some of constructs above (e.g. jobs and tasks, batch processing) just run the thing you asked for, and the workers then go back to being available for other things (possibly from a different user).
But some of the constructs (e.g. parfor, spmd) require that the workers on which you intend to run are reserved for you for a period of time - partly because they might lie idle for some time and you don't want them taken over by another user, and partly because (unlike with jobs and tasks, or batch processing) they might need to communicate with each other. This is called creating a worker pool.
When you run parpool, you're telling MATLAB that you want to reserve a pool of workers for yourself, because you're intending to run a construct that requires a worker pool. You can specify as an input argument a cluster profile, which would tell it whether you want to run on a remote cluster or locally.
If you're running on a cluster, parpool will send a message to the cluster to reserve some of its (already running) workers for your use.
If you're running locally, parpool will ensure that there are enough workers running locally, and then connect them into a pool for you.
The thing that takes 30 seconds is the part where it needs to start up workers, if they're not already running. On Windows, if you watch Task Manager while running parpool, you'll see additional copies of MATLAB popping up over those 30 seconds as the workers start (they're actually not MATLAB itself, they're MATLAB workers - you can distinguish them as they'll be using less memory with no desktop).
To compare what MATLAB is doing to OpenMP, note that these MATLAB workers are separate processes, whereas OpenMP creates multiple threads within an existing process.

To be honest I do not think that we will ever get to know exactly what MatLab does.
However, to give you some answer, MatLab basically opens additional instances of itself, for it to execute the code on. In order to do this, it first needs to check where the instances should be opened (you can change the cluster from local to whatever else you have access to, e.g. Amazons EC2 cluster). Once the new instances have been opened MatLab then has set the connection from your main window to the instances up.
Notes:
1) It is not recommended to use parpool inside a function or script as if it is run while a parallel pool is open it will cast an error. The use of parallel commands e.g. parfor will automatically open the instance.
2) parpool only have to be executed "once" (before it is shut down), i.e. if you run the code again the instances are already open.
3) If you want to avoid the overhead in your codes, you can create a file called startup.m in the search path of MATLAB, with the command parpool, this will automatically start a parallel pool on startup.
4) Vectorizing your code will automatically make it parallelised without the overhead.

A few further details to follow up on #Nicky's answer. Creating a parallel pool involves:
Submitting a communicating job to the appropriate cluster
This recruits MATLAB worker processes. These processes might be already running (in the case of MJS), or they might need to be started (in the case of 'local', and all other cluster types).
MPI communication is set up among the workers to support spmd (unless you specify "'SpmdEnabled', false" while starting the pool - however, this stage isn't usually a performance bottleneck).
The MATLAB workers are then connected up to the client so they can do its bidding.
The difference in overhead between parpool and something like OpenMP is because parpool generally launches additional MATLAB processes - a relatively heavy-weight operation, whereas OpenMP simply creates additional threads within a single process - comparatively light-weight. Also, as #Nicky points out - MATLAB can intrinsically multi-thread some/most vectorised operations - parpool is useful for cases where this doesn't happen, or where you have a real multi-node cluster available to run on.

Related

Multiple inputs running in parallel in the same application vs. starting a new application for each input

As i can see when running the application in debug mode there are 4 different threads on which Drools is running. When i start the application with two threads running two inputs in parallel again the same 4 threads remain (they don't turn to 8 for example).
My question is , is there a benefit from running the inputs in a separate application (and with this starting a separate Drools application) , or all of this is covered and we end up with the same results using one application and starting the inputs in parallel?
Advantages of using 1 JVM's:
heap pooling
JIT compilation only once in java code
JIT compilation only once in DRL code iff you reuse your kiebases
Ability to implement queueing and round-robin solving. See SolverManager issue and runnablePartThreadLimit's purpose in Partitioned Search (which will also be supported on SolverManager at some point).
Advantages of using multiple JVM's:
Garbage Collector works more efficiently usually
Ability to link OS (especially on linux) process id's to specific cores. This allows to do round-robin solving through OS configuration.

Why do my results become un-reproducible when run from qsub?

I'm running matlab on a cluster. when i run my .m script from an interactive matlab session on the cluster my results are reproducible. but when i run the same script from a qsub command, as part of an array job away from my watchful eye, I get believable but unreproducible results. The .m files are doing exactly the same thing, including saving the results as .mat files.
Anyone know why run one way the scripts give reproducible results, and run the other way they become unreproducible?
Is this only a problem with reproducibility or is this indicative of inaccurate results?
%%%%%
Thanks to spuder for a helpful answer. Just in case anyone stumbles upon this and is interested here is some further information.
If you use more than one thread in Matlab jobs, this may result in stealing resources from other jobs which plays havoc with the results. So you have 2 options:
1. Select exclusive access to a node. The cluster I am using is not currently allowing parallel array jobs, so doing this for me was very wasteful - i took a whole node but used it in serial.
2. Ask matlab to run on a singleCompThread. This may make your script take longer to complete, but it gets jobs through the queue faster.
There are a lot of variables at play. Ruling out transient issues such as network performance, and load here are a couple of possible explanations:
You are getting assigned a different batch of nodes when you run an interactive job from when you use qsub.
I've seen some sites that assign a policy of 'exclusive' to the nodes that run interactive jobs and 'shared' to nodes that run queued 'qsub' jobs. If that is the case, then you will almost always see better performance on the exclusive nodes.
Another answer might be that the interactive jobs are assigned to nodes that have less network congestion.
Additionally, if you are requesting multiple nodes, and you happen to land on nodes that traverse multiple hops, then you could be seeing significant network slowdowns. The solution would be for the cluster administrator to setup nodesets.
Are you using multiple nodes for the job? How are you requesting the resources?

Celery: limit memory usage (large number of django installations)

we're having a setup with a large number of separate django installations on a single box. each of these have their own code base & linux user.
we're using celery for some asynchronous tasks.
each of the installations has its own setup for celery, i.e. its own celeryd & worker.
the amount of asynchronous tasks per installation is limited, and not time-critical.
when a worker starts it takes about 30mb of memory. when it has run for a while this amount may grow (presumably due to fragmentation).
the last bulletpoint has already been (somewhat) solved by settings --maxtasksperchild to a low number (say 10). This ensures a restart after 10 tasks, after which the memory at least goes back to 30MB.
However, each celeryd is still taking up a lot of memory, since the minimum amount of workers appears to be 1 as opposed to 0. I also imagine running python manage.py celery worker does not lead to the smallest-possible footprint for the celeryd, since the full stack is loaded even if the only thing that happens is checking for tasks.
In an ideal setup, I'd like to see the following: a process that has a very small memory footprint (100k or so) is looking at the queue for new tasks. when such a task arises, it spins up the (heavy) full django stack in a separate process. and when the worker is done, the heavy process is spun down.
Is such a setup configurable using (somewhat) standard celery? If not, what points of extension are there?
we're (currently) using celery 3.0.17 and the associated django-celery.
Just to make sure I understand - you have a lot of different django codebases, each with their own celery, and they take up too much memory when running on a single box simultaneously, all waiting for a celery job to come down the pipe? How many celery instances are we talking about here?
In my experience, you're using django celery in a very different way than it was designed for - all of your different django projects should be condensed to a few (or a single) project(s), composed of multiple applications. Then you set up a small number of queues to field celery tasks from the different apps - this way, you only have as many dormant celery threads taking up 30mb as you have queues, and a single queue can handle multiple tasks (from multiple apps if you want). The memory issue should go away.
To reiterate - you only need one celeryd, driving multiple workers. This way your bottleneck is job concurrency, not dormant memory needs.
Why do you need so many django installations? Please let me know if I'm missing something, or if you need clarification.

Run time memory of perl script

I am having a perl script which is killed by a automated job whenever a high priority process comes as my script is running ~ 300 parallel jobs for downloading data and is consuming lot of memory. I want to figure out how much is the memory it takes during run time so that I can ask for more memory before scheduling the script or if I get to know using some tool the portion in my code which takes up more memory, I can optimize the code for it.
Regarding OP's comment on the question, if you want to minimize memory use, definitely collect and append the data one row/line at a time. If you collect all of it into a variable at once, that means you need to have all of it in memory at once.
Regarding the question itself, you may want to look into whether it's possible to have the Perl code just run once (rather than running 300 separate instances) and then fork to create your individual worker processes. When you fork, the child processes will share memory with the parent much more efficiently than is possible for unrelated processes, so you will, e.g., only need to have one copy of the Perl binary in memory rather than 300 copies.

Perl Threads faster than Sequentially processing?

Just wanted to ask whether it's true that parallel processing is faster than sequentially processing.
I've always thought that parallel processing is faster, so therefore, I did an experiment.
I benchmarked my scripts and found out that after doing a bunch of
sub add{
for ($x=0; $x<=200000; $x++){
$data[$x] = $x/($x+2);
}
}
threading seems to be slower by about 0.5 CPU secs on average. Is this normal or is it really true that sequentially processing is faster?
Whether parallel vs. sequential processing is better is highly task-dependent and you've already done the right thing: You benchmarked both and determined for your task (the one you benchmarked, not necessarily the one you actually want to do) which one is faster.
As a general rule, on a single processor, sequential processing tends to be better for tasks which are CPU-bound, because if you have two tasks each needing five seconds of CPU time to complete, then you'll need ten seconds of CPU time regardless of whether you do them sequentially or in parallel. Setting up multiple threads/processes will, therefore, provide no benefit, but it will create additional task-switching overhead while also preventing you from having any results until all results are available.
CPU-bound tasks on a multi-processor system tend to do better when run in parallel, provided that they can run independently of each other. If not, or if you're using a language/threading model/IPC model/etc. which forces all tasks to run on the same processor, then see "on a single processor" above.
Parallel processing is generally better for tasks which are I/O-bound, regardless of the number of processors available, because CPUs are fast and I/O is slow, so working in parallel allows one task to process its data while the other is waiting for I/O operations to complete. (This is why make -j2 tends to be significantly faster than a plain make, even on single-processor machines.)
But, again, these are all generalities and all have cases where they'll be incorrect. Only benchmarking will reveal the truth with certainty.
Perl threads are an extreme suck. You are better off in every case forking several processes.
When you create a new thread in perl, it does the following:
Make a copy - yes, a real copy - of every single perl data structure in scope, including those belonging to modules you didn't write
Start up what is almost a new, independent instance of perl in a new OS thread
If you then want to share anything (as it has now copied everything) you have to use the share function in the threads module. This is incredibly sucky, as it replaces your variable, with some tie() nonsense, which adds much-too-fine-grained locking around it to prevent concurrent access. Accessing a shared variable then causes a massive amount of implicit locking, and is incredibly slow.
So in short, perl threads:
Take a long time to start
waste loads of memory
Cannot share data efficiently anyway.
You are much better off with fork(), which does not copy every variable (the kernel does copy-on-write) unless you're on Windows.
There's no reason to assume that in a single CPU core system, parallel processing will be faster.
Consider this png example:
The red and blue lines at the top represent two tasks running sequentially on a single core.
The alternate red and blue lines at the bottom represent two task running in parallel on a single core.