Why does TreeBagger in Matlab 2014a/b only use few workers from a parallel pool? - matlab

I'm using the TreeBagger class provided by Matlab (R2014a&b), in conjunction with the distributed computing toolbox. I have a local cluster running, with 30 workers, on a Windows 7 machine with 40 cores.
I call the TreeBagger constructor to generate a regression forest (an ensemble containing 32 trees), passing an options structure with 'UseParallel' set to 'always'.
However, TreeBagger seems to only make use of 8 or so workers, out of the 30 available (judging by CPU usage per process, observed using the Task Manager). When I try to test the pool with a simple parfor loop:
parfor i=1:30
a = fft(rand(20000));
Then all 30 workers are engaged.
My question is:
(How) can I force TreeBagger to use all available resources?

Based on the documentation for the TreeBagger class it would appear that the operations required are quite memory intensive. Without knowing more about the internal scheduling system used by Matlab it seems likely that distributing the workload across fewer workers with more memory for each worker is what the scheduler believes will be the most efficient way to solve the problem.
The number of workers used/available may also depend on the number of physical cores on the system(which is different from the number of hyper threaded cores), as well as the resources Matlab is allowed to consume.
Splitting memory intensive tasks across a less than maximum number of workers is a common technique in HPC for some types of problems.


Number of workers in Matlab's parfor

I am running a for loop using MATLAB's parfor function. My CPU's specs are
I set preferred number of workers to 24. However, MATLAB sets this number to 6. Is number of workers bounded by the number of cores or by (number of cores)x(number of processors=6x12?
Matlab prefers to limit the number of workers to the number of cores (six in your case).
Your CPU (intel i7-9750H) has hyperthreading, i.e. you can run multiple (here 2) threads per core. However, this is of no use if you want to run them under full-load, which means that there is simply no resources available to switch to a different task (what the additional threads effectively are).
See the documentation.
Restricting to one worker per physical core ensures that each worker
has exclusive access to a floating point unit, which generally
optimizes performance of computational code. If your code is not
computationally intensive, for example, it is input/output (I/O)
intensive, then consider using up to two workers per physical core.
Running too many workers on too few resources may impact performance
and stability of your machine.
Note that Matlab needs to stream data to every core in order to run the distributed code. This is some kind of initialization effort and the reason why you won't be able to cut the runtime in half if you double the number of cores/workers. And that is also the explanation why there is no use for Matlab to make use of hyperthreading. It would just mean to increase the initial streaming effort without any speed-up -- in fact, the core would probably force matlab to save intermediate results and switch to the other task from time to time... which is the same task as before;)

Matlab use of vCPUs of VM for parallel loop

I have a big nested calculation in MATLAB and changed the programming to parallel processing (PARFOR). My CPU only has 4 cores. So I thought, maybe I could use a Azure VM with 16 cores to provide even more workers and reduce computing time.So the question is, do the vCPUs of the VM count as additional pools or will the code still work with only 4 workers?
thanks a lot for the help!
The max number of local workers on a machine is up to the version and license of your current Matlab.
Plesae see the marked answer of Matlabpool Maximum Number of Local Workers on one computer - Parallel Computing from MathWorks forum, as the screenshot figure below.

Difference between MATLAB parallel computing terminologies

I want to know the differences between
1. labs
2. workers
3. cores
4. processes
Is it just the semantics or they are all different?
labs and workers are MathWorks terminologies, and they mean roughly the same thing.
A lab or a worker is essentially an instance of MATLAB (without a front-end). You run several of them, and you can run them either on your own machine (requires only Parallel Computing Toolbox) or remotely on a cluster (requires Distributed Computing Server). When you execute parallel code (such as a parfor loop, an spmd block, or a parfeval command), the code is executed in parallel by the workers, rather than by your main MATLAB.
Parallel Computing Toolbox has changed and developed its functionality quite a lot over recent releases, and has also changed and developed the terminologies it uses to describe the way it works. At some point it was convenient to refer to them as labs when running an spmd block, but workers when running a parfor loop, or working on jobs and tasks. I believe they are moving now toward always calling them workers (although there's a legacy in the commands labSend, labReceive, labBroadcast, labindex and numlabs).
cores and processes are different, and are not themselves anything to do with MATLAB.
A core is a physical part of your processor - you might have a dual-core or quad-core processor in your desktop computer, or you might have access to a really big computer with many more than that. By having multiple cores, your processor can do multiple things at once.
A process is (roughly) a program that your operating system is running. Although the OS runs multiple programs simultaneously, it typically does this by interleaving operations from each process. But if you have access to a multiple-core machine, those operations can be done in parallel.
So you would typically want to tell MATLAB to start one worker for each of the cores you have on your machine. Each of those workers will be run as a process by the OS, and will end up being run one worker per core in parallel.
The above is quite simplified, but I hope gives a roughly accurate picture.
Edit: moved description of threads from a comment to the answer.
Threads are something different again. Threads are also not in themselves anything to do with MATLAB.
Let's go back to processes for a moment. One thing I didn't mention above is that the OS allocates each process a specific block of memory which other processes shouldn't be able to touch, so that it's difficult for them to interact with each other and mess things up.
A thread is like a process within a process - it's a stream of operations that the process runs. Typically, operations from each thread would be interleaved, but if you have multiple cores, they can also be parallelized across the cores.
However, unlike processes, they all share a memory block, which is OK because they're all managed by the same program so it should matter less if they're allowed to interact.
Regular MATLAB automatically uses multiple threads to parallelize many built-in operations (such as matrix multiplication, svd, eig, linear algebra etc) - that's without you doing anything, and whether or not you have Parallel Computing Toolbox.
However, MATLAB workers are each run as a single process with a single thread, so you have full control over how to parallelize.
I think workers are synonyms for processes. The term "cores" is related to the hardware. Labs is a mechanism which allows workers to communicate with each other. Each worker has at least one lab but can own more.
This piece of a discussion may be useful
I hope someone here will deliver more information in a more rigorous way

How to distribute MATLAB's parfor workers between GPUs AND CPUs (cores)?

I have a computation that has the structure of a binary tree, where at each node a bunch of highly vectorized functions take the output of the previous branches to produce new branch(es) (nodes on the same level are independent). Since the functions are vectorized, they run well both on CPU or GPU, the latter naturally giving substantially faster execution.
I will soon have access to 4-GPU 2-CPU workstation to run my code on and I would like to use it as optimally as I can. I understand how to use parfor on the GPUs only or on the CPUs' cores only, but I would like to reasonably distribute the workload between the GPUs and the CPU, since GPU execution only leaves so many CPU cores at idle, and even though they are much slower than the GPUs, they are still fast enough to have noticeable impact on the total execution time.
(Q1) Since the functions in each node are vectorized, is it actually reasonable to run independent nodes in 1-node-per-core mode? Or does that strictly depend on the particular case? Is there a "rule of the thumb" for such dilemmas?
(Q2) Assuming in (Q1) that simultaneous execution of 1 node per core is suboptimal, is there a way to assign several CPU cores to one worker?
(Q3) Is there a way to distribute the parfor workers between GPUs and CPUs in an efficient way?
Here is what I don't consider particularly efficient in (Q3): depending on the loop index, the loop instance can execute the GPU code on a given gpuDevice or on CPU (core). Knowing the performance difference between GPU and CPU execution, one can deduce a suitable proportion of the indices assigned to CPU execution. The problem with this is that parfor does not pick the loop indices in any particular order, which in turn can easily lead to instances where it tries to execute two independent tasks on same GPU, which is inefficient since it will have to serialize the tasks.

performance issues with parallel MATLAB on a NUMA machine

I'm running memory-intensive parallel computations in MATLAB on a 64-core NUMA machine under Windows 7, 8 cores per socket. I'm using parallel computing toolbox to do that. I've noticed a very strange cpu load pattern: then running say 36 parallel MATLABs, the cores on the 1st socket are fully loaded, 2nd socket is almost fully loaded too, third socket is about 50% and so on. The last socket is usually almost completely free and doing nothing. Running more than 12 parallel workers simultaneously seem to very adversely affect performance of all workers.
I tried to experiment with cpu affinity, pinning different workers to different cores. While it helps in simple tests (i.e. cpu load pattern becomes uniform across all cores), it doesn't help in our real-life memory-intensive computations.
I suspect the problem is with memory locality. I.e. all memory is allocated on 1st and 2nd sockets. This would explain strange cpu load: OS tires to run computational threads closer to the data. But I don't know neither how to confirm this suspicion directly, nor how to fix it, if it's true.
I use maxNumCompThreads(4) in all my parallel workers, if that's important. Hyperthreading is off.
You should only be able to run 12 local workers using Parallel Computing Toolbox. See the data sheet.
Please note that in R2014a the limit on the number of local workers was removed. See the release notes.