Using matlabpool with a specified number of workers - matlab

I've been using the command matlabpool open 8 for a while in order to speed up things.
However I just tried using it and am denied 8 cores and now limited to 4.
My laptop is an i7 with 4 cores but hyperthreaded which meant I had no issue making matlab working on 8 virtual cores.
Simultaneously I noticed the following warning message:
Warning: matlabpool will be removed in a future release.
Use parpool instead.
Seems like MathsWorks decided this was a great update for some reason.
Any ideas how I can get my code running on 8 cores again?
Note: I was using R2010b (I think) and now using R2014b.

It looks like #horchler has provided you with a direct solution to your question in the comments.
However, I would recommend sticking to the default 4 workers suggested by MATLAB, and not using 8. You're very unlikely to get significant speedup by moving to 8, and you're even likely to slow things down a bit.
You have four physical cores, and they can only do so much work. Hyperthreading enables the operating system to pretend that there are 8 cores, by interleaving operations done on pairs of virtual cores.
This is great for applications such as Outlook, which are not compute-intensive, but require lots of operations to appear simultaneous in order, for example, to keep a GUI responsive while checking for email over a network connection.
But for compute-intensive applications such as MATLAB, it will not give you any sort of real speed up, as the operations are just interleaved - you haven't increased the amount of work that the 4 real, physical cores can do. In addition, there's a small overhead in performing the hyperthreading.
In my experience, MATLAB will benefit slightly by turning hyperthreading off. (Of course other things, such as Outlook, won't: your choice).

Related

How to fully use the CPU in Matlab [Improving performance of a repetitive, time-consuming program]

I'm working on an adaptive and Fully automatic segmentation algorithm under varying light condition , the core of this algorithm uses Particle swarm Optimization(PSO) to tune the fuzzy system and believe me it's very time consuming :| for only 5 particles and 100 iterations I have to wait 2 to 3 hours ! and it's just processing one image from my data set containing over 100 photos !
I'm using matlab R2013 ,with a intel coer i7-2670Qm # 2.2GHz //8.00GB RAM//64-bit operating system
the problem is : when starting the program it uses only 12%-16% of my CPU and only one core is working !!
I've searched a lot and came into matlabpool so I added this line to my code :
matlabpool open 8
now when I start the program the task manger shows 98% CPU usage, but it's just for a few seconds ! after that it came back to 12-13% CPU usage :|
Do you have any idea how can I get this code run faster ?!
12 Percent sounds like Matlab is using only one Thread/Core and this one with with full load, which is normal.
matlabpool open 8 is not enough, this simply opens workers. You have to use commands like parfor, to assign work to them.
Further to Daniel's suggestion, ideally to apply PARFOR you'd find a time-consuming FOR loop in you algorithm where the iterations are independent and convert that to PARFOR. Generally, PARFOR works best when applied at the outermost level possible. It's also definitely worth using the MATLAB profiler to help you optimise your serial code before you start adding parallelism.
With my own simulations I find that I cannot recode them using Parfor, the for loops I have are too intertwined to take advantage of multiple cores.
HOWEVER:
You can open a second (and third, and fourth etc) instance of Matlab and tell this additional instance to run another job. Each instance of matlab open will use a different core. So if you have a quadcore, you can have 4 instances open and get 100% efficiency by running code in all 4.
So, I gained efficiency by having multiple instances of matlab open at the same time and running a job. My jobs took 8 to 27 hours at a time, and as one might imagine without liquid cooling I burnt out my cpu fan and had to replace it.
Also do look into optimizing your matlab code, I recently optimized my code and now it runs 40% faster.

performance issues with parallel MATLAB on a NUMA machine

I'm running memory-intensive parallel computations in MATLAB on a 64-core NUMA machine under Windows 7, 8 cores per socket. I'm using parallel computing toolbox to do that. I've noticed a very strange cpu load pattern: then running say 36 parallel MATLABs, the cores on the 1st socket are fully loaded, 2nd socket is almost fully loaded too, third socket is about 50% and so on. The last socket is usually almost completely free and doing nothing. Running more than 12 parallel workers simultaneously seem to very adversely affect performance of all workers.
I tried to experiment with cpu affinity, pinning different workers to different cores. While it helps in simple tests (i.e. cpu load pattern becomes uniform across all cores), it doesn't help in our real-life memory-intensive computations.
I suspect the problem is with memory locality. I.e. all memory is allocated on 1st and 2nd sockets. This would explain strange cpu load: OS tires to run computational threads closer to the data. But I don't know neither how to confirm this suspicion directly, nor how to fix it, if it's true.
I use maxNumCompThreads(4) in all my parallel workers, if that's important. Hyperthreading is off.
You should only be able to run 12 local workers using Parallel Computing Toolbox. See the data sheet.
Please note that in R2014a the limit on the number of local workers was removed. See the release notes.

How can I query the number of physical cores from MATLAB?

Does anyone know of a way to query the number of physical cores from MATLAB? I would specifically like to get the number of physical rather than logical cores (which can differ when hyperthreading is enabled).
I need the method to be cross-platform (Windows and Linux, don't care about Mac), but I'd be happy to use two separate methods with a switch statement based on the output of computer.
So far I've tried:
java.lang.Runtime.getRuntime().availableProcessors
System.Environment.ProcessorCount
!wmic cpu get NumberOfCores and !wmic cpu get NumberOfLogicalProcessors.
1 is cross-platform, but returns the number of logical rather than physical processors.
2 is Windows only, and also returns logical rather than physical processors.
3 gives both physical and logical processors, but is also Windows only, and although I can use it successfully from the DOS command window, for some reason it seems to hang for an eternity when run from MATLAB.
You need to use the undocumented command
feature('numcores')
as explained here: http://undocumentedmatlab.com/blog/undocumented-feature-function/
This will work
getenv('NUMBER_OF_PROCESSORS')
You can use the function maxNumCompThreads. However it's deprecated. Still it works on Matlab 2011a.
maxNumCompThreads
Warning: maxNumCompThreads will be removed in a future release. Please remove any
instances of this function from your code.
> In maxNumCompThreads at 27
ans =
4

Is Scala doing anything in parallel on its own?

I have little program creating a maze. It uses lots of collections (the default variant, which is immutable, or at least used as an immutable).
The program calculates 30 mazes with increasing dimensions. Using a for comprehension over (1 to 30)
Since with the latest versions the parallel collections framework became available I thought to give it a spin, hoping for some performance gain.
This failed and when I investigated a little, I found the following:
When run without any call to anything remotely parallel it still showed a processor load of about 30% on each of the 4 cores of my machine.
When I replaced the Range 1 to 30 with (1 to 30).par CPU load went up to about 80% on all cores (which I expected). The order in which the mazes completed became more or less random (which I expected). The total time for all mazes stayed the same.
Replacing some of the internally used collections with their parallel counter parts did seem to have an effect.
I now have 2 questions:
Why do I have all 4 cores spinning, although there isn't anything that runs in parallel.
What might be likely reasons for the program to still take the same time, no matter if running in parallel or not. There are no obvious other bottlenecks but CPU cycles (no IO, no Network, plenty of Memory via -Xmx setting)
Any ideas on this?
The 30% per core version is just a poor scheduler (sounds like Windows 7) migrating the process from core to core very frequently. It's probably closer to 25% per core (1/4) for your process plus misc other load making 30%. If you run the same example under Linux you would probably see one core pegged.
When you converted to (1 to 30).par, you started really using threads across all cores but the synchronization overhead of distributing such a small amount of work and then collecting the results cancelled out the parallelism gains. You need to break your work into larger independent chunks.
EDIT: If each of 1..30 represents some larger amount of work (solving a maze, say) then automatic parallelization will work much better if each unit of work is about the same. Imagine you had 29 easy mazes and one very very hard maze. The 30th maze will still run serially (or very nearly) with everything else). If your mazes increase in complexity by number try spawning them in the order 30 to 1 by -1 so that the biggest tasks will go first. Think of it as a braindead solution to the knapsack problem.

Parallel programming on a Quad-Core and a VM?

I'm thinking of slowly picking up Parallel Programming. I've seen people use clusters with OpenMPI installed to learn this stuff. I do not have access to a cluster but have a Quad-Core machine. Will I be able to experience any benefit here? Also, if I'm running linux inside a Virtual machine, does it make sense in using OpenMPI inside a VM?
If your target is to learn, you don't need a cluster at all. Your quad-core (or any dual-core or even a single-cored) computer will be more than enough. The main point is to learn how to think "in parallel" and how to design your application.
Some important points are to:
Exploit different parallelism paradigms like divide-and-conquer, master-worker, SPMD, ... depending on data and tasks dependencies of what you want to do.
Chose different data division granularities to check the computation/communication ratio (in case of message passing), or to check the amount of serial execution because of mutual exclusion to memory regions.
Having a quad-core you can measure your approach speedup (the gain on performance attained because of the parallelization) which is normally given by the division between the time of the non parallelized execution and the time of the parallel execution.
The closer you get to 4 (four cores meaning 1/4th the execution time), the better your parallelization strategy was (once you could evenly distribute work and data).