Is it beneficial to run Matlab calculations in parallel on a multi-core computer?

Is it beneficial to run Matlab calculations in parallel on a multi-core computer? - matlab

I have a laptop with a multi-core processor and I would like to run a lengthy loop in which Simulink simulations are performed. Is it beneficial to split the loop into two parts (it is possible in my case), open the Matlab application twice, and run a Matlab script in each of them?
Someone told me that Matlab/Simulink always uses one core per opened Matlab application. Is that correct?

MATLAB splits some builtin functions across multiple cores, but standard MATLAB code uses just one core. Generally, if you are running several independent iterations, then the computation time can benefit from parallelization. You can do this easily using either parfor (if the have the Parallel Computing Toolbox), or batch_job.

Related

Parallel scripts on MATLAB

I have two systems running on MATLAB: the control system and the computer vision system.
The control system needs to receive three variables generated by the computer vision system periodically. However, I can't single thread both systems, because the computer vision system latency is too high compared with the control system latency.
I tried to run each program in a different MATLAB session and use a .mat file as interface between both sessions, but it did not work.
I'm not familiar with the Parallel Computing Toolbox. So I was wondering if someone can help with this? Or at lest give a start up idea, because, as I've said, I will start to learn the Parallel Computing Toolbox now.

I think the function in the Parallel Computing Toolbox you may be looking for is parfeval. It lets you spawn an asynchronous task, and get its result whenever it is ready.

In addition to parfeval as suggested by #Dima you might also want to look into labSendReceive
and associated function like labSend and labReceive which allow to share data between individual workers in your parallel pool. I guess which one is best for you depends on the type of calculation you want to do.

Simulink running on multiple cores (or not) for a given model

I have a huge simulink model that takes about ~1 hour to execute. In my computer (HP Z210), the execution makes the computer use all cpu cores at 100%. What is intriguing me is that the same model running on my colleague computer (Dell precision T3600) uses ~50% of the cpu "power" (some cores at 100% and some cores remain idle).
My questions is:
I always thought Simulink runs in a single core. I´m not using parrallel computer toolbox or any other toolbox. I´m only using Matlab and Simulink licenses.
Why the execution in my computer is different from my colleague? Does it have anything to do with hyper-threading?

While simulink itself is single threaded in executing a single model, a block itself might be multi threaded. For example a simulink block executing matrix multiplication will use the multi threaded implementation which is also used by matlab.

Simnulink is definitely a single-threaded application. The exception to this is if you are using Rapid Accelerator mode and have multiple cores available, then the standalone executable runs on a separate core. See How Acceleration Modes Work for more details.
If you are running multiple simulations, then you can distribute these across multiple cores with the Parallel Computing Toolbox, or even multiple workers (machines) with the MATLAB Distributed Computing Server. However, this is for multiple simulations of a model (e.g. a Monte-Carlo simulation), not for breaking uo a large model in several chunks (currently not possible as far as I know). see Run Parallel Simulations for more details.
Not sure why the execution would be different from one machine to the other. Are you both using the same release of MATLAB? Same O/S? There are so many things that could be different. With regards to speeding up the execution of model, you could try running the model in accelerated mode, using the Simulink profiler to see where the bottlenecks are, changing some of the solver settings (e.g. variable-step vs fixed-step), etc...

If your model can be built with Simulink Coder, you can use xMOD software (www.xmodsoftware.com) to execute your model in multi-core (subsystem per thread basis, with a dedicated solver and step-size for each subsystem).

SPMD vs. Parfor

I'm new about parallel computing in matlab. I have a function which creates a classifiers (SVM) and I'd like to test it with several dataset. I've got a 2 core workstation so I'd like to run test in parallel. Can someone explain me the difference between:
dataset_array={dataset1, dataset2}
matlabpool open 2
spmd
my_function(dataset(labindex));
end
and
dataset_array={dataset1, dataset2}
matlabpool open 2
parfor i:1=2
my_function(dataset(i));
end

spmd is a parallel region, while parfor is a parallel for loop. The difference is that in spmd region you have a much larger flexibility when it comes to the tasks you can perform in parallel. You can write a for loop, you can operate on distributed arrays and vectors. You can program an entire work flow, which in general consists of more than loops. This comes at a price: you need to know more about distributing the work and the data among your threads. Parallelizing the loop for example requires explicitly dividing the loop index ranges amongst the workers (which you did in your code by using labindex), and maybe creating distributed arrays.
parfor on the other hand only does this - a parallelized for loop. Automatically parallelized, you can add, so the work is divided between the workers by MATLAB.
If you only want to run a single loop in parallel and later work on the result on your local client, you should use parfor. If you want to parallelize your entire MATLAB program, you will have to deal with the complexities of spmd and work distribution.

Matlab and GPU/CUDA programming

I need to run several independent analyses on the same data set.
Specifically, I need to run bunches of 100 glm (generalized linear models) analyses and was thinking to take advantage of my video card (GTX580).
As I have access to Matlab and the Parallel Computing Toolbox (and I'm not good with C++), I decided to give it a try.
I understand that a single GLM is not ideal for parallel computing, but as I need to run 100-200 in parallel, I thought that using parfor could be a solution.
My problem is that it is not clear to me which approach I should follow. I wrote a gpuArray version of the matlab function glmfit, but using parfor doesn't have any advantage over a standard "for" loop.
Has this anything to do with the matlabpool setting? It is not even clear to me how to set this to "see" the GPU card. By default, it is set to the number of cores in the CPU (4 in my case), if I'm not wrong.
Am I completely wrong on the approach?
Any suggestion would be highly appreciated.
Edit
Thanks. I'm aware of GPUmat and Jacket, and I could start writing in C without too much effort, but I'm testing the GPU computing possibilities for a department where everybody uses Matlab or R. The final goal would be a cluster based on C2050 and the Matlab Distribution Server (or at least this was the first project).
Reading the ADs from Mathworks I was under the impression that parallel computing was possible even without C skills. It is impossible to ask the researchers in my department to learn C, so I'm guessing that GPUmat and Jacket are the better solutions, even if the limitations are quite big and the support to several commonly used routines like glm is non-existent.
How can they be interfaced with a cluster? Do they work with some job distribution system?

I would recommend you try either GPUMat (free) or AccelerEyes Jacket (buy, but has free trial) rather than the Parallel Computing Toolbox. The toolbox doesn't have as much functionality.
To get the most performance, you may want to learn some C (no need for C++) and code in raw CUDA yourself. Many of these high level tools may not be smart enough about how they manage memory transfers (you could lose all your computational benefits from needlessly shuffling data across the PCI-E bus).

Parfor will help you for utilizing multiple GPUs, but not a single GPU. The thing is that a single GPU can do only one thing at a time, so parfor on a single GPU or for on a single GPU will achieve the exact same effect (as you are seeing).
Jacket tends to be more efficient as it can combine multiple operations and run them more efficiently and has more features, but most departments already have parallel computing toolbox and not jacket so that can be an issue. You can try the demo to check.
No experience with gpumat.
The parallel computing toolbox is getting better, what you need is some large matrix operations. GPUs are good at doing the same thing multiple times, so you need to either combine your code somehow into one operation or make each operation big enough. We are talking a need for ~10000 things in parallel at least, although it's not a set of 1e4 matrices but rather a large matrix with at least 1e4 elements.
I do find that with the parallel computing toolbox you still need quite a bit of inline CUDA code to be effective (it's still pretty limited). It does better allow you to inline kernels and transform matlab code into kernels though, something that

CUDA and MATLAB for loop optimization

I'm going to attempt to optimize some code written in MATLAB, by using CUDA. I recently started programming CUDA, but I've got a general idea of how it works.
So, say I want to add two matrices together. In CUDA, I could write an algorithm that would utilize a thread to calculate the answer for each element in the result matrix. However, isn't this technique probably similar to what MATLAB already does? In that case, wouldn't the efficiency be independent of the technique and attributable only to the hardware level?

The technique might be similar but remember with CUDA you have hundreds of threads running simultaneously. If MATLAB is using threads and those threads are running on a Quad core, you are only going to get 4 threads excuted per clock cycle while you might achieve a couple of hundred threads to run on CUDA with that same clock cycle.
So to answer you question, YES, the efficiency in this example is independent of the technique and attributable only to the hardware.

The answer is unequivocally yes, all the efficiencies are hardware level. I don't how exactly matlab works, but the advantage of CUDA is that mutltiple threads can be executed simultaneously, unlike matlab.
On a side note, if your problem is small, or requires many read write operations, CUDA will probably only be an additional headache.

CUDA has official support for matlab.
[need link]
You can make use of mex files to run on GPU from MATLAB.
The bottleneck is the speed at which data is transfered from CPU-RAM to GPU. So if the transfer is minimized and done in large chunks, the speedup is great.

For simple things, it's better to use the gpuArray support in the Matlab PCT. You can check it here
http://www.mathworks.de/de/help/distcomp/using-gpuarray.html
For things like adding gpuArrays, multiplications, mins, maxs, etc., the implementation they use tends to be OK. I did find out that for making things like batch operations of small matrices like abs(y-Hx).^2, you're better off writing a small Kernel that does it for you.