CUDA and MATLAB for loop optimization

CUDA and MATLAB for loop optimization - matlab

I'm going to attempt to optimize some code written in MATLAB, by using CUDA. I recently started programming CUDA, but I've got a general idea of how it works.
So, say I want to add two matrices together. In CUDA, I could write an algorithm that would utilize a thread to calculate the answer for each element in the result matrix. However, isn't this technique probably similar to what MATLAB already does? In that case, wouldn't the efficiency be independent of the technique and attributable only to the hardware level?

The technique might be similar but remember with CUDA you have hundreds of threads running simultaneously. If MATLAB is using threads and those threads are running on a Quad core, you are only going to get 4 threads excuted per clock cycle while you might achieve a couple of hundred threads to run on CUDA with that same clock cycle.
So to answer you question, YES, the efficiency in this example is independent of the technique and attributable only to the hardware.

The answer is unequivocally yes, all the efficiencies are hardware level. I don't how exactly matlab works, but the advantage of CUDA is that mutltiple threads can be executed simultaneously, unlike matlab.
On a side note, if your problem is small, or requires many read write operations, CUDA will probably only be an additional headache.

CUDA has official support for matlab.
[need link]
You can make use of mex files to run on GPU from MATLAB.
The bottleneck is the speed at which data is transfered from CPU-RAM to GPU. So if the transfer is minimized and done in large chunks, the speedup is great.

For simple things, it's better to use the gpuArray support in the Matlab PCT. You can check it here
http://www.mathworks.de/de/help/distcomp/using-gpuarray.html
For things like adding gpuArrays, multiplications, mins, maxs, etc., the implementation they use tends to be OK. I did find out that for making things like batch operations of small matrices like abs(y-Hx).^2, you're better off writing a small Kernel that does it for you.

Related

Difference in Matlab and Octave computation

I have implemented a Naive Bayes classifier. On Matlab, my classify function takes 2 minutes to run while octave takes 25 minutes to run the same code. Does anyone know what causes ocatve to run slower so that I can tweak my code accordingly?
PS: I have to submit to a server which runs octave and not Matlab.

Matlab does a lot of "hidden" optimization when running your code (Octave probably, too, but different ones). Many of these optimizations e.g. concern that parameters to functions are not copied if you do not modify these parameters in the function, but instead passed by reference. This can significantly speed up calculations when you e.g. pass around large matrices, since otherwise most of your computational time is spend on copying. There are many, many similar optimizations, and not all of them are documented at all.
Without specific knowledge of what you are computing, it's hard to guess where the difference comes from. I am not aware if octave has an equivalence to the matlab profiler, but if, I would use this to find out where octave spends all the time. For debugging, I would also recommend to download Octave to your PC and debug there.

Can I measure the speedup from parallelization in matlab?

If I assume that a problem is a candidate for parallization e.g. matrix multiplication or some other problem and I use an Intel i7 haswell dualcore, is there some way I can compare a parallel execution to a sequential version of the same program or will matlab optimize a program to my architecture (dualcore, quadcore..)? I would like to know the speedup from adding more processors from a good benchmark parallell program.

Unfortunately there is no such thing as a benchmark parallel program. If you measure a speedup for a benchmark algorithm that does not mean that all the algorithms will benefit from parallelization
Since your target architecture has only 2 cores you might be better off avoiding parallelization at all and let Matlab and the operative system to optimize the execution. Anyway, here are the steps I followed.
Determine if your problem is apt for parallelization by calculating the theoretical speedup. Some problems like matrix multiplication or Gauss elimination are well studied. Since I assume your problem is more complicated than that, try to decompose your algorithm into simple blocks and determine, block-wise, the advantages of parallelization.
If you find that several parts of your algorithms could profit from parallelization, study those part separately.
Obtain statistical information of the runtime of your sequential algorithm. That is, run your program X number of times under similar conditions (and similar inputs) and average the running time.
Obtain statistical information of the runtime of your parallel algorithm.
Measure with the profiler. Many people recommends to use function like tic or toc. The profiler will give you a more accurate picture of your running times, as well as detailed information per function. See the documentation for detailed information on how to use the profiler.
Don't make the mistake of not taking into account the time Matlab takes to open the pool of workers (I assume you are working with the Parallel Computing Toolbox). Depending on your number of workers, the pool takes more/less time and in some occasions it could be up to 1 minute (2011b)!

You can try "Run and time" feature on MATLAB.
Or simply put some tic and toc to the first and end of your code, respectively.

Matlab provides a number of timing functions to help you assess the performance of your code: go read the documentation here and select the function that you deem most appropriate in your case! In particular, be aware of the difference between tic toc and the cputime function.

Can Matlab use multiple processors for plots and interactions in plots?

I am analyzing EMG data in my research lab. One of the steps is to calculate a continuous wavelet transformation of the dataset (size ~80000). Therefore, I use Matlab with the wavelet toolbox and "cwt" to plot a 3D-scalogram.
The calculation takes a lot of time and any interaction like 3D-rotation (which is very important to see different aspects of the data) is nearly impossible.
The resource-monitor shows that only one of my hexa-core processor is working. I use parallel computing for other calculations and haven't found any solution or even a similiar question like this.
Is there anything I can do to activate multicore support for plots?

I'll hazard an educated guess and plump for the answer No to your question Is there anything I can do to activate multicore support for plots?
Matlab can certainly use multiple cores for its computations. Many of its intrinsic functions are already multi-threaded and will use any available cores without the programmer (or user) having to do take any special measures. For your own computations you can use the Parallel Compute Toolbox.
However, unless you have some very special graphics hardware (and if you do why didn't you mention it ?) your display shows you why only one processor is being used when you interact with your 3D plots -- somewhere between the screen and the hardware of your computer there is a bottleneck through which the outputs of all those cores are squeezed into one stream of bits and bytes for presentation.
Your experience is consistent with that bottleneck being the Matlab visualisation routines, I think it is safe to conclude from the evidence you present, that the Mathworks haven't multi-threaded the routines which compute the new screen positions of each element in a plot as you rotate it, or any of the other processing that goes on to turn the results of your analyses into a picture or pictures. If they did parallelise those routines, that would shift the bottleneck but not remove it.
To remove the bottleneck you would have to have a way for different Matlab threads to separately address different parts of your screen; I see no evidence that Matlab has that capability. Google will find you a ton of references to parallel rendering but I see no sign that Matlab currently implements any aspect of this.
I'll just add, in response to your comment where you write unfortunately I can't resample my data that you should be mindful that Matlab's visualisation routines are resampling your data for presentation unless you are only visualising datasets with numbers of samples less than the number of pixels available. If you visualise a time series with 80000 samples on a display with 2000 pixels horizontally, something has got to give.
You might get better graphics performance and superior understanding if you take charge of that resampling yourself.

Matlab plotting performance is pretty bad, it's more focused on customizability than performance. Using MEX to run some native C++ code to plot the data with OpenGL will likely be much much faster.

Dedicated distributed system to handle matlab jobs

I'm a software engineer and currently looking forward to setup a distributed system at my laboratory so that I can process some matlab jobs over them. I have looked into MATLAB MPI but I want to know if there is some way so that I can setup a system here without any FEE or AMOUNT.

I have spent a lot of time looking at that very issue, and the short answer is: nope, not possible.
There are two long answers. First, if you're constrained to using Matlab, then all roads lead back to MathWorks. One possibility is that you could compile your code, you'd need to buy the compiler from Mathworks, though, and then you could run the compiled code on whatever grid infrastructure you wish, such as Hadoop.
Second, for this reason, I have found it much better to simply port code to another language, usually one in open source. For the work I tend to do, Octave is a poor replacement for Matlab. Instead, R and Python are great for most of the same functionality. Personally, I lean a lot more toward R than Python, but that's because R is a better fit for these applications (i.e. they're very statistical in nature).
I've ported a lot of Matlab code to R and it's not too bad. Porting to Python would be easier in general, though, and there is a very large Matlab refugee community that has switched to Python.
Once you're in either Python or R, there are a lot of options for MPI, multicore tools, distributed systems, GPU tools, and more. In fact, you may find the migration easier by writing some of the distributed functions in Python or R, loading up an easy to use grid system, and then have Matlab submit the job to the server. Your local code could be the same, but then you can work on porting only the gridded parts, which you'd probably have to devote some time to write in Matlab, anyway.

I wouldn't say it's completely impossible; You can use TCP/IP sockets to build a client/server application (you will find many MEX implementations for BSD sockets on File Exchange).
The architecture is simple: your main MATLAB client script sends jobs (code along with any needed data serialized) to nodes to evaluate and send back results when done. These nodes would be distributed MATLAB instances running the server part which listens for connections, and runs anything it receive through the EVAL function.
Obviously it is up to write code that can be divided into breakable tasks.
This is not as sophisticated as what is offered by the Distributed Computing Toolbox, but basically does the same thing...

Matlab and GPU/CUDA programming

I need to run several independent analyses on the same data set.
Specifically, I need to run bunches of 100 glm (generalized linear models) analyses and was thinking to take advantage of my video card (GTX580).
As I have access to Matlab and the Parallel Computing Toolbox (and I'm not good with C++), I decided to give it a try.
I understand that a single GLM is not ideal for parallel computing, but as I need to run 100-200 in parallel, I thought that using parfor could be a solution.
My problem is that it is not clear to me which approach I should follow. I wrote a gpuArray version of the matlab function glmfit, but using parfor doesn't have any advantage over a standard "for" loop.
Has this anything to do with the matlabpool setting? It is not even clear to me how to set this to "see" the GPU card. By default, it is set to the number of cores in the CPU (4 in my case), if I'm not wrong.
Am I completely wrong on the approach?
Any suggestion would be highly appreciated.
Edit
Thanks. I'm aware of GPUmat and Jacket, and I could start writing in C without too much effort, but I'm testing the GPU computing possibilities for a department where everybody uses Matlab or R. The final goal would be a cluster based on C2050 and the Matlab Distribution Server (or at least this was the first project).
Reading the ADs from Mathworks I was under the impression that parallel computing was possible even without C skills. It is impossible to ask the researchers in my department to learn C, so I'm guessing that GPUmat and Jacket are the better solutions, even if the limitations are quite big and the support to several commonly used routines like glm is non-existent.
How can they be interfaced with a cluster? Do they work with some job distribution system?

I would recommend you try either GPUMat (free) or AccelerEyes Jacket (buy, but has free trial) rather than the Parallel Computing Toolbox. The toolbox doesn't have as much functionality.
To get the most performance, you may want to learn some C (no need for C++) and code in raw CUDA yourself. Many of these high level tools may not be smart enough about how they manage memory transfers (you could lose all your computational benefits from needlessly shuffling data across the PCI-E bus).

Parfor will help you for utilizing multiple GPUs, but not a single GPU. The thing is that a single GPU can do only one thing at a time, so parfor on a single GPU or for on a single GPU will achieve the exact same effect (as you are seeing).
Jacket tends to be more efficient as it can combine multiple operations and run them more efficiently and has more features, but most departments already have parallel computing toolbox and not jacket so that can be an issue. You can try the demo to check.
No experience with gpumat.
The parallel computing toolbox is getting better, what you need is some large matrix operations. GPUs are good at doing the same thing multiple times, so you need to either combine your code somehow into one operation or make each operation big enough. We are talking a need for ~10000 things in parallel at least, although it's not a set of 1e4 matrices but rather a large matrix with at least 1e4 elements.
I do find that with the parallel computing toolbox you still need quite a bit of inline CUDA code to be effective (it's still pretty limited). It does better allow you to inline kernels and transform matlab code into kernels though, something that

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse