Does more toolboxes slow down Matlab? - matlab

Does installing more and more toolboxes slows down loading of Matlab and make it run slow? I believe if one has many toolboxes in Matlab then it will take more time to load them on RAM and when a script is run then Matlab would take more time to search for the function. Am I correct?

I have all MathWorks products installed on my machine and there's no significant slowdown.

Related

trainAutoencoder slows down when using GPU?

I am trying to get into deep learning using the neural network library in matlab. A good starting step seems to be training an autoencoder. In that respect, it would be good to see whether I am getting the msot out of my gpu.
In this connection, When I run
tic
autoenc1 = trainAutoencoder(allSets,5,...
'L2WeightRegularization',0.001,...
'SparsityRegularization',1,...
'SparsityProportion',0.2,...
'DecoderTransferFunction','logsig',...
'useGPU',true)
toc
I get "Elapsed time is 19.680823 seconds.".
However, not using the gpu (setting 'useGPU' to false) it only takes 8.272708 seconds.
I am puzzled by this, since I am assuming that using the gpu for neural networks will speed things up? Does anyone know of any way to check whether matlab and cuda are properly interfacing, or see how matlab is actually using the resources?
I have cuda 8.1 installed, and am using a GeForce GTX 960M (compute capability 5.0). The matlab version is 2016b.
EDIT: as has been pointed out, there is as of yet no cuda 8.1. What I do have is 8.0, and cudnn 5.1.
As pointed out in the comments, performing computations on the GPU is not necessarily faster. Instead, the impact on performance depends on the additional overhead of data conversion and transfer.
Usually, the overhead can be influenced via the batch size, but the trainAutoencoder function does not provide that option.
For general measurement and improvement of GPU performance in MATLAB, see this link.

Difference in Matlab and Octave computation

I have implemented a Naive Bayes classifier. On Matlab, my classify function takes 2 minutes to run while octave takes 25 minutes to run the same code. Does anyone know what causes ocatve to run slower so that I can tweak my code accordingly?
PS: I have to submit to a server which runs octave and not Matlab.
Matlab does a lot of "hidden" optimization when running your code (Octave probably, too, but different ones). Many of these optimizations e.g. concern that parameters to functions are not copied if you do not modify these parameters in the function, but instead passed by reference. This can significantly speed up calculations when you e.g. pass around large matrices, since otherwise most of your computational time is spend on copying. There are many, many similar optimizations, and not all of them are documented at all.
Without specific knowledge of what you are computing, it's hard to guess where the difference comes from. I am not aware if octave has an equivalence to the matlab profiler, but if, I would use this to find out where octave spends all the time. For debugging, I would also recommend to download Octave to your PC and debug there.

Will upgrading matlab from 2009b to 2015 improve performance of mex Functions?

I am currently using Matlab 2009b to run optimizations based on mex code. The time spent in numerous calls to mex function is more than 95%. I am currently using cygwin64 compiler on windows machine, which is setup using gnumex.
I want to know if there will be noticeable decrease in optimization time if I upgrade my matlab to 2015.
The release notes at matlab (http://in.mathworks.com/help/matlab/release-notes.html) mentions several performance improvements over the course of various release but if somebody can quantify it, it can be useful to make my decision.
Tested it with 2015a and the difference in performance was less than 2 seconds in a 90 minute test. It appears the mex performance does not depend on matlab version.

Matlab and GPU/CUDA programming

I need to run several independent analyses on the same data set.
Specifically, I need to run bunches of 100 glm (generalized linear models) analyses and was thinking to take advantage of my video card (GTX580).
As I have access to Matlab and the Parallel Computing Toolbox (and I'm not good with C++), I decided to give it a try.
I understand that a single GLM is not ideal for parallel computing, but as I need to run 100-200 in parallel, I thought that using parfor could be a solution.
My problem is that it is not clear to me which approach I should follow. I wrote a gpuArray version of the matlab function glmfit, but using parfor doesn't have any advantage over a standard "for" loop.
Has this anything to do with the matlabpool setting? It is not even clear to me how to set this to "see" the GPU card. By default, it is set to the number of cores in the CPU (4 in my case), if I'm not wrong.
Am I completely wrong on the approach?
Any suggestion would be highly appreciated.
Edit
Thanks. I'm aware of GPUmat and Jacket, and I could start writing in C without too much effort, but I'm testing the GPU computing possibilities for a department where everybody uses Matlab or R. The final goal would be a cluster based on C2050 and the Matlab Distribution Server (or at least this was the first project).
Reading the ADs from Mathworks I was under the impression that parallel computing was possible even without C skills. It is impossible to ask the researchers in my department to learn C, so I'm guessing that GPUmat and Jacket are the better solutions, even if the limitations are quite big and the support to several commonly used routines like glm is non-existent.
How can they be interfaced with a cluster? Do they work with some job distribution system?
I would recommend you try either GPUMat (free) or AccelerEyes Jacket (buy, but has free trial) rather than the Parallel Computing Toolbox. The toolbox doesn't have as much functionality.
To get the most performance, you may want to learn some C (no need for C++) and code in raw CUDA yourself. Many of these high level tools may not be smart enough about how they manage memory transfers (you could lose all your computational benefits from needlessly shuffling data across the PCI-E bus).
Parfor will help you for utilizing multiple GPUs, but not a single GPU. The thing is that a single GPU can do only one thing at a time, so parfor on a single GPU or for on a single GPU will achieve the exact same effect (as you are seeing).
Jacket tends to be more efficient as it can combine multiple operations and run them more efficiently and has more features, but most departments already have parallel computing toolbox and not jacket so that can be an issue. You can try the demo to check.
No experience with gpumat.
The parallel computing toolbox is getting better, what you need is some large matrix operations. GPUs are good at doing the same thing multiple times, so you need to either combine your code somehow into one operation or make each operation big enough. We are talking a need for ~10000 things in parallel at least, although it's not a set of 1e4 matrices but rather a large matrix with at least 1e4 elements.
I do find that with the parallel computing toolbox you still need quite a bit of inline CUDA code to be effective (it's still pretty limited). It does better allow you to inline kernels and transform matlab code into kernels though, something that

CUDA and MATLAB for loop optimization

I'm going to attempt to optimize some code written in MATLAB, by using CUDA. I recently started programming CUDA, but I've got a general idea of how it works.
So, say I want to add two matrices together. In CUDA, I could write an algorithm that would utilize a thread to calculate the answer for each element in the result matrix. However, isn't this technique probably similar to what MATLAB already does? In that case, wouldn't the efficiency be independent of the technique and attributable only to the hardware level?
The technique might be similar but remember with CUDA you have hundreds of threads running simultaneously. If MATLAB is using threads and those threads are running on a Quad core, you are only going to get 4 threads excuted per clock cycle while you might achieve a couple of hundred threads to run on CUDA with that same clock cycle.
So to answer you question, YES, the efficiency in this example is independent of the technique and attributable only to the hardware.
The answer is unequivocally yes, all the efficiencies are hardware level. I don't how exactly matlab works, but the advantage of CUDA is that mutltiple threads can be executed simultaneously, unlike matlab.
On a side note, if your problem is small, or requires many read write operations, CUDA will probably only be an additional headache.
CUDA has official support for matlab.
[need link]
You can make use of mex files to run on GPU from MATLAB.
The bottleneck is the speed at which data is transfered from CPU-RAM to GPU. So if the transfer is minimized and done in large chunks, the speedup is great.
For simple things, it's better to use the gpuArray support in the Matlab PCT. You can check it here
http://www.mathworks.de/de/help/distcomp/using-gpuarray.html
For things like adding gpuArrays, multiplications, mins, maxs, etc., the implementation they use tends to be OK. I did find out that for making things like batch operations of small matrices like abs(y-Hx).^2, you're better off writing a small Kernel that does it for you.