Using the GPU for parallel computing in Matlab - matlab

I've reached a stage where my arrays have become massive and a single function takes about 2 days to compute.
I am working with image processing and using kmeans and gmm - fitgmdist.
I have a workstation with Nvidia Tesla GPU's which are on the supported list and I would like to use their processing power to help speed up my work.
Looking into the documentation, I understand that in order to use the GPU functions all I have to do is to pass the array that is being passed to the functions to the GPU first. i.e.
model_feats = get_feats(all_imges);
kmeans = kmeans(model_feats, gaussians, 'EmptyAction','singleton', 'MaxIter',1000);
gmm{i} = fitgmdist(model_feats, 128, 'Options',statset('MaxIter',1000), ...
'CovType','diagonal', 'SharedCov',false, 'Regularize',0.01, 'Start',cInd);
All of my processing time is taken up by these two functions. So if I am to use the GPU cores, is all that I have to do is use the gpuArray function? For example the above will become:
temp_feats = get_feats(all_imges);
model_feats = gpuArray(temp_feats);
kmeans = kmeans(model_feats, gaussians, 'EmptyAction','singleton', 'MaxIter',1000);
gmm{i} = fitgmdist(model_feats, 128, 'Options',statset('MaxIter',1000), ...
'CovType','diagonal', 'SharedCov',false, 'Regularize',0.01, 'Start',cInd);
Will this work? Will it work for any function by first passing the array to gpuArray?
P.S. Sorry I have to ask here rather than just try it myself, but I do
not have access to the workstation as of now, but I can request access
to it. Before I request access to it I wanted to make sure if my
script will work with gpuArray.

Unfortunately, in short the answer to your question is No, it wont work.
The matlab GPU support is all but partial. The currently supported functions which accepts gpuArray inputs are given in: http://de.mathworks.com/help/distcomp/run-built-in-functions-on-a-gpu.html
So in my understanding, since kmeans is not in the list, it should not work. Someone please correct me if I'm wrong.
But on the other hand, if you do a google search, you can see 3rd party matlab implementations of kmeans on GPUs. Since I cant grantee the quality of the code, I wouldn't be posting a link.
Good luck!

Related

How to create a "Denoising Autoencoder" in Matlab?

I know Matlab has the function TrainAutoencoder(input, settings) to create and train an autoencoder. The result is capable of running the two functions of "Encode" and "Decode".
But this is only applicable to the case of normal autoencoders. What if you want to have a denoising autoencoder? I searched and found some sample codes, where they used the "Network" function to convert the autoencoder to a normal network and then Train(network, noisyInput, smoothOutput)like a denoising autoencoder.
But there are multiple missing parts:
How to use this new network object to "encode" new data points? it doesn't support the encode().
How to get the "latent" variables to the features, out of this "network'?
I appreciate if anyone could help me resolve this issue.
Thanks,
-Moein
At present (2019a), MATALAB does not permit users to add layers manually in autoencoder. If you want to build up your own, you will have start from the scratch by using layers provided by MATLAB;
In order to to use TrainNetwork(...) to train your model, you will have you find out a way to insert your data into an object called imDatastore. The difficulty for autoencoder's data is that there is NO label, which is required by imDatastore, hence you will have to find out a smart way to avoid it--essentially you are to deal with a so-called OCC (One Class Classification) problem.
https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.imagedatastore.html
Use activations(...) to dump outputs from intermediate (hidden) layers
https://www.mathworks.com/help/deeplearning/ref/activations.html?searchHighlight=activations&s_tid=doc_srchtitle
I swang between using MATLAB and Python (Keras) for deep learning for a couple of weeks, eventually I chose the latter, albeit I am a long-term and loyal user to MATLAB and a rookie to Python. My two cents are that there are too many restrictions in the former regarding deep learning.
Good luck.:-)
If you 'simulation' means prediction/inference, simply use activations(...) to dump outputs from any intermediate (hidden) layers as I mentioned earlier so that you can check them.
Another way is that you construct an identical network but with the encoding part only, copy your trained parameters into it, and feed your simulated signals.

matlab running all linprog algortithms (is there a matlab-list of algorithms?)

Matlab offers multiple algorithms for solving Linear Programs.
For example Matlab R2012b offers: 'active-set', 'trust-region-reflective', 'interior-point', 'interior-point-convex', 'levenberg-marquardt', 'trust-region-dogleg', 'lm-line-search', or 'sqp'.
But other versions of Matlab support different algorithms.
I would like to run a loop over all algorithms that are supported by the users Matlab-Version. And I would like them to be ordered like the recommendation order of Matlab.
I would like to implement something like this:
i=1;
x=[];
while (isempty(x))
options=optimset(options,'Algorithm',Here_I_need_a_list_of_Algorithms(i))
x = linprog(f,A,b,Aeq,beq,lb,ub,x0,options);
end
In 99% this code should be equivalent to
x = linprog(f,A,b,Aeq,beq,lb,ub,x0,options);
but sometimes the algorithm gives back an empty array because of numerical problems (exitflag -4). If there is a chance that one of the other algorithms can find a solution I would like to try them too.
So my question is:
Is there a possibility to automatically get a list of all linprog-algorithms that are supported by the installed Matlab-version ordered like Matlab recommends them.
I think looping through all algorithms can make sense in other scenarios too. For example when you need very precise data and have a lot of time, you could run them all and than evaluate which gives the best results.
Or one would like to loop through all algorithms, if one wants to find which algorithms is the best for LPs with a certain structure.
There's no automatic way to do this as far as I know. If you really want to do it, the easiest thing to do would be to go to the online documentation, and check through previous versions (online documentation is available for old versions, not just the most recent release), and construct some variables like this:
r2012balgos = {'active-set', 'trust-region-reflective', 'interior-point', 'interior-point-convex', 'levenberg-marquardt', 'trust-region-dogleg', 'lm-line-search', 'sqp'};
...
r2017aalgos = {...};
v = ver('matlab');
switch v.Release
case '(R2012b)'
algos = r2012balgos;
....
case '(R2017a)'
algos = r2017aalgos;
end
% loop through each of the algorithms
Seems boring, but it should only take you about 30 minutes.
There's a reason MathWorks aren't making this as easy as you might hope, though, because what you're asking for isn't a great idea.
It is possible to construct artificial problems where one algorithm finds a solution and the others don't. But in practice, typically if the recommended algorithm doesn't find a solution this doesn't indicate that you should switch algorithms, it indicates that your problem wasn't well-formulated, and you should consider modifying it, perhaps by modifying some constraints, or reformulating the objective function.
And after all, why stop with just looping through the alternative algorithms? Why not also loop through lots of values for other options such as constraint tolerances, optimality tolerances, maximum number of function evaluations, etc.? These may have just as much likelihood of affecting things as a choice of algorithm. And soon you're running an optimisation algorithm to search through the space of meta-parameters for your original optimisation.
That's not a great plan - probably better to just choose one of the recommended algorithms, stick to that, and if things don't work out then focus on improving your formulation of the problems rather than over-tweaking the optimisation itself.

How to use parallel 'for' loop in Octave or Scilab?

I have two for loops running in my Matlab code. The inner loop is parallelized using Matlabpool in 12 processors (which is maximum Matlab allows in a single machine).
I dont have Distributed computing license. Please help me how to do it using Octave or Scilab. I just want to parallelize 'for' loop ONLY.
There are some broken links given while I searched for it in google.
parfor is not really implemented in octave yet. The keyword is accepted, but is a mere synonym of for (http://octave.1599824.n4.nabble.com/Parfor-td4630575.html).
The pararrayfun and parcellfun functions of the parallel package are handy on multicore machines.
They are often a good replacement to a parfor loop.
For examples, see
http://wiki.octave.org/Parallel_package.
To install, issue (just once)
pkg install -forge parallel
And then, once on each session
pkg load parallel
before using the functions
In Scilab you can use parallel_run:
function a=g(arg1)
a=arg1*arg1
endfunction
res=parallel_run(1:10, g);
Limitations
uses only one core on Windows platforms.
For now, parallel_run only handles arguments and results of scalar matrices of real values and the types argument is not used
one should not rely on side effects such as modifying variables from outer scope : only the data stored into the result variables will be copied back into the calling environment.
macros called by parallel_run are not allowed to use the JVM
no stack resizing (via gstacksize() or via stacksize()) should take place during a call to parallel_run
In GNU Octave you can use the parfor construct:
parfor i=1:10
# do stuff that may run in parallel
endparfor
For more info: help parfor
To see a list of Free and Open Source alternatives to MATLAB-SIMULINK please check its Alternativeto page or my answer here. Specifically for SIMULINK alternatives see this post.
something you should consider is the difference between vectorized, parallel, concurrent, asynchronous and multithreaded computing. Without going much into the details vectorized programing is a way to avoid ugly for-loops. For example map function and list comprehension on Python is vectorised computation. It is the way you write the code not necesarily how it is being handled by the computer. Parallel computation, mostly used for GPU computing (data paralleism), is when you run massive amount of arithmetic on big arrays, using GPU computational units. There is also task parallelism which mostly refers to ruing a task on multiple threads, each processed by a separate CPU core. Concurrent or asynchronous is when you have just one computational unit, but it does multiple jobs at the same time, without blocking the processor unconditionally. Basically like a mom cooking and cleaning and taking care of its kid at the same time but doing only one job at the time :)
Given the above description there are lot in the FOSS world for each one of these. For Scilab specifically check this page. There is MPI interface for distributed computation (multithreading/parallelism on multiple computers). OpenCL interfaces for GPU/data-parallel computation. OpenMP interface for multithreading/task-parallelism. The feval functions is not parallelism but a way to vectorize a conventional function.Scilab matrix arithmetic and parallel_run are vectorized or parallel depending to the platform, hardware and version of the Scilab.

Fast Algorithms for Finding Pairwise Euclidean Distance (Distance Matrix)

I know matlab has a built in pdist function that will calculate pairwise distances. However, my matrix is so large that its 60000 by 300 and matlab runs out of memory.
This question is a follow up on Matlab euclidean pairwise square distance function.
Is there any workaround for this computational inefficiency. I tried manually coding the pairwise distance calculations and it usually takes a full day to run (sometimes 6 to 7 hours).
Any help is greatly appreciated!
Well, I couldn't resist playing around. I created a Matlab mex C file called pdistc that implements pairwise Euclidean distance for single and double precision. On my machine using Matlab R2012b and R2015a it's 20–25% faster than pdist(and the underlying pdistmex helper function) for large inputs (e.g., 60,000-by-300).
As has been pointed out, this problem is fundamentally bounded by memory and you're asking for a lot of it. My mex C code uses minimal memory beyond that needed for the output. In comparing its memory usage to that of pdist, it looks like the two are virtually the same. In other words, pdist is not using lots of extra memory. Your memory problem is likely in the memory used up before calling pdist (can you use clear to remove any large arrays?) or simply because you're trying to solve a big computational problem on tiny hardware.
So, my pdistc function likely won't be able to save you memory overall, but you may be able to use another feature I built in. You can calculate chunks of your overall pairwise distance vector. Something like this:
m = 6e3;
n = 3e2;
X = rand(m,n);
sz = m*(m-1)/2;
for i = 1:m:sz-m
D = pdistc(X', i, i+m); % mex C function, X is transposed relative to pdist
... % Process chunk of pairwise distances
end
This is considerably slower (10 times or so) and this part of my C code is not optimized well, but it will allow much less memory use – assuming that you don't need the entire array at one time. Note that you could do the same thing much more efficiently with pdist (or pdistc) by creating a loop where you passed in subsets of X directly, rather than all of it.
If you have a 64-bit Intel Mac, you won't need to compile as I've included the .mexmaci64 binary, but otherwise you'll need to figure out how to compile the code for your machine. I can't help you with that. It's possible that you may not be able to get it to compile or that there will be compatibility issues that you'll need to solve by editing the code yourself. It's also possible that there are bugs and the code will crash Matlab. Also, note that you may get slightly different outputs relative to pdist with differences between the two in the range of machine epsilon (eps). pdist may or may not do fancy things to avoid overflows for large inputs and other numeric issues, but be aware that my code does not.
Additionally, I created a simple pure Matlab implementation. It is massively slower than the mex code, but still faster than a naïve implementation or the code found in pdist.
All of the files can be found here. The ZIP archive includes all of the files. It's BSD licensed. Feel free to optimize (I tried BLAS calls and OpenMP in the C code to no avail – maybe some pointer magic or GPU/OpenCL could further speed it up). I hope that it can be helpful to you or someone else.
On my system the following is the fastest (Even faster than the C code pdistc by #horchler):
function [ mD ] = CalcDistMtx ( mX )
vSsqX = sum(mX .^ 2);
mD = sqrt(bsxfun(#plus, vSsqX.', vSsqX) - (2 * (mX.' * mX)));
end
You'll need a very well tuned C code to beat this, I think.
Update
Since MATLAB R2016b MATLAB supports implicit broadcasting without the use of bsxfun().
Hence the code can be written:
function [ mD ] = CalcDistMtx ( mX )
vSsqX = sum(mX .^ 2, 1);
mD = sqrt(vSsqX.'+ vSsqX - (2 * (mX.' * mX)));
end
A generalization is given in my Calculate Distance Matrix project.
P. S.
Using MATLAB's pdist for comparison: squareform(pdist(mX.')) is equivalent to CalcDistMtx(mX).
Namely the input should be transposed.
Computers are not infinitely large, or infinitely fast. People think that they have a lot of memory, a fast CPU, so they just create larger and larger problems, and then eventually wonder why their problem runs slowly. The fact is, this is NOT computational inefficiency. It is JUST an overloaded CPU.
As Oli points out in a comment, there are something like 2e9 values to compute, even assuming you only compute the upper or lower half of the distance matrix. (6e4^2/2 is approximately 2e9.) This will require roughly 16 gigabytes of RAM to store, assuming that only ONE copy of the array is created in memory. If your code is sloppy, you might easily double or triple that. As soon as you go into virtual memory, things get much slower.
Wanting a big problem to run fast is not enough. To really help you, we need to know how much RAM is available. Is this a virtual memory issue? Are you using 64 bit MATLAB, on a CPU that can handle all the needed RAM?

MATLAB Parallel Parfor Memory Usage Interrogation

I am new here for asking questions though I have found solutions to many problems here before.
This particular problem I cannot seem to find an answer for so I though I would join up and ask.
I am using the Parallel computing toolbox to run multiple simulations at once, the code I am developing is to be deployed on a single core so there is no need for converting the algorithm to parallel.
The data structures being created by each of the simulations are large, and running 8 simulations at once is using all of the available RAM in my machine (4GB).
I am currently looking at reducing the memory used by each simulation, and was wondering if anybody knew how to get memory usage info from each of the instances of the function.
So far I have been calling:
parfor i=1:8
[IR(:, i) Data(i)] = feval(F, NX, NY(i), SR, NS, i);
end
And inside function F
[usr, sys] = memory;
format short eng;
TEST.Mem = usr.MemUsedMATLAB;
But this understandably is returning the memory being used by all 8 instances of F.
I would like to get the information from each instance of F.
Note: The data structure TEST is returned as Data to the top level function.
Thanks in advance for any help.
you can use the matlab profiler to get a hint at the memory usage:
% start profiler
profile -memory on;
% start your simulation
my_sim();
% look at profiler results
profile viewer