GPU perfomance request, what's the best solution?

GPU perfomance request, what's the best solution? - matlab

I work on an audio processing project that needs to do a lot of basic computations (+, -, *) like a FFT (Fast Fourier Transform) calculation.
We're considering using a graphics card to accelerate these computations. But we don't know if this is the best solution. Our desired solution needs to be a good computation system costing less than $500.
We use Matlab programming, and we have a sound card acquisition which have to be plug in the system.
Do you know a solution other than graphics card + motherboard to do lot of calculus?

You can use the free Matlab CUDA library to perform the computations on the GPU. $500 will give you a very decent NVIDIA GPU. Beware that GPU's have limited video memory and will run out of memory with large data volumes even faster than Matlab.
I have benchmarked an 8core intel CPU against an 8800 Nvidia GPU (128streams) with GPUMat , for 512Kb datasets the GPU spun out at the same speed as the 8 core intel at 2Ghz, including transfer times to the GPU memory. For serious GPU work I recommend a dedicated card compared to the one you are using to drive the monitor. Use the motherboard cheapie intel video to drive the monitor and pass the array computes to the Nvidia.

Parallel Computing Toolbox from MathWorks now includes GPU support. In particular, elementwise operations and arithmetic are supported, as well as 1- and 2-dimensional FFTs (along with a whole bunch of other stuff to support hand-written CUDA code if you have that). If you're interested in performing calculations in double-precision, the recent Tesla and Quadro branded cards will give you the best performance.
Here's a trivial example showing how you might use the GPU in MATLAB using Parallel Computing Toolbox:
gA = gpuArray( rand(1000) );
gB = fft( 1 + gA * 3 );

Related

Training a neural network using CPU only

I'm working on a virtual machine on a remote server and I want to train a neural network on it but I don't have GPUs to use in this VM. is it possible to train the net on this VM using CPU only? and if that the case, does it work with a large dataset or that will be a problem?

I have used Tensorflow for training a deep neural network. I have used it with GPU and CPU only. The rest of my response is in context of Tensorflow.
Please be aware that Convolution Neural Nets are generally more resource hungry than standard regular feed forward neural networks because CNNs deal with much higher dimensional data. If you are not working deep CNNs then you may be all right to use CPU to and restrict to smaller datasets.
In my scenario, initially I was training with CPU only and then moved on to GPU mode because of speed improvements.
Example of speed
I was able to train the entire MNIST with when using GPU in under 15 minutes. Training on CPU was much slower but you can still learn by cutting down the on the size of the training data set.
Tensorflow with GPU
https://www.tensorflow.org/install/gpu
You will need to go through all the installation steps. This involves not only installing Tensorflow but also CUDA libraries.
What is CUDA?
CUDA is the specification developed by NVIDIA for programming with GPU. They provide their native libraries which talk to the underlying hardware.
https://docs.nvidia.com/cuda/
How to use TensorFlow GPU?

trainAutoencoder slows down when using GPU?

I am trying to get into deep learning using the neural network library in matlab. A good starting step seems to be training an autoencoder. In that respect, it would be good to see whether I am getting the msot out of my gpu.
In this connection, When I run
tic
autoenc1 = trainAutoencoder(allSets,5,...
'L2WeightRegularization',0.001,...
'SparsityRegularization',1,...
'SparsityProportion',0.2,...
'DecoderTransferFunction','logsig',...
'useGPU',true)
toc
I get "Elapsed time is 19.680823 seconds.".
However, not using the gpu (setting 'useGPU' to false) it only takes 8.272708 seconds.
I am puzzled by this, since I am assuming that using the gpu for neural networks will speed things up? Does anyone know of any way to check whether matlab and cuda are properly interfacing, or see how matlab is actually using the resources?
I have cuda 8.1 installed, and am using a GeForce GTX 960M (compute capability 5.0). The matlab version is 2016b.
EDIT: as has been pointed out, there is as of yet no cuda 8.1. What I do have is 8.0, and cudnn 5.1.

As pointed out in the comments, performing computations on the GPU is not necessarily faster. Instead, the impact on performance depends on the additional overhead of data conversion and transfer.
Usually, the overhead can be influenced via the batch size, but the trainAutoencoder function does not provide that option.
For general measurement and improvement of GPU performance in MATLAB, see this link.

Ways to accelerate reduce operation on Xeon CPU, GPU and Xeon Phi

I have an application where reduce operations (like sum, max) on a large matrix are bottleneck. I need to make this as fast as possible. Are there vector instructions in mkl to do that?
Is there a special hardware unit to deal with it on xeon cpu, gpu or mic?
How are reduce operations implemented in these hardware in general?

You can implement your own simple reductions using the KNC vpermd and vpermf32x4 instructions as well as the swizzle modifiers to do cross lane operations inside the vector units.
The C intrinsic function equivalents of these would be the mm512{mask}permute* and mm512{mask}swizzle* family.
However, I recommend that you first look at the array notation reduce operations, that already have high performance implementations on the MIC.
Look at the reduction operations available here and also check out this video by Taylor Kidd from Intel talking about array notation reductions on the Xeon Phi starting at 20mins 30s.
EDIT: I noticed you are also looking for CPU based solutions. The array notation reductions will work very well on the Xeon also.

Turns out none of the hardware have reduce operation circuit built-in.
I imagined a sixteen 17 bit adders attached to 128 bit vector register for reduce-sum operation. Maybe this is because no one has encountered a significant bottleneck with reduce operation. Well, the best solution i found is #pragma omp parallel for reduction in openmp. I am yet to test its performance though.

This operation is going to be bandwidth-limited and thus vectorization almost certainly doesn't matter. You want the hardware with the most memory bandwidth. An Intel Xeon Phi processor has more aggregate bandwidth (but not bandwidth-per-core) than a Xeon processor.

Accelerating MATLAB code using GPUs?

AccelerEyes announced in December 2012 that it works with Mathworks on the GPU code and has discontinued its product Jacket for MATLAB:
http://blog.accelereyes.com/blog/2012/12/12/exciting-updates-from-accelereyes/
Unfortunately they do not sell Jacket licences anymore.
As far as I understand, the Jacket GPU Array solution based on ArrayFire was much faster than the gpuArray solution provided by MATLAB.
I started working with gpuArray, but I see that many functions are implemented poorly. For example a simple
myArray(:) = 0
is very slow. I have written some custom CUDA-Kernels, but the poorly-implemented standard MATLAB functionality adds a lot of overhead, even if working with gpuArrays consistently throughout the code. I fixed some issues by replacing MATLAB code with hand written CUDA code - but I do not want to reimplement the MATLAB standard functionality.
Another feature I am missing is sparse GPU matrices.
So my questions are:
How do is speed up the badly implemented default GPU implementations provided by MATLAB? In particular, how do I speed up sparse matrix operations in MATLAB using the GPU?

MATLAB does support CUDA based GPU. You have to access it from the "Parallel Computing Toolbox". Hope these 2 links also help:
Parallel Computing Toolbox Features
Key Features
Parallel for-loops (parfor) for running task-parallel algorithms on multiple processors
Support for CUDA-enabled NVIDIA GPUs
Full use of multicore processors on the desktop via workers that run locally
Computer cluster and grid support (with MATLAB Distributed Computing Server)
Interactive and batch execution of parallel applications
Distributed arrays and single program multiple data (spmd) construct for large dataset handling and data-parallel algorithms
MATLAB GPU Computing Support for NVIDIA CUDA-Enabled GPUs
Using MATLAB for GPU computing lets you accelerate your applications with GPUs more easily than by using C or Fortran. With the familiar MATLAB language you an take advantage of the CUDA GPU computing technology without having to learn the intricacies of GPU architectures or low-level GPU computing libraries.
You can use GPUs with MATLAB through Parallel Computing Toolbox, which supports:
CUDA-enabled NVIDIA GPUs with compute capability 2.0 or higher. For releases 14a and earlier, compute capability 1.3 is sufficient.
GPU use directly from MATLAB
GPU-enabled MATLAB functions such as fft, filter, and several linear algebra operations
GPU-enabled functions in toolboxes: Image Processing Toolbox, Communications System Toolbox, Statistics and Machine Learning Toolbox, Neural Network Toolbox, Phased Array Systems Toolbox, and Signal Processing Toolbox (Learn more about GPU support for signal processing algorithms)
CUDA kernel integration in MATLAB applications, using only a single line of MATLAB code
Multiple GPUs on the desktop and computer clusters using MATLAB workers in Parallel Computing Toolbox and MATLAB Distributed Computing Server

I had the pleasure of attending a talk by John, the founder of AccelerEyes. They did not get the speedup because they just removed poorly written code and replaced it with code that saved a few bits here and there. Their speedup was mostly from exploiting the availability of cache and doing a lot of operations in-memory (GPU's). Matlab relied on transferring data between GPU and CPU, if I remember correctly, and hence the speedup was crazy.

Matlab and GPU/CUDA programming

I need to run several independent analyses on the same data set.
Specifically, I need to run bunches of 100 glm (generalized linear models) analyses and was thinking to take advantage of my video card (GTX580).
As I have access to Matlab and the Parallel Computing Toolbox (and I'm not good with C++), I decided to give it a try.
I understand that a single GLM is not ideal for parallel computing, but as I need to run 100-200 in parallel, I thought that using parfor could be a solution.
My problem is that it is not clear to me which approach I should follow. I wrote a gpuArray version of the matlab function glmfit, but using parfor doesn't have any advantage over a standard "for" loop.
Has this anything to do with the matlabpool setting? It is not even clear to me how to set this to "see" the GPU card. By default, it is set to the number of cores in the CPU (4 in my case), if I'm not wrong.
Am I completely wrong on the approach?
Any suggestion would be highly appreciated.
Edit
Thanks. I'm aware of GPUmat and Jacket, and I could start writing in C without too much effort, but I'm testing the GPU computing possibilities for a department where everybody uses Matlab or R. The final goal would be a cluster based on C2050 and the Matlab Distribution Server (or at least this was the first project).
Reading the ADs from Mathworks I was under the impression that parallel computing was possible even without C skills. It is impossible to ask the researchers in my department to learn C, so I'm guessing that GPUmat and Jacket are the better solutions, even if the limitations are quite big and the support to several commonly used routines like glm is non-existent.
How can they be interfaced with a cluster? Do they work with some job distribution system?

I would recommend you try either GPUMat (free) or AccelerEyes Jacket (buy, but has free trial) rather than the Parallel Computing Toolbox. The toolbox doesn't have as much functionality.
To get the most performance, you may want to learn some C (no need for C++) and code in raw CUDA yourself. Many of these high level tools may not be smart enough about how they manage memory transfers (you could lose all your computational benefits from needlessly shuffling data across the PCI-E bus).

Parfor will help you for utilizing multiple GPUs, but not a single GPU. The thing is that a single GPU can do only one thing at a time, so parfor on a single GPU or for on a single GPU will achieve the exact same effect (as you are seeing).
Jacket tends to be more efficient as it can combine multiple operations and run them more efficiently and has more features, but most departments already have parallel computing toolbox and not jacket so that can be an issue. You can try the demo to check.
No experience with gpumat.
The parallel computing toolbox is getting better, what you need is some large matrix operations. GPUs are good at doing the same thing multiple times, so you need to either combine your code somehow into one operation or make each operation big enough. We are talking a need for ~10000 things in parallel at least, although it's not a set of 1e4 matrices but rather a large matrix with at least 1e4 elements.
I do find that with the parallel computing toolbox you still need quite a bit of inline CUDA code to be effective (it's still pretty limited). It does better allow you to inline kernels and transform matlab code into kernels though, something that

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse