Using more than one GPU in matlab - matlab

this is the output of ginfo using Jacket/matlab:
Detected CUDA-capable GPUs:
CUDA driver 270.81, CUDA toolkit 4.0
GPU0 Tesla C1060, 4096 MB, Compute 1.3 (single,double) (in use)
GPU1 Tesla C1060, 4096 MB, Compute 1.3 (single,double)
GPU2 Quadro FX 1800, 742 MB, Compute 1.1 (single)
Display Device: GPU2 Quadro FX 1800
The problem is :
Can I use the two Teslas at same time (parfor)? How?
How to know number of cores are currently running/executing the program?
After running the following code and make Quadro (in use) I found it takes less time than Tesla despite Tesla having 240 cores and Quadro has only 64? Maybe because it's the display device?maybe becouse it's single precision and Tesla is Double precision?
clc; clear all;close all;
addpath ('C:/Program Files/AccelerEyes/Jacket/engine');
i = im2double(imread('cameraman.tif'));
i_gpu=gdouble(i);
h=fspecial('motion',50,45);% Create predefined 2-D filter
h_gpu=gdouble(h);
tic;
for j=1:500
x_gpu = imfilter( i_gpu,h_gpu );
end
i2 = double(x_gpu); %memory transfer
t=toc
figure(2), imshow(i2);
Any help with the code will be appreciated. As you can see it's very trivial example used to demonstrate power of GPU, no more.

Using two Teslas at the same time: write a MEX file and call cudaChooseDevice(0), launch one kernel, then call cudaChooseDevice(1) and execute another kernel. Kernel calls and memory copies (i.e., cudaMemcpyAsync and cudaMemcpyPeerAsync) are asynchronous. I've given an example about how to write a MEX file (i.e., a DLL) in one of my other answers. Just add a second kernel to that example. FYI, you don't need Jacket if you can do some C/C++ programming. On the other hand, if you don't want to spend your time learning the Cuda SDK or you don't have a C/C++ compiler then you're stuck with Jacket or gp-you or GPUlib until Matlab changes the way that parfor works.
An alternative is to call OpenCL from Matlab (again through a MEX file). Then you could launch kernels on all the GPUs and CPUs. Again, this requires some C/C++ programming.

From Matlab 2012, GPU array and GPU related functions are fully integrated into the MATLAB so you might not need to use Jacket to achieve what you are trying to do.
In sum, put gpuDevice(deviceID); before running GPU commands, then the following codes will be run on the deviceIDth gpu. For instance
gpuDevice(1);
a = gpuArray(rand(3)); // a is on the first GPU memory
gpuDevice(2);
b = gpuArray(rand(4)); // b is on the second GPU memory
To run multiple GPUs. simply put
c = cell(1,num_device);
parfor i = 1:num_device
gpuDevice(i);
a = gpuArray(magic(3));
b = gpuArray(rand(3));
c{i} = gather(a*b);
end
You can see the GPU memory usage by typing nvidia-smi on the system command line.
This way of setting GPU id seems strange but it is the conventional way to set GPU id. In CUDA, if you want to use specific GPU then cudaSetDevice(gpuId) and the following codes will run on the gpuIdth GPU. (0-base indexing)
----------------------EDIT----------------
Tested and confirmed on MATLAB 2012b, MATLAB 2013b.
Checked using nvidia-smi that the code is actually using different GPU memories. You might have to scale it very large rand(5000) and check very quickly since temporary variables a and b would disappear after the for loop ends

Related

Why is it faster to transfer data from CPU to GPU rather than GPU to CPU?

I've noticed that transferring data to recent high end GPUs is faster than gathering it back to the CPU. Here are the results using a benchmarking function provided to me by mathworks tech-support running on an older Nvidia K20 and a recent Nvidia P100 with PCIE:
Using a Tesla P100-PCIE-12GB GPU.
Achieved peak send speed of 11.042 GB/s
Achieved peak gather speed of 4.20609 GB/s
Using a Tesla K20m GPU.
Achieved peak send speed of 2.5269 GB/s
Achieved peak gather speed of 2.52399 GB/s
I've attached the benchmark function below for reference. What is the reason for the asymmetry on the P100? Is this system dependent or is it the norm on recent high end GPUs? Can the gather speed be increased?
gpu = gpuDevice();
fprintf('Using a %s GPU.\n', gpu.Name)
sizeOfDouble = 8; % Each double-precision number needs 8 bytes of storage
sizes = power(2, 14:28);
sendTimes = inf(size(sizes));
gatherTimes = inf(size(sizes));
for ii=1:numel(sizes)
numElements = sizes(ii)/sizeOfDouble;
hostData = randi([0 9], numElements, 1);
gpuData = randi([0 9], numElements, 1, 'gpuArray');
% Time sending to GPU
sendFcn = #() gpuArray(hostData);
sendTimes(ii) = gputimeit(sendFcn);
% Time gathering back from GPU
gatherFcn = #() gather(gpuData);
gatherTimes(ii) = gputimeit(gatherFcn);
end
sendBandwidth = (sizes./sendTimes)/1e9;
[maxSendBandwidth,maxSendIdx] = max(sendBandwidth);
fprintf('Achieved peak send speed of %g GB/s\n',maxSendBandwidth)
gatherBandwidth = (sizes./gatherTimes)/1e9;
[maxGatherBandwidth,maxGatherIdx] = max(gatherBandwidth);
fprintf('Achieved peak gather speed of %g GB/s\n',max(gatherBandwidth))
Edit: we now know it is not system dependent (see comments) . I still want to know the reason for the assymetry or if it can be changed.
This is a CW for anybody interested in posting benchmarks from their machine. Contributors are encouraged to leave their details in case some future question arises regarding their results.
System: Win10, 32GB DDR4-2400Mhz RAM, i7 6700K. MATLAB: R2018a.
Using a GeForce GTX 660 GPU.
Achieved peak send speed of 7.04747 GB/s
Achieved peak gather speed of 3.11048 GB/s
Warning: The measured time for F may be inaccurate because it is running too fast. Try measuring something that takes
longer.
Contributor: Dev-iL
System: Win7, 32GB RAM, i7 4790K. MATLAB: R2018a.
Using a Quadro P6000 GPU.
Achieved peak send speed of 1.43346 GB/s
Achieved peak gather speed of 1.32355 GB/s
Contributor: Dev-iL
I am not familiar with Matlab GPU toolboxes, but I suspect that the second transfer (that gets data back from GPU) starts before the first has ended.
% Time sending to GPU
sendFcn = #() gpuArray(hostData);
sendTimes(ii) = gputimeit(sendFcn);
%
%No synchronization here
%
% Time gathering back from GPU
gatherFcn = #() gather(gpuData);
gatherTimes(ii) = gputimeit(gatherFcn);
A similar question, for a C program, was posted here:
copy from GPU to CPU is slower than copying CPU to GPU
In that case, there is no explicit sync after launching a thread on the GPU and getting results data back from the GPU.
So the function that gets data back, in C cudaMemcpy(), has to wait for the GPU to end the previous launched thread, before transferring data, thus inflating the time measured for the data transfer.
With the Cuda C API, it is possible to force the CPU to wait for the GPU to end the previously launched thread(s), with:
cudaDeviceSynchronize();
And only then start measuring the time to transfer data back.
Maybe in Matlab there is also some synchronization primitive.
Also in the same answer, it is recommended to measure time with (Cuda) Events.
In this POST on optimizing data transfers, also in C sorry, Events are used to measure data transfer times:
https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/
The time for transferring data is the same in both directions.

Matlab out of memory error while solving ODE

I have to integrate an ODE of 8 variable in matlab. My simulation time is 5e9 with a time step of 0.1. But it shows memory error. I am working with i7 core ,2.6Ghz CPU with 8GB RAM. How can I simulate ODEs for a large time samples?
Assuming you're working on 64 Bit version of MATLAB you might want to let MATLAB squeeze the memory to the edge using the Preferences -> MATLAB -> Workspace -> MATLAB Array Size Limit.
If you are getting this erro because you really mximized the memory in the system do the following:
Make sure you're using 64 Bit OS and 64 Bit version of MATLAB.
Before you call the ODE function, clear manually (using the clear() function) variables you don't need any more (Or can recreate once the function finishes).
Increase the swap file of your system. It will help with larger memory consumption but might make things much slower.
You can find more tips and tricks in Resolve "Out of Memory" Errors and memory().

How to enable multithreading in MATLAB?

Some of MATLAB's functions support multithreading and will make use of your multi-core architecture when available. Hence, I'm not referring to MATLAB support for parallel execution when you explicitly invoke it, e.g. using parfor.
In my code I'm running imregtform. My issue with using this function is that on one device (Win 8, x64, MATLAB 2014b) the function (called thousands of times) is maxing all my CPUs but whereas on another device (Win 7, x64, MATLAB 2014a) it is barely using half of my CPUs and only about 20%. Why is that? Is there a switch somewhere?
If tried some of the suggestions found in:
Checking if MATLAB is running in multithread mode and Matlab 2011a Use all Cores Available on 64 bit Linux?.
Any other suggestions?

Why does complex Matlab gpuArray take twice as much memory than it should?

I noticed that a large complex array takes twice as much memory on GPU than on CPU.
Here is a minimal example:
%-- First Try: Complex Single
gpu = gpuDevice(1);
m1 = gpu.FreeMemory;
test = complex(single(zeros(600000/8,1000))); % 600 MByte complex single
whos('test')
test = gpuArray(test);
fprintf(' Used memory on GPU: %e\n', m1-gpu.FreeMemory);
Now I do the same with a twice as big array which is not complex:
%-- Second Try:, Single
gpu = gpuDevice(1);
m1 = gpu.FreeMemory;
test = single(zeros(600000/4,1000)); % 600MB MByte real single
whos('test')
test = gpuArray(test);
fprintf(' Used memory on GPU: %e\n', m1-gpu.FreeMemory);
The output is:
Name Size Bytes Class Attributes
test 75000x1000 600000000 single complex
Used memory on GPU: 1.200095e+09
Name Size Bytes Class Attributes
test 150000x1000 600000000 single
Used memory on GPU: 6.000476e+08
On the CPU both arrays are 600MB - on the GPU the complex array uses 1.2 GByte.
I tested this on two graphics cards: GeForce GTX 680 and Tesla K20 using Matlab 2013a.
How can I avoid this? Is this a bug in Matlab?
This was answered on MATLAB central. To summarize MathWorks developer Edric Ellis's answer:
gpu.FreeMemory may not be an accurate measure of the available GPU memory because MATLAB does not immediately free up memory when it's done using it. gpu.AvailableMemory is a more accurate measure of available memory.
Transferring complex data to/from the GPU still requires 2x the memory because there is a format conversion that is done on the GPU. Specifically, complex arrays in CPU host memory are stored with the real/imaginary parts split into 2 separate vectors, whereas complex arrays on the GPU device are stored in interleaved format.
Testing on R2017a, I confirmed that:
Switching from gpu.FreeMemory to gpu.AvailableMemory indeed addresses this discrepancy in reported memory usage that prompted the original question.
With 8 GB of GPU memory, copying...
6 GB real array: success
6 GB complex array: "Out of memory on device" error
6 separate 1 GB complex arrays: success

Very slow execution of Matlab code under ubuntu

I was using MATLAB 2012a under windows 7 and I was executing some intense code, and I mean by intense in terms of memory usage and processing time, however, the code was working fine on Windows. Now, I changed my OS to ubuntu 12.04 and I installed Matlab 2013a. The amount of memory used is considerably less than the way it was in Windows, but the time taken by matlab to execute the same code is extremely high-really high.
I need to mention that my code contain nothing that may take such huge time except a statement of sparse with symbolic substitution as one of the arguments as follows
K=zeros(Np,Np);
for i=1:ord
K=K+sparse(t(1:ord,:),repmat(t(i,:),ord,1),double(subs(Kv(:,i),Arg(Kv,1,1,6),Arg(Kv,1,2,6))),Np,Np);
end
Note: that Kv is a symbolic matrix and Arg is a function to provide OLD and NEW and it depends on a number of global variables.
I have the feeling that I missed to add something to ubuntu that might help accelerate the execution of the Matlab codes.
Any ideas ?
I had a similar problem at windows, but I believe the solution is same on Ubuntu LTS.
So, if you increase the Java Heap Memory of Matlab, the Matlab will consume more memory from your system but it will be faster.
To do that go to:
File->preferences->General->Java Heap Memory and increase to the maximum.
The default value is 128, that is too little.
If heap memory limit doesn't fix the issue, then try increasing matlab process.
First start matlab, then do
ps aux|grep MATLAB
In my case the result is:
comtom 9769 28.2 19.8 4360632 761808 tty2 S<l+ 14:00 1:50 /usr/local/MATLAB/MATLAB_Production_Server/R2015a/bin/glnxa64/MATLAB -desktop
Look at first number (PID). Then use it with command renice to change process priority:
renice -3 -p 9769
That's it. The GUI is very slow because it's built against outdated Xorg libs. So changing priority helps, you may notice some gnome effect's tear, but matlab's interface will work a lot better.