Multiple Tesla K80 GPU's and parfor loops

Multiple Tesla K80 GPU's and parfor loops - matlab

I received a computer with 4xGPU's Tesla K80 and I am trying the parfor loops from Matlab PCT to speed up FFT's calculation and it is yet slower.
Here is what I am trying:
% Pupil is based on a 512x512 array
parfor zz = 1:4
gd = gpuDevice;
d{zz} = gd.Index;
probe{zz} = gpuArray(pupil);
Essai{zz} = gpuArray(pupil);
end
tic;
parfor ii = 1:4
gd2 = gpuDevice;
d2{ii} = gd2.Index;
for i = 1:100
[Essai{ii}] = fftn(probe{ii});
end
end
toc
%%
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.
Elapsed time is 1.805763 seconds.
Elapsed time is 1.412928 seconds.
Elapsed time is 1.409559 seconds.
Starting parallel pool (parpool) using the 'local' profile ... connected to 8 workers.
Elapsed time is 0.606602 seconds.
Elapsed time is 0.297850 seconds.
Elapsed time is 0.294365 seconds.
%%
tic; for i = 1:400; Essai{1} = fftn( probe{1} ); end; toc
Elapsed time is 0.193579 seconds !!!
Why is opening 8 workers faster as in principle I stored my variables into 4gpu's only (out of 8)?
Also, how to use a Tesla K80 as a single GPU?
Merci, Nicolas

I doubt that parfor works for multi-GPU systems. If speed is critical and you want to take full advantage of your GPUs, I suggest to write your own little CUDA script using the cuFFT library:
http://docs.nvidia.com/cuda/cufft/#multiple-GPU-cufft-transforms
Here is how to write your mex file containing CUDA code:
http://www.mathworks.com/help/distcomp/run-mex-functions-containing-cuda-code.html

many thanks for your quick reply and for the links ! It is true that I was trying to avoid CUDA but it seems like the best option to spread FFTs.
Although I thought that parfor and spmd were great tools for multiple GPUs..

Related

why Matlab parfeval function slower than normal? [duplicate]

This question already has an answer here:
Why is this simple parallel Matlab program much slower than the non-parallel version?
(1 answer)
Closed 12 months ago.
When I call function normally, execution time is much faster than parfeval.
tic
f = parfeval(#magic,1,10000);
value = fetchOutputs(f);
toc
Elapsed time is 2.244390 seconds.
magic function works on parfeval with 2.24 seconds.
tic
magic(10000);
toc
Elapsed time is 0.592743 seconds.
But when i call normally, it works fastly. What is the reason of this and How to speed up parfeval function?

In general, there is some overhead that needs to be considered when setting up threads (which parfeval does). This is the main reason for the time discrepancy.
When using any kind of parallel processing you have to first determine if the process runs long enough that the overhead from spawning the processes is negligible. In this case, it isn't.
Testing a longer run case:
tic
test(1E10)
toc
tic
f = parfeval(#test, 1, 1E10)
value = fetchOutputs(f);
toc
function x = test(n)
x = 1;
for i = 1:n
x = x * 1;
end
end
Which gives the time (on my computer) 5.51 and 5.49 seconds.

How to end loop if value does not change for X consecutive seconds in Matlab

I am collecting data from a potentiometer connected to an Arduino. In the script, I tell matlab to keep collecting data for 2 minutes. But I need to tell it that if the user does not move the potentiometer for 10 consecutive seconds, then it should stop the loop and move to the next session (write the data to an excel file). Does anybody have ideas on how to achieve this?

Probably tic and toc can help you.
tic starts a stopwatch timer. The function records the internal time at execution of the tic command.
toc reads the elapsed time from the stopwatch timer started by the tic function.
tic;
while toc < 10
% Do your loopy things
if variable_changed
tic; % Restart stopwatch
end
end
Furthermore to be sure tic won't interact with other processes you should store it's value like this:
% First start stopwatch
time_since_last_movement = tic;
while toc(time_since_last_movement) < 10
% Do your loopy things
if variable_changed
time_since_last_movement = tic; % Restart stopwatch
end
end

matlab filter execution time

I need to filter 6 signals with 60000000 samples in each. So data are saved in matrix data(60000000,6). There are several aproaches how to do that:
data=randn(60000000,6);
b=ones(1,1000)/1000;
tic
R=filter(b,1,data);
toc
tic
for i=1:6
R2(:,i)=filter(b,1,data(:,i));
end
toc
tic
parfor i=1:6
R2(:,i)=filter(b,1,data(:,i));
end
toc
By documentation it is recommanded to use 1st form as the fastest one, but in my case it is the slowest.
Elapsed time is 172.235919 seconds.
Elapsed time is 45.354810 seconds.
Elapsed time is 59.250638 seconds.
In process explorer 1st form use only 1 thread of CPU. By documentation it should run on multiple threads in default. Have you experienced same problem?

Matlab CUDA basic experiment

(correctly and instructively asnwered, see below)
I'm beginning to do experiments with matlab and gpu (nvidia gtx660).
Now, I wrote this simple monte carlo algorithm to calculate PI. The following is the CPU version:
function pig = mc1vecnocuda(n)
countr=0;
A=rand(n,2);
for i=1:n
if norm(A(i,:))<1
countr=countr+1;
end
end
pig=(countr/n)*4;
end
This takes very little time to be executed on CPU with 100000 points "thrown" into the unit circle:
>> tic; mc1vecnocuda(100000);toc;
Elapsed time is 0.092473 seconds.
See, instead, what happens with gpu-ized version of the algorithm:
function pig = mc1veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)
if norm(A(i,:))<1
gpucountr=gpucountr+1;
end
end
pig=(gpucountr/n)*4;
end
Now, this takes a LONG time to be executed:
>> tic; mc1veccuda(100000);toc;
Elapsed time is 21.137954 seconds.
I don't understand why. I used parfor loop with 1024 workers, because querying my nvidia card with gpuDevice, 1024 is the maximum number of simultaneous threads allowed on the gtx660.
Can someone help me? Thanks.
Edit: this is the updated version that avoids IF:
function pig = mc2veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)
gpucountr = gpucountr+nnz(norm(A(i,:))<1);
end
pig=(gpucountr/n)*4;
end
And this is the code written following Bichoy's guidelines (the
right code to achieve result):
function pig = mc3veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
Asq = A.^2;
Asqsum_big_column = Asq(:,1)+Asq(:,2);
Anorms=Asqsum_big_column.^(1/2);
gpucountr=gpucountr+nnz(Anorms<1);
pig=(gpucountr/n)*4;
end
Please note execution time with n=10 millions:
>> tic; mc3veccuda(10000000); toc;
Elapsed time is 0.131348 seconds.
>> tic; mc1vecnocuda(10000000); toc;
Elapsed time is 8.108907 seconds.
I didn't test my original cuda version (for/parfor), for its execution would require hours with n=10000000.
Great Bichoy! ;)

I guess the problem is with parfor!
parfor is supposed to run on MATLAB workers, that is your host not the GPU!
I guess what is actually happening is that you are starting 1024 threads on your host (not on your GPU) and each of them is trying to call the GPU. This result in the tremendous time your code is taking.
Try to re-write your code to use matrix and array operations, not for-loops! This will show some speed-up. Also, remember that you should have much more calculations to do in the GPU otherwise, memory transfer will just dominate your code.
Code:
This is the final code after including all corrections and suggestions from several people:
function pig = mc2veccuda(n)
A=gpuArray.rand(n,2); % An nx2 random matrix
Asq = A.^2; % Get the square value of each element
Anormsq = Asq(:,1)+Asq(:,2); % Get the norm squared of each point
gpucountr = nnz(Anorm<1); % Check the number of elements < 1
pig=(gpucountr/n)*4;

Many reasons like:
Movement of data between host & device
Computation within each loop is very small
Call to rand on GPU may not be parallel
if condition within the loop can cause divergence
Accumulation to a common variable may run in serial, with overhead
It is difficult to profile Matlab+CUDA code. You should probably try in native C++/CUDA and use parallel Nsight to find the bottleneck.

As Bichoy said, CUDA code should always be done vectorized. In MATLAB, unless you're writing a CUDA Kernal, the only large speedup that you're getting is that the vectorized operations are called on the GPU which has thousands of (slow) cores. If you don't have large vectors and vectorized code, it won't help.
Another thing that hasn't been mentioned is that for highly parallel architectures like GPUs you want to use different random number generating algorithms than the "standard" ones. So to add to Bichoy's answer, adding the parameter 'Threefry4x64' (64-bit) or 'Philox4x32-10' (32-bit and a lot faster! Super fast!) can lead to large speedups in CUDA code. MATLAB explains this here: http://www.mathworks.com/help/distcomp/examples/generating-random-numbers-on-a-gpu.html

Several time counters in MATLAB

I have a program running a loop I want to have two time counters, one for the loop, that will tell me how log did one iteration of the loop took, and one for the entire program. To the best of my knowledge tic and toc will work only once.

You're only familiar with this tic toc syntax:
tic; someCode; elapsed = toc;
But there is another syntax:
start = tic; someCode; elapsed = toc(start);
The second syntax makes the same time measurement, but allows you the option of running more than one stopwatch timer concurrently. You assign the output of tic to a variable tStart and then use that same variable when calling toc. MATLAB measures the time elapsed between the tic and its related toc command and displays the time elapsed in seconds. This syntax enables you to time multiple concurrent operations, including the timing of nested operations (matlab documentation of tic toc).
Here's how to use it in your case. Let's say that this is your code:
for i = 1:M
someCode;
end
Insert the tic and toc like this:
startLoop = tic;
for i = 1:N
startIteration = tic;
someCode;
endIteration = toc(startIteration);
end
endLoop = toc(startLoop);
You can also use the above syntax to create a vector for which the ith element is the time measurement for the ith iteration. Like this:
startLoop = tic;
for i = 1:N
startIteration(i) = tic;
someCode;
endIteration(i) = toc(startIteration(i));
end
endLoop = toc(startLoop);

You can use tic and toc to time nested operations, from the Matlab help for tic:
tStart=tic; any_statements; toc(tStart); makes the same time measurement, but allows you the option of running more than one stopwatch timer concurrently. You assign the output of tic to a variable tStart and then use that same variable when calling toc. MATLAB measures the time elapsed between the tic and its related toc command and displays the time elapsed in seconds. This syntax enables you to time multiple concurrent operations, including the timing of nested operations

I'm not able to try this right now, but you should be able to use multiple tic and toc statements if you store the tic values into variables.
Read Matlab's documentation on this, there is even a section on nesting them. Here is a rough example:
tStartOverall = tic;
...
tStartLoop = tic;
<your loop code here>
tEndLoop = toc(tStartLoop);
...
tEndOverall = toc(tStartOverall);