GPU timing in Matlab - matlab

I am using a GeForce GT 720 to do some basic calculations in Matlab.
I am simply doing matrix multiplication:
A = rand(3000,3000); % Define array using CPU
tic; % Start clock
Agpu = gpuArray(A); % Transfer data to GPU
Bgpu = Agpu*Agpu; % Perform computation on GPU
time = toc; % Stop clock
In this code, my clock is timing the data transfer to the GPU and matrix multiplication on the GPU, and I get time ~ 4 seconds. I suspect that the data transfer is taking much more time than the multiplication, so I isolate it with my timer:
A = rand(3000,3000); % Define array using CPU
tic; % Start clock
Agpu = gpuArray(A); % Transfer data to GPU
time = toc; % Stop clock
Bgpu = Agpu*Agpu; % Perform computation on GPU
and indeed it takes ~ 4 seconds. However, if I comment the last line of the code, so that no multiplicaiton is done, my code speeds up to ~0.02 seconds.
Does performing a computation with the GPU after transferring data to the GPU alter the speed of the data transfer?

I don't see this behaviour at all (R2017b, Tesla K20c) - for me, either way, the transfer takes 0.012 seconds. Note that if you're running this in a fresh MATLAB session each time, the very first time you run anything at all on the GPU takes a few seconds - perhaps that accounts for the 4 seconds?
In general, use gputimeit to time stuff on the GPU to ensure you don't see strange results from the asynchronous nature of some GPU operations.

Related

Detecting the duration of simulation

I have a simulation which will run for several hours in Simulink. Is it possible to detect see total duration of simulation in Matlab or Simulink?
tic/toc does this. tic starts the timer, toc ends it. Time is given in seconds, adjust according to your needs.
tic
% code here
time = toc;
frpintf('It took %f s',time)

Fast approximation/implementation of power function in matalb

I am able to find similar answer of C/C++ in the following question:
Fast implementation/approximation of pow() function in C/C++
My code's bottleneck as per profiler is something similar to following code:
a=rand(100,1); % a is an array of doubles
b=1.2; % power can be double instead of only integers.
temp=power(a,b); % this line is taking maximum time
For my requirement, only first 3-4 digits are significant, so any fast approximation of power will be very useful to me. Any suggestions to do the same.
More Information:
this is how I calculated power using exp and log. This has improvement of roughly 50%. Now I need to find approx functions of log and exp and check if that is faster still.
a=rand(1000,1);
tic;
for i=1:100000
powA=power(a,1.1);
end
toc;
tic;
for i=1:100000
powB=exp(1.1*log(a));
end
toc;
Elapsed time is 4.098334 seconds.
Elapsed time is 1.994894 seconds.

How to make Matlab creating matrixs faster during the iteration?

I need to create a square matrix $V$ iteratively over 100000+ times per pack.
When just doing it traditionally, the computational consumption is at around 70s.(Over 1 mintes) And I need to repeate this process for over 100 packs.That's about 1 hours extra time.
It turned out to me that when calculating the matrix using a double for loop $V(x,y)$, the matlab is only using a single thread. Howver, there are 12 threads in the computer, and there should be a way to use all of them to assign the matrix faster.
The type of function is
$V(x,y)=exp((x-variation_1).^2+(y-variation_2).^2)$
I thought about using GPU. However, as it turned out, the GPU array is calculating it much slower than CPU.
I also thought about using the parpool function. However, not only it cost more time to send the matrix into the parallel pool, is also denied the access to the $V$ itself.
How can I tell the CPU to calculating the matrix with all the threads at faster speed?
You should always use matrix and vector operations rather than for loop.
If x and y are constant for all cases, you can use meshgrid to generate x and y once.
for example, consider the following code which uses double for loop:
v = zeros(10000,10000);
tic;
for x=1:10000
for y = 1:10000
v(x,y) = exp((x/10000).^2+(y/10000).^2);
end
end
toc
On my computer it runs about 11 seconds.
Now by using meshgrid:
%This is done only once
[x,y] = meshgrid((1:10000)/10000,(1:10000)/10000);
tic;
v = exp(x.^2+y.^2);
toc
Which takes about 4 seconds, not including the meshgrid.

matlab if statements with CUDA

I have the following matlab code:
randarray = gpuArray(rand(N,1));
N = 1000;
tic
g=0;
for i=1:N
if randarray(i)>10
g=g+1;
end
end
toc
secondrandarray = rand(N,1);
g=0;
tic
for i=1:N
if secondrandarray(i)>10
g=g+1;
end
end
toc
Elapsed time is 0.221710 seconds.
Elapsed time is 0.000012 seconds.
1) Why is the if clause so slow on the GPU? It is slowing down all my attempts at optimisation
2) What can I do to get around this limitation?
Thanks
This is typically a bad thing to do no matter if you are doing it on the cpu or the gpu.
The following would be a good way to do the operation you are looking at.
N = 1000;
randarray = gpuArray(100 * rand(N,1));
tic
g = nnz(randarray > 10);
toc
I do not have PCT and I can not verify if this actually works (number of functions supported on GPU are fairly limited).
However if you had Jacket, you would definitely be able to do the following.
N = 1000;
randarray = gdouble(100 * rand(N, 1));
tic
g = nnz(randarray > 10);
toc
Full disclosure: I am one of the engineers developing Jacket.
No expert on the Matlab gpuArray implementation, but I would suspect that each randarray(i) access in the first loop triggers a PCI-e transaction to retrieve a value from GPU memory, which will incur a very large latency penalty. You might be better served by calling gather to transfer the whole array in a single transaction instead and then loop over a local copy in host memory.
Using MATLAB R2011b and Parallel Computing Toolbox on a now rather old GPU (Tesla C1060), here's what I see:
>> g = 100*parallel.gpu.GPUArray.rand(1, 1000);
>> tic, sum(g>10); toc
Elapsed time is 0.000474 seconds.
Operating on scalar elements of a gpuArray one at a time is always going to be slow, so using the sum method is much quicker.
I cannot comment on a prior solution because I'm too new, but extending on the solution from Pavan. The nnz function is (not yet) implemented for gpuArrays, at least on the Matlab version I'm using (R2012a).
In general, it is much better to vectorize Matlab code. However, in some cases looped code can run fast in Matlab bercause of the JIT compilation.
Check the results from
N = 1000;
randarray_cpu = rand(N,1);
randarray_gpu = gpuArray(randarray_cpu);
threshold = 0.5;
% CPU: looped
g=0;
tic
for i=1:N
if randarray_cpu(i)>threshold
g=g+1;
end
end
toc
% CPU: vectorized
tic
g = nnz(randarray_cpu>threshold);
toc
% GPU: looped
tic
g=0;
for i=1:N
if randarray_gpu(i)>threshold
g=g+1;
end
end
toc
% GPU: vectorized
tic
g_d = sum(randarray_gpu > threshold);
g = gather(g_d); % I'm assuming that you want this in the CPU at some point
toc
Which is (on my core i7+ GeForce 560Ti):
Elapsed time is 0.000014 seconds.
Elapsed time is 0.000580 seconds.
Elapsed time is 0.310218 seconds.
Elapsed time is 0.000558 seconds.
So what we see from this case is:
Loops in Matlab are not considered good praxis, but in your particular case, it does run fast because Matlab somehow "precompiles" it internally. I changed your threshold from 10 to 0.5, as rand will never give you a value higher than 1.
The looped GPU version performs horribly because at each loop iteration, a kernel is launched (or data is read from the GPU, however TMW implemented that...), which is slow. A lot of small memory transfers while calculating basically nothing are the worst thing one could do on the GPU.
From the last (best) GPU result the answer would be: unless the data is already on the GPU, it doesn't make sense to calculate this on the GPU. Since the arithmetic complexity of your operation is basically nonexistent, the memory transfer overhead does not pay off in any way. If this is part of a bigger GPU calculation, it's OK. If not... better stick to the CPU ;)

Speeding up sparse FFT computations

I'm hoping someone can review my code below and offer hints how to speed up the section between tic and toc. The function below attempts to perform an IFFT faster than Matlab's built-in function since (1) almost all of the fft-coefficient bins are zero (i.e. 10 to 1000 bins out of 10M to 300M bins are non-zero), and (2) only the central third of the IFFT results are retained (the first and last third are discarded -- so no need to compute them in the first place).
The input variables are:
fftcoef = complex fft-coef 1D array (10 to 1000 pts long)
bins = index of fft coefficients corresponding to fftcoef (10 to 1000 pts long)
DATAn = # of pts in data before zero padding and fft (in range of 10M to 260M)
FFTn = DATAn + # of pts used to zero pad before taking fft (in range of 16M to 268M) (e.g. FFTn = 2^nextpow2(DATAn))
Currently, this code takes a few orders of magnitude longer than Matlab's ifft function approach which computes the entire spectrum then discards 2/3's of it. For example, if the input data for fftcoef and bins are 9x1 arrays (i.e. only 9 complex fft coefficients per sideband; 18 pts when considering both sidebands), and DATAn=32781534, FFTn=33554432 (i.e. 2^25), then the ifft approach takes 1.6 seconds whereas the loop below takes over 700 seconds.
I've avoided using a matrix to vectorize the nn loop since sometimes the array size for fftcoef and bins could be up to 1000 pts long, and a 260Mx1K matrix would be too large for memory unless it could be broken up somehow.
Any advice is much appreciated! Thanks in advance.
function fn_fft_v1p0(fftcoef, bins, DATAn, FFTn)
fftcoef = [fftcoef; (conj(flipud(fftcoef)))]; % fft coefficients
bins = [bins; (FFTn - flipud(bins) +2)]; % corresponding fft indices for fftcoef array
ttrend = zeros( (round(2*DATAn/3) - round(DATAn/3) + 1), 1); % preallocate
start = round(DATAn/3)-1;
tic;
for nn = start+1 : round(2*DATAn/3) % loop over desired time indices
% sum over all fft indices having non-zero coefficients
arg = 2*pi*(bins-1)*(nn-1)/FFTn;
ttrend(nn-start) = sum( fftcoef.*( cos(arg) + 1j*sin(arg));
end
toc;
end
You have to keep in mind that Matlab uses a compiled fft library (http://www.fftw.org/) for its fft functions, which besides operating much faster then a Matlab script, it is well optimized for many use-cases. So a first step might be writing your code in c/c++ and compiling it as a mex file you can use within Matlab. That will surely speed up your code at least an order of magnitude (probably more).
Besides that, one simple optimization you can do is by considering 2 things:
You assume your time series is real valued, so you can use the symmetry of the fft coeffs.
Your time series is typically much longer then your fft coeffs vector, so it is better to iterate over bins instead of time points (thus vectorizing the longer vector).
These two points are translated to the following loop:
nn=(start+1 : round(2*DATAn/3))';
ttrend2 = zeros( (round(2*DATAn/3) - round(DATAn/3) + 1), 1);
tic;
for bn = 1:length(bins)
arg = 2*pi*(bins(bn)-1)*(nn-1)/FFTn;
ttrend2 = ttrend2 + 2*real(fftcoef(bn) * exp(i*arg));
end
toc;
Note you have to use this loop before you expand bins and fftcoef, since the symmetry is already taken into account. This loop takes 8.3 seconds to run with the parameters from your question, while it takes on my pc 141.3 seconds to run with your code.
I have posted a question/answer at Accelerating FFTW pruning to avoid massive zero padding which solves the problem for the C++ case using FFTW. You can use this solution by exploiting mex-files.