I have the following matlab code:
randarray = gpuArray(rand(N,1));
N = 1000;
tic
g=0;
for i=1:N
if randarray(i)>10
g=g+1;
end
end
toc
secondrandarray = rand(N,1);
g=0;
tic
for i=1:N
if secondrandarray(i)>10
g=g+1;
end
end
toc
Elapsed time is 0.221710 seconds.
Elapsed time is 0.000012 seconds.
1) Why is the if clause so slow on the GPU? It is slowing down all my attempts at optimisation
2) What can I do to get around this limitation?
Thanks
This is typically a bad thing to do no matter if you are doing it on the cpu or the gpu.
The following would be a good way to do the operation you are looking at.
N = 1000;
randarray = gpuArray(100 * rand(N,1));
tic
g = nnz(randarray > 10);
toc
I do not have PCT and I can not verify if this actually works (number of functions supported on GPU are fairly limited).
However if you had Jacket, you would definitely be able to do the following.
N = 1000;
randarray = gdouble(100 * rand(N, 1));
tic
g = nnz(randarray > 10);
toc
Full disclosure: I am one of the engineers developing Jacket.
No expert on the Matlab gpuArray implementation, but I would suspect that each randarray(i) access in the first loop triggers a PCI-e transaction to retrieve a value from GPU memory, which will incur a very large latency penalty. You might be better served by calling gather to transfer the whole array in a single transaction instead and then loop over a local copy in host memory.
Using MATLAB R2011b and Parallel Computing Toolbox on a now rather old GPU (Tesla C1060), here's what I see:
>> g = 100*parallel.gpu.GPUArray.rand(1, 1000);
>> tic, sum(g>10); toc
Elapsed time is 0.000474 seconds.
Operating on scalar elements of a gpuArray one at a time is always going to be slow, so using the sum method is much quicker.
I cannot comment on a prior solution because I'm too new, but extending on the solution from Pavan. The nnz function is (not yet) implemented for gpuArrays, at least on the Matlab version I'm using (R2012a).
In general, it is much better to vectorize Matlab code. However, in some cases looped code can run fast in Matlab bercause of the JIT compilation.
Check the results from
N = 1000;
randarray_cpu = rand(N,1);
randarray_gpu = gpuArray(randarray_cpu);
threshold = 0.5;
% CPU: looped
g=0;
tic
for i=1:N
if randarray_cpu(i)>threshold
g=g+1;
end
end
toc
% CPU: vectorized
tic
g = nnz(randarray_cpu>threshold);
toc
% GPU: looped
tic
g=0;
for i=1:N
if randarray_gpu(i)>threshold
g=g+1;
end
end
toc
% GPU: vectorized
tic
g_d = sum(randarray_gpu > threshold);
g = gather(g_d); % I'm assuming that you want this in the CPU at some point
toc
Which is (on my core i7+ GeForce 560Ti):
Elapsed time is 0.000014 seconds.
Elapsed time is 0.000580 seconds.
Elapsed time is 0.310218 seconds.
Elapsed time is 0.000558 seconds.
So what we see from this case is:
Loops in Matlab are not considered good praxis, but in your particular case, it does run fast because Matlab somehow "precompiles" it internally. I changed your threshold from 10 to 0.5, as rand will never give you a value higher than 1.
The looped GPU version performs horribly because at each loop iteration, a kernel is launched (or data is read from the GPU, however TMW implemented that...), which is slow. A lot of small memory transfers while calculating basically nothing are the worst thing one could do on the GPU.
From the last (best) GPU result the answer would be: unless the data is already on the GPU, it doesn't make sense to calculate this on the GPU. Since the arithmetic complexity of your operation is basically nonexistent, the memory transfer overhead does not pay off in any way. If this is part of a bigger GPU calculation, it's OK. If not... better stick to the CPU ;)
Related
I am able to find similar answer of C/C++ in the following question:
Fast implementation/approximation of pow() function in C/C++
My code's bottleneck as per profiler is something similar to following code:
a=rand(100,1); % a is an array of doubles
b=1.2; % power can be double instead of only integers.
temp=power(a,b); % this line is taking maximum time
For my requirement, only first 3-4 digits are significant, so any fast approximation of power will be very useful to me. Any suggestions to do the same.
More Information:
this is how I calculated power using exp and log. This has improvement of roughly 50%. Now I need to find approx functions of log and exp and check if that is faster still.
a=rand(1000,1);
tic;
for i=1:100000
powA=power(a,1.1);
end
toc;
tic;
for i=1:100000
powB=exp(1.1*log(a));
end
toc;
Elapsed time is 4.098334 seconds.
Elapsed time is 1.994894 seconds.
I need to create a square matrix $V$ iteratively over 100000+ times per pack.
When just doing it traditionally, the computational consumption is at around 70s.(Over 1 mintes) And I need to repeate this process for over 100 packs.That's about 1 hours extra time.
It turned out to me that when calculating the matrix using a double for loop $V(x,y)$, the matlab is only using a single thread. Howver, there are 12 threads in the computer, and there should be a way to use all of them to assign the matrix faster.
The type of function is
$V(x,y)=exp((x-variation_1).^2+(y-variation_2).^2)$
I thought about using GPU. However, as it turned out, the GPU array is calculating it much slower than CPU.
I also thought about using the parpool function. However, not only it cost more time to send the matrix into the parallel pool, is also denied the access to the $V$ itself.
How can I tell the CPU to calculating the matrix with all the threads at faster speed?
You should always use matrix and vector operations rather than for loop.
If x and y are constant for all cases, you can use meshgrid to generate x and y once.
for example, consider the following code which uses double for loop:
v = zeros(10000,10000);
tic;
for x=1:10000
for y = 1:10000
v(x,y) = exp((x/10000).^2+(y/10000).^2);
end
end
toc
On my computer it runs about 11 seconds.
Now by using meshgrid:
%This is done only once
[x,y] = meshgrid((1:10000)/10000,(1:10000)/10000);
tic;
v = exp(x.^2+y.^2);
toc
Which takes about 4 seconds, not including the meshgrid.
I am using a GeForce GT 720 to do some basic calculations in Matlab.
I am simply doing matrix multiplication:
A = rand(3000,3000); % Define array using CPU
tic; % Start clock
Agpu = gpuArray(A); % Transfer data to GPU
Bgpu = Agpu*Agpu; % Perform computation on GPU
time = toc; % Stop clock
In this code, my clock is timing the data transfer to the GPU and matrix multiplication on the GPU, and I get time ~ 4 seconds. I suspect that the data transfer is taking much more time than the multiplication, so I isolate it with my timer:
A = rand(3000,3000); % Define array using CPU
tic; % Start clock
Agpu = gpuArray(A); % Transfer data to GPU
time = toc; % Stop clock
Bgpu = Agpu*Agpu; % Perform computation on GPU
and indeed it takes ~ 4 seconds. However, if I comment the last line of the code, so that no multiplicaiton is done, my code speeds up to ~0.02 seconds.
Does performing a computation with the GPU after transferring data to the GPU alter the speed of the data transfer?
I don't see this behaviour at all (R2017b, Tesla K20c) - for me, either way, the transfer takes 0.012 seconds. Note that if you're running this in a fresh MATLAB session each time, the very first time you run anything at all on the GPU takes a few seconds - perhaps that accounts for the 4 seconds?
In general, use gputimeit to time stuff on the GPU to ensure you don't see strange results from the asynchronous nature of some GPU operations.
I'm working on optimizing my code. I'm going through MATLAB's own optimization webpage (https://www.mathworks.com/company/newsletters/articles/programming-patterns-maximizing-code-performance-by-optimizing-memory-access.html) and writing code to makes sure I understand how to increase the performance. Point 2 suggests that when you store and access matrix data, do so with columns. However, I can't construct my own code that shows this performance boost. If i'm reading it right the two simple codes below should differ in performance by about 30%, however I'm getting far, far less than that:
x=zeros(10000);
tic
for i = 1:10000
for j = 1:10000
if x(i,j) == 0
x(i,j)=randn(1);
end
end
end
toc
Elapsed time is 41.102294 seconds.
x=zeros(10000);
tic
for j = 1:10000
for i = 1:10000
if x(i,j) == 0
x(i,j)=randn(1);
end
end
end
toc
Elapsed time is 39.654574 seconds.
i confront with a weird fact! i wrote a matlab code that compare runing time of two algorithms. i embrace codes of each algorithm by tic toc functions, so i can compare time complexities of them. but weird thing is here:
if i run the script at once, algorithm 1 has longer time than algorithm 2, but if i put breakmarks after each toc, algorithm 1 would has shorter time!
in order to understand better what I said, consider following:
tic
% some matlab codes implementing algorithm1
time1 = toc;
disp(['t1 = ',num2str(time1)]) % here is the place for first breakpoint
tic
% some matlab codes implementing algorithm2
time2 = toc;
disp(['t2 = ',num2str(time2)]) % here is the place for second breakpoint
can anyone please explain to me why this happen? thanks