Here is a toy example that i put together exploring the parfoor function using the CPU to speed up executions. Even after reviewing the Parallel documentation however i am confused how to upgrade this to run on my GPU (Nvidia 980ti).
Would appreciate any pointers on how to update this code to run on GPU.
Cheers.
% toy example--monte carlo estimation of pi using for loops
tic;
N = 1000000000;
hitcounter = 0;
for i = 1:N
x = rand;
y = rand;
if ( y < sqrt(1-x*x) )
hitcounter = hitcounter + 1;
end
end
disp(hitcounter/N*4)
toc;
% toy example--monte carlo estimation of pi using parfor loops
tic;
N = 1000000000;
hitcounter = 0;
parfor i = 1:N
x = rand;
y = rand;
if ( y < sqrt(1-x*x) )
hitcounter = hitcounter + 1;
end
end
disp(hitcounter/N*4)
toc;
The main thing you need to do is to vectorise your code - this is always a good idea, and especially so on the GPU. Then, you simply need to build x and y directly on the GPU using the trailing argument to rand.
N = 1000000;
x = rand(1, N, 'gpuArray');
y = rand(1, N, 'gpuArray');
pi_est = sum(y < sqrt(1 - x.*x)) / N * 4;
Note I scaled back N to enable this to fit on the GPU. If you want to run with a higher value of N - I would suggest adding an outer loop, and essentially performing the computation in "chunks" that fit on the limited memory of the GPU.
Related
I want to run this specific nested for loop in GPU using matlab, can anyboy help me,
Phi=rand(100,100); FluxD=rand(100,100); FluxC=rand(100,100);
Ima = 100;
Jma = 100;
for i=1:Ima-1
for j=1:Jma-1
Phi(i,j) =Phi(i,j)+dt*(FluxD(i,j)-FluxC(i,j));
end
end
You need to do two things here - firstly, build your data on the GPU, and then for best performance, operate on it in a vectorised manner, like this:
% Build input data arrays directly on the GPU
Phi = rand(100, 'gpuArray');
FluxD = rand(100, 'gpuArray');
FluxC = rand(100, 'gpuArray');
Ima = 100;
Jma = 100;
% For convenience, make index vectors for i and j
ii = 1:Ima-1;
jj = 1:Jma-1;
% Compute Phi in a vectorised manner
Phi(ii, jj) = Phi(ii, jj) + dt * (FluxD(ii,jj) - FluxC(ii,jj));
I'm using a code that calculates expectation value of probabilities. This code contains a while-loop that finds all possible combinations and adds up products of probability combinations. However, when the number of elements becomes large(over 40) it takes too much time, and I want to make the code faster.
The code is as follow-
function pcs = combsum(N,K,prbv)
nprbv=1-prbv; %prbv: probability vector
WV = 1:K; % Working vector.
lim = K; % Sets the limit for working index.
inc = 0; % Controls which element of WV is being worked on.
pcs = 0;
stopp=0;
while stopp==0
if logical((inc+lim)-N)
stp = inc; % This is where the for loop below stops.
flg = 0; % Used for resetting inc.
else
stp = 1;
flg = 1;
end
for jj = 1:stp
WV(K + jj - inc) = lim + jj; % Faster than a vector assignment.
end
PV=nprbv;
PV(WV)=prbv(WV);
pcs=prod(PV)+pcs;
inc = inc*flg + 1; % Increment the counter.
lim = WV(K - inc + 1 ); % lim for next run.
if (inc==K)&&(lim==N-K)
stopp=1;
WV = (N-K+1):N;
PV=nprbv;
PV(WV)=prbv(WV);
pcs=prod(PV)+pcs;
end
end
Is there a way to reduce calculation time? I wonder if parallel computing using GPU would help.
I tried to remove dependent variables in a loop for parallel computing, and I made a matrix of possible combinations using 'combnk' function. This worked faster.
nprbv=1-prbv; %prbv : a probability vector
N = 40;
K = 4;
n_combnk = size(combnk(1:N,K),1);
PV_mat = repmat(nprbv,n_combnk,1);
cnt = 0;
tic;
for i = 1:N-K+1
for j = i+1:N-K+2
for k = j+1:N-K+3
for l = k+1:N-K+4
cnt = cnt+1;
PV_mat(cnt,i) = prbv(i);
PV_mat(cnt,j) = prbv(j);
PV_mat(cnt,k) = prbv(k);
PV_mat(cnt,l) = prbv(l);
end
end
end
end
toc;
tic;
pcs_rr = sum(prod(PV_mat,2));
toc;
However, when K value gets larger, an out-of-memory problem happens in building a combination matrix(PV_mat). How can I break up the big matrix into small ones to avoid memory problem?
Given an nxn matrix A_k and a nx1 vector x, is there any smart way to compute
using Matlab? x_i are the elements of the vector x, therefore J is a sum of matrices. So far I have used a for loop, but I was wondering if there was a smarter way.
Short answer: you can use the builtin matlab function polyvalm for matrix polynomial evaluation as follows:
x = x(end:-1:1); % flip the order of the elements
x(end+1) = 0; % append 0
J = polyvalm(x, A);
Long answer: Matlab uses a loop internally. So, you didn't gain that much or you perform even worse if you optimise your own implementation (see my calcJ_loopOptimised function):
% construct random input
n = 100;
A = rand(n);
x = rand(n, 1);
% calculate the result using different methods
Jbuiltin = calcJ_builtin(A, x);
Jloop = calcJ_loop(A, x);
JloopOptimised = calcJ_loopOptimised(A, x);
% check if the functions are mathematically equivalent (should be in the order of `eps`)
relativeError1 = max(max(abs(Jbuiltin - Jloop)))/max(max(Jbuiltin))
relativeError2 = max(max(abs(Jloop - JloopOptimised)))/max(max(Jloop))
% measure the execution time
t_loopOptimised = timeit(#() calcJ_loopOptimised(A, x))
t_builtin = timeit(#() calcJ_builtin(A, x))
t_loop = timeit(#() calcJ_loop(A, x))
% check if builtin function is faster
builtinFaster = t_builtin < t_loopOptimised
% calculate J using Matlab builtin function
function J = calcJ_builtin(A, x)
x = x(end:-1:1);
x(end+1) = 0;
J = polyvalm(x, A);
end
% naive loop implementation
function J = calcJ_loop(A, x)
n = size(A, 1);
J = zeros(n,n);
for i=1:n
J = J + A^i * x(i);
end
end
% optimised loop implementation (cache result of matrix power)
function J = calcJ_loopOptimised(A, x)
n = size(A, 1);
J = zeros(n,n);
A_ = eye(n);
for i=1:n
A_ = A_*A;
J = J + A_ * x(i);
end
end
For n=100, I get the following:
t_loopOptimised = 0.0077
t_builtin = 0.0084
t_loop = 0.0295
For n=5, I get the following:
t_loopOptimised = 7.4425e-06
t_builtin = 4.7399e-05
t_loop = 1.0496e-04
Note that my timings fluctuates somewhat between different runs, but the optimised loop is almost always faster (up to 6x for small n) than the builtin function.
I'm following a Numerical Methods course and I made a small MATLAB script to compute integrals using the trapezoidal method. However my script uses a FOR loop and my friend told me I'm doing something wrong if I use a FOR loop in Matlab. Is there a way to convert this script to a Matlab-friendly one?
%Number of points to use
N = 4;
%Integration interval
a = 0;
b = 0.5;
%Width of the integration segments
h = (b-a) / N;
F = exp(a);
for i = 1:N-1
F = F + 2*exp(a+i*h);
end
F = F + exp(b);
F = h/2*F
Vectorization is important speed and clarity, but so is using built-in functions whenever possible. Matlab has a built in function for trapezoidal numerical integration called trapz. Here is an example.
x = 0:.125:.5
y = exp(x)
F = trapz(x,y)
It is recommended to vectorize your code.
%Number of points to use
N = 4;
%Integration interval
a = 0;
b = 0.5;
%Width of the integration segments
h = (b-a) / N;
x = 1:1:N-1;
F = h/2*(exp(a) + sum(2*exp(a+x*h)) + exp(b));
However, I've read that Matlab is no longer slow at for loops.
I've searched a lot but didn't find any solution to my problem, could you please help me vectorizing (or just a way to make it way faster) these loops ?
% n is the size of C
h = 1/(n-1)
dt = 1e-6;
a = 1e-2;
F=zeros(n,n);
F2=zeros(n,n);
C2=zeros(n,n);
t = 0.0;
for iter=1:12000
F2=F.^3-F;
for i=1:n
for j=1:n
F2(i,j)=F2(i,j)-(C(ij(i-1),j)+C(ij(i+1),j)+C(i,ij(j-1))+C(i,ij(j+1))-4*C(i,j)).*(a.^2)./(h.^2);
end
end
F=F2;
for i=1:n
for j=1:n
C2(i,j)=C(i,j)+(F(ij(i-1),j)+F(ij(i+1),j)+F(i,ij(j-1))+F(i,ij(j+1))-4*F(i,j)).*dt./(h^2);
end
end
C=C2;
t = t + dt;
end
function i=ij(i) %Just to have a matrix as loop (the n+1 th cases are the 1 th and 0 the 0th are nth)
if i==0
i=n;
return
elseif i==n+1
i=1;
end
return
end
thanks a lot
EDIT: Found an answer, it was totally ridiculous and I was searching way too far
%n is still the size of C
h = 1/((n-1))
dt = 1e-6;
a = 1e-2;
F=zeros(n,n);
var1=(a^2)/(h^2); %to make a bit less calculus
var2=dt/(h^2); % the same
t = 0.0;
for iter=1:12000
F=C.^3-C-var1*(C([n 1:n-1],1:n) + C([2:n 1], 1:n) + C(1:n, [n 1:n-1]) + C(1:n, [2:n 1]) - 4*C);
C = C + var2*(F([n 1:n-1], 1:n) + F([2:n 1], 1:n) + F(1:n, [n 1:n-1]) + F(1:n,[2:n 1]) - 4*F);
t = t + dt;
end
Found an answer, it was totally ridiculous and I was searching way too far
%n is still the size of C
h = 1/((n-1))
dt = 1e-6;
a = 1e-2;
F=zeros(n,n);
var1=(a^2)/(h^2); %to make a bit less calculus
var2=dt/(h^2); % the same
prev = [n 1:n-1];
next = [2:n 1];
t = 0.0;
for iter=1:12000
F = C.*C.*C - C - var1*(C(:,next)+C(:,prev)+C(next,:)+C(prev,:)-4*C);
C = C + var2*(F(:,next)+F(:,prev)+F(next,:)+F(prev,:)-4*F);
t = t + dt;
end
The behavior of the inner loop looks like a 2-dimensional circular convolution. That's the same as multiplication in the FFT domain. Subtraction is invariant across a linear operation such as FFT.
You'll want to use the fft2 and ifft2 functions.
Once you do that, I think you'll find that the repeated convolution can be eliminated by raising the convolution kernel (element-wise) to the power iter. If that optimization is correct, I'm predicting a speedup of 5 orders of magnitude.
You can replace for example C(ij(i-1),j) by using circshift(C,[1,0]) or circshift(C,[1,0]) (i can't figure out witch one of two is correct)
http://www.mathworks.com/help/matlab/ref/circshift.htm