Simple Monte Carlo on GPU using Parallel toolbox - matlab

Here is a toy example that i put together exploring the parfoor function using the CPU to speed up executions. Even after reviewing the Parallel documentation however i am confused how to upgrade this to run on my GPU (Nvidia 980ti).
Would appreciate any pointers on how to update this code to run on GPU.
Cheers.
% toy example--monte carlo estimation of pi using for loops
tic;
N = 1000000000;
hitcounter = 0;
for i = 1:N
x = rand;
y = rand;
if ( y < sqrt(1-x*x) )
hitcounter = hitcounter + 1;
end
end
disp(hitcounter/N*4)
toc;
% toy example--monte carlo estimation of pi using parfor loops
tic;
N = 1000000000;
hitcounter = 0;
parfor i = 1:N
x = rand;
y = rand;
if ( y < sqrt(1-x*x) )
hitcounter = hitcounter + 1;
end
end
disp(hitcounter/N*4)
toc;

The main thing you need to do is to vectorise your code - this is always a good idea, and especially so on the GPU. Then, you simply need to build x and y directly on the GPU using the trailing argument to rand.
N = 1000000;
x = rand(1, N, 'gpuArray');
y = rand(1, N, 'gpuArray');
pi_est = sum(y < sqrt(1 - x.*x)) / N * 4;
Note I scaled back N to enable this to fit on the GPU. If you want to run with a higher value of N - I would suggest adding an outer loop, and essentially performing the computation in "chunks" that fit on the limited memory of the GPU.

Related

Nested for loops in Matlab GPU programming

I want to run this specific nested for loop in GPU using matlab, can anyboy help me,
Phi=rand(100,100); FluxD=rand(100,100); FluxC=rand(100,100);
Ima = 100;
Jma = 100;
for i=1:Ima-1
for j=1:Jma-1
Phi(i,j) =Phi(i,j)+dt*(FluxD(i,j)-FluxC(i,j));
end
end
You need to do two things here - firstly, build your data on the GPU, and then for best performance, operate on it in a vectorised manner, like this:
% Build input data arrays directly on the GPU
Phi = rand(100, 'gpuArray');
FluxD = rand(100, 'gpuArray');
FluxC = rand(100, 'gpuArray');
Ima = 100;
Jma = 100;
% For convenience, make index vectors for i and j
ii = 1:Ima-1;
jj = 1:Jma-1;
% Compute Phi in a vectorised manner
Phi(ii, jj) = Phi(ii, jj) + dt * (FluxD(ii,jj) - FluxC(ii,jj));

How can I speed up this MATLAB code with a whileloop?

I'm using a code that calculates expectation value of probabilities. This code contains a while-loop that finds all possible combinations and adds up products of probability combinations. However, when the number of elements becomes large(over 40) it takes too much time, and I want to make the code faster.
The code is as follow-
function pcs = combsum(N,K,prbv)
nprbv=1-prbv; %prbv: probability vector
WV = 1:K; % Working vector.
lim = K; % Sets the limit for working index.
inc = 0; % Controls which element of WV is being worked on.
pcs = 0;
stopp=0;
while stopp==0
if logical((inc+lim)-N)
stp = inc; % This is where the for loop below stops.
flg = 0; % Used for resetting inc.
else
stp = 1;
flg = 1;
end
for jj = 1:stp
WV(K + jj - inc) = lim + jj; % Faster than a vector assignment.
end
PV=nprbv;
PV(WV)=prbv(WV);
pcs=prod(PV)+pcs;
inc = inc*flg + 1; % Increment the counter.
lim = WV(K - inc + 1 ); % lim for next run.
if (inc==K)&&(lim==N-K)
stopp=1;
WV = (N-K+1):N;
PV=nprbv;
PV(WV)=prbv(WV);
pcs=prod(PV)+pcs;
end
end
Is there a way to reduce calculation time? I wonder if parallel computing using GPU would help.
I tried to remove dependent variables in a loop for parallel computing, and I made a matrix of possible combinations using 'combnk' function. This worked faster.
nprbv=1-prbv; %prbv : a probability vector
N = 40;
K = 4;
n_combnk = size(combnk(1:N,K),1);
PV_mat = repmat(nprbv,n_combnk,1);
cnt = 0;
tic;
for i = 1:N-K+1
for j = i+1:N-K+2
for k = j+1:N-K+3
for l = k+1:N-K+4
cnt = cnt+1;
PV_mat(cnt,i) = prbv(i);
PV_mat(cnt,j) = prbv(j);
PV_mat(cnt,k) = prbv(k);
PV_mat(cnt,l) = prbv(l);
end
end
end
end
toc;
tic;
pcs_rr = sum(prod(PV_mat,2));
toc;
However, when K value gets larger, an out-of-memory problem happens in building a combination matrix(PV_mat). How can I break up the big matrix into small ones to avoid memory problem?

Compute weighted summation of matrix power (matrix polynomial) in Matlab

Given an nxn matrix A_k and a nx1 vector x, is there any smart way to compute
using Matlab? x_i are the elements of the vector x, therefore J is a sum of matrices. So far I have used a for loop, but I was wondering if there was a smarter way.
Short answer: you can use the builtin matlab function polyvalm for matrix polynomial evaluation as follows:
x = x(end:-1:1); % flip the order of the elements
x(end+1) = 0; % append 0
J = polyvalm(x, A);
Long answer: Matlab uses a loop internally. So, you didn't gain that much or you perform even worse if you optimise your own implementation (see my calcJ_loopOptimised function):
% construct random input
n = 100;
A = rand(n);
x = rand(n, 1);
% calculate the result using different methods
Jbuiltin = calcJ_builtin(A, x);
Jloop = calcJ_loop(A, x);
JloopOptimised = calcJ_loopOptimised(A, x);
% check if the functions are mathematically equivalent (should be in the order of `eps`)
relativeError1 = max(max(abs(Jbuiltin - Jloop)))/max(max(Jbuiltin))
relativeError2 = max(max(abs(Jloop - JloopOptimised)))/max(max(Jloop))
% measure the execution time
t_loopOptimised = timeit(#() calcJ_loopOptimised(A, x))
t_builtin = timeit(#() calcJ_builtin(A, x))
t_loop = timeit(#() calcJ_loop(A, x))
% check if builtin function is faster
builtinFaster = t_builtin < t_loopOptimised
% calculate J using Matlab builtin function
function J = calcJ_builtin(A, x)
x = x(end:-1:1);
x(end+1) = 0;
J = polyvalm(x, A);
end
% naive loop implementation
function J = calcJ_loop(A, x)
n = size(A, 1);
J = zeros(n,n);
for i=1:n
J = J + A^i * x(i);
end
end
% optimised loop implementation (cache result of matrix power)
function J = calcJ_loopOptimised(A, x)
n = size(A, 1);
J = zeros(n,n);
A_ = eye(n);
for i=1:n
A_ = A_*A;
J = J + A_ * x(i);
end
end
For n=100, I get the following:
t_loopOptimised = 0.0077
t_builtin = 0.0084
t_loop = 0.0295
For n=5, I get the following:
t_loopOptimised = 7.4425e-06
t_builtin = 4.7399e-05
t_loop = 1.0496e-04
Note that my timings fluctuates somewhat between different runs, but the optimised loop is almost always faster (up to 6x for small n) than the builtin function.

Trapezoidal Numerical Integration in MATLAB without using a FOR loop?

I'm following a Numerical Methods course and I made a small MATLAB script to compute integrals using the trapezoidal method. However my script uses a FOR loop and my friend told me I'm doing something wrong if I use a FOR loop in Matlab. Is there a way to convert this script to a Matlab-friendly one?
%Number of points to use
N = 4;
%Integration interval
a = 0;
b = 0.5;
%Width of the integration segments
h = (b-a) / N;
F = exp(a);
for i = 1:N-1
F = F + 2*exp(a+i*h);
end
F = F + exp(b);
F = h/2*F
Vectorization is important speed and clarity, but so is using built-in functions whenever possible. Matlab has a built in function for trapezoidal numerical integration called trapz. Here is an example.
x = 0:.125:.5
y = exp(x)
F = trapz(x,y)
It is recommended to vectorize your code.
%Number of points to use
N = 4;
%Integration interval
a = 0;
b = 0.5;
%Width of the integration segments
h = (b-a) / N;
x = 1:1:N-1;
F = h/2*(exp(a) + sum(2*exp(a+x*h)) + exp(b));
However, I've read that Matlab is no longer slow at for loops.

MATLAB How to vectorize these for loops?

I've searched a lot but didn't find any solution to my problem, could you please help me vectorizing (or just a way to make it way faster) these loops ?
% n is the size of C
h = 1/(n-1)
dt = 1e-6;
a = 1e-2;
F=zeros(n,n);
F2=zeros(n,n);
C2=zeros(n,n);
t = 0.0;
for iter=1:12000
F2=F.^3-F;
for i=1:n
for j=1:n
F2(i,j)=F2(i,j)-(C(ij(i-1),j)+C(ij(i+1),j)+C(i,ij(j-1))+C(i,ij(j+1))-4*C(i,j)).*(a.^2)./(h.^2);
end
end
F=F2;
for i=1:n
for j=1:n
C2(i,j)=C(i,j)+(F(ij(i-1),j)+F(ij(i+1),j)+F(i,ij(j-1))+F(i,ij(j+1))-4*F(i,j)).*dt./(h^2);
end
end
C=C2;
t = t + dt;
end
function i=ij(i) %Just to have a matrix as loop (the n+1 th cases are the 1 th and 0 the 0th are nth)
if i==0
i=n;
return
elseif i==n+1
i=1;
end
return
end
thanks a lot
EDIT: Found an answer, it was totally ridiculous and I was searching way too far
%n is still the size of C
h = 1/((n-1))
dt = 1e-6;
a = 1e-2;
F=zeros(n,n);
var1=(a^2)/(h^2); %to make a bit less calculus
var2=dt/(h^2); % the same
t = 0.0;
for iter=1:12000
F=C.^3-C-var1*(C([n 1:n-1],1:n) + C([2:n 1], 1:n) + C(1:n, [n 1:n-1]) + C(1:n, [2:n 1]) - 4*C);
C = C + var2*(F([n 1:n-1], 1:n) + F([2:n 1], 1:n) + F(1:n, [n 1:n-1]) + F(1:n,[2:n 1]) - 4*F);
t = t + dt;
end
Found an answer, it was totally ridiculous and I was searching way too far
%n is still the size of C
h = 1/((n-1))
dt = 1e-6;
a = 1e-2;
F=zeros(n,n);
var1=(a^2)/(h^2); %to make a bit less calculus
var2=dt/(h^2); % the same
prev = [n 1:n-1];
next = [2:n 1];
t = 0.0;
for iter=1:12000
F = C.*C.*C - C - var1*(C(:,next)+C(:,prev)+C(next,:)+C(prev,:)-4*C);
C = C + var2*(F(:,next)+F(:,prev)+F(next,:)+F(prev,:)-4*F);
t = t + dt;
end
The behavior of the inner loop looks like a 2-dimensional circular convolution. That's the same as multiplication in the FFT domain. Subtraction is invariant across a linear operation such as FFT.
You'll want to use the fft2 and ifft2 functions.
Once you do that, I think you'll find that the repeated convolution can be eliminated by raising the convolution kernel (element-wise) to the power iter. If that optimization is correct, I'm predicting a speedup of 5 orders of magnitude.
You can replace for example C(ij(i-1),j) by using circshift(C,[1,0]) or circshift(C,[1,0]) (i can't figure out witch one of two is correct)
http://www.mathworks.com/help/matlab/ref/circshift.htm