CUDA loop on matlab

CUDA loop on matlab - matlab

I have been playing around with parallelization both using ACC and OpenMP in Fortran. I am now trying to do the same in matlab. I find it very interesting that it seems to be very hard to paralelize a loop using GPUs in matlab. Apparently the only way to do it is to by using arrayfun function. But I might be wrong.
At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.
Below, I am sharing three codes and benchmarks:
Fortran OpenMP code
Fortran ACC code
Matlab parfor code
Matlab CUDA (?) this is the one I don't know how to do.
Fortran OpenMP:
program rbc
use omp_lib ! For timing
use tools
implicit none
real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &
rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;
integer, parameter :: nz = 4, nk=4800;
real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);
real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);
real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;
integer :: i, iz, ik, cnt;
logical :: ind(nk);
real(kind=8) :: start, finish ! For timing
real :: tmpmax, c1
call omp_set_num_threads(12)
!Grid for productivity z
! [1 x 4] grid of values for z
call linspace(zmin,zmax,nz,zgrid)
zgrid = exp(zgrid)
! [4 x 4] Markov transition matrix of z
tran_z(1,1) = 0.996757
tran_z(1,2) = 0.00324265
tran_z(1,3) = 0
tran_z(1,4) = 0
tran_z(2,1) = 0.000385933
tran_z(2,2) = 0.998441
tran_z(2,3) = 0.00117336
tran_z(2,4) = 0
tran_z(3,1) = 0
tran_z(3,2) = 0.00117336
tran_z(3,3) = 0.998441
tran_z(3,4) = 0.000385933
tran_z(4,1) = 0
tran_z(4,2) = 0
tran_z(4,3) = 0.00324265
tran_z(4,4) = 0.996757
! Grid for capital k
kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));
! [1 x 4800] grid of possible values of k
call linspace(kmin, kmax, nk, kgrid)
! Compute initial wealth c0(k,z)
do iz=1,nz
c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;
end do
dif = 10000
tol = 1e-8
cnt = 1
do while(dif>tol)
!$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)
do ik=1,nk;
do iz = 1,nz;
tmpmax = -huge(0.)
do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax
end do
end do
!$omp end parallel do
ev = beta*matmul(v,tran_z)
dif = maxval(abs(v-v0))
v0 = v
if(mod(cnt,1)==0) write(*,*) cnt, ':', dif
cnt = cnt+1
end do
end program
Fortran ACC:
Just replace the mainloop syntax on the above code with:
do while(dif>tol)
!$acc kernels
!$acc loop gang
do ik=1,nk;
!$acc loop gang
do iz = 1,nz;
tmpmax = -huge(0.)
do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax
end do
end do
!$acc end kernels
ev = beta*matmul(v,tran_z)
dif = maxval(abs(v-v0))
v0 = v
if(mod(cnt,1)==0) write(*,*) cnt, ':', dif
cnt = cnt+1
end do
Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).
tic;
beta = 0.984;
eta = 2;
alpha = 0.35;
delta = 0.01;
rho = 0.95;
sigma = 0.005;
zmin=-0.0480384;
zmax=0.0480384;
nz = 4;
nk=4800;
v=zeros(nk,nz);
v0=zeros(nk,nz);
ev=zeros(nk,nz);
c0=zeros(nk,nz);
%Grid for productivity z
%[1 x 4] grid of values for z
zgrid = linspace(zmin,zmax,nz);
zgrid = exp(zgrid);
% [4 x 4] Markov transition matrix of z
tran_z(1,1) = 0.996757;
tran_z(1,2) = 0.00324265;
tran_z(1,3) = 0;
tran_z(1,4) = 0;
tran_z(2,1) = 0.000385933;
tran_z(2,2) = 0.998441;
tran_z(2,3) = 0.00117336;
tran_z(2,4) = 0;
tran_z(3,1) = 0;
tran_z(3,2) = 0.00117336;
tran_z(3,3) = 0.998441;
tran_z(3,4) = 0.000385933;
tran_z(4,1) = 0;
tran_z(4,2) = 0;
tran_z(4,3) = 0.00324265;
tran_z(4,4) = 0.996757;
% Grid for capital k
kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));
% [1 x 4800] grid of possible values of k
kgrid = linspace(kmin, kmax, nk);
% Compute initial wealth c0(k,z)
for iz=1:nz
c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end
dif = 10000;
tol = 1e-8;
cnt = 1;
while dif>tol
parfor ik=1:nk
for iz = 1:nz
tmpmax = -intmax;
for i = 1:nk
c1 = c0(ik,iz) - kgrid(i);
if (c1<0)
continue
end
c1 = c1^(1-eta)/(1-eta)+ev(i,iz);
if tmpmax<c1
tmpmax = c1;
end
end
v(ik,iz) = tmpmax;
end
end
ev = beta*v*tran_z;
dif = max(max(abs(v-v0)));
v0 = v;
if mod(cnt,1)==0
fprintf('%1.5f : %1.5f \n', [cnt dif])
end
cnt = cnt+1;
end
toc
Matlab CUDA:
This is what I have no clue how to code. Is using arrayfun the only way of doing this? In fortran is so simple to move from OpenMP to OpenACC. Isn't there an easy way in Matlab of going from parfor to GPUs loops?
The time comparison between codes:
Fortran OpenMP: 83.1 seconds
Fortran ACC: 2.4 seconds
Matlab parfor: 1182 seconds
Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.

Matlab Coder
First, as Dev-iL already mentioned, you can use GPU coder.
It (I use R2019a) would only require minor changes in your code:
function cdapted()
beta = 0.984;
eta = 2;
alpha = 0.35;
delta = 0.01;
rho = 0.95;
sigma = 0.005;
zmin=-0.0480384;
zmax=0.0480384;
nz = 4;
nk=4800;
v=zeros(nk,nz);
v0=zeros(nk,nz);
ev=zeros(nk,nz);
c0=zeros(nk,nz);
%Grid for productivity z
%[1 x 4] grid of values for z
zgrid = linspace(zmin,zmax,nz);
zgrid = exp(zgrid);
% [4 x 4] Markov transition matrix of z
tran_z = zeros([4,4]);
tran_z(1,1) = 0.996757;
tran_z(1,2) = 0.00324265;
tran_z(1,3) = 0;
tran_z(1,4) = 0;
tran_z(2,1) = 0.000385933;
tran_z(2,2) = 0.998441;
tran_z(2,3) = 0.00117336;
tran_z(2,4) = 0;
tran_z(3,1) = 0;
tran_z(3,2) = 0.00117336;
tran_z(3,3) = 0.998441;
tran_z(3,4) = 0.000385933;
tran_z(4,1) = 0;
tran_z(4,2) = 0;
tran_z(4,3) = 0.00324265;
tran_z(4,4) = 0.996757;
% Grid for capital k
kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));
% [1 x 4800] grid of possible values of k
kgrid = linspace(kmin, kmax, nk);
% Compute initial wealth c0(k,z)
for iz=1:nz
c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end
dif = 10000;
tol = 1e-8;
cnt = 1;
while dif>tol
for ik=1:nk
for iz = 1:nz
tmpmax = double(intmin);
for i = 1:nk
c1 = c0(ik,iz) - kgrid(i);
if (c1<0)
continue
end
c1 = c1^(1-eta)/(1-eta)+ev(i,iz);
if tmpmax<c1
tmpmax = c1;
end
end
v(ik,iz) = tmpmax;
end
end
ev = beta*v*tran_z;
dif = max(max(abs(v-v0)));
v0 = v;
% I've commented out fprintf because double2single cannot handle it
% (could be manually uncommented in the converted version if needed)
% ------------
% if mod(cnt,1)==0
% fprintf('%1.5f : %1.5f \n', cnt, dif);
% end
cnt = cnt+1;
end
end
The script to build this is:
% unload mex files
clear mex
%% Build for gpu, float64
% Produces ".\codegen\mex\cdapted" folder and "cdapted_mex.mexw64"
cfg = coder.gpuConfig('mex');
codegen -config cfg cdapted
% benchmark it (~7.14s on my GTX1080Ti)
timeit(#() cdapted_mex,0)
%% Build for gpu, float32:
% Produces ".\codegen\cdapted\single" folder
scfg = coder.config('single');
codegen -double2single scfg cdapted
% Produces ".\codegen\mex\cdapted_single" folder and "cdapted_single_mex.mexw64"
cfg = coder.gpuConfig('mex');
codegen -config cfg .\codegen\cdapted\single\cdapted_single.m
% benchmark it (~2.09s on my GTX1080Ti)
timeit(#() cdapted_single_mex,0)
So, if your Fortran binary is using float32 precision (I suspect so), this Matlab Coder result is on par with it. That does not mean that both are highly efficient, though. The code, generated by Matlab Coder is still far from being efficient. And it does not fully utilize the GPU (even TDP is ~50%).
Vectorization and gpuArray
Next, I agree with user10597469 and Nicky Mattsson that your Matlab code does not look like normal "native" vectorized Matlab code.
There are many things to adjust. (But arrayfun is hardly better than for). Firstly, let's remove for loops:
function vertorized1()
t_tot = tic();
beta = 0.984;
eta = 2;
alpha = 0.35;
delta = 0.01;
rho = 0.95;
sigma = 0.005;
zmin=-0.0480384;
zmax=0.0480384;
nz = 4;
nk=4800;
v=zeros(nk,nz);
v0=zeros(nk,nz);
ev=zeros(nk,nz);
c0=zeros(nk,nz);
%Grid for productivity z
%[1 x 4] grid of values for z
zgrid = linspace(zmin,zmax,nz);
zgrid = exp(zgrid);
% [4 x 4] Markov transition matrix of z
tran_z = zeros([4,4]);
tran_z(1,1) = 0.996757;
tran_z(1,2) = 0.00324265;
tran_z(1,3) = 0;
tran_z(1,4) = 0;
tran_z(2,1) = 0.000385933;
tran_z(2,2) = 0.998441;
tran_z(2,3) = 0.00117336;
tran_z(2,4) = 0;
tran_z(3,1) = 0;
tran_z(3,2) = 0.00117336;
tran_z(3,3) = 0.998441;
tran_z(3,4) = 0.000385933;
tran_z(4,1) = 0;
tran_z(4,2) = 0;
tran_z(4,3) = 0.00324265;
tran_z(4,4) = 0.996757;
% Grid for capital k
kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));
% [1 x 4800] grid of possible values of k
kgrid = linspace(kmin, kmax, nk);
% Compute initial wealth c0(k,z)
for iz=1:nz
c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end
dif = 10000;
tol = 0.4;
tol = 1e-8;
cnt = 1;
t_acc=zeros([1,2]);
while dif>tol
%% orig-noparfor:
t=tic();
for ik=1:nk
for iz = 1:nz
tmpmax = -intmax;
for i = 1:nk
c1 = c0(ik,iz) - kgrid(i);
if (c1<0)
continue
end
c1 = c1^(1-eta)/(1-eta)+ev(i,iz);
if tmpmax<c1
tmpmax = c1;
end
end
v(ik,iz) = tmpmax;
end
end
t_acc(1) = t_acc(1) + toc(t);
%% better:
t=tic();
kgrid_ = reshape(kgrid,[1 1 numel(kgrid)]);
c1_ = c0 - kgrid_;
c1_x = c1_.^(1-eta)/(1-eta);
c2 = c1_x + reshape(ev', [1 nz nk]);
c2(c1_<0) = -Inf;
v_ = max(c2,[],3);
t_acc(2) = t_acc(2) + toc(t);
%% compare
assert(isequal(v_,v));
v=v_;
%% other
ev = beta*v*tran_z;
dif = max(max(abs(v-v0)));
v0 = v;
if mod(cnt,1)==0
fprintf('%1.5f : %1.5f \n', cnt, dif);
end
cnt = cnt+1;
end
disp(t_acc);
disp(toc(t_tot));
end
% toc result:
% tol = 0.4 -> 12 iterations :: t_acc = [ 17.7 9.8]
% tol = 1e-8 -> 1124 iterations :: t_acc = [1758.6 972.0]
%
% (all 1124 iterations) with commented-out orig :: t_tot = 931.7443
Now, it is strikingly evident that most computationally intense calculations inside the while loop (e.g. ^(1-eta)/(1-eta)) actually produce constants that could be pre-calculated. Once we fix that, the result would be already a bit faster than the original parfor-based version (on my 2xE5-2630v3):
function vertorized2()
t_tot = tic();
beta = 0.984;
eta = 2;
alpha = 0.35;
delta = 0.01;
rho = 0.95;
sigma = 0.005;
zmin=-0.0480384;
zmax=0.0480384;
nz = 4;
nk=4800;
v=zeros(nk,nz);
v0=zeros(nk,nz);
ev=zeros(nk,nz);
c0=zeros(nk,nz);
%Grid for productivity z
%[1 x 4] grid of values for z
zgrid = linspace(zmin,zmax,nz);
zgrid = exp(zgrid);
% [4 x 4] Markov transition matrix of z
tran_z = zeros([4,4]);
tran_z(1,1) = 0.996757;
tran_z(1,2) = 0.00324265;
tran_z(1,3) = 0;
tran_z(1,4) = 0;
tran_z(2,1) = 0.000385933;
tran_z(2,2) = 0.998441;
tran_z(2,3) = 0.00117336;
tran_z(2,4) = 0;
tran_z(3,1) = 0;
tran_z(3,2) = 0.00117336;
tran_z(3,3) = 0.998441;
tran_z(3,4) = 0.000385933;
tran_z(4,1) = 0;
tran_z(4,2) = 0;
tran_z(4,3) = 0.00324265;
tran_z(4,4) = 0.996757;
% Grid for capital k
kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));
% [1 x 4800] grid of possible values of k
kgrid = linspace(kmin, kmax, nk);
% Compute initial wealth c0(k,z)
for iz=1:nz
c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end
dif = 10000;
tol = 0.4;
tol = 1e-8;
cnt = 1;
t_acc=zeros([1,2]);
%% constants:
kgrid_ = reshape(kgrid,[1 1 numel(kgrid)]);
c1_ = c0 - kgrid_;
mask=zeros(size(c1_));
mask(c1_<0)=-Inf;
c1_x = c1_.^(1-eta)/(1-eta);
while dif>tol
%% orig:
t=tic();
parfor ik=1:nk
for iz = 1:nz
tmpmax = -intmax;
for i = 1:nk
c1 = c0(ik,iz) - kgrid(i);
if (c1<0)
continue
end
c1 = c1^(1-eta)/(1-eta)+ev(i,iz);
if tmpmax<c1
tmpmax = c1;
end
end
v(ik,iz) = tmpmax;
end
end
t_acc(1) = t_acc(1) + toc(t);
%% better:
t=tic();
c2 = c1_x + reshape(ev', [1 nz nk]);
c2 = c2 + mask;
v_ = max(c2,[],3);
t_acc(2) = t_acc(2) + toc(t);
%% compare
assert(isequal(v_,v));
v=v_;
%% other
ev = beta*v*tran_z;
dif = max(max(abs(v-v0)));
v0 = v;
if mod(cnt,1)==0
fprintf('%1.5f : %1.5f \n', cnt, dif);
end
cnt = cnt+1;
end
disp(t_acc);
disp(toc(t_tot));
end
% toc result:
% tol = 0.4 -> 12 iterations :: t_acc = [ 2.4 1.7]
% tol = 1e-8 -> 1124 iterations :: t_acc = [188.3 115.9]
%
% (all 1124 iterations) with commented-out orig :: t_tot = 117.6217
This vectorized code is still inefficient (e.g. reshape(ev',...), which eats ~60% of time, could be easily avoided by re-ordering dimensions), but it is somewhat suitable for gpuArray():
function vectorized3g()
t0 = tic();
beta = 0.984;
eta = 2;
alpha = 0.35;
delta = 0.01;
rho = 0.95;
sigma = 0.005;
zmin=-0.0480384;
zmax=0.0480384;
nz = 4;
nk=4800;
v=zeros(nk,nz);
v0=zeros(nk,nz);
ev=gpuArray(zeros(nk,nz,'single'));
c0=zeros(nk,nz);
%Grid for productivity z
%[1 x 4] grid of values for z
zgrid = linspace(zmin,zmax,nz);
zgrid = exp(zgrid);
% [4 x 4] Markov transition matrix of z
tran_z = zeros([4,4]);
tran_z(1,1) = 0.996757;
tran_z(1,2) = 0.00324265;
tran_z(1,3) = 0;
tran_z(1,4) = 0;
tran_z(2,1) = 0.000385933;
tran_z(2,2) = 0.998441;
tran_z(2,3) = 0.00117336;
tran_z(2,4) = 0;
tran_z(3,1) = 0;
tran_z(3,2) = 0.00117336;
tran_z(3,3) = 0.998441;
tran_z(3,4) = 0.000385933;
tran_z(4,1) = 0;
tran_z(4,2) = 0;
tran_z(4,3) = 0.00324265;
tran_z(4,4) = 0.996757;
% Grid for capital k
kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));
% [1 x 4800] grid of possible values of k
kgrid = linspace(kmin, kmax, nk);
% Compute initial wealth c0(k,z)
for iz=1:nz
c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end
dif = 10000;
tol = 1e-8;
cnt = 1;
t_acc=zeros([1,2]);
%% constants:
kgrid_ = reshape(kgrid,[1 1 numel(kgrid)]);
c1_ = c0 - kgrid_;
mask=gpuArray(zeros(size(c1_),'single'));
mask(c1_<0)=-Inf;
c1_x = c1_.^(1-eta)/(1-eta);
c1_x = gpuArray(single(c1_x));
while dif>tol
%% orig:
% t=tic();
% parfor ik=1:nk
% for iz = 1:nz
% tmpmax = -intmax;
%
% for i = 1:nk
% c1 = c0(ik,iz) - kgrid(i);
% if (c1<0)
% continue
% end
% c1 = c1^(1-eta)/(1-eta)+ev(i,iz);
% if tmpmax<c1
% tmpmax = c1;
% end
% end
% v(ik,iz) = tmpmax;
% end
%
% end
% t_acc(1) = t_acc(1) + toc(t);
%% better:
t=tic();
c2 = c1_x + reshape(ev', [1 nz nk]);
c2 = c2 + mask;
v_ = max(c2,[],3);
t_acc(2) = t_acc(2) + toc(t);
%% compare
% assert(isequal(v_,v));
v = v_;
%% other
ev = beta*v*tran_z;
dif = max(max(abs(v-v0)));
v0 = v;
if mod(cnt,1)==0
fprintf('%1.5f : %1.5f \n', cnt, dif);
end
cnt = cnt+1;
end
disp(t_acc);
disp(toc(t0));
end
% (all 849 iterations) with commented-out orig :: t_tot = 14.9040
This ~15 sec result is ~7x worse than those (~2sec) we get from Matlab Coder. But this option requires fewer toolboxes. In practice, gpuArray is most handy when you start from writing "native Matlab code". Including interactive use.
Finally, if you build this final vectorized version with Matlab Coder (you would have to do some trivial adjustments), it won't be faster than the first one. It would be 2x-3x slower.

So, this bit is what is going to mess you up on this project. MATLAB stands for Matrix Laboratory. Vectors and matrices are kind of its thing. The number 1 way to optimize anything in MATLAB is to vectorize it. For this reason, when using performance enhancing tools like CUDA, MATLAB assumes that you are going to vectorize your inputs if possible. Given the primacy of vectorizing inputs in the MATLAB coding style, it is not a fair comparison to assess its performance using only loops. It would be like assessing the performance of C++ while refusing to use pointers. If you want to use CUDA with MATLAB, the main way to go about it is to vectorize your inputs and use gpuarray. Honestly, I haven't looked too hard at your code but it kind of looks like your inputs are already mostly vectorized. You may be able to get away with something as simple as gpuarray(1:nk) or kgrid=gpuarray(linspace(...).

Related

tangent tangent correlation calculation in matrix form MATLAB

Consider the following calculation of the tangent tangent correlation which is performed in a for loop
v1=rand(25,1);
v2=rand(25,1);
n=25;
nSteps=10;
mean_theta = zeros(nSteps,1);
for j=1:nSteps
theta=[];
for i=1:(n-j)
d = dot([v1(i) v2(i)],[v1(i+j) v2(i+j)]);
n1 = norm([v1(i) v2(i)]);
n2 = norm([v1(i+j) v2(i+j)]);
theta = [theta acosd(d/n1/n2)];
end
mean_theta(j)=mean(theta);
end
plot(mean_theta)
How can matlab matrix calculations be utilized to make this performance better?

There are several things you can do to speed up your code. First, always preallocate. This converts:
theta = [];
for i = 1:(n-j)
%...
theta = [theta acosd(d/n1/n2)];
end
into:
theta = zeros(1,n-j);
for i = 1:(n-j)
%...
theta(i) = acosd(d/n1/n2);
end
Next, move the normalization out of the loops. There is no need to normalize over and over again, just normalize the input:
v = [v1,v2];
v = v./sqrt(sum(v.^2,2)); % Can use VECNORM in newest MATLAB
%...
theta(i) = acosd(dot(v(i,:),v(i+j,:)));
This does change the output very slightly, within numerical precision, because the different order of operations leads to different floating-point rounding error.
Finally, you can remove the inner loop by vectorizing that computation:
i = 1:(n-j);
theta = acosd(dot(v(i,:),v(i+j,:),2));
Timings (n=25):
Original: 0.0019 s
Preallocate: 0.0013 s
Normalize once: 0.0011 s
Vectorize: 1.4176e-04 s
Timings (n=250):
Original: 0.0185 s
Preallocate: 0.0146 s
Normalize once: 0.0118 s
Vectorize: 2.5694e-04 s
Note how the vectorized code is the only one whose timing doesn't grow linearly with n.
Timing code:
function so
n = 25;
v1 = rand(n,1);
v2 = rand(n,1);
nSteps = 10;
mean_theta1 = method1(v1,v2,nSteps);
mean_theta2 = method2(v1,v2,nSteps);
fprintf('diff method1 vs method2: %g\n',max(abs(mean_theta1(:)-mean_theta2(:))));
mean_theta3 = method3(v1,v2,nSteps);
fprintf('diff method1 vs method3: %g\n',max(abs(mean_theta1(:)-mean_theta3(:))));
mean_theta4 = method4(v1,v2,nSteps);
fprintf('diff method1 vs method4: %g\n',max(abs(mean_theta1(:)-mean_theta4(:))));
timeit(#()method1(v1,v2,nSteps))
timeit(#()method2(v1,v2,nSteps))
timeit(#()method3(v1,v2,nSteps))
timeit(#()method4(v1,v2,nSteps))
function mean_theta = method1(v1,v2,nSteps)
n = numel(v1);
mean_theta = zeros(nSteps,1);
for j = 1:nSteps
theta=[];
for i=1:(n-j)
d = dot([v1(i) v2(i)],[v1(i+j) v2(i+j)]);
n1 = norm([v1(i) v2(i)]);
n2 = norm([v1(i+j) v2(i+j)]);
theta = [theta acosd(d/n1/n2)];
end
mean_theta(j) = mean(theta);
end
function mean_theta = method2(v1,v2,nSteps)
n = numel(v1);
mean_theta = zeros(nSteps,1);
for j = 1:nSteps
theta = zeros(1,n-j);
for i = 1:(n-j)
d = dot([v1(i) v2(i)],[v1(i+j) v2(i+j)]);
n1 = norm([v1(i) v2(i)]);
n2 = norm([v1(i+j) v2(i+j)]);
theta(i) = acosd(d/n1/n2);
end
mean_theta(j) = mean(theta);
end
function mean_theta = method3(v1,v2,nSteps)
v = [v1,v2];
v = v./sqrt(sum(v.^2,2)); % Can use VECNORM in newest MATLAB
n = numel(v1);
mean_theta = zeros(nSteps,1);
for j = 1:nSteps
theta = zeros(1,n-j);
for i = 1:(n-j)
theta(i) = acosd(dot(v(i,:),v(i+j,:)));
end
mean_theta(j) = mean(theta);
end
function mean_theta = method4(v1,v2,nSteps)
v = [v1,v2];
v = v./sqrt(sum(v.^2,2)); % Can use VECNORM in newest MATLAB
n = numel(v1);
mean_theta = zeros(nSteps,1);
for j = 1:nSteps
i = 1:(n-j);
theta = acosd(dot(v(i,:),v(i+j,:),2));
mean_theta(j) = mean(theta);
end

Here is a full vectorized solution:
i = 1:n-1;
j = (1:nSteps).';
ij= min(i+j,n);
a = cat(3, v1(i).', v2(i).');
b = cat(3, v1(ij), v2(ij));
d = sum(a .* b, 3);
n1 = sum(a .^ 2, 3);
n2 = sum(b .^ 2, 3);
theta = acosd(d./sqrt(n1.*n2));
idx = (1:nSteps).' <= (n-1:-1:1);
mean_theta = sum(theta .* idx ,2) ./ sum(idx,2);
Result of Octave timings for my method,method4 from the answer provided by #CrisLuengo and the original method (n=250):
Full vectorized : 0.000864983 seconds
Method4(Vectorize) : 0.002774 seconds
Original(loop) : 0.340693 seconds

Finding the distribution of '1' in rows/columns from a parity-check matrix

My question this time concerns the obtention of the degree distribution of a LDPC matrix through linear programming, under the following statement:
My code is the following:
function [v] = LP_Irr_LDPC(k,Ebn0)
options = optimoptions('fmincon','Display','iter','Algorithm','interior-point','MaxIter', 4000, 'MaxFunEvals', 70000);
fun = #(v) -sum(v(1:k)./(1:k));
A = [];
b = [];
Aeq = [0, ones(1,k-1)];
beq = 1;
lb = zeros(1,k);
ub = [0, ones(1,k-1)];
nonlcon = #(v)DensEv_SP(v,Ebn0);
l0 = [0 rand(1,k-1)];
l0 = l0./sum(l0);
v = fmincon(fun,l0,A,b,Aeq,beq,lb,ub,nonlcon,options)
end
Definition of nonlinear constraints:
function [c, ceq] = DensEv_SP(v,Ebn0)
% It is also needed to modify this function, as you cannot pass parameters from others to it.
h = [0 rand(1,19)];
h = h./sum(h); % This is where h comes from
syms x;
X = x.^(0:(length(h)-1));
R = h*transpose(X);
ebn0 = 10^(Ebn0/10);
Rm = 1;
LLR = (-50:50);
p03 = 0.3;
LLR03 = log((1-p03)/p03);
r03 = 1 - p03;
noise03 = (2*r03*Rm*ebn0)^-1;
pf03 = normpdf(LLR, LLR03, noise03);
sumpf03 = sum(pf03(1:length(pf03)/2));
divisions = 100;
Aj = zeros(1, divisions);
rho = zeros(1, divisions);
xj = zeros(1, divisions);
k = 10; % Length(v) -> Same value as in 'Complete.m'
for j=1:1:divisions
xj(j) = sumpf03*j/divisions;
rho(j) = subs(R,x,1-xj(j));
Aj(j) = 1 - rho(j);
end
c = zeros(1, length(xj));
lambda = zeros(1, length(Aj));
for j = 1:1:length(xj)
lambda(j) = sum(v(2:k).*(Aj(j).^(1:(k-1))));
c(j) = sumpf03*lambda(j) - xj(j);
end
save Almacen
ceq = [];
%ceq = sum(v)-1;
end
This question is linked to the one posted here. My problem is that I need that each element from vectors v and h resulting from this optimization problem is a fraction of x/N and x/(N(1-r) respectively.
How could I ensure that condition without losing convergence capability?

I came up with one possible solution, at least for vector h, within the function DensEv_SP:
function [c, ceq] = DensEv_SP(v,Ebn0)
% It is also needed to modify this function, as you cannot pass parameters from others to it.
k = 10; % Same as in Complete.m, desired sum of h
M = 19; % Number of integers
h = [0 diff([0,sort(randperm(k+M-1,M-1)),k+M])-ones(1,M)];
h = h./sum(h);
syms x;
X = x.^(0:(length(h)-1));
R = h*transpose(X);
ebn0 = 10^(Ebn0/10);
Rm = 1;
LLR = (-50:50);
p03 = 0.3;
LLR03 = log((1-p03)/p03);
r03 = 1 - p03;
noise03 = (2*r03*Rm*ebn0)^-1;
pf03 = normpdf(LLR, LLR03, noise03);
sumpf03 = sum(pf03(1:length(pf03)/2));
divisions = 100;
Aj = zeros(1, divisions);
rho = zeros(1, divisions);
xj = zeros(1, divisions);
N = 20; % Length(v) -> Same value as in 'Complete.m'
for j=1:1:divisions
xj(j) = sumpf03*j/divisions;
rho(j) = subs(R,x,1-xj(j));
Aj(j) = 1 - rho(j);
end
c = zeros(1, length(xj));
lambda = zeros(1, length(Aj));
for j = 1:1:length(xj)
lambda(j) = sum(v(2:k).*(Aj(j).^(1:(k-1))));
c(j) = sumpf03*lambda(j) - xj(j);
end
save Almacen
ceq = (N*v)-floor(N*v);
%ceq = sum(v)-1;
end
As above stated, there is no longer any problem with vector h; nevertheless the way I defined ceq value seemed to be insufficient to make the optimization work out (the problems with v have not diminished at all). Does anybody know how to find the solution?

Vectorizing 4 nested for loops

I'm trying to vectorize the 2 inner nested for loops, but I can't come up with a way to do this. The FS1 and FS2 functions have been written to accept argument for N_theta and N_e, which is what the loops are iterating over
%% generate regions
for raw_r=1:visual_field_width
for raw_c=1:visual_field_width
r = raw_r - center_r;
c = raw_c - center_c;
% convert (r,c) to polar: (eccentricity, angle)
e = sqrt(r^2+c^2)*deg_per_pixel;
a = mod(atan2(r,c),2*pi);
for nt=1:N_theta
for ne=1:N_e
regions(raw_r, raw_c, nt, ne) = ...
FS_1(nt-1,a,N_theta) * ...
FS_2(ne-1,e,N_e,e0_in_deg, e_max);
end
end
end
end
Ideally, I could replace the two inner nested for loops by:
regions(raw_r,raw_c,:,:) = FS_1(:,a,N_theta) * FS_2(:,N_e,e0_in_deg,e_max);
But this isn't possible. Maybe I'm missing an easy fix or vectorization technique? e0_in_deg and e_max are parameters.
The FS_1 function is
function h = FS_1(n,theta,N,t)
if nargin==2
N = 9;
t=1/2;
elseif nargin==3
t=1/2;
end
w = (2*pi)/N;
theta = theta + w/4;
if n==0 && theta>(3/2)*pi
theta = theta - 2*pi;
end
h = FS_f((theta - (w*n + 0.5*w*(1-t)))/w);
the FS_2 function is
function g = FS_gne(n,e,N,e0, e_max)
if nargin==2
N = 10;
e0 = .5;
elseif nargin==3
e0 = .5;
end
w = (log(e_max) - log(e0))/N;
g = FS_f((log(e)-log(e0)-w*(n+1))/w);
and the FS_f function is
function f = FS_f(x, t)
if nargin<2
t = 0.5;
end
f = zeros(size(x));
% case 1
idx = x>-(1+t)/2 & x<=(t-1)/2;
f(idx) = (cos(0.5*pi*((x(idx)-(t-1)/2)/t))).^2;
% case 2
idx = x>(t-1)/2 & x<=(1-t)/2;
f(idx) = 1;
% case 3
idx = x>(1-t)/2 & x<=(1+t)/2;
f(idx) = -(cos(0.5*pi*((x(idx)-(1+t)/2)/t))).^2+1;

I had to assume values for the constants, and then used ndgrid to find the possible configurations and sub2ind to get the indices. Doing this I removed all loops. Let me know if this produced the correct values.
function RunningFunction
%% generate regions
visual_field_width = 10;
center_r = 2;
center_c = 3;
deg_per_pixel = 17;
N_theta = 2;
N_e = 5;
e0_in_deg = 35;
e_max = 17;
[raw_r, raw_c, nt, ne] = ndgrid(1:visual_field_width, 1:visual_field_width, 1:N_theta, 1:N_e);
ind = sub2ind(size(raw_r), raw_r, raw_c, nt, ne);
r = raw_r - center_r;
c = raw_c - center_c;
% convert (r,c) to polar: (eccentricity, angle)
e = sqrt(r.^2+c.^2)*deg_per_pixel;
a = mod(atan2(r,c),2*pi);
regions(ind) = ...
FS_1(nt-1,a,N_theta) .* ...
FS_2(ne-1,e,N_e,e0_in_deg, e_max);
regions = reshape(regions, size(raw_r));
end
function h = FS_1(n,theta,N,t)
if nargin==2
N = 9;
t=1/2;
elseif nargin==3
t=1/2;
end
w = (2*pi)./N;
theta = theta + w/4;
theta(n==0 & theta>(3/2)*pi) = theta(n==0 & theta>(3/2)*pi) - 2*pi;
h = FS_f((theta - (w*n + 0.5*w*(1-t)))/w);
end
function g = FS_2(n,e,N,e0, e_max)
if nargin==2
N = 10;
e0 = .5;
elseif nargin==3
e0 = .5;
end
w = (log(e_max) - log(e0))/N;
g = FS_f((log(e)-log(e0)-w*(n+1))/w);
end
function f = FS_f(x, t)
if nargin<2
t = 0.5;
end
f = zeros(size(x));
% case 1
idx = x>-(1+t)/2 & x<=(t-1)/2;
f(idx) = (cos(0.5*pi*((x(idx)-(t-1)/2)/t))).^2;
% case 2
idx = x>(t-1)/2 & x<=(1-t)/2;
f(idx) = 1;
% case 3
idx = x>(1-t)/2 & x<=(1+t)/2;
f(idx) = -(cos(0.5*pi*((x(idx)-(1+t)/2)/t))).^2+1;
end

Undefined variable in 'if' statement

I'm writing a script for an aerodynamics class and I'm getting the following error:
Undefined function or variable 'dCt_dx'.
Error in Project2_Iteration (line 81)
Ct = trapz(x,dCt_dx)
I'm not sure what the cause is. It's something to do with my if statement. My script is below:
clear all
clc
global dr a n Vinf Vr w rho k x c cl dr B R beta t
%Environmental Parameters
n = 2400; %rpm
Vinf = 154; %KTAS
rho = 0.07647 * (.7429/.9450); %from mattingly for 8kft
a = 1084; %speed of sound, ft/s, 8000 ft
n = n/60; %convert to rps
w = 2*pi*n;
Vinf = (Vinf*6076.12)/3600; %convert from KTAS to ft/s
k = length(c);
dr = R/k; %length of each blade element
for i = 1:k
r(i) = i*dr - (.5*dr); %radius at center of blade element
dA = 2*pi*r*dr; %Planform area of blade element
x(i) = r(i)/R;
if x(i) > .15 && x(i-1) < .15
i_15 = i;
end
if x(i) > .75 && x(i-1) < .75
i_75h = i;
i_75l = i-1;
end
Vr(i) = w*r(i) + Vinf;
%Aerodynamic Parameters
M = Vr(i)/a;
if M > 0.9
M = 0.9;
end
m0 = 0.9*(2*pi/(1-M^2)^0.5); %lift-curve slope (2pi/rad)
%1: Calculate phi
phi = atan(Vinf/(2*pi*n*r(i)));
%2: Choose Vo
Vo = .00175*Vinf;
%3: Calculate Theta
theta = atan((Vinf + Vo)/(2*pi*n*r(i)))-phi;
%4:
if option == 1
%calculate cl(i) from c(i)
sigma = (B*c(i))/(pi*R);
if sigma > 0
cl(i) = (8*x(i)*theta*cos(phi)*tan(phi+theta))/sigma;
else
cl(i) = 0;
end
else %option == 2
%calculate c(i) from cl(i)
if cl(i) ~= 0
sigma = (8*x(i)*theta*cos(phi)*tan(phi+theta))/cl(i);
else
sigma = 0;
end
c(i) = (sigma*pi*R)/B;
if c(i) < 0
c(i) = 0;
end
end
%5: Calculate cd
cd(i) = 0.0090 + 0.0055*(cl(i)-0.1)^2;
%6: calculate alpha
alpha = cl(i)/m0;
%7: calculate beta
beta(i) = phi + alpha + theta;
%8: calculate dCt/dx and dCq/dx
phi0 = phi+theta;
lambda_t = (1/(cos(phi)^2))*(cl(i)*cos(phi0) - cd(i)*sin(phi0));
lambda_q = (1/(cos(phi)^2))*(cl(i)*sin(phi0) + cd(i)*cos(phi0));
if x(i) >= 0.15
dCt_dx(i) = ((pi^3)*(x(i)^2)*sigma*lambda_t)/8; %Roskam eq. 7.47, pg. 280
dCq_dx(i) = ((pi^3)*(x(i)^3)*sigma*lambda_q)/16; %Roskam eq. 7.48, pg 280
else
dCt_dx(i) = 0;
dCq_dx(i) = 0;
end
%calculate Mdd
t(i) = (0.04/(x(i)^1.2))*c(i);
Mdd(i) = 0.94 - (t(i)/c(i)) - cl(i)/10;
end
%9: calculate Ct, Cq, Cd
Ct = trapz(x,dCt_dx)
Cq = trapz(x,dCq_dx)
D = 2*R;
Q=(rho*(n^2)*(D^5)*Cq)
T=(rho*(n^2)*(D^4)*Ct)

When I step through your script, I see that the the entire for i = 1:k loop is skipped because k=0. You set k = length(c), but c was never initialized to a value, so it has length zero.
Because of this, dCt_dx is never given a value--and more importantly the majority of your script is never run.
If you're going to be using MATLAB in the future, I really suggest learning how to do this. It makes it a lot easier to find bugs. Try looking at this video.

The Euler-Maruyama uses timestep Dt multiple of a step size dt

I wonder if you could help me with this?
why the Euler-Maruyama uses timestep Dt multiple of a step size of the increment dt for the Brownian path ?
http://personal.strath.ac.uk/d.j.higham/em.m
%EM Euler-Maruyama method on linear SDE
%
% SDE is dX = lambda*X dt + mu*X dW, X(0) = Xzero,
% where lambda = 2, mu = 1 and Xzero = 1.
%
% Discretized Brownian path over [0,1] has dt = 2^(-8).
% Euler-Maruyama uses timestep R*dt.
randn('state',100)
lambda = 2; mu = 1; Xzero = 1; % problem parameters
T = 1; N = 2^8; dt = T/N;
dW = sqrt(dt)*randn(1,N); % Brownian increments
W = cumsum(dW); % discretized Brownian path
Xtrue = Xzero*exp((lambda-0.5*mu^2)*([dt:dt:T])+mu*W);
plot([0:dt:T],[Xzero,Xtrue],'m-'), hold on
R = 4; Dt = R*dt; L = N/R; % L EM steps of size Dt = R*dt
Xem = zeros(1,L); % preallocate for efficiency
Xtemp = Xzero;
for j = 1:L
Winc = sum(dW(R*(j-1)+1:R*j));
Xtemp = Xtemp + Dt*lambda*Xtemp + mu*Xtemp*Winc;
Xem(j) = Xtemp;
end
plot([0:Dt:T],[Xzero,Xem],'r--*'), hold off
xlabel('t','FontSize',12)
ylabel('X','FontSize',16,'Rotation',0,'HorizontalAlignment','right')
emerr = abs(Xem(end)-Xtrue(end))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

CUDA loop on matlab - matlab

Related

tangent tangent correlation calculation in matrix form MATLAB

Finding the distribution of '1' in rows/columns from a parity-check matrix

Vectorizing 4 nested for loops

Undefined variable in 'if' statement

The Euler-Maruyama uses timestep Dt multiple of a step size dt

Categories

Resources