Segmentation Fault when using glmnet mex with MATLAB - matlab

My jobs have been suffering due to segmentation faults when calling glmnet (downloaded from here:http://web.stanford.edu/~hastie/glmnet_matlab/download.html) from my MATLAB code. I call the glmnet routine thousands of times. I have noticed the following peculiarities about the problem occurence:
The problem is more frequent when the size of my input matrices are larger.
I use both gaussian and poisson distribution in separate jobs, and I notice that the problem is more frequent when fitting the Poisson distribution (which also takes usually longer to converge, so might involve more loops internally?)
Since there haven't been reports of segmentation faults for the R version for these two distributions, my suspicion is that the problem, likely a memory leak, might lie in the mex interface rather than the core glmnet Fortran code, which I am pasting below. Any insights into where a memory leak might be happening is greatly appreciated! Apologies for the lengthy code dump.
Thanks!
subroutine mexFunction(nlhs, plhs, nrhs, prhs)
C-----------------------------------------------------------------------
mwpointer plhs(*), prhs(*)
mwpointer mxCreateDoubleMatrix, mxGetPr, mxCreateNumericArray
integer nlhs, nrhs
mwsize mxGetM, mxGetN, mxGetNzmax
integer mxIsNumeric
integer mxIsSparse
C-----------------------------------------------------------------------
C Input
real parm,flmin,thr, intr
integer ka,no,ni,nr,nc,ne,nx,nlam,isd,maxit,kopt,isparse,nnz,jsd
real, dimension (:), allocatable :: x,y,w,vp,ulam,cl,sr,xs,o,d,
$ flog,a
integer, dimension (:), allocatable :: ix,jx,jd,irs,jcs
mwpointer pr
C Output
integer lmu,nlp,jerr
real dev0
real, dimension (:), allocatable :: a0,ca,alm,dev,rsq
integer, dimension (:), allocatable :: ia,nin
C Temporary
mwpointer temp_pr
mwsize temp_m, temp_n, temp_nzmax, dims(3)
integer task,i
C For internal parameters
real fdev, devmax, eps, big, pmin, prec, exmx
integer mnlam, mxit
C Check for proper number of arguments.
if (nrhs .eq. 0) then
task = -1;
else
temp_pr = mxGetPr(prhs(1))
call getinteger(temp_pr,task,1)
endif
C Get input
if (task .eq. -1) then
call get_int_parms(fdev,eps,big,mnlam,devmax,pmin,exmx)
call get_bnorm(prec,mxit)
plhs(1) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(1))
call putreal(fdev,temp_pr,1)
plhs(2) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(2))
call putreal(devmax,temp_pr,1)
plhs(3) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(3))
call putreal(eps,temp_pr,1)
plhs(4) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(4))
call putreal(big,temp_pr,1)
plhs(5) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(5))
call putinteger(mnlam,temp_pr,1)
plhs(6) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(6))
call putreal(pmin,temp_pr,1)
plhs(7) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(7))
call putreal(exmx,temp_pr,1)
plhs(8) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(8))
call putreal(prec,temp_pr,1)
plhs(9) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(9))
call putinteger(mxit,temp_pr,1)
return
endif
if (task .eq. 0) then
temp_pr = mxGetPr(prhs(2))
call getreal(temp_pr,fdev,1)
temp_pr = mxGetPr(prhs(3))
call getreal(temp_pr,devmax,1)
temp_pr = mxGetPr(prhs(4))
call getreal(temp_pr,eps,1)
temp_pr = mxGetPr(prhs(5))
call getreal(temp_pr,big,1)
temp_pr = mxGetPr(prhs(6))
call getinteger(temp_pr,mnlam,1)
temp_pr = mxGetPr(prhs(7))
call getreal(temp_pr,pmin,1)
temp_pr = mxGetPr(prhs(8))
call getreal(temp_pr,exmx,1)
temp_pr = mxGetPr(prhs(9))
call getreal(temp_pr,prec,1)
temp_pr = mxGetPr(prhs(10))
call getinteger(temp_pr,mxit,1)
call chg_fract_dev(fdev)
call chg_dev_max(devmax)
call chg_min_flmin(eps)
call chg_big(big)
call chg_min_lambdas(mnlam)
call chg_min_null_prob(pmin)
call chg_max_exp(exmx)
call chg_bnorm(prec, mxit)
return
endif
c$$$ -----------------Gaussian--------------------
c$$$ ---input---
if (task .eq. 10 .or. task .eq. 11) then
if (task .eq. 11) then
temp_pr = mxGetPr(prhs(3))
temp_m = mxGetM(prhs(3))
no = temp_m
temp_n = mxGetN(prhs(3))
ni = temp_n
allocate(x(1:no*ni))
call getreal(temp_pr,x,no*ni)
else
temp_m = mxGetM(prhs(4))
no = temp_m
temp_pr = mxGetPr(prhs(3))
temp_m = mxGetM(prhs(3))
nnz = temp_m
allocate(xs(1:nnz))
call getreal(temp_pr,xs,nnz)
temp_pr = mxGetPr(prhs(19))
allocate(irs(1:nnz))
call getinteger(temp_pr,irs,nnz)
temp_pr = mxGetPr(prhs(20))
temp_n = mxGetM(prhs(20))
ni = temp_n - 1
allocate(jcs(1:(ni+1)))
call getinteger(temp_pr,jcs,(ni+1))
endif
temp_pr = mxGetPr(prhs(2))
call getreal(temp_pr,parm,1)
temp_pr = mxGetPr(prhs(4))
allocate(y(1:no))
call getreal(temp_pr,y,no)
temp_pr = mxGetPr(prhs(5))
temp_m = mxGetM(prhs(5))
temp_n = mxGetN(prhs(5))
allocate(jd(temp_m*temp_n))
call getinteger(temp_pr,jd,temp_m*temp_n)
temp_pr = mxGetPr(prhs(6))
allocate(vp(1:ni))
call getreal(temp_pr,vp,ni)
temp_pr = mxGetPr(prhs(7))
call getinteger(temp_pr,ne,1)
temp_pr = mxGetPr(prhs(8))
call getinteger(temp_pr,nx,1)
temp_pr = mxGetPr(prhs(9))
call getinteger(temp_pr,nlam,1)
temp_pr = mxGetPr(prhs(10))
call getreal(temp_pr,flmin,1)
temp_pr = mxGetPr(prhs(11))
temp_m = mxGetM(prhs(11))
temp_n = mxGetN(prhs(11))
allocate(ulam(1:temp_m * temp_n))
call getreal(temp_pr,ulam,temp_m * temp_n)
temp_pr = mxGetPr(prhs(12))
call getreal(temp_pr,thr,1)
temp_pr = mxGetPr(prhs(13))
call getinteger(temp_pr,isd,1)
temp_pr = mxGetPr(prhs(14))
allocate(w(1:no))
call getreal(temp_pr,w,no)
temp_pr = mxGetPr(prhs(15))
call getinteger(temp_pr,ka,1)
temp_pr = mxGetPr(prhs(16))
allocate(cl(1:2*ni))
call getreal(temp_pr,cl,2*ni)
temp_pr = mxGetPr(prhs(17))
call getinteger(temp_pr,intr,1)
temp_pr = mxGetPr(prhs(18))
call getinteger(temp_pr,maxit,1)
c$$$ ---prepare output---
allocate(ia(1:nx))
call zerointeger(ia,nx)
allocate(nin(1:nlam))
call zerointeger(nin,nlam)
allocate(alm(1:nlam))
call zeroreal(alm,nlam)
allocate(a0(1:nlam))
call zeroreal(a0,nlam)
allocate(ca(1:nx*nlam))
call zeroreal(ca,nx*nlam)
allocate(rsq(1:nlam))
call zeroreal(rsq,nlam)
c$$$ ---computation----
if (task .eq. 11) then
call elnet(ka,parm,no,ni,x,y,w,jd,vp,cl,ne,nx,nlam,flmin,
$ ulam,thr,isd,intr,maxit,lmu,a0,ca,ia,nin,rsq,alm,
$ nlp,jerr)
else
call spelnet(ka,parm,no,ni,xs,jcs,irs,y,w,jd,vp,cl,ne,nx,
$ nlam,flmin,ulam,thr,isd,intr,maxit,lmu,a0,ca,ia,nin,
$ rsq,alm,nlp,jerr)
endif
c$$$ ----output-----
plhs(1) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(1))
call putinteger(lmu,temp_pr,1)
plhs(4) = mxCreateDoubleMatrix(nx,1,0)
temp_pr = mxGetPr(plhs(4))
call putinteger(ia,temp_pr,nx)
plhs(5) = mxCreateDoubleMatrix(lmu,1,0)
temp_pr = mxGetPr(plhs(5))
call putinteger(nin,temp_pr,lmu)
plhs(7) = mxCreateDoubleMatrix(lmu,1,0)
temp_pr = mxGetPr(plhs(7))
call putreal(alm,temp_pr,lmu)
plhs(8) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(8))
call putinteger(nlp,temp_pr,1)
plhs(9) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(9))
call putinteger(jerr,temp_pr,1)
plhs(2) = mxCreateDoubleMatrix(lmu,1,0)
temp_pr = mxGetPr(plhs(2))
call putreal(a0,temp_pr,lmu)
plhs(3) = mxCreateDoubleMatrix(nx,lmu,0)
temp_pr = mxGetPr(plhs(3))
call putreal(ca,temp_pr,nx*lmu)
plhs(6) = mxCreateDoubleMatrix(lmu,1,0)
temp_pr = mxGetPr(plhs(6))
call putreal(rsq,temp_pr,lmu)
deallocate(y)
deallocate(jd)
deallocate(vp)
deallocate(ulam)
deallocate(a0)
deallocate(ca)
deallocate(ia)
deallocate(nin)
deallocate(alm)
deallocate(w)
deallocate(rsq)
deallocate(cl)
if (task .eq. 11) then
deallocate(x)
else
deallocate(xs)
deallocate(irs)
deallocate(jcs)
endif
return
endif
c$$$ --------------end of Gaussian---------------------------
c$$$ ---------------Poisson--------------------------
c$$$ ---input---
if (task .eq. 50 .or. task .eq. 51) then
if (task .eq. 51) then
temp_pr = mxGetPr(prhs(3))
temp_m = mxGetM(prhs(3))
no = temp_m
temp_n = mxGetN(prhs(3))
ni = temp_n
allocate(x(1:no*ni))
call getreal(temp_pr,x,no*ni)
else
temp_m = mxGetM(prhs(4))
no = temp_m
temp_pr = mxGetPr(prhs(3))
temp_m = mxGetM(prhs(3))
nnz = temp_m
allocate(xs(1:nnz))
call getreal(temp_pr,xs,nnz)
temp_pr = mxGetPr(prhs(19))
allocate(irs(1:nnz))
call getinteger(temp_pr,irs,nnz)
temp_pr = mxGetPr(prhs(20))
temp_n = mxGetM(prhs(20))
ni = temp_n - 1
allocate(jcs(1:(ni+1)))
call getinteger(temp_pr,jcs,(ni+1))
endif
temp_pr = mxGetPr(prhs(2))
call getreal(temp_pr,parm,1)
temp_pr = mxGetPr(prhs(4))
allocate(y(1:no))
call getreal(temp_pr,y,no)
temp_pr = mxGetPr(prhs(5))
temp_m = mxGetM(prhs(5))
temp_n = mxGetN(prhs(5))
allocate(jd(temp_m*temp_n))
call getinteger(temp_pr,jd,temp_m*temp_n)
temp_pr = mxGetPr(prhs(6))
allocate(vp(1:ni))
call getreal(temp_pr,vp,ni)
temp_pr = mxGetPr(prhs(7))
call getinteger(temp_pr,ne,1)
temp_pr = mxGetPr(prhs(8))
call getinteger(temp_pr,nx,1)
temp_pr = mxGetPr(prhs(9))
call getinteger(temp_pr,nlam,1)
temp_pr = mxGetPr(prhs(10))
call getreal(temp_pr,flmin,1)
temp_pr = mxGetPr(prhs(11))
temp_m = mxGetM(prhs(11))
temp_n = mxGetN(prhs(11))
allocate(ulam(1:temp_m * temp_n))
call getreal(temp_pr,ulam,temp_m * temp_n)
temp_pr = mxGetPr(prhs(12))
call getreal(temp_pr,thr,1)
temp_pr = mxGetPr(prhs(13))
call getinteger(temp_pr,isd,1)
temp_pr = mxGetPr(prhs(14))
allocate(w(1:no))
call getreal(temp_pr,w,no)
temp_pr = mxGetPr(prhs(15))
allocate(cl(1:2*ni))
call getreal(temp_pr,cl,2*ni)
temp_pr = mxGetPr(prhs(16))
call getinteger(temp_pr,intr,1)
temp_pr = mxGetPr(prhs(17))
call getinteger(temp_pr,maxit,1)
temp_pr = mxGetPr(prhs(18))
allocate(o(1:no))
call getreal(temp_pr,o,no)
c$$$ ---prepare output---
allocate(ia(1:nx))
call zerointeger(ia,nx)
allocate(nin(1:nlam))
call zerointeger(nin,nlam)
allocate(alm(1:nlam))
call zeroreal(alm,nlam)
allocate(a0(1:nlam))
call zeroreal(a0,nlam)
allocate(ca(1:nx*nlam))
call zeroreal(ca,nx*nlam)
allocate(dev(1:nlam))
call zeroreal(dev,nlam)
c$$$ ---computation----
if (task .eq. 51) then
call fishnet(parm,no,ni,x,y,o,w,jd,vp,cl,ne,nx,nlam,flmin,
$ ulam,thr,isd,intr,maxit,lmu,a0,ca,ia,nin,dev0,dev,alm,
$ nlp,jerr)
else
call spfishnet(parm,no,ni,xs,jcs,irs,y,o,w,jd,vp,cl,ne,nx,
$ nlam,flmin,ulam,thr,isd,intr,maxit,lmu,a0,ca,ia,
$ nin,dev0,dev,alm,nlp,jerr)
endif
c$$$ ----output-----
plhs(1) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(1))
call putinteger(lmu,temp_pr,1)
plhs(4) = mxCreateDoubleMatrix(nx,1,0)
temp_pr = mxGetPr(plhs(4))
call putinteger(ia,temp_pr,nx)
plhs(5) = mxCreateDoubleMatrix(lmu,1,0)
temp_pr = mxGetPr(plhs(5))
call putinteger(nin,temp_pr,lmu)
plhs(7) = mxCreateDoubleMatrix(lmu,1,0)
temp_pr = mxGetPr(plhs(7))
call putreal(alm,temp_pr,lmu)
plhs(8) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(8))
call putinteger(nlp,temp_pr,1)
plhs(9) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(9))
call putinteger(jerr,temp_pr,1)
plhs(2) = mxCreateDoubleMatrix(lmu,1,0)
temp_pr = mxGetPr(plhs(2))
call putreal(a0,temp_pr,lmu)
plhs(3) = mxCreateDoubleMatrix(nx,lmu,0)
temp_pr = mxGetPr(plhs(3))
call putreal(ca,temp_pr,nx*lmu)
plhs(6) = mxCreateDoubleMatrix(lmu,1,0)
temp_pr = mxGetPr(plhs(6))
call putreal(dev,temp_pr,lmu)
plhs(10) = mxCreateDoubleMatrix(1,1,0)
temp_pr = mxGetPr(plhs(10))
call putreal(dev0,temp_pr,1)
plhs(11) = mxCreateDoubleMatrix(no,1,0)
temp_pr = mxGetPr(plhs(11))
call putreal(o,temp_pr,no)
deallocate(y)
deallocate(jd)
deallocate(vp)
deallocate(ulam)
deallocate(a0)
deallocate(ca)
deallocate(ia)
deallocate(nin)
deallocate(alm)
deallocate(cl)
deallocate(o)
deallocate(dev)
if (task .eq. 51) then
deallocate(x)
else
deallocate(xs)
deallocate(irs)
deallocate(jcs)
endif
return
endif
c$$$ --------------------end of Poisson------------------
return
end
C End of subroutine mexFunction
subroutine real8toreal(x, y, size)
integer size
real*8 x(size)
real y(size)
do 10 i=1,size
y(i)= x(i)
10 continue
return
end
subroutine realtoreal8(x, y, size)
integer size
real x(size)
real*8 y(size)
do 20 i=1,size
y(i)= x(i)
20 continue
return
end
subroutine real8tointeger(x, y, size)
integer size
real*8 x(size)
integer y(size)
do 30 i=1,size
y(i)= x(i)
30 continue
return
end
subroutine integertoreal8(x, y, size)
integer size
integer x(size)
real*8 y(size)
do 40 i=1,size
y(i)= x(i)
40 continue
return
end
subroutine getreal(pr,x,size)
mwpointer pr
integer size
real x(size)
real*8, dimension (:), allocatable :: temp
allocate(temp(1:size))
call mxCopyPtrToReal8(pr,temp,size)
call real8toreal(temp,x,size)
deallocate(temp)
return
end
subroutine getinteger(pr,x,size)
mwpointer pr
integer size
integer x(size)
real*8, dimension (:), allocatable :: temp
allocate(temp(1:size))
call mxCopyPtrToReal8(pr,temp,size)
call real8tointeger(temp,x,size)
deallocate(temp)
return
end
subroutine putreal(x,pr,size)
mwpointer pr
integer size
real x(size)
real*8, dimension (:), allocatable :: temp
allocate(temp(1:size))
call realtoreal8(x,temp,size)
call mxCopyReal8ToPtr(temp,pr,size)
deallocate(temp)
return
end
subroutine putinteger(x,pr,size)
mwpointer pr
integer size
integer x(size)
real*8, dimension (:), allocatable :: temp
allocate(temp(1:size))
call integertoreal8(x,temp,size)
call mxCopyReal8ToPtr(temp,pr,size)
deallocate(temp)
return
end
subroutine zeroreal(x,size)
integer size
real x(size)
do 90 i=1,size
x(i) = 0
90 continue
return
end
subroutine zerointeger(x,size)
integer size
integer x(size)
do 100 i=1,size
x(i) = 0
100 continue
return
end

First thing I would do is clean up the MATLAB API interface stuff. Remember that in Fortran you do not get automatic type promotion in function/subroutine argument lists like you do in C/C++. So it is important to get the signatures exact. You should NEVER be passing literal integers to MATLAB API functions. You should be passing variables that are typed exactly as the API specifies to ensure that there is not a mismatch. E.g., take this code:
subroutine getreal(pr,x,size)
mwpointer pr
integer size
real x(size)
real*8, dimension (:), allocatable :: temp
allocate(temp(1:size))
call mxCopyPtrToReal8(pr,temp,size)
call real8toreal(temp,x,size)
deallocate(temp)
return
end
The signature for mxCopyPtrToReal8 in the API is:
subroutine mxCopyPtrToReal8(px, y, n)
mwPointer px
real*8 y(n)
mwSize n
So you have a potential mismatch because the default Fortran integer might not match mwSize. Also, size is the name of a Fortran instrinsic function, so a different name for your variable would probably be more appropriate.
I would change that subroutine to:
subroutine getreal(pr,x,sizex)
mwpointer pr
mwSize sizex
real x(sizex)
real*8, dimension (:), allocatable :: temp
allocate(temp(1:sizex))
call mxCopyPtrToReal8(pr,temp,sizex)
call real8toreal(temp,x,sizex)
deallocate(temp)
return
end
Now you are ensured that sizex is the appropriate type. You would also need to change the types of the variables in the calling routine.
(SIDE NOTE: Actually, I wouldn't do any of what you are doing ... I would simply write a loop to copy the values directly from the mxArray into your real array without extra copies and memory allocation/deallocation)
Another example is this:
integer ...,nx,...
:
integer lmu,...
:
plhs(3) = mxCreateDoubleMatrix(nx,lmu,0)
Should be replaced with:
mwSize nx, lmu
integer*4 :: ComplexFlag = 0
:
plhs(3) = mxCreateDoubleMatrix(nx,lmu,ComplexFlag)
And, frankly, you have lots of assignment loops that can be replace by simple statements. E.g.,
call real8toreal(temp,x,sizex)
can be replaced with:
x = temp
And this:
allocate(ia(1:nx))
call zerointeger(ia,nx)
allocate(nin(1:nlam))
call zerointeger(nin,nlam)
allocate(alm(1:nlam))
call zeroreal(alm,nlam)
allocate(a0(1:nlam))
call zeroreal(a0,nlam)
allocate(ca(1:nx*nlam))
call zeroreal(ca,nx*nlam)
allocate(dev(1:nlam))
call zeroreal(dev,nlam)
can be replaced with this:
allocate(ia(1:nx))
ia = 0
allocate(nin(1:nlam))
nin = 0
allocate(alm(1:nlam))
alm = 0.0
allocate(a0(1:nlam))
a0 = 0.0
allocate(ca(1:nx*nlam))
ca = 0.0
allocate(dev(1:nlam))
dev = 0.0
etc.

Related

Speed in Matlab vs. Julia vs. Fortran

I am playing around with different languages to solve a simple value function iteration problem where I loop over a state-space grid. I am trying to understand the performance differences and how I could tweak each code. For posterity I have posted full length working examples for each language below. However, I believe that most of the tweaking is to be done in the while loop. I am a bit confused what I am doing wrong in Fortran as the speed seems subpar.
Matlab ~2.7secs : I am avoiding a more efficient solution using the repmat function for now to keep the codes comparable. Code seems to be automatically multithreaded onto 4 threads
beta = 0.98;
sigma = 0.5;
R = 1/beta;
a_grid = linspace(0,100,1001);
tic
[V_mat, next_mat] = valfun(beta, sigma, R ,a_grid);
toc
where valfun()
function [V_mat, next_mat] = valfun(beta, sigma, R, a_grid)
zeta = 1-1/sigma;
len = length(a_grid);
V_mat = zeros(2,len);
next_mat = zeros(2,len);
u = zeros(2,len,len);
c = zeros(2,len,len);
for i = 1:len
c(1,:,i) = a_grid(i) - a_grid/R + 20.0;
c(2,:,i) = a_grid(i) - a_grid/R;
end
u = c.^zeta * zeta^(-1);
u(c<=0) = -1e8;
tol = 1e-4;
outeriter = 0;
diff = 1000.0;
while (diff>tol) %&& (outeriter<20000)
outeriter = outeriter + 1;
V_last = V_mat;
for i = 1:len
[V_mat(1,i), next_mat(1,i)] = max( u(1,:,i) + beta*V_last(2,:));
[V_mat(2,i), next_mat(2,i)] = max( u(2,:,i) + beta*V_last(1,:));
end
diff = max(abs(V_mat - V_last));
end
fprintf("\n Value Function converged in %i steps. \n", outeriter)
end
Julia (after compilation) ~5.4secs (4 threads (9425469 allocations: 22.43 GiB)), ~7.8secs (1 thread (2912564 allocations: 22.29 GiB))
[EDIT: after adding correct broadcasting and #views its only 1.8-2.1seconds now, see below!]
using LinearAlgebra, UnPack, BenchmarkTools
struct paramsnew
β::Float64
σ::Float64
R::Float64
end
function valfun(params, a_grid)
#unpack β,σ, R = params
ζ = 1-1/σ
len = length(a_grid)
V_mat = zeros(2,len)
next_mat = zeros(2,len)
u = zeros(2,len,len)
c = zeros(2,len,len)
#inbounds for i in 1:len
c[1,:,i] = #. a_grid[i] - a_grid/R .+ 20.0
c[2,:,i] = #. a_grid[i] - a_grid/R
end
u = c.^ζ * ζ^(-1)
u[c.<=0] .= typemin(Float64)
tol = 1e-4
outeriter = 0
test = 1000.0
while test>tol
outeriter += 1
V_last = deepcopy(V_mat)
#inbounds Threads.#threads for i in 1:len # loop over grid points
V_mat[1,i], next_mat[1,i] = findmax( u[1,:,i] .+ β*V_last[2,:])
V_mat[2,i], next_mat[2,i] = findmax( u[2,:,i] .+ β*V_last[1,:])
end
test = maximum( abs.(V_mat - V_last)[.!isnan.( V_mat - V_last )])
end
print("\n Value Function converged in ", outeriter, " steps.")
return V_mat, next_mat
end
a_grid = collect(0:0.1:100)
p1 = paramsnew(0.98, 1/2, 1/0.98);
#time valfun(p1,a_grid)
print("\n should be compiled now \n")
#btime valfun(p1,a_grid)
Fortran (O3, mkl, qopenmp) ~9.2secs: I also must be doing something wrong when declaring the openmp variables as the compilation will crash for some grid sizes when using openmp (SIGSEGV error).
module mod_calc
use omp_lib
implicit none
integer, parameter :: dp = selected_real_kind(33,4931), len = 1001
public :: dp, len
contains
subroutine linspace(from, to, array)
real(dp), intent(in) :: from, to
real(dp), intent(out) :: array(:)
real(dp) :: range
integer :: n, i
n = size(array)
range = to - from
if (n == 0) return
if (n == 1) then
array(1) = from
return
end if
do i=1, n
array(i) = from + range * (i - 1) / (n - 1)
end do
end subroutine
subroutine calc_val()
real(dp):: bbeta, sigma, R, zeta, tol, test
real(dp):: a_grid(len), V_mat(2,len), V_last(2,len), &
u(len,len,2), c(len,len,2)
integer :: outeriter, i, sss, next_mat(2,len), fu
character(len=*), parameter :: FILE_NAME = 'data.txt' ! File name.
call linspace(from=0._dp, to=100._dp, array=a_grid)
bbeta = 0.98
sigma = 0.5
R = 1.0/0.98
zeta = 1.0 - 1.0/sigma
tol = 1e-4
test = 1000.0
outeriter = 0
do i = 1,len
c(:,i,1) = a_grid(i) - a_grid/R + 20.0
c(:,i,2) = a_grid(i) - a_grid/R
end do
u = c**zeta * 1.0/zeta
where (c<=0)
u = -1e6
end where
V_mat = 0.0
next_mat = 0.0
do while (test>tol .and. outeriter<20000)
outeriter = outeriter+1
V_last = V_mat
!$OMP PARALLEL DEFAULT(NONE) &
!$OMP SHARED(V_mat, next_mat,V_last, u, bbeta) &
!$OMP PRIVATE(i)
!$OMP DO SCHEDULE(static)
do i=1,len
V_mat(1,i) = maxval(u(:,i,1) + bbeta*V_last(2,:))
next_mat(1,i) = maxloc(u(:,i,1) + bbeta*V_last(2,:),1)
V_mat(2,i) = maxval(u(:,i,2) + bbeta*V_last(1,:))
next_mat(2,i) = maxloc(u(:,i,2) + bbeta*V_last(1,:),1)
end do
!$OMP END DO
!$OMP END PARALLEL
test = maxval(abs(log(V_last/V_mat)))
end do
end subroutine
end module mod_calc
program main
use mod_calc
implicit none
integer:: clck_counts_beg,clck_rate,clck_counts_end
call omp_set_num_threads(4)
call system_clock ( clck_counts_beg, clck_rate )
call calc_val()
call system_clock ( clck_counts_end, clck_rate )
write (*, '("Time = ",f6.3," seconds.")') (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main
There should be ways to reduce the amount of allocations (Julia reports 32-45% gc time!) but for now I am too novice to see them, so any comments and tipps are welcome.
Edit:
Adding #views and correct broadcasting to the while loop improved the Julia speed considerably (as expected, I guess) and hence beats the Matlab loop now. With 4 threads the code now takes only 1.97secs. Specifically,
#inbounds for i in 1:len
c[1,:,i] = #views #. a_grid[i] - a_grid/R .+ 20.0
c[2,:,i] = #views #. a_grid[i] - a_grid/R
end
u = #. c^ζ * ζ^(-1)
#. u[c<=0] = typemin(Float64)
while test>tol && outeriter<20000
outeriter += 1
V_last = deepcopy(V_mat)
#inbounds Threads.#threads for i in 1:len # loop over grid points
V_mat[1,i], next_mat[1,i] = #views findmax( #. u[1,:,i] + β*V_last[2,:])
V_mat[2,i], next_mat[2,i] = #views findmax( #. u[2,:,i] + β*V_last[1,:])
end
test = #views maximum( #. abs(V_mat - V_last)[!isnan( V_mat - V_last )])
end
The reason the fortran is so slow is that it is using quadruple precision - I don't know Julia or Matlab but it looks as though double precision is being used in that case. Further as noted in the comments some of the loop orders are incorrect for Fortran, and also you are not consistent in your use of precision in the Fortran code, most of your constants are single precision. Correcting all these leads to the following:
Original: test = 9.83440674663232047922921588613472439E-0005 Time =
31.413 seconds.
Optimised: test = 9.8343643237979391E-005 Time = 0.912 seconds.
Note I have turned off parallelisation for these, all results are single threaded. Code is below:
module mod_calc
!!$ use omp_lib
implicit none
!!$ integer, parameter :: dp = selected_real_kind(33,4931), len = 1001
integer, parameter :: dp = selected_real_kind(15), len = 1001
public :: dp, len
contains
subroutine linspace(from, to, array)
real(dp), intent(in) :: from, to
real(dp), intent(out) :: array(:)
real(dp) :: range
integer :: n, i
n = size(array)
range = to - from
if (n == 0) return
if (n == 1) then
array(1) = from
return
end if
do i=1, n
array(i) = from + range * (i - 1) / (n - 1)
end do
end subroutine
subroutine calc_val()
real(dp):: bbeta, sigma, R, zeta, tol, test
real(dp):: a_grid(len), V_mat(len,2), V_last(len,2), &
u(len,len,2), c(len,len,2)
integer :: outeriter, i, sss, next_mat(2,len), fu
character(len=*), parameter :: FILE_NAME = 'data.txt' ! File name.
call linspace(from=0._dp, to=100._dp, array=a_grid)
bbeta = 0.98_dp
sigma = 0.5_dp
R = 1.0_dp/0.98_dp
zeta = 1.0_dp - 1.0_dp/sigma
tol = 1e-4_dp
test = 1000.0_dp
outeriter = 0
do i = 1,len
c(:,i,1) = a_grid(i) - a_grid/R + 20.0_dp
c(:,i,2) = a_grid(i) - a_grid/R
end do
u = c**zeta * 1.0_dp/zeta
where (c<=0)
u = -1e6_dp
end where
V_mat = 0.0_dp
next_mat = 0.0_dp
do while (test>tol .and. outeriter<20000)
outeriter = outeriter+1
V_last = V_mat
!$OMP PARALLEL DEFAULT(NONE) &
!$OMP SHARED(V_mat, next_mat,V_last, u, bbeta) &
!$OMP PRIVATE(i)
!$OMP DO SCHEDULE(static)
do i=1,len
V_mat(i,1) = maxval(u(:,i,1) + bbeta*V_last(:, 2))
next_mat(i,1) = maxloc(u(:,i,1) + bbeta*V_last(:, 2),1)
V_mat(i,2) = maxval(u(:,i,2) + bbeta*V_last(:, 1))
next_mat(i,2) = maxloc(u(:,i,2) + bbeta*V_last(:, 1),1)
end do
!$OMP END DO
!$OMP END PARALLEL
test = maxval(abs(log(V_last/V_mat)))
end do
Write( *, * ) test
end subroutine
end module mod_calc
program main
use mod_calc
implicit none
integer:: clck_counts_beg,clck_rate,clck_counts_end
!!$ call omp_set_num_threads(2)
call system_clock ( clck_counts_beg, clck_rate )
call calc_val()
call system_clock ( clck_counts_end, clck_rate )
write (*, '("Time = ",f6.3," seconds.")') (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main
Compilation / linking:
ian#eris:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ian#eris:~/work/stack$ gfortran -Wall -Wextra -O3 jul.f90
jul.f90:36:48:
character(len=*), parameter :: FILE_NAME = 'data.txt' ! File name.
1
Warning: Unused parameter ‘file_name’ declared at (1) [-Wunused-parameter]
jul.f90:35:57:
integer :: outeriter, i, sss, next_mat(2,len), fu
1
Warning: Unused variable ‘fu’ declared at (1) [-Wunused-variable]
jul.f90:35:36:
integer :: outeriter, i, sss, next_mat(2,len), fu
1
Warning: Unused variable ‘sss’ declared at (1) [-Wunused-variable]
Running:
ian#eris:~/work/stack$ ./a.out
9.8343643237979391E-005
Time = 0.908 seconds.
What #Ian Bush says in his answer about the dual precision is correct. Moreover,
You will likely not need openmp for the kind of parallelization you have done in your code. The Fortran's intrinsic do concurrent() will automatically parallelize the loop for you (when the code is compiled with the parallel flag of the respective compiler).
Also, the where elsewhere construct is slow as it often requires the creation of a logical mask array and then applying it in a do-loop. You can use do concurrent() in place of where to both avoid the extra temporary array creation and parallelize the computation on multiple cores.
Also, when comparing 64bit precision numbers, it's good to make sure both values are the same type and kind to avoid an implicit type/kind conversion before the comparison is made.
Also, the calculation of a_grid(i) - a_grid/R in computing the c array is redundant and can be avoided in the subsequent line.
Here is the modified optimized parallel Fortran code without any OpenMP,
module mod_calc
use iso_fortran_env, only: dp => real64
implicit none
integer, parameter :: len = 1001
public :: dp, len
contains
subroutine linspace(from, to, array)
real(dp), intent(in) :: from, to
real(dp), intent(out) :: array(:)
real(dp) :: range
integer :: n, i
n = size(array)
range = to - from
if (n == 0) return
if (n == 1) then
array(1) = from
return
end if
do concurrent(i=1:n)
array(i) = from + range * (i - 1) / (n - 1)
end do
end subroutine
subroutine calc_val()
implicit none
real(dp) :: bbeta, sigma, R, zeta, tol, test
real(dp) :: a_grid(len), V_mat(len,2), V_last(len,2), u(len,len,2), c(len,len,2)
integer :: outeriter, i, j, k, sss, next_mat(2,len), fu
character(len=*), parameter :: FILE_NAME = 'data.txt' ! File name.
call linspace(from=0._dp, to=100._dp, array=a_grid)
bbeta = 0.98_dp
sigma = 0.5_dp
R = 1.0_dp/0.98_dp
zeta = 1.0_dp - 1.0_dp/sigma
tol = 1e-4_dp
test = 1000.0_dp
outeriter = 0
do concurrent(i=1:len)
c(1:len,i,2) = a_grid(i) - a_grid/R
c(1:len,i,1) = c(1:len,i,2) + 20.0_dp
end do
u = c**zeta * 1.0_dp/zeta
do concurrent(i=1:len, j=1:len, k=1:2)
if (c(i,j,k)<=0._dp) u(i,j,k) = -1e6_dp
end do
V_mat = 0.0_dp
next_mat = 0.0_dp
do while (test>tol .and. outeriter<20000)
outeriter = outeriter + 1
V_last = V_mat
do concurrent(i=1:len)
V_mat(i,1) = maxval(u(:,i,1) + bbeta*V_last(:, 2))
next_mat(i,1) = maxloc(u(:,i,1) + bbeta*V_last(:, 2),1)
V_mat(i,2) = maxval(u(:,i,2) + bbeta*V_last(:, 1))
next_mat(i,2) = maxloc(u(:,i,2) + bbeta*V_last(:, 1),1)
end do
test = maxval(abs(log(V_last/V_mat)))
end do
Write( *, * ) test
end subroutine
end module mod_calc
program main
use mod_calc
implicit none
integer:: clck_counts_beg,clck_rate,clck_counts_end
call system_clock ( clck_counts_beg, clck_rate )
call calc_val()
call system_clock ( clck_counts_end, clck_rate )
write (*, '("Time = ",f6.3," seconds.")') (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main
Compiling your original code with /standard-semantics /F0x1000000000 /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive /inline:all /Ob2 /Qparallel Intel Fortran compiler flags, yields the following timing,
original.exe
Time = 37.284 seconds.
compiling and running the parallel concurrent Fortran code in the above (on at most 4 cores, if any at all is used) yields,
concurrent.exe
Time = 0.149 seconds.
For comparison, this MATLAB's timing,
Value Function converged in 362 steps.
Elapsed time is 3.575691 seconds.
One last tip: There are several vectorized array computations and loops in the above code that can still be merged together to even further improve the speed of your Fortran code. For example,
u = c**zeta * 1.0_dp/zeta
do concurrent(i=1:len, j=1:len, k=1:2)
if (c(i,j,k)<=0._dp) u(i,j,k) = -1e6_dp
end do
in the above code can be all merged with the do concurrent loop appearing before it,
do concurrent(i=1:len)
c(1:len,i,2) = a_grid(i) - a_grid/R
c(1:len,i,1) = c(1:len,i,2) + 20.0_dp
end do
If you decide to do so, then you can define an auxiliary variable inverse_zeta = 1.0_dp / zeta to use in the computation of u inside the loop instead of using * 1.0_dp / zeta, thus avoiding the extra division (which is more costly than multiplication), without degrading the readability of the code.

fsolve issue with initial condition for ODE

I am trying to solve the following ODE:
function [eta, sol] = compressible_similarity_wo
global Gamm Ma Pr omega;
Gamm = 1.4;
Ma = 2;
Pr = 0.7;
omega=0.76;
global eta_max_ode;
eta_max_ode = 20;
opt = optimset('Display','off','TolFun',1E-20);
F = fsolve(#(F) eval_boundary(F),[0,0,0.4,1,0],opt);
[eta_ode, fg_ode] = solve_ode(F);
sol = [fg_ode];
eta = eta_ode;
end
function [eta_ode, fg_ode] = solve_ode(F)
global eta_max_ode
options = odeset('RelTol',1e-9,'AbsTol',1e-9);
[eta_ode, fg_ode] = ode45(#BLFunc,[0,eta_max_ode],F,options);
end
function [g] = eval_boundary(F)
% Get the solution to the ODE with inital condition F
[eta_ode, fg_ode] = solve_ode(F);
% Get the function values (for BCs) at the starting/end points
f_start = fg_ode(1,1); %f(0) = 0
df_start = fg_ode(1,2); %f'(0) = 0
df_end = fg_ode(end,2); %f'(inf) - 1 = 0
t_end = fg_ode(end,4); %T(inf) - 1 = 0
dt_start = fg_ode(1,5); %T'(0) = 0
% Evaluate the boundary function
g = [f_start
df_start
df_end - 1
t_end - 1
dt_start];
end
function [df] = BLFunc(f)
global Gamm Ma Pr omega;
df = zeros(5,1);
df(1) = f(2);
df(2) = f(3);
df(3) = -f(1)*f(3)/(f(4)^(omega-1))-(omega-1)*f(3)/f(4);
df(4) = f(5);
df(5) = -Pr*f(1)*f(5)/(f(4)^(omega-1)) - Pr*(Gamm - 1.0)*Ma*Ma*f(3)*f(3) - (omega-1)*f(5)/f(4);
end
but fsolve returns the following problem
Error using BLFunc
Too many input arguments.
Error in odearguments (line 90)
f0 = feval(ode,t0,y0,args{:}); % ODE15I sets args{1} to yp0.
Error in ode45 (line 115)
odearguments(FcnHandlesUsed, solver_name, ode, tspan, y0, options, varargin);
Error in solve_ode (line 5)
[eta_ode, fg_ode] = ode45(#BLFunc,[0,eta_max_ode],F,options);
Error in eval_boundary (line 3)
[eta_ode, fg_ode] = solve_ode(F);
Error in compressible_similarity_wo>#(F)eval_boundary(F) (line 15)
F = fsolve(#(F) eval_boundary(F),[0,0,0.4,1,0],opt);
Error in fsolve (line 230)
fuser = feval(funfcn{3},x,varargin{:});
Error in compressible_similarity_wo (line 15)
F = fsolve(#(F) eval_boundary(F),[0,0,0.4,1,0],opt);
Error in launch (line 3)
[eta, sol] = compressible_similarity_wo;
Caused by:
Failure in initial objective function evaluation. FSOLVE cannot continue.
Do you have an idea of what's going on?
I'll cite you the friendly manual page
The function dydt = odefun(t,y), for a scalar t and a column vector y, must return a column vector dydt of data type single or double that corresponds to f(t,y). odefun must accept both input arguments, t and y, even if one of the arguments is not used in the function.
That is, you simply need to change to
function [df] = BLFunc(t,f)
to get a result (no guarantee that it is THE result).
Try to replace BLFunc signature to
function [df] = BLFunc(t, f)
You need to provide odefun to ode45, which takes 2 arguments, as stated in documentation:
The function dydt = odefun(t,y), for a scalar t and a column vector y, must return a column vector dydt of data type single or double that corresponds to f(t,y). odefun must accept both input arguments, t and y, even if one of the arguments is not used in the function.

Trying to return array from Fortran to Matlab with mex, getting empty array instead

So, I'm trying to return a an array of numbers from 1-n.
#include "fintrf.h"
C Gateway routine
subroutine mexFunction(nlhs, plhs, nrhs, prhs)
C Declarations
implicit none
C mexFunction arguments:
mwPointer plhs(*), prhs(*)
integer nlhs, nrhs
mwPointer mxGetPr
mwPointer mxCreateDoubleMatrix
mwPointer mxGetM, mxGetN
mwPointer mrows, ncols
mwSize size
mwPointer x_ptr, y_ptr
integer x_input,i
real*8, allocatable :: vec(:)
x_ptr = mxGetPr(prhs(1))
mrows = mxGetM(prhs(1))
ncols = mxGetN(prhs(1))
size = mrows*ncols
x_ptr=mxGetPr(prhs(1))
call mxCopyPtrToReal8(x_ptr,x_input,size)
allocate (vec(x_input))
do i=1,x_input
vec(i)=i
end do
plhs(1) = mxCreateDoubleMatrix(1, x_input, 0)
y_ptr = mxGetPr(plhs(1))
call mxCopyReal8ToPtr(vec,y_ptr,x_input)
deallocate ( vec )
return
end
I then call the mex file in fortran here
mex testingvec.F
Building with 'gfortran'.
MEX completed successfully.
a=testingvec(10);
and then find
a=[]
Can someone give me some help on this? If someone can give me some example code how to return a matrix as well, that would be sweet.
Thanks guys.
edit: new installment of the code. Still trying to get some help.
#include "fintrf.h"
C Gateway routine
subroutine mexFunction(nlhs, plhs, nrhs, prhs)
C Declarations
implicit none
C mexFunction arguments:
mwPointer plhs(*), prhs(*)
integer nlhs, nrhs
mwPointer mxGetPr
mwPointer mxCreateDoubleMatrix
mwPointer mxGetM, mxGetN
mwPointer mrows, ncols
mwSize size
mwPointer x_ptr, y_ptr
integer i
mwSize sizeone, x_input
integer*4 izero
real*8, allocatable :: vec(:)
x_ptr = mxGetPr(prhs(1))
mrows = mxGetM(prhs(1))
ncols = mxGetN(prhs(1))
size = mrows*ncols
sizeone=1
izero=0
x_ptr=mxGetPr(prhs(1))
call mxCopyPtrToReal8(x_ptr,x_input,size)
allocate (vec(x_input))
do i=1,x_input
vec(i)=i
end do
plhs(1) = mxCreateDoubleMatrix(sizeone,x_input,izero)
call mxCopyReal8ToPtr(vec,mxGetPr(plhs(1)),x_input)
deallocate ( vec )
return
end
There were both declaration problems and issues with calls to mex functions. Here's a solution, which assumes that the input is an integer-valued double giving you the length of the output vector (assuming that this is what you wanted).
#include "fintrf.h"
C Gateway routine
subroutine mexFunction(nlhs, plhs, nrhs, prhs)
C Declarations
implicit none
C mexFunction arguments:
mwPointer plhs(*), prhs(*)
integer*4 nlhs, nrhs
mwPointer mxGetPr
mwPointer mxCreateDoubleMatrix
mwSize mxGetM, mxGetN
mwSignedIndex mrows, ncols
mwSize size, x_input, sizeone
mwPointer x_ptr, y_ptr
integer*4 i, izero, x_int
real*8, allocatable :: vec(:)
real*8 :: x_dbl
sizeone = 1
izero = 0
!check input/output syntax
if (nrhs /= 1) then
call mexErrMsgIdAndTxt("MATLAB:testingvec:rhs",
> "Exactly 1 input variable required.")
end if
if (nlhs /= 1) then
call mexErrMsgIdAndTxt("MATLAB:testingvec:lhs",
> "Exactly 1 output matrix required.")
end if
x_ptr = mxGetPr(prhs(1))
mrows = mxGetM(prhs(1))
ncols = mxGetN(prhs(1))
size = mrows*ncols
call mxCopyPtrToReal8(x_ptr,x_dbl,sizeone)
x_input = int(x_dbl)
allocate (vec(x_input))
do i=1,x_input
vec(i)=i
end do
plhs(1) = mxCreateDoubleMatrix(sizeone, x_input, izero)
y_ptr = mxGetPr(plhs(1))
call mxCopyReal8ToPtr(vec,y_ptr,x_input)
deallocate ( vec )
return
end
I introduced a check for the number of input/output variables (to be updated in the actual program). And I introduced an auxiliary x_dbl which might or might not be necessary. This version reads the double input given to your function, and truncates it to get x_input.

Error in using a function with more that one argument

Here is the code of the functions I'm using.
function max = find_max( matrix )
a = -1;
for i = 1:numel( matrix ),
if ( matrix(i) > a),
a = matrix(i);
end
end
max = a;
end
function maxs = find_maxs( matrix, count )
maxs = [];
while (count > 0),
a = -1;
for i = 1:numel( matrix ),
if ( matrix(i) > a && ~is_present(maxs, matrix(i))),
a = matrix(i);
end
end
maxs = [maxs a];
count = count - 1;
end
end
function present = is_present( vector, element )
for i = 1:numel( vector ),
if ( vector(i) == element),
present = TRUE;
return;
end
end
end
When I try to call:
m = [1 2 3 4];
disp(is_present(m, 1));
Or the function find_maxs, I get this error.
??? Undefined function or method 'is_present' for input arguments of type 'double'.
I'm new to matlab and I don't understand why I'm getting this error. The name of the file is find_max.m, the same name of the first function (which works fine).
Just to expand on eigenchris's comment (I'd put it as a comment but I don't have the privilege yet),
Each function should have it's own m file, and the m file should have the same name as the function.
Ex:
function max = find_max( matrix )
should be in a file named 'find_max.m'

ode45 error, index out of bounds because numel(x)=1

function xx = test(x, t)
xx(1) = x(2);
xx(2) = x(3) * cos(x(4) + x(1));
xx(3) = 0;
xx(4) = 0;
end
That is my script, confused as to why I am getting this error.
The right-hand side given to the ODE suites pass the arguments as (t,x) where t is always scalar.
So your function signature should be
function xx = test(t,x)