Run openblas on multicore - multicore

I am implementing a simple version for matrix per matrix multiplication and matrix per vector multiplication with openblas with dgemm and dgemv. I see that openblas is only running on one core.
I tried adding the -lpthread for compilation but that did not make it work.
The way I am calling dgemm and dgemv is simple:
cblas_dgemv(order, trans, m, n, alpha, a, lda, x, incx, beta, y, incy);
cblas_dgemm(M, N, K, alpha, A, 1, M, B, 1, K, beta, C, 1, M);
Has anyone successfully run openblas on multiple cores?

have you tried setting the number of threads with environment variables?
export OMP_NUM_THREADS=4
if this does not work, you can set the number of threads openblas is using via the following function:
void openblas_set_num_threads(int num_threads);
cf. https://github.com/xianyi/OpenBLAS#set-the-number-of-threads-on-runtime

Related

MATLAB: atan2 breaking ode15s

I have a program that runs ode15s a few thousand times in order to find a particular solution. However, I'm getting many integration tolerance errors such as the following:
"Warning: Failure at t=5.144337e+02. Unable to meet integration tolerances without reducing the step size below the smallest value allowed (1.818989e-12) at time t."
Such warnings cause the program to slow down drastically, and sometimes even grind to a complete halt. The following is some test code that re-produces the error:
%Simulation constants
G = 6.672e-11; %Gravitational constant
M = 6.39e23; %Mass of Mars (kg)
g = 9.81; %Gravitational acceleration on Earth (m/s^2);
T1 = 845000/3; %Total engine thrust, 1 engine (N)
Isp = 282; %Engine specific impulse (s)
mdot1 = T1/(g*Isp); %Engine mass flow rate (kg/s)
xinit_on2 = [72368.8347685214;
3384891.40103322;
-598.36623436025;
-1440.49702235844;
16330.430678033]
tspan_on2 = [436.600093957202, 520.311296453027];
[t3,x3] = ode15s(#(t,x) engine_on_2(t, x, G, g, M, Isp, T1), tspan_on2, xinit_on2)
where the function engine_on_2 contains the system of ODEs that model the descent of a rocket, and is given by,
function xdot = engine_on_2(t, x, G, g, M, Isp, T1)
gamma = atan2(x(4),x(3)); %flight-path angle
xdot = [x(3); %xdot1: x-velocity
x(4); %xdot2: y-velocity
-(G*M*x(1))/((x(1)^2+x(2)^2)^(3/2))-(T1/x(5))*cos(gamma); %xdot3: x-acceleration
-(G*M*x(2))/((x(1)^2+x(2)^2)^(3/2))-(T1/x(5))*sin(gamma); %xdot4: y-acceleration
-T1/(g*Isp)]; %xdot5: engine mass flow rate
end
Having done some testing, it seems that I am getting the integration tolerance warnings because of the use of the atan2 function in gamma = atan2(x(4),x(3)) which is used to calculate the flight-path angle of the rocket. If I change atan2 to another function (for example cos or sin) I don't get any integration tolerance warnings anymore (although, due to such a change, my solutions are obviously incorrect). As such, I was wondering if I am using atan2 incorrectly, or if there is a way to implement it differently so that I do not get the integration tolerance errors anymore. Furthermore, could it be that I am incorrect and that it is something other than atan2 that is causing the errors?
Use the odeset function to create an options structure that you then pass to the solver. RelTol and AbsTol can be adjusted in the ODE solver to eliminate your error. I was able to run your code using this addition without any errors:
options = odeset('RelTol',1e-13,'AbsTol',1e-20)
[t3,x3] = ode15s(#(t,x) engine_on_2(t, x, G, g, M, Isp, T1), tspan_on2, xinit_on2, options)
See the options are passed to the ODE solver as a 4th input parameter. Note the RelTol maxes out just above 1e-13 but hopefully that's fine for your application. Also you can try any of the other ODE solvers which can get rid of your error but from my playing around ode15s seems quite fast.

Decrease the computation time for DGELSD subroutine

I am writing a code in Fortran that involve computation of linear least squares (A^(-1)*b). For this I have used the subroutine "DGELSD". I am able to get the correct answer from my code. I have cross checked my solution with that of the data available from MATLAB.
Now when it comes to the computation time, MATLAB is taking much less time than that of my .f90 code. The main motivation for me to write the .f90 code was to do heavy computations, as MATLAB was unable to handle this problem (as the matrix size is increased, it takes more and more time). I am talking of the order of matrix around (10^6 x 10^6).
I know that I may be lacking somewhere around the vectorization or parallelization of code (as I am new to it). But would it make any difference? As the subroutine "DGELSD" is already highly optimized. Also I am using Intel ifort compiler and visual studio as an editor.
I have attached the part of the main code (.f90) below for reference. Please suggest what can be done to decrease the computation time. I am having good hardware configuration to run heavy computation.
Workstation specification: Intel Xeon 32 core - 64 bit processor, 64 GB RAM.
Code information:
a) I have used 'DGELSD' example available on Intel MKL example site..
b) I am using x64 architecture.
c) This is the code I am using in MATLAB for comparison.
function min_norm_check(m,n)
tic
a=15*m*n;
b=15*m*n;
c=6;
A=rand(a,b);
B=rand(b,c);
C=A\B;
toc
end
This is the Fotran code given below:
! Program to check the time required to
! calculate linear least square solution
! using DGELSD subroutine
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: M,N,LDA,LDB,NRHS,NB,NG
REAL :: T1,T2,START,FINISH
DOUBLE PRECISION, DIMENSION (:,:), ALLOCATABLE :: D_CAP,A,B,TG,DG
INTEGER :: I=0
NB=10
NG=10
M = 15*NB*NG
N = 15*NB*NG
NRHS = 6
!!
LDA = MAX(M,N)
LDB = MAX(M,NRHS)
!
ALLOCATE (A(LDA,N))
ALLOCATE (B(LDB,NRHS))
A = 0
B = 0
CALL RANDOM_NUMBER(A)
CALL RANDOM_NUMBER(B)
CALL CPU_TIME(START)
DO I=1,1
WRITE(*,*) I
CALL CALC_MIN_NORM_D(M,N,LDA,LDB,NRHS,A,B,D_CAP)
ENDDO
CALL CPU_TIME(FINISH)
WRITE(*,*)'("TIME =',FINISH-START,'SECONDS.")'
END
SUBROUTINE CALC_MIN_NORM_D(M,N,LDA,LDB,NRHS,A,B,D_CAP)
!
! SUBROUTINE DEFINITION TO CALCULATE THE
! LINEAR LEAST SQUARE SOLUTION OF ([A]^-1*B)
!
IMPLICIT NONE
INTEGER :: M,N,NRHS,LDA,LDB,LWMAX,INFO,LWORK,RANK
DOUBLE PRECISION RCOND
INTEGER, ALLOCATABLE, DIMENSION(:) :: IWORK
DOUBLE PRECISION :: A(LDA,N),B(LDB,NRHS),D_CAP(LDB,NRHS)
DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:) :: S,WORK
!
WRITE(*,*)'IN CALC MIN NORM BEGINING'
WRITE(*,*)'DGELSD EXAMPLE PROGRAM RESULTS'
LWMAX = 1E8
ALLOCATE(S(M))
ALLOCATE(IWORK(3*M*0+11*M))
ALLOCATE(WORK(LWMAX))
! NEGATIVE RCOND MEANS USING DEFAULT (MACHINE PRECISION) VALUE
RCOND = -1.0
!
! QUERY THE OPTIMAL WORKSPACE.
!
LWORK = -1
CALL DGELSD( M, N, NRHS, A, LDA, B, LDB, S, RCOND, RANK, WORK,LWORK, IWORK, INFO )
LWORK = MIN( LWMAX, INT( WORK( 1 ) ) )
!WRITE(*,*) 'AFTER FIRST DGELSD'
!
! SOLVE THE EQUATIONS A*X = B.
!!!
CALL DGELSD( M, N, NRHS, A, LDA, B, LDB, S, RCOND, RANK, WORK,LWORK, IWORK, INFO )
!
! CHECK FOR CONVERGENCE.
!
IF( INFO.GT.0 ) THEN
WRITE(*,*)'THE ALGORITHM COMPUTING SVD FAILED TO CONVERGE;'
WRITE(*,*)'THE LEAST SQUARES SOLUTION COULD NOT BE COMPUTED.'
STOP
END IF
!
!
WRITE(*,*)' EFFECTIVE RANK = ', RANK
!D_CAP = B
END
This is the build log for successful compilation of the file. As I am using Visual studio with intel visual fortran, I can compile my program using Compile option available in the editor. I mean to say I don't have to use command line interface to run my program.
Compiling with Intel(R) Visual Fortran Compiler 17.0.2.187 [IA-32]...
ifort /nologo /debug:full /Od /warn:interfaces /module:&quotDebug\\&quot /object:&quotDebug\\&quot /Fd&quotDebug\vc110.pdb&quot /traceback /check:bounds /check:stack /libs:dll /threads /dbglibs /c /Qlocation,link,&quotC:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\\bin&quot /Qm32 &quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Console6.f90&quot
Linking...
Link /OUT:&quotDebug\Console6.exe&quot /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:&quotDebug\Console6.exe.intermediate.manifest&quot /MANIFESTUAC:&quotlevel='asInvoker' uiAccess='false'&quot /DEBUG /PDB:&quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\Console6\Console6\Debug\Console6.pdb&quot /SUBSYSTEM:CONSOLE /IMPLIB:&quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Debug\Console6.lib&quot -qm32 &quotDebug\Console6.obj&quot
Embedding manifest...
mt.exe /nologo /outputresource:&quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Debug\Console6.exe;#1&quot /manifest &quotDebug\Console6.exe.intermediate.manifest&quot
Console6 - 0 error(s), 0 warning(s)
Also I have included the intel mkl library, shown in the figure below:
I have compiled the program and the output has been written in file 'OUTPUT_FILE.TXT'.
IN CALC MIN NORM BEGINING
DGELSD EXAMPLE PROGRAM RESULTS
EFFECTIVE RANK = 1500
IN CALC MIN NORM ENDING
("TIME TAKEN TO SOLVE DGELSD= 4.290028 SECONDS.")
On the other hand MATLAB gives the following result in the output command window:
min_norm_check(10,10)
Elapsed time is 0.224172 seconds.
Also I don't want to outperform MATLAB, its easy and simple. But with increase in the size of the problem, MATLAB stops responding. I have left my program to run on MATLAB for more than two days. It still haven't produced any results.

Computing mixed derivatives in MATLAB using syms and diff

I'm using MATLAB 2012b.
I want to get d²/dxdy of a simple function:
f(x,y) = (x-1)² + 2y²
The documentation states that I can use syms and diff as in the following example:
> syms x y
> diff(x*sin(x*y), x, y)
ans =
2*x*cos(x*y) - x^2*y*sin(x*y)
But doing the same I got the wrong answer:
> syms x y
> f = (x-1)^2 + 2*y^2;
> diff(f,x,y)
ans =
4*y
The answer is right if I use diff like this:
diff(diff(f,x),y)
Well, it's not a problem for me to use it this way, but nevertheless why is the first variant not working? Is it a version issue?
The actual documentation from R2010a:
diff(expr) differentiates a symbolic expression expr with respect to its free variable as determined by symvar.
diff(expr, v) and diff(expr, sym('v')) differentiate expr with respect to v.
diff(expr, n) differentiates expr n times. n is a positive integer.
diff(expr, v, n) and diff(expr, n, v) differentiate expr with respect to v n times.
So, the command diff(f,x,y) is the last case. It would be equal to differentiating f w.r.t. x, y times, or w.r.t y, x times.
For some reason I don't quite understand, you don't get a warning or error, but one of the syms variables gets interpreted as n = 1, and then the differentiation is carried out. In this case, what diff seems to do is basically diff(f, y, 1).
In any case, it seems that the behavior changed from version to version, because in the documentation you link to (R2016b), there is an additional case:
diff(F,var1,...,varN) differentiates F with respect to the variables var1,...,varN
So I suspect you're running into a version issue.
If you want to differentiate twice, both w.r.t x and y, your second attempt is indeed the correct and most portable way to do that:
diff( diff(f,x), y )
or equivalently
diff( diff(f,y), x )
NB
I checked the R2010a code for symbolic/symbolic/#sym/diff.m and indeed, n is defaulted to 1 and only changed if one of the input variables is a double, and the variable to differentiate over is set equal to the last syms variable in the argument list. The multiple syms variable call is not supported, nor detected and error-trapped.
Syms is only creating symbolic variables.
The first code you execute is only a single derivative. The second code you provided differentiates two times. So I think you forgot to differentiate a second time in the first piece of code you provided.
I am also wondering what answer you expect? If you want 4*y as answer, than you should use
diff(f,y)
and not
diff(f,x,y)
Performing the second derivative is giving me zero?
diff(diff(f,x),y)
If you want 4 as answer than you have to do following:
diff(diff(f,y),y)

Julia vs. Matlab benchmarking eigenvector calculations

I'm a new Julia user and I need to find eigenvectors of large matrices as quickly as possible*. I'm having trouble getting Julia to run as fast as Matlab for the following example:
Julia
const j = 1000 ::Int
A = Array{Float64}(j,j)
B = Array{Float64}(j,j)
f(x) = eigvecs(x)
A = randn(j,j)
B = f(A)
#time f(A)
output for time: 2.950973 seconds (12.31 k allocations: 76.445 MB, 0.11% gc time)
Matlab
j = 1000;
A = randn(j,j);
tic
[v, d] = eig(A);
toc
Elapsed time is 1.161133 seconds.
I have also checked Matlab with 1 thread to compare using maxNumCompThreads = 1 but it still gives a similar time (1.16s) to before. I've also tried to speed up Julia by running twice to precompile, and also setting blas_set_num_threads(4) but this isn't helping.
I'd really appreciate any advice about how to improve my Julia code!
*(I am using Matlab 2015b and Julia 0.4.7 on OSX El Capitan 10.11.6)
Kind of a duplicate of this discussion.
Usually when talking about Julia performance, you're talking about how the language actually works. In this case, both Julia and MATLAB are just calling well-optimized C/Fortran libraries for doing the eigenvalue calculation. This is reliant on the BLAS configuration. MATLAB ships with a version of MKL, so it's also just using a different library which in many cases is faster than OpenBLAS, but you can build Julia with MKL using the instructions in the README on the Julia Github repo. Maybe rebuilding your sysimg could help:
include(joinpath(dirname(JULIA_HOME),"share","julia","build_sysimg.jl")); build_sysimg(force=true)
If you are using a pre-built binary then it's not optimized for your system, and this will enable the optimizations.

order of the outputs for the MATLAB solve function

I have been tinkering with the MATLAB solve function for a while, but cannot seem how it determines the order that it outputs the symbolic variables.
Specifically, I have a system of equations that I want to solve simultaneously.
a = f(a, b, c, d)
b = f(a, b, c, d)
c = f(a, b, c, d)
d = f(a, b, c, d)
and these equations are symbolic and have other symbolic variables (aside from a, b, c, and d). (so the solution outputs aren't numeric, but are symbolic).
For example, when I am solving the for the equations of motion for an inverted spring pendulum, I have two equations that are both dependent on phiDDot and lenDDot. I use the solve function to solve for phiDDot and lenDDot separately using this call:
[eom2, eom1] = solve(Lag(1)==0, Lag(2)==0, ddphi, ddlen);
The solution for ddphi corresponds to the second term of the matrix outputted, while ddlen corresponds to the first term of the matrix. I was wondering whether there was some way to tell MATLAB to output ddphi first and ddlen second, or at least determine what order they are outputted. Not knowing the order of the variables becomes a big problem when I am solving for more than 4 variables, and trying to solve the differential equations using ode45.
Any advice would be helpful!!
I believe that it's alphabetical based on the ASCII values of the variable names in your equations. As per the documentation for solve, sym/symvar is used to parse the equations in the case where you don't supply the names of output variables. The help for sym/symvar indicates that it returns variables in lexicographical order, i.e. alphabetical (symvar does the same, even though it doesn't say so, by making calls to setdiff). If you look at the actual code for solve.m (type edit solve in your command window) and examine the sub-function called assignOutputs (line 190 in R2012b) you'll see that it makes a call to sort and that there's a comment about lexicographical order.
In R2012b (and likely earlier) the documentation differs from that of R2013a in a way that seems relevant to your issue. In R2013a, this sentence is added:
If you explicitly specify independent variables vars, then the solver uses the same order
to return the solutions.
I'm still running R2012b, so I can't confirm this different behavior.