Do loop to be parallelized in a Matlab's Mex function (Fortran) with OpenMP - matlab

I have a do loop in a Matlab's Mex function (written in Fortran) that performs some calculations for each element of a FEM mesh. My mesh consists of 250k elements, so I thought it was worth parallelizing it. This is my first attempt at parallelizing this code with OpenMP (I'm a beginner at coding). I used the reduction command to avoid the race condition in fintk(dofele) = fintk(dofele) + fintele. Is it correct? I can compile it in Matlab without any problem. However, when I use it (in Matlab), it produces correct results for a 12k element mesh and it is faster than the serialized one, but when I try to use it for the 250k element mesh Matlab crashes. Thank you for helping me
subroutine loop_over_elements( &
! OUT
fintk,Sxyz,&
! IN
Elem,Bemesh,Dofelemat,u,dt,NE,NDOF)
use omp_lib
implicit none
mwSize NE, NDOF, ele
integer, parameter :: dp = selected_real_kind(15,307)
real(dp) :: fintk(NDOF), Sxyz(6,NE), Elemat(4,NE), Bemesh(6,12,NE), Dofelemat(12,NE)
real(dp) :: u(NDOF)
real(dp) :: Bele(6,12), fintele(12), uele(12), si(6), dt
integer*4 :: nodes(4), dofele(12)
fintk = 0.D0
!$OMP PARALLEL DO REDUCTION(+:fintk(:)) PRIVATE(ele,nodes,Bele,dofele,uele,si,fintele)
DO ele = 1, NE
nodes = Elemat(1:4,ele)
Bele = Bemesh(1:6,1:12,ele)
dofele = Dofelemat(1:12,ele)
uele = u(dofele)
call comput_subroutine( &
! IN
Bele,uele,dt, &
! OUT
si)
Sxyz(:,ele) = si
fintele = MATMUL(TRANSPOSE(Bele),si)
fintk(dofele) = fintk(dofele) + fintele
END DO
!$OMP END PARALLEL DO
return
end

I solved the "crashing" of Matlab that I was experiencing by adding this line in the general mexFunction subroutine before calling the loop_over_elements subroutine:
call KMP_SET_STACKSIZE(100000000). I figured that, since Matlab didn't crash when I was using the large model with the non-parallelized subroutine, maybe it was a memory problem. After that I discovered the well-known (not to me unfortunately) segmentation fault problem when using OpenMp with large arrays (see for example this). I'm still a bit confused about the difference between setting OMP_STACKSIZE(which I don't know how to do in Mex functions) and KMP_SET_STACKSIZE, but now the parallelized code works with the large model.

Related

Why does geometryFromEdges give a parse error when used in parfor on a thread-based pool?

I want to solve multiple PDEs in parallel on threads using the Parallel Computing Toolbox in Matlab. I have the following code:
delete(gcp('nocreate'));
parpool('threads')
parfor i = 1:1
model = createpde();
R1 = [3,4,0,1,1,0,1,1,0,0]';
sf = 'R1';
ns = char('R1');
ns = ns';
gm = R1;
g = decsg(gm,sf,ns);
geometryFromEdges(model, g);
end
This results in the following error:
Error using pde.EquationModel/geometryFromEdges
The specified superclass 'pde.GeometricModel' contains a parse error,
cannot be found on MATLAB's search path, or is shadowed by another file
with the same name.
Error in codeFile (line 4)
parfor i = 1:1
When I change the parfor into a for, the code runs (and consequently I am able to numerically solve PDEs). The problem does not happen when I leave out parpool('threads'), and the parallel computations are done on processes.
I have already restored the default Matlab paths, the solution suggested for this type of error.
What could the problem be?
MATLAB only supports a subset of functions in thread-based parallel computation, less than is supported in process-based parallel computation. You can find the list here. It appears that geometryFromEdges isn't one of them.

indexing within parfor loop - matlab

I'm trying to run this code in Matlab
a = ones(4,4);
b=[1,0,0,1;0,0,0,1;0,1,0,0;0,0,0,0];
b(:,:,2)=[0,1,1,0;1,1,1,0;1,0,1,1;1,1,1,1];
parfor i = 1:size(b,3)
c = b(:,:,i)
a(c) = i;
end
but get the error:
Error: The variable a in a parfor cannot be classified.
See Parallel for Loops in MATLAB, "Overview".
There are restrictions in how you can write into arrays inside the body of a parfor loop. In general, you will need to use sliced arrays.
The reason behind this issue is that Matlab needs to prevent that different worksers access the same data, leading to unpredictable results (as the timely order in which the parfor loops through i is not detemined).
So, although in your example the workers don't operate on the same entries of a, due to the way how you index a (with an array of logicals), it is currently not possible for Matlab to decide if this is the case or not (in other words, Matlab cannot classify a).
Edit: For completeness I add some code that is equivalent to your example, although I assume that your actual problem involves more complicated logical indexing?
a = ones(4,4,4);
parfor i = 1:size(a,1)
a(i, :, :) = zeros(4, 4) + i; % this is sliced indexing
end
Edit: As the OP example was modified, the above code is not equivalent to the example anymore.

why does a*b*a take longer than (a'*(a*b)')' when using gpuArray in Matlab scripts?

The code below performs the operation the same operation on gpuArrays a and b in two different ways. The first part computes (a'*(a*b)')' , while the second part computes a*b*a. The results are then verified to be the same.
%function test
clear
rng('default');rng(1);
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c1=gather(transpose(transpose(a)*transpose(a*b)));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c2=gather(a*b*a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%end
However, computing (a'*(a*b)')' is roughly 4 times faster than computing a*b*a. Here is the output of the above script in R2018a on an Nvidia K20 (I've tried different versions and different GPUs with the similar behaviour).
>> test
time for (a'*(a*b)')': 0.43234s
time for a*b*a: 1.7175s
error = 2.0009e-11
Even more strangely, if the first and last lines of the above script are uncommented (to turn it into a function), then both take the longer amount of time (~1.7s instead of ~0.4s). Below is the output for this case:
>> test
time for (a'*(a*b)')': 1.717s
time for a*b*a: 1.7153s
error = 1.0914e-11
I'd like to know what is causing this behaviour, and how to perform a*b*a or (a'*(a*b)')' or both in the shorter amount of time (i.e. ~0.4s rather than ~1.7s) inside a matlab function rather than inside a script.
There seem to be an issue with multiplication of two sparse matrices on GPU. time for sparse by full matrix is more than 1000 times faster than sparse by sparse. A simple example:
str={'sparse*sparse','sparse*full'};
for ii=1:2
rng(1);
a=sprand(3000,3000,0.1);
b=sprand(3000,3000,0.1);
if ii==2
b=full(b);
end
a=gpuArray(a);
b=gpuArray(b);
tic
c=a*b;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
In your context, it is the last multiplication which does it. to demonstrate I replace a with a duplicate c, and multiply by it twice, once as sparse and once as full matrix.
str={'a*b*a','a*b*full(a)'};
for ii=1:2
%rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
rng(1)
c=sprand(3000,3000,0.1);
if ii==2
c=full(c);
end
a=gpuArray(a);
b=gpuArray(b);
c=gpuArray(c);
tic;
c1{ii}=a*b*c;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
disp(['error = ',num2str(max(max(abs(c1{1}-c1{2}))))])
I may be wrong, but my conclusion is that a * b * a involves multiplication of two sparse matrices (a and a again) and is not treated well, while using transpose() approach divides the process to two stage multiplication, in none of which there are two sparse matrices.
I got in touch with Mathworks tech support and Rylan finally shed some light on this issue. (Thanks Rylan!) His full response is below. The function vs script issue appears to be related to certain optimizations matlab applies automatically to functions (but not scripts) not working as expected.
Rylan's response:
Thank you for your patience on this issue. I have consulted with the MATLAB GPU computing developers to understand this better.
This issue is caused by internal optimizations done by MATLAB when encountering some specific operations like matrix-matrix multiplication and transpose. Some of these optimizations may be enabled specifically when executing a MATLAB function (or anonymous function) rather than a script.
When your initial code was being executed from a script, a particular matrix transpose optimization is not performed, which results in the 'res2' expression being faster than the 'res1' expression:
n = 2000;
a=gpuArray(sprand(n,n,0.01));
b=gpuArray(rand(n));
tic;res1=a*b*a;wait(gpuDevice);toc % Elapsed time is 0.884099 seconds.
tic;res2=transpose(transpose(a)*transpose(a*b));wait(gpuDevice);toc % Elapsed time is 0.068855 seconds.
However when the above code is placed in a MATLAB function file, an additional matrix transpose-times optimization is done which causes the 'res2' expression to go through a different code path (and different CUDA library function call) compared to the same line being called from a script. Therefore this optimization generates slower results for the 'res2' line when called from a function file.
To avoid this issue from occurring in a function file, the transpose and multiply operations would need to be split in a manner that stops MATLAB from applying this optimization. Separating each clause within the 'res2' statement seems to be sufficient for this:
tic;i1=transpose(a);i2=transpose(a*b);res3=transpose(i1*i2);wait(gpuDevice);toc % Elapsed time is 0.066446 seconds.
In the above line, 'res3' is being generated from two intermediate matrices: 'i1' and 'i2'. The performance (on my system) seems to be on par with that of the 'res2' expression when executed from a script; in addition the 'res3' expression also shows similar performance when executed from a MATLAB function file. Note however that additional memory may be used to store the transposed copy of the initial array. Please let me know if you see different performance behavior on your system, and I can investigate this further.
Additionally, the 'res3' operation shows faster performance when measured with the 'gputimeit' function too. Please refer to the attached 'testscript2.m' file for more information on this. I have also attached 'test_v2.m' which is a modification of the 'test.m' function in your Stack Overflow post.
Thank you for reporting this issue to me. I would like to apologize for any inconvenience caused by this issue. I have created an internal bug report to notify the MATLAB developers about this behavior. They may provide a fix for this in a future release of MATLAB.
Since you had an additional question about comparing the performance of GPU code using 'gputimeit' vs. using 'tic' and 'toc', I just wanted to provide one suggestion which the MATLAB GPU computing developers had mentioned earlier. It is generally good to also call 'wait(gpuDevice)' before the 'tic' statements to ensure that GPU operations from the previous lines don't overlap in the measurement for the next line. For example, in the following lines:
b=gpuArray(rand(n));
tic; res1=a*b*a; wait(gpuDevice); toc
if the 'wait(gpuDevice)' is not called before the 'tic', some of the time taken to construct the 'b' array from the previous line may overlap and get counted in the time taken to execute the 'res1' expression. This would be preferred instead:
b=gpuArray(rand(n));
wait(gpuDevice); tic; res1=a*b*a; wait(gpuDevice); toc
Apart from this, I am not seeing any specific issues in the way that you are using the 'tic' and 'toc' functions. However note that using 'gputimeit' is generally recommended over using 'tic' and 'toc' directly for GPU-related profiling.
I will go ahead and close this case for now, but please let me know if you have any further questions about this.
%testscript2.m
n = 2000;
a = gpuArray(sprand(n, n, 0.01));
b = gpuArray(rand(n));
gputimeit(#()transpose_mult_fun(a, b))
gputimeit(#()transpose_mult_fun_2(a, b))
function out = transpose_mult_fun(in1, in2)
i1 = transpose(in1);
i2 = transpose(in1*in2);
out = transpose(i1*i2);
end
function out = transpose_mult_fun_2(in1, in2)
out = transpose(transpose(in1)*transpose(in1*in2));
end
.
function test_v2
clear
%% transposed expression
n = 2000;
rng('default');rng(1);
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c1 = gather(transpose( transpose(a) * transpose(a * b) ));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
%% non-transposed expression
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c2 = gather(a * b * a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%% sliced equivalent
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
intermediate1 = transpose(a);
intermediate2 = transpose(a * b);
c3 = gather(transpose( intermediate1 * intermediate2 ));
disp(['time for split equivalent: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c3))))])
end
EDIT 2 I might have been right, see this other answer
EDIT: They use MAGMA, which is column major. My answer does not hold, however I will leave it here for a while in case it can help crack this strange behavior.
The below answer is wrong
This is my guess, I can not 100% tell you without knowing the code under MATLAB's hood.
Hypothesis: MATLABs parallel computing code uses CUDA libraries, not their own.
Important information
MATLAB is column major and CUDA is row major.
There is no such things as 2D matrices, only 1D matrices with 2 indices
Why does this matter? Well because CUDA is highly optimized code that uses memory structure to maximize cache hits per kernel (the slowest operation on GPUs is reading memory). This means a standard CUDA matrix multiplication code will exploit the order of memory reads to make sure they are adjacent. However, what is adjacent memory in row-major is not in column-major.
So, there are 2 solutions to this as someone writing software
Write your own column-major algebra libraries in CUDA
Take every input/output from MATLAB and transpose it (i.e. convert from column-major to row major)
They have done point 2, and assuming that there is a smart JIT compiler for MATLAB parallel processing toolbox (reasonable assumption), for the second case, it takes a and b, transposes them, does the maths, and transposes the output when you gather.
In the first case however, you already do not need to transpose the output, as it is internally already transposed and the JIT catches this, so instead of calling gather(transpose( XX )) it just skips the output transposition is side. The same with transpose(a*b). Note that transpose(a*b)=transpose(b)*transpose(a), so suddenly no transposes are needed (they are all internally skipped). A transposition is a costly operation.
Indeed there is a weird thing here: making the code a function suddenly makes it slow. My best guess is that because the JIT behaves differently in different situations, it doesn't catch all this transpose stuff inside and just does all the operations anyway, losing the speed up.
Interesting observation: It takes the same time in CPU than GPU to do a*b*a in my PC.

Decrease the computation time for DGELSD subroutine

I am writing a code in Fortran that involve computation of linear least squares (A^(-1)*b). For this I have used the subroutine "DGELSD". I am able to get the correct answer from my code. I have cross checked my solution with that of the data available from MATLAB.
Now when it comes to the computation time, MATLAB is taking much less time than that of my .f90 code. The main motivation for me to write the .f90 code was to do heavy computations, as MATLAB was unable to handle this problem (as the matrix size is increased, it takes more and more time). I am talking of the order of matrix around (10^6 x 10^6).
I know that I may be lacking somewhere around the vectorization or parallelization of code (as I am new to it). But would it make any difference? As the subroutine "DGELSD" is already highly optimized. Also I am using Intel ifort compiler and visual studio as an editor.
I have attached the part of the main code (.f90) below for reference. Please suggest what can be done to decrease the computation time. I am having good hardware configuration to run heavy computation.
Workstation specification: Intel Xeon 32 core - 64 bit processor, 64 GB RAM.
Code information:
a) I have used 'DGELSD' example available on Intel MKL example site..
b) I am using x64 architecture.
c) This is the code I am using in MATLAB for comparison.
function min_norm_check(m,n)
tic
a=15*m*n;
b=15*m*n;
c=6;
A=rand(a,b);
B=rand(b,c);
C=A\B;
toc
end
This is the Fotran code given below:
! Program to check the time required to
! calculate linear least square solution
! using DGELSD subroutine
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: M,N,LDA,LDB,NRHS,NB,NG
REAL :: T1,T2,START,FINISH
DOUBLE PRECISION, DIMENSION (:,:), ALLOCATABLE :: D_CAP,A,B,TG,DG
INTEGER :: I=0
NB=10
NG=10
M = 15*NB*NG
N = 15*NB*NG
NRHS = 6
!!
LDA = MAX(M,N)
LDB = MAX(M,NRHS)
!
ALLOCATE (A(LDA,N))
ALLOCATE (B(LDB,NRHS))
A = 0
B = 0
CALL RANDOM_NUMBER(A)
CALL RANDOM_NUMBER(B)
CALL CPU_TIME(START)
DO I=1,1
WRITE(*,*) I
CALL CALC_MIN_NORM_D(M,N,LDA,LDB,NRHS,A,B,D_CAP)
ENDDO
CALL CPU_TIME(FINISH)
WRITE(*,*)'("TIME =',FINISH-START,'SECONDS.")'
END
SUBROUTINE CALC_MIN_NORM_D(M,N,LDA,LDB,NRHS,A,B,D_CAP)
!
! SUBROUTINE DEFINITION TO CALCULATE THE
! LINEAR LEAST SQUARE SOLUTION OF ([A]^-1*B)
!
IMPLICIT NONE
INTEGER :: M,N,NRHS,LDA,LDB,LWMAX,INFO,LWORK,RANK
DOUBLE PRECISION RCOND
INTEGER, ALLOCATABLE, DIMENSION(:) :: IWORK
DOUBLE PRECISION :: A(LDA,N),B(LDB,NRHS),D_CAP(LDB,NRHS)
DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:) :: S,WORK
!
WRITE(*,*)'IN CALC MIN NORM BEGINING'
WRITE(*,*)'DGELSD EXAMPLE PROGRAM RESULTS'
LWMAX = 1E8
ALLOCATE(S(M))
ALLOCATE(IWORK(3*M*0+11*M))
ALLOCATE(WORK(LWMAX))
! NEGATIVE RCOND MEANS USING DEFAULT (MACHINE PRECISION) VALUE
RCOND = -1.0
!
! QUERY THE OPTIMAL WORKSPACE.
!
LWORK = -1
CALL DGELSD( M, N, NRHS, A, LDA, B, LDB, S, RCOND, RANK, WORK,LWORK, IWORK, INFO )
LWORK = MIN( LWMAX, INT( WORK( 1 ) ) )
!WRITE(*,*) 'AFTER FIRST DGELSD'
!
! SOLVE THE EQUATIONS A*X = B.
!!!
CALL DGELSD( M, N, NRHS, A, LDA, B, LDB, S, RCOND, RANK, WORK,LWORK, IWORK, INFO )
!
! CHECK FOR CONVERGENCE.
!
IF( INFO.GT.0 ) THEN
WRITE(*,*)'THE ALGORITHM COMPUTING SVD FAILED TO CONVERGE;'
WRITE(*,*)'THE LEAST SQUARES SOLUTION COULD NOT BE COMPUTED.'
STOP
END IF
!
!
WRITE(*,*)' EFFECTIVE RANK = ', RANK
!D_CAP = B
END
This is the build log for successful compilation of the file. As I am using Visual studio with intel visual fortran, I can compile my program using Compile option available in the editor. I mean to say I don't have to use command line interface to run my program.
Compiling with Intel(R) Visual Fortran Compiler 17.0.2.187 [IA-32]...
ifort /nologo /debug:full /Od /warn:interfaces /module:&quotDebug\\&quot /object:&quotDebug\\&quot /Fd&quotDebug\vc110.pdb&quot /traceback /check:bounds /check:stack /libs:dll /threads /dbglibs /c /Qlocation,link,&quotC:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\\bin&quot /Qm32 &quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Console6.f90&quot
Linking...
Link /OUT:&quotDebug\Console6.exe&quot /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:&quotDebug\Console6.exe.intermediate.manifest&quot /MANIFESTUAC:&quotlevel='asInvoker' uiAccess='false'&quot /DEBUG /PDB:&quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\Console6\Console6\Debug\Console6.pdb&quot /SUBSYSTEM:CONSOLE /IMPLIB:&quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Debug\Console6.lib&quot -qm32 &quotDebug\Console6.obj&quot
Embedding manifest...
mt.exe /nologo /outputresource:&quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Debug\Console6.exe;#1&quot /manifest &quotDebug\Console6.exe.intermediate.manifest&quot
Console6 - 0 error(s), 0 warning(s)
Also I have included the intel mkl library, shown in the figure below:
I have compiled the program and the output has been written in file 'OUTPUT_FILE.TXT'.
IN CALC MIN NORM BEGINING
DGELSD EXAMPLE PROGRAM RESULTS
EFFECTIVE RANK = 1500
IN CALC MIN NORM ENDING
("TIME TAKEN TO SOLVE DGELSD= 4.290028 SECONDS.")
On the other hand MATLAB gives the following result in the output command window:
min_norm_check(10,10)
Elapsed time is 0.224172 seconds.
Also I don't want to outperform MATLAB, its easy and simple. But with increase in the size of the problem, MATLAB stops responding. I have left my program to run on MATLAB for more than two days. It still haven't produced any results.

Out of memory in Matlab - how to do in-place operation on matrix elements?

I am loading a quite large matrix into Matlab. Loading this matrix already pushes Matlab to its limits - but it fits.
Then I do the following and I get an out-of-memory error.
data( :, 2:2:end, :, : ) = - data( :, 2:2:end, :, : );
Is Matlab allocating a new matrix for this operation? I would assume this operation would not need extra memory. How do I force Matlab to be more efficient for this?
Bonus question:
'data = permute(data,[1 2 3 4 5 12 8 7 6 9 10 11]);'
Can matlab do this in-place?
There are a few constraints (further to those from Loren's block cited by John):
Your code must be running inside a function
You must have no other aliases to 'data'
The 'aliases' thing is both important and potentially hard to get right. MATLAB uses copy-on-write, which means that when you call a function, the argument you pass isn't duplicated immediately, but might be copied if you modify it within the function. For example, consider
x = rand(100);
y = myfcn(x);
% with myfcn.m containing:
function out = myfcn(in)
in(1) = 3;
out = in * 2;
end
In that case, the variable x is passed to myfcn. MATLAB has value semantics, so any modifications to the input argument in must not be seen in the calling workspace. So, the first line of myfcn causes the argument in to become a copy of x, rather than simply an alias to it. Consider what happens with try/catch - this can be an in-place killer, because MATLAB has to be able to preserve values if you error out. In the following:
% consider this function
function myouterfcn()
x = rand(100);
x = myfcn(x);
end
% with myfcn.m containing
function arg = myfcn( arg )
arg = -arg;
end
then, that should get in-place optimisation for x in myouterfcn. But the following can't:
% consider this function
function myouterfcn()
x = rand(100);
x = myfcn(x);
end
% with myfcn.m containing
function arg = myfcn( arg )
try
arg = -arg;
catch E
disp( 'Oops.' );
end
end
Hope some of this info helps...
Matlab does support in place operations. Here's a discussion: http://blogs.mathworks.com/loren/2007/03/22/in-place-operations-on-data/. Sorry I can't be more help.
I don't think there's a good way to know what MATLAB is really doing under the sheets. I would recommend that you:
Make sure you clear any variables that aren't in use before trying the operation.
If you're still running out of memory, write and compile a simple mex file to do the operation in place. With a massive array and your specialized requirement, it'd probably be faster than MATLAB's approach too.
Likewise, you could write your own permutation algorithm as a .mex file in C if you require that it be done in-place (i.e. you're running out of memory again).