Decrease the computation time for DGELSD subroutine

Decrease the computation time for DGELSD subroutine - matlab

I am writing a code in Fortran that involve computation of linear least squares (A^(-1)*b). For this I have used the subroutine "DGELSD". I am able to get the correct answer from my code. I have cross checked my solution with that of the data available from MATLAB.
Now when it comes to the computation time, MATLAB is taking much less time than that of my .f90 code. The main motivation for me to write the .f90 code was to do heavy computations, as MATLAB was unable to handle this problem (as the matrix size is increased, it takes more and more time). I am talking of the order of matrix around (10^6 x 10^6).
I know that I may be lacking somewhere around the vectorization or parallelization of code (as I am new to it). But would it make any difference? As the subroutine "DGELSD" is already highly optimized. Also I am using Intel ifort compiler and visual studio as an editor.
I have attached the part of the main code (.f90) below for reference. Please suggest what can be done to decrease the computation time. I am having good hardware configuration to run heavy computation.
Workstation specification: Intel Xeon 32 core - 64 bit processor, 64 GB RAM.
Code information:
a) I have used 'DGELSD' example available on Intel MKL example site..
b) I am using x64 architecture.
c) This is the code I am using in MATLAB for comparison.
function min_norm_check(m,n)
tic
a=15*m*n;
b=15*m*n;
c=6;
A=rand(a,b);
B=rand(b,c);
C=A\B;
toc
end
This is the Fotran code given below:
! Program to check the time required to
! calculate linear least square solution
! using DGELSD subroutine
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: M,N,LDA,LDB,NRHS,NB,NG
REAL :: T1,T2,START,FINISH
DOUBLE PRECISION, DIMENSION (:,:), ALLOCATABLE :: D_CAP,A,B,TG,DG
INTEGER :: I=0
NB=10
NG=10
M = 15*NB*NG
N = 15*NB*NG
NRHS = 6
!!
LDA = MAX(M,N)
LDB = MAX(M,NRHS)
!
ALLOCATE (A(LDA,N))
ALLOCATE (B(LDB,NRHS))
A = 0
B = 0
CALL RANDOM_NUMBER(A)
CALL RANDOM_NUMBER(B)
CALL CPU_TIME(START)
DO I=1,1
WRITE(*,*) I
CALL CALC_MIN_NORM_D(M,N,LDA,LDB,NRHS,A,B,D_CAP)
ENDDO
CALL CPU_TIME(FINISH)
WRITE(*,*)'("TIME =',FINISH-START,'SECONDS.")'
END
SUBROUTINE CALC_MIN_NORM_D(M,N,LDA,LDB,NRHS,A,B,D_CAP)
!
! SUBROUTINE DEFINITION TO CALCULATE THE
! LINEAR LEAST SQUARE SOLUTION OF ([A]^-1*B)
!
IMPLICIT NONE
INTEGER :: M,N,NRHS,LDA,LDB,LWMAX,INFO,LWORK,RANK
DOUBLE PRECISION RCOND
INTEGER, ALLOCATABLE, DIMENSION(:) :: IWORK
DOUBLE PRECISION :: A(LDA,N),B(LDB,NRHS),D_CAP(LDB,NRHS)
DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:) :: S,WORK
!
WRITE(*,*)'IN CALC MIN NORM BEGINING'
WRITE(*,*)'DGELSD EXAMPLE PROGRAM RESULTS'
LWMAX = 1E8
ALLOCATE(S(M))
ALLOCATE(IWORK(3*M*0+11*M))
ALLOCATE(WORK(LWMAX))
! NEGATIVE RCOND MEANS USING DEFAULT (MACHINE PRECISION) VALUE
RCOND = -1.0
!
! QUERY THE OPTIMAL WORKSPACE.
!
LWORK = -1
CALL DGELSD( M, N, NRHS, A, LDA, B, LDB, S, RCOND, RANK, WORK,LWORK, IWORK, INFO )
LWORK = MIN( LWMAX, INT( WORK( 1 ) ) )
!WRITE(*,*) 'AFTER FIRST DGELSD'
!
! SOLVE THE EQUATIONS A*X = B.
!!!
CALL DGELSD( M, N, NRHS, A, LDA, B, LDB, S, RCOND, RANK, WORK,LWORK, IWORK, INFO )
!
! CHECK FOR CONVERGENCE.
!
IF( INFO.GT.0 ) THEN
WRITE(*,*)'THE ALGORITHM COMPUTING SVD FAILED TO CONVERGE;'
WRITE(*,*)'THE LEAST SQUARES SOLUTION COULD NOT BE COMPUTED.'
STOP
END IF
!
!
WRITE(*,*)' EFFECTIVE RANK = ', RANK
!D_CAP = B
END
This is the build log for successful compilation of the file. As I am using Visual studio with intel visual fortran, I can compile my program using Compile option available in the editor. I mean to say I don't have to use command line interface to run my program.
Compiling with Intel(R) Visual Fortran Compiler 17.0.2.187 [IA-32]...
ifort /nologo /debug:full /Od /warn:interfaces /module:&quotDebug\\&quot /object:&quotDebug\\&quot /Fd&quotDebug\vc110.pdb&quot /traceback /check:bounds /check:stack /libs:dll /threads /dbglibs /c /Qlocation,link,&quotC:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\\bin&quot /Qm32 &quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Console6.f90&quot
Linking...
Link /OUT:&quotDebug\Console6.exe&quot /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:&quotDebug\Console6.exe.intermediate.manifest&quot /MANIFESTUAC:&quotlevel='asInvoker' uiAccess='false'&quot /DEBUG /PDB:&quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\Console6\Console6\Debug\Console6.pdb&quot /SUBSYSTEM:CONSOLE /IMPLIB:&quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Debug\Console6.lib&quot -qm32 &quotDebug\Console6.obj&quot
Embedding manifest...
mt.exe /nologo /outputresource:&quotD:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Debug\Console6.exe;#1&quot /manifest &quotDebug\Console6.exe.intermediate.manifest&quot
Console6 - 0 error(s), 0 warning(s)
Also I have included the intel mkl library, shown in the figure below:
I have compiled the program and the output has been written in file 'OUTPUT_FILE.TXT'.
IN CALC MIN NORM BEGINING
DGELSD EXAMPLE PROGRAM RESULTS
EFFECTIVE RANK = 1500
IN CALC MIN NORM ENDING
("TIME TAKEN TO SOLVE DGELSD= 4.290028 SECONDS.")
On the other hand MATLAB gives the following result in the output command window:
min_norm_check(10,10)
Elapsed time is 0.224172 seconds.
Also I don't want to outperform MATLAB, its easy and simple. But with increase in the size of the problem, MATLAB stops responding. I have left my program to run on MATLAB for more than two days. It still haven't produced any results.

Related

Do loop to be parallelized in a Matlab's Mex function (Fortran) with OpenMP

I have a do loop in a Matlab's Mex function (written in Fortran) that performs some calculations for each element of a FEM mesh. My mesh consists of 250k elements, so I thought it was worth parallelizing it. This is my first attempt at parallelizing this code with OpenMP (I'm a beginner at coding). I used the reduction command to avoid the race condition in fintk(dofele) = fintk(dofele) + fintele. Is it correct? I can compile it in Matlab without any problem. However, when I use it (in Matlab), it produces correct results for a 12k element mesh and it is faster than the serialized one, but when I try to use it for the 250k element mesh Matlab crashes. Thank you for helping me
subroutine loop_over_elements( &
! OUT
fintk,Sxyz,&
! IN
Elem,Bemesh,Dofelemat,u,dt,NE,NDOF)
use omp_lib
implicit none
mwSize NE, NDOF, ele
integer, parameter :: dp = selected_real_kind(15,307)
real(dp) :: fintk(NDOF), Sxyz(6,NE), Elemat(4,NE), Bemesh(6,12,NE), Dofelemat(12,NE)
real(dp) :: u(NDOF)
real(dp) :: Bele(6,12), fintele(12), uele(12), si(6), dt
integer*4 :: nodes(4), dofele(12)
fintk = 0.D0
!$OMP PARALLEL DO REDUCTION(+:fintk(:)) PRIVATE(ele,nodes,Bele,dofele,uele,si,fintele)
DO ele = 1, NE
nodes = Elemat(1:4,ele)
Bele = Bemesh(1:6,1:12,ele)
dofele = Dofelemat(1:12,ele)
uele = u(dofele)
call comput_subroutine( &
! IN
Bele,uele,dt, &
! OUT
si)
Sxyz(:,ele) = si
fintele = MATMUL(TRANSPOSE(Bele),si)
fintk(dofele) = fintk(dofele) + fintele
END DO
!$OMP END PARALLEL DO
return
end

I solved the "crashing" of Matlab that I was experiencing by adding this line in the general mexFunction subroutine before calling the loop_over_elements subroutine:
call KMP_SET_STACKSIZE(100000000). I figured that, since Matlab didn't crash when I was using the large model with the non-parallelized subroutine, maybe it was a memory problem. After that I discovered the well-known (not to me unfortunately) segmentation fault problem when using OpenMp with large arrays (see for example this). I'm still a bit confused about the difference between setting OMP_STACKSIZE(which I don't know how to do in Mex functions) and KMP_SET_STACKSIZE, but now the parallelized code works with the large model.

why does aba take longer than (a'(ab)')' when using gpuArray in Matlab scripts?

The code below performs the operation the same operation on gpuArrays a and b in two different ways. The first part computes (a'*(a*b)')' , while the second part computes a*b*a. The results are then verified to be the same.
%function test
clear
rng('default');rng(1);
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c1=gather(transpose(transpose(a)*transpose(a*b)));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c2=gather(a*b*a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%end
However, computing (a'*(a*b)')' is roughly 4 times faster than computing a*b*a. Here is the output of the above script in R2018a on an Nvidia K20 (I've tried different versions and different GPUs with the similar behaviour).
>> test
time for (a'*(a*b)')': 0.43234s
time for a*b*a: 1.7175s
error = 2.0009e-11
Even more strangely, if the first and last lines of the above script are uncommented (to turn it into a function), then both take the longer amount of time (~1.7s instead of ~0.4s). Below is the output for this case:
>> test
time for (a'*(a*b)')': 1.717s
time for a*b*a: 1.7153s
error = 1.0914e-11
I'd like to know what is causing this behaviour, and how to perform a*b*a or (a'*(a*b)')' or both in the shorter amount of time (i.e. ~0.4s rather than ~1.7s) inside a matlab function rather than inside a script.

There seem to be an issue with multiplication of two sparse matrices on GPU. time for sparse by full matrix is more than 1000 times faster than sparse by sparse. A simple example:
str={'sparse*sparse','sparse*full'};
for ii=1:2
rng(1);
a=sprand(3000,3000,0.1);
b=sprand(3000,3000,0.1);
if ii==2
b=full(b);
end
a=gpuArray(a);
b=gpuArray(b);
tic
c=a*b;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
In your context, it is the last multiplication which does it. to demonstrate I replace a with a duplicate c, and multiply by it twice, once as sparse and once as full matrix.
str={'a*b*a','a*b*full(a)'};
for ii=1:2
%rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
rng(1)
c=sprand(3000,3000,0.1);
if ii==2
c=full(c);
end
a=gpuArray(a);
b=gpuArray(b);
c=gpuArray(c);
tic;
c1{ii}=a*b*c;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
disp(['error = ',num2str(max(max(abs(c1{1}-c1{2}))))])
I may be wrong, but my conclusion is that a * b * a involves multiplication of two sparse matrices (a and a again) and is not treated well, while using transpose() approach divides the process to two stage multiplication, in none of which there are two sparse matrices.

I got in touch with Mathworks tech support and Rylan finally shed some light on this issue. (Thanks Rylan!) His full response is below. The function vs script issue appears to be related to certain optimizations matlab applies automatically to functions (but not scripts) not working as expected.
Rylan's response:
Thank you for your patience on this issue. I have consulted with the MATLAB GPU computing developers to understand this better.
This issue is caused by internal optimizations done by MATLAB when encountering some specific operations like matrix-matrix multiplication and transpose. Some of these optimizations may be enabled specifically when executing a MATLAB function (or anonymous function) rather than a script.
When your initial code was being executed from a script, a particular matrix transpose optimization is not performed, which results in the 'res2' expression being faster than the 'res1' expression:
n = 2000;
a=gpuArray(sprand(n,n,0.01));
b=gpuArray(rand(n));
tic;res1=a*b*a;wait(gpuDevice);toc % Elapsed time is 0.884099 seconds.
tic;res2=transpose(transpose(a)*transpose(a*b));wait(gpuDevice);toc % Elapsed time is 0.068855 seconds.
However when the above code is placed in a MATLAB function file, an additional matrix transpose-times optimization is done which causes the 'res2' expression to go through a different code path (and different CUDA library function call) compared to the same line being called from a script. Therefore this optimization generates slower results for the 'res2' line when called from a function file.
To avoid this issue from occurring in a function file, the transpose and multiply operations would need to be split in a manner that stops MATLAB from applying this optimization. Separating each clause within the 'res2' statement seems to be sufficient for this:
tic;i1=transpose(a);i2=transpose(a*b);res3=transpose(i1*i2);wait(gpuDevice);toc % Elapsed time is 0.066446 seconds.
In the above line, 'res3' is being generated from two intermediate matrices: 'i1' and 'i2'. The performance (on my system) seems to be on par with that of the 'res2' expression when executed from a script; in addition the 'res3' expression also shows similar performance when executed from a MATLAB function file. Note however that additional memory may be used to store the transposed copy of the initial array. Please let me know if you see different performance behavior on your system, and I can investigate this further.
Additionally, the 'res3' operation shows faster performance when measured with the 'gputimeit' function too. Please refer to the attached 'testscript2.m' file for more information on this. I have also attached 'test_v2.m' which is a modification of the 'test.m' function in your Stack Overflow post.
Thank you for reporting this issue to me. I would like to apologize for any inconvenience caused by this issue. I have created an internal bug report to notify the MATLAB developers about this behavior. They may provide a fix for this in a future release of MATLAB.
Since you had an additional question about comparing the performance of GPU code using 'gputimeit' vs. using 'tic' and 'toc', I just wanted to provide one suggestion which the MATLAB GPU computing developers had mentioned earlier. It is generally good to also call 'wait(gpuDevice)' before the 'tic' statements to ensure that GPU operations from the previous lines don't overlap in the measurement for the next line. For example, in the following lines:
b=gpuArray(rand(n));
tic; res1=a*b*a; wait(gpuDevice); toc
if the 'wait(gpuDevice)' is not called before the 'tic', some of the time taken to construct the 'b' array from the previous line may overlap and get counted in the time taken to execute the 'res1' expression. This would be preferred instead:
b=gpuArray(rand(n));
wait(gpuDevice); tic; res1=a*b*a; wait(gpuDevice); toc
Apart from this, I am not seeing any specific issues in the way that you are using the 'tic' and 'toc' functions. However note that using 'gputimeit' is generally recommended over using 'tic' and 'toc' directly for GPU-related profiling.
I will go ahead and close this case for now, but please let me know if you have any further questions about this.
%testscript2.m
n = 2000;
a = gpuArray(sprand(n, n, 0.01));
b = gpuArray(rand(n));
gputimeit(#()transpose_mult_fun(a, b))
gputimeit(#()transpose_mult_fun_2(a, b))
function out = transpose_mult_fun(in1, in2)
i1 = transpose(in1);
i2 = transpose(in1*in2);
out = transpose(i1*i2);
end
function out = transpose_mult_fun_2(in1, in2)
out = transpose(transpose(in1)*transpose(in1*in2));
end
.
function test_v2
clear
%% transposed expression
n = 2000;
rng('default');rng(1);
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c1 = gather(transpose( transpose(a) * transpose(a * b) ));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
%% non-transposed expression
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c2 = gather(a * b * a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%% sliced equivalent
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
intermediate1 = transpose(a);
intermediate2 = transpose(a * b);
c3 = gather(transpose( intermediate1 * intermediate2 ));
disp(['time for split equivalent: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c3))))])
end

EDIT 2 I might have been right, see this other answer
EDIT: They use MAGMA, which is column major. My answer does not hold, however I will leave it here for a while in case it can help crack this strange behavior.
The below answer is wrong
This is my guess, I can not 100% tell you without knowing the code under MATLAB's hood.
Hypothesis: MATLABs parallel computing code uses CUDA libraries, not their own.
Important information
MATLAB is column major and CUDA is row major.
There is no such things as 2D matrices, only 1D matrices with 2 indices
Why does this matter? Well because CUDA is highly optimized code that uses memory structure to maximize cache hits per kernel (the slowest operation on GPUs is reading memory). This means a standard CUDA matrix multiplication code will exploit the order of memory reads to make sure they are adjacent. However, what is adjacent memory in row-major is not in column-major.
So, there are 2 solutions to this as someone writing software
Write your own column-major algebra libraries in CUDA
Take every input/output from MATLAB and transpose it (i.e. convert from column-major to row major)
They have done point 2, and assuming that there is a smart JIT compiler for MATLAB parallel processing toolbox (reasonable assumption), for the second case, it takes a and b, transposes them, does the maths, and transposes the output when you gather.
In the first case however, you already do not need to transpose the output, as it is internally already transposed and the JIT catches this, so instead of calling gather(transpose( XX )) it just skips the output transposition is side. The same with transpose(a*b). Note that transpose(a*b)=transpose(b)*transpose(a), so suddenly no transposes are needed (they are all internally skipped). A transposition is a costly operation.
Indeed there is a weird thing here: making the code a function suddenly makes it slow. My best guess is that because the JIT behaves differently in different situations, it doesn't catch all this transpose stuff inside and just does all the operations anyway, losing the speed up.
Interesting observation: It takes the same time in CPU than GPU to do a*b*a in my PC.

Julia vs. Matlab benchmarking eigenvector calculations

I'm a new Julia user and I need to find eigenvectors of large matrices as quickly as possible*. I'm having trouble getting Julia to run as fast as Matlab for the following example:
Julia
const j = 1000 ::Int
A = Array{Float64}(j,j)
B = Array{Float64}(j,j)
f(x) = eigvecs(x)
A = randn(j,j)
B = f(A)
#time f(A)
output for time: 2.950973 seconds (12.31 k allocations: 76.445 MB, 0.11% gc time)
Matlab
j = 1000;
A = randn(j,j);
tic
[v, d] = eig(A);
toc
Elapsed time is 1.161133 seconds.
I have also checked Matlab with 1 thread to compare using maxNumCompThreads = 1 but it still gives a similar time (1.16s) to before. I've also tried to speed up Julia by running twice to precompile, and also setting blas_set_num_threads(4) but this isn't helping.
I'd really appreciate any advice about how to improve my Julia code!
*(I am using Matlab 2015b and Julia 0.4.7 on OSX El Capitan 10.11.6)

Kind of a duplicate of this discussion.
Usually when talking about Julia performance, you're talking about how the language actually works. In this case, both Julia and MATLAB are just calling well-optimized C/Fortran libraries for doing the eigenvalue calculation. This is reliant on the BLAS configuration. MATLAB ships with a version of MKL, so it's also just using a different library which in many cases is faster than OpenBLAS, but you can build Julia with MKL using the instructions in the README on the Julia Github repo. Maybe rebuilding your sysimg could help:
include(joinpath(dirname(JULIA_HOME),"share","julia","build_sysimg.jl")); build_sysimg(force=true)
If you are using a pre-built binary then it's not optimized for your system, and this will enable the optimizations.

Function returns different answers with same arguments

I'm transitioning from MATLAB to Fortran and encountering all sorts of weird behaviors I'd never expect from MATLAB. Here's one that's got me puzzled:
Program pruebanormal
double precision :: g01eaf, x, y
character :: T*1
integer :: Iffail
Iffail = 0
T = 'L'
x = 0.0
y = g01eaf('L',0.0,Iffail)
write(*,*) 'y = ',y
end program pruebanormal
I have this fairly simple program in which I'm trying to find the pdf at x=0 of a standard N(0,1) variable (should be 0.5). g01eaf() is the NAG library function that does this for me. I'm using gfortran to compile.
Leaving the rest of the program unchanged, depending on how I write the arguments in g01eaf(), I get different answers:
a) g01eaf(T,x,Iffail)
b) g01eaf(T,0.0,Iffail)
c) g01eaf(T,x,0)
Now, under MATLAB, I would get the same (correct) answer either way: y = 0.500000. Under Fortran, however, I get:
a) y = 0.500000
b) y = 1.000000
c) Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0xB766C163
#1 0xB766C800
#2 0xB77763FF
#3 0x804982E in g01eafn_
Violación de segmento (`core' generado)
I have no words for the answer in (b) and no clue what (c) even means.

The very quick answer for the "wrong result" is that in
y = g01eaf('L',0.0,Iffail)
you are passing a different type of real variable than in
double precision x
x = 0.0 ! x is still double precision.
y = g01eaf('L',x,Iffail)
The function g01eaf probably expects double precision: you should very carefully read NAG's documentation.
y = g01eaf('L', 0d0, Iffail)
Now to elaborate.
You don't want these problems to come about too often. You want to ensure an interface is available for the function call to g01eaf. Your compiler would then complain about passing a real of default kind to the function.
Assuming you have an up-to-date version of the library, you want to do something like
use nag_library, only : g01eaf, nag_wp
implicit none
integer Iffail
real(kind=nag_wp) y
y = g01eaf('L', 0._nag_wp, Iffail)
end
Again, see the documentation. Both for the library, and for meaning of modules, etc.
For older versions, one should still have a module available, but it may be called something different, and nag_wp may not be defined (meaning you must carefully choose kinds).
The module will also lead to a complaint that Iffail requires to be able to be set, so must be a variable, not 0. This explains (c).

Does Fortran have inherent limitations on numerical accuracy compared to other languages?

While working on a simple programming exercise, I produced a while loop (DO loop in Fortran) that was meant to exit when a real variable had reached a precise value.
I noticed that due to the precision being used, the equality was never met and the loop became infinite. This is, of course, not unheard of and one is advised that, rather than comparing two numbers for equality, it is best see if the absolute difference between two numbers is less than a set threshold.
What I found disappointing was how low I had to set this threshold, even with variables at double precision, for my loop to exit properly. Furthermore, when I rewrote a "distilled" version of this loop in Perl, I had no problems with numerical accuracy and the loop exited fine.
Since the code to produce the problem is so small, in both Perl and Fortran, I'd like to reproduce it here in case I am glossing over an important detail:
Fortran Code
PROGRAM precision_test
IMPLICIT NONE
! Data Dictionary
INTEGER :: count = 0 ! Number of times the loop has iterated
REAL(KIND=8) :: velocity
REAL(KIND=8), PARAMETER :: MACH_2_METERS_PER_SEC = 340.0
velocity = 0.5 * MACH_2_METERS_PER_SEC ! Initial Velocity
DO
WRITE (*, 300) velocity
300 FORMAT (F20.8)
IF (count == 50) EXIT
IF (velocity == 5.0 * MACH_2_METERS_PER_SEC) EXIT
! IF (abs(velocity - (5.0 * MACH_2_METERS_PER_SEC)) < 1E-4) EXIT
velocity = velocity + 0.1 * MACH_2_METERS_PER_SEC
count = count + 1
END DO
END PROGRAM precision_test
Perl Code
#! /usr/bin/perl -w
use strict;
my $mach_2_meters_per_sec = 340.0;
my $velocity = 0.5 * $mach_2_meters_per_sec;
while (1) {
printf "%20.8f\n", $velocity;
exit if ($velocity == 5.0 * $mach_2_meters_per_sec);
$velocity = $velocity + 0.1 * $mach_2_meters_per_sec;
}
The commented-out line in Fortran is what I would need to use for the loop to exit normally. Notice that the threshold is set to 1E-4, which I feel is quite pathetic.
The names of the variables come from the self-study-based programming exercise I was performing and don't have any relevance.
The intent is that the loop stops when the velocity variable reaches 1700.
Here are the truncated outputs:
Perl Output
170.00000000
204.00000000
238.00000000
272.00000000
306.00000000
340.00000000
...
1564.00000000
1598.00000000
1632.00000000
1666.00000000
1700.00000000
Fortran Output
170.00000000
204.00000051
238.00000101
272.00000152
306.00000203
340.00000253
...
1564.00002077
1598.00002128
1632.00002179
1666.00002229
1700.00002280
What good is Fortran's speed and ease of parallelization if its accuracy stinks? Reminds me of the three ways to do things:
The Right Way
The Wrong Way
The Max Power Way
"Isn't that just the wrong way?"
"Yeah! But faster!"
All kidding aside, I must be doing something wrong.
Does Fortran have inherent limitations on numerical accuracy compared to other languages, or am I (quite likely) the one at fault?
My compiler is gfortran (gcc version 4.1.2), Perl v5.12.1, on a Dual Core AMD Opteron # 1 GHZ.

Your assignment is accidentally converting the value to single precision and then back to double.
Try making your 0.1 * be 0.1D0 * and you should see your problem fixed.

As already answered, "plain" floating point constants in Fortran will default to the default real type, which will likely be single-precision. This is an almost classic mistake.
Also, using "kind=8" is not portable -- it will give you double precision with gfortran, but not with some other compilers. The safe, portable way to specify precisions for both variables and constants in Fortran >= 90 is to use the intrinsic functions, and request the precision that you need. Then specify "kinds" on the constants where precision is important. A convenient method is to define your own symbols. For example:
integer, parameter :: DR_K = selected_real_kind (14)
REAL(DR_K), PARAMETER :: MACH_2_METERS_PER_SEC = 340.0_DR_K
real (DR_K) :: mass, velocity, energy
energy = 0.5_DR_K * mass * velocity**2
This can also be important for integers, e.g., if large values are needed. For related questions for integers, see Fortran: integer*4 vs integer(4) vs integer(kind=4) and Long ints in Fortran

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse