Make double-precision default in g77, Fortran compiler - double

Is there an analog of the "-fdefault-real-8" gfortran (the GNU Fortran 95 compiler) option in g77 (the GNU Fortran 77 compiler)? This option sets the default real type to an 8-byte wide type.
I currently have code where single-precision arithmetic is limiting my accuracy, and so I need double-precision. (It's not just intermediate values that I want to be in double-precision, which is an FPU flag; I want everything to be in double-precision.) I know that I have some other approaches (using gfortran, using other compilers, or changing all REALs to DOUBLE PRECISIONs), but they're not ideal for my situation.
So, is there any way to set the default real type to be double precision, namely 8 bytes wide, in g77?

If you can't find a flag in the man pages, you might try a #define macro.
#define REAL DOUBLE PRECISION

Since a lot of FORTRAN 77 is still legal, is it possible to use gfortran to compile your FORTRAN 77 code, and supply the -fdefault-real-8 option?

Related

Avoiding long integers (32 bit) in Simulink-generated code

I'm working with some signal processing code (C language) generated by Matlab Simulink, targetting a DSP with 24-bit integers. The code that Simulink generated relies upon the existence of 32-bit integers and uses them in calculations, only at the end truncating the higher-order bits into the 24-bit result. Unfortunately the compiler for this architecture targets a limited subset of C and doesn't currently support 32-bit longs, instead having short/int/long all as the same 24-bit integer.
We've tried specifying the bit-widths of the integer types for the custom processor target as 24 bits, however this gave errors and the documentation for hardware targets appears to confirm that this is not permitted (3rd bullet):
The Number of bits parameters describe the native word size of the microprocessor and the bit lengths of char, short, int, and long data. For code generation to succeed:
The bit lengths must be such that char <= short <= int <= long.
Bit lengths must be multiples of 8, with a maximum of 32.
The bit length for long data must not be less than 32.
However as a Simulink neophyte it's quite possible I'm looking in the wrong places - is it in fact possible to have Simulink target a device with only 24-bit integers?

representing Double values in Katai

Some of the values I need to read in my ksy file are double's which I assume is a binary64 structure. The native data-types for a float won't stretch that far. Has anyone managed to represent this datatype in Kaitai ?
"binary64" is a normal IEEE 754 double-precision floats, occupying 64 bits = 8 bytes.
They're perfectly supported by vast majority of languages and, subsequently, Kaitai Struct offers built-in supports for them as type: f8 (float, 8 bytes long).
If you're rather interested in larger floating point values (binary128, binary256 — i.e. quad or octuple precision), there is no built-in support for them in KS due to lack of standard support for these types in most target languages. If you want something like that, the recommended way would be implementing one as opaque type in a target language of your choice. That will likely require you to bringing in some external library which implements this type using some kind of software emulation / complex arithmetics — as hardware support seems to be almost non-existent in commodity CPUs (like Intel or ARM) as of 2020.
For more details on these, see issue #101.

Why is there no fused multiply-add for general-purpose registers on x86_64 CPUs?

On Intel and AMD x86_64 processors, SIMD vectorized registers have specific fused-multiply-add capabilities, but general-purpose (scalar, integer) registers don't - you basically need to multiply, then add (unless you can fit things into an lea).
Why is that? I mean, is it that useless so as to not be worth the overhead?
Integer multiply is common, but not one of the most common things to do with integers. But with floating point numbers, multiplying and adding is used all the time, and FMA provides major speedups for lots of ALU-bound FP code.
Also, floating point actually avoids precision loss with an FMA (the x*y internal temporary isn't rounded off at all before adding). This is why the ISO C99 / C++ fma() math library function exists, and why it's slow to implement without hardware FMA support.
Integer FMA (or multiply-accumulate, aka MAC) doesn't have any precision benefit vs. separate multiply and add.
Some non-x86 ISAs do provide integer FMA. It's not useless, but Intel and AMD both haven't bothered to include it until AVX512-IFMA (and that's still only for SIMD, basically exposing the 52-bit mantissa multiplier circuits needed for double-precision FMA/vmulpd for use by integer instructions).
Non-x86 examples include:
MIPS32, madd / maddu (unsigned) to multiply-accumulate into the hi / lo registers (the special registers used as a destination by regular multiply and divide instructions).
ARM smlal and friends (32x32=>64 bit MAC, or 16x16=>32 bit), also available for unsigned integer. Operands are regular R0..R15 general purpose registers.
An integer register FMA would be useful on x86, but uops that have 3 integer inputs are rare. CMOV and ADC have 3 inputs, but one of those is flags. Even then, they didn't decode to a single uop on Intel until Broadwell, after 3-input uop support was added for FP FMA in Haswell.
Haswell and later can track fused-domain uops with 3 integer inputs, though, for (some) micro-fused instructions with indexed addressing modes. Sandybridge/Ivybridge un-laminate instructions like add eax, [rdx+rcx]. (But Nehalem could keep them micro-fused, like Haswell; SnB simplified the fused-domain uop format). Anyway, that's fused domain, not in the scheduler. Only Broadwell/Skylake can track 3-input integer uops in the scheduler, and that's only for 2 integer + flags, not 3 integer registers.
Intel does use a "unified" scheduler, where FP and integer ops use the same scheduler, and it can track proper 3-input FP FMA. So IDK if there's a technical obstacle. If not, IDK why Intel didn't include integer FMA as part of BMI2 or something, which added stuff like mulx (2-input 2-output mul with mostly explicit operands, unlike legacy mul that uses rdx:rax.)
SSE2/SSSE3 does have integer mul-add instructions for vector registers, but only horizontal add after widening 16x16 => 32-bit (SSE2 pmaddwd) or (unsigned)8x(signed)8=>16-bit (SSSE3 pmaddubsw).
But those are only 2-input instructions, so even though there's a multiply and an add, it's very different from FMA.
Footnote: The question title originally said there was no FMA "for scalars". There is scalar FP FMA with the same FMA3 extension that added the packed versions of these: VFMADD231SD and friends operate on scalar double-precision, and the same flavours of vfmaddXXXss are available for scalar float in XMM registers.

double_t in C99

I just read that C99 has double_t which should be at least as wide as double. Does this imply that it gives more precision digits after the decimal place? More than the usual 15 digits for double?.
Secondly, how to use it: Is only including
#include <float.h>
enough? I read that one has to set the FLT_EVAL_METHOD to 2 for long double. How to do this? As I work with numerical methods, I would like maximum precision without using an arbitrary precision library.
Thanks a lot...
No. double_t is at least as wide as double; i.e., it might be the same as double. Footnote 190 in the C99 standard makes the intent clear:
The types float_t and double_t are
intended to be the implementation’s
most efficient types at least as wide
as float and double, respectively.
As Michael Burr noted, you can't set FLT_EVAL_METHOD.
If you want the widest floating-point type on any system available using only C99, use long double. Just be aware that on some platforms it will be the same as double (and could even be the same as float).
Also, if you "work with numerical methods", you should be aware that for many (most even) numerical methods, the approximation error of the method is vastly larger than the rounding error of double precision, so there's often no benefit to using wider types. Exceptions exist, of course. What type of numerical methods are you working on, specifically?
Edit: seriously, either (a) just use long double and call it a day or (b) take a few weeks to learn about how floating-point is actually implemented on the platforms that you're targeting, and what the actual accuracy requirements are for the algorithms that you're implementing.
Note that you don't get to set FLT_EVAL_METHOD - it is set by the compiler's headers to let you determine how the library does certain things with floating point.
If your code is very sensitive to exactly how floating point operations are performed, you can use the value of that macro to conditionally compile code to handle those differences that might be important to you.
So for example, in general you know that double_t will be at least a double in all cases. If you want your code to do something different if double_t is a long double then your code can test if FLT_EVAL_METHOD == 2 and act accordingly.
Note that if FLT_EVAL_METHOD is something other than 0, 1, or 2 you'll need to look at the compiler's documentation to know exactly what type double_t is.
double_t may be defined by typedef double double_t; — of course, if you plan to rely on implementation specifics, you need to look at your own implementation.

double precision in Ada?

I'm very new to Ada and was trying to see if it offers double precision type. I see that we have float and
Put( Integer'Image( Float'digits ) );
on my machine gives a value of 6, which is not enough for numerical computations.
Does Ada has double and long double types as in C?
Thanks a lot...
It is a wee bit more complicated than that.
The only predefined floating-point type that compilers have to support is Float. Compilers may optionally support Short_Float and Long_Float. You should be able to look in appendex F of your compiler documentation to see what it supports.
In practice, your compiler almost certianly defines Float as a 32-bit IEEE float, and Long_Float as a 64-bit. Note that C pretty much works this way too with its float and double. C doesn't actually define the size of those.
If you absolutely must have a certian precision (eg: you are sharing the data with something external that must use IEEE 64-bit), then you should probably define your own float type with exactly that precision. That would ensure your code is either portable to any platform or compiler you move it to, or that it will produce a compiler error so you can fix the issue.
You can create any size Float you like. For a long it would be:
type My_Long_Float is digits 11;
Wiki Books is a good reference for things like this.