Double vs Float80 speed in Swift

Double vs Float80 speed in Swift - swift

I've heard that the x87 FPU works with 80-bit floats, so even if I want to calculate using 64-bit numbers, it would calculate it with 80-bit and then convert it. But which is fastest in Swift on x86-64, Double or Float80 (when calculating arithmetic)?

While it's true that the x87 FPU operates internally at 80-bit "extended" precision (at least, by default; this is customizable, and in fact 32-bit builds following the macOS ABI set 64-bit internal precision), binaries targeting x86-64 no longer use x87 FPU instructions. All x86 chips that implement the 64-bit long mode extension also support SSE2 (in fact, this was required by the AMD64 specification), so a 64-bit binary can always assume SSE2 support. As such, this is what is used to implement floating-point operations, because it's much more efficient and easier to optimize around for a compiler.
Even 32-bit builds in the modern era assume SSE2 as a minimum, and certainly on the Macintosh platform, since SSE2 was introduced with the Pentium 4, which predated the Macintosh platform's switch to Intel x86 chips. All x86 chips ever used in an Apple machine support SSE2.
So no, you aren't going to see any performance improvement by using an 80-bit extended precision type. You weren't going to see any performance improvement from x87 instructions, even if they were generated by the compiler. And you certainly aren't going to see any performance improvement on x86-64, because SSE2 supports a maximum of 64-bit precision in hardware. Any 80-bit precision operations are going to have to be implemented in software, or force a smart compiler to emit x87 instructions, which means you don't benefit from any of the nice features and tangible performance improvements of SSE2.

Double will almost always[1] be at least as fast on Float80 on modern Intel processors, in almost any language. There are some situations in which it will be significantly faster:
Double uses less memory; it's possible for an algorithm's working set to fit in cache when using Double, but fail to fit when using Float80, causing significant performance hazards.
Double can take advantage for FMA instructions (exposed in Swift as .add[ing]Product(x,y) and the fma() free function), which effectively doubles the attainable floating-point throughput on recent cores.
Double can be auto-vectorized by the compiler. There are no vector instructions on Float80. When possible, this can give you up to a 4x speedup.
Math functions like sin, cos, pow, etc. are faster on Double than they are on Float80.
There are some other reasons to use Double: it's portable to non-x86 hardware, whereas Float80 is not, and interoperability with C interfaces is easier with Double than it is with Float80. You should only use Float80 when necessary, and default to using Double otherwise.
[1] There are a few niche cases where Float80 can be faster--if an algorithm repeatedly underflows in Double, but remains in normal range in Float80, for example. These are rare, and usually not worth worrying about; more commonly your algorithm will also underflow in Float80, just do it a few iterations later.

Related

Orthogonality of Instruction Set Architecture

I am studying the difference between CISC and RISC recently, and I've encountered into the term "Orthogonality". After doing some research, my understanding so far is that there are two "axes", addressing modes & operations, which are independent of each other, so it produces a maximum number of (#addressing modes * #operations) instructions in the ISA.
For CISC machine, which is a register-memory architecture, operands may come from register or memory and RISC a register-register(or load-store) one on the contrary.
So, what's the role of orthogonality playing in these two ISA? Is CISC more orthogonal than RISC or vice versa?
As the wiki describes, "Modern CPUs often simulate orthogonality in a preprocessing step before performing the actual tasks in a RISC-like core. This "simulated orthogonality" in general is a broader concept, encompassing the notions of decoupling and completeness in function libraries, like in the mathematical concept: an orthogonal function set is easy to use as a basis into expanded functions, ensuring that parts don’t affect another if we change one part." What does this paragraph mean? What is the preprocessing step, does it have anything to do with the microcode?
Any explanation are appreciated! Thanks a lot!

Maximizing total choices of possible instructions like a CISC is generally not what's meant. Instead it's more about being a simpler compiler target, without complex interactions in what makes an instruction legal or not. RISC machines are often highly orthogonal, and designed with being a compiler target in mind, not human programmers.
My understanding of the term is that orthogonality is more about any register being usable in any case where any other register is usable. Unlike x86 shl reg, cl where variable-count shifts require a specific register. (I know this is a RISC-V question, but the examples of non-orthogonality I know of come from other ISAs, primarily x86.)
And definitely not like 8086 (before 386), where if you needed to multiply, one of the operands had to be in the accumulator, AL or AX. And sign-extension was also only available there. 386 introduced movsx reg, r/m8 and r/m16. (And movzx, allowing easy and more efficient zero-extending of a byte from memory into SI or DI, without having to load 2 bytes and and si, 0x00ff.)
Even worse, 16-bit addressing modes only allow a few registers in very limited ways: [bp|bx] + [si|di] + disp0/8/16, vs. 32-bit addressing modes allowing stuff like lea eax, [ecx + ecx + 3] to use the same register twice, or address memory relative to the stack pointer without having to copy it to the base pointer (BP) register.
Or if some memory operands can use a certain addressing mode, can all memory operands use it? AArch64 ldp/stp (load-pair/store-pair of registers) I think has fewer available addressing modes than single-register loads, because it needs 5 extra bits for a second register number. Unlike ARM32 ldrd where the pair of registers is two contiguous registers, starting with an even number.
In general, the less interaction there is between a choice of one thing (like instruction) and the possible choices for another (a register), the more orthogonal.
One of the major benefits with this is being a simple compiler target. The most optimal code can more often be found with a greedy algorithm that only takes into a account one thing at a time, not interlocking tradeoffs. Not like x86-64 "if I use ECX instead of R9d for this variable, that'll save bytes in multiple instructions not needing REX prefix, but later mean I need an extra mov to copy a register for a shift count". (x86 BMI2 introduced variable-count shifts that can use a count from any register, like shlx ebx, eax, r15d)
Or far worse targeting 8086 or 286, where 16-bit addressing modes impose a lot more constraints on register allocation. And you'd more often you'd want to use instructions that needed their operands in specific registers, especially the accumulator.
But if you're not worried about every byte of code size, x86-64 is a fairly orthogonal ISA, usually you don't need to care about which register you use for what. One change in that direction beyond 386's important changes was making the low byte of every register addressable, like bpl, spl, sil, dil as the low bytes of RBP, RSP, RSI, RDI. (But those require REX prefixes, overlapping encodings with AH/CH/DH/BH which are only usable in instructions without REX prefixes.)
Another example of non-orthogonality is x86's notorious integer SIMD extensions, MMX and SSE2. Want to do minimum of unsigned integers 16 bytes at a time? In SSE2 we have pminub for unsigned byte elements. And pminsw, signed 16-bit elements. But no other combination of size and signedness until SSE4.1, several years later, which filled in the gaps allowing signed bytes and u16, as well as i32 and u32. And then AVX-512 added i64 and u64. Every min available always had a corresponding max, but other than that, SSE2 was highly non-orthogonal in that and many other ways, including signed/unsigned saturating add/sub, and pack of wider to narrower elements with signed or unsigned saturation. And FP vs. integer shuffles, e.g. there's no integer equivalent to shufps that takes two elements from one vector, two from another, using an immediate control operand. Fortunately for shuffles you can use FP shuffles on integer data.
x86 SIMD is still not very orthogonal in many ways, for example in integer multiply where not all combinations of element size are available for everything; 16-bit has 16x16 => 16-bit low half, signed high half, or unsigned high half. (And a widening multiply and horizontal-add, pmaddwd). 32-bit has signed and unsigned widening 32x32 => 64-bit, and with SSE4.1 also non-widening. 8-bit only has a multiply and horizontal-add where one operand is treated as signed, the other as unsigned.
Again, if I'm picking on x86 a lot, it's because it's what I know. And Intel painted a huge "kick me" sign on their back when they designed MMX and SSE2, only taking some steps to fix things later with SSE4.1. (I'm sure there are reasons for some of those choices, including transistor budget and opcode coding-space in x86's notoriously cramped machine-code.) But a lot of programs don't want to assume SSE4.1 as a requirement to run at all, even now, over a decade since the first SSE4.1 CPUs. Most other SIMD ISAs are more orthogonal than x86, like ARM NEON or PowerPC AltiVec.
Anyway, in general, it's more orthogonal if all operations are available in all combinations of size and signedness that exist for any operation. This isn't always a big deal for compilers per-se, more for humans not realizing that a compiler could make their code faster if this variable was unsigned or something.
Modern CPUs often simulate orthogonality in a preprocessing step before performing the actual tasks in a RISC-like core
That sounds like they're talking about decoding to uops, but I don't see how that would gain orthogonality.
Unless they're counting the concept of any instruction allowing a memory source operand as being more orthogonal. Normally you wouldn't, being a load/store architecture is basically a fixed constraint that doesn't make other things harder.
But if you do consider that more orthogonal, then yes, decoding add eax, [rdi] to 2 uops lets it run on a back-end that separates the load work from the store work, like a RISC.

I hadn't heard this term orthogonal instruction set before, however:
The VAX is perhaps the epitome of CISC.  The VAX supports many addressing modes, ranging from register itself, to memory specified by various indexing computations (some including pointer advancement, so as to do *p++ or *--p).
The VAX allows all addressing modes for all operands of any instruction.  Further, the VAX allows both 2 operand and 3 operand instructions, so addl2 is operand2 += operand1, and addl3 is operand3 = operand1 + operand2.
Basically it can encode a lot of stuff in a single instruction, so we can do for example, a[i] = *p++ + b[j]; in one instruction, assuming a, b, i, j, and p are in registers.
Other CISC-style processors limit the encoding, for example, so that we can only do two-operand instructions (no 3 operand), and some even limit the 2nd operand to a register, so only one memory operand.  I believe this is what they're getting at with the term orthogonal or not.
Meanwhile, a RISC processor instead follows a load/store architecture.  Access to memory is not allowed for any operand, but rather only via load and store instructions, and only with those instructions are there addressing modes.  Most all arithmetic operations (except the add for addressing) happen between registers alone.  (In some sense the RISC philosophy has an orthogonality since all arithmetic operations work on registers alone.)
I don't think the term orthogonality is of high value.  I wouldn't dwell on the term itself, but rather take away from that article the comparison between CISC ala VAX, vs. others CISC, vs. RISC.
#Peter also makes a good points, such as that certain registers being hard code (i.e. an implicit source/target) in some architectures for some instructions, which reduces orthogonality.
By that point I might stress that RISC architectures generally don't hard code registers, though MIPS hard codes the return address register ($31) for the jal instruction whereas RISC V does not ($sp and $ra are hard coded but only in the compressed instruction extension).  Whereas some CISC architectures (except VAX) hard code more registers. 
The MC68000 divides the registers into two sets of 8: addressing registers and data registers, which helps encoding by providing 16 registers with only 3-bit register fields, but also limits what you can do with them (and there aren't enough address registers, since one is the stack pointer and another the global pointer, leaving only 6).
CISC architecture often support byte vs. word sized arithmetic, whereas RISC architectures usually support only word sized arithmetic, so if you want byte, you have to simulate it (i.e. with range check or other).

Why does SIMD have single data instructions when it's called SIMD?

I've been wondering.. It's called SIMD as in single instruction multiple data. So why does it have single data instructions?
For example, vaddss is the single data equivalent of the multiple data vaddps. Just about every SIMD instruction have a single data version.
Why?
Why does SIMD have single data instructions when it's called SIMD?

It isn't a SIMD instruction in that sense
vaddss is a scalar FP math instruction that operates on data in the FP/SIMD registers (XMM0..15). It exists because x87 is not a very convenient compiler target with its stack-based registers that often need fxch, and other quirks. Intel added a new way to do scalar FP math along with SSE1 (float) and SSE2 (double), which is fortunately baseline for x86-64 so everyone can just use it.
People who call that a SIMD instruction are talking about one of:
Which registers it operates on. (XMM0 is 16 bytes wide and clearly a SIMD register, even when you only care about the low element holding a scalar value.)
The fact that it's an AVX instruction, so it was introduced with an ISA extension that was primarily aimed at SIMD usage, and thus is called a SIMD extension or instruction set.
Which also means it uses the MXCSR for rounding mode and FP exception recording / unmasking, and the kinds of exceptions it can take are the same as other SSE/AVX instructions which Intel documents as "SIMD Floating-Point Exceptions" as concise terminology to distinguish it from legacy x87.
Or they're talking about the use-case of doing something to just the low element when the high elements have actual data. (Quite rare, but something you could do. Maybe more likely with sd scalar double, where the low double is one half of an XMM register.)
Or they're just plain wrong if they actually mean it in terms of Flynn's taxonomy of SISD vs. SIMD vs. MIMD etc. I highly doubt anyone would actually mean that, though. The ss and sd scalar FP math instructions are SISD, single-instruction single-data. And BTW, they only exist for FP math; x86 already has instructions like add eax, ecx for scalar integer math, and doesn't have scalar versions of paddb or even xorps.
One reason for having separate scalar FP math instructions is that using addps would also operate on whatever garbage might be in the high elements of XMM registers. This can raise extra FP exceptions (usually masked, so only recorded in MXCSR (fenv.h), but if unmasked would trap to the OS.)
With the upper elements all 0.0 (which isn't required by the calling convention, BTW), addps wouldn't raise any extra exceptions, but divps would divide by zero.
With non-zero garbage like small integers, it might be a bit-pattern for a subnormal float, or a result might be subnormal, causing huge slowdowns (factor of ~100) as the CPU takes a microcode assist to get handle subnormal input or output in many cases (or when SSE1 was new in Pentium III, probably all cases of subnormals). Unless you set FTZ and DAZ (flush to zero, denormal are zero) like gcc -ffast-math does.
For instructions like xorps or paddq which don't do actual FP math, no FP exceptions or microcode assists are possible. You can just use them even if you only care about the low 32 or 64 bits of an XMM.
MMX or SSE2 had occasional uses in 32-bit code for doing scalar 64-bit integer math, with zeros or garbage in the upper bytes. MMX paddq mm0, mm1 is a SISD instruction, but SSE2 paddq xmm0, xmm1 is a SIMD instruction.
SSE1 was new in Pentium 3, where the SIMD execution units and registers were only 64 bits wide. addps decoded to 2 uops; addss decoded to 1. So there was a performance motivation, too, even in the best case.
This is also likely the reason for Intel's unfortunate design where sqrtss and cvtsi2ss and others merge into the destination, requiring either spending extra front-end bandwidth on xor-zeroing, or risking false dependencies: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? . It's a short-sighted design decision to make them single-uop on Pentium 3, which they unfortunately followed in SSE2 for double precision, and stuck to for AVX and AVX-512 when they had a chance to introduce better versions with different semantics. At least the AVX versions take a 2nd source register to merge with, so you can pick a "cold" reg as a workaround, see my answer on the linked duplicate.
It's normal for scalar FP to share registers with SIMD
It isn't necessary or useful to have yet another set of registers for scalar FP, and sharing with the x87 FPU or the general-purpose integer registers would each be worse for separate reasons.
It's totally normal on other ISAs for the SIMD registers to overlap or be the same as the scalar FP registers; Some ISAs (like ARM) that didn't have weirdo designs like x87 didn't need new architectural state to introduce SIMD. e.g. ARM's NEON q0..q15 16-byte registers map to pairs of d0..d31 double-precision FP registers that existed with VFPv3.
(I'm not sure if the partial-register aliasing was actually common in SIMD extensions for other ISAs, though. Probably some introduced new architectural state, or just used FP double-precision registers as 64-bit integer SIMD instead of 128-bit.)
In an OS kernel you often talk about saving "FPU state" on context switch (as opposed to just the general-purpose integer registers), and these days that's short-hand for FPU and SIMD state. e.g. in the Linux kernel, you need to use kernel_fpu_begin() before running instructions that use XMM/YMM/ZMM registers. (e.g. in the RAID5 / RAID6 drivers).

representing Double values in Katai

Some of the values I need to read in my ksy file are double's which I assume is a binary64 structure. The native data-types for a float won't stretch that far. Has anyone managed to represent this datatype in Kaitai ?

"binary64" is a normal IEEE 754 double-precision floats, occupying 64 bits = 8 bytes.
They're perfectly supported by vast majority of languages and, subsequently, Kaitai Struct offers built-in supports for them as type: f8 (float, 8 bytes long).
If you're rather interested in larger floating point values (binary128, binary256 — i.e. quad or octuple precision), there is no built-in support for them in KS due to lack of standard support for these types in most target languages. If you want something like that, the recommended way would be implementing one as opaque type in a target language of your choice. That will likely require you to bringing in some external library which implements this type using some kind of software emulation / complex arithmetics — as hardware support seems to be almost non-existent in commodity CPUs (like Intel or ARM) as of 2020.
For more details on these, see issue #101.

Why is there no fused multiply-add for general-purpose registers on x86_64 CPUs?

On Intel and AMD x86_64 processors, SIMD vectorized registers have specific fused-multiply-add capabilities, but general-purpose (scalar, integer) registers don't - you basically need to multiply, then add (unless you can fit things into an lea).
Why is that? I mean, is it that useless so as to not be worth the overhead?

Integer multiply is common, but not one of the most common things to do with integers. But with floating point numbers, multiplying and adding is used all the time, and FMA provides major speedups for lots of ALU-bound FP code.
Also, floating point actually avoids precision loss with an FMA (the x*y internal temporary isn't rounded off at all before adding). This is why the ISO C99 / C++ fma() math library function exists, and why it's slow to implement without hardware FMA support.
Integer FMA (or multiply-accumulate, aka MAC) doesn't have any precision benefit vs. separate multiply and add.
Some non-x86 ISAs do provide integer FMA. It's not useless, but Intel and AMD both haven't bothered to include it until AVX512-IFMA (and that's still only for SIMD, basically exposing the 52-bit mantissa multiplier circuits needed for double-precision FMA/vmulpd for use by integer instructions).
Non-x86 examples include:
MIPS32, madd / maddu (unsigned) to multiply-accumulate into the hi / lo registers (the special registers used as a destination by regular multiply and divide instructions).
ARM smlal and friends (32x32=>64 bit MAC, or 16x16=>32 bit), also available for unsigned integer. Operands are regular R0..R15 general purpose registers.
An integer register FMA would be useful on x86, but uops that have 3 integer inputs are rare. CMOV and ADC have 3 inputs, but one of those is flags. Even then, they didn't decode to a single uop on Intel until Broadwell, after 3-input uop support was added for FP FMA in Haswell.
Haswell and later can track fused-domain uops with 3 integer inputs, though, for (some) micro-fused instructions with indexed addressing modes. Sandybridge/Ivybridge un-laminate instructions like add eax, [rdx+rcx]. (But Nehalem could keep them micro-fused, like Haswell; SnB simplified the fused-domain uop format). Anyway, that's fused domain, not in the scheduler. Only Broadwell/Skylake can track 3-input integer uops in the scheduler, and that's only for 2 integer + flags, not 3 integer registers.
Intel does use a "unified" scheduler, where FP and integer ops use the same scheduler, and it can track proper 3-input FP FMA. So IDK if there's a technical obstacle. If not, IDK why Intel didn't include integer FMA as part of BMI2 or something, which added stuff like mulx (2-input 2-output mul with mostly explicit operands, unlike legacy mul that uses rdx:rax.)
SSE2/SSSE3 does have integer mul-add instructions for vector registers, but only horizontal add after widening 16x16 => 32-bit (SSE2 pmaddwd) or (unsigned)8x(signed)8=>16-bit (SSSE3 pmaddubsw).
But those are only 2-input instructions, so even though there's a multiply and an add, it's very different from FMA.
Footnote: The question title originally said there was no FMA "for scalars". There is scalar FP FMA with the same FMA3 extension that added the packed versions of these: VFMADD231SD and friends operate on scalar double-precision, and the same flavours of vfmaddXXXss are available for scalar float in XMM registers.

64-bit Advantages for Discrete Event Simulation

As I understand it, Intel 64-bit CPUs offer the ability to address a larger address space (>4GB), which is useful for a large simulation. Interesting architectural hardware advantages::
16 general purpose registers instead of 8
Additional SSE registers
A no execute (NX) bit to prevent buffer overrun attacks
BACKGROUND
Historically, the simulations have been performed on 32-bit IA (Intel Architecture) systems. I am wondering if where (if any) is opportunity to reduce simulation times with 64-bit CPUs: I expect that software should be recompiled to take advantage of 64-bit capability. This type of simulation would not benefit from a MAC (multiply and accumulate) nor does it use floating point calculations.
QUESTION
That being said, is there an Intel 64-bit instruction or capability that offers an appreciable advantage over the 32-bit instructions set that would accelerate simulation (computationally intensive and lengthy 32-BIT algorithms)?
If you have experience implementing simulations and have transitioned from 32 to 64 bit CPUs, please state this in your response (relevant experience is important). I look forward to insightful responses from the community

The most immediate computational benefits to expect regarding CPU instructions I can think of would be AVX although this is only loosely related to x86_64, but more of an CPU-generational issue.
In our company, we developed multiple, highly-complex discrete event simulations, simulating aircraft (including electrics, hydraulics, avionics software and everything related). They are all built with or ported to x86_64. The reasons are mostly due to memory addressing, allowing for larger caches and wider choice of algorithms (e.g. data-centric design, concurrency), graphics content also tends to be huge nowadays. However, optimizations regarding x86_64 instructions themselves, such as AVX, are left to compilers. I never saw code written in assembler or using compiler intrinsics to actually refer to specific x86_64 instructions explicitly.
To summarize, based on my experience, x86_64 CPUs allow for certain optimizations, often sacrificing memory consumption in favor of CPU processing:
Wider choice of algorithms, especially regarding concurrency, where data may need to be laid out in a way favoring parallel processing at the cost of occupied memory
Intermediate results or other processing output may be cached more easily in memory to avoid recomputation or to optimize for temporal or state-related coherence
AVX instructions may help compilers to vectorize more code than with MMX/SSE

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse