Why does SIMD have single data instructions when it's called SIMD? - cpu-architecture

I've been wondering.. It's called SIMD as in single instruction multiple data. So why does it have single data instructions?
For example, vaddss is the single data equivalent of the multiple data vaddps. Just about every SIMD instruction have a single data version.
Why?
Why does SIMD have single data instructions when it's called SIMD?

It isn't a SIMD instruction in that sense
vaddss is a scalar FP math instruction that operates on data in the FP/SIMD registers (XMM0..15). It exists because x87 is not a very convenient compiler target with its stack-based registers that often need fxch, and other quirks. Intel added a new way to do scalar FP math along with SSE1 (float) and SSE2 (double), which is fortunately baseline for x86-64 so everyone can just use it.
People who call that a SIMD instruction are talking about one of:
Which registers it operates on. (XMM0 is 16 bytes wide and clearly a SIMD register, even when you only care about the low element holding a scalar value.)
The fact that it's an AVX instruction, so it was introduced with an ISA extension that was primarily aimed at SIMD usage, and thus is called a SIMD extension or instruction set.
Which also means it uses the MXCSR for rounding mode and FP exception recording / unmasking, and the kinds of exceptions it can take are the same as other SSE/AVX instructions which Intel documents as "SIMD Floating-Point Exceptions" as concise terminology to distinguish it from legacy x87.
Or they're talking about the use-case of doing something to just the low element when the high elements have actual data. (Quite rare, but something you could do. Maybe more likely with sd scalar double, where the low double is one half of an XMM register.)
Or they're just plain wrong if they actually mean it in terms of Flynn's taxonomy of SISD vs. SIMD vs. MIMD etc. I highly doubt anyone would actually mean that, though. The ss and sd scalar FP math instructions are SISD, single-instruction single-data. And BTW, they only exist for FP math; x86 already has instructions like add eax, ecx for scalar integer math, and doesn't have scalar versions of paddb or even xorps.
One reason for having separate scalar FP math instructions is that using addps would also operate on whatever garbage might be in the high elements of XMM registers. This can raise extra FP exceptions (usually masked, so only recorded in MXCSR (fenv.h), but if unmasked would trap to the OS.)
With the upper elements all 0.0 (which isn't required by the calling convention, BTW), addps wouldn't raise any extra exceptions, but divps would divide by zero.
With non-zero garbage like small integers, it might be a bit-pattern for a subnormal float, or a result might be subnormal, causing huge slowdowns (factor of ~100) as the CPU takes a microcode assist to get handle subnormal input or output in many cases (or when SSE1 was new in Pentium III, probably all cases of subnormals). Unless you set FTZ and DAZ (flush to zero, denormal are zero) like gcc -ffast-math does.
For instructions like xorps or paddq which don't do actual FP math, no FP exceptions or microcode assists are possible. You can just use them even if you only care about the low 32 or 64 bits of an XMM.
MMX or SSE2 had occasional uses in 32-bit code for doing scalar 64-bit integer math, with zeros or garbage in the upper bytes. MMX paddq mm0, mm1 is a SISD instruction, but SSE2 paddq xmm0, xmm1 is a SIMD instruction.
SSE1 was new in Pentium 3, where the SIMD execution units and registers were only 64 bits wide. addps decoded to 2 uops; addss decoded to 1. So there was a performance motivation, too, even in the best case.
This is also likely the reason for Intel's unfortunate design where sqrtss and cvtsi2ss and others merge into the destination, requiring either spending extra front-end bandwidth on xor-zeroing, or risking false dependencies: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? . It's a short-sighted design decision to make them single-uop on Pentium 3, which they unfortunately followed in SSE2 for double precision, and stuck to for AVX and AVX-512 when they had a chance to introduce better versions with different semantics. At least the AVX versions take a 2nd source register to merge with, so you can pick a "cold" reg as a workaround, see my answer on the linked duplicate.
It's normal for scalar FP to share registers with SIMD
It isn't necessary or useful to have yet another set of registers for scalar FP, and sharing with the x87 FPU or the general-purpose integer registers would each be worse for separate reasons.
It's totally normal on other ISAs for the SIMD registers to overlap or be the same as the scalar FP registers; Some ISAs (like ARM) that didn't have weirdo designs like x87 didn't need new architectural state to introduce SIMD. e.g. ARM's NEON q0..q15 16-byte registers map to pairs of d0..d31 double-precision FP registers that existed with VFPv3.
(I'm not sure if the partial-register aliasing was actually common in SIMD extensions for other ISAs, though. Probably some introduced new architectural state, or just used FP double-precision registers as 64-bit integer SIMD instead of 128-bit.)
In an OS kernel you often talk about saving "FPU state" on context switch (as opposed to just the general-purpose integer registers), and these days that's short-hand for FPU and SIMD state. e.g. in the Linux kernel, you need to use kernel_fpu_begin() before running instructions that use XMM/YMM/ZMM registers. (e.g. in the RAID5 / RAID6 drivers).

Related

Orthogonality of Instruction Set Architecture

I am studying the difference between CISC and RISC recently, and I've encountered into the term "Orthogonality". After doing some research, my understanding so far is that there are two "axes", addressing modes & operations, which are independent of each other, so it produces a maximum number of (#addressing modes * #operations) instructions in the ISA.
For CISC machine, which is a register-memory architecture, operands may come from register or memory and RISC a register-register(or load-store) one on the contrary.
So, what's the role of orthogonality playing in these two ISA? Is CISC more orthogonal than RISC or vice versa?
As the wiki describes, "Modern CPUs often simulate orthogonality in a preprocessing step before performing the actual tasks in a RISC-like core. This "simulated orthogonality" in general is a broader concept, encompassing the notions of decoupling and completeness in function libraries, like in the mathematical concept: an orthogonal function set is easy to use as a basis into expanded functions, ensuring that parts don’t affect another if we change one part." What does this paragraph mean? What is the preprocessing step, does it have anything to do with the microcode?
Any explanation are appreciated! Thanks a lot!
Maximizing total choices of possible instructions like a CISC is generally not what's meant. Instead it's more about being a simpler compiler target, without complex interactions in what makes an instruction legal or not. RISC machines are often highly orthogonal, and designed with being a compiler target in mind, not human programmers.
My understanding of the term is that orthogonality is more about any register being usable in any case where any other register is usable. Unlike x86 shl reg, cl where variable-count shifts require a specific register. (I know this is a RISC-V question, but the examples of non-orthogonality I know of come from other ISAs, primarily x86.)
And definitely not like 8086 (before 386), where if you needed to multiply, one of the operands had to be in the accumulator, AL or AX. And sign-extension was also only available there. 386 introduced movsx reg, r/m8 and r/m16. (And movzx, allowing easy and more efficient zero-extending of a byte from memory into SI or DI, without having to load 2 bytes and and si, 0x00ff.)
Even worse, 16-bit addressing modes only allow a few registers in very limited ways: [bp|bx] + [si|di] + disp0/8/16, vs. 32-bit addressing modes allowing stuff like lea eax, [ecx + ecx + 3] to use the same register twice, or address memory relative to the stack pointer without having to copy it to the base pointer (BP) register.
Or if some memory operands can use a certain addressing mode, can all memory operands use it? AArch64 ldp/stp (load-pair/store-pair of registers) I think has fewer available addressing modes than single-register loads, because it needs 5 extra bits for a second register number. Unlike ARM32 ldrd where the pair of registers is two contiguous registers, starting with an even number.
In general, the less interaction there is between a choice of one thing (like instruction) and the possible choices for another (a register), the more orthogonal.
One of the major benefits with this is being a simple compiler target. The most optimal code can more often be found with a greedy algorithm that only takes into a account one thing at a time, not interlocking tradeoffs. Not like x86-64 "if I use ECX instead of R9d for this variable, that'll save bytes in multiple instructions not needing REX prefix, but later mean I need an extra mov to copy a register for a shift count". (x86 BMI2 introduced variable-count shifts that can use a count from any register, like shlx ebx, eax, r15d)
Or far worse targeting 8086 or 286, where 16-bit addressing modes impose a lot more constraints on register allocation. And you'd more often you'd want to use instructions that needed their operands in specific registers, especially the accumulator.
But if you're not worried about every byte of code size, x86-64 is a fairly orthogonal ISA, usually you don't need to care about which register you use for what. One change in that direction beyond 386's important changes was making the low byte of every register addressable, like bpl, spl, sil, dil as the low bytes of RBP, RSP, RSI, RDI. (But those require REX prefixes, overlapping encodings with AH/CH/DH/BH which are only usable in instructions without REX prefixes.)
Another example of non-orthogonality is x86's notorious integer SIMD extensions, MMX and SSE2. Want to do minimum of unsigned integers 16 bytes at a time? In SSE2 we have pminub for unsigned byte elements. And pminsw, signed 16-bit elements. But no other combination of size and signedness until SSE4.1, several years later, which filled in the gaps allowing signed bytes and u16, as well as i32 and u32. And then AVX-512 added i64 and u64. Every min available always had a corresponding max, but other than that, SSE2 was highly non-orthogonal in that and many other ways, including signed/unsigned saturating add/sub, and pack of wider to narrower elements with signed or unsigned saturation. And FP vs. integer shuffles, e.g. there's no integer equivalent to shufps that takes two elements from one vector, two from another, using an immediate control operand. Fortunately for shuffles you can use FP shuffles on integer data.
x86 SIMD is still not very orthogonal in many ways, for example in integer multiply where not all combinations of element size are available for everything; 16-bit has 16x16 => 16-bit low half, signed high half, or unsigned high half. (And a widening multiply and horizontal-add, pmaddwd). 32-bit has signed and unsigned widening 32x32 => 64-bit, and with SSE4.1 also non-widening. 8-bit only has a multiply and horizontal-add where one operand is treated as signed, the other as unsigned.
Again, if I'm picking on x86 a lot, it's because it's what I know. And Intel painted a huge "kick me" sign on their back when they designed MMX and SSE2, only taking some steps to fix things later with SSE4.1. (I'm sure there are reasons for some of those choices, including transistor budget and opcode coding-space in x86's notoriously cramped machine-code.) But a lot of programs don't want to assume SSE4.1 as a requirement to run at all, even now, over a decade since the first SSE4.1 CPUs. Most other SIMD ISAs are more orthogonal than x86, like ARM NEON or PowerPC AltiVec.
Anyway, in general, it's more orthogonal if all operations are available in all combinations of size and signedness that exist for any operation. This isn't always a big deal for compilers per-se, more for humans not realizing that a compiler could make their code faster if this variable was unsigned or something.
Modern CPUs often simulate orthogonality in a preprocessing step before performing the actual tasks in a RISC-like core
That sounds like they're talking about decoding to uops, but I don't see how that would gain orthogonality.
Unless they're counting the concept of any instruction allowing a memory source operand as being more orthogonal. Normally you wouldn't, being a load/store architecture is basically a fixed constraint that doesn't make other things harder.
But if you do consider that more orthogonal, then yes, decoding add eax, [rdi] to 2 uops lets it run on a back-end that separates the load work from the store work, like a RISC.
I hadn't heard this term orthogonal instruction set before, however:
The VAX is perhaps the epitome of CISC.  The VAX supports many addressing modes, ranging from register itself, to memory specified by various indexing computations (some including pointer advancement, so as to do *p++ or *--p).
The VAX allows all addressing modes for all operands of any instruction.  Further, the VAX allows both 2 operand and 3 operand instructions, so addl2 is operand2 += operand1, and addl3 is operand3 = operand1 + operand2.
Basically it can encode a lot of stuff in a single instruction, so we can do for example, a[i] = *p++ + b[j]; in one instruction, assuming a, b, i, j, and p are in registers.
Other CISC-style processors limit the encoding, for example, so that we can only do two-operand instructions (no 3 operand), and some even limit the 2nd operand to a register, so only one memory operand.  I believe this is what they're getting at with the term orthogonal or not.
Meanwhile, a RISC processor instead follows a load/store architecture.  Access to memory is not allowed for any operand, but rather only via load and store instructions, and only with those instructions are there addressing modes.  Most all arithmetic operations (except the add for addressing) happen between registers alone.  (In some sense the RISC philosophy has an orthogonality since all arithmetic operations work on registers alone.)
I don't think the term orthogonality is of high value.  I wouldn't dwell on the term itself, but rather take away from that article the comparison between CISC ala VAX, vs. others CISC, vs. RISC.
#Peter also makes a good points, such as that certain registers being hard code (i.e. an implicit source/target) in some architectures for some instructions, which reduces orthogonality.
By that point I might stress that RISC architectures generally don't hard code registers, though MIPS hard codes the return address register ($31) for the jal instruction whereas RISC V does not ($sp and $ra are hard coded but only in the compressed instruction extension).  Whereas some CISC architectures (except VAX) hard code more registers. 
The MC68000 divides the registers into two sets of 8: addressing registers and data registers, which helps encoding by providing 16 registers with only 3-bit register fields, but also limits what you can do with them (and there aren't enough address registers, since one is the stack pointer and another the global pointer, leaving only 6).
CISC architecture often support byte vs. word sized arithmetic, whereas RISC architectures usually support only word sized arithmetic, so if you want byte, you have to simulate it (i.e. with range check or other).

Why is there no fused multiply-add for general-purpose registers on x86_64 CPUs?

On Intel and AMD x86_64 processors, SIMD vectorized registers have specific fused-multiply-add capabilities, but general-purpose (scalar, integer) registers don't - you basically need to multiply, then add (unless you can fit things into an lea).
Why is that? I mean, is it that useless so as to not be worth the overhead?
Integer multiply is common, but not one of the most common things to do with integers. But with floating point numbers, multiplying and adding is used all the time, and FMA provides major speedups for lots of ALU-bound FP code.
Also, floating point actually avoids precision loss with an FMA (the x*y internal temporary isn't rounded off at all before adding). This is why the ISO C99 / C++ fma() math library function exists, and why it's slow to implement without hardware FMA support.
Integer FMA (or multiply-accumulate, aka MAC) doesn't have any precision benefit vs. separate multiply and add.
Some non-x86 ISAs do provide integer FMA. It's not useless, but Intel and AMD both haven't bothered to include it until AVX512-IFMA (and that's still only for SIMD, basically exposing the 52-bit mantissa multiplier circuits needed for double-precision FMA/vmulpd for use by integer instructions).
Non-x86 examples include:
MIPS32, madd / maddu (unsigned) to multiply-accumulate into the hi / lo registers (the special registers used as a destination by regular multiply and divide instructions).
ARM smlal and friends (32x32=>64 bit MAC, or 16x16=>32 bit), also available for unsigned integer. Operands are regular R0..R15 general purpose registers.
An integer register FMA would be useful on x86, but uops that have 3 integer inputs are rare. CMOV and ADC have 3 inputs, but one of those is flags. Even then, they didn't decode to a single uop on Intel until Broadwell, after 3-input uop support was added for FP FMA in Haswell.
Haswell and later can track fused-domain uops with 3 integer inputs, though, for (some) micro-fused instructions with indexed addressing modes. Sandybridge/Ivybridge un-laminate instructions like add eax, [rdx+rcx]. (But Nehalem could keep them micro-fused, like Haswell; SnB simplified the fused-domain uop format). Anyway, that's fused domain, not in the scheduler. Only Broadwell/Skylake can track 3-input integer uops in the scheduler, and that's only for 2 integer + flags, not 3 integer registers.
Intel does use a "unified" scheduler, where FP and integer ops use the same scheduler, and it can track proper 3-input FP FMA. So IDK if there's a technical obstacle. If not, IDK why Intel didn't include integer FMA as part of BMI2 or something, which added stuff like mulx (2-input 2-output mul with mostly explicit operands, unlike legacy mul that uses rdx:rax.)
SSE2/SSSE3 does have integer mul-add instructions for vector registers, but only horizontal add after widening 16x16 => 32-bit (SSE2 pmaddwd) or (unsigned)8x(signed)8=>16-bit (SSSE3 pmaddubsw).
But those are only 2-input instructions, so even though there's a multiply and an add, it's very different from FMA.
Footnote: The question title originally said there was no FMA "for scalars". There is scalar FP FMA with the same FMA3 extension that added the packed versions of these: VFMADD231SD and friends operate on scalar double-precision, and the same flavours of vfmaddXXXss are available for scalar float in XMM registers.

Indexed addressing mode and implied addressing mode

Indexed addressing mode is usually used for accessing arrays as arrays are stored contiguosly. We have a index register which gets incremented in every iteration which when added to base address gives the array element address.
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing ? We have the base address and we can just add 1 to it every time when accessing. Why do we need indexed addressing mode which has a overhead of index register ?
I am not sure about the instruction format for implied addressing mode. Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing?
You can; MIPS only has one addressing mode and compilers can still generate code for it just fine. But sometimes it has to use an extra shift + add instruction to calculate an address (if it's not just looping through an array).
The point of addressing modes is to save instructions and save registers, especially in 2-operand instruction sets like x86, where add eax, ecx overwrites eax with the result (eax += ecx), unlike MIPS or other 3-instruction ISAs where addu $t2, $t1, $t0 does t2 = t1 + t0. On x86, that would require a copy (mov) and an add. (Or in that special case, lea edx, [eax+ecx]: x86 can copy-and-add (and shift) using the same instruction-encoding it uses for memory operands.)
Consider a histogram problem: you generate array indices in unpredictable order, and have to index an array. On x86-64, add dword [rbx + rdi*4], 1 will increment a 32-bit counter in memory using a single 4-byte instruction, which decodes to only 2 uops for the front-end to issue into the out-of-order core on modern Intel CPUs. (http://agner.org/optimize/). (rbx is the base register, rdi is a scaled index). Having a scaled index is very powerful; x86 16-bit addressing modes support 2 registers, but not a scaled index.
Classic MIPS only has separate shift and add instructions, although MIPS32 did add a scaled-add instruction for address calculation. That would save an instruction here. Being a load-store machine, the loads and stores always have to be separate instructions (unlike on x86 where that add decodes as a micro-fused load+add and a store. See INC instruction vs ADD 1: Does it matter?).
Probably ARM would be a better comparison for MIPS: It's also a load-store RISC machine. But it does have a selection of addressing modes, including scaled index using the barrel shifter. So instead of needing a separate shift / add for each array index, you'd use LDR R0, [R1, R2, LSL #2], add r0, r0, #1 / str with the same addressing mode.
Often when looping through an array, it is best to just increment pointers on x86. But it's also an option to use an index, especially for loops with multiple arrays using the same index, like C[i] = A[i] + B[i]. Indexed addressing mode can sometimes be slightly less efficient in hardware, though, so when a compiler is unrolling a loop it usually should use pointers, even though it has to increment all 3 pointers separately instead of one index.
The point of instruction-set design is not merely to be Turing complete, it's to enable efficient code that gets more work done with fewer clock cycles and/or smaller code-size, or give programmers the option of aiming for either of those goals.
The minimum threshold for a computer to be programmable is extremely low, see for example various One instruction set computer architectures. (None implemented for real, just designed on paper to show that it's possible to write programs with nothing but a subtract-and-branch-if-less-than-zero instruction, with memory operands encoded in the instruction.
There's a tradeoff between easy to decode (especially to decode in parallel) vs. compact. x86 is horrible because it evolved as a series of extensions, often without a lot of planning to leave room for future extensions. If you're interested in ISA design decisions, have a look at Agner Fog's blog for interesting discussion about designing an ISA for high-performance CPUs that combines the best of x86 (lots of work with one instruction, e.g. memory operand as part of an ALU instruction) with the best features of RISC (easy to decode, lots of registers): Proposal for an ideal extensible instruction set.
There's also a tradeoff in how you spend the bits in an instruction word, especially in a fixed instruction width ISA like most RISCs. Different ISAs made different choices.
PowerPC uses lots of the coding space for powerful bitfield instructions like rlwinm (rotate left and mask off a window of bits), and lots of opcodes. IDK if the generally unpronounceable and hard-to-remember mnemonics are related to that...
ARM uses the high 4 bits for predicated execution of any instruction based on condition codes. It uses more bits for the barrel shifter (the 2nd source operand is optionally shifted or rotated by an immediate or a count from another register).
MIPS has relatively large immediate operands, and is basically simple.
x86 32/64-bit addressing modes use a variable-length encoding, with an extra byte SIB (scale/index/base) byte when there's an index, and an optional disp8 or disp32 immediate displacement. (e.g. add esi, [rax + rdx + 12340] takes 2 + 1 + 4 bytes to encode, vs. 2 bytes for add esi, [rax].
x86 16-bit addressing modes are much more limited, and pack everything except the optional disp8/disp16 displacement into the ModR/M byte.
Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
Yes, the machine-code format for some instructions in some ISAs includes implicit operands. Many machines have push / pop instructions that implicitly use a specific register as the stack pointer. For example, in x86-64's push rax, RAX is an explicit register operand (encoded in the low 3 bits of the one-byte opcode using the push r64 short form), while RSP is an implicit operand.
Older 8-bit CPUs often had instructions like DECA (to decrement the accumulator, A). i.e. there was a specific opcode for that register. This could be the same thing as having a DEC instruction with some bits in the opcode byte specifying which register (like x86 does before x86-64 repurposed the short INC/DEC encodings as REX prefixes: note the "N.E" (Not Encodeable) in the 64-bit mode column for dec r32). But if there's no regular pattern then it can definitely be considered an implicit operand.
Sometimes putting things into neat categories breaks down, so don't worry too much about whether using bits with the opcode byte counts as implicit or explicit for x86. It's a way of spending more opcode space to save code-size for commonly used instructions while still allowing use with different registers.
Some ISAs only use a certain register as the stack pointer by convention, with no implicit uses. MIPS is like this.
ARM32 (in ARM, not Thumb mode) also uses explicit operands in push/pop. Its push/pop mnemonics are just aliases for store-multiple decrement-before / load-multiple increment-after (LDMIA / STMDB) to implement a full-descending stack. See ARM's docs for LDM/STM which explains this, and what you can do with the general case of these instructions, e.g. LDMDB to decrement a pointer and then load (in the opposite direction of POP).

Double vs Float80 speed in Swift

I've heard that the x87 FPU works with 80-bit floats, so even if I want to calculate using 64-bit numbers, it would calculate it with 80-bit and then convert it. But which is fastest in Swift on x86-64, Double or Float80 (when calculating arithmetic)?
While it's true that the x87 FPU operates internally at 80-bit "extended" precision (at least, by default; this is customizable, and in fact 32-bit builds following the macOS ABI set 64-bit internal precision), binaries targeting x86-64 no longer use x87 FPU instructions. All x86 chips that implement the 64-bit long mode extension also support SSE2 (in fact, this was required by the AMD64 specification), so a 64-bit binary can always assume SSE2 support. As such, this is what is used to implement floating-point operations, because it's much more efficient and easier to optimize around for a compiler.
Even 32-bit builds in the modern era assume SSE2 as a minimum, and certainly on the Macintosh platform, since SSE2 was introduced with the Pentium 4, which predated the Macintosh platform's switch to Intel x86 chips. All x86 chips ever used in an Apple machine support SSE2.
So no, you aren't going to see any performance improvement by using an 80-bit extended precision type. You weren't going to see any performance improvement from x87 instructions, even if they were generated by the compiler. And you certainly aren't going to see any performance improvement on x86-64, because SSE2 supports a maximum of 64-bit precision in hardware. Any 80-bit precision operations are going to have to be implemented in software, or force a smart compiler to emit x87 instructions, which means you don't benefit from any of the nice features and tangible performance improvements of SSE2.
Double will almost always[1] be at least as fast on Float80 on modern Intel processors, in almost any language. There are some situations in which it will be significantly faster:
Double uses less memory; it's possible for an algorithm's working set to fit in cache when using Double, but fail to fit when using Float80, causing significant performance hazards.
Double can take advantage for FMA instructions (exposed in Swift as .add[ing]Product(x,y) and the fma() free function), which effectively doubles the attainable floating-point throughput on recent cores.
Double can be auto-vectorized by the compiler. There are no vector instructions on Float80. When possible, this can give you up to a 4x speedup.
Math functions like sin, cos, pow, etc. are faster on Double than they are on Float80.
There are some other reasons to use Double: it's portable to non-x86 hardware, whereas Float80 is not, and interoperability with C interfaces is easier with Double than it is with Float80. You should only use Float80 when necessary, and default to using Double otherwise.
[1] There are a few niche cases where Float80 can be faster--if an algorithm repeatedly underflows in Double, but remains in normal range in Float80, for example. These are rare, and usually not worth worrying about; more commonly your algorithm will also underflow in Float80, just do it a few iterations later.

Why doesn't my processor have built-in BigInt support?

As far as I understood it, BigInts are usually implemented in most programming languages as arrays containing digits, where, eg.: when adding two of them, each digit is added one after another like we know it from school, e.g.:
246
816
* *
----
1062
Where * marks that there was an overflow. I learned it this way at school and all BigInt adding functions I've implemented work similar to the example above.
So we all know that our processors can only natively manage ints from 0 to 2^32 / 2^64.
That means that most scripting languages in order to be high-level and offer arithmetics with big integers, have to implement/use BigInt libraries that work with integers as arrays like above.
But of course this means that they'll be far slower than the processor.
So what I've asked myself is:
Why doesn't my processor have a built-in BigInt function?
It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Seems like a fine idea to me, so why isn't there something like that?
There are simply too many issues that require the processor to deal with a ton of stuff which isn't its job.
Suppose that the processor DID have that feature. We can work out a system where we know how many bytes are used by a given BigInt - just use the same principle as most string libraries and record the length.
But what would happen if the result of a BigInt operation exceeded the amount of space reserved?
There are two options:
It'll wrap around inside the space it does have
or
It'll use more memory.
The thing is, if it did 1), then it's useless - you'd have to know how much space was required beforehand, and that's part of the reason you'd want to use a BigInt - so you're not limited by those things.
If it did 2), then it'll have to allocate that memory somehow. Memory allocation is not done in the same way across OSes, but even if it were, it would still have to update all pointers to the old value. How would it know what were pointers to the value, and what were simply integer values containing the same value as the memory address in question?
Binary Coded Decimal is a form of string math. The Intel x86 processors have opcodes for direct BCD arthmetic operations.
It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Almost all CPUs do have this built-in. You have to use a software loop around the relevant instructions, but that doesn't make it slower if the loop is efficient. (That's non-trivial on x86, due to partial-flag stalls, see below)
e.g. if x86 provided rep adc to do src += dst, taking 2 pointers and a length as input (like rep movsd to memcpy), it would still be implemented as a loop in microcode.
It would be possible for a 32bit x86 CPU to have an internal implementation of rep adc that used 64bit adds internally, since 32bit CPUs probably still have a 64bit adder. However, 64bit CPUs probably don't have a single-cycle latency 128b adder. So I don't expect that having a special instruction for this would give a speedup over what you can do with software, at least on a 64bit CPU.
Maybe a special wide-add instruction would be useful on a low-power low-clock-speed CPU where a really wide adder with single-cycle latency is possible.
The x86 instructions you're looking for are:
adc: add with carry / sbb: subtract with borrow
mul: full multiply, producing upper and lower halves of the result: e.g. 64b*64b => 128b
div: dividend is twice as wide as the other operands, e.g. 128b / 64b => 64b division.
Of course, adc works on binary integers, not single decimal digits. x86 can adc in 8, 16, 32, or 64bit chunks, unlike RISC CPUs which typically only adc at full register width. (GMP calls each chunk a "limb"). (x86 has some instructions for working with BCD or ASCII, but those instructions were dropped for x86-64.)
imul / idiv are the signed equivalents. Add works the same for signed 2's complement as for unsigned, so there's no separate instruction; just look at the relevant flags to detect signed vs. unsigned overflow. But for adc, remember that only the most-significant chunk has the sign bit; the rest are essential unsigned.
ADX and BMI/BMI2 add some instructions like mulx: full-multiply without touching flags, so it can be interleaved with an adc chain to create more instruction-level parallelism for superscalar CPUs to exploit.
In x86, adc is even available with a memory destination, so it performs exactly like you describe: one instruction triggers the whole read / modify / write of a chunk of the BigInteger. See example below.
Most high-level languages (including C/C++) don't expose a "carry" flag
Usually there aren't intrinsics add-with-carry directly in C. BigInteger libraries usually have to be written in asm for good performance.
However, Intel actually has defined intrinsics for adc (and adcx / adox).
unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, \
unsigned __int64 b, unsigned __int64 * out);
So the carry result is handled as an unsigned char in C. For the _addcarryx_u64 intrinsic, it's up to the compiler to analyse the dependency chains and decide which adds to do with adcx and which to do with adox, and how to string them together to implement the C source.
IDK what the point of _addcarryx intrinsics are, instead of just having the compiler use adcx/adox for the existing _addcarry_u64 intrinsic, when there are parallel dep chains that can take advantage of it. Maybe some compilers aren't smart enough for that.
Here's an example of a BigInteger add function, in NASM syntax:
;;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;
; C prototype:
; void bigint_add(uint64_t *dst, uint64_t *src, size_t len);
; len is an element-count, not byte-count
global bigint_add
bigint_add: ; AMD64 SysV ABI: dst=rdi, src=rsi, len=rdx
; set up for using dst as an index for src
sub rsi, rdi ; rsi -= dst. So orig_src = rsi + rdi
clc ; CF=0 to set up for the first adc
; alternative: peel the first iteration and use add instead of adc
.loop:
mov rax, [rsi + rdi] ; load from src
adc rax, [rdi] ; <================= ADC with dst
mov [rdi], rax ; store back into dst. This appears to be cheaper than adc [rdi], rax since we're using a non-indexed addressing mode that can micro-fuse
lea rdi, [rdi + 8] ; pointer-increment without clobbering CF
dec rdx ; preserves CF
jnz .loop ; loop while(--len)
ret
On older CPUs, especially pre-Sandybridge, adc will cause a partial-flag stall when reading CF after dec writes other flags. Looping with a different instruction will help for old CPUs which stall while merging partial-flag writes, but not be worth it on SnB-family.
Loop unrolling is also very important for adc loops. adc decodes to multiple uops on Intel, so loop overhead is a problem, esp if you have extra loop overhead from avoiding partial-flag stalls. If len is a small known constant, a fully-unrolled loop is usually good. (e.g. compilers just use add/adc to do a uint128_t on x86-64.)
adc with a memory destination appears not to be the most efficient way, since the pointer-difference trick lets us use a single-register addressing mode for dst. (Without that trick, memory-operands wouldn't micro-fuse).
According to Agner Fog's instruction tables for Haswell and Skylake, adc r,m is 2 uops (fused-domain) with one per 1 clock throughput, while adc m, r/i is 4 uops (fused-domain), with one per 2 clocks throughput. Apparently it doesn't help that Broadwell/Skylake run adc r,r/i as a single-uop instruction (taking advantage of ability to have uops with 3 input dependencies, introduced with Haswell for FMA). I'm also not 100% sure Agner's results are right here, since he didn't realize that SnB-family CPUs only micro-fuse indexed addressing modes in the decoders / uop-cache, not in the out-of-order core.
Anyway, this simple not-unrolled-at-all loop is 6 uops, and should run at one iteration per 2 cycles on Intel SnB-family CPUs. Even if it takes an extra uop for partial-flag merging, that's still easily less than the 8 fused-domain uops that can be issued in 2 cycles.
Some minor unrolling could get this close to 1 adc per cycle, since that part is only 4 uops. However, 2 loads and one store per cycle isn't quite sustainable.
Extended-precision multiply and divide are also possible, taking advantage of the widening / narrowing multiply and divide instructions. It's much more complicated, of course, due to the nature of multiplication.
It's not really helpful to use SSE for add-with carry, or AFAIK any other BigInteger operations.
If you're designing a new instruction-set, you can do BigInteger adds in vector registers if you have the right instructions to efficiently generate and propagate carry. That thread has some back-and-forth discussion on the costs and benefits of supporting carry flags in hardware, vs. having software generate carry-out like MIPS does: compare to detect unsigned wraparound, putting the result in another integer register.
Suppose the result of the multiplication needed 3 times the space (memory) to be stored - where would the processor store that result ? How would users of that result, including all pointers to it know that its size suddenly changed - and changing the size might need it to relocate it in memory cause extending the current location would clash with another variable.
This would create a lot of interaction between the processor, OS memory managment, and the compiler that would be hard to make both general and efficient.
Managing the memory of application types is not something the processor should do.
As I think, the main idea behind not including the bigint support in modern processors is the desire to reduce ISA and leave as few instructions as possible, that are fetched, decoded and executed at full throttle.
By the way, in x86 family processors there is a set of instructions that make writing big int library a single-day's matter.
Another reason, I think, is price. It's much more efficient to save some space on the wafer dropping the redundant operations, that can be easily implemented on the higher level.
Seems Intel is Adding (or has added as # time of this post - 2015) new Instructions Support for Large Integer Arithmetic.
New instructions are being introduced on Intel® Architecture
Processors to enable fast implementations of large integer arithmetic.
Large Integer Arithmetic is widely used in multi-precision libraries
for high-performance technical computing, as well as for public key
cryptography (e.g., RSA). In this paper, we describe the critical
operations required in large integer arithmetic and their efficient
implementations using the new instructions.
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
There are so many instructions and functionalities jockeying for area on a CPU chip that in the end those that are used more often/deemed more useful will push out those that aren't. The instructions necessary for implementing BigInt functionality are there and the math is straight-forward.
BigInt: the fundamental function required is:
Unsigned Integer Multiplication, add previous high order
I wrote one in Intel 16bit assembler, then 32 bit...
C code is usually fast enough .. ie for BigInt you use a software library.
CPUs (and GPUs) are not designed with unsigned Integer as top priority.
If you want to write your own BigInt...
Division is done via Knuths Vol 2 (its a bunch of multiply and subtract, with some tricky add-backs)
Add with carry and subtract are easier. etc etc
I just posted this in Intel:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
SSE4 is there a BigInt LIbrary?
i5 2410M processor I suppose can NOT use AVX [AVX is only on very recent Intel CPUs]
but can use SSE4.2
Is there a BigInt Library for SSE?
I Guess I am looking for something that implements unsigned integer
PMULUDQ (with 128-Bit operands)
PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)
and does the carries.
Its a Laptop so I cant buy an NVIDIA GTX 550, which isnt so grand on unsigned Ints, I hear.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx