I am studying the difference between CISC and RISC recently, and I've encountered into the term "Orthogonality". After doing some research, my understanding so far is that there are two "axes", addressing modes & operations, which are independent of each other, so it produces a maximum number of (#addressing modes * #operations) instructions in the ISA.
For CISC machine, which is a register-memory architecture, operands may come from register or memory and RISC a register-register(or load-store) one on the contrary.
So, what's the role of orthogonality playing in these two ISA? Is CISC more orthogonal than RISC or vice versa?
As the wiki describes, "Modern CPUs often simulate orthogonality in a preprocessing step before performing the actual tasks in a RISC-like core. This "simulated orthogonality" in general is a broader concept, encompassing the notions of decoupling and completeness in function libraries, like in the mathematical concept: an orthogonal function set is easy to use as a basis into expanded functions, ensuring that parts don’t affect another if we change one part." What does this paragraph mean? What is the preprocessing step, does it have anything to do with the microcode?
Any explanation are appreciated! Thanks a lot!
Maximizing total choices of possible instructions like a CISC is generally not what's meant. Instead it's more about being a simpler compiler target, without complex interactions in what makes an instruction legal or not. RISC machines are often highly orthogonal, and designed with being a compiler target in mind, not human programmers.
My understanding of the term is that orthogonality is more about any register being usable in any case where any other register is usable. Unlike x86 shl reg, cl where variable-count shifts require a specific register. (I know this is a RISC-V question, but the examples of non-orthogonality I know of come from other ISAs, primarily x86.)
And definitely not like 8086 (before 386), where if you needed to multiply, one of the operands had to be in the accumulator, AL or AX. And sign-extension was also only available there. 386 introduced movsx reg, r/m8 and r/m16. (And movzx, allowing easy and more efficient zero-extending of a byte from memory into SI or DI, without having to load 2 bytes and and si, 0x00ff.)
Even worse, 16-bit addressing modes only allow a few registers in very limited ways: [bp|bx] + [si|di] + disp0/8/16, vs. 32-bit addressing modes allowing stuff like lea eax, [ecx + ecx + 3] to use the same register twice, or address memory relative to the stack pointer without having to copy it to the base pointer (BP) register.
Or if some memory operands can use a certain addressing mode, can all memory operands use it? AArch64 ldp/stp (load-pair/store-pair of registers) I think has fewer available addressing modes than single-register loads, because it needs 5 extra bits for a second register number. Unlike ARM32 ldrd where the pair of registers is two contiguous registers, starting with an even number.
In general, the less interaction there is between a choice of one thing (like instruction) and the possible choices for another (a register), the more orthogonal.
One of the major benefits with this is being a simple compiler target. The most optimal code can more often be found with a greedy algorithm that only takes into a account one thing at a time, not interlocking tradeoffs. Not like x86-64 "if I use ECX instead of R9d for this variable, that'll save bytes in multiple instructions not needing REX prefix, but later mean I need an extra mov to copy a register for a shift count". (x86 BMI2 introduced variable-count shifts that can use a count from any register, like shlx ebx, eax, r15d)
Or far worse targeting 8086 or 286, where 16-bit addressing modes impose a lot more constraints on register allocation. And you'd more often you'd want to use instructions that needed their operands in specific registers, especially the accumulator.
But if you're not worried about every byte of code size, x86-64 is a fairly orthogonal ISA, usually you don't need to care about which register you use for what. One change in that direction beyond 386's important changes was making the low byte of every register addressable, like bpl, spl, sil, dil as the low bytes of RBP, RSP, RSI, RDI. (But those require REX prefixes, overlapping encodings with AH/CH/DH/BH which are only usable in instructions without REX prefixes.)
Another example of non-orthogonality is x86's notorious integer SIMD extensions, MMX and SSE2. Want to do minimum of unsigned integers 16 bytes at a time? In SSE2 we have pminub for unsigned byte elements. And pminsw, signed 16-bit elements. But no other combination of size and signedness until SSE4.1, several years later, which filled in the gaps allowing signed bytes and u16, as well as i32 and u32. And then AVX-512 added i64 and u64. Every min available always had a corresponding max, but other than that, SSE2 was highly non-orthogonal in that and many other ways, including signed/unsigned saturating add/sub, and pack of wider to narrower elements with signed or unsigned saturation. And FP vs. integer shuffles, e.g. there's no integer equivalent to shufps that takes two elements from one vector, two from another, using an immediate control operand. Fortunately for shuffles you can use FP shuffles on integer data.
x86 SIMD is still not very orthogonal in many ways, for example in integer multiply where not all combinations of element size are available for everything; 16-bit has 16x16 => 16-bit low half, signed high half, or unsigned high half. (And a widening multiply and horizontal-add, pmaddwd). 32-bit has signed and unsigned widening 32x32 => 64-bit, and with SSE4.1 also non-widening. 8-bit only has a multiply and horizontal-add where one operand is treated as signed, the other as unsigned.
Again, if I'm picking on x86 a lot, it's because it's what I know. And Intel painted a huge "kick me" sign on their back when they designed MMX and SSE2, only taking some steps to fix things later with SSE4.1. (I'm sure there are reasons for some of those choices, including transistor budget and opcode coding-space in x86's notoriously cramped machine-code.) But a lot of programs don't want to assume SSE4.1 as a requirement to run at all, even now, over a decade since the first SSE4.1 CPUs. Most other SIMD ISAs are more orthogonal than x86, like ARM NEON or PowerPC AltiVec.
Anyway, in general, it's more orthogonal if all operations are available in all combinations of size and signedness that exist for any operation. This isn't always a big deal for compilers per-se, more for humans not realizing that a compiler could make their code faster if this variable was unsigned or something.
Modern CPUs often simulate orthogonality in a preprocessing step before performing the actual tasks in a RISC-like core
That sounds like they're talking about decoding to uops, but I don't see how that would gain orthogonality.
Unless they're counting the concept of any instruction allowing a memory source operand as being more orthogonal. Normally you wouldn't, being a load/store architecture is basically a fixed constraint that doesn't make other things harder.
But if you do consider that more orthogonal, then yes, decoding add eax, [rdi] to 2 uops lets it run on a back-end that separates the load work from the store work, like a RISC.
I hadn't heard this term orthogonal instruction set before, however:
The VAX is perhaps the epitome of CISC. The VAX supports many addressing modes, ranging from register itself, to memory specified by various indexing computations (some including pointer advancement, so as to do *p++ or *--p).
The VAX allows all addressing modes for all operands of any instruction. Further, the VAX allows both 2 operand and 3 operand instructions, so addl2 is operand2 += operand1, and addl3 is operand3 = operand1 + operand2.
Basically it can encode a lot of stuff in a single instruction, so we can do for example, a[i] = *p++ + b[j]; in one instruction, assuming a, b, i, j, and p are in registers.
Other CISC-style processors limit the encoding, for example, so that we can only do two-operand instructions (no 3 operand), and some even limit the 2nd operand to a register, so only one memory operand. I believe this is what they're getting at with the term orthogonal or not.
Meanwhile, a RISC processor instead follows a load/store architecture. Access to memory is not allowed for any operand, but rather only via load and store instructions, and only with those instructions are there addressing modes. Most all arithmetic operations (except the add for addressing) happen between registers alone. (In some sense the RISC philosophy has an orthogonality since all arithmetic operations work on registers alone.)
I don't think the term orthogonality is of high value. I wouldn't dwell on the term itself, but rather take away from that article the comparison between CISC ala VAX, vs. others CISC, vs. RISC.
#Peter also makes a good points, such as that certain registers being hard code (i.e. an implicit source/target) in some architectures for some instructions, which reduces orthogonality.
By that point I might stress that RISC architectures generally don't hard code registers, though MIPS hard codes the return address register ($31) for the jal instruction whereas RISC V does not ($sp and $ra are hard coded but only in the compressed instruction extension). Whereas some CISC architectures (except VAX) hard code more registers.
The MC68000 divides the registers into two sets of 8: addressing registers and data registers, which helps encoding by providing 16 registers with only 3-bit register fields, but also limits what you can do with them (and there aren't enough address registers, since one is the stack pointer and another the global pointer, leaving only 6).
CISC architecture often support byte vs. word sized arithmetic, whereas RISC architectures usually support only word sized arithmetic, so if you want byte, you have to simulate it (i.e. with range check or other).
Related
I've been wondering.. It's called SIMD as in single instruction multiple data. So why does it have single data instructions?
For example, vaddss is the single data equivalent of the multiple data vaddps. Just about every SIMD instruction have a single data version.
Why?
Why does SIMD have single data instructions when it's called SIMD?
It isn't a SIMD instruction in that sense
vaddss is a scalar FP math instruction that operates on data in the FP/SIMD registers (XMM0..15). It exists because x87 is not a very convenient compiler target with its stack-based registers that often need fxch, and other quirks. Intel added a new way to do scalar FP math along with SSE1 (float) and SSE2 (double), which is fortunately baseline for x86-64 so everyone can just use it.
People who call that a SIMD instruction are talking about one of:
Which registers it operates on. (XMM0 is 16 bytes wide and clearly a SIMD register, even when you only care about the low element holding a scalar value.)
The fact that it's an AVX instruction, so it was introduced with an ISA extension that was primarily aimed at SIMD usage, and thus is called a SIMD extension or instruction set.
Which also means it uses the MXCSR for rounding mode and FP exception recording / unmasking, and the kinds of exceptions it can take are the same as other SSE/AVX instructions which Intel documents as "SIMD Floating-Point Exceptions" as concise terminology to distinguish it from legacy x87.
Or they're talking about the use-case of doing something to just the low element when the high elements have actual data. (Quite rare, but something you could do. Maybe more likely with sd scalar double, where the low double is one half of an XMM register.)
Or they're just plain wrong if they actually mean it in terms of Flynn's taxonomy of SISD vs. SIMD vs. MIMD etc. I highly doubt anyone would actually mean that, though. The ss and sd scalar FP math instructions are SISD, single-instruction single-data. And BTW, they only exist for FP math; x86 already has instructions like add eax, ecx for scalar integer math, and doesn't have scalar versions of paddb or even xorps.
One reason for having separate scalar FP math instructions is that using addps would also operate on whatever garbage might be in the high elements of XMM registers. This can raise extra FP exceptions (usually masked, so only recorded in MXCSR (fenv.h), but if unmasked would trap to the OS.)
With the upper elements all 0.0 (which isn't required by the calling convention, BTW), addps wouldn't raise any extra exceptions, but divps would divide by zero.
With non-zero garbage like small integers, it might be a bit-pattern for a subnormal float, or a result might be subnormal, causing huge slowdowns (factor of ~100) as the CPU takes a microcode assist to get handle subnormal input or output in many cases (or when SSE1 was new in Pentium III, probably all cases of subnormals). Unless you set FTZ and DAZ (flush to zero, denormal are zero) like gcc -ffast-math does.
For instructions like xorps or paddq which don't do actual FP math, no FP exceptions or microcode assists are possible. You can just use them even if you only care about the low 32 or 64 bits of an XMM.
MMX or SSE2 had occasional uses in 32-bit code for doing scalar 64-bit integer math, with zeros or garbage in the upper bytes. MMX paddq mm0, mm1 is a SISD instruction, but SSE2 paddq xmm0, xmm1 is a SIMD instruction.
SSE1 was new in Pentium 3, where the SIMD execution units and registers were only 64 bits wide. addps decoded to 2 uops; addss decoded to 1. So there was a performance motivation, too, even in the best case.
This is also likely the reason for Intel's unfortunate design where sqrtss and cvtsi2ss and others merge into the destination, requiring either spending extra front-end bandwidth on xor-zeroing, or risking false dependencies: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? . It's a short-sighted design decision to make them single-uop on Pentium 3, which they unfortunately followed in SSE2 for double precision, and stuck to for AVX and AVX-512 when they had a chance to introduce better versions with different semantics. At least the AVX versions take a 2nd source register to merge with, so you can pick a "cold" reg as a workaround, see my answer on the linked duplicate.
It's normal for scalar FP to share registers with SIMD
It isn't necessary or useful to have yet another set of registers for scalar FP, and sharing with the x87 FPU or the general-purpose integer registers would each be worse for separate reasons.
It's totally normal on other ISAs for the SIMD registers to overlap or be the same as the scalar FP registers; Some ISAs (like ARM) that didn't have weirdo designs like x87 didn't need new architectural state to introduce SIMD. e.g. ARM's NEON q0..q15 16-byte registers map to pairs of d0..d31 double-precision FP registers that existed with VFPv3.
(I'm not sure if the partial-register aliasing was actually common in SIMD extensions for other ISAs, though. Probably some introduced new architectural state, or just used FP double-precision registers as 64-bit integer SIMD instead of 128-bit.)
In an OS kernel you often talk about saving "FPU state" on context switch (as opposed to just the general-purpose integer registers), and these days that's short-hand for FPU and SIMD state. e.g. in the Linux kernel, you need to use kernel_fpu_begin() before running instructions that use XMM/YMM/ZMM registers. (e.g. in the RAID5 / RAID6 drivers).
Indexed addressing mode is usually used for accessing arrays as arrays are stored contiguosly. We have a index register which gets incremented in every iteration which when added to base address gives the array element address.
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing ? We have the base address and we can just add 1 to it every time when accessing. Why do we need indexed addressing mode which has a overhead of index register ?
I am not sure about the instruction format for implied addressing mode. Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing?
You can; MIPS only has one addressing mode and compilers can still generate code for it just fine. But sometimes it has to use an extra shift + add instruction to calculate an address (if it's not just looping through an array).
The point of addressing modes is to save instructions and save registers, especially in 2-operand instruction sets like x86, where add eax, ecx overwrites eax with the result (eax += ecx), unlike MIPS or other 3-instruction ISAs where addu $t2, $t1, $t0 does t2 = t1 + t0. On x86, that would require a copy (mov) and an add. (Or in that special case, lea edx, [eax+ecx]: x86 can copy-and-add (and shift) using the same instruction-encoding it uses for memory operands.)
Consider a histogram problem: you generate array indices in unpredictable order, and have to index an array. On x86-64, add dword [rbx + rdi*4], 1 will increment a 32-bit counter in memory using a single 4-byte instruction, which decodes to only 2 uops for the front-end to issue into the out-of-order core on modern Intel CPUs. (http://agner.org/optimize/). (rbx is the base register, rdi is a scaled index). Having a scaled index is very powerful; x86 16-bit addressing modes support 2 registers, but not a scaled index.
Classic MIPS only has separate shift and add instructions, although MIPS32 did add a scaled-add instruction for address calculation. That would save an instruction here. Being a load-store machine, the loads and stores always have to be separate instructions (unlike on x86 where that add decodes as a micro-fused load+add and a store. See INC instruction vs ADD 1: Does it matter?).
Probably ARM would be a better comparison for MIPS: It's also a load-store RISC machine. But it does have a selection of addressing modes, including scaled index using the barrel shifter. So instead of needing a separate shift / add for each array index, you'd use LDR R0, [R1, R2, LSL #2], add r0, r0, #1 / str with the same addressing mode.
Often when looping through an array, it is best to just increment pointers on x86. But it's also an option to use an index, especially for loops with multiple arrays using the same index, like C[i] = A[i] + B[i]. Indexed addressing mode can sometimes be slightly less efficient in hardware, though, so when a compiler is unrolling a loop it usually should use pointers, even though it has to increment all 3 pointers separately instead of one index.
The point of instruction-set design is not merely to be Turing complete, it's to enable efficient code that gets more work done with fewer clock cycles and/or smaller code-size, or give programmers the option of aiming for either of those goals.
The minimum threshold for a computer to be programmable is extremely low, see for example various One instruction set computer architectures. (None implemented for real, just designed on paper to show that it's possible to write programs with nothing but a subtract-and-branch-if-less-than-zero instruction, with memory operands encoded in the instruction.
There's a tradeoff between easy to decode (especially to decode in parallel) vs. compact. x86 is horrible because it evolved as a series of extensions, often without a lot of planning to leave room for future extensions. If you're interested in ISA design decisions, have a look at Agner Fog's blog for interesting discussion about designing an ISA for high-performance CPUs that combines the best of x86 (lots of work with one instruction, e.g. memory operand as part of an ALU instruction) with the best features of RISC (easy to decode, lots of registers): Proposal for an ideal extensible instruction set.
There's also a tradeoff in how you spend the bits in an instruction word, especially in a fixed instruction width ISA like most RISCs. Different ISAs made different choices.
PowerPC uses lots of the coding space for powerful bitfield instructions like rlwinm (rotate left and mask off a window of bits), and lots of opcodes. IDK if the generally unpronounceable and hard-to-remember mnemonics are related to that...
ARM uses the high 4 bits for predicated execution of any instruction based on condition codes. It uses more bits for the barrel shifter (the 2nd source operand is optionally shifted or rotated by an immediate or a count from another register).
MIPS has relatively large immediate operands, and is basically simple.
x86 32/64-bit addressing modes use a variable-length encoding, with an extra byte SIB (scale/index/base) byte when there's an index, and an optional disp8 or disp32 immediate displacement. (e.g. add esi, [rax + rdx + 12340] takes 2 + 1 + 4 bytes to encode, vs. 2 bytes for add esi, [rax].
x86 16-bit addressing modes are much more limited, and pack everything except the optional disp8/disp16 displacement into the ModR/M byte.
Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
Yes, the machine-code format for some instructions in some ISAs includes implicit operands. Many machines have push / pop instructions that implicitly use a specific register as the stack pointer. For example, in x86-64's push rax, RAX is an explicit register operand (encoded in the low 3 bits of the one-byte opcode using the push r64 short form), while RSP is an implicit operand.
Older 8-bit CPUs often had instructions like DECA (to decrement the accumulator, A). i.e. there was a specific opcode for that register. This could be the same thing as having a DEC instruction with some bits in the opcode byte specifying which register (like x86 does before x86-64 repurposed the short INC/DEC encodings as REX prefixes: note the "N.E" (Not Encodeable) in the 64-bit mode column for dec r32). But if there's no regular pattern then it can definitely be considered an implicit operand.
Sometimes putting things into neat categories breaks down, so don't worry too much about whether using bits with the opcode byte counts as implicit or explicit for x86. It's a way of spending more opcode space to save code-size for commonly used instructions while still allowing use with different registers.
Some ISAs only use a certain register as the stack pointer by convention, with no implicit uses. MIPS is like this.
ARM32 (in ARM, not Thumb mode) also uses explicit operands in push/pop. Its push/pop mnemonics are just aliases for store-multiple decrement-before / load-multiple increment-after (LDMIA / STMDB) to implement a full-descending stack. See ARM's docs for LDM/STM which explains this, and what you can do with the general case of these instructions, e.g. LDMDB to decrement a pointer and then load (in the opposite direction of POP).
In the RISC-V Instruction Set Manual, User-Level ISA, I couldn't understand section 2.3 Immediate Encoding Variants page 11.
There is four types of instruction formats R, I, S, and U, then there is a variants of S and U types which are SB and UJ which I suppose mean Branch and Jump as shown in figure 2.3. Then there is the types of Immediate produced by RISC-V instructions shown in figure 2.4.
So my questions are, why the SB and UJ are needed? and why shuffle the Immediate bits in that way? what does it mean to say "the Immediate produced by RISC-V instructions"? and how are they produced in this manner?
To speed up decoding, the base RISC-V ISA puts the most important fields in the same place in every instruction. As you can see in the instruction formats table,
The major opcode is always in bits 0-6.
The destination register, when present, is always in bits 7-11.
The first source register, when present, is always in bits 15-19.
The second source register, when present, is always in bits 20-24.
The other bits are used for the minor opcode or other data for the instruction (funct3 in bits 12-14 and funct7 in bits 25-31), and for the immediate. How many bits can be used for the immediate depends on how many register numbers are present in the instruction:
Instructions with one destination and two source registers (R-type) have no immediate, for instance adding two registers (ADD);
Instructions with one destination and one source register (I-type) have 12 bits for the immediate, for instance adding one register with an immediate (ADDI);
Instructions with two source registers and no destination register (S-type), for instance the store instructions, have also 12 bits for the immediate, but they have to be in a different place since the register numbers are also in a different place;
Finally, instructions with only a destination register and no minor opcode (U-type), for instance LUI, can use 20 bits for the immediate (the major opcode and the destination register number together need 12 bits).
Now think from the other point of view, of the instructions which will use these immediate values. The simplest users, I-immediate and S-immediate, need only a sign-extended 12-bit value. The U-immediate instructions need the immediate in the upper 20 bits of a 32-bit value. Finally, the branch/jump instructions need the sign-extended immediate in the lower bits of the value, except for the lowest bit which will always be zero, since RISC-V instructions are always aligned to even addresses.
But why are the immediate bits shuffled? Think this time about the physical circuit which decodes the immediate field. Since it's a hardware implementation, the bits will be decoded in parallel; each bit in the output immediate will have a multiplexer to select which input bit it comes from. The bigger the multiplexer, the costlier and slower it is.
The "shuffling" of the immediate bits in the instruction encoding, therefore, is to make each output immediate bit have as little input instruction bit options as possible. For instance, immediate bit 1 can only come from instruction bits 8 (S-immediate or B-immediate), 21 (I-immediate or J-immediate), or constant zero (U-immediate or R-type instruction which has no immediate). Immediate bit 0 can come from instruction bits 7 (S-immediate), 20 (I-immediate), or constant zero. Immediate bit 5 can only come from instruction bit 25 or constant zero. And so on.
Instruction bit 31 is a special case: for RV-64, bits 32-63 of the immediate are always copies of instruction bit 31. This high fan-out adds a delay, which would be even bigger if it also needed a multiplexer, so it only has one option (other than constant zero, which can be treated later in the pipeline by ignoring the whole immediate).
It's also interesting to note that only the major opcode (bits 0-6) is needed to know how to decode the immediate, so immediate decoding can be done in parallel with decoding the rest of the instruction.
So, answering the questions:
SB-type doubles the range of branches, since instructions are always aligned to even addresses;
UJ-type has the same overall instruction format as U-type, but the immediate value is in the lower bits instead of the upper bits;
The immediate bits are shuffled to reduce the cost of decoding the immediate value, by reducing the number of choices for each output immediate bit;
The "immediate produced by RISC-V instructions" table shows the different kinds of immediate values which can be decoded from a RISC-V instruction, and from where in the instruction each bit comes from;
They are produced by, for each output immediate bit, using the major opcode (bits 0-6) to chose an input instruction bit.
The encoding is done to try and make the actual hardware implementation as simple as possible, rather than make it easy for the reader to understand at a glance.
In practice the compiler will generate the output and so it does not matter if it is not easy for the user to understand.
When possible the SB type tries to use the same bits for the same immediate bit positions as type S, that minimizes the hardware design complexity. So imm[4:1] and imm[10:5] are in the same place for both. The top most bit of the immediate values is always at position 31 so that you can use that bit to decide if a sign extension is needed. Again, this makes the hardware easier because for multiple types of instruction the top bit is used to decide on sign extension.
The RISC-V instruction encoding is chosen to simplify the decoder
2.2 Base Instruction Formats
The RISC-V ISA keeps the source (rs1 and rs2) and destination (rd) registers at the same position in all formats to simplify decoding. Except for the 5-bit immediates used in CSR instructions(Chapter 9), immediates are always sign-extended, and are generally packed towards the left most available bits in the instruction and have been allocated to reduce hardware complexity. In particular, the sign bit for all immediates is always in bit 31 of the instruction to speed sign-extension circuitry.
2.3 Immediate Encoding Variants
The only difference between the S and B formats is that the 12-bit immediate field is used to encode branch offsets in multiples of 2 in the B format. Instead of shifting all bits in the instruction-encoded immediate left by one in hardware as is conventionally done, the middle bits (imm[10:1]) and sign bit stay in fixed positions, while the lowest bit in S format (inst[7]) encodes a high-order bit in B format.
Similarly, the only difference between the U and J formats is that the 20-bit immediate is shiftedleft by 12 bits to form U immediates and by 1 bit to form J immediates. The location of instructionbits in the U and J format immediates is chosen to maximize overlap with the other formats andwith each other.
https://riscv.org/technical/specifications/
The reason for the shuffling of the immediate in SB/UL formats has also been explained in the RISC-V spec
Although more complex implementations might have separate adders for branch and jump calculations and so would not benefit from keeping the location of immediate bits constant across types of instruction, we wanted to reduce the hardware cost of the simplest implementations. By rotating bits in the instruction encoding of B and J immediates instead of using dynamic hard-ware muxes to multiply the immediate by 2, we reduce instruction signal fanout and immediate mux costs by around a factor of 2. The scrambled immediate encoding will add negligible timeto static or ahead-of-time compilation. For dynamic generation of instructions, there is some small additional overhead, but the most common short forward branches have straight forward immediate encodings.
Many processors have instructions which are of uniform format and width such as the ARM where all instructions are 32-bit long. other processors have instructions in multiple widths of say 2, 3, or 4 bytes long, such as 8086.
What is the advantage of having all instructions the same width and in a uniform format?
What is the advantage of having instructions in multiple widths?
Fixed Length Instruction Trade-offs
The advantages of fixed length instructions with a relatively uniform formatting is that fetching and parsing the instructions is substantially simpler.
For an implementation that fetches a single instruction per cycle, a single aligned memory (cache) access of the fixed size is guaranteed to provide one (and only one) instruction, so no buffering or shifting is required. There is also no concern about crossing a cache line or page boundary within a single instruction.
The instruction pointer is incremented by a fixed amount (except when executing control flow instructions--jumps and branches) independent of the instruction type, so the location of the next sequential instruction can be available early with minimal extra work (compared to having to at least partially decode the instruction). This also makes fetching and parsing more than one instruction per cycle relatively simple.
Having a uniform format for each instruction allows trivial parsing of the instruction into its components (immediate value, opcode, source register names, destination register name). Parsing out the source register names is the most timing critical; with these in fixed positions it is possible to begin reading the register values before the type of instruction has been determined. (This register reading is speculative since the operation might not actually use the values, but this speculation does not require any special recovery in the case of mistaken speculation but does take extra energy.) In the MIPS R2000's classic 5-stage pipeline, this allowed reading of the register values to be started immediately after instruction fetch providing half of a cycle to compare register values and resolve a branch's direction; with a (filled) branch delay slot this avoided stalls without branch prediction.
(Parsing out the opcode is generally a little less timing critical than source register names, but the sooner the opcode is extracted the sooner execution can begin. Simple parsing out of the destination register name makes detecting dependencies across instructions simpler; this is perhaps mainly helpful when attempting to execute more than one instruction per cycle.)
In addition to providing the parsing sooner, simpler encoding makes parsing less work (energy use and transistor logic).
A minor advantage of fixed length instructions compared to typical variable length encodings is that instruction addresses (and branch offsets) use fewer bits. This has been exploited in some ISAs to provide a small amount of extra storage for mode information. (Ironically, in cases like MIPS/MIPS16, to indicate a mode with smaller or variable length instructions.)
Fixed length instruction encoding and uniform formatting do have disadvantages. The most obvious disadvantage is relatively low code density. Instruction length cannot be set according to frequency of use or how much distinct information is required. Strict uniform formatting would also tend to exclude implicit operands (though even MIPS uses an implicit destination register name for the link register) and variable-sized operands (most RISC variable length encodings have short instructions that can only access a subset of the total number of registers).
(In a RISC-oriented ISA, this has the additional minor issue of not allowing more work to be bundled into an instruction to equalize the amount of information required by the instruction.)
Fixed length instructions also make using large immediates (constant operands included in the instruction) more difficult. Classic RISCs limited immediate lengths to 16-bits. If the constant is larger, it must either be loaded as data (which means an extra load instruction with its overhead of address calculation, register use, address translation, tag check, etc.) or a second instruction must provide the rest of the constant. (MIPS provides a load high immediate instruction, partially under the assumption that large constants are mainly used to load addresses which will later be used for accessing data in memory. PowerPC provides several operations using high immediates, allowing, e.g., the addition of a 32-bit immediate in two instructions.) Using two instructions is obviously more overhead than using a single instruction (though a clever implementation could fuse the two instructions in the front-end [What Intel calls macro-op fusion]).
Fixed length instructions also makes it more difficult to extend an instruction set while retaining binary compatibility (and not requiring addition modes of operation). Even strictly uniform formatting can hinder extension of an instruction set, particularly for increasing the number of registers available.
Fujitsu's SPARC64 VIIIfx is an interesting example. It uses a two-bit opcode (in its 32-bit instructions) to indicate a loading of a special register with two 15-bit instruction extensions for the next two instructions. These extensions provide extra register bits and indication of SIMD operation (i.e., extending the opcode space of the instruction to which the extension is applied). This means that the full register name of an instruction not only is not entirely in a fixed position, but not even in the same "instruction". (Similarities to x86's REX prefix--which provides bits to extend register names encoded in the main part of the instruction--might be noted.)
(One aspect of fixed length encodings is the tyranny of powers of two. Although it is possible to used non-power-of-two instruction lengths [Tensilica's XTensa now has fixed 24-bit instructions as its base ISA--with 16-bit short instruction support being an extension, previously they were part of the base ISA; IBM had an experimental ISA with 40-bit instructions.], such adds a little complexity. If one size, e.g., 32bits, is a little too short, the next available size, e.g., 64 bits, is likely to be too long, sacrificing too much code density.)
For implementations with deep pipelines the extra time required for parsing instructions is less significant. The extra dynamic work done by hardware and the extra design complexity are reduced in significance for high performance implementations which add sophisticated branch prediction, out-of-order execution, and other features.
Variable Length Instruction Trade-offs
For variable length instructions, the trade-offs are essentially reversed.
Greater code density is the most obvious advantage. Greater code density can improve static code size (the amount of storage needed for a given program). This is particularly important for some embedded systems, especially microcontrollers, since it can be a large fraction of the system cost and influence the system's physical size (which has impact on fitness for purpose and manufacturing cost).
Improving dynamic code size reduces the amount of bandwidth used to fetch instructions (both from memory and from cache). This can reduce cost and energy use and can improve performance. Smaller dynamic code size also reduces the size of caches needed for a given hit rate; smaller caches can use less energy and less chip area and can have lower access latency.
(In a non- or minimally pipelined implementation with a narrow memory interface, fetching only a portion of an instruction in a cycle in some cases does not hurt performance as much as it would in a more pipelined design less limited by fetch bandwidth.)
With variable length instructions, large constants can be used in instructions without requiring all instructions to be large. Using an immediate rather than loading a constant from data memory exploits spatial locality, provides the value earlier in the pipeline, avoids an extra instruction, and removed a data cache access. (A wider access is simpler than multiple accesses of the same total size.)
Extending the instruction set is also generally easier given support for variable length instructions. Addition information can be included by using extra long instructions. (In the case of some encoding techniques--particularly using prefixes--, it is also possible to add hint information to existing instructions allowing backward compatibility with additional new information. x86 has exploited this not only to provide branch hints [which are mostly unused] but also the Hardware Lock Elision extension. For a fixed length encoding, it would be difficult to choose in advance which operations should have additional opcodes reserved for possible future addition of hint information.)
Variable length encoding clearly makes finding the start of the next sequential instruction more difficult. This is somewhat less of a problem for implementations
that only decode one instruction per cycle, but even in that case it adds extra
work for the hardware (which can increase cycle time or pipeline length as well as use more energy). For wider decode several tricks are available to reduce the cost of parsing out individual instructions from a block of instruction memory.
One technique that has mainly been used microarchitecturally (i.e., not included in the interface exposed to software but only an implementation technique) is to use marker bits to indicate the start or end of an instruction. Such marker bits would be set for each parcel of instruction encoding and stored in the instruction cache. Such delays the availability of such information on a instruction cache miss, but this delay is typically small compared to the ordinary delay in filling a cache miss. The extra (pre)decoding work is only needed on a cache miss, so time and energy is saved in the common case of a cache hit (at the cost of some extra storage and bandwidth which has some energy cost).
(Several AMD x86 implementations have used marker bit techniques.)
Alternatively, marker bits could be included in the instruction encoding. This places some constrains on opcode assignment and placement since the marker bits effectively become part of the opcode.
Another technique, used by the IBM zSeries (S/360 and descendants), is to encode the instruction length in a simple way in the opcode in the first parcel. The zSeries uses two bits to encode three different instruction lengths (16, 32, and 48 bits) with two encodings used for 16 bit length. By placing this in a fixed position, it is relatively easy to quickly determine where the next sequential instruction begins.
(More aggressive predecoding is also possible. The Pentium 4 used a trace cache containing fixed-length micro-ops and recent Intel processors use a micro-op cache with [presumably] fixed-length micro-ops.)
Obviously, variable length encodings require addressing at the granularity of a parcel which is typically smaller than an instruction for a fixed-length ISA. This means that branch offsets either lose some range or must use more bits. This can be compensated by support for more different immediate sizes.
Likewise, fetching a single instruction can be more complex since the start of the instruction is likely to not be aligned to a larger power of two. Buffering instruction fetch reduces the impact of this, but adds (trivial) delay and complexity.
With variable length instructions it is also more difficult to have uniform encoding. This means that part of the opcode must often be decoded before the basic parsing of the instruction can be started. This tends to delay the availability of register names and other, less critical information. Significant uniformity can still be obtained, but it requires more careful design and weighing of trade-offs (which are likely to change over the lifetime of the ISA).
As noted earlier, with more complex implementations (deeper pipelines, out-of-order execution, etc.), the extra relative complexity of handling variable length instructions is reduced. After instruction decode, a sophisticated implementation of an ISA with variable length instructions tends to look very similar to one of an ISA with fixed length instructions.
It might also be noted that much of the design complexity for variable length instructions is a one-time cost; once an organization has learned techniques (including the development of validation software) to handle the quirks, the cost of this complexity is lower for later implementations.
Because of the code density concerns for many embedded systems, several RISC ISAs provide variable length encodings (e.g., microMIPS, Thumb2). These generally only have two instruction lengths, so the additional complexity is constrained.
Bundling as a Compromise Design
One (sort of intermediate) alternative chosen for some ISAs is to use a fixed length bundle of instructions with different length instructions. By containing instructions in a bundle, each bundle has the advantages of a fixed length instruction and the first instruction in each bundle has a fixed, aligned starting position. The CDC 6600 used 60-bit bundles with 15-bit and 30-bit operations. The M32R uses 32-bit bundles with 16-bit and 32-bit instructions.
(Itanium uses fixed length power-of-two bundles to support non-power of two [41-bit] instructions and has a few cases where two "instructions" are joined to allow 64-bit immediates. Heidi Pan's [academic] Heads and Tails encoding used fixed length bundles to encode fixed length base instruction parts from left to right and variable length chunks from right to left.)
Some VLIW instruction sets use a fixed size instruction word but individual operation slots within the word can be a different (but fixed for the particular slot) length. Because different operation types (corresponding to slots) have different information requirements, using different sizes for different slots is sensible. This provides the advantages of fixed size instructions with some code density benefit. (In addition, a slot might be allocated to optionally provide an immediate to one of the operations in the instruction word.)
As far as I understood it, BigInts are usually implemented in most programming languages as arrays containing digits, where, eg.: when adding two of them, each digit is added one after another like we know it from school, e.g.:
246
816
* *
----
1062
Where * marks that there was an overflow. I learned it this way at school and all BigInt adding functions I've implemented work similar to the example above.
So we all know that our processors can only natively manage ints from 0 to 2^32 / 2^64.
That means that most scripting languages in order to be high-level and offer arithmetics with big integers, have to implement/use BigInt libraries that work with integers as arrays like above.
But of course this means that they'll be far slower than the processor.
So what I've asked myself is:
Why doesn't my processor have a built-in BigInt function?
It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Seems like a fine idea to me, so why isn't there something like that?
There are simply too many issues that require the processor to deal with a ton of stuff which isn't its job.
Suppose that the processor DID have that feature. We can work out a system where we know how many bytes are used by a given BigInt - just use the same principle as most string libraries and record the length.
But what would happen if the result of a BigInt operation exceeded the amount of space reserved?
There are two options:
It'll wrap around inside the space it does have
or
It'll use more memory.
The thing is, if it did 1), then it's useless - you'd have to know how much space was required beforehand, and that's part of the reason you'd want to use a BigInt - so you're not limited by those things.
If it did 2), then it'll have to allocate that memory somehow. Memory allocation is not done in the same way across OSes, but even if it were, it would still have to update all pointers to the old value. How would it know what were pointers to the value, and what were simply integer values containing the same value as the memory address in question?
Binary Coded Decimal is a form of string math. The Intel x86 processors have opcodes for direct BCD arthmetic operations.
It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Almost all CPUs do have this built-in. You have to use a software loop around the relevant instructions, but that doesn't make it slower if the loop is efficient. (That's non-trivial on x86, due to partial-flag stalls, see below)
e.g. if x86 provided rep adc to do src += dst, taking 2 pointers and a length as input (like rep movsd to memcpy), it would still be implemented as a loop in microcode.
It would be possible for a 32bit x86 CPU to have an internal implementation of rep adc that used 64bit adds internally, since 32bit CPUs probably still have a 64bit adder. However, 64bit CPUs probably don't have a single-cycle latency 128b adder. So I don't expect that having a special instruction for this would give a speedup over what you can do with software, at least on a 64bit CPU.
Maybe a special wide-add instruction would be useful on a low-power low-clock-speed CPU where a really wide adder with single-cycle latency is possible.
The x86 instructions you're looking for are:
adc: add with carry / sbb: subtract with borrow
mul: full multiply, producing upper and lower halves of the result: e.g. 64b*64b => 128b
div: dividend is twice as wide as the other operands, e.g. 128b / 64b => 64b division.
Of course, adc works on binary integers, not single decimal digits. x86 can adc in 8, 16, 32, or 64bit chunks, unlike RISC CPUs which typically only adc at full register width. (GMP calls each chunk a "limb"). (x86 has some instructions for working with BCD or ASCII, but those instructions were dropped for x86-64.)
imul / idiv are the signed equivalents. Add works the same for signed 2's complement as for unsigned, so there's no separate instruction; just look at the relevant flags to detect signed vs. unsigned overflow. But for adc, remember that only the most-significant chunk has the sign bit; the rest are essential unsigned.
ADX and BMI/BMI2 add some instructions like mulx: full-multiply without touching flags, so it can be interleaved with an adc chain to create more instruction-level parallelism for superscalar CPUs to exploit.
In x86, adc is even available with a memory destination, so it performs exactly like you describe: one instruction triggers the whole read / modify / write of a chunk of the BigInteger. See example below.
Most high-level languages (including C/C++) don't expose a "carry" flag
Usually there aren't intrinsics add-with-carry directly in C. BigInteger libraries usually have to be written in asm for good performance.
However, Intel actually has defined intrinsics for adc (and adcx / adox).
unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, \
unsigned __int64 b, unsigned __int64 * out);
So the carry result is handled as an unsigned char in C. For the _addcarryx_u64 intrinsic, it's up to the compiler to analyse the dependency chains and decide which adds to do with adcx and which to do with adox, and how to string them together to implement the C source.
IDK what the point of _addcarryx intrinsics are, instead of just having the compiler use adcx/adox for the existing _addcarry_u64 intrinsic, when there are parallel dep chains that can take advantage of it. Maybe some compilers aren't smart enough for that.
Here's an example of a BigInteger add function, in NASM syntax:
;;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;
; C prototype:
; void bigint_add(uint64_t *dst, uint64_t *src, size_t len);
; len is an element-count, not byte-count
global bigint_add
bigint_add: ; AMD64 SysV ABI: dst=rdi, src=rsi, len=rdx
; set up for using dst as an index for src
sub rsi, rdi ; rsi -= dst. So orig_src = rsi + rdi
clc ; CF=0 to set up for the first adc
; alternative: peel the first iteration and use add instead of adc
.loop:
mov rax, [rsi + rdi] ; load from src
adc rax, [rdi] ; <================= ADC with dst
mov [rdi], rax ; store back into dst. This appears to be cheaper than adc [rdi], rax since we're using a non-indexed addressing mode that can micro-fuse
lea rdi, [rdi + 8] ; pointer-increment without clobbering CF
dec rdx ; preserves CF
jnz .loop ; loop while(--len)
ret
On older CPUs, especially pre-Sandybridge, adc will cause a partial-flag stall when reading CF after dec writes other flags. Looping with a different instruction will help for old CPUs which stall while merging partial-flag writes, but not be worth it on SnB-family.
Loop unrolling is also very important for adc loops. adc decodes to multiple uops on Intel, so loop overhead is a problem, esp if you have extra loop overhead from avoiding partial-flag stalls. If len is a small known constant, a fully-unrolled loop is usually good. (e.g. compilers just use add/adc to do a uint128_t on x86-64.)
adc with a memory destination appears not to be the most efficient way, since the pointer-difference trick lets us use a single-register addressing mode for dst. (Without that trick, memory-operands wouldn't micro-fuse).
According to Agner Fog's instruction tables for Haswell and Skylake, adc r,m is 2 uops (fused-domain) with one per 1 clock throughput, while adc m, r/i is 4 uops (fused-domain), with one per 2 clocks throughput. Apparently it doesn't help that Broadwell/Skylake run adc r,r/i as a single-uop instruction (taking advantage of ability to have uops with 3 input dependencies, introduced with Haswell for FMA). I'm also not 100% sure Agner's results are right here, since he didn't realize that SnB-family CPUs only micro-fuse indexed addressing modes in the decoders / uop-cache, not in the out-of-order core.
Anyway, this simple not-unrolled-at-all loop is 6 uops, and should run at one iteration per 2 cycles on Intel SnB-family CPUs. Even if it takes an extra uop for partial-flag merging, that's still easily less than the 8 fused-domain uops that can be issued in 2 cycles.
Some minor unrolling could get this close to 1 adc per cycle, since that part is only 4 uops. However, 2 loads and one store per cycle isn't quite sustainable.
Extended-precision multiply and divide are also possible, taking advantage of the widening / narrowing multiply and divide instructions. It's much more complicated, of course, due to the nature of multiplication.
It's not really helpful to use SSE for add-with carry, or AFAIK any other BigInteger operations.
If you're designing a new instruction-set, you can do BigInteger adds in vector registers if you have the right instructions to efficiently generate and propagate carry. That thread has some back-and-forth discussion on the costs and benefits of supporting carry flags in hardware, vs. having software generate carry-out like MIPS does: compare to detect unsigned wraparound, putting the result in another integer register.
Suppose the result of the multiplication needed 3 times the space (memory) to be stored - where would the processor store that result ? How would users of that result, including all pointers to it know that its size suddenly changed - and changing the size might need it to relocate it in memory cause extending the current location would clash with another variable.
This would create a lot of interaction between the processor, OS memory managment, and the compiler that would be hard to make both general and efficient.
Managing the memory of application types is not something the processor should do.
As I think, the main idea behind not including the bigint support in modern processors is the desire to reduce ISA and leave as few instructions as possible, that are fetched, decoded and executed at full throttle.
By the way, in x86 family processors there is a set of instructions that make writing big int library a single-day's matter.
Another reason, I think, is price. It's much more efficient to save some space on the wafer dropping the redundant operations, that can be easily implemented on the higher level.
Seems Intel is Adding (or has added as # time of this post - 2015) new Instructions Support for Large Integer Arithmetic.
New instructions are being introduced on Intel® Architecture
Processors to enable fast implementations of large integer arithmetic.
Large Integer Arithmetic is widely used in multi-precision libraries
for high-performance technical computing, as well as for public key
cryptography (e.g., RSA). In this paper, we describe the critical
operations required in large integer arithmetic and their efficient
implementations using the new instructions.
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
There are so many instructions and functionalities jockeying for area on a CPU chip that in the end those that are used more often/deemed more useful will push out those that aren't. The instructions necessary for implementing BigInt functionality are there and the math is straight-forward.
BigInt: the fundamental function required is:
Unsigned Integer Multiplication, add previous high order
I wrote one in Intel 16bit assembler, then 32 bit...
C code is usually fast enough .. ie for BigInt you use a software library.
CPUs (and GPUs) are not designed with unsigned Integer as top priority.
If you want to write your own BigInt...
Division is done via Knuths Vol 2 (its a bunch of multiply and subtract, with some tricky add-backs)
Add with carry and subtract are easier. etc etc
I just posted this in Intel:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
SSE4 is there a BigInt LIbrary?
i5 2410M processor I suppose can NOT use AVX [AVX is only on very recent Intel CPUs]
but can use SSE4.2
Is there a BigInt Library for SSE?
I Guess I am looking for something that implements unsigned integer
PMULUDQ (with 128-Bit operands)
PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)
and does the carries.
Its a Laptop so I cant buy an NVIDIA GTX 550, which isnt so grand on unsigned Ints, I hear.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx