Why does instruction cache alignment improve performance in set associative cache implementations? - cpu-architecture

I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything.
I understand the concept of cache hits and their importance in computing speed.
But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since that adjacent block has not been evicted by the execution of the previous block. Both blocks are likely to remain cached during the loop.
So all I can figure is if there is truth in the claim that alignment can help, it must be from some sort of other effect.
Is there a cost in switching cache lines?
Is there a difference in cache hits, one where you get a hit and one where you hit the same cache line you're currently reading from?

Keeping a whole function (or the hot parts of a function, i.e. the fast path through it) in fewer cache lines reduces I-cache footprint. So it can reduce the number of cache misses, including on startup when most of the cache is cold. Having a loop end before the end of a cache line could give HW prefetching time to fetch the next one.
Accessing any line that's present in L1i cache takes takes the same amount of time. (Unless your cache uses way-prediction: that introduces the possibility of a "slow hit". See these slides for a mention and brief description of the idea. Apparently MIPS r10k's L2 cache used it, and so did Alpha 21264's L1 instruction cache with "branch target" vs. "sequential" ways in its 2-way associative 64kiB L1i. Or see any of the academic papers that come up when you google cache way prediction like I did.)
Other than that, the effects aren't so much about cache-line boundaries but rather aligned instruction-fetch blocks in superscalar CPUs. You were correct that the effects are not from things you were considering.
See Modern Microprocessors
A 90-Minute Guide! for an intro to superscalar (and out-of-order) execution.
Many superscalar CPUs do their first stage of instruction fetch using aligned accesses to their I-cache. Lets simplify by considering a RISC ISA with 4-byte instruction width1 and 4-wide fetch/decode/exec. (e.g. MIPS r10k, although IDK if some of the other stuff I'm going to make up reflects that microarch exactly).
...
.top_of_loop:
insn1 ; at address 16*n + 12
; 16-byte boundary here
insn2 ; at address 16*n + 0
insn3 ; at address 16*n + 4
b .top_of_loop ; at address 16*n + 8
... after loop ; at address 16*n + 12
... after loop ; at address 16*n + 0
Without any kind of loop buffer, the fetch stage has to fetch the loop instructions from I-cache one for every time it executes. But this takes a minimum of 2 cycles per iteration because the loop spans two 16-byte aligned fetch blocks. It's not capable of fetching the 16 bytes of instructions in one unaligned fetch.
But if we align the top of the loop, it can be fetched in a single cycle, allowing the loop to run at 1 cycle / iteration if the loop body doesn't have other bottlenecks.
...
nop ; at address 16*n + 12 ; NOP padding for alignment
.top_of_loop: ; 16-byte boundary here
insn1 ; at address 16*n + 0
insn2 ; at address 16*n + 4
insn3 ; at address 16*n + 8
b .top_of_loop ; at address 16*n + 12
... after loop ; at address 16*n + 0
... after loop ; at address 16*n + 4
With a larger loop that's not a multiple of 4 instructions, there's still going to a partially-wasted fetch somewhere. It's generally best that it's not the top of the loop, though. Getting more instructions into the pipeline sooner rather than later helps the CPU find and exploit more instruction-level parallelism, for code that isn't purely bottlenecked on instruction-fetch.
In general, aligning branch targets (including function entry points) by 16 can be a win (at the cost of greater I-cache pressure from lower code density). A useful tradeoff can be padding to the next multiple of 16 if you're within 1 or 2 instructions. e.g. so in the worst case, a fetch block contains at least 2 or 3 useful instructions, not just 1.
This is why the GNU assembler supports .p2align 4,,8 : pad to the next 2^4 boundary if it's 8 bytes away or closer. GCC does in fact use that directive for some targets / architectures, depending on tuning options / defaults.
In the general case for non-loop branches, you also don't want to jump near the end of a cache line. Then you might have another I-cache miss right away.
Footnote 1:
The principle also applies to modern x86 with its variable-width instructions, at least when they have decoded-uop cache misses forcing them to actually fetch x86 machine code from L1I-cache. And applies to older superscalar x86 like Pentium III or K8 without uop caches or loopback buffers (that can make loops efficient regardless of alignment).
But x86 decoding is so hard that it takes multiple pipeline stages, e.g. to some to simple find instruction boundaries and then feed groups of instructions to the decoders. Only the initial fetch-blocks are aligned and buffers between stages can hide bubbles from the decoders if pre-decode can catch up.
https://www.realworldtech.com/merom/4/ shows the details of Core2's front-end: 16-byte fetch blocks, same as PPro/PII/PIII, feeding a pre-decode stage that can scan up to 32 bytes and find boundaries between up to 6 instructions IIRC. That then feeds another buffer leading to the full decode stage which can decode up to 4 instructions (5 with macro-fusion of test or cmp + jcc) into up to 7 uops...
Agner Fog's microarch guide has some detailed info about optimizing x86 asm for fetch/decode bottlenecks on Pentium Pro/II vs. Core2 / Nehalem vs. Sandybridge-family, and AMD K8/K10 vs. Bulldozer vs. Ryzen.
Modern x86 doesn't always benefit from alignment. There are effects from code alignment but they're not usually simple and not always beneficial. Relative alignment of things can matter, but usually for things like which branches alias each other in branch predictor entries, or for how uops pack into the uop cache.

Related

How many clock cycles do the stages of a simple 5 stage processor take?

A 5 stage pipelined CPU has the following sequence of stages:
IF – Instruction fetch from instruction memory.
RD – Instruction decode and register read.
EX – Execute: ALU operation for data and address computation.
MA – Data memory access – for write access, the register read at RD state is
used.
WB – Register write back.
Now I know that an instruction fetch, for example, is from memory which can take 4 cycles (L1 cache) or up to ~150 cycles (RAM). However, in every pipelining diagram, I see something like this, where each stage is assigned a single cycle.
Now, I know of course real processors have complex pipelines with over 19 stages and every architecture is different. However, am I missing something here? With memory accesses in IF and MA, can this 5 stage pipeline take dozens of cycles?
Classic 5-stage RISC pipelines are designed around single-cycle latency L1d / L1i, allowing 1 IPC (instruction per clock) in code without cache misses or other stalls. i.e. the hopefully common / good case. Every stage must have a worst-case critical path latency of 1 cycle, or trigger a stall.
Clock speeds were lower back then (even relative to 1 gate delay) so you could get more done in a single cycle, and the caches were simpler, often 8k direct-mapped, single port, sometimes even virtually tagged (VIVT) so TLB lookup wasn't part of the access latency.
First-gen MIPS, the R2000 (and R3000), had on-chip controllers1 for its direct-mapped PIPT split L1i/L1d write-through caches, but the actual tags+data were off-chip, from 4K to 64K. Achieving the required single-cycle latency with this setup limited clock speeds to 15 MHz (R2000) or 33 MHz (R3000) with available SRAM technology. The TLB was fully on-chip.
vs. modern Intel/AMD using 32kiB 8-way VIPT L1d/L1i caches, with at least 2 read + 1 write port for L1d, at such high clock speed that access latency is 4 cycles best-case on Intel SnB-family, or 5 cycles including address-generation. Modern CPUs have larger TLBs, too, which also adds to the latency. This is ok when out-of-order execution and/or other techniques can usually hide that latency, but classic 5-stage RISCs just had one single pipeline, not separately pipelined memory access. See also Cycles/cost for L1 Cache hit vs. Register on x86? for some more links about how performance on modern superscalar out-of-order exec x86 CPUs differs from classic-RISC CPUs.
If you wanted to raise clock speeds for the same transistor performance (gate delay), you'd divide the fetch and mem stages into multiple pipeline stages (i.e. pipeline them more heavily), if cache access was even on the critical path (i.e. if cache access could no longer be done in one clock period). The downside of lengthening the pipeline is raising branch latency (cost of a mispredict, and the amount of latency a correct prediction has to hide), as well as raising total transistor cost.
Note that classic-RISC pipelines do address-generation in the EX stage, using the ALU there to calculate register + immediate, the only addressing mode supported by most RISC ISAs build around such a pipeline. So load-use latency is effectively 2 cycles for pointer-chasing, due to the load delay for forwarding back to EX.)
On a cache miss, the entire pipeline would just stall: those early pipelines lacked scoreboarding of loads to allow hit-under-miss or miss-under-miss for loads from L1d cache.
MIPS R2000 did have a 4-entry store buffer to decouple execution from cache-miss stores. (Apparently built from 4 separate R2020 write-buffer chips, according to wikipedia.) The LSI datasheet says the write-buffer chips were optional, but with write-through caches, every store has to go to DRAM and would create a stall without write buffering. Most modern CPUs use write-back caches, allowing multiple writes of the same line without creating DRAM traffic.
Also remember that CPU speed wasn't as high relative to memory for early CPUs like MIPS R2000, and single-core machines didn't need an interconnect between cores and memory controllers. (Although they maybe did have a frontside bus to a memory controller on a separate chip, a "northbridge".) But anyway, back then a cache miss to DRAM cost a lot fewer core clock cycles. It sucks to fully stall on every miss, but it wasn't like modern CPUs where it can be in the 150 to 350 cycles range (70 ns * 5 GHz). DRAM latency hasn't improved nearly as much as bandwidth and CPU clocks. See also http://www.lighterra.com/papers/modernmicroprocessors/ which has a "memory wall" section, and Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? re: why modern CPUs need multi-level caches as the mismatch between CPU speed and memory latency has grown.
Later CPUs allowed progressively more memory-level parallelism by doing things like allowing execution to continue after a non-faulting load (successful TLB lookup), only stalling when you actually read a register that was last written by a load, if the load result isn't ready yet. This allows hiding load latency on a still-short and fairly simple in-order pipeline, with some number of load buffers to track outstanding loads. And with register renaming + OoO exec, the ROB size is basically the "window" over which you can hide cache-miss latency: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
Modern x86 CPUs even have buffers between pipeline stages in the front-end to hide or partially absorb fetch bubbles (caused by L1i misses, decode stalls, low-density code, e.g. a jump to another jump, or even just failure to predict a simple always-taken branch. i.e. only detecting it when it's eventually decoded, after fetching something other than the correct path. That's right, even unconditional branches like jmp foo need some prediction for the fetch stage.)
https://www.realworldtech.com/haswell-cpu/2/ has some good diagrams. Of course, Intel SnB-family and AMD Zen-family use a decoded-uop cache because x86 machine code is hard to decode in parallel, so often they can bypass some of that front-end complexity, effectively shortening the pipeline. (wikichip has block diagrams and microarchitecture details for Zen 2.)
See also Modern Microprocessors
A 90-Minute Guide! re: modern CPUs and the "memory wall": the increasing mismatch between DRAM latency and core clock cycle time. DRAM latency has only dropped a little bit (in absolute nanoseconds) as bandwidth has continued to climb tremendously in recent years.
Footnote 1: MIPS R2000 cache details:
An R2000 datasheet shows the D-cache was write-through, and various other interesting things.
According to a 1992 usenet message from an SGI engineer, the control logic just sends 18 index bits, receiving a word of data + 8 tags bits to determine hit or not. The CPU is oblivious to the cache size; you connect up the right number of index lines to SRAM address lines. (So I guess a line-size of one 4-byte word?)
You have to use at least 10 index bits because the tag is only 20 bits wide, and you need tag+index+2(byte-in-word) to be 32, the physical address-space size. That sets a minimum cache size of 4K.
20 bits of tag for every 32 bits of data is very inefficient. With a larger cache, fewer tag bits are actually needed, since more of the address is used up as part of the index. But Paul Ries posted that R2000/R3000 does not support comparing fewer tag bits. IDK if you could wire up some of the address output lines to the tag input lines, to generate matching bits instead of storing them in SRAMs.
A 32-byte cache line would still only need 20-bit tags (at most), but would have one tag per 8 words, a factor of 8 improvement in tag overhead. CPUs with larger caches, especially L2 caches, would definitely want to use larger line sizes.
But you're probably more likely to get conflict misses with fewer larger lines, especially with a direct-mapped cache. And the memory bus can still be busy filling a previous line when you encounter another miss, even if you have critical-word-first / early-restart so the miss latency wasn't worse if the memory bus was idle to start with.

What is the benefit of having the registers as a part of memory in AVR microcontrollers?

Larger memories have higher decoding delay; why is the register file a part of the memory then?
Does it only mean that the registers are "mapped" SRAM registers that are stored inside the microprocessor?
If not, what would be the benefit of using registers as they won't be any faster than accessing RAM? Furthermore, what would be the use of them at all? I mean these are just a part of the memory so I don't see the point of having them anymore. Having them would be just as costly as referencing memory.
The picture is taken from Avr Microcontroller And Embedded Systems The: Using Assembly and C by Muhammad Ali Mazidi, Sarmad Naimi, and Sepehr Naimi
AVR has some instructions with indirect addressing, for example LD (LDD) – Load Indirect From Data Space to Register using Z:
Loads one byte indirect with or without displacement from the data space to a register. [...]
The data location is pointed to by the Z (16-bit) Pointer Register in the Register File.
So now you can move from a register by loading its data-space address into Z, allowing indirect or indexed register-to-register moves. Certainly one can think of some usage where such indirect access would save the odd instruction.
what would be the benefit of using registers as they won't be any faster than accessing RAM?
accessing General purpose Registers is faster than accessing Ram
first of all let us define how fast measured in microControllers .... fast mean how many cycle the instruction will take to excute ... LOOk at the avr architecture
See the General Purpose Registers GPRs are input for the ALU , and the GPRs are controlled by instruction register (2 byte width) which holds the next instruction from the code memory.
Let us examine simple instruction ADD Rd , Rr; where Rd,Rr are any two register in GPRs so 0<=r,d<=31 so each of r and d could be rebresented in 5 bit,now open "AVR Instruction Set Manual" page number 32 look at the op-code for this simple add instraction is 000011rdddddrrrr and becuse this op-code is two byte(code memory width) this will fetched , Decoded and excuit in one cycle (under consept of pipline ofcourse) jajajajjj only one cycle seems cool to me
I mean these are just a part of the memory so I don't see the point of having them anymore. Having them would be just as costly as referencing memory
You suggest to make the all ram as input for the ALU; this is a very bad idea: a memory address takes 2 bytes.
If you have 2 operands per instruction as in Add instruction you will need 4 Byte for saving only the operands .. and 1 more byte for the op-code of the operator itself in total 5 byte which is waste of memory!
And furthermore this architecture could only fetch 2 bytes at a time (instruction register width) so you need to spend more cycles on fetching the code from code memory which is waste of cycles >> more slower system
Register numbers are only 4 or 5 bits wide, depending on the instruction, allowing 2 per instruction with room to spare in a 16-bit instruction word.
conclusion GPRs' existence are crucial for saving code memory and program execution time
Larger memories have higher decoding delay; why is the register file a part of the memory then?
When cpu deal with GPRs it only access the first 32 position not all the data space
Final comment
don't disturb yourself by time diagram for different ram technology because you don't have control on it ,so who has control? the architecture designers , and they put the limit of the maximum crystal frequency you can use with there architecture and everything will be fine .. you only concern about cycles consuming with your application

Indexed addressing mode and implied addressing mode

Indexed addressing mode is usually used for accessing arrays as arrays are stored contiguosly. We have a index register which gets incremented in every iteration which when added to base address gives the array element address.
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing ? We have the base address and we can just add 1 to it every time when accessing. Why do we need indexed addressing mode which has a overhead of index register ?
I am not sure about the instruction format for implied addressing mode. Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing?
You can; MIPS only has one addressing mode and compilers can still generate code for it just fine. But sometimes it has to use an extra shift + add instruction to calculate an address (if it's not just looping through an array).
The point of addressing modes is to save instructions and save registers, especially in 2-operand instruction sets like x86, where add eax, ecx overwrites eax with the result (eax += ecx), unlike MIPS or other 3-instruction ISAs where addu $t2, $t1, $t0 does t2 = t1 + t0. On x86, that would require a copy (mov) and an add. (Or in that special case, lea edx, [eax+ecx]: x86 can copy-and-add (and shift) using the same instruction-encoding it uses for memory operands.)
Consider a histogram problem: you generate array indices in unpredictable order, and have to index an array. On x86-64, add dword [rbx + rdi*4], 1 will increment a 32-bit counter in memory using a single 4-byte instruction, which decodes to only 2 uops for the front-end to issue into the out-of-order core on modern Intel CPUs. (http://agner.org/optimize/). (rbx is the base register, rdi is a scaled index). Having a scaled index is very powerful; x86 16-bit addressing modes support 2 registers, but not a scaled index.
Classic MIPS only has separate shift and add instructions, although MIPS32 did add a scaled-add instruction for address calculation. That would save an instruction here. Being a load-store machine, the loads and stores always have to be separate instructions (unlike on x86 where that add decodes as a micro-fused load+add and a store. See INC instruction vs ADD 1: Does it matter?).
Probably ARM would be a better comparison for MIPS: It's also a load-store RISC machine. But it does have a selection of addressing modes, including scaled index using the barrel shifter. So instead of needing a separate shift / add for each array index, you'd use LDR R0, [R1, R2, LSL #2], add r0, r0, #1 / str with the same addressing mode.
Often when looping through an array, it is best to just increment pointers on x86. But it's also an option to use an index, especially for loops with multiple arrays using the same index, like C[i] = A[i] + B[i]. Indexed addressing mode can sometimes be slightly less efficient in hardware, though, so when a compiler is unrolling a loop it usually should use pointers, even though it has to increment all 3 pointers separately instead of one index.
The point of instruction-set design is not merely to be Turing complete, it's to enable efficient code that gets more work done with fewer clock cycles and/or smaller code-size, or give programmers the option of aiming for either of those goals.
The minimum threshold for a computer to be programmable is extremely low, see for example various One instruction set computer architectures. (None implemented for real, just designed on paper to show that it's possible to write programs with nothing but a subtract-and-branch-if-less-than-zero instruction, with memory operands encoded in the instruction.
There's a tradeoff between easy to decode (especially to decode in parallel) vs. compact. x86 is horrible because it evolved as a series of extensions, often without a lot of planning to leave room for future extensions. If you're interested in ISA design decisions, have a look at Agner Fog's blog for interesting discussion about designing an ISA for high-performance CPUs that combines the best of x86 (lots of work with one instruction, e.g. memory operand as part of an ALU instruction) with the best features of RISC (easy to decode, lots of registers): Proposal for an ideal extensible instruction set.
There's also a tradeoff in how you spend the bits in an instruction word, especially in a fixed instruction width ISA like most RISCs. Different ISAs made different choices.
PowerPC uses lots of the coding space for powerful bitfield instructions like rlwinm (rotate left and mask off a window of bits), and lots of opcodes. IDK if the generally unpronounceable and hard-to-remember mnemonics are related to that...
ARM uses the high 4 bits for predicated execution of any instruction based on condition codes. It uses more bits for the barrel shifter (the 2nd source operand is optionally shifted or rotated by an immediate or a count from another register).
MIPS has relatively large immediate operands, and is basically simple.
x86 32/64-bit addressing modes use a variable-length encoding, with an extra byte SIB (scale/index/base) byte when there's an index, and an optional disp8 or disp32 immediate displacement. (e.g. add esi, [rax + rdx + 12340] takes 2 + 1 + 4 bytes to encode, vs. 2 bytes for add esi, [rax].
x86 16-bit addressing modes are much more limited, and pack everything except the optional disp8/disp16 displacement into the ModR/M byte.
Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
Yes, the machine-code format for some instructions in some ISAs includes implicit operands. Many machines have push / pop instructions that implicitly use a specific register as the stack pointer. For example, in x86-64's push rax, RAX is an explicit register operand (encoded in the low 3 bits of the one-byte opcode using the push r64 short form), while RSP is an implicit operand.
Older 8-bit CPUs often had instructions like DECA (to decrement the accumulator, A). i.e. there was a specific opcode for that register. This could be the same thing as having a DEC instruction with some bits in the opcode byte specifying which register (like x86 does before x86-64 repurposed the short INC/DEC encodings as REX prefixes: note the "N.E" (Not Encodeable) in the 64-bit mode column for dec r32). But if there's no regular pattern then it can definitely be considered an implicit operand.
Sometimes putting things into neat categories breaks down, so don't worry too much about whether using bits with the opcode byte counts as implicit or explicit for x86. It's a way of spending more opcode space to save code-size for commonly used instructions while still allowing use with different registers.
Some ISAs only use a certain register as the stack pointer by convention, with no implicit uses. MIPS is like this.
ARM32 (in ARM, not Thumb mode) also uses explicit operands in push/pop. Its push/pop mnemonics are just aliases for store-multiple decrement-before / load-multiple increment-after (LDMIA / STMDB) to implement a full-descending stack. See ARM's docs for LDM/STM which explains this, and what you can do with the general case of these instructions, e.g. LDMDB to decrement a pointer and then load (in the opposite direction of POP).

What is the advantage of having instructions in a uniform format?

Many processors have instructions which are of uniform format and width such as the ARM where all instructions are 32-bit long. other processors have instructions in multiple widths of say 2, 3, or 4 bytes long, such as 8086.
What is the advantage of having all instructions the same width and in a uniform format?
What is the advantage of having instructions in multiple widths?
Fixed Length Instruction Trade-offs
The advantages of fixed length instructions with a relatively uniform formatting is that fetching and parsing the instructions is substantially simpler.
For an implementation that fetches a single instruction per cycle, a single aligned memory (cache) access of the fixed size is guaranteed to provide one (and only one) instruction, so no buffering or shifting is required. There is also no concern about crossing a cache line or page boundary within a single instruction.
The instruction pointer is incremented by a fixed amount (except when executing control flow instructions--jumps and branches) independent of the instruction type, so the location of the next sequential instruction can be available early with minimal extra work (compared to having to at least partially decode the instruction). This also makes fetching and parsing more than one instruction per cycle relatively simple.
Having a uniform format for each instruction allows trivial parsing of the instruction into its components (immediate value, opcode, source register names, destination register name). Parsing out the source register names is the most timing critical; with these in fixed positions it is possible to begin reading the register values before the type of instruction has been determined. (This register reading is speculative since the operation might not actually use the values, but this speculation does not require any special recovery in the case of mistaken speculation but does take extra energy.) In the MIPS R2000's classic 5-stage pipeline, this allowed reading of the register values to be started immediately after instruction fetch providing half of a cycle to compare register values and resolve a branch's direction; with a (filled) branch delay slot this avoided stalls without branch prediction.
(Parsing out the opcode is generally a little less timing critical than source register names, but the sooner the opcode is extracted the sooner execution can begin. Simple parsing out of the destination register name makes detecting dependencies across instructions simpler; this is perhaps mainly helpful when attempting to execute more than one instruction per cycle.)
In addition to providing the parsing sooner, simpler encoding makes parsing less work (energy use and transistor logic).
A minor advantage of fixed length instructions compared to typical variable length encodings is that instruction addresses (and branch offsets) use fewer bits. This has been exploited in some ISAs to provide a small amount of extra storage for mode information. (Ironically, in cases like MIPS/MIPS16, to indicate a mode with smaller or variable length instructions.)
Fixed length instruction encoding and uniform formatting do have disadvantages. The most obvious disadvantage is relatively low code density. Instruction length cannot be set according to frequency of use or how much distinct information is required. Strict uniform formatting would also tend to exclude implicit operands (though even MIPS uses an implicit destination register name for the link register) and variable-sized operands (most RISC variable length encodings have short instructions that can only access a subset of the total number of registers).
(In a RISC-oriented ISA, this has the additional minor issue of not allowing more work to be bundled into an instruction to equalize the amount of information required by the instruction.)
Fixed length instructions also make using large immediates (constant operands included in the instruction) more difficult. Classic RISCs limited immediate lengths to 16-bits. If the constant is larger, it must either be loaded as data (which means an extra load instruction with its overhead of address calculation, register use, address translation, tag check, etc.) or a second instruction must provide the rest of the constant. (MIPS provides a load high immediate instruction, partially under the assumption that large constants are mainly used to load addresses which will later be used for accessing data in memory. PowerPC provides several operations using high immediates, allowing, e.g., the addition of a 32-bit immediate in two instructions.) Using two instructions is obviously more overhead than using a single instruction (though a clever implementation could fuse the two instructions in the front-end [What Intel calls macro-op fusion]).
Fixed length instructions also makes it more difficult to extend an instruction set while retaining binary compatibility (and not requiring addition modes of operation). Even strictly uniform formatting can hinder extension of an instruction set, particularly for increasing the number of registers available.
Fujitsu's SPARC64 VIIIfx is an interesting example. It uses a two-bit opcode (in its 32-bit instructions) to indicate a loading of a special register with two 15-bit instruction extensions for the next two instructions. These extensions provide extra register bits and indication of SIMD operation (i.e., extending the opcode space of the instruction to which the extension is applied). This means that the full register name of an instruction not only is not entirely in a fixed position, but not even in the same "instruction". (Similarities to x86's REX prefix--which provides bits to extend register names encoded in the main part of the instruction--might be noted.)
(One aspect of fixed length encodings is the tyranny of powers of two. Although it is possible to used non-power-of-two instruction lengths [Tensilica's XTensa now has fixed 24-bit instructions as its base ISA--with 16-bit short instruction support being an extension, previously they were part of the base ISA; IBM had an experimental ISA with 40-bit instructions.], such adds a little complexity. If one size, e.g., 32bits, is a little too short, the next available size, e.g., 64 bits, is likely to be too long, sacrificing too much code density.)
For implementations with deep pipelines the extra time required for parsing instructions is less significant. The extra dynamic work done by hardware and the extra design complexity are reduced in significance for high performance implementations which add sophisticated branch prediction, out-of-order execution, and other features.
Variable Length Instruction Trade-offs
For variable length instructions, the trade-offs are essentially reversed.
Greater code density is the most obvious advantage. Greater code density can improve static code size (the amount of storage needed for a given program). This is particularly important for some embedded systems, especially microcontrollers, since it can be a large fraction of the system cost and influence the system's physical size (which has impact on fitness for purpose and manufacturing cost).
Improving dynamic code size reduces the amount of bandwidth used to fetch instructions (both from memory and from cache). This can reduce cost and energy use and can improve performance. Smaller dynamic code size also reduces the size of caches needed for a given hit rate; smaller caches can use less energy and less chip area and can have lower access latency.
(In a non- or minimally pipelined implementation with a narrow memory interface, fetching only a portion of an instruction in a cycle in some cases does not hurt performance as much as it would in a more pipelined design less limited by fetch bandwidth.)
With variable length instructions, large constants can be used in instructions without requiring all instructions to be large. Using an immediate rather than loading a constant from data memory exploits spatial locality, provides the value earlier in the pipeline, avoids an extra instruction, and removed a data cache access. (A wider access is simpler than multiple accesses of the same total size.)
Extending the instruction set is also generally easier given support for variable length instructions. Addition information can be included by using extra long instructions. (In the case of some encoding techniques--particularly using prefixes--, it is also possible to add hint information to existing instructions allowing backward compatibility with additional new information. x86 has exploited this not only to provide branch hints [which are mostly unused] but also the Hardware Lock Elision extension. For a fixed length encoding, it would be difficult to choose in advance which operations should have additional opcodes reserved for possible future addition of hint information.)
Variable length encoding clearly makes finding the start of the next sequential instruction more difficult. This is somewhat less of a problem for implementations
that only decode one instruction per cycle, but even in that case it adds extra
work for the hardware (which can increase cycle time or pipeline length as well as use more energy). For wider decode several tricks are available to reduce the cost of parsing out individual instructions from a block of instruction memory.
One technique that has mainly been used microarchitecturally (i.e., not included in the interface exposed to software but only an implementation technique) is to use marker bits to indicate the start or end of an instruction. Such marker bits would be set for each parcel of instruction encoding and stored in the instruction cache. Such delays the availability of such information on a instruction cache miss, but this delay is typically small compared to the ordinary delay in filling a cache miss. The extra (pre)decoding work is only needed on a cache miss, so time and energy is saved in the common case of a cache hit (at the cost of some extra storage and bandwidth which has some energy cost).
(Several AMD x86 implementations have used marker bit techniques.)
Alternatively, marker bits could be included in the instruction encoding. This places some constrains on opcode assignment and placement since the marker bits effectively become part of the opcode.
Another technique, used by the IBM zSeries (S/360 and descendants), is to encode the instruction length in a simple way in the opcode in the first parcel. The zSeries uses two bits to encode three different instruction lengths (16, 32, and 48 bits) with two encodings used for 16 bit length. By placing this in a fixed position, it is relatively easy to quickly determine where the next sequential instruction begins.
(More aggressive predecoding is also possible. The Pentium 4 used a trace cache containing fixed-length micro-ops and recent Intel processors use a micro-op cache with [presumably] fixed-length micro-ops.)
Obviously, variable length encodings require addressing at the granularity of a parcel which is typically smaller than an instruction for a fixed-length ISA. This means that branch offsets either lose some range or must use more bits. This can be compensated by support for more different immediate sizes.
Likewise, fetching a single instruction can be more complex since the start of the instruction is likely to not be aligned to a larger power of two. Buffering instruction fetch reduces the impact of this, but adds (trivial) delay and complexity.
With variable length instructions it is also more difficult to have uniform encoding. This means that part of the opcode must often be decoded before the basic parsing of the instruction can be started. This tends to delay the availability of register names and other, less critical information. Significant uniformity can still be obtained, but it requires more careful design and weighing of trade-offs (which are likely to change over the lifetime of the ISA).
As noted earlier, with more complex implementations (deeper pipelines, out-of-order execution, etc.), the extra relative complexity of handling variable length instructions is reduced. After instruction decode, a sophisticated implementation of an ISA with variable length instructions tends to look very similar to one of an ISA with fixed length instructions.
It might also be noted that much of the design complexity for variable length instructions is a one-time cost; once an organization has learned techniques (including the development of validation software) to handle the quirks, the cost of this complexity is lower for later implementations.
Because of the code density concerns for many embedded systems, several RISC ISAs provide variable length encodings (e.g., microMIPS, Thumb2). These generally only have two instruction lengths, so the additional complexity is constrained.
Bundling as a Compromise Design
One (sort of intermediate) alternative chosen for some ISAs is to use a fixed length bundle of instructions with different length instructions. By containing instructions in a bundle, each bundle has the advantages of a fixed length instruction and the first instruction in each bundle has a fixed, aligned starting position. The CDC 6600 used 60-bit bundles with 15-bit and 30-bit operations. The M32R uses 32-bit bundles with 16-bit and 32-bit instructions.
(Itanium uses fixed length power-of-two bundles to support non-power of two [41-bit] instructions and has a few cases where two "instructions" are joined to allow 64-bit immediates. Heidi Pan's [academic] Heads and Tails encoding used fixed length bundles to encode fixed length base instruction parts from left to right and variable length chunks from right to left.)
Some VLIW instruction sets use a fixed size instruction word but individual operation slots within the word can be a different (but fixed for the particular slot) length. Because different operation types (corresponding to slots) have different information requirements, using different sizes for different slots is sensible. This provides the advantages of fixed size instructions with some code density benefit. (In addition, a slot might be allocated to optionally provide an immediate to one of the operations in the instruction word.)

Why doesn't my processor have built-in BigInt support?

As far as I understood it, BigInts are usually implemented in most programming languages as arrays containing digits, where, eg.: when adding two of them, each digit is added one after another like we know it from school, e.g.:
246
816
* *
----
1062
Where * marks that there was an overflow. I learned it this way at school and all BigInt adding functions I've implemented work similar to the example above.
So we all know that our processors can only natively manage ints from 0 to 2^32 / 2^64.
That means that most scripting languages in order to be high-level and offer arithmetics with big integers, have to implement/use BigInt libraries that work with integers as arrays like above.
But of course this means that they'll be far slower than the processor.
So what I've asked myself is:
Why doesn't my processor have a built-in BigInt function?
It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Seems like a fine idea to me, so why isn't there something like that?
There are simply too many issues that require the processor to deal with a ton of stuff which isn't its job.
Suppose that the processor DID have that feature. We can work out a system where we know how many bytes are used by a given BigInt - just use the same principle as most string libraries and record the length.
But what would happen if the result of a BigInt operation exceeded the amount of space reserved?
There are two options:
It'll wrap around inside the space it does have
or
It'll use more memory.
The thing is, if it did 1), then it's useless - you'd have to know how much space was required beforehand, and that's part of the reason you'd want to use a BigInt - so you're not limited by those things.
If it did 2), then it'll have to allocate that memory somehow. Memory allocation is not done in the same way across OSes, but even if it were, it would still have to update all pointers to the old value. How would it know what were pointers to the value, and what were simply integer values containing the same value as the memory address in question?
Binary Coded Decimal is a form of string math. The Intel x86 processors have opcodes for direct BCD arthmetic operations.
It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Almost all CPUs do have this built-in. You have to use a software loop around the relevant instructions, but that doesn't make it slower if the loop is efficient. (That's non-trivial on x86, due to partial-flag stalls, see below)
e.g. if x86 provided rep adc to do src += dst, taking 2 pointers and a length as input (like rep movsd to memcpy), it would still be implemented as a loop in microcode.
It would be possible for a 32bit x86 CPU to have an internal implementation of rep adc that used 64bit adds internally, since 32bit CPUs probably still have a 64bit adder. However, 64bit CPUs probably don't have a single-cycle latency 128b adder. So I don't expect that having a special instruction for this would give a speedup over what you can do with software, at least on a 64bit CPU.
Maybe a special wide-add instruction would be useful on a low-power low-clock-speed CPU where a really wide adder with single-cycle latency is possible.
The x86 instructions you're looking for are:
adc: add with carry / sbb: subtract with borrow
mul: full multiply, producing upper and lower halves of the result: e.g. 64b*64b => 128b
div: dividend is twice as wide as the other operands, e.g. 128b / 64b => 64b division.
Of course, adc works on binary integers, not single decimal digits. x86 can adc in 8, 16, 32, or 64bit chunks, unlike RISC CPUs which typically only adc at full register width. (GMP calls each chunk a "limb"). (x86 has some instructions for working with BCD or ASCII, but those instructions were dropped for x86-64.)
imul / idiv are the signed equivalents. Add works the same for signed 2's complement as for unsigned, so there's no separate instruction; just look at the relevant flags to detect signed vs. unsigned overflow. But for adc, remember that only the most-significant chunk has the sign bit; the rest are essential unsigned.
ADX and BMI/BMI2 add some instructions like mulx: full-multiply without touching flags, so it can be interleaved with an adc chain to create more instruction-level parallelism for superscalar CPUs to exploit.
In x86, adc is even available with a memory destination, so it performs exactly like you describe: one instruction triggers the whole read / modify / write of a chunk of the BigInteger. See example below.
Most high-level languages (including C/C++) don't expose a "carry" flag
Usually there aren't intrinsics add-with-carry directly in C. BigInteger libraries usually have to be written in asm for good performance.
However, Intel actually has defined intrinsics for adc (and adcx / adox).
unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, \
unsigned __int64 b, unsigned __int64 * out);
So the carry result is handled as an unsigned char in C. For the _addcarryx_u64 intrinsic, it's up to the compiler to analyse the dependency chains and decide which adds to do with adcx and which to do with adox, and how to string them together to implement the C source.
IDK what the point of _addcarryx intrinsics are, instead of just having the compiler use adcx/adox for the existing _addcarry_u64 intrinsic, when there are parallel dep chains that can take advantage of it. Maybe some compilers aren't smart enough for that.
Here's an example of a BigInteger add function, in NASM syntax:
;;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;
; C prototype:
; void bigint_add(uint64_t *dst, uint64_t *src, size_t len);
; len is an element-count, not byte-count
global bigint_add
bigint_add: ; AMD64 SysV ABI: dst=rdi, src=rsi, len=rdx
; set up for using dst as an index for src
sub rsi, rdi ; rsi -= dst. So orig_src = rsi + rdi
clc ; CF=0 to set up for the first adc
; alternative: peel the first iteration and use add instead of adc
.loop:
mov rax, [rsi + rdi] ; load from src
adc rax, [rdi] ; <================= ADC with dst
mov [rdi], rax ; store back into dst. This appears to be cheaper than adc [rdi], rax since we're using a non-indexed addressing mode that can micro-fuse
lea rdi, [rdi + 8] ; pointer-increment without clobbering CF
dec rdx ; preserves CF
jnz .loop ; loop while(--len)
ret
On older CPUs, especially pre-Sandybridge, adc will cause a partial-flag stall when reading CF after dec writes other flags. Looping with a different instruction will help for old CPUs which stall while merging partial-flag writes, but not be worth it on SnB-family.
Loop unrolling is also very important for adc loops. adc decodes to multiple uops on Intel, so loop overhead is a problem, esp if you have extra loop overhead from avoiding partial-flag stalls. If len is a small known constant, a fully-unrolled loop is usually good. (e.g. compilers just use add/adc to do a uint128_t on x86-64.)
adc with a memory destination appears not to be the most efficient way, since the pointer-difference trick lets us use a single-register addressing mode for dst. (Without that trick, memory-operands wouldn't micro-fuse).
According to Agner Fog's instruction tables for Haswell and Skylake, adc r,m is 2 uops (fused-domain) with one per 1 clock throughput, while adc m, r/i is 4 uops (fused-domain), with one per 2 clocks throughput. Apparently it doesn't help that Broadwell/Skylake run adc r,r/i as a single-uop instruction (taking advantage of ability to have uops with 3 input dependencies, introduced with Haswell for FMA). I'm also not 100% sure Agner's results are right here, since he didn't realize that SnB-family CPUs only micro-fuse indexed addressing modes in the decoders / uop-cache, not in the out-of-order core.
Anyway, this simple not-unrolled-at-all loop is 6 uops, and should run at one iteration per 2 cycles on Intel SnB-family CPUs. Even if it takes an extra uop for partial-flag merging, that's still easily less than the 8 fused-domain uops that can be issued in 2 cycles.
Some minor unrolling could get this close to 1 adc per cycle, since that part is only 4 uops. However, 2 loads and one store per cycle isn't quite sustainable.
Extended-precision multiply and divide are also possible, taking advantage of the widening / narrowing multiply and divide instructions. It's much more complicated, of course, due to the nature of multiplication.
It's not really helpful to use SSE for add-with carry, or AFAIK any other BigInteger operations.
If you're designing a new instruction-set, you can do BigInteger adds in vector registers if you have the right instructions to efficiently generate and propagate carry. That thread has some back-and-forth discussion on the costs and benefits of supporting carry flags in hardware, vs. having software generate carry-out like MIPS does: compare to detect unsigned wraparound, putting the result in another integer register.
Suppose the result of the multiplication needed 3 times the space (memory) to be stored - where would the processor store that result ? How would users of that result, including all pointers to it know that its size suddenly changed - and changing the size might need it to relocate it in memory cause extending the current location would clash with another variable.
This would create a lot of interaction between the processor, OS memory managment, and the compiler that would be hard to make both general and efficient.
Managing the memory of application types is not something the processor should do.
As I think, the main idea behind not including the bigint support in modern processors is the desire to reduce ISA and leave as few instructions as possible, that are fetched, decoded and executed at full throttle.
By the way, in x86 family processors there is a set of instructions that make writing big int library a single-day's matter.
Another reason, I think, is price. It's much more efficient to save some space on the wafer dropping the redundant operations, that can be easily implemented on the higher level.
Seems Intel is Adding (or has added as # time of this post - 2015) new Instructions Support for Large Integer Arithmetic.
New instructions are being introduced on Intel® Architecture
Processors to enable fast implementations of large integer arithmetic.
Large Integer Arithmetic is widely used in multi-precision libraries
for high-performance technical computing, as well as for public key
cryptography (e.g., RSA). In this paper, we describe the critical
operations required in large integer arithmetic and their efficient
implementations using the new instructions.
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
There are so many instructions and functionalities jockeying for area on a CPU chip that in the end those that are used more often/deemed more useful will push out those that aren't. The instructions necessary for implementing BigInt functionality are there and the math is straight-forward.
BigInt: the fundamental function required is:
Unsigned Integer Multiplication, add previous high order
I wrote one in Intel 16bit assembler, then 32 bit...
C code is usually fast enough .. ie for BigInt you use a software library.
CPUs (and GPUs) are not designed with unsigned Integer as top priority.
If you want to write your own BigInt...
Division is done via Knuths Vol 2 (its a bunch of multiply and subtract, with some tricky add-backs)
Add with carry and subtract are easier. etc etc
I just posted this in Intel:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
SSE4 is there a BigInt LIbrary?
i5 2410M processor I suppose can NOT use AVX [AVX is only on very recent Intel CPUs]
but can use SSE4.2
Is there a BigInt Library for SSE?
I Guess I am looking for something that implements unsigned integer
PMULUDQ (with 128-Bit operands)
PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)
and does the carries.
Its a Laptop so I cant buy an NVIDIA GTX 550, which isnt so grand on unsigned Ints, I hear.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx