Hardware Prefetcher prefetches single block or multiple blocks? - operating-system

From the information related to hardware prefetching here, hardware prefetching schemes
there are 3 types of hardware prefetching,
Prefetcher on miss : If there is a miss for block n, then it prefetches block (n+1). As per name reflects, if there is no miss for block n, this prefetcher will not prefetch block (n+1). [Also ONLY single block is prefetched].
Tagged Prefetcher : With every block , a tag is associated and as contrary to above prefetcher, this prefetcher will always prefetch block (n+1) whenever block n is accessed.
Prefetcher with degree K : Prefetch n+1, n+2,..... n+k block.
In some other link, definition of hardware prefetcher says it activates when hardware detects a stride, it prefetch the block according to the stride in advance to stop stall.
Now, my doubt is as follow
According to Hardware prefetcher on stride detection, Hardware prefetcher will prefetcher block at stride distance .
Question is will the hardware prefetcher, prefetch 1 block or 2 block or any no of blocks ?
Let me take one example. Suppose I am accessing 0,8,16,24,....Hardware prefetcher will detect a stride of 8 .
Now will it prefetch only block no 8,16,24 OR will it prefetch all blocks 0,1,2,...8 according to Prefetcher with degree K=8 [ 3rd type of prefetcher above]
If hardware prefetcher only prefetch 0,8,16,24 then at later time access of other blocks not effected due to hardware prefetching, otherwise there will be a impact on access time of other blocks [1,2,3,,,,,7 ] [9,10,11,....]
Here I will access randomly any blocks after access of 0,8,16,24 so there will be no stride detectable.
Any link or help will highly appreciated. Thanks in advance.

A non-unit stride-based prefetcher will fetch at the given stride and not prefetch intermediate blocks. The point of non-unit stride prefetch is to avoid excess cache pollution and bandwidth waste from prefetching unused blocks, so prefetching as if unit-stride was detected would be inappropriate.
A prefetcher sophisticated enough to handle strides is almost certain to provide more than one stream of stride sequences, so a second sequence (e.g., 1, 9, ...) could be detected and prefetch started while still prefetching along the first sequence. Hardware predicts the future based on past behavior. If behavior outside the first sequence is random, the hardware cannot accurately predict that the other blocks will be accessed soon. (Software prefetch could inform the hardware of such expected behavior.)
In addition, different prefetch engines and policies are likely to exist at different levels of cache. Such would mainly influence fetch ahead distance (to compensate for greater latency of accesses closer to memory), but prefetch engines farther from the processor are also likely to be more tolerant of latency (so more cleverness and greater storage and logic overhead can be applied; cache pollution is also somewhat less critical because outer levels of cache have greater capacity and associativity). (At the memory controller, prefetch within an active DRAM row can be less expensive than random accesses, especially when such avoids using burst chop which reads only half of a full burst.)

Related

Why are most cache line sizes designed to be 64 byte instead of 32/128byte now?

i found in linux, it shows my cpu's cache line size is 64 byte, and i realized 16/32/128 byte is existing, but most cpu are designed to 64 byte cache line size now. why not bigger or smaller?
It's a trade-off. Wider caches are more efficient (in terms of area/power for a given cache size), but result in more memory traffic for random (non-sequential/strided) access, and more false sharing contention between parallel caches.
if you have a memory access pattern that only needs a few bytes from each cache line (eg, iterating along a linked list that is scattered widely across memory), each access will need to pull an entire line into the cache. So doubling the line size will double the memory traffic.
if different CPUs, each with its own cache, are accessing memory on the same cache line, that line will have to "bounce" back and forth between the caches. Avoiding this means putting more padding between objects.
In both cases, these problems can be avoided by tuning the software to want memory in chunks that are multiples of the cache line size. The bigger the cache line size, the more work that is.
As Chris Dodd's answer points out, the sizing of cache lines involves trade-offs.
Larger cache lines reduce the number of tag bits per data byte, provide prefetching, and facilitate higher bandwidth (particularly at the memory and the L1 interfaces) at the cost of excessive prefetch (wasting bandwidth and cache capacity), false sharing, higher miss latency (especially without critical word first/early restart), and higher conflict misses (for smaller caches, with fewer sets the probability that more accesses than the associativity will map to a particular set increases). (Larger cache lines can also provide greater performance predictability by guaranteeing a cache hit within a larger address range and number of bytes.)
Modern systems would not noticeably benefit from such prefetching; configurable static prefetcher logic would provide the same behavior and dynamic prefetching can exploit variable resource availability (e.g., cache capacity and memory channel occupancy) and utility as well as provide more flexible prefetching (such as non-unit stride).
Tag overhead is not as significant a concern in terms of area for modern caches using SRAM for data as well as tags. (IBM's Power and zArchitecture implementations use eDRAM for outer cache data storage and SRAM for tags, which more than doubles the area cost of tags relative to data.) However, access latency and access energy are effected by the size of the tag arrays. For L1 caches, way prediction is more effective with larger cache lines both because there are fewer cache lines for a given cache capacity and because spatial locality tends to apply even beyond reasonable cache line sizes; only having to check one set of tags for a wider or larger number of accesses reduces the cost of higher bandwidth (this is most noticeable in GPUs which exploit spatial locality and sacrifice latency for bandwidth). For outer cache levels, phased tag-data access is often used (tags are checked before data access begins, saving energy — especially given higher associativity and miss rates); smaller tag arrays for a given capacity both reduce access energy and latency (especially for misses — 50% hit rates are not unheardof). (Note that one can use partial tags to provide early miss detection in the common case where a miss has no matching partial tags. Other filtering mechanisms are also possible.)
False sharing can be countered by using sectored caches, where more than one validity (or coherence state) entry is provided for each address portion of the tag. This provides an intermediate design point between larger cache lines with more frequent false sharing and smaller cache lines with higher tag overhead. Such also inherently supports reducing cache line fill delay. For traditional layouts, this has the substantial effective capacity cost of large cache lines when false sharing or less spatial locality is more common. For designs using indirection, such as proposed Non-Uniform Cache Architectures and V-Way caches, the capacity utilization issue can be reduced by allocating data storage at a finer granularity at the cost of more indirection pointer storage.
Larger cache lines provide three bandwidth benefits. The command overhead is less (address and action information is nearly constant — address is one bit smaller for each doubling in size) so the bandwidth overhead per data byte is lower; this is more significant for coherence traffic where many messages carry only metadata. (Obviously with more numerous coherence nodes false sharing can be more problematic acting against this advantage.) Other per-request overheads also do not scale with request size (e.g., DRAM row activation with random-access, close-on-completion row management). ECC (or check codes with retransmission) also have less per payload byte overhead with larger payloads (this can be used to store extra metadata while using commodity width memory modules).
A larger cache line also facilitates wider memory interfaces when burst length is fixed. Increasing DRAM burst length facilitates higher bandwidth; DDR5 moved to a burst length of 16, pushing DIMMs into using two 32-bit wide channels to be compatible with x86's de facto standardization on 64-byte cache lines. While this chance can be viewed positively as increasing available memory level parallelism (MLP)— doubling the number of channels and reducing DRAM bank conflicts — MLP is more important when relative latency of memory is greater (large on-chip caches and faster processing), thread-level parallelism is available (multicore and multithreading), and out-of-order execution (and multithreading) expose more memory accesses to hide latency. Multicore (when used with significant data memory sharing as opposed to multiprogramming or large chunk/stream communication such as pipeline-style mulithreading) also increases the importance of false sharing, further reducing the benefits of larger cache lines (beyond MLP benefits of narrower channels). With lower (on-chip) communication latency and near-unavoidability of multicore processors, multithreaded programming becomes more attractive.
For L1 caches the microarchitecture (and somewhat the ISA) can be influence cache line sizes. Higher frequency designs favor smaller capacity L1 caches both for latency and access energy, especially with less latency tolerance from out-of-order execution or skewed pipelines (where the execution pipeline phase is one or more stages delayed from the address generation phase).
The relative sizes of the various tradeoffs also depends on the workload and software design. Larger cache capacity caches targeting workloads that benefit from such reduce the excessive prefetch and conflict disadvantages of larger cache lines; higher associativity reduces the conflict disadvantage, but workloads that are more likely to have conflicts are less likely (in general) to benefit from spatial locality (and the conflict disadvantage is typically less important for outer cache levels). Pointer-chasing workloads tend to favor lower latency and thus lower capacity, favoring smaller cache lines (at least in L1).
Software design is a significant factor. Avoiding false sharing tends to increase padding as cache line size increases, discouraging larger cache lines. Once a cache line size assumption is established in a software community (which is somewhat segregated according to ISA, OS, and hardware/system vendor) the effect of legacy code and legacy conceptualizations constrains cache line size.
Speculation: x86's orientation toward generic software and personal computer uses (cost and workload characteristics biasing toward smaller caches and workload perhaps generally having lower spatial locality) probably biased the choice toward a smaller cache line than ISA/hardware vendors targeting workstation and server workloads with a higher expectation of software development effort. x86 has standardized on 64-byte cache lines, IBM POWER9 uses 128-byte cache blocks (divided into four sectors for L1 caches) and IBM z15 uses 256-byte cache blocks.
(Latency vs. hit rate, access energy, and other tradeoffs as well as software and programmer legacy seem to lead to less strict standardization on 32KiB L1 cache capacity. The performance impact for a smaller or larger cache can be less significant than for false sharing, so the software constraints are less significant than for cache line size.)

How many clock cycles do the stages of a simple 5 stage processor take?

A 5 stage pipelined CPU has the following sequence of stages:
IF – Instruction fetch from instruction memory.
RD – Instruction decode and register read.
EX – Execute: ALU operation for data and address computation.
MA – Data memory access – for write access, the register read at RD state is
used.
WB – Register write back.
Now I know that an instruction fetch, for example, is from memory which can take 4 cycles (L1 cache) or up to ~150 cycles (RAM). However, in every pipelining diagram, I see something like this, where each stage is assigned a single cycle.
Now, I know of course real processors have complex pipelines with over 19 stages and every architecture is different. However, am I missing something here? With memory accesses in IF and MA, can this 5 stage pipeline take dozens of cycles?
Classic 5-stage RISC pipelines are designed around single-cycle latency L1d / L1i, allowing 1 IPC (instruction per clock) in code without cache misses or other stalls. i.e. the hopefully common / good case. Every stage must have a worst-case critical path latency of 1 cycle, or trigger a stall.
Clock speeds were lower back then (even relative to 1 gate delay) so you could get more done in a single cycle, and the caches were simpler, often 8k direct-mapped, single port, sometimes even virtually tagged (VIVT) so TLB lookup wasn't part of the access latency.
First-gen MIPS, the R2000 (and R3000), had on-chip controllers1 for its direct-mapped PIPT split L1i/L1d write-through caches, but the actual tags+data were off-chip, from 4K to 64K. Achieving the required single-cycle latency with this setup limited clock speeds to 15 MHz (R2000) or 33 MHz (R3000) with available SRAM technology. The TLB was fully on-chip.
vs. modern Intel/AMD using 32kiB 8-way VIPT L1d/L1i caches, with at least 2 read + 1 write port for L1d, at such high clock speed that access latency is 4 cycles best-case on Intel SnB-family, or 5 cycles including address-generation. Modern CPUs have larger TLBs, too, which also adds to the latency. This is ok when out-of-order execution and/or other techniques can usually hide that latency, but classic 5-stage RISCs just had one single pipeline, not separately pipelined memory access. See also Cycles/cost for L1 Cache hit vs. Register on x86? for some more links about how performance on modern superscalar out-of-order exec x86 CPUs differs from classic-RISC CPUs.
If you wanted to raise clock speeds for the same transistor performance (gate delay), you'd divide the fetch and mem stages into multiple pipeline stages (i.e. pipeline them more heavily), if cache access was even on the critical path (i.e. if cache access could no longer be done in one clock period). The downside of lengthening the pipeline is raising branch latency (cost of a mispredict, and the amount of latency a correct prediction has to hide), as well as raising total transistor cost.
Note that classic-RISC pipelines do address-generation in the EX stage, using the ALU there to calculate register + immediate, the only addressing mode supported by most RISC ISAs build around such a pipeline. So load-use latency is effectively 2 cycles for pointer-chasing, due to the load delay for forwarding back to EX.)
On a cache miss, the entire pipeline would just stall: those early pipelines lacked scoreboarding of loads to allow hit-under-miss or miss-under-miss for loads from L1d cache.
MIPS R2000 did have a 4-entry store buffer to decouple execution from cache-miss stores. (Apparently built from 4 separate R2020 write-buffer chips, according to wikipedia.) The LSI datasheet says the write-buffer chips were optional, but with write-through caches, every store has to go to DRAM and would create a stall without write buffering. Most modern CPUs use write-back caches, allowing multiple writes of the same line without creating DRAM traffic.
Also remember that CPU speed wasn't as high relative to memory for early CPUs like MIPS R2000, and single-core machines didn't need an interconnect between cores and memory controllers. (Although they maybe did have a frontside bus to a memory controller on a separate chip, a "northbridge".) But anyway, back then a cache miss to DRAM cost a lot fewer core clock cycles. It sucks to fully stall on every miss, but it wasn't like modern CPUs where it can be in the 150 to 350 cycles range (70 ns * 5 GHz). DRAM latency hasn't improved nearly as much as bandwidth and CPU clocks. See also http://www.lighterra.com/papers/modernmicroprocessors/ which has a "memory wall" section, and Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? re: why modern CPUs need multi-level caches as the mismatch between CPU speed and memory latency has grown.
Later CPUs allowed progressively more memory-level parallelism by doing things like allowing execution to continue after a non-faulting load (successful TLB lookup), only stalling when you actually read a register that was last written by a load, if the load result isn't ready yet. This allows hiding load latency on a still-short and fairly simple in-order pipeline, with some number of load buffers to track outstanding loads. And with register renaming + OoO exec, the ROB size is basically the "window" over which you can hide cache-miss latency: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
Modern x86 CPUs even have buffers between pipeline stages in the front-end to hide or partially absorb fetch bubbles (caused by L1i misses, decode stalls, low-density code, e.g. a jump to another jump, or even just failure to predict a simple always-taken branch. i.e. only detecting it when it's eventually decoded, after fetching something other than the correct path. That's right, even unconditional branches like jmp foo need some prediction for the fetch stage.)
https://www.realworldtech.com/haswell-cpu/2/ has some good diagrams. Of course, Intel SnB-family and AMD Zen-family use a decoded-uop cache because x86 machine code is hard to decode in parallel, so often they can bypass some of that front-end complexity, effectively shortening the pipeline. (wikichip has block diagrams and microarchitecture details for Zen 2.)
See also Modern Microprocessors
A 90-Minute Guide! re: modern CPUs and the "memory wall": the increasing mismatch between DRAM latency and core clock cycle time. DRAM latency has only dropped a little bit (in absolute nanoseconds) as bandwidth has continued to climb tremendously in recent years.
Footnote 1: MIPS R2000 cache details:
An R2000 datasheet shows the D-cache was write-through, and various other interesting things.
According to a 1992 usenet message from an SGI engineer, the control logic just sends 18 index bits, receiving a word of data + 8 tags bits to determine hit or not. The CPU is oblivious to the cache size; you connect up the right number of index lines to SRAM address lines. (So I guess a line-size of one 4-byte word?)
You have to use at least 10 index bits because the tag is only 20 bits wide, and you need tag+index+2(byte-in-word) to be 32, the physical address-space size. That sets a minimum cache size of 4K.
20 bits of tag for every 32 bits of data is very inefficient. With a larger cache, fewer tag bits are actually needed, since more of the address is used up as part of the index. But Paul Ries posted that R2000/R3000 does not support comparing fewer tag bits. IDK if you could wire up some of the address output lines to the tag input lines, to generate matching bits instead of storing them in SRAMs.
A 32-byte cache line would still only need 20-bit tags (at most), but would have one tag per 8 words, a factor of 8 improvement in tag overhead. CPUs with larger caches, especially L2 caches, would definitely want to use larger line sizes.
But you're probably more likely to get conflict misses with fewer larger lines, especially with a direct-mapped cache. And the memory bus can still be busy filling a previous line when you encounter another miss, even if you have critical-word-first / early-restart so the miss latency wasn't worse if the memory bus was idle to start with.

Why does instruction cache alignment improve performance in set associative cache implementations?

I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything.
I understand the concept of cache hits and their importance in computing speed.
But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since that adjacent block has not been evicted by the execution of the previous block. Both blocks are likely to remain cached during the loop.
So all I can figure is if there is truth in the claim that alignment can help, it must be from some sort of other effect.
Is there a cost in switching cache lines?
Is there a difference in cache hits, one where you get a hit and one where you hit the same cache line you're currently reading from?
Keeping a whole function (or the hot parts of a function, i.e. the fast path through it) in fewer cache lines reduces I-cache footprint. So it can reduce the number of cache misses, including on startup when most of the cache is cold. Having a loop end before the end of a cache line could give HW prefetching time to fetch the next one.
Accessing any line that's present in L1i cache takes takes the same amount of time. (Unless your cache uses way-prediction: that introduces the possibility of a "slow hit". See these slides for a mention and brief description of the idea. Apparently MIPS r10k's L2 cache used it, and so did Alpha 21264's L1 instruction cache with "branch target" vs. "sequential" ways in its 2-way associative 64kiB L1i. Or see any of the academic papers that come up when you google cache way prediction like I did.)
Other than that, the effects aren't so much about cache-line boundaries but rather aligned instruction-fetch blocks in superscalar CPUs. You were correct that the effects are not from things you were considering.
See Modern Microprocessors
A 90-Minute Guide! for an intro to superscalar (and out-of-order) execution.
Many superscalar CPUs do their first stage of instruction fetch using aligned accesses to their I-cache. Lets simplify by considering a RISC ISA with 4-byte instruction width1 and 4-wide fetch/decode/exec. (e.g. MIPS r10k, although IDK if some of the other stuff I'm going to make up reflects that microarch exactly).
...
.top_of_loop:
insn1 ; at address 16*n + 12
; 16-byte boundary here
insn2 ; at address 16*n + 0
insn3 ; at address 16*n + 4
b .top_of_loop ; at address 16*n + 8
... after loop ; at address 16*n + 12
... after loop ; at address 16*n + 0
Without any kind of loop buffer, the fetch stage has to fetch the loop instructions from I-cache one for every time it executes. But this takes a minimum of 2 cycles per iteration because the loop spans two 16-byte aligned fetch blocks. It's not capable of fetching the 16 bytes of instructions in one unaligned fetch.
But if we align the top of the loop, it can be fetched in a single cycle, allowing the loop to run at 1 cycle / iteration if the loop body doesn't have other bottlenecks.
...
nop ; at address 16*n + 12 ; NOP padding for alignment
.top_of_loop: ; 16-byte boundary here
insn1 ; at address 16*n + 0
insn2 ; at address 16*n + 4
insn3 ; at address 16*n + 8
b .top_of_loop ; at address 16*n + 12
... after loop ; at address 16*n + 0
... after loop ; at address 16*n + 4
With a larger loop that's not a multiple of 4 instructions, there's still going to a partially-wasted fetch somewhere. It's generally best that it's not the top of the loop, though. Getting more instructions into the pipeline sooner rather than later helps the CPU find and exploit more instruction-level parallelism, for code that isn't purely bottlenecked on instruction-fetch.
In general, aligning branch targets (including function entry points) by 16 can be a win (at the cost of greater I-cache pressure from lower code density). A useful tradeoff can be padding to the next multiple of 16 if you're within 1 or 2 instructions. e.g. so in the worst case, a fetch block contains at least 2 or 3 useful instructions, not just 1.
This is why the GNU assembler supports .p2align 4,,8 : pad to the next 2^4 boundary if it's 8 bytes away or closer. GCC does in fact use that directive for some targets / architectures, depending on tuning options / defaults.
In the general case for non-loop branches, you also don't want to jump near the end of a cache line. Then you might have another I-cache miss right away.
Footnote 1:
The principle also applies to modern x86 with its variable-width instructions, at least when they have decoded-uop cache misses forcing them to actually fetch x86 machine code from L1I-cache. And applies to older superscalar x86 like Pentium III or K8 without uop caches or loopback buffers (that can make loops efficient regardless of alignment).
But x86 decoding is so hard that it takes multiple pipeline stages, e.g. to some to simple find instruction boundaries and then feed groups of instructions to the decoders. Only the initial fetch-blocks are aligned and buffers between stages can hide bubbles from the decoders if pre-decode can catch up.
https://www.realworldtech.com/merom/4/ shows the details of Core2's front-end: 16-byte fetch blocks, same as PPro/PII/PIII, feeding a pre-decode stage that can scan up to 32 bytes and find boundaries between up to 6 instructions IIRC. That then feeds another buffer leading to the full decode stage which can decode up to 4 instructions (5 with macro-fusion of test or cmp + jcc) into up to 7 uops...
Agner Fog's microarch guide has some detailed info about optimizing x86 asm for fetch/decode bottlenecks on Pentium Pro/II vs. Core2 / Nehalem vs. Sandybridge-family, and AMD K8/K10 vs. Bulldozer vs. Ryzen.
Modern x86 doesn't always benefit from alignment. There are effects from code alignment but they're not usually simple and not always beneficial. Relative alignment of things can matter, but usually for things like which branches alias each other in branch predictor entries, or for how uops pack into the uop cache.

Prefetch distance and degree of prefetch

what is the difference between prefetch distance and degree of prefetching?
Prefetching typically deals with entire cache lines. So a given prefetch request will bring in the cache line that would hold the specified address.
Due to the huge differences in memory speeds, it can take many cycles to bring data into the cache. Some latencies are in the dozens of cycles if not longer. Now, the only way to really benefit from a prefetch is to issue it far enough ahead of the actual use of the data so that there's enough time for the machine to pull the data into the cache. This implies that data access be predictable so one can anticipate what memory needs to be in the cache. The simplest case is marching through a linear array. Now, a common scenario (in 'scientific code') is a loop that reads the data then processes it. The cache miss penalty may be high and the processor may be very fast, and simply prefetching the next cache line may not be sufficient as we may be finished processing the array corresponding to the current cache line and waiting for the data in the neighbouring cache line before the data has arrived at the cache. So we may have to fetch further away than the next cache line.
How far ahead you prefetch is the distance e.g. 512 bytes. The degree of prefetching is the distance in terms of cache lines i.e. if your cache line is 256 bytes, the degree of prefetching is 2.
Prefetching degree is the number of cache lines to prefetch at each trigger.
Prefetching distance is the concept from array within loop. D = ceil(l / s), l is average memory latency in terms of number of cycles, s is cycle time of shortest execution path. D is be number of iterations ahead for a certain array element, so that memory latency is covered.

Why are CPU registers fast to access?

Register variables are a well-known way to get fast access (register int i). But why are registers on the top of hierarchy (registers, cache, main memory, secondary memory)? What are all the things that make accessing registers so fast?
Registers are circuits which are literally wired directly to the ALU, which contains the circuits for arithmetic. Every clock cycle, the register unit of the CPU core can feed a half-dozen or so variables into the other circuits. Actually, the units within the datapath (ALU, etc.) can feed data to each other directly, via the bypass network, which in a way forms a hierarchy level above registers — but they still use register-numbers to address each other. (The control section of a fully pipelined CPU dynamically maps datapath units to register numbers.)
The register keyword in C does nothing useful and you shouldn't use it. The compiler decides what variables should be in registers and when.
Registers are a core part of the CPU, and much of the instruction set of a CPU will be tailored for working against registers rather than memory locations. Accessing a register's value will typically require very few clock cycles (likely just 1), as soon as memory is accessed, things get more complex and cache controllers / memory buses get involved and the operation is going to take considerably more time.
Several factors lead to registers being faster than cache.
Direct vs. Indirect Addressing
First, registers are directly addressed based on bits in the instruction. Many ISAs encode the source register addresses in a constant location, allowing them to be sent to the register file before the instruction has been decoded, speculating that one or both values will be used. The most common memory addressing modes indirect through a register. Because of the frequency of base+offset addressing, many implementations optimize the pipeline for this case. (Accessing the cache at different stages adds complexity.) Caches also use tagging and typically use set associativity, which tends to increase access latency. Not having to handle the possibility of a miss also reduces the complexity of register access.
Complicating Factors
Out-of-order implementations and ISAs with stacked or rotating registers (e.g., SPARC, Itanium, XTensa) do rename registers. Specialized caches such as Todd Austin's Knapsack Cache (which directly indexes the cache with the offset) and some stack cache designs (e.g., using a small stack frame number and directly indexing a chunk of the specialized stack cache using that frame number and the offset) avoid register read and addition. Signature caches associate a register name and offset with a small chunk of storage, providing lower latency for accesses to the lower members of a structure. Index prediction (e.g., XORing offset and base, avoiding carry propagation delay) can reduce latency (at the cost of handling mispredictions). One could also provide memory addresses earlier for simpler addressing modes like register indirect, but accessing the cache in two different pipeline stages adds complexity. (Itanium only provided register indirect addressing — with option post increment.) Way prediction (and hit speculation in the case of direct mapped caches) can reduce latency (again with misprediction handling costs). Scratchpad (a.k.a. tightly coupled) memories do not have tags or associativity and so can be slightly faster (as well as have lower access energy) and once an access is determined to be to that region a miss is impossible. The contents of a Knapsack Cache can be treated as part of the context and the context not be considered ready until that cache is filled. Registers could also be loaded lazily (particularly for Itanium stacked registers), theoretically, and so have to handle the possibility of a register miss.
Fixed vs. Variable Size
Registers are usually fixed size. This avoids the need to shift the data retrieved from aligned storage to place the actual least significant bit into its proper place for the execution unit. In addition, many load instructions sign extend the loaded value, which can add latency. (Zero extension is not dependent on the data value.)
Complicating Factors
Some ISAs do support sub-registers, notable x86 and zArchitecture (descended from S/360), which can require pre-shifting. One could also provide fully aligned loads at lower latency (likely at the cost of one cycle of extra latency for other loads); subword loads are common enough and the added latency small enough that special casing is not common. Sign extension latency could be hidden behind carry propagation latency; alternatively sign prediction could be used (likely just speculative zero extension) or sign extension treated as a slow case. (Support for unaligned loads can further complicate cache access.)
Small Capacity
A typical register file for an in-order 64-bit RISC will be only about 256 bytes (32 8-byte registers). 8KiB is considered small for a modern cache. This means that multiplying the physical size and static power to increase speed has a much smaller effect on the total area and static power. Larger transistors have higher drive strength and other area-increasing design factors can improve speed.
Complicating Factors
Some ISAs have a large number of architected registers and may have very wide SIMD registers. In addition, some implementations add additional registers for renaming or to support multithreading. GPUs, which use SIMD and support multithreading, can have especially high capacity register files; GPU register files are also different from CPU register files in typically being single ported, accessing four times as many vector elements of one operand/result per cycle as can be used in execution (e.g., with 512-bit wide multiply-accumulate execution, reading 2KiB of each of three operands and writing 2KiB of the result).
Common Case Optimization
Because register access is intended to be the common case, area, power, and design effort is more profitably spent to improve performance of this function. If 5% of instructions use no source registers (direct jumps and calls, register clearing, etc.), 70% use one source register (simple loads, operations with an immediate, etc.), 25% use two source registers, and 75% use a destination register, while 50% access data memory (40% loads, 10% stores) — a rough approximation loosely based on data from SPEC CPU2000 for MIPS —, then more than three times as many of the (more timing-critical) reads are from registers than memory (1.3 per instruction vs. 0.4) and
Complicating Factors
Not all processors are design for "general purpose" workloads. E.g., processor using in-memory vectors and targeting dot product performance using registers for vector start address, vector length, and an accumulator might have little reason to optimize register latency (extreme parallelism simplifies hiding latency) and memory bandwidth would be more important than register bandwidth.
Small Address Space
A last, somewhat minor advantage of registers is that the address space is small. This reduces the latency for address decode when indexing a storage array. One can conceive of address decode as a sequence of binary decisions (this half of a chunk of storage or the other). A typical cache SRAM array has about 256 wordlines (columns, index addresses) — 8 bits to decode — and the selection of the SRAM array will typically also involve address decode. A simple in-order RISC will typically have 32 registers — 5 bits to decode.
Complicating Factors
Modern high-performance processors can easily have 8 bit register addresses (Itanium had more than 128 general purpose registers in a context and higher-end out-of-order processors can have even more registers). This is also a less important consideration relative to those above, but it should not be ignored.
Conclusion
Many of the above considerations overlap, which is to be expected for an optimized design. If a particular function is expected to be common, not only will the implementation be optimized but the interface as well. Limiting flexibility (direct addressing, fixed size) naturally aids optimization and smaller is easier to make faster.
Registers are essentially internal CPU memory. So accesses to registers are easier and quicker than any other kind of memory accesses.
Smaller memories are generally faster than larger ones; they can also require fewer bits to address. A 32-bit instruction word can hold three four-bit register addresses and have lots of room for the opcode and other things; one 32-bit memory address would completely fill up an instruction word leaving no room for anything else. Further, the time required to address a memory increases at a rate more than proportional to the log of the memory size. Accessing a word from a 4 gig memory space will take dozens if not hundreds of times longer than accessing one from a 16-word register file.
A machine that can handle most information requests from a small fast register file will be faster than one which uses a slower memory for everything.
Every microcontroller has a CPU as Bill mentioned, that has the basic components of ALU, some RAM as well as other forms of memory to assist with its operations. The RAM is what you are referring to as Main memory.
The ALU handles all of the arthimetic logical operations and to operate on any operands to perform these calculations, it loads the operands into registers, performs the operations on these, and then your program accesses the stored result in these registers directly or indirectly.
Since registers are closest to the heart of the CPU (a.k.a the brain of your processor), they are higher up in the chain and ofcourse operations performed directly on registers take the least amount of clock cycles.