What is the benefit of having the registers as a part of memory in AVR microcontrollers? - cpu-architecture

Larger memories have higher decoding delay; why is the register file a part of the memory then?
Does it only mean that the registers are "mapped" SRAM registers that are stored inside the microprocessor?
If not, what would be the benefit of using registers as they won't be any faster than accessing RAM? Furthermore, what would be the use of them at all? I mean these are just a part of the memory so I don't see the point of having them anymore. Having them would be just as costly as referencing memory.
The picture is taken from Avr Microcontroller And Embedded Systems The: Using Assembly and C by Muhammad Ali Mazidi, Sarmad Naimi, and Sepehr Naimi

AVR has some instructions with indirect addressing, for example LD (LDD) – Load Indirect From Data Space to Register using Z:
Loads one byte indirect with or without displacement from the data space to a register. [...]
The data location is pointed to by the Z (16-bit) Pointer Register in the Register File.
So now you can move from a register by loading its data-space address into Z, allowing indirect or indexed register-to-register moves. Certainly one can think of some usage where such indirect access would save the odd instruction.

what would be the benefit of using registers as they won't be any faster than accessing RAM?
accessing General purpose Registers is faster than accessing Ram
first of all let us define how fast measured in microControllers .... fast mean how many cycle the instruction will take to excute ... LOOk at the avr architecture
See the General Purpose Registers GPRs are input for the ALU , and the GPRs are controlled by instruction register (2 byte width) which holds the next instruction from the code memory.
Let us examine simple instruction ADD Rd , Rr; where Rd,Rr are any two register in GPRs so 0<=r,d<=31 so each of r and d could be rebresented in 5 bit,now open "AVR Instruction Set Manual" page number 32 look at the op-code for this simple add instraction is 000011rdddddrrrr and becuse this op-code is two byte(code memory width) this will fetched , Decoded and excuit in one cycle (under consept of pipline ofcourse) jajajajjj only one cycle seems cool to me
I mean these are just a part of the memory so I don't see the point of having them anymore. Having them would be just as costly as referencing memory
You suggest to make the all ram as input for the ALU; this is a very bad idea: a memory address takes 2 bytes.
If you have 2 operands per instruction as in Add instruction you will need 4 Byte for saving only the operands .. and 1 more byte for the op-code of the operator itself in total 5 byte which is waste of memory!
And furthermore this architecture could only fetch 2 bytes at a time (instruction register width) so you need to spend more cycles on fetching the code from code memory which is waste of cycles >> more slower system
Register numbers are only 4 or 5 bits wide, depending on the instruction, allowing 2 per instruction with room to spare in a 16-bit instruction word.
conclusion GPRs' existence are crucial for saving code memory and program execution time
Larger memories have higher decoding delay; why is the register file a part of the memory then?
When cpu deal with GPRs it only access the first 32 position not all the data space
Final comment
don't disturb yourself by time diagram for different ram technology because you don't have control on it ,so who has control? the architecture designers , and they put the limit of the maximum crystal frequency you can use with there architecture and everything will be fine .. you only concern about cycles consuming with your application

Related

What is the purpose of the CIR if I have the MDR in Von Neumann Architecture?

From the fetch decode execute cycle of the Von Neumann Architecture, at a basic level, here is what I understand:
Memory address in PC is copied to MAR.
PC +=1
The instruction / data in the address of the MAR is stored in the MDR after being fetched from main memory.
Instruction from MDR is copied to CIR
Instruction / data in memory is decoded & executed by the CU .
Result from the calculation stored in ACC.
Repeat
Now if the MDR value is copied to the CIR, why are they both necessary. I am quite new to systems architecture so I may have gotten the wrong end of the stick, but I've tried my best :)
Think about what happens if the current instruction is a load or store: does anything need to happen after the MDR? If so, how is the CPU going to remember what it's in the middle of doing if it doesn't keep track of that somehow.
Whether that requires the original instruction bits the whole time or not depends on the design.
A CPU that needs to do a lot of decoding (e.g. a CISC with a compact variable-length instruction set, like original 8086) may not keep the actual instruction itself around, but instead just some decode results. For example, actual 8086 decoded incrementally, scanning through prefixes one byte at a time until reaching the opcode. And modern x86 decodes to uops which it sends down the pipeline; the original machine-code bytes aren't needed.
But CPUs like MIPS were specifically designed so parts of the instruction word could be used directly as internal control signals. Still, it's not always necessary to keep the whole instruction around in one piece.
It might make more sense to look at CIR as being the input latches of the decoding process that produces the necessary internal control signals, or sequence of microcode depending on the design. Having a truly physical CIR separate from that is ok if you don't mind redoing decoding at any step you need to consult it to figure out what step to do next.

How many clock cycles do the stages of a simple 5 stage processor take?

A 5 stage pipelined CPU has the following sequence of stages:
IF – Instruction fetch from instruction memory.
RD – Instruction decode and register read.
EX – Execute: ALU operation for data and address computation.
MA – Data memory access – for write access, the register read at RD state is
used.
WB – Register write back.
Now I know that an instruction fetch, for example, is from memory which can take 4 cycles (L1 cache) or up to ~150 cycles (RAM). However, in every pipelining diagram, I see something like this, where each stage is assigned a single cycle.
Now, I know of course real processors have complex pipelines with over 19 stages and every architecture is different. However, am I missing something here? With memory accesses in IF and MA, can this 5 stage pipeline take dozens of cycles?
Classic 5-stage RISC pipelines are designed around single-cycle latency L1d / L1i, allowing 1 IPC (instruction per clock) in code without cache misses or other stalls. i.e. the hopefully common / good case. Every stage must have a worst-case critical path latency of 1 cycle, or trigger a stall.
Clock speeds were lower back then (even relative to 1 gate delay) so you could get more done in a single cycle, and the caches were simpler, often 8k direct-mapped, single port, sometimes even virtually tagged (VIVT) so TLB lookup wasn't part of the access latency.
First-gen MIPS, the R2000 (and R3000), had on-chip controllers1 for its direct-mapped PIPT split L1i/L1d write-through caches, but the actual tags+data were off-chip, from 4K to 64K. Achieving the required single-cycle latency with this setup limited clock speeds to 15 MHz (R2000) or 33 MHz (R3000) with available SRAM technology. The TLB was fully on-chip.
vs. modern Intel/AMD using 32kiB 8-way VIPT L1d/L1i caches, with at least 2 read + 1 write port for L1d, at such high clock speed that access latency is 4 cycles best-case on Intel SnB-family, or 5 cycles including address-generation. Modern CPUs have larger TLBs, too, which also adds to the latency. This is ok when out-of-order execution and/or other techniques can usually hide that latency, but classic 5-stage RISCs just had one single pipeline, not separately pipelined memory access. See also Cycles/cost for L1 Cache hit vs. Register on x86? for some more links about how performance on modern superscalar out-of-order exec x86 CPUs differs from classic-RISC CPUs.
If you wanted to raise clock speeds for the same transistor performance (gate delay), you'd divide the fetch and mem stages into multiple pipeline stages (i.e. pipeline them more heavily), if cache access was even on the critical path (i.e. if cache access could no longer be done in one clock period). The downside of lengthening the pipeline is raising branch latency (cost of a mispredict, and the amount of latency a correct prediction has to hide), as well as raising total transistor cost.
Note that classic-RISC pipelines do address-generation in the EX stage, using the ALU there to calculate register + immediate, the only addressing mode supported by most RISC ISAs build around such a pipeline. So load-use latency is effectively 2 cycles for pointer-chasing, due to the load delay for forwarding back to EX.)
On a cache miss, the entire pipeline would just stall: those early pipelines lacked scoreboarding of loads to allow hit-under-miss or miss-under-miss for loads from L1d cache.
MIPS R2000 did have a 4-entry store buffer to decouple execution from cache-miss stores. (Apparently built from 4 separate R2020 write-buffer chips, according to wikipedia.) The LSI datasheet says the write-buffer chips were optional, but with write-through caches, every store has to go to DRAM and would create a stall without write buffering. Most modern CPUs use write-back caches, allowing multiple writes of the same line without creating DRAM traffic.
Also remember that CPU speed wasn't as high relative to memory for early CPUs like MIPS R2000, and single-core machines didn't need an interconnect between cores and memory controllers. (Although they maybe did have a frontside bus to a memory controller on a separate chip, a "northbridge".) But anyway, back then a cache miss to DRAM cost a lot fewer core clock cycles. It sucks to fully stall on every miss, but it wasn't like modern CPUs where it can be in the 150 to 350 cycles range (70 ns * 5 GHz). DRAM latency hasn't improved nearly as much as bandwidth and CPU clocks. See also http://www.lighterra.com/papers/modernmicroprocessors/ which has a "memory wall" section, and Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? re: why modern CPUs need multi-level caches as the mismatch between CPU speed and memory latency has grown.
Later CPUs allowed progressively more memory-level parallelism by doing things like allowing execution to continue after a non-faulting load (successful TLB lookup), only stalling when you actually read a register that was last written by a load, if the load result isn't ready yet. This allows hiding load latency on a still-short and fairly simple in-order pipeline, with some number of load buffers to track outstanding loads. And with register renaming + OoO exec, the ROB size is basically the "window" over which you can hide cache-miss latency: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
Modern x86 CPUs even have buffers between pipeline stages in the front-end to hide or partially absorb fetch bubbles (caused by L1i misses, decode stalls, low-density code, e.g. a jump to another jump, or even just failure to predict a simple always-taken branch. i.e. only detecting it when it's eventually decoded, after fetching something other than the correct path. That's right, even unconditional branches like jmp foo need some prediction for the fetch stage.)
https://www.realworldtech.com/haswell-cpu/2/ has some good diagrams. Of course, Intel SnB-family and AMD Zen-family use a decoded-uop cache because x86 machine code is hard to decode in parallel, so often they can bypass some of that front-end complexity, effectively shortening the pipeline. (wikichip has block diagrams and microarchitecture details for Zen 2.)
See also Modern Microprocessors
A 90-Minute Guide! re: modern CPUs and the "memory wall": the increasing mismatch between DRAM latency and core clock cycle time. DRAM latency has only dropped a little bit (in absolute nanoseconds) as bandwidth has continued to climb tremendously in recent years.
Footnote 1: MIPS R2000 cache details:
An R2000 datasheet shows the D-cache was write-through, and various other interesting things.
According to a 1992 usenet message from an SGI engineer, the control logic just sends 18 index bits, receiving a word of data + 8 tags bits to determine hit or not. The CPU is oblivious to the cache size; you connect up the right number of index lines to SRAM address lines. (So I guess a line-size of one 4-byte word?)
You have to use at least 10 index bits because the tag is only 20 bits wide, and you need tag+index+2(byte-in-word) to be 32, the physical address-space size. That sets a minimum cache size of 4K.
20 bits of tag for every 32 bits of data is very inefficient. With a larger cache, fewer tag bits are actually needed, since more of the address is used up as part of the index. But Paul Ries posted that R2000/R3000 does not support comparing fewer tag bits. IDK if you could wire up some of the address output lines to the tag input lines, to generate matching bits instead of storing them in SRAMs.
A 32-byte cache line would still only need 20-bit tags (at most), but would have one tag per 8 words, a factor of 8 improvement in tag overhead. CPUs with larger caches, especially L2 caches, would definitely want to use larger line sizes.
But you're probably more likely to get conflict misses with fewer larger lines, especially with a direct-mapped cache. And the memory bus can still be busy filling a previous line when you encounter another miss, even if you have critical-word-first / early-restart so the miss latency wasn't worse if the memory bus was idle to start with.

Why does registers exists and how they work together with cpu?

So I am currently learning Operating Systems and Programming.
I want how the registers work in detail.
All I know is there is the main memory and our CPU which takes address and instruction from the main memory by the help of the address bus.
And also there is something MCC (Memory Controller Chip which helps in fetching the memory location from RAM.)
On the internet, it shows register is temporary storage and data can be accessed faster than ram for registers.
But I want to really understand the deep-down process on how they work. As they are also of 32 bits and 16 bits something like that. I am really confused.!!!
I'm not a native english speaker, pardon me for some perhaps incorrect terminology. Hope this will be a little bit helpful.
Why does registers exists
When user program is running on CPU, it works in a 'dynamic' sense. That is, we should store incoming source data or any intermediate data, and do specific calculation upon them. Memory devices are needed. We have a choice among flip-flop, on-chip RAM/ROM, and off-chip RAM/ROM.
The term register for programmer's model is actually a D flip-flop in the physical circuit, which is a memory device and can hold a single bit. An IC design consists of standard cell part (including the register mentioned before, and and/or/etc. gates) and hard macro (like SRAM). As the technology node advances, the standard cells' delay are getting smaller and smaller. Auto Place-n-Route tool will place the register and the related surrounding logic nearby, to make sure the logic can run at the specified 3.0/4.0GHz speed target. For some practical reasons (which I'm not quite sure because I don't do layout), we tend to place hard macros around, leading to much longer metal wire. This plus SRAM's own characteristics, on-chip SRAM is normally slower than D flip-flop. If the memory device is off the chip, say an external Flash chip or KGD (known good die), it will be further slower since the signals should traverse through 2 more IO devices which have much larger delay.
how they work together with cpu
Each register is assigned a different 'address' (which maybe not open to programmer). That is implemented by adding address decode logic. For instance, when CPU is going to execute an instruction mov R1, 0x12, the address decode logic sees the binary code of R1, and selects only those flip-flops corresponding to R1. Then data 0x12 is stored (written) into those flip-flops. Same for read process.
Regarding "they are also of 32 bits and 16 bits something like that", the bit width is not a problem. Both flip-flops and a word in RAM can have a bit width of N, as long as the same address can select N flip-flops or N bits in RAM at one time.
Registers are small memories which resides inside the processor (what you called CPU). Their role is to hold the operands for fast processor calculations and to store the results. A register is usually designated by a name (AL, BX, ECX, RDX, cr3, RIP, R0, R8, R15, etc.) and has a size which is the number of bits it can store (4, 8, 16, 32, 64, 128 bits). Other registers have special meanings, and their bits control the state or provide information about the state of the processor.
There are not many registers (because they are very expensive). All of them have a capacity of only a few kilobytes, so they can't store all the code and data of your program, which can go up to gigabytes. This is the role of the central memory (what you call RAM). This big memory can hold gigabytes of data and each byte has its address. However, it only holds data when the computer is turned on. The RAM reside outside of the CPU Chip and interacts with him via Memory Controller Chip which stands as interface between CPU and RAM.
On top of that, there is the hard drive that stores your data when you turn off your computer.
That is a very simple view to get you started.

Why do we need to specify the number of flash wait cycles?

Especially when working with "faster" devices like STMF4xx/F7xx we need to specify the number of flash wait cycles, based on the supply voltage and the sys-clock frequency.
When the CPU fetches instructions/or constants this is done over the FLITF. Am I right with the assumption that the FLITF holds a CPU request as long as it can provide the requested data, making it impossible for other Bus-Masters to access flash meanwhile.
If this was true, why should it be important to any interface to know flash wait cycles. Like Cache does preload instructions so or so, independent if it knows how long to wait, no?
Because the flash interface isn't magic.
It has to meet the necessary setup and hold times for addressing and reading out the flash cells, which will vary somewhat depending on voltage. Taking the STM32F411 as an example (because I have that TRM handy), doing some maths with the voltage/frequency/wait-state table implies that a read from flash on one of those takes in the order of ~30ns above 2.7V, down to ~60ns below 2.1V.
Since the flash interface doesn't have its own asynchronous nanosecond-precision timekeeping ability (because that would be needlessly complicated, power-hungry, and silly), that translates to asserting its signals for n clock cycles, after which it can assume the data signals from the cells are stable enough to read back*. How does it know what the clock frequency is, and therefore what n should be? Simple: you, as the programmer who set the clock, tell it. Some hardware things are just infinitely easier to let software deal with.
* and then going through the further shenanigans of extracting the relevant 8, 16 or 32 bits out of the 128-bit line it's read, to finally spit that out the other side onto the AHB bus to the waiting CPU, obviously.

Why are CPU registers fast to access?

Register variables are a well-known way to get fast access (register int i). But why are registers on the top of hierarchy (registers, cache, main memory, secondary memory)? What are all the things that make accessing registers so fast?
Registers are circuits which are literally wired directly to the ALU, which contains the circuits for arithmetic. Every clock cycle, the register unit of the CPU core can feed a half-dozen or so variables into the other circuits. Actually, the units within the datapath (ALU, etc.) can feed data to each other directly, via the bypass network, which in a way forms a hierarchy level above registers — but they still use register-numbers to address each other. (The control section of a fully pipelined CPU dynamically maps datapath units to register numbers.)
The register keyword in C does nothing useful and you shouldn't use it. The compiler decides what variables should be in registers and when.
Registers are a core part of the CPU, and much of the instruction set of a CPU will be tailored for working against registers rather than memory locations. Accessing a register's value will typically require very few clock cycles (likely just 1), as soon as memory is accessed, things get more complex and cache controllers / memory buses get involved and the operation is going to take considerably more time.
Several factors lead to registers being faster than cache.
Direct vs. Indirect Addressing
First, registers are directly addressed based on bits in the instruction. Many ISAs encode the source register addresses in a constant location, allowing them to be sent to the register file before the instruction has been decoded, speculating that one or both values will be used. The most common memory addressing modes indirect through a register. Because of the frequency of base+offset addressing, many implementations optimize the pipeline for this case. (Accessing the cache at different stages adds complexity.) Caches also use tagging and typically use set associativity, which tends to increase access latency. Not having to handle the possibility of a miss also reduces the complexity of register access.
Complicating Factors
Out-of-order implementations and ISAs with stacked or rotating registers (e.g., SPARC, Itanium, XTensa) do rename registers. Specialized caches such as Todd Austin's Knapsack Cache (which directly indexes the cache with the offset) and some stack cache designs (e.g., using a small stack frame number and directly indexing a chunk of the specialized stack cache using that frame number and the offset) avoid register read and addition. Signature caches associate a register name and offset with a small chunk of storage, providing lower latency for accesses to the lower members of a structure. Index prediction (e.g., XORing offset and base, avoiding carry propagation delay) can reduce latency (at the cost of handling mispredictions). One could also provide memory addresses earlier for simpler addressing modes like register indirect, but accessing the cache in two different pipeline stages adds complexity. (Itanium only provided register indirect addressing — with option post increment.) Way prediction (and hit speculation in the case of direct mapped caches) can reduce latency (again with misprediction handling costs). Scratchpad (a.k.a. tightly coupled) memories do not have tags or associativity and so can be slightly faster (as well as have lower access energy) and once an access is determined to be to that region a miss is impossible. The contents of a Knapsack Cache can be treated as part of the context and the context not be considered ready until that cache is filled. Registers could also be loaded lazily (particularly for Itanium stacked registers), theoretically, and so have to handle the possibility of a register miss.
Fixed vs. Variable Size
Registers are usually fixed size. This avoids the need to shift the data retrieved from aligned storage to place the actual least significant bit into its proper place for the execution unit. In addition, many load instructions sign extend the loaded value, which can add latency. (Zero extension is not dependent on the data value.)
Complicating Factors
Some ISAs do support sub-registers, notable x86 and zArchitecture (descended from S/360), which can require pre-shifting. One could also provide fully aligned loads at lower latency (likely at the cost of one cycle of extra latency for other loads); subword loads are common enough and the added latency small enough that special casing is not common. Sign extension latency could be hidden behind carry propagation latency; alternatively sign prediction could be used (likely just speculative zero extension) or sign extension treated as a slow case. (Support for unaligned loads can further complicate cache access.)
Small Capacity
A typical register file for an in-order 64-bit RISC will be only about 256 bytes (32 8-byte registers). 8KiB is considered small for a modern cache. This means that multiplying the physical size and static power to increase speed has a much smaller effect on the total area and static power. Larger transistors have higher drive strength and other area-increasing design factors can improve speed.
Complicating Factors
Some ISAs have a large number of architected registers and may have very wide SIMD registers. In addition, some implementations add additional registers for renaming or to support multithreading. GPUs, which use SIMD and support multithreading, can have especially high capacity register files; GPU register files are also different from CPU register files in typically being single ported, accessing four times as many vector elements of one operand/result per cycle as can be used in execution (e.g., with 512-bit wide multiply-accumulate execution, reading 2KiB of each of three operands and writing 2KiB of the result).
Common Case Optimization
Because register access is intended to be the common case, area, power, and design effort is more profitably spent to improve performance of this function. If 5% of instructions use no source registers (direct jumps and calls, register clearing, etc.), 70% use one source register (simple loads, operations with an immediate, etc.), 25% use two source registers, and 75% use a destination register, while 50% access data memory (40% loads, 10% stores) — a rough approximation loosely based on data from SPEC CPU2000 for MIPS —, then more than three times as many of the (more timing-critical) reads are from registers than memory (1.3 per instruction vs. 0.4) and
Complicating Factors
Not all processors are design for "general purpose" workloads. E.g., processor using in-memory vectors and targeting dot product performance using registers for vector start address, vector length, and an accumulator might have little reason to optimize register latency (extreme parallelism simplifies hiding latency) and memory bandwidth would be more important than register bandwidth.
Small Address Space
A last, somewhat minor advantage of registers is that the address space is small. This reduces the latency for address decode when indexing a storage array. One can conceive of address decode as a sequence of binary decisions (this half of a chunk of storage or the other). A typical cache SRAM array has about 256 wordlines (columns, index addresses) — 8 bits to decode — and the selection of the SRAM array will typically also involve address decode. A simple in-order RISC will typically have 32 registers — 5 bits to decode.
Complicating Factors
Modern high-performance processors can easily have 8 bit register addresses (Itanium had more than 128 general purpose registers in a context and higher-end out-of-order processors can have even more registers). This is also a less important consideration relative to those above, but it should not be ignored.
Conclusion
Many of the above considerations overlap, which is to be expected for an optimized design. If a particular function is expected to be common, not only will the implementation be optimized but the interface as well. Limiting flexibility (direct addressing, fixed size) naturally aids optimization and smaller is easier to make faster.
Registers are essentially internal CPU memory. So accesses to registers are easier and quicker than any other kind of memory accesses.
Smaller memories are generally faster than larger ones; they can also require fewer bits to address. A 32-bit instruction word can hold three four-bit register addresses and have lots of room for the opcode and other things; one 32-bit memory address would completely fill up an instruction word leaving no room for anything else. Further, the time required to address a memory increases at a rate more than proportional to the log of the memory size. Accessing a word from a 4 gig memory space will take dozens if not hundreds of times longer than accessing one from a 16-word register file.
A machine that can handle most information requests from a small fast register file will be faster than one which uses a slower memory for everything.
Every microcontroller has a CPU as Bill mentioned, that has the basic components of ALU, some RAM as well as other forms of memory to assist with its operations. The RAM is what you are referring to as Main memory.
The ALU handles all of the arthimetic logical operations and to operate on any operands to perform these calculations, it loads the operands into registers, performs the operations on these, and then your program accesses the stored result in these registers directly or indirectly.
Since registers are closest to the heart of the CPU (a.k.a the brain of your processor), they are higher up in the chain and ofcourse operations performed directly on registers take the least amount of clock cycles.