Out of Order Execution, How to Solve True Dependency? - cpu-architecture

I was reading about OOOE (Out of Order Execution) and read about how we can solve false dependencies (By using renaming).
But my question is, how can we solve true dependency (RAW - read after write)?
For example:
R1=R2+R3 #1
R1=R4+R5 #2
R9=R1 #3
Renaming won't be helpful here in case CPU chose to run #2 before #1.

There is no way to really avoid them, that's why RAW hazards are called true dependencies. Instructions have to wait for their inputs to be ready before they can execute. (With OoO exec, normally CPUs will dispatch the oldest-ready-first instructions / uops to execution units, for example on Intel CPUs.)
True dependencies aren't something you "solve" in the sense of making them go away, they're the essence of computation, the way multiple computations on the same numbers are glued together to form an algorithm. Other hazards (WAR and WAW) are just implementation details, reusing the same architectural register for something different.
Sometimes you can structure an algorithm to have shorter dependency chains, once things are already nailed down into machine code, the CPU pretty much just has to respect them, with at best out-of-order exec to overlap independent dep chains.
For loads, in theory there's value-prediction, but AFAIK no real CPU is doing that. Mispredictions are expensive, just like for branches. That's why you'd only want to consider that for high-latency stuff like loads that miss in cache, otherwise the gains won't outweigh the costs. (Even then, it's not done because the gains don't outweigh the costs even for loads, including the power / area cost of building a predictor.) As Paul Clayton points out, branch prediction is a form of value prediction (predicting the condition the load was based on). The more instructions you can keep in flight at once with OoO exec, the more you stand to lose from mispredicts, but real CPUs do predict / speculate for memory disambiguation (whether a load reloads an earlier store to an unknown address or not), and (on CPUs like x86 with strongly-ordered memory models) speculating that early loads will turn out to be allowed; as well as the well known case of control dependencies (branch prediction + speculative execution).
The only thing that helps directly with dependency chains is keeping instruction latencies low. e.g. in the case of your example, #3 is just a register copy, which modern x86 CPUs can do with zero latency (mov-elimination), so another instruction dependent on R9 wouldn't have to wait an extra cycle beyond #2 producing a result, because it's handled during register renaming instead of by an execution unit reading an input and producing an output the normal way.
Obviously bypass forwarding from the outputs of execution units to the inputs of the same or others is essential to keep latency low, same as in an in-order classic RISC pipeline.
Or more conventionally, by improving execution units, like AMD Bulldozer family had 2-cycle latency for most SIMD integer instructions, but that improved to 1 cycle for AMD's next design, Zen. (Scalar integer stuff like add was always 1 cycle on any sane high-performance CPU.)
OoO exec with a large enough window size lets you overlap multiple dep chains in parallel (as in this experiment, and of course software should aim to have enough instruction-level parallelism (ILP) for the CPU to find and exploit. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an example of doing that by summing into multiple accumulators for a dot-product.)
This is also useful for in-order CPUs if done statically by a compiler, where techniques like "software pipelining" are a big deal to overlap execution of multiple loop iterations because HW isn't finding that parallelism for you. Or for OoO exec CPUs with a limited window size, for loops with long but not loop-carried dependency chains within each iteration.
As long as you're bottlenecked on something other than latency / dependency chains, true dependencies aren't the problem. e.g. a front-end bottleneck, ideally maxed out at the pipeline width, and/or all relevant back-end execution units busy every cycle, mean that you couldn't get more work through the pipeline even if it was independent.
Of course, in a lot of code there are enough dependencies, including through memory, to not reach that ideal situation.
Simultaneous Multithreading (SMT) can help to keep the back-end fed with work to do, increasing throughput by having the front-end of one physical core read multiple instruction streams, acting as multiple logical cores. This effectively creates ILP out of thread-level parallelism, which is useful if software can scale efficiently to more threads, exposing enough TLP to keep all the logical cores busy.
Related:
Modern Microprocessors A 90-Minute Guide!
How many CPU cycles are needed for each assembly instruction? - that's not how it works on superscalar OoO exec CPUs; latency or throughput or a front-end bottleneck might be the relevant thing.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

Related

How many clock cycles do the stages of a simple 5 stage processor take?

A 5 stage pipelined CPU has the following sequence of stages:
IF – Instruction fetch from instruction memory.
RD – Instruction decode and register read.
EX – Execute: ALU operation for data and address computation.
MA – Data memory access – for write access, the register read at RD state is
used.
WB – Register write back.
Now I know that an instruction fetch, for example, is from memory which can take 4 cycles (L1 cache) or up to ~150 cycles (RAM). However, in every pipelining diagram, I see something like this, where each stage is assigned a single cycle.
Now, I know of course real processors have complex pipelines with over 19 stages and every architecture is different. However, am I missing something here? With memory accesses in IF and MA, can this 5 stage pipeline take dozens of cycles?
Classic 5-stage RISC pipelines are designed around single-cycle latency L1d / L1i, allowing 1 IPC (instruction per clock) in code without cache misses or other stalls. i.e. the hopefully common / good case. Every stage must have a worst-case critical path latency of 1 cycle, or trigger a stall.
Clock speeds were lower back then (even relative to 1 gate delay) so you could get more done in a single cycle, and the caches were simpler, often 8k direct-mapped, single port, sometimes even virtually tagged (VIVT) so TLB lookup wasn't part of the access latency.
First-gen MIPS, the R2000 (and R3000), had on-chip controllers1 for its direct-mapped PIPT split L1i/L1d write-through caches, but the actual tags+data were off-chip, from 4K to 64K. Achieving the required single-cycle latency with this setup limited clock speeds to 15 MHz (R2000) or 33 MHz (R3000) with available SRAM technology. The TLB was fully on-chip.
vs. modern Intel/AMD using 32kiB 8-way VIPT L1d/L1i caches, with at least 2 read + 1 write port for L1d, at such high clock speed that access latency is 4 cycles best-case on Intel SnB-family, or 5 cycles including address-generation. Modern CPUs have larger TLBs, too, which also adds to the latency. This is ok when out-of-order execution and/or other techniques can usually hide that latency, but classic 5-stage RISCs just had one single pipeline, not separately pipelined memory access. See also Cycles/cost for L1 Cache hit vs. Register on x86? for some more links about how performance on modern superscalar out-of-order exec x86 CPUs differs from classic-RISC CPUs.
If you wanted to raise clock speeds for the same transistor performance (gate delay), you'd divide the fetch and mem stages into multiple pipeline stages (i.e. pipeline them more heavily), if cache access was even on the critical path (i.e. if cache access could no longer be done in one clock period). The downside of lengthening the pipeline is raising branch latency (cost of a mispredict, and the amount of latency a correct prediction has to hide), as well as raising total transistor cost.
Note that classic-RISC pipelines do address-generation in the EX stage, using the ALU there to calculate register + immediate, the only addressing mode supported by most RISC ISAs build around such a pipeline. So load-use latency is effectively 2 cycles for pointer-chasing, due to the load delay for forwarding back to EX.)
On a cache miss, the entire pipeline would just stall: those early pipelines lacked scoreboarding of loads to allow hit-under-miss or miss-under-miss for loads from L1d cache.
MIPS R2000 did have a 4-entry store buffer to decouple execution from cache-miss stores. (Apparently built from 4 separate R2020 write-buffer chips, according to wikipedia.) The LSI datasheet says the write-buffer chips were optional, but with write-through caches, every store has to go to DRAM and would create a stall without write buffering. Most modern CPUs use write-back caches, allowing multiple writes of the same line without creating DRAM traffic.
Also remember that CPU speed wasn't as high relative to memory for early CPUs like MIPS R2000, and single-core machines didn't need an interconnect between cores and memory controllers. (Although they maybe did have a frontside bus to a memory controller on a separate chip, a "northbridge".) But anyway, back then a cache miss to DRAM cost a lot fewer core clock cycles. It sucks to fully stall on every miss, but it wasn't like modern CPUs where it can be in the 150 to 350 cycles range (70 ns * 5 GHz). DRAM latency hasn't improved nearly as much as bandwidth and CPU clocks. See also http://www.lighterra.com/papers/modernmicroprocessors/ which has a "memory wall" section, and Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? re: why modern CPUs need multi-level caches as the mismatch between CPU speed and memory latency has grown.
Later CPUs allowed progressively more memory-level parallelism by doing things like allowing execution to continue after a non-faulting load (successful TLB lookup), only stalling when you actually read a register that was last written by a load, if the load result isn't ready yet. This allows hiding load latency on a still-short and fairly simple in-order pipeline, with some number of load buffers to track outstanding loads. And with register renaming + OoO exec, the ROB size is basically the "window" over which you can hide cache-miss latency: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
Modern x86 CPUs even have buffers between pipeline stages in the front-end to hide or partially absorb fetch bubbles (caused by L1i misses, decode stalls, low-density code, e.g. a jump to another jump, or even just failure to predict a simple always-taken branch. i.e. only detecting it when it's eventually decoded, after fetching something other than the correct path. That's right, even unconditional branches like jmp foo need some prediction for the fetch stage.)
https://www.realworldtech.com/haswell-cpu/2/ has some good diagrams. Of course, Intel SnB-family and AMD Zen-family use a decoded-uop cache because x86 machine code is hard to decode in parallel, so often they can bypass some of that front-end complexity, effectively shortening the pipeline. (wikichip has block diagrams and microarchitecture details for Zen 2.)
See also Modern Microprocessors
A 90-Minute Guide! re: modern CPUs and the "memory wall": the increasing mismatch between DRAM latency and core clock cycle time. DRAM latency has only dropped a little bit (in absolute nanoseconds) as bandwidth has continued to climb tremendously in recent years.
Footnote 1: MIPS R2000 cache details:
An R2000 datasheet shows the D-cache was write-through, and various other interesting things.
According to a 1992 usenet message from an SGI engineer, the control logic just sends 18 index bits, receiving a word of data + 8 tags bits to determine hit or not. The CPU is oblivious to the cache size; you connect up the right number of index lines to SRAM address lines. (So I guess a line-size of one 4-byte word?)
You have to use at least 10 index bits because the tag is only 20 bits wide, and you need tag+index+2(byte-in-word) to be 32, the physical address-space size. That sets a minimum cache size of 4K.
20 bits of tag for every 32 bits of data is very inefficient. With a larger cache, fewer tag bits are actually needed, since more of the address is used up as part of the index. But Paul Ries posted that R2000/R3000 does not support comparing fewer tag bits. IDK if you could wire up some of the address output lines to the tag input lines, to generate matching bits instead of storing them in SRAMs.
A 32-byte cache line would still only need 20-bit tags (at most), but would have one tag per 8 words, a factor of 8 improvement in tag overhead. CPUs with larger caches, especially L2 caches, would definitely want to use larger line sizes.
But you're probably more likely to get conflict misses with fewer larger lines, especially with a direct-mapped cache. And the memory bus can still be busy filling a previous line when you encounter another miss, even if you have critical-word-first / early-restart so the miss latency wasn't worse if the memory bus was idle to start with.

What is the point of on-chip hardware accelerators, instead of that functionality being added as an instruction to the ISA?

I get that if a specialized operation is known to be common, it makes sense to do it in hardware. But at that point, why not make it a part of the ISA so it can be even faster?
Is there a benefit to making it a co-processor that communicates through shared memory?
This is a bit hand-wavy because I don't actually design hardware, but I think I know enough to say something that's at least plausible.
Adding it to the ISA means it has to be fairly tightly coupled to the pipeline, which doesn't fit well for things like integrated GPUs that have some specialized hardware and can filter out which pixels even need to be processed using dedicated hardware instead of software branching.
Even considering less complicated accelerators (e.g. for crypto):
Especially on simpler CPUs without out-of-order exec and large reordering windows, high-latency HW accelerators could stall the pipeline and stop it from getting other work done while waiting for a result.
Intel does tend to add things to the ISA, such as AES and SHA, because mainstream x86 CPUs do have the instruction throughput and vector registers to feed data to execution units that do one round of AES, for example.
If the accelerator is physically large but usually not needed by multiple cores at once, having groups of cores share one is more natural with some kind of co-processor arrangement to insulate the core from the round-trip latency of going off-core to compute something.
Also for GPUs, a GPU has more computational throughput than you can fit down the superscalar pipeline of a normal CPU. The FLOPS of an integrated GPU is typically much greater than a single core of a modern Intel CPU, even with 2x 256-bit FMA units. So you'd need to have a CPU instruction like "run shader" that runs a GPU program using its own separately-programmable machine code. GPU instruction scheduling is lighter weight than even a normal in-order CPU.

Why predict a branch, instead of simply executing both in parallel?

I believe that when creating CPUs, branch prediction is a major slow down when the wrong branch is chosen. So why do CPU designers choose a branch instead of simply executing both branches, then cutting one off once you know for sure which one was chosen?
I realize that this could only go 2 or 3 branches deep within a short number of instructions or the number of parallel stages would get ridiculously large, so at some point you would still need some branch prediction since you definitely will run across larger branches, but wouldn't a couple stages like this make sense? Seems to me like it would significantly speed things up and be worth a little added complexity.
Even just a single branch deep would almost half the time eaten up by wrong branches, right?
Or maybe it is already somewhat done like this? Branches usually only choose between two choices when you get down to assembly, correct?
You're right in being afraid of exponentially filling the machine, but you underestimate the power of that.
A common rule-of-thumb says you can expect to have ~20% branches on average in your dynamic code. This means one branch in every 5 instructions. Most CPUs today have a deep out-of-order core that fetches and executes hundreds of instructions ahead - take Intels' Haswell for e.g., it has a 192 entries ROB, meaning you can hold at most 4 levels of branches (at that point you'll have 16 "fronts" and 31 "blocks" including a single bifurcating branch each - assuming each block would have 5 instructions you've almost filled your ROB, and another level would exceed it). At that point you would have progressed only to an effective depth of ~20 instructions, rendering any instruction-level parallelism useless.
If you want to diverge on 3 levels of branches, it means you're going ot have 8 parallel contexts, each would have only 24 entries available to run ahead. And even that's only when you ignore overheads for rolling back 7/8 of your work, the need to duplicate all state-saving HW (like registers, which you have dozens of), and the need to split other resources into 8 parts like you did with the ROB. Also, that's not counting memory management which would have to manage complicated versioning, forwarding, coherency, etc.
Forget about power consumption, even if you could support that wasteful parallelism, spreading your resources that thin would literally choke you before you could advance more than a few instructions on each path.
Now, let's examine the more reasonable option of splitting over a single branch - this is beginning to look like Hyperthreading - you split/share your core resources over 2 contexts. This feature has some performance benefits, granted, but only because both context are non-speculative. As it is, I believe the common estimation is around 10-30% over running the 2 contexts one after the other, depending on the workload combination (numbers from a review by AnandTech here) - that's nice if you indeed intended to run both the tasks one after the other, but not when you're about to throw away the results of one of them. Even if you ignore the mode switch overhead here, you're gaining 30% only to lose 50% - no sense in that.
On the other hand, you have the option of predicting the branches (modern predictors today can reach over 95% success rate on average), and paying the penalty of misprediction, which is partially hidden already by the out-of-order engine (some instructions predating the branch may execute after it's cleared, most OOO machines support that). This leaves any deep out-of-order engine free to roam ahead, speculating up to its full potential depth, and being right most of the time. The odds of flusing some of the work here do decrease geometrically (95% after the first branch, ~90% after the second, etc..), but the flush penalty also decreases. It's still far better than a global efficiency of 1/n (for n levels of bifurcation).

Is there any difference in execution times between two different processors with same amount of gigaflops?

I have a hardware related question that I debated with a friend.
Consider two processors from two different manufacturers with the same amount of gigaflops put into the same computers (i.e. RAM and such as are the same for both computers).
Now given a simple program will there be any difference in execution times between the two computers with the same processors. I.e. will the two computers handle the code differently (for-loops, while-loops, if-statements and such)?
And if, is that difference noticably or can one say that the computers would approximately perform the same?
Short answer: Yes, they will be different, possibly very much so.
Flops are just about floating point operations, so it is a very very crude measure of CPU performance. It is, in general, a decent proxy for performance for scientific computations of certain kinds, but not for general performance.
There are CPUs which are strong in FLOPS - the Alpha is a historical example - but which have more moderate performance in integer computations. This means that an alpha and an x86 CPU with similar FLOPS would have very different MIPS-performance.
The truth is that it is very hard to make a good generic benchmark, though many have tried.
Another critical factor in comparing the performance of two processors with the same FLOP measure is the rate at which they can move data between CPU and RAM. Add memory cache into your thinking to complicate matters further.

Parallel programming on a Quad-Core and a VM?

I'm thinking of slowly picking up Parallel Programming. I've seen people use clusters with OpenMPI installed to learn this stuff. I do not have access to a cluster but have a Quad-Core machine. Will I be able to experience any benefit here? Also, if I'm running linux inside a Virtual machine, does it make sense in using OpenMPI inside a VM?
If your target is to learn, you don't need a cluster at all. Your quad-core (or any dual-core or even a single-cored) computer will be more than enough. The main point is to learn how to think "in parallel" and how to design your application.
Some important points are to:
Exploit different parallelism paradigms like divide-and-conquer, master-worker, SPMD, ... depending on data and tasks dependencies of what you want to do.
Chose different data division granularities to check the computation/communication ratio (in case of message passing), or to check the amount of serial execution because of mutual exclusion to memory regions.
Having a quad-core you can measure your approach speedup (the gain on performance attained because of the parallelization) which is normally given by the division between the time of the non parallelized execution and the time of the parallel execution.
The closer you get to 4 (four cores meaning 1/4th the execution time), the better your parallelization strategy was (once you could evenly distribute work and data).