Why is the Branch Target Buffer designed as a cache? - cpu-architecture

The BHT is not a cache and it doesn't need to be because it is okay if a mistake is made when accessing it. The BTB, however, is designed as a cache because it always has to return either a hit or a miss. Why can't the BTB make a mistake?

The BTB can make mistakes, and mis-speculation will be detected when the jump instruction actually executes (or decodes, for direct branches), resulting in a bubble as the front-end re-steers to fetch from the right location.
Or with out-of-order speculative execution for indirect branches, potentially the core has to roll back to the last known-good state just like recovering from a wrong branch direction for a conditional branch.

A BTB can present a false hit and this is currently exploited by some implementations through the use of partial tags. (Similarly, one could have one entry per set be completely untagged; in a direct-mapped BTB, no entries would be tagged. A traditional set-associative design, having (partial) tags for each way, gives "free" miss detection as part of way selection.) As Peter Cordes' answer notes, this mistake can be detected and corrected later in the pipeline.
Recognizing a BTB miss does allow the throttling of speculation. If the BTB is used to prefetch the instruction stream past an instruction cache miss, avoiding cache polluting and bandwidth wasting misspeculation can have a performance impact. When performance is limited by power or thermal considerations, avoiding misspeculation even when it would be detected and corrected quickly can save some power/heat generation and so potentially improve performance.
With a two-level BTB, a hit indication could allow the L2 BTB not to be accessed for that branch. Aside from energy efficiency, the L2 BTB may have been designed to provide lower bandwidth or to be shared with another closely coupled fetch engine (so bandwidth unused by one fetch engine could be used by another).
In addition, an indication of a BTB miss can be used to improve branch direction prediction. A miss indicates that the branch was likely not taken in recent history (whether not recently executed or not taken during recent execution); the branch direction predictor may choose to override a taken prediction (with the target calculated at decode) or may choose to treat the prediction as low confidence (e.g., using dynamic predication or giving priority to fetch from other threads). The former effectively filters out never taken branches from the predictor (which is allowed to have a destructive alias that predicts taken); both uses of a miss indication exploit the likelihood that old branch information is less likely to be accurate.
A BTB can also provide a simple method of branch identification. A BTB miss predicts that the fetch does not contain a potentially taken branch (filtering out non-branches and never taken branches). This avoids the branch direction predictor having to predict not taken for non-branch instructions (or redirecting fetch after instruction decode on a BTB false hit when the branch direction predictor predicts taken). This adds non-branches to the filtering to avoid destructive aliasing. (A separate branch identifier could be used to filter non-branch instructions and to distinguish non-conditional, indirect, and return instructions, which might use different target predictors and might not need direction prediction.)
If the BTB provides a per address direction prediction or other information used for direction prediction, a miss indication could allow the direction predictor to use other methods to provide such information (e.g., static branch prediction). A static prediction may be not particularly accurate but it is likely to be more accurate than a "random" prediction with a taken bias (since never taken branches might never enter the BTB and replacement might be based on least recently taken); a "static" predictor could also exploit the fact that there was a BTB miss. If an agree predictor is used (where a static bias is xored with the prediction to reduce destructive aliasing, biased taken branches that are taken have the same predictor updating as biased not-taken branches that are not taken), a per-address bias is needed.
An L1 BTB might also be integrated with the L1 instruction cache, particularly for branch address relative targets, such that not only is miss detection free (the tags for all ways are present) but the BTB provided target may not even be a prediction (avoiding the need to recalculate the target). This would require additional prediction resources for indirect branches (and an L2 BTB might be used to support prefetching under instruction cache misses) but can avoid significant redundant storage (as such branch instructions already store the offset).
Even though BTB miss determination is not necessary, it can be useful.

Related

Estimating Noise Budget without Secret Key

In the latest version of SEAL, the Simulation class has been removed.
Has it been shifted to some other file?
If it has been completely removed from the library, then how can I estimate noise growth, and choose appropriate parameters for my application?
Simulation class and related tools are indeed removed from the most recent SEAL. There were several reasons for this:
updating them was simply too much work with all the recent API changes;
the parameter selection and noise growth estimation tools worked well only in extremely restricted settings, e.g. very shallow computations;
the design of these tools didn't necessary correspond very well to how we envision parameter selection to happen in the future.
Obviously it's really important to have developer tools in SEAL (e.g. parameter selection, circuit analysis/optimization, etc.) so these will surely be coming back in some form.
Even without explicit tools for parameter selection you can still test different parameters and see how much noise budget you have left after the computation. Repeat this experiment multiple times to account for probabilistic effects and convince yourself that the parameters indeed work.

For a Single Cycle CPU How Much Energy Required For Execution Of ADD Command

The question is obvious like specified in the title. I wonder this. Any expert can help?
OK, this is was going to be a long answer, so long that I may write an article about it instead. Strangely enough, I've been working on experiments that are closely related to your question -- determining performance per watt for a modern processor. As Paul and Sneftel indicated, it's not really possible with any real architecture today. You can probably compute this if you are looking at only the execution of that instruction given a certain silicon technology and a certain ALU design through calculating gate leakage and switching currents, voltages, etc. But that isn't a useful value because there is something always going on (from a HW perspective) in any processor newer than an 8086, and instructions haven't been executed in isolation since a pipeline first came into being.
Today, we have multi-function ALUs, out-of-order execution, multiple pipelines, hyperthreading, branch prediction, memory hierarchies, etc. What does this have to do with the execution of one ADD command? The energy used to execute one ADD command is different from the execution of multiple ADD commands. And if you wrap a program around it, then it gets really complicated.
SORT-OF-AN-ANSWER:
So let's look at what you can do.
Statistically measure running a given add over and over again. Remember that there are many different types of adds such as integer adds, floating-point, double precision, adds with carries, and even simultaneous adds (SIMD) to name a few. Limits: OSs and other apps are always there, though you may be able to run on bare metal if you know how; varies with different hardware, silicon technologies, architecture, etc; probably not useful because it is so far from reality that it means little; limits of measurement equipment (using interprocessor PMUs, from the wall meters, interposer socket, etc); memory hierarchy; and more
Statistically measuring an integer/floating-point/double -based workload kernel. This is beginning to have some meaning because it means something to the community. Limits: Still not real; still varies with architecture, silicon technology, hardware, etc; measuring equipment limits; etc
Statistically measuring a real application. Limits: same as above but it at least means something to the community; power states come into play during periods of idle; potentially cluster issues come into play.
When I say "Limits", that just means you need to well define the constraints of your answer / experiment, not that it isn't useful.
SUMMARY: it is possible to come up with a value for one add but it doesn't really mean anything anymore. A value that means anything is way more complicated but is useful and requires a lot of work to find.
By the way, I do think it is a good and important question -- in part because it is so deceptively simple.

Why predict a branch, instead of simply executing both in parallel?

I believe that when creating CPUs, branch prediction is a major slow down when the wrong branch is chosen. So why do CPU designers choose a branch instead of simply executing both branches, then cutting one off once you know for sure which one was chosen?
I realize that this could only go 2 or 3 branches deep within a short number of instructions or the number of parallel stages would get ridiculously large, so at some point you would still need some branch prediction since you definitely will run across larger branches, but wouldn't a couple stages like this make sense? Seems to me like it would significantly speed things up and be worth a little added complexity.
Even just a single branch deep would almost half the time eaten up by wrong branches, right?
Or maybe it is already somewhat done like this? Branches usually only choose between two choices when you get down to assembly, correct?
You're right in being afraid of exponentially filling the machine, but you underestimate the power of that.
A common rule-of-thumb says you can expect to have ~20% branches on average in your dynamic code. This means one branch in every 5 instructions. Most CPUs today have a deep out-of-order core that fetches and executes hundreds of instructions ahead - take Intels' Haswell for e.g., it has a 192 entries ROB, meaning you can hold at most 4 levels of branches (at that point you'll have 16 "fronts" and 31 "blocks" including a single bifurcating branch each - assuming each block would have 5 instructions you've almost filled your ROB, and another level would exceed it). At that point you would have progressed only to an effective depth of ~20 instructions, rendering any instruction-level parallelism useless.
If you want to diverge on 3 levels of branches, it means you're going ot have 8 parallel contexts, each would have only 24 entries available to run ahead. And even that's only when you ignore overheads for rolling back 7/8 of your work, the need to duplicate all state-saving HW (like registers, which you have dozens of), and the need to split other resources into 8 parts like you did with the ROB. Also, that's not counting memory management which would have to manage complicated versioning, forwarding, coherency, etc.
Forget about power consumption, even if you could support that wasteful parallelism, spreading your resources that thin would literally choke you before you could advance more than a few instructions on each path.
Now, let's examine the more reasonable option of splitting over a single branch - this is beginning to look like Hyperthreading - you split/share your core resources over 2 contexts. This feature has some performance benefits, granted, but only because both context are non-speculative. As it is, I believe the common estimation is around 10-30% over running the 2 contexts one after the other, depending on the workload combination (numbers from a review by AnandTech here) - that's nice if you indeed intended to run both the tasks one after the other, but not when you're about to throw away the results of one of them. Even if you ignore the mode switch overhead here, you're gaining 30% only to lose 50% - no sense in that.
On the other hand, you have the option of predicting the branches (modern predictors today can reach over 95% success rate on average), and paying the penalty of misprediction, which is partially hidden already by the out-of-order engine (some instructions predating the branch may execute after it's cleared, most OOO machines support that). This leaves any deep out-of-order engine free to roam ahead, speculating up to its full potential depth, and being right most of the time. The odds of flusing some of the work here do decrease geometrically (95% after the first branch, ~90% after the second, etc..), but the flush penalty also decreases. It's still far better than a global efficiency of 1/n (for n levels of bifurcation).

What is the advantage of having instructions in a uniform format?

Many processors have instructions which are of uniform format and width such as the ARM where all instructions are 32-bit long. other processors have instructions in multiple widths of say 2, 3, or 4 bytes long, such as 8086.
What is the advantage of having all instructions the same width and in a uniform format?
What is the advantage of having instructions in multiple widths?
Fixed Length Instruction Trade-offs
The advantages of fixed length instructions with a relatively uniform formatting is that fetching and parsing the instructions is substantially simpler.
For an implementation that fetches a single instruction per cycle, a single aligned memory (cache) access of the fixed size is guaranteed to provide one (and only one) instruction, so no buffering or shifting is required. There is also no concern about crossing a cache line or page boundary within a single instruction.
The instruction pointer is incremented by a fixed amount (except when executing control flow instructions--jumps and branches) independent of the instruction type, so the location of the next sequential instruction can be available early with minimal extra work (compared to having to at least partially decode the instruction). This also makes fetching and parsing more than one instruction per cycle relatively simple.
Having a uniform format for each instruction allows trivial parsing of the instruction into its components (immediate value, opcode, source register names, destination register name). Parsing out the source register names is the most timing critical; with these in fixed positions it is possible to begin reading the register values before the type of instruction has been determined. (This register reading is speculative since the operation might not actually use the values, but this speculation does not require any special recovery in the case of mistaken speculation but does take extra energy.) In the MIPS R2000's classic 5-stage pipeline, this allowed reading of the register values to be started immediately after instruction fetch providing half of a cycle to compare register values and resolve a branch's direction; with a (filled) branch delay slot this avoided stalls without branch prediction.
(Parsing out the opcode is generally a little less timing critical than source register names, but the sooner the opcode is extracted the sooner execution can begin. Simple parsing out of the destination register name makes detecting dependencies across instructions simpler; this is perhaps mainly helpful when attempting to execute more than one instruction per cycle.)
In addition to providing the parsing sooner, simpler encoding makes parsing less work (energy use and transistor logic).
A minor advantage of fixed length instructions compared to typical variable length encodings is that instruction addresses (and branch offsets) use fewer bits. This has been exploited in some ISAs to provide a small amount of extra storage for mode information. (Ironically, in cases like MIPS/MIPS16, to indicate a mode with smaller or variable length instructions.)
Fixed length instruction encoding and uniform formatting do have disadvantages. The most obvious disadvantage is relatively low code density. Instruction length cannot be set according to frequency of use or how much distinct information is required. Strict uniform formatting would also tend to exclude implicit operands (though even MIPS uses an implicit destination register name for the link register) and variable-sized operands (most RISC variable length encodings have short instructions that can only access a subset of the total number of registers).
(In a RISC-oriented ISA, this has the additional minor issue of not allowing more work to be bundled into an instruction to equalize the amount of information required by the instruction.)
Fixed length instructions also make using large immediates (constant operands included in the instruction) more difficult. Classic RISCs limited immediate lengths to 16-bits. If the constant is larger, it must either be loaded as data (which means an extra load instruction with its overhead of address calculation, register use, address translation, tag check, etc.) or a second instruction must provide the rest of the constant. (MIPS provides a load high immediate instruction, partially under the assumption that large constants are mainly used to load addresses which will later be used for accessing data in memory. PowerPC provides several operations using high immediates, allowing, e.g., the addition of a 32-bit immediate in two instructions.) Using two instructions is obviously more overhead than using a single instruction (though a clever implementation could fuse the two instructions in the front-end [What Intel calls macro-op fusion]).
Fixed length instructions also makes it more difficult to extend an instruction set while retaining binary compatibility (and not requiring addition modes of operation). Even strictly uniform formatting can hinder extension of an instruction set, particularly for increasing the number of registers available.
Fujitsu's SPARC64 VIIIfx is an interesting example. It uses a two-bit opcode (in its 32-bit instructions) to indicate a loading of a special register with two 15-bit instruction extensions for the next two instructions. These extensions provide extra register bits and indication of SIMD operation (i.e., extending the opcode space of the instruction to which the extension is applied). This means that the full register name of an instruction not only is not entirely in a fixed position, but not even in the same "instruction". (Similarities to x86's REX prefix--which provides bits to extend register names encoded in the main part of the instruction--might be noted.)
(One aspect of fixed length encodings is the tyranny of powers of two. Although it is possible to used non-power-of-two instruction lengths [Tensilica's XTensa now has fixed 24-bit instructions as its base ISA--with 16-bit short instruction support being an extension, previously they were part of the base ISA; IBM had an experimental ISA with 40-bit instructions.], such adds a little complexity. If one size, e.g., 32bits, is a little too short, the next available size, e.g., 64 bits, is likely to be too long, sacrificing too much code density.)
For implementations with deep pipelines the extra time required for parsing instructions is less significant. The extra dynamic work done by hardware and the extra design complexity are reduced in significance for high performance implementations which add sophisticated branch prediction, out-of-order execution, and other features.
Variable Length Instruction Trade-offs
For variable length instructions, the trade-offs are essentially reversed.
Greater code density is the most obvious advantage. Greater code density can improve static code size (the amount of storage needed for a given program). This is particularly important for some embedded systems, especially microcontrollers, since it can be a large fraction of the system cost and influence the system's physical size (which has impact on fitness for purpose and manufacturing cost).
Improving dynamic code size reduces the amount of bandwidth used to fetch instructions (both from memory and from cache). This can reduce cost and energy use and can improve performance. Smaller dynamic code size also reduces the size of caches needed for a given hit rate; smaller caches can use less energy and less chip area and can have lower access latency.
(In a non- or minimally pipelined implementation with a narrow memory interface, fetching only a portion of an instruction in a cycle in some cases does not hurt performance as much as it would in a more pipelined design less limited by fetch bandwidth.)
With variable length instructions, large constants can be used in instructions without requiring all instructions to be large. Using an immediate rather than loading a constant from data memory exploits spatial locality, provides the value earlier in the pipeline, avoids an extra instruction, and removed a data cache access. (A wider access is simpler than multiple accesses of the same total size.)
Extending the instruction set is also generally easier given support for variable length instructions. Addition information can be included by using extra long instructions. (In the case of some encoding techniques--particularly using prefixes--, it is also possible to add hint information to existing instructions allowing backward compatibility with additional new information. x86 has exploited this not only to provide branch hints [which are mostly unused] but also the Hardware Lock Elision extension. For a fixed length encoding, it would be difficult to choose in advance which operations should have additional opcodes reserved for possible future addition of hint information.)
Variable length encoding clearly makes finding the start of the next sequential instruction more difficult. This is somewhat less of a problem for implementations
that only decode one instruction per cycle, but even in that case it adds extra
work for the hardware (which can increase cycle time or pipeline length as well as use more energy). For wider decode several tricks are available to reduce the cost of parsing out individual instructions from a block of instruction memory.
One technique that has mainly been used microarchitecturally (i.e., not included in the interface exposed to software but only an implementation technique) is to use marker bits to indicate the start or end of an instruction. Such marker bits would be set for each parcel of instruction encoding and stored in the instruction cache. Such delays the availability of such information on a instruction cache miss, but this delay is typically small compared to the ordinary delay in filling a cache miss. The extra (pre)decoding work is only needed on a cache miss, so time and energy is saved in the common case of a cache hit (at the cost of some extra storage and bandwidth which has some energy cost).
(Several AMD x86 implementations have used marker bit techniques.)
Alternatively, marker bits could be included in the instruction encoding. This places some constrains on opcode assignment and placement since the marker bits effectively become part of the opcode.
Another technique, used by the IBM zSeries (S/360 and descendants), is to encode the instruction length in a simple way in the opcode in the first parcel. The zSeries uses two bits to encode three different instruction lengths (16, 32, and 48 bits) with two encodings used for 16 bit length. By placing this in a fixed position, it is relatively easy to quickly determine where the next sequential instruction begins.
(More aggressive predecoding is also possible. The Pentium 4 used a trace cache containing fixed-length micro-ops and recent Intel processors use a micro-op cache with [presumably] fixed-length micro-ops.)
Obviously, variable length encodings require addressing at the granularity of a parcel which is typically smaller than an instruction for a fixed-length ISA. This means that branch offsets either lose some range or must use more bits. This can be compensated by support for more different immediate sizes.
Likewise, fetching a single instruction can be more complex since the start of the instruction is likely to not be aligned to a larger power of two. Buffering instruction fetch reduces the impact of this, but adds (trivial) delay and complexity.
With variable length instructions it is also more difficult to have uniform encoding. This means that part of the opcode must often be decoded before the basic parsing of the instruction can be started. This tends to delay the availability of register names and other, less critical information. Significant uniformity can still be obtained, but it requires more careful design and weighing of trade-offs (which are likely to change over the lifetime of the ISA).
As noted earlier, with more complex implementations (deeper pipelines, out-of-order execution, etc.), the extra relative complexity of handling variable length instructions is reduced. After instruction decode, a sophisticated implementation of an ISA with variable length instructions tends to look very similar to one of an ISA with fixed length instructions.
It might also be noted that much of the design complexity for variable length instructions is a one-time cost; once an organization has learned techniques (including the development of validation software) to handle the quirks, the cost of this complexity is lower for later implementations.
Because of the code density concerns for many embedded systems, several RISC ISAs provide variable length encodings (e.g., microMIPS, Thumb2). These generally only have two instruction lengths, so the additional complexity is constrained.
Bundling as a Compromise Design
One (sort of intermediate) alternative chosen for some ISAs is to use a fixed length bundle of instructions with different length instructions. By containing instructions in a bundle, each bundle has the advantages of a fixed length instruction and the first instruction in each bundle has a fixed, aligned starting position. The CDC 6600 used 60-bit bundles with 15-bit and 30-bit operations. The M32R uses 32-bit bundles with 16-bit and 32-bit instructions.
(Itanium uses fixed length power-of-two bundles to support non-power of two [41-bit] instructions and has a few cases where two "instructions" are joined to allow 64-bit immediates. Heidi Pan's [academic] Heads and Tails encoding used fixed length bundles to encode fixed length base instruction parts from left to right and variable length chunks from right to left.)
Some VLIW instruction sets use a fixed size instruction word but individual operation slots within the word can be a different (but fixed for the particular slot) length. Because different operation types (corresponding to slots) have different information requirements, using different sizes for different slots is sensible. This provides the advantages of fixed size instructions with some code density benefit. (In addition, a slot might be allocated to optionally provide an immediate to one of the operations in the instruction word.)

Why speedup reduces with increase in number of pipeline stages?

I am watching a video tutorial on pipelining at link.
At time 4:30, the instructor says that with increase in number of stages, we need to also add pipeline registers, which creates overhead, and due to this speedup cannot increase beyond an optimal value with increase in number of stages.
Can someone please elaborate this ? My doubt is that the pipeline register might be adding some delay to individual stage cycle time, so why does it become a problem when number of stages are large compared to a few ?
Thanks.
The latches themselves do have a small delay (they are after all "doing work", i.e., switching). By itself, this would only produce an asymptotic approach to a fixed peak performance value. E.g., starting from the (already unrealistically) tiny actual work time of each stage equaling the latch delay, a doubling of pipeline depth (excluding other, real-world constraints) would reduce cycle time to latch delay plus 1/2 latch delay (increasing clock speed by just over 33%) but doubling pipeline depth again would only reduce cycle time to latch delay plus 1/4 latch delay.
Even at an infinite number of pipeline stages with each stage (somehow) doing infinitesimal work, the minimum cycle time would equal one latch delay, doubling the clock speed relative to a pipeline depth where latch delay equals actual work time. On a slightly practical level, one transistor switch delay of real work is a relatively hard constraint.
However, before latch delay itself, prevents further improvement, other real-world factors limit the benefit of increased pipeline depth.
On the more physical level, excluding area and power/thermal density constraints, getting the clock signal to transition uniformly at very high precision across an entire design is challenging at such high clock speeds. Clock skew and jitter become more significant when there is less margin in work time to absorb variation. (This is even excluding variation in manufacturing or environmental conditions like temperature.)
Besides such more physical constraints, dependence constraints tend to prevent a deeper pipeline from increasing performance. While control dependencies (e.g., condition evaluation of a branch) can often be hidden by prediction, as Gabe notes in his answer, a branch misprediction can require a pipeline flush. Even at 99% prediction accuracy and one branch every ten instructions (95% and one branch every five instructions are more likely), a thousand-stage branch resolution delay (i.e., excluding stages after branch resolution and assuming branch target is available no later than branch direction) would mean half of the performance is taken by branch mispredictions.
Instruction cache misses would also be a problem. If one had perfect control flow prediction, one could use prefetching to hide the delay. This effectively becomes a part of the branch prediction problem. Also, note that increasing cache size to reduce miss rate (or branch predictor size to reduce misprediction rate) increases access latency (pipeline stage count).
Data value dependencies are more difficult to handle. If the execution takes two cycles, then two sequential instructions with a data dependence will not be able to execute back-to-back. While value prediction could, theoretically, help in some cases, it is most helpful in relatively limited cases. It is also possible for some operations to be width-pipelined (e.g., add, subtract, bitwise logical operations, and left shifts). However, such tricks have limits.
Data cache misses becoming part of this data dependence issue. Data memory addresses tend to be much more difficult to predict than instruction addresses.
This Google Scholar search provides some more detailed (and technical) reading on this subject.
Without watching the hour-long video, I would say that the actual problem when there are a large number of stages is that pipeline stalls are worse. If you have a 14-stage pipeline and mispredict a branch, that's 14 stages that you have to fill up again before issuing another instruction.