Bottleneck when using indexed addressing modes - x86-64

I performed the following experiments both on a Haswell and a Coffee Lake machine.
The instruction
cmp rbx, qword ptr [r14+rax]
has a throughput of 0.5 (i.e., 2 instructions per cycle). This is as expected. The instruction is decoded to one µop that is later unlaminated (see https://stackoverflow.com/a/31027695/10461973) and, thus, requires two retire slots.
If we add a nop instruction
cmp rbx, qword ptr [r14+rax]; nop
I would expect a throughput of 0.75, as this sequence requires 3 retire slots, and there also seem to be no other bottlenecks in the back-end. This is also the throughput that IACA reports. However, the actual throughput is 1 (this is independent of whether the µops come from the decoders or the DSB). What is the bottleneck in this case?
Without the indexed addressing mode,
cmp rbx, qword ptr [r14]; nop
has a throughput of 0.5, as expected.

It seems you've uncovered a downside to unlamination vs. regular multi-uop instructions, perhaps in the interaction with 4-wide issue/rename/allocate when a micro-fused uop reaches the head of the IDQ.
Hypothesis: maybe both uops resulting from un-lamination have to be part of the same issue group, so unlaminated; nop repeated only achieves a front-end throughput of 3 fused-domain uops per clock.
That might make sense if un-lamination only happens at the head of the IDQ, as they reach the alloc/rename stage. Rather than as they're added to the IDQ. To test this, we could see if LSD (loop buffer) capacity on Haswell depends on uop count before or after unlamination - #AndreasAbel's testing shows that a loop containing 55x cmp rbx, [r14+rax] runs from the LSD on Haswell, so that's strong evidence that unlamination happens during alloc/rename, not taking multiple entries in the IDQ itself.
For comparison, cmp dword [rip+rel32], 1 won't micro-fuse in the first place, in the decoders, so it won't un-laminate. If it achieves 0.75c throughput, that would be evidence in support of un-lamination requiring room in the same issue group.
Perhaps times 2 nop; unlaminate or times 3 nop could also be an interesting test to see if the unlaminated uop ever issues by itself or can reliably grab 2 more NOPs after it's delayed from whatever position in an issue group. From your back-to-back cmp-unlaminate test, I expect we'd still see mostly full 4-uop issue groups.
Your question mentions retirement but not issue.
Retire is at least as wide as issue (4-wide from Core2 to Skylake, 5-wide in Ice Lake).
Sandybridge / Haswell retire 4 fused-domain uops/clock. Skylake can retire 4 fused-domain uops per clock per hyperthread, allowing quicker release of resources like load buffers after one old stalled uop finally completes, if both logical cores are busy. It's not 100% clear whether it can retire 8/clock when running in single-thread mode, I found conflicting claims, and no clear statement in Intel's optimization manual.
It's very hard if not impossible to actually create a bottleneck on retirement (but not issue). Any sustained stream has to get through the issue stage, which is not wider than retirement. (Performance counters for uops_issued.any indicate that un-lamination happens at some point before issue, so that doesn't help us jam more uops through the front-end than retirement can handle. Unless that's misleading; running the same loop on both logical cores of the same physical core should have the same overall bottleneck, but if if Skylake runs it faster, that would tell us that parallel SMT retirement helped. Unlikely, but something to check if anyone wants to rule it out.)
This is also the throughput that IACA reports
IACA's pipeline model seems pretty naive; I don't think it knows about Sandybridge's multiple-of-4-uop issue effect (e.g. a 6 uop loop costs the same as 8). IACA also doesn't know that Haswell can keep add eax, [rdi+rdx] micro-fused throughout the pipeline, so any analysis of indexed uops that don't un-laminate is wrong.
I wouldn't trust IACA to do more than count uops and make some wild guesses about how they will allocate to ports.

Related

Throughput vs latency in computer architecture

I've come across articles on "through-put vs latency" in contexts like networking e.g. https://homepage.cs.uri.edu/~thenry/resources/unix_art/ch12s04.html But in the context of computer architecture / operating systems, I'm not able to understand why would there be a trade-off between latency (response time of a program) and through-put (how many programs we're able to complete in a unit of time, say per hour). Is this solely due to the fact that we can choose to parallelize processing of multiple programs / requests leading to overheads like context switches & sharing of caches which make the start-to-end response time per process to be worse? Or am I missing something here?
In terms of single instructions in a superscalar pipelined out-of-order exec CPU, throughput vs. latency is very important because the CPU is trying to extract parallelism from an instruction stream that has to be executed as if in serial program order. See Assembly - How to score a CPU instruction by latency and throughput and the bottom of my answer on latency vs throughput in intel intrinsics for example.
In terms of OS decisions that affect throughput vs. latency on a much longer timescale than a few clock cycles, that's a totally separate question.
One of the major factors there is choosing how to use the available physical RAM, and whether to page out (to a swap file) infrequently used code / data to make more room to cache disk files. (e.g. Linux's vm.swappiness is widely considered a key tunable in terms of setting it differently between servers and desktops. https://unix.stackexchange.com/questions/88693/why-is-swappiness-set-to-60-by-default).
If you alt-tab to a window when many pages of that process have been paged out, it will take some time before the process can redraw its window. (Multiple hard page faults, can be quite slow especially if paging on a rotational disk, not SSD.) So to optimize for latency, you want the kernel to not aggressively swap out pages from running processes, even if they've been idle for a few hours. Those pages, if they'd been free, could have improved throughput for other processes by acting as buffers / cache.
A related factor is I/O scheduling: trying to group IO requests together to minimize HD seek times (for higher throughput and lower average latency), but sometimes at the expense of delaying a few requests for a longer time (higher worst-case latency). Linux for example has many to choose from, including deadline, Completely Fair Queuing (CFQ), and the original elevator (just grouping requests by locality without consideration of fairness or latency). https://wiki.archlinux.org/title/improving_performance#Input/output_schedulers
CPU scheduling is also a factor: a context-switch hurts throughput, as it takes time itself and caches will likely be cold for the new task on this CPU. You also have to run the kernel's schedule() function to decide which task to run next, so that takes away some time from real work.
To minimize latency (for example between a socket message being sent to a process and it waking up when its poll or select system call returns), you want a short timeslice, like Linux HZ=1000. (Timer interrupts every 1 ms to run the scheduler). And you want to be able to pre-empt even the kernel itself, instead of waiting until the kernel is ready to return to the old user-space to consider the possibility of running a different user-space task.
But neither of these helps throughput, and in fact hurt (assuming the workload has enough parallelism to not bottleneck on latency). So HZ=100 was the default for "server" Linux builds, vs. 1000 on "desktop" builds tuned for interactive use. (Modern Linux can be "tickless", not using a fixed timer interrupt on every core at all, instead deciding when to schedule the next interrupt on a case by case basis.)
Real-time kernels take this even further, spending more time on finer-grained locking and stuff like that to enable pausing work and coming back to it later to minimize interrupt latency and other latencies between it being time to do something and actually starting to do that thing. (There are real-time patches for Linux, and there are also totally separate kernels built from the ground up for real-time operation.)
If you have an embedded system controlling a motor or something, you absolutely need hard real-time latency guarantees that it will never take longer than say 1 millisecond from an interrupt pin being asserted to the interrupt handler starting to run.
(Designing the system to make these guarantees possible often comes at the cost of throughput. e.g. obviously you have to pin some memory to make it not swappable, if we're talking about user-space, making it unavailable for cache even if it goes untouched for days.)

What does a 'Split' cache means. And how is it useful(if it is)?

I was doing a question on Computer Architecture and in it it was mentioned that the cache is a split cache, and no hazard what does this exactly means?
A summary and additional discussion can be found at: L1 caches usually have split design, but L2, L3 caches have unified design, why?.
Introduction
A split cache is a cache that consists of two physically separate parts, where one part, called the instruction cache, is dedicated for holding instructions and the other, called the data cache, is dedicated for holding data (i.e., instruction memory operands). Both of the instruction cache and data cache are logically considered to be a single cache, described as a split cache, because both are hardware-managed caches for the same physical address space at the same level of the memory hierarchy. Instruction fetch requests are handled only by the instruction cache and memory operand read and write requests are handled only by the data cache. A cache that is not split is called a unified cache.
The Harvard vs. von Neumann architecture distinction originally applies to main memory. However, most modern computer systems implement the modified Harvard architecture whereby the L1 cache implements the Harvard architecture and the rest of the memory hierarchy implements the von Neumann architecture. Therefore, in modern systems, the Harvard vs. von Neumann distinction mostly applies to the L1 cache design. That's why the split cache design is also called the Harvard cache design and the unified cache design is also called von Neumann. The Wikipedia article on the modified Harvard architecture discusses three variants of the architecture, of which one is the split cache design.
To my knowledge, the idea of the split cache design was first proposed and evaluated by James Bell, David Casasent, and C. Cordon Bell in their paper entitled An Investigation of Alternative Cache Organizations, which was published in 1974 in the IEEE TC journal (the IEEE version is a bit clearer). The authors found using a simulator that, for almost all cache capacities considered in the study, an equal split results in the best performance (see Figure 5). From the paper:
Typically, the best performance occurs with half of the cache devoted
to instructions and half to data.
They also provided a comparison with a unified cache design of the same capacity and their initial conclusion was that the split design has no advantage over the unified design.
As shown in Fig. 6, the performance of the best dedicated cache CUXD
(half allotted to instructions and half to data) in general is quite
similar to that of a homogeneous cache (CUX); the extra complexity of
a dedicated cache control is thus not justifiable.
It's not clear to me actually whether the paper evaluated the split design or a cache that is partitioned between instructions and data. One paragraph says:
Thus far, the cache memory has been assumed to be composed of
homogeneous cells. But conceivably a functionally specialized
partitioning of the cache could give higher performance. For example,
perhaps a cache devoted exactly half to instructions and half to data
would be more effective than a homogeneous one; alternatively, one
that holds just instructions could be better than one holding just
data. To test these hypotheses, the effects of dividing the cache into
sections dedicated to specific uses were investigated.
(This paragraph was formatted automatically by https://www.textfixer.com/tools/remove-white-spaces.php.)
It seems to me that the authors are talking about both the split and partitioned designs. But it's not clear what design was implemented in the simulator and how the simulator was configured for evaluation.
Note that the paper didn't discuss why the split design may have a better or worse performance than the unified design. Also note how the authors used the terms "dedicated cache" and "homogeneous cache." The terms "split" and "unified" appeared in later works, which I believe were first used by Alan Jay Smith in Directions for memory hierarchies and their components: research and development in 1978. But I'm not sure because the way Alan used these terms gives the impression that they are already well-known. It appears to me from Alan's paper that the first processor that used the split cache design was the IBM 801 around 1975 and probably the second processor was the S-1 (around 1976). It's possible that the engineers of these processors might have came up with the split design idea independently.
Advantages of the Split Cache Design
The split cache design was then extensively studied in the next two decades. See, for example, Section 2.8 of this highly influential paper. But it was quickly recognized that the split design is useful for pipelined processors where the instruction fetch unit and the memory access unit are physically located in different parts of the chip. With the unified design, it is impossible to place the cache simultaneously close to the instruction fetch unit and the memory unit, resulting in high cache access latency from one or both units. The split design enables us to place the instruction cache close to the instruction fetch unit and the data cache close to the memory unit, thereby simultaneously reducing the latencies of both. (See what it looks like in the S-1 processor in Figure 3 of this document.) This is the primary advantage of the split design over the unified design. This is also the crucial difference between the split design and the unified design that supports cache partitioning. That's why it makes to have a split data cache, as proposed in several research works, such as Cache resident data locality analysis and Partitioned first-level cache design for clustered microarchitectures.
Another advantage of the split design is that it allows instruction and data accesses to occur in parallel without contention. Essentially, a split cache can have double the bandwidth of a unified cache. This improves performance in pipelined processors because instruction and data accesses can occur in the same cycle in different stages of the pipeline. Alternatively, the bandwidth of a unified cache can be doubled or improved using multiple access ports or multiple banks. In fact, using two ports provides twice the bandwidth to the whole cache (in contrast, in the split design, the bandwidth is split in half between the instruction cache and the data cache), but adding another port is more expensive in terms of area and power and may impact latency. A third alternative to improve the bandwidth is by adding more wires to the same port so that more bits can be accessed in the same cycle, but this would probably be restricted to the same cache line (in contrast to the two other approaches). If the cache is off-chip, then the wires that connect it to the pipeline become pins and the impact of the number of wires on area, power, and latency become more significant.
In addition, processors that use a unified (L1) cache typically included arbitration logic that prioritizes data accesses over instruction accesses; this logic can be eliminated in the split design. (See the discussion on the Z80000 processor below for a unified design that avoids arbitration.) Similarly, if there is another cache level that implements the unified design, there will be a need for an arbitration logic at the L2 unified cache. Simple arbitration policies may reduce performance and better policies may increase area. [TODO: Add examples of policies.]
Another potential advantage is that the split design allows us to employ different replacement policies for the instruction cache and data cache that may be more suitable for the access patterns of each cache. All Intel Itanium processors use the LRU policy for the L1I and the NRU policy for the L1D (I know for sure that this applies to the Itanium 2 and later, but I'm not sure about the first Itanium). Moreover, starting with Itanium 9500, the L1 ITLB uses NRU but the L1 DTLB uses LRU. Intel didn't disclose why they decided to use different replacement policies in these processors. In general, It seems to me that it's uncommon for the L1I and L1D to use different replacement policies. I couldn't find a single research paper on this (all papers on replacement policies focus only on data or unified caches). Even for a unified cache, it may be useful for the replacement policy to distinguish between instruction and data lines. In a split design, a cache line fetched into the data cache can never displace a line in the instruction cache. Similarly, a line filled into the instruction cache can never displace a line in the data cache. This issue, however, may occur in the unified design.
The last sub-section of the section on the differences between the modified Harvard architecture and Harvard and von Neumann in the Wikipedia article mentions that the Mark I machine uses different memory technologies for the instruction and data memories. This made me think whether this can constitute as an advantage for the split design in modern computer systems. Here are some of the papers that show that this indeed the case:
LASIC: Loop-Aware Sleepy Instruction Caches Based on STT-RAM Technology: The instruction cache is mostly read-only, except when there is a miss, in which case the line must be fetched and filled into the cache. This means that, when using STT-RAM (or really any other NVRAM technology), the expensive write operations occur less frequently compared to using STT-RAM for the data cache. The paper shows that by using an SRAM loop cache (like the LSD in Intel processors) and an STT-RAM instruction cache, energy consumption can be significantly reduced, especially when a loop is being executed that fits entirely in the loop cache. The non-volatile property of STT-RAM enables the authors to completely power-gate the instruction cache without losing its contents. In contrast, with an SRAM instruction cache, the static energy consumption is much larger, and power-gating it results in losing its contents. There is, however, a performance penalty with the proposed design (compared to a pure SRAM cache hierarchy).
Feasibility exploration of NVM based I-cache through MSHR enhancements: This paper also proposes using STT-RAM for the instruction cache while the data cache and the L2 cache remain based on SRAM. There is no loop cache here. This paper instead targets the high write latency issue of STT-RAM, which is incurred when a line is filled in the cache. The idea is that when a requested line is received from the L2 cache, the L1 cache first buffers the line in the MSHR allocated for its request. The MSHRs are still SRAM-based. Then the instruction cache line can be fed into the pipeline directly from the MSHR without having to potentially stall until it gets written in the STT-RAM cache. Similar to the previous work, the proposed architecture improves energy consumption at the expense of reduced performance.
System level exploration of a STT-MRAM based level 1 data-cache: Proposes using STT-RAM for the L1 data cache while keeping all other caches SRAM-based. This reduces area overhead and energy consumption, but performance is penalized.
Loop optimization in presence of STT-MRAM caches: A study of performance-energy tradeoffs: Compares the energy consumption and performance of pure (only SRAM or only STT-RAM) and hybrid (the L2 and instruction cache are STT-RAM-based) hierarchies. The hybrid cache hierarchy a performance-energy tradeoff that is in between the pure SRAM and pure STT-RAM hierarchies.
So I think we can say that one advantage of the split design is that we can use different memory technologies for the instruction and data caches.
There are two other advantages, which will be discussed later in this answer.
Disadvantages of the Split Cache Design
The split design has its problems, though. First, the combined space of the instruction and data caches may not be efficiently utilized. A cache line that contains both instructions and data may exist in both caches at the same time. In contrast, in a unified cache, only a single copy of the line would exist in the cache. In addition, the size of the instruction cache and/or the data cache may not be optimal for all applications or different phases of the same application. Simulations have shown that a unified cache of the same total size has a higher hit rate (see the VSC paper discussed later). This is the primary disadvantage of the split design. (If there is a placement contention on a single cache set in the split design, this contention may still occur in the unified design and it may have a worse impact on performance. In such a scenario, the split design would have a lower overall miss rate.)
Second, self-modifying code leads to consistency issues that need to be considered at the microarchitecture-level and/or software-level. (An inconsistency may be allowed between the two caches for a small number of cycles, but if the ISA does not allow such inconsistencies to be observable, they have to be detected before the instruction that got modified permanently changes the architectural state.) Maintaining instruction consistency requires more logic and has a higher performance impact in the split design than the unified one.
Third, the design and hardware complexity of a split cache compared against a single-ported unified cache, a fully dual-ported unified cache, and dual-ported banked cache of the same overall organization parameters is an important consideration. According to the cache area model proposed in CACTI 3.0: An Integrated Cache Timing, Power, and Area Model, the fully dual-ported design has the biggest area. This holds true irrespective of the types of the two ports (exclusive-read, exclusive-write, read/write). The dual-ported banked cache has a higher area than the single-ported unified cache. How these two compare against split is less obvious to me. My understanding is that the split design has a higher area than the single-ported unified design [TODO: Explain why]. It may be important to consider the cache organization details, the lengths of the cache buses to the pipeline, and the process technology. One thing to note here is a single-ported instruction cache has a lower are than a single-ported data cache or unified cache because the instruction cache requires only an exclusive-read port while the others require a read/write port.
Unified L1 and Split L2 Caches in Real Processors
I'm not aware of any processor designed in the last 15 years that has a unified (L1) cache. In modern processors, the unified design is mostly used for higher-numbered cache levels, which makes sense because they are not directly connected to the pipeline. An interesting example where the L2 cache follows the split design is the Intel Itanium 2 9000 processor. This processor has a 3-level cache hierarchy where both the L1 and L2 caches are split and private to each core and the L3 cache is unified and shared between all the cores. The L2D and L2I caches are 256 KB and 1 MB in size, respectively. Later Itanium processors reduced the L2I size to 512 KB. The Itanium 2 9000 manual explains why the L2 was made split:
The separate instruction and data L2 caches provide more efficient
access to the caches compared to Itanium 2 processors where
instruction requests would contend against data accesses for L2
bandwidth against data accesses and potentially impact core execution
as well as L2 throughput.
.
.
.
The L3 receives requests from both the
L2I and L2D but gives priority to the L2I request in the rare case of
a conflict. Moving the arbitration point from the L1-L2 in the Itanium
2 processor to the L2-L3 cache greatly reduces conflicts thanks to the
high hit rates of the L2.
(I think "against data accesses" was written twice by mistake.)
The second paragraph from that quote mentions an advantage that I have missed earlier. A split L2 cache moves the data-instruction conflict point from the L2 to the L3. In addition, some/many requests that miss in the L1 caches may hit in the L2, thereby making contention at the L3 less likely.
By the way, the L2I and L2D in the Itanium 2 9000 both use the NRU replacement policy.
Unified L1 Cache Partitioning
James Bell et al. mentioned in their 1974 paper the idea of partitioning a unified cache between instructions and data. The only paper that I'm aware of that proposed and evaluated such a design is Virtually Split Cache: An Efficient Mechanism to Distribute Instructions and Data, which was published in 2013. The main disadvantage of the split design is that one of the L1 caches may be underutilized while the other may be over-utilized. A split cache doesn't allow one cache to essentially take space from the other when needed. It is for reason why the unified design has a low L1 miss rate than the overall miss rate of the split caches (as the paper shows using simulation). However, the combined impact on performance of the higher latency and lower miss rate still makes the system with the unified L1 cache slower than the one with the split cache.
The Virtually Split Cache (VSC) design is the middle point between the split and unified designs. The VSC dynamically partitions (way-wise) the L1 cache between instructions and data depending on demand. This enables better utilization of the L1 cache, similar to the unified design. However, the VSC has even a low miss rate because partitioning reduces potential space conflict between lines holding and instructions and lines holding data. According to the experimental results (all cache designs have the same overall capacity), even if the VSC has the same latency as the unified cache, the VSC has about the same performance as the split design on a single-core system and has a higher performance on a multi-core system because the lower miss rate results in less contention on accessing the shared L2 cache. In addition, in both the single-core and multi-core system configurations, the VSC reduces energy consumption due to the lower miss rate.
A VSC could have a lower latency than a unified cache. Although both are dual ported (to have the same bandwdith as the single-ported split cache), in the VSC design, only the interface needs to be dual ported because no part of the cache can be accessed more than once at the same time. (The paper doesn't explicitly say so, but I think the VSC allows the same line to be present in both partitions if it holds both instructions and data, so it still has the consistency problem that exists in the split design.) Assuming that each bank of the cache represents one cache way, then each bank can be single-ported in VSC. This leads to a simpler design (see: Fast quadratic increase of multiport-storage-cell area with port number) and may allow reducing the latency. Moreover, assuming that the different in latency between the unified design and the split design is small (because the instruction cache and data cache in the split design are physically close to each other), the VSC design can store instructions and data in banks that are physically close to where they are needed in the pipeline and support variable-latency access depending on how many banks are allocated for each. The larger the number of banks, the higher the latency, up to the latency of the unified design. This would require, however, a pipeline design that can handle such variable latency cache.
I think one important thing that this paper is missing is evaluating the VSC design with higher access latencies with respect to the split design (not just 2 cycles vs. 3 cycles). I think increasing the latency by even only one cycle would make VSC slower than split.
The Case of the Z80000 Unified Cache
The Zilog Z80000 processor has a scalar 6-stage pipeline with an on-chip single-ported unified cache. The cache is 16-way fully associative and sectored. Each stage of the pipeline takes at least two clock cycles (loads that miss in the cache and other complex instructions may take more cycles). Each pair of consecutive clock cycles constitutes a processor cycle. The cache design of the Z80000 has a number of unique properties that I've not seen anywhere else:
There can be up to two cache accesses in a single processor cycle, including up to one instruction fetch and up to one data access. However, the cache, despite of being unified and single-ported, is designed in such a way as to have no contention between instruction fetches and data accesses. The unified cache has an access latency of a single clock cycle (which is equal to half a processor cycle). In each processor cycle, an instruction fetch is performed in the first clock cycle and a data access is performed in the second clock cycle. There is no latency benefit from splitting the cache in this case and time-multiplexing accesses to the cache provides the same bandwidth and also the split design downsides don't exist. The full associativity minimizes space contention between instruction and data lines. This design was made possible by the small cache size and relatively shallow pipeline with respect to the cache latency.
The System Configuration Control Longword (SCCL) offers Cache Instruction (CI) and Cache Data (CD) control bits. If CI is 1, instruction fetches that miss in the cache can be filled in the cache. If CD is 1, data loads that miss in the cache can be filled in the cache. The cache uses a write-no-allocate policy so write misses never allocate in the cache. If both CI and CD are set to 1, the cache effectively works like a unified cache. If only one of the flags is 1, the cache effectively works like a data-only or instruction-only cache. Applications can tune these flags to improve performance.
This property is not relevant to the question, but I found it interesting. SCCL also offers Cache Replacement (CR) control bit. Setting this bit to zero disables replacement on a miss, so lines are never replaced. If all entries in a set are occupied and a load/fetch miss occurs in that set, the line is simply not filled in the cache.
The Cases of the R3000, 80486, and Pentium P5
I came across the following complementary question on SE Retrocomputing: Why did Intel abandon unified CPU cache?. There are a number of issues with the accepted answer on that question. I'll address these issues here and explain why the 80486 and Pentium caches were designed like that based on information from Intel.
The 80386 does have an external cache controller with an external unified cache. However, just because the cache is external doesn't necessarily mean that it's likely to be unified. Consider the R3000 processor, which was released three years after 80386 and is of the same generation as the 80486. The designers of R3000 opted for a large external cache instead of a small on-chip cache to improve performance according to Section 1.8 of PaceMips R3000 32-Bit, 25 MHz RISC CPU with Integrated Memory Management Unit. The first section of Chapter 1 of the R3000 Software Reference Manual says that the external cache uses the split design so that it can perform an instruction fetch and a read or write data access in the same "clock phase." It's not clear to me how this exactly works though. My understanding is that the external data and address buses are shared between the two caches and with memory as well. (Also, some of the address wires are used to provide cache line tags to the on-chip cache controller for tag matching.) Both caches are direct-mapped, maybe to achieve a single-cycle access latency. A unified external cache design with the same bandwdith, associativity, and capacity requires the cache to be fully dual-ported, or the VSC design could be used but VSC was invented many years later. Such a unified cache would be more expensive and may have a latency larger than the required single cycle to keep the pipeline filled with instructions.
Another issue with the linked answer from Retro is that just because the 80486 evolved directly from the 80386 doesn't necessarily mean that it has also use the unified design. According to the Intel paper titled The i486 CPU: executing instructions in one clock cycle, Intel evaluated both designs and deliberately chose to go for the unified on-chip design. Compared to the same-gen R3000, both processors have similar frequency ranges and the off-chip data width is 32 bits in both processors. However, the unified cache of the 80486 is much smaller than total cache capacity of the R3000 (up to 16KB vs. up to 256KB+256KB). On the other hand, being on-chip made it more feasible for the 80486 to have wider cache buses. In particular, the 80486 cache has a 16-byte instruction fetch bus, a 4-byte data load bus, and a 4-byte data load/store bus. The two data buses could be used at the same time to load a single 8-byte operand (double-precision FP operand or segment desc) in one access. The R3000 caches share a single 4-byte bus. The relatively small size of the 80486 cache may have allowed making it 4-way associative with a single-cycle latency. This means that a load instruction that hits in the cache can supply the data to a dependent instruction in the next cycle without any stalls. On the R3000, if an instruction depends on an immediately preceding load instruction, it has to stall for one cycle in the best-case scenario of a cache hit.
The 80486 cache is single-ported, but the instruction prefetch buffer and the wide 16-byte instruction fetch bus helps keeping contention between instruction fetches and data accesses to minimum. Intel mentions that simulation results show that the unified design provides a hit rate that is higher than that of a split cache enough to compensate for the bandwdith contention.
Intel explained in another paper titled Design of the Intel Pentium processor why they decided to change the cache in the Pentium to split. There are two reasons: (1) The 2-wide superscalar Pentium requires the ability to perform up to two data accesses in a single cycle, and (2) Branch prediction increases cache bandwdith demand. The paper doesn't mention whether Intel considered using a triple-ported banked unified cache, but they probably did and found out that it's not feasible at that time, so they went for a split cache with a dual-ported 8-banked data cache and a single-ported instruction cache. With today's fab technology, the triple-ported unified design may be better
Wider pipelines in later microarchitectures required higher parallelism at the data cache. Now we're at 4 64-byte ports in Sunny Cove.
Answering the Second Part of the Question
it was mentioned that the cache is a split cache, and no hazard what
does this exactly means?
It's probably about the structural hazard mentioned in Paul's comment. That is, a unified single-ported cache cannot be accessed by the instruction fetch unit and the memory unit at the time.

Out-of-order execution vs. speculative execution

I have read the wikipedia page about out-of-order execution and speculative exectution.
What I fail to understant though are the similarities and differences. It seems to me that speculative execution uses out-of-order execution when it has not determined the value of a condition for example.
The confusion came when I read the papers of Meltdown and Spectre and did additional research. It is stated in the Meltdown paper that Meltdown is based on out-of-order execution, while some other resources including the wiki page about sepeculative execution state that Meltdown is based on speculative execution.
I'd like to get some clarification about this.
Speculative execution and out-of-order execution are orthogonal. One could design a processor that is OoO but not speculative or speculative but in-order. OoO execution is an execution model in which instructions can be dispatched to execution units in an order that is potentially different from the program order. However, the instructions are still retired in program order so that the program's observed behavior is the same as the one intuitively expected by the programmer. (Although it's possible to design an OoO processor that retires instructions in some unnatural order with certain constraints. See the simulation-based study on this idea: Maximizing Limited Resources: a Limit-Based Study and Taxonomy
of Out-of-Order Commit).
Speculative execution is an execution model in which instructions can be fetched and enter the pipeline and begin execution without knowing for sure that they will indeed be required to execute (according to the control flow of the program). The term is often used to specifically refer to speculative execution in the execution stage of the pipeline. The Meltdown paper does define these terms on page 3:
In this paper, we refer to speculative execution in a more
restricted meaning, where it refers to an instruction sequence
following a branch, and use the term out-of-order execution to refer
to any way of getting an operation executed before the processor has
committed the results of all prior instructions.
The authors here specifically refer to having branch prediction with executing instructions past predicted branches in the execution units. This is commonly the intended meaning of the term. Although it's possible to design a processor that executes instructions speculatively without any branch prediction by using other techniques such as value prediction and speculative memory disambiguation. This would be speculation on data or memory dependencies rather than on control. An instruction could be dispatched to an execution unit with an incorrect operand or that loads the wrong value. Speculation can also occur on the availability of execution resources, on the latency of an earlier instruction, or on the presence of a needed value in a particular unit in the memory hierarchy.
Note that instructions can be executed speculatively, yet in-order. When the decoding stage of the pipeline identifies a conditional branch instruction, it can speculate on the branch and its target and fetch instructions from the predicted target location. But still, instructions can also be executed in-order. However, note that once the speculated conditional branch instruction and the instructions fetched from the predicted path (or both paths) reach the issue stage, none of them will be issued until all earlier instructions are issued. The Intel Bonnell microarchitecture is an example of a real processor that is in-order and supports branch prediction.
Processors designed to carry out simple tasks and used in embedded systems or IoT devices are typically neither speculative nor OoO. Desktop and server processors are both speculative and OoO. Speculative execution is particularly beneficial when used with OoO.
The confusion came when I read the papers of Meltdown and Spectre and
did additional research. It is stated in the Meltdown paper that
Meltdown is based on out-of-order execution, while some other
resources including the wiki page about sepeculative execution state
that Meltdown is based on speculative execution.
The Meltdown vulnerability as described in the paper requires both speculative and out-of-order execution. However, this is somewhat a vague statement since there are many different speculative and out-of-order execution implementations. Meltdown doesn't work with just any type of OoO or speculative execution. For example, ARM11 (used in Raspberry Pis) supports some limited OoO and speculative execution, but it's not vulnerable.
See Peter's answer for more details on Meltdown and his other answer.
Related: What is the difference between Superscalar and OoO execution?.
I'm still having hard time figuring out, how Meltdown uses speculative execution. The example in the paper (the same one I mentioned here earlier) uses IMO only OoO - #Name in a comment
Meltdown is based on Intel CPUs optimistically speculating that loads won't fault, and that if a faulting load reaches the load ports, that it was the result of an earlier mispredicted branch. So the load uop gets marked so it will fault if it reaches retirement, but execution continues speculatively using data the page table entry says you aren't allowed to read from user-space.
Instead of triggering a costly exception-recovery when the load executes, it waits until it definitely reaches retirement, because that's a cheap way for the machinery to handle the branch miss -> bad load case. In hardware, it's easier for the pipe to keep piping unless you need it to stop / stall for correctness. e.g. A load where there's no page-table entry at all, and thus a TLB miss, has to wait. But waiting even on a TLB hit (for an entry with permissions that block using it) would be added complexity. Normally a page-fault is only ever raised after a failed page walk (which doesn't find an entry for the virtual address), or at retirement of a load or store that failed the permissions of the TLB entry it hit.
In a modern OoO pipelined CPU, all instructions are treated as speculative until retirement. Only at retirement do instructions become non-speculative. The Out-of-Order machinery doesn't really know or care whether it's speculating down one side of a branch that was predicted but not executed yet, or speculating past potentially-faulting loads. "Speculating" that loads don't fault or ALU instructions don't raise exceptions happens even in CPUs that aren't really considered speculative, but fully out-of-order execution turns that into just another kind of speculation.
I'm not too worried about an exact definition for "speculative execution", and what counts / what doesn't. I'm more interested in how modern out-of-order designs actually work, and that it's actually simpler to not even try to distinguish speculative from non-speculative until the end of the pipeline. This answer isn't even trying to address simpler in-order pipelines with speculative instruction-fetch (based on branch prediction) but not execution, or anywhere in between that and full-blown Tomasulo's algorithm with a ROB + scheduler with OoO exec + in-order retirement for precise exceptions.
For example, only after retirement can a store ever commit from the store buffer to L1d cache, not before. And to absorb short bursts and cache misses, it doesn't have to happen as part of retirement either. So one of the only non-speculative out-of-order things is committing stores to L1d; they have definitely happened as far as the architectural state is concerned, so they have to be completed even if an interrupt / exception happens.
The fault-if-reaching-retirement mechanism is a good way to avoid expensive work in the shadow of a branch mispredict. It also gives the CPU the right architectural state (register values, etc.) if the exception does fire. You do need that whether or not you let the OoO machinery keep churning on instructions beyond a point where you've detected an exception.
Branch-misses are special: there are buffers that record micro-architectural state (like register-allocation) on branches, so branch-recovery can roll back to that instead of flushing the pipeline and restarting from the last known-good retirement state. Branches do mispredict a fair amount in real code. Other exceptions are very rare.
Modern high-performance CPUs can keep (out-of-order) executing uops from before a branch miss, while discarding uops and execution results from after that point. Fast recovery is a lot cheaper than discarding and restarting everything from a retirement state that's potentially far behind the point where the mispredict was discovered.
E.g. in a loop, the instructions that handle the loop counter might get far ahead of the rest of the loop body, and detect the mispredict at the end soon enough to redirect the front-end and maybe not lose much real throughput, especially if the bottleneck was the latency of a dependency chain or something other than uop throughput.
This optimized recovery mechanism is only used for branches (because the state-snapshot buffers are limited), which is why branch misses are relatively cheap compared to full pipeline flushes. (e.g. on Intel, memory-ordering machine clears, performance counter machine_clears.memory_ordering: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)
Exceptions are not unheard-of, though; page-faults do happen in the normal course of operation. e.g. store to a read-only page triggers copy-on-write. Load or store to an unmapped page triggers page-in or handling the lazy mapping. But thousands to millions of instructions usually run between every page fault even in a process that's allocating new memory frequently. (1 per micro or milli-second on a 1GHz CPU). In code that doesn't map new memory, you can go far longer without exceptions. Mostly just a timer interrupt occasionally in pure number crunching without I/O.
But anyway, you don't want to trigger a pipeline flush or anything expensive until you're sure that an exception will really fire. And that you're sure you have the right exception. e.g. maybe the load address for an earlier faulting load wasn't ready as soon, so the first faulting load to execute wasn't the first in program order. Waiting until retirement is a cheap way to get precise exceptions. Cheap in terms of additional transistors to handle this case, and letting the usual in-order retirement machinery figure out exactly which exception fires is fast.
The useless work done executing instructions after an instruction marked to fault on retirement costs a tiny bit of power, and isn't worth blocking because exceptions are so rare.
This explains why it makes sense to design hardware that was vulnerable to Meltdown in the first place. Obviously it's not safe to keep doing this, now that Meltdown has been thought of.
Fixing Meltdown cheaply
We don't need to block speculative execution after a faulting load; we just need to make sure it doesn't actually use sensitive data. It's not the load succeeding speculatively that's the problem, Meltdown is based on the following instructions using that data to produce data-dependent microarchitectural effects. (e.g. touching a cache line based on the data).
So if the load ports mask the loaded data to zero or something as well as setting the fault-on-retirement flag, execution continues but can't gain any info about the secret data. This should take about 1 extra gate delay of critical path, which is probably possible in the load ports without limiting the clock speed or adding an extra cycle of latency. (1 clock cycle is long enough for logic to propagate through many AND/OR gates within a pipeline stage, e.g. a full 64-bit adder).
Related: I suggested the same mechanism for a HW fix for Meltdown in Why are AMD processors not/less vulnerable to Meltdown and Spectre?.

Efficiently checking for a rare occurrence

I have to process many millions of data records. A data record has a record-type string at the beginning of a record. Processing is record-type-dependent but does not require to 'if'/'elsif' the type, just selecting an array-slice mask from a hash.
However, on the order of once-per-million I might encounter a record type that require a totally different kind of processing.
I hate to insert an 'if' testing for this record type that will return 'true' so rarely.
Any suggestions?
Thanks
Meir
The answer is: Don't worry about it.
The speed of your CPU is considerably higher than that of your disk IO, so an if test is just not going to make a lot of difference - even if you ignored e.g. branch prediction algorithms.
An SSD will do about 1500 IO operations per second, and to quote Borodin from the comments:
A reasonable average disk read speed is 100MB per second. Say your records are 100 bytes each, that means you can read 1 million records per second, or 1μs per record. A 2011 Intel Core i5 processor runs at 83,000 MIPS, and so can
execute 83,000 instructions in the time taken to read one record. It is pointless to avoid a few test and branch instructions amongst all that.
Basically this is true in any code - your IO to storage is almost always your limiting factor, because CPUs have followed Moore's law, but the actual rotational speed of a spinning disk hasn't really changed in 15+ years. SSDs are something of a revolutionary change, but they're still too expensive to use as bulk storage options (and even if that wasn't true, they're still going to be the bottleneck on a sustained data transfer/processing operation).

What are the differences between multi-CPU, multi-core and hyper-thread?

Could anyone explain to me the differences between multi-CPU, multi-core, and hyper-thread? I am always confused about these differences, and about the pros/cons of each architecture in different scenarios.
Here is my current understanding after learning online and learning from others' comments.
I think hyper-thread is the most inferior technology among them, but cheap. Its main idea is duplicate registers to save context switch time;
Multi processor is better than hyper-thread, but since different CPUs are on different chips, the communication between different CPUs is of longer latency than multi-core, and using multiple chips, there is more expense and more power consumption than with multi-core;
multi-core integrates all the CPUs on a single chip, so the latency of communication between different CPUs are greatly reduced compared with multi-processor. Since it uses one single chip to contain all CPUs, it consumer less power and is less expensive than a multi processor system.
Is this correct?
Multi-CPU was the first version: You'd have one or more mainboards with one or more CPU chips on them. The main problem here was that the CPUs would have to expose some of their internal data to the other CPU so they wouldn't get in their way.
The next step was hyper-threading. One chip on the mainboard but it had some parts twice internally so it could execute two instructions at the same time.
The current development is multi-core. It's basically the original idea (several complete CPUs) but in a single chip. The advantage: Chip designers can easily put the additional wires for the sync signals into the chip (instead of having to route them out on a pin, then over the crowded mainboard and up into a second chip).
Super computers today are multi-cpu, multi-core: They have lots of mainboards with usually 2-4 CPUs on them, each CPU is multi-core and each has its own RAM.
[EDIT] You got that pretty much right. Just a few minor points:
Hyper-threading keeps track of two contexts at once in a single core, exposing more parallelism to the out-of-order CPU core. This keeps the execution units fed with work, even when one thread is stalled on a cache miss, branch mispredict, or waiting for results from high-latency instructions. It's a way to get more total throughput without replicating much hardware, but if anything it slows down each thread individually. See this Q&A for more details, and an explanation of what was wrong with the previous wording of this paragraph.
The main problem with multi-CPU is that code running on them will eventually access the RAM. There are N CPUs but only one bus to access the RAM. So you must have some hardware which makes sure that a) each CPU gets a fair amount of RAM access, b) that accesses to the same part of the RAM don't cause problems and c) most importantly, that CPU 2 will be notified when CPU 1 writes to some memory address which CPU 2 has in its internal cache. If that doesn't happen, CPU 2 will happily use the cached value, oblivious to the fact that it is outdated
Just imagine you have tasks in a list and you want to spread them to all available CPUs. So CPU 1 will fetch the first element from the list and update the pointers. CPU 2 will do the same. For efficiency reasons, both CPUs will not only copy the few bytes into the cache but a whole "cache line" (whatever that may be). The assumption is that, when you read byte X, you'll soon read X+1, too.
Now both CPUs have a copy of the memory in their cache. CPU 1 will then fetch the next item from the list. Without cache sync, it won't have noticed that CPU 2 has changed the list, too, and it will start to work on the same item as CPU 2.
This is what effectively makes multi-CPU so complicated. Side effects of this can lead to a performance which is worse than what you'd get if the whole code ran only on a single CPU. The solution was multi-core: You can easily add as many wires as you need to synchronize the caches; you could even copy data from one cache to another (updating parts of a cache line without having to flush and reload it), etc. Or the cache logic could make sure that all CPUs get the same cache line when they access the same part of real RAM, simply blocking CPU 2 for a few nanoseconds until CPU 1 has made its changes.
[EDIT2] The main reason why multi-core is simpler than multi-cpu is that on a mainboard, you simply can't run all wires between the two chips which you'd need to make sync effective. Plus a signal only travels 30cm/ns tops (speed of light; in a wire, you usually have much less). And don't forget that, on a multi-layer mainboard, signals start to influence each other (crosstalk). We like to think that 0 is 0V and 1 is 5V but in reality, "0" is something between -0.5V (overdrive when dropping a line from 1->0) and .5V and "1" is anything above 0.8V.
If you have everything inside of a single chip, signals run much faster and you can have as many as you like (well, almost :). Also, signal crosstalk is much easier to control.
You can find some interesting articles about dual CPU, multi-core and hyper-threading on Intel's website or in a short article from Yale University.
I hope you find here all the information you need.
In a nutshell: multi-CPU or multi-processor system has several processors. A multi-core system is a multi-processor system with several processors on the same die. In hyperthreading, multiple threads can run on the same processor (that is the context-switch time between these multiple threads is very small).
Multi-processors have been there for 30 years now but mostly in labs. Multi-core is the new popular multi-processor. Server processors nowadays implement hyperthreading along with multi-processors.
The wikipedia articles on these topics are quite illustrative.
Hyperthreading is a cheaper and slower alternative to having multiple-cores
The Intel Manual Volume 3 System Programming Guide - 325384-056US September 2015 8.7 "INTEL HYPER-THREADING TECHNOLOGY ARCHITECTURE" describes HT briefly. It contains the following diagram:
TODO it is slower by how much percent in average in real applications?
Hyperthreading is possible because modern single CPUs cores already execute multiple instructions at once with the instruction pipeline https://en.wikipedia.org/wiki/Instruction_pipelining
The instruction pipeline is a separation of functions inside of a single core to ensure that each part of the circuit is used at any given time: reading memory, decoding instructions, executing instructions, etc.
Hyperthreading separates functions further by using:
a single backend, which actually runs the instructions with its pipeline.
Dual core has two backends, which explains the greater cost and performance.
two front-ends, which take two streams of instructions and order them in a way to maximize pipelining usage of the single backend by avoiding hazards.
Dual core would also have 2 front-ends, one for each backend.
There are edge cases where instruction reordering produces no benefit, making hyperthreading useless. But it produces a significant improvement in average.
Two hyperthreads in a single core share further cache levels (TODO how many? L1?) than two different cores, which share only L3, see:
Multiple threads and CPU cache
How are cache memories shared in multicore Intel CPUs?
The interface that each hyperthread exposes to the operating system is similar to that of an actual core, and both can be controlled separately. Thus cat /proc/cpuinfo shows me 4 processors, even though I only have 2 cores with 2 hyperthreads each.
Operating systems can however take advantage of knowing which hyperthreads are on the same core to run multiple threads of a given program on a single core, which might improve cache usage.
This LinusTechTips video contains a light-hearted non-technical explanation: https://www.youtube.com/watch?v=wnS50lJicXc
Multi-CPU is a bit like multicore, but communication can only happen through RAM, not L3 cache
This means that if possible, you want to partition tasks that use the same memory a lot for each separate CPU.
E.g. the following SBI-7228R-T2X blade server contains 4 CPUs, 2 on each node:
Source.
We can see that there seem to be 4 sockets for the CPUs, each covered by a heat sink, with one open.
I think across the nodes, they don't even share RAM memory and must communicate through some kind of networking, thus representing one further step up on the hyperthread/multicore/multi-CPU hierarchy, TODO confirm:
https://scicomp.stackexchange.com/questions/7530/difference-between-nodes-and-cpus-when-running-software-on-a-cluster
SLURM nodes, tasks, cores, and cpus
https://www.quora.com/In-High-Performance-Computing-what-exactly-is-the-difference-between-the-terms-%E2%80%9Ccores-%E2%80%9D-%E2%80%9Cprocessors-%E2%80%9D-%E2%80%9Cnodes-%E2%80%9D-and-%E2%80%9Cclusters%E2%80%9D