What is the impact of locality principles on the pipelining techniques? - cpu-architecture

I know what are both locality principles and pipelining techniques. But I can't see any sort of interconnection between the two things.
How can locality principles impact pipelining techniques?

One case I could think of is the instruction cache hit during IF stage. Spatial locality is likely to increase instruction cache hits when there is less flow control. Also when there is flow control, like a loop, temporal locality will increase the hit rate of the instruction cache. If the IF is not succeed due to a cache miss, pipeline has to be stalled, which can be solved by locality principles.

Related

Why are most cache line sizes designed to be 64 byte instead of 32/128byte now?

i found in linux, it shows my cpu's cache line size is 64 byte, and i realized 16/32/128 byte is existing, but most cpu are designed to 64 byte cache line size now. why not bigger or smaller?
It's a trade-off. Wider caches are more efficient (in terms of area/power for a given cache size), but result in more memory traffic for random (non-sequential/strided) access, and more false sharing contention between parallel caches.
if you have a memory access pattern that only needs a few bytes from each cache line (eg, iterating along a linked list that is scattered widely across memory), each access will need to pull an entire line into the cache. So doubling the line size will double the memory traffic.
if different CPUs, each with its own cache, are accessing memory on the same cache line, that line will have to "bounce" back and forth between the caches. Avoiding this means putting more padding between objects.
In both cases, these problems can be avoided by tuning the software to want memory in chunks that are multiples of the cache line size. The bigger the cache line size, the more work that is.
As Chris Dodd's answer points out, the sizing of cache lines involves trade-offs.
Larger cache lines reduce the number of tag bits per data byte, provide prefetching, and facilitate higher bandwidth (particularly at the memory and the L1 interfaces) at the cost of excessive prefetch (wasting bandwidth and cache capacity), false sharing, higher miss latency (especially without critical word first/early restart), and higher conflict misses (for smaller caches, with fewer sets the probability that more accesses than the associativity will map to a particular set increases). (Larger cache lines can also provide greater performance predictability by guaranteeing a cache hit within a larger address range and number of bytes.)
Modern systems would not noticeably benefit from such prefetching; configurable static prefetcher logic would provide the same behavior and dynamic prefetching can exploit variable resource availability (e.g., cache capacity and memory channel occupancy) and utility as well as provide more flexible prefetching (such as non-unit stride).
Tag overhead is not as significant a concern in terms of area for modern caches using SRAM for data as well as tags. (IBM's Power and zArchitecture implementations use eDRAM for outer cache data storage and SRAM for tags, which more than doubles the area cost of tags relative to data.) However, access latency and access energy are effected by the size of the tag arrays. For L1 caches, way prediction is more effective with larger cache lines both because there are fewer cache lines for a given cache capacity and because spatial locality tends to apply even beyond reasonable cache line sizes; only having to check one set of tags for a wider or larger number of accesses reduces the cost of higher bandwidth (this is most noticeable in GPUs which exploit spatial locality and sacrifice latency for bandwidth). For outer cache levels, phased tag-data access is often used (tags are checked before data access begins, saving energy — especially given higher associativity and miss rates); smaller tag arrays for a given capacity both reduce access energy and latency (especially for misses — 50% hit rates are not unheardof). (Note that one can use partial tags to provide early miss detection in the common case where a miss has no matching partial tags. Other filtering mechanisms are also possible.)
False sharing can be countered by using sectored caches, where more than one validity (or coherence state) entry is provided for each address portion of the tag. This provides an intermediate design point between larger cache lines with more frequent false sharing and smaller cache lines with higher tag overhead. Such also inherently supports reducing cache line fill delay. For traditional layouts, this has the substantial effective capacity cost of large cache lines when false sharing or less spatial locality is more common. For designs using indirection, such as proposed Non-Uniform Cache Architectures and V-Way caches, the capacity utilization issue can be reduced by allocating data storage at a finer granularity at the cost of more indirection pointer storage.
Larger cache lines provide three bandwidth benefits. The command overhead is less (address and action information is nearly constant — address is one bit smaller for each doubling in size) so the bandwidth overhead per data byte is lower; this is more significant for coherence traffic where many messages carry only metadata. (Obviously with more numerous coherence nodes false sharing can be more problematic acting against this advantage.) Other per-request overheads also do not scale with request size (e.g., DRAM row activation with random-access, close-on-completion row management). ECC (or check codes with retransmission) also have less per payload byte overhead with larger payloads (this can be used to store extra metadata while using commodity width memory modules).
A larger cache line also facilitates wider memory interfaces when burst length is fixed. Increasing DRAM burst length facilitates higher bandwidth; DDR5 moved to a burst length of 16, pushing DIMMs into using two 32-bit wide channels to be compatible with x86's de facto standardization on 64-byte cache lines. While this chance can be viewed positively as increasing available memory level parallelism (MLP)— doubling the number of channels and reducing DRAM bank conflicts — MLP is more important when relative latency of memory is greater (large on-chip caches and faster processing), thread-level parallelism is available (multicore and multithreading), and out-of-order execution (and multithreading) expose more memory accesses to hide latency. Multicore (when used with significant data memory sharing as opposed to multiprogramming or large chunk/stream communication such as pipeline-style mulithreading) also increases the importance of false sharing, further reducing the benefits of larger cache lines (beyond MLP benefits of narrower channels). With lower (on-chip) communication latency and near-unavoidability of multicore processors, multithreaded programming becomes more attractive.
For L1 caches the microarchitecture (and somewhat the ISA) can be influence cache line sizes. Higher frequency designs favor smaller capacity L1 caches both for latency and access energy, especially with less latency tolerance from out-of-order execution or skewed pipelines (where the execution pipeline phase is one or more stages delayed from the address generation phase).
The relative sizes of the various tradeoffs also depends on the workload and software design. Larger cache capacity caches targeting workloads that benefit from such reduce the excessive prefetch and conflict disadvantages of larger cache lines; higher associativity reduces the conflict disadvantage, but workloads that are more likely to have conflicts are less likely (in general) to benefit from spatial locality (and the conflict disadvantage is typically less important for outer cache levels). Pointer-chasing workloads tend to favor lower latency and thus lower capacity, favoring smaller cache lines (at least in L1).
Software design is a significant factor. Avoiding false sharing tends to increase padding as cache line size increases, discouraging larger cache lines. Once a cache line size assumption is established in a software community (which is somewhat segregated according to ISA, OS, and hardware/system vendor) the effect of legacy code and legacy conceptualizations constrains cache line size.
Speculation: x86's orientation toward generic software and personal computer uses (cost and workload characteristics biasing toward smaller caches and workload perhaps generally having lower spatial locality) probably biased the choice toward a smaller cache line than ISA/hardware vendors targeting workstation and server workloads with a higher expectation of software development effort. x86 has standardized on 64-byte cache lines, IBM POWER9 uses 128-byte cache blocks (divided into four sectors for L1 caches) and IBM z15 uses 256-byte cache blocks.
(Latency vs. hit rate, access energy, and other tradeoffs as well as software and programmer legacy seem to lead to less strict standardization on 32KiB L1 cache capacity. The performance impact for a smaller or larger cache can be less significant than for false sharing, so the software constraints are less significant than for cache line size.)

Does the Harvard architecture have the von Neumann bottleneck?

From the naming and this article I feel the answer is no, but I don't understand why. The bottleneck is how fast you can fetch data from memory. Whether you can fetch instruction at the same time doesn't seem to matter. Don't you still have to wait until the data arrive? Suppose fetching data takes 100 cpu cycles and executing instruction takes 1, the ability of doing that 1 cycle in advance doesn't seem to be a huge improvement. What am I missing here?
Context: I came across this article saying the Spectre bug is not going to be fixed because of speculative execution. I think speculative execution, for example branch prediction, makes sense for Harvard architecture too. Am I right? I understand speculative execution is more beneficial for von Neumann architecture, but by how much? Can someone give a rough number? On what extent can we say the Spectre will stay because of von Neumann architecture?
The term "von Neumann bottleneck" isn't just talking about Harvard vs. von Neumann architectures. It's talking about the entire idea of stored-program computers, which John von Neumann invented.
(Depending on context, some people may use it to mean the competition between code-fetch and data access; that does exacerbate the overall memory bottleneck without split caches. Or perhaps I'm mixing up terminology and the more general memory bottleneck for processors I discuss in the rest of this answer shouldn't be called the von Neumann bottleneck, although it is a real thing. See the memory wall section in Modern Microprocessors
A 90-Minute Guide! )
The von Neumann bottleneck applies equally to both kinds of stored-program computers. And even to fixed-function (not stored-program) processors that keep data in RAM. (Old GPUs without programmable shaders are basically fixed-function but can still have memory bottlenecks accessing data).
Usually it's most relevant when looping over big arrays or pointer-based data structures like linked lists, so the code fits in an instruction cache and doesn't have to be fetched during data access anyway. (Computers too old to even have caches were just plain slow, and I'm not interested in arguing semantics of whether slowness even when there is temporal and/or spatial locality is a von Neumann bottleneck for them or not.)
https://whatis.techtarget.com/definition/von-Neumann-bottleneck points out that caching and prefetching is part of how we work around the von Neumann bottleneck, and that faster / wider busses make the bottleneck wider. But only stuff like Processor-in-Memory / https://en.wikipedia.org/wiki/Computational_RAM truly solves it, where an ALU is attached to memory cells directly, so there is no central bottleneck between computation and storage, and computational capacity scales with storage size. But von Neumann with a CPU and separate RAM works well enough for most things that it's not going away any time soon (given large caches and smart hardware prefetching, and out-of-order execution and/or SMT to hide memory latency.)
John von Neumann was a pioneer in early computing, and it's not surprising his name is attached to two different concepts.
Harvard vs. von Neumann is about whether program memory is in a separate address space (and a separate bus); that's an implementation detail for stored-program computers.
Spectre: yes, Spectre is just about data access and branch prediction, not accessing code as data or vice versa. If you can get a Spectre attack into program memory in a Harvard architecture in the first place (e.g. by running a normal program that makes system calls), then it can run the same as on a von Neumann.
I understand speculative execution is more beneficial for von Neumann architecture, but by how much?
What? No. There's no connection here at all. Of course, all high-performance modern CPUs are von Neumann. (With split L1i / L1d caches, but program and data memory are not separate, sharing the same address space and physical storage. Split L1 caches is often called "modified Harvard", which makes some sense on ISAs other than x86 where L1i isn't coherent with data caches so you need special flushing instructions before you can execute newly-stored bytes as code. x86 has coherent instruction caches, so it's very much an implementation detail.)
Some embedded CPUs are true Harvard, with program memory connected to Flash and data address space mapped to RAM. But often those CPUs are pretty low performance. Pipelined but in-order, and only using branch prediction for instruction prefetch.
But if you did build a very high performance CPU with fully separate program and data memories (so copying from one to the other would have to go through the CPU), there'd be basically zero different from modern high-performance CPUs. L1i cache misses are rare, and whether they compete with data access is not very significant.
I guess you'd have split caches all the way down, though; normally modern CPUs have unified L2 and L3 caches, so depending on the workload (big code size or not) more or less of L2 and L3 can end up holding code. Maybe you'd still use unified caches with one extra bit in the tag to distinguish code addresses from data addresses, allowing your big outer caches to be competitively shared between the two address-spaces.
The Harvard Architecture, separated instruction and data memories, is a mitigation of the von Neumann bottleneck. Backus' original definition of the bottleneck addresses a slightly more general problem than just instruction or data fetch and talks about the CPU/memory interface. In the paragraph before the money quote Backus talks about looking at the actual traffic on this bus,
Ironically, a large part of the traffic is not useful data but merely
names of data that most of it consists of names as well as operations
and data used only to compute such names.
In a Harvard architecture with a separated I/D bus, that will not change. It will still largely consist of names.
So the answer is a hard no. The Harvard architecture mitigates the von Neumann bottleneck but it doesn't solve it. Bluntly, it's a faster von Neumann bottleneck.

Are Real time systems Hard/Soft or the RTOS itself?

I just wanted to ask if there exists anything like a Hard-time RTOS or Soft-Time RTOS itself or is it the designer who defines a system as Hard-time or Real-time irrespective of the RTOS used?
"Hard" or "Soft" is a characteristic of the system requirement. It is unrelated to the RTOS used.
See this related question for more information.
Most people implicitly have an informal mental model that considers information or an event as being “real-time”
--if, or to the extent that, it is manifest to them with a delay (latency) that can be related to its perceived currency
--i.e., in a time frame that the information or event has satisfactory usefulness to them.
Note that the magnitude of the delay is irrelevant, it may be from microseconds to megaseconds. Well known examples in the real world include real-time computing systems, automated financial trading, and adverse weather alerts.
Any particular real-time system (i.e., according to the above informal mental model, it has satisfactory timeliness), which includes an OS, depends on that OS to be real-time enough--i.e., have latencies short enough that result in it providing satisfactory usefulness to the rest of the system. Some systems may be real-time enough even though the OS is Microsoft Windows (numerous such systems are deployed); other systems cannot be real-time enough unless their OS is designed and implemented to have very low latencies.
The informal mental model refers to timeliness but lacks the second essential property of something being "real-time:" predictability of timeliness.
Usually an OS which is intended for real-time systems is designed and implemented to have sufficiently low latencies (needed for timeliness) AND sufficiently high predictability of latencies (and hence timeliness). Again, note that the magnitudes of the latencies and the degrees of predictability are application-specific. An OS or a system can have latencies in (say) seconds or minutes, and predictability of timeliness which is stochastic (e.g., long low value tails after the mean value, which is common in many real-time systems and RTOSs).
Predictability is an extremely deep topic, especially in real-time systems.
It is discussed elsewhere.

Flynn's Bottleneck - maximum speedup 2

According to Flynn's Bottleneck, the speedup due to instruction level parallelism (ILP) can be at best 2. Why is it so?
That version of Flynn's Bottleneck originates in Detection and Parallel Execution of Independent Instructions where the authors empirically conclude that ILP for most programs is less than 2. That was 1970 technology and that was an empirical conclusion. You can contrast it with Fisher's Optimism which said there was lots of ILP out there and proposed trace scheduling and VLIW to exploit it.
So the literal answer to your question is because that's what they measured within basic blocks back then.
The ILP less than 2 meaning isn't really used anymore because superscalars and better compilers have blown past the number 2. So instead, over time Flynn's Bottleneck has come to mean You cannot retire more than you fetch which stems from his earlier paper Some Computer Organizations and Their Effectiveness.
The execution bandwidth of a system is usually referred to as being
the maximum number of operations that can be performed per unit time
by the execution area. Notice that due to bottlenecks in issuing
instructions, for example, the execution bandwidth is usually
substantially in excess of the maximum performance of a system.

Data Scrambling Purpose

Can someone please explain to me what data scrambling is when it comes to a memory controller? According to Wikipedia, it somehow masks the user data with random patterns to prevent reverse engineering of a DRAM. But, it is also is used to finding electrical problems. Can someone please elaborate on these features of data scrambling? Thanks!
The Wikipedia article claimed:
Memory controllers integrated into certain Intel Core processors also
provide memory scrambling as a feature that turns user data written to
the main memory into pseudo-random patterns.[6][7] As such, memory
scrambling prevents forensic and reverse-engineering analysis based on
DRAM data remanence, by effectively rendering various types of cold
boot attacks ineffective. However, this feature has been designed to
address DRAM-related electrical problems, not to prevent security
issues, so it may not be rigorously cryptographically secure.[8]
However, I think that this claim is somewhat misleading because it implies that the purpose of data scrambling is to prevent reverse engineering. In fact the cited sources (listed as [6][7] in the quote) say the following:
The memory controller incorporates a DDR3 Data Scrambling feature to
minimize the impact of excessive di/dt on the platform DDR3 VRs due to
successive 1s and 0s on the data bus. Past experience has demonstrated
that traffic on the data bus is not random and can have energy
concentrated at specific spectral harmonics creating high di/dt that
is generally limited by data patterns that excite resonance between
the package inductance and on-die capacitances. As a result, the
memory controller uses a data scrambling feature to create
pseudo-random patterns on the DDR3 data bus to reduce the impact of
any excessive di/dt.
Basically the purpose of scrambling is to limit fluctuations in the current draw that is used on the DRAM data bus. There is nothing in the cited source to support the claim that it is designed to prevent reverse-engineering, though I suppose it is reasonable to assume that it might make reverse engineering more difficult. I'm not an expert in this area so I don't know for sure.
I have edited the Wikipedia article to remove the improperly sourced claim. Though I suppose someone could add it back it, but if so hopefully they can provide better sourcing.
It's not reverse engineering of the DRAM, it's reverse engineering of the data in the DRAM that scrambing is designed to prevent (e.g. forensics like cold-boot attacks), according to that article.
The electrical properties thing made me think of Row Hammer. Scrambling might make that harder, but IDK if that's what the author of that paragraph had in mind.