MSI: Why do we need to write the line back when other CPU is going to override it? - cpu-architecture

In the book "Computer Architecture", by Hennessy/Patterson, 5th ed, on page 360 they describe MSI protocol, and write something like:
If the line is in state "Exclusive" (Modified), then on receiving "Write Miss" from the bus the current CPU 1) writes back the line into the bus, and then 2) goes into "Invalid" state.
Why do we need to write-back the line, if it will be overwritten anyway by the successive write by the other CPU?
Is it connected with the fact that every CPU should see the same writes? (but I don't see why is it a problem not see this particular write by some other CPU)
Here is the protocol from their book (question in green, in purple it is clear: we need to write-back in order to supply the line to requesting CPU):

Writing back the modified data to memory is not strictly necessary in a MSI protocol. The state diagrams also seem to assume a system with low cost memory access (data is supplied by memory even when found in the shared state in another cache) and a shared bus connecting to the memory interface.
However, the modified data cannot simply be dropped as in shared state since the requesting processor might only be modifying part of the cache block (e.g., only one byte). Whatever portions of the block that are not modified by the requesting processor must still be available either in memory or at the requesting processor (the other processor has already invalidated its copy). With a shared bus and low cost memory access the cost difference of adding a write-back to memory over just communicating the data to the other processor is small.
In addition, even on a word-addressed system with word-sized cache blocks, keeping the old data available allows the write miss request to be sent speculatively (as with out-of-order execution or prefetch-for-write) without correctness issues.
(Per-byte tracking of modified [or valid as a superset of modified] state would allow some data communication to be avoided at the cost of extra state bits and a more complex communication system.)

Related

What does TCM connection with Icache in this RISCV version?

In the middle of this page (https://github.com/ultraembedded/riscv), there is a block diagram about the core, I really do not know what is TCM doing in the same block with the Icache ? Is it an optional thing to be inside the CPU ?
Some embedded systems provide dedicated memory for code and/or for data.  On some of these systems, Tightly-Coupled Memory serves as a replacement for the (instruction) cache, while on other such systems this memory is in addition to and along side a cache, applying to a certain portion of the address space.  This dedicated memory may be on the chip of the processor.
This memory could be some kind of ROM or other memory that is initialized somehow prior to boot.  In any case, TCM typically isn't backed by main memory, so doesn't suffer cache misses and the associated circuitry, usually also has high performance, like a cache when a hit occurs.
Some systems refer to this as Instruction Tightly Integrated Memory, ITIM, or Data Tightly Integrated Memory, DTIM.
When a system uses ITIM or DTIM, it performs more like a Harvard architecture than the Modified Harvard architecture of laptops and desktops.
The cache has no address space. CPU does not ask for data from the cache, it just asks for a data, then the memory controller first checks the cache if the data is present in the cache. If it is in the cache, data is fetched, if not then the controller checks the RAM. All processor does is ask for data, it does not care where the data came from. In the case of TCM, the CPU can directly write data to TCM and ask data from TCM since it has a specific address. Think of TCM as a RAM that is close to the CPU.

Why place of Mem[MA] in MB then copy from MB to IR rather than going straight from Mem[MA] to IR?

During the fetch stage of the fetch-execute cycle, why are the contents of the cell whose address is in the MA (memory address register) placed in MB (memory buffer) then copied to IR (instruction register), rather than placing the contents of address of MA directly in the IR?
In theory it would be possible to send instruction fetch memory data directly to the IR (or to both the MB and the IR) — this would require extra hardware: wires and muxes.
You may notice that the architecture (depending on which one it is) makes use of few (one or two) busses, and this would effectively add another bus.  So, I think that all we can say is that simplicity is the reason.  Back in the day when processors were this simple, transistor counts were very limited for integrated circuits.
Going in the direction of making things more efficient, nowadays, even simple processors separate instruction (usually cache) memory from data (usually cache) memory.  This independence accomplishes a number of improvements.  MIPS, even the unpipelined single cycle processor, for example:
First, the PC (program counter) register replaces the MA for the instruction fetch side of things and the IR replaces the MB (as if loading directly into that register as you're suggesting), but let's also note that the IR can be reduced from being a true register to being wires whose output is stable for the cycle and thus can be worked on by a decode unit directly.  (Stability is gained by not sharing the instruction memory hardware with the data memory hardware; whereas with only a single memory interface, data has to be copied around and stored somewhere so the interface can be shared for both code & data.)
That saves both the cycle you're referring to: to transfer data from MB to IR, but also the cycle before to capture the data in the MB register in the first place.  (Generally speaking, enregistering some data requires a cycle, so if you can feed wires without enregistering, that's better, all other factors being the same.)
(Also depending on the architecture you're looking at, the PC on MIPS uses dedicated increment unit (adder) rather than attempting to share the main ALU and/or the busses for that increment — that could also save a cycle or two.)
Second, meanwhile the data memory can run concurrently with the instruction memory (a nice win) executing a data load from memory or store to memory in parallel with the fetch of the next instruction.  The data side also forgoes the MB register as temporary parking place, and instead can load memory data directly into a processor register (the one specified by the load instruction).
Having two dedicated memories creates an independence that reduces the need for register capture while also allowing for parallelism, of course requiring more hardware for the design.

Does a user process have any control over paging?

A program might have some data that, when needed, it wants to access very fast. Let's call this VIP data. It would like to reduce the likelihood that page in memory that the VIP data resides on gets swapped to disk when memory utilization is high on the system. What types of control/influence does it have over this?
For example, I think it can consider the page replacement policy and try to influence the OS to not swap this VIP data to disk. If the policy is LRU, the program can periodically read the VIP data to ensure that the page has always been accessed fairly recently. A program can also use a very small amount of memory in total, making it likely that all its pages are recently accessed when it runs and therefore the VIP data is not likely swapped to disk.
Can it exert any more explicit control over paging?
In order to do this, you might consider
Prioritising the process using renice command or
Lock the processes in the main memory using MLOCK(2)
This is entirely operating system dependent. On some systems, if you have appropriate privileges you can lock pages in physical memory.

When CPU flush value in storebuffer to L1 Cache?

Core A writes value x to storebuffer, waiting invalid ack and then flushes x to cache. Does it wait only one ack or wait all acks ? And how does it konw how many acks in all CPUs ?
It isn't clear to me what you mean by "invalid ack", but let's assume you mean a snoop/invalidation originating from another core which is requesting ownership of the same line.
In this case, the stores in the store buffer are generally free to ignore such invalidations from other cores since the stores in the store buffer are not yet globally visible. The store only become globally visible when they commit to L1 at some point after they have retired. At this point1 the cache controller will make an RFO (request for ownership) of the associated line if it isn't already in the cache. It is essentially at this point that the store becomes globally visible. The L1 cache controller doesn't need to know how many other invalidations are in flight, because they are being mediated by some higher level components in the system as part of the MESI protocol, and when they get the line in the E state, they are guaranteed they are the exclusive owner.
In short, invalidations from other cores have little effect on stores in the store buffer2, since they become globally visible at a single point based on an RFO request. Is is loads that have executed that area more likely to be made by invalid activity on another core, especially on strongly platforms such as x86 which doesn't allow visible load-load reordering. The so-called MOB on x86, for example, is responsible for tracking whether invalidations potentially break the ordering rules.
RFO Response
Perhaps the "acks" you were talking about are the responses from other cores to the writing core's request to obtain or upgrade its ownership of the line so that it can write to it: i.e., invaliding copies of the lines in the other CPUs and so on.
This is commonly known as issuing an RFO which when successful leaves the line in the E state in the requesting core.
Most CPUs are layered, with a variety of different agents working together to ensure coherency. In practice, this means that a CPU doens't need to wait for up to N-1 "acks" from the other N-1 cores on an N CPU system, but rather just a single reply from a higher-level component which itself is in charge of sending and collecting responses from other CPUs.
One example could be a single-socket multi-core CPU with a private L1 and L2, and shared L3. A core might send its RFO down to the L3, which might send invalidate requests to all cores, wait for their responses and then acknowledge the RFO request to the requesting core. Alternately, the L3 may store some bits which indicate which cores could possibly have a copy of the line, and then it only needs to send the requests to those cores (the role the L3 is taking in that case is sometimes referred to as a snoop filer).
Since all communication between agents passes through the L3, it is able to keep anything consistent. In the case of a multi-socket system, things get more complicated: the L3 on the local core may again get the request and may pass it over to the other socket to do the same type of invalidation there. Again there might exist the concept of a snoop filter, or other concepts may exist and the behavior may even be configurable!
For example, in Intel's Broadwell Xeon architecture, there are fully four different configurable snoop modes:
Broadwell offers four different snoop modes a reintroduction of Home
Snoop with Directory and Opportunistic Snoop Broadcast (HS with DIR +
OSB) previously available on Ivy Bridge, and three snoop modes that
were available on Haswell, Early Snoop, Home Snoop, and Cluster on Die
Mode (COD). Table 5 maps the memory bandwidth and latency trade-offs
that will vary across each of the different modes. Most workloads will
find that Home Snoop with Directory and Opportunistic Snoop Broadcast
will be the best choice.
... with different performance tradeoffs:
The rest that document goes into some detail about how the various modes work.
So I guess the short answer is "it's complicated and depends on the detailed design and possibly even user-configurable settings".
1 Or potentially at some earlier point since an optimized implementation might "look ahead" in the store buffer and issue RFOs (so-called "RFO prefetches") for upcoming stores even before they become the most senior store.
2 Invalidations may, however, complicate the RFO prefetches mentioned in the first footnote, since it means there is a window where line can be "stolen back" by another core, making the RFO prefetch wasted work. A sophisticated implementation may have a predictor that varies the RFO prefetch aggressiveness based on monitoring whether this occurs.

What happens when a core write in its L1 cache while another core is having the same line in its L1 too?

What happens when a core write in its L1 cache while another core is having the same line in its L1 too ?
Let say for an intel Skylake CPU.
How does the cache system preserve consistency ? Does it update in real time, does it stop one of the cores ?
What's the performance cost of continuously writing in same cache line with two cores ?
In general modern CPUs use some variant1 of the MESI protocol to preserve cache coherency.
In your scenario of an L1 write, the details depend on the existing state of the cache lines: is the cache line already in the cache of the writing core? In the other core, in what state is the cache line, e.g., has it been modified?
Let's take the simple case where the line isn't already in the writing core (C1), and it is in the "exclusive" state in the other core (C2). At the point where the address for the write is known, C1 will issue an RFO (request for ownership) transaction onto the "bus" with the address of the line and the other cores will snoop the bus and notice the transaction. The other core that has the line will then transition its line from the exclusive to the invalid state and the value of the value of the line will be provided to the requesting core, which will have it in the modified state, at which point the write can proceed.
Note that at this point, further writes to that line from the writing core proceed quickly, since it is in the M state which means no bus transaction needs to take place. That will be the case until the line is evicted or some other core requests access.
Now, there are a lot of additional details in actual implementations which aren't covered above or even in the wikipedia description of the protocol.
For example, the basic model involves a single private cache per CPU, and shared main memory. In this model, core C2 would usually provide the value of the shared line onto the bus, even though it has not modified it, since that would be much faster than waiting to read the value from main memory. In all recent x86 implementations, however, there is a shared last-level L3 cache which sits between all the private L2 and L1 caches and main memory. This cache has typically been inclusive so it can provide the value directly to C1 without needing to do a cache-to-cache transfer from C2. Furthermore, having this shared cache means that each CPU may not actually need to snoop the "bus" since the L3 cache can be consulted first to determine which, if any, cores actually have the line. Only the cores that have the line will then be asked to make a state transition. Kind of a push model rather than pull.
Despite all these implementation details, the basics are the same: each cache line has some "per core" state (even though this state may be stored or duplicated in some central place like the LLC), and this state atomically undergoes logical transitions that ensure that the cache line remains consistent at all times.
Given that background, here are some specific answers to you final two sub-questions:
Does it update in real time, does it stop one of the cores?
Any modern core is going to do this in real time, and also in parallel for different cache lines. That doesn't mean it is free! For example, in the description above, the write by C1 is stalled until the cache coherence protocol is complete, which is likely dozens of cycles. Contrast that with a normal write which takes only a couple cycles. There are also possible bandwidth issues: the requests and responses used to implement the protocol use shared resources that may have a maximum throughput; if the rate of coherence transactions passes some limit, all requests may slow down even if they are independent.
In the past, when there was truly a shared bus, there may have been some partial "stop the world" behavior in some cases. For example, the lock prefix for x86 atomic instructions is apparently named based on the lock signal that a CPU would assert on the bus while it was doing an atomic transaction. During that entire period other CPUs are not able to fully use the bus (but presumably they could still proceed with CPU-local instructions).
What's the performance cost of continuously writing in same cache line with two cores?
The cost is very high because the line will continuously ping-pong between the two cores as described above (at the end of the process described, just reverse the roles of C1 and C2 and restart). The exact details vary a lot by CPU and even by platform (e.g, a 2-socket configuration will change this behavior a lot), but basically are probably looking at a penalty of 10s of cycles per write versus a not-shared output of 1 write per cycle.
You can find some specific numbers in the answers to this question which covers both the "two threads on the same physical core" case and the "separate cores" case.
If you want more details about specific performance scenarios, you should probably ask a separate specific question that lays out the particular behavior you are interested in.
1 The variations on MESI often introduce new states, such as the "owned" state in MOESI or the "Forwarded" state in MESIF. The idea is usually to make certain transitions or usage patterns more efficient than the plain MESI protocol, but the basic idea is largely the same.