How is load->store reordering possible with in-order commit -- follow up

How is load->store reordering possible with in-order commit -- follow up - cpu-architecture

I would like to ask for some clarification about what discussed in this thread How is load->store reordering possible with in-order commit ? -- sorry I've not enough reputation to add comments directly there.
A load can't fault after you've checked the TLB and/or whatever
memory-region stuff for it. That part has to be complete before it
retires, or before it reaches the end of an in-order pipeline. Just
like a retired store sitting in the store buffer waiting to commit, a
retired load sitting in a load buffer is definitely happening at some
point.
So the sequence on an in-order pipeline is:
lw r0, [r1] TLB hit, but misses in L1d cache. Load execution unit writes the address (r1) into a load buffer. Any later instruction that tries to read r0 will stall, but we know for sure that the load didn't fault.
With r0 tied to waiting for that load buffer to be ready, the lw instruction itself can leave the pipeline (retire), and so can later
instructions.
So you're basically saying that lw instruction (or its micro-op) can retire even though, after it retires, the load itself can stay 'in flight' inside the load buffer (load queue) waiting for the data arrive from somewhere.
sw r2, [r3] store execution unit writes address + data to the store buffer / queue. Then this instruction can retire.
Probing the load buffers finds that this store doesn't overlap with the pending load, so it can commit to L1d. (If it had overlapped, you couldn't commit it until a MESI RFO completed anyway, and fast restart would forward the incoming data to the load buffer).
Here you are saying that when the store executes (or actually after it retires ?) it will search inside the load buffer to check if there is any older in-flight load waiting for data that happen to overlap with that store's target address.
Does it make sense ? Thank you.

Related

Do memory barriers prevent branch prediction?

This question does not assume any specific architecture.
Assume that we have a multicore processor with cache coherence, out-of-order execution, and branch prediction logic. We also assume that stores to memory are strictly in program order.
We have two threads running in parallel, each on a separate core.
Below are the threads’ pseudo-code. data and flag are initially 0.
Thread #1 code:
data=10;
flag=1;
Thread #2 code:
while(!flag);
print data;
With proper synchronization, Thread #2 would eventually print 1. However, the branch predictor could potentially predict that the loop is not entered, thus perform a speculative read of data, which contains 0 at that time (prior to Thread #1 setting data). The prediction is correct, i.e. ‘flag’ is eventually set to 1. In this case the print data instruction can be retired, but it prints the incorrect value of 0.
The question is whether a memory barrier would somehow prevent the speculative read of data, and cause the cpu to execute the busy wait properly. An alternative solution could be to let the branch predictor do its work, but snoop the writes done by the other core, and in case a write to data is detected, we can use the ROB to undo the premature read (and its dependent instructions) and then re-execute with the proper data.
Arch-specific answers are also welcome.

No, branch prediction + speculative execution is fine in an ISA with memory barriers, as long as mis-speculation is killed properly.
thus perform a speculative read of data, which contains 0 at that time
When the CPU detects the misprediction, instructions from the mis-speculated path of execution are discarded, along with their effects on architectural registers.
When the correct path of execution does eventually exit the loop, then the memory barrier will run (again), then the load of data will run (again). The fact that they earlier ran in the shadow of a mis-predicted branch has no effect.
Your pseudo-code assembly isn't very clear because it makes print data look like a single operation. In fact it will involve a load into a register and then a call print instruction.
When the data load runs on the correct path, it will have to redo the work of reading a value from cache, and cache is coherent across cores. It doesn't matter if the mis-speculated load brought the cache line into this core's L1d cache; a store by another core will have to invalidate it before that store can become globally visible.
The loop exits after seeing exit!=0; the barrier after that makes sure that later loads haven't already happened, giving acquire semantics to the load of exit (assuming it includes blocking LoadLoad reordering).
The barrier executing on the correct path makes sure that this core waits for that invalidation instead of using an early load.
A store / release barrier in the writer makes sure that the new data value is globally visible before exit = 1 is visible to any other threads on any cores.

What happens with nested branches and speculative execution?

Alright, so I know that if a particular conditional branch has a condition that takes time to compute (memory access, for instance), the CPU assumes a condition result and speculatively executes along that path. However, what would happen if, along that path, yet another slow conditional branch pops up (assuming, of course, that the first condition hasn't been resolved yet and the CPU can't just commit the changes)? Does the CPU just speculate inside the speculation? What happens if the last condition is mispredicted but the first wasn't? Does it just rollback all the way?
I'm talking about something like this:
if (value_in_memory == y){
// computations
if (another_val_memory == x){
//computations
}
}

Speculative execution is the regular state of execution, not a special mode that an out of order CPU enters when it sees a branch and then leaves when the branch is no longer in flight.
This is easier to see if you consider that it's not just branches that can fault, but many instructions, including those that access memory, have restrictions on their input values, etc. So any substantial out of order execution implies constant speculation, and CPUs are built around that idea.
So "nested branches" doesn't end up being special in that sense.
Now, modern CPUs have a variety of methods for quick branch misprediction recovery, faster than recovery from other types of faults1. For example they may snapshot the state of the register mapping at some branches, to allow recovery to start before the branch is at the head of the reorder buffer. Since it is not always feasible to snapshot at all branches, there might be complicated heuristics involved to decide where to take snapshots.
I mention this last part because it is one way in which nested branches might matter: when there are lots of branches in flight, you might hit some microarchitectural limits related to the tracking of these branches for recovery purposes. For more details, you can look through patents for "branch order buffer" (for Intel techniques, but there are no doubt others).
1 The basic recovery method is keep executing until the faulting instruction is the next to retire, and then throw away all younger instructions. In the context of branch mispredictions, this means you could actually suffer two or more mispredictions only the oldest of which actually takes effect: e.g., a younger branch mispredicts, and while executing up to that branch (at which point recovery can occur), another mispredict occurs, so the younger one ends up getting discarded.

(Maybe not a complete answer, but I had some of this written when #BeeOnRope posted an answer. Posting this anyway for some more links and technical details in case anyone's curious.)
Everything is always speculative until it reaches retirement and becomes non-speculative, definitely happened, part of the architectural state.
e.g. any load might fault with a bad address, any div might trap on divide by zero. See also Out-of-order execution vs. speculative execution That and What exactly happens when a skylake CPU mispredicts a branch? mention that branch mispredicts are handled specially, because they're expected to be frequent. Fast-recovery can start before a mis-predicted branch reaches retirement, unlike the behaviour for a faulting load for example. (That's part of why Meltdown is exploitable.)
So even "regular" instructions are executed speculatively before being commited, and the only distinction between them is a human-made distinction, not computer-made? I presume, then, that the CPU stores multiple, possible rollback points? For instance if I have load instructions that may lead to page faults or simply use stale values, inside a conditional branch, the CPU identifies such instructions and scenarios and saves a state for each of them? I feel like I misunderstood because this may lead to a lot of storing register states and complicated dependencies.
The retirement state is always consistent so you can always roll back to there and discard all in-flight work, e.g. if an external interrupt arrives you want to handle it without waiting for a chain of a dozen cache miss loads to all execute. When an interrupt occurs, what happens to instructions in the pipeline?
This tracking basically happens for free or is something you need to do anyway to be able to detect which instruction faulted, not just that there was a problem somewhere. (This is called "precise exceptions")
The real distinction humans can usefully make is speculation that has a real chance of being wrong during execution of non-error cases. If your code gets a bad pointer, it doesn't really matter how it performs; it's going to page-fault and that's going to be very slow compared to local OoO exec details.
You're talking about a modern out-of-order (OoO) execution (not just fetch) CPU, like modern Intel or AMD x86, high-end ARM, MIPS r10000, etc.
The front-end is in-order (with speculation down predicted paths), and so is commit (aka retirement) from the out-of-order back-end into non-speculative retirement state. (aka known-good architectural state).
The CPU uses two major structures to track instructions (or on x86, uops = parts of instructions) in the back-end. The last stage of the front-end (after fetch / decode) allocates/renames instructions and adds them into both of these structures at once.
RS = Reservation Station = scheduler: not-yet-executed instructions, waiting for an execution unit. The RS tracks dependencies and sends the oldest-ready uops to execution units that are ready.
ROB = ReOrder Buffer: not-yet-retired instructions. Instructions enter and leave in-order so it can just be a circular buffer.
Includes a flag to mark each entry as executed or not, set once the RS has sent it to an execution unit which reports success. The oldest instructions in the ROB that all have their done-executing bit set can "retire".
Also includes a flag which indicates "fault if this reaches retirement". This avoids spending time handling page faults from load instruction on the wrong path of execution (that might well have pointers into an unmapped page), for example. Either in the shadow of a branch mispredict, or just after another instruction (in program order) that should have faulted first but OoO exec got to it later.
(I'm also leaving out register-renaming onto a large physical register file.
That's the "rename" part. Allocate includes choosing which execution port an instruction will use, and reserving a load or store buffer entry for memory instructions.)
(There's also a store-buffer; stores don't write directly to L1d cache, they write to the store buffer. This makes it possible to speculatively execute stores and still roll back without them becoming visible to other cores. It also decouples cache-miss stores from execution. Once a store instruction retires, the store-buffer entry "graduates" and is eligible to commit to L1d cache, once MESI gets exclusive access to the cache line, and once memory-ordering rules are satisfied.)
Execution units detect whether an instruction should fault, or was mis-speculated and should roll back, but don't necessarily act on that until the instruction reaches retirement.
In-order retirement is the step that recovers program-order after OoO exec, including the case of exceptions of mis-speculation.
Terminology: Intel calls it "issue" when instructions are sent from the front-end into the ROB + RS. Other computer architecture people often call that "dispatch".
Sending uops from the RS to execution units is called "dispatch" by Intel, "issue" by other people.

How the speculative load and store happen in modern Intel processor? [duplicate]

Given the small program shown below (handcrafted to look the same from a sequential consistency / TSO perspective), and assuming it's being run by a superscalar out-of-order x86 cpu:
Load A <-- A in main memory
Load B <-- B is in L2
Store C, 123 <-- C is L1
I have a few questions:
Assuming a big enough instruction-window, will the three instructions be fetched, decoded, executed at the same time? I assume not, as that would break execution in program order.
The 2nd load is going to take longer to fetch A from memory than B. Will the later have to wait until the first is fully executed? Will the fetching of B only start after Load A is fully executed? or until when does it have to wait?
Why would the store have to wait for the loads? If yes, will the instruction just wait to be committed in the store buffer until the loads finish or after decoding it will have to sit and wait for the loads?
Thanks

Terminology: "instruction-window" normally means out-of-order execution window, over which the CPU can find ILP. i.e. ROB or RS size. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
The term for how many instructions can go through the pipeline in a single cycle is pipeline width. e.g. Skylake is 4-wide superscalar out-of-order. (Parts of its pipeline, like decode, uop-cache fetch, and retirement, are wider than 4 uops, but issue/rename is the narrowest point.)
Terminology: "wait to be committed in the store buffer" store data + address gets written into the store buffer when a store executes. It commits from the store buffer to L1d at any point after retirement, when it's known to be non-speculative.
(In program order, to maintain the TSO memory model of no store reordering. A store buffer allows stores to execute inside this core out of order but still commit to L1d (and become globally visible) in-order. Executing a store = writing address + data to the store buffer.)
Can a speculatively executed CPU branch contain opcodes that access RAM?
Also what is a store buffer? and
Size of store buffers on Intel hardware? What exactly is a store buffer?
The front-end is irrelevant. 3 consecutive instructions might well be fetched in the same 16-byte fetch block, and might go through pre-decode and decode in the same cycle as a group. And (also or instead) issue into the out-of-order back-end as part of a group of 3 or 4 uops. IDK why you think any of that would cause any potential problem.
The front end (from fetch to issue/rename) processes instructions in program order. Processing simultaneously doesn't put later instructions before earlier ones, it puts them at the same time. And more importantly, it preserves the information of what program order is; that's not lost or discarded because it matters for instructions that depend on the previous one1!
There are queues between most pipeline stages, so (for example on Intel Sandybridge) instructions that pre-decode as part of a group of up-to-6 instructions might not hit the decoders as part of the same group of up-to-4 (or more with macro-fusion). See https://www.realworldtech.com/sandy-bridge/3/ for fetch, and the next page for decode. (And the uop cache.)
Executing (dispatching uops to execution ports from the out-of-order scheduler) is where ordering matters. The out-of-order scheduler has to avoid breaking single threaded code.2
Usually issue/rename is far ahead of execution, unless you're bottlenecked on the front-end. So there's normally no reason to expect that uops that issued together will execute together. (For the sake of argument, let's assume that the 2 loads you show do get dispatched for execution in the same cycle, regardless of how they got there via the front-end.)
But anyway, there's no problem here starting both loads and the store the same time. The uop scheduler doesn't know whether a load will hit or miss in L1d. It just sends 2 load uops to the load execution units in a cycle, and a store-address + store-data uop to those ports.
[load ordering]
This is the tricky part.
As I explained in an answer + comments on your last question, modern x86 CPUs will speculatively use the L2 hit result from Load B for later instructions, even though the memory model requires that this load happens after Load A.
But if no other cores write to cache line B before Load A completes, then nothing can tell the difference. The Memory-Order Buffer takes care of detecting invalidations of cache lines that were loaded from before earlier loads complete, and doing a memory-order mis-speculation pipeline flush (rollback to retirement state) in the rare case that allowing load re-ordering could change the result.
Why would the store have to wait for the loads?
It won't, unless the store-address depends on a load value. The uop scheduler will dispatch the store-address and store-data uops to execution units when their inputs are ready.
It's after the loads in program order, and the store buffer will make it even farther after the loads as far as global memory order is concerned. The store buffer won't commit the store data to L1d (making it globally visible) until after the store has retired. Since it's after the loads, they'll have also retired.
(Retirement is in-order to allow precise exceptions, and to make sure no previous instructions took an exception or were a mispredicted branch. In-order retirement allows us to say for sure that an instruction is non-speculative after it retires.)
So yes, this mechanism does ensure that the store can't commit to L1d until after both loads have taken data from memory (via L1d cache which provides a coherent view of memory to all cores). So this prevents LoadStore reordering (of earlier loads with later stores).
I'm not sure if any weakly-ordered OoO CPUs do LoadStore reordering. It is possible on in-order CPUs when a cache-miss load comes before a cache-hit store, and the CPU uses scoreboarding to avoid stalling until the load data is actually read from a register, if it still isn't ready. (LoadStore is a weird one: see also Jeff Preshing's Memory Barriers Are Like Source Control Operations). Maybe some OoO exec CPUs can also track cache-miss stores post retirement when they're known to be definitely happening, but the data just still hasn't arrived yet. x86 doesn't do this because it would violate the TSO memory model.
Footnote 1: There are some architectures (typically VLIW) where bundles of simultaneous instructions are part of the architecture in a way that's visible to software. So if software can't fill all 3 slots with instructions that can execute simultaneously, it has to fill them with NOPs. It might even be allowed to swap 2 registers with a bundle that contained mov r0, r1 and mov r1, r0, depending on whether the ISA allows instructions in the same bundle to read and write the same registers.
But x86 is not like that: superscalar out-of-order execution must always preserve the illusion of running instructions one at a time in program order. The cardinal rule of OoO exec is: don't break single-threaded code.
Anything that would violate this can only be done with checking for hazards, or speculatively with rollback upon detection of mistakes.
Footnote 2: (continued from footnote 1)
You can fetch / decode / issue two back-to-back inc eax instructions, but they can't execute in the same cycle because register renaming + the OoO scheduler has to detect that the 2nd one reads the output of the first.

FreeRTOS blocking on multiple events/objects

In the UDP/IP Stack solution example, here, there is a proposed solution for blocking on a single event queue.
What would be the go to solution for protecting the data that the pointer points to until it has been handled by the task waiting for the queue.
Say for example that the queue is filled from a ISR. The ISR should not write to *pvData if it has not been processed by the appropriate task. But since there can be several event sources the queue should probably be longer than one item. Should the struct be made:
typedef struct IP_TASK_COMMANDS
{
eIPEvent_t eEventType;
SemaphoreHandle_t data_read;
void *pvData;
} xIPStackEvent_t;
With the semaphore taken in the ISR and given in the task that processes the data when it's done with it.

If you take the UDP example - normally you would have a pool of buffers (or dynamically allocate a buffer) from which a buffer would be obtained and given to the DMA. When the DMA fills the buffer with received data a pointer to the buffer goes into the UDP stack - at which point only the UDP stack knows where the buffer is and is responsible for it. At some point the data in the buffer may get passed from the UDP stack to the application where it can be consumed. The application then returns the buffer to the pool (or frees the allocated buffer) so it is available to the DMA again. The reverse is also true - the application may allocate a buffer that is filled with data to be sent, via the UDP stack, to the Tx function where it is actually placed onto the wire - in which case it is the Tx end interrupt that returns the buffer to the pool.
So, in short, there is only one thing that has a reference to the buffer at a time, so there is no problem.
[note above where it says the application allocates or frees a buffer, that would be inside the UDP/IP stack API called by the application rather than by the application directly - this is in fact partly at least how our own TCP/IP stack is implemented.]

You don't want your ISR to block and wait for the data buffer to become available. If it's appropriate for your ISR to just skip the update and move on when the buffer is not available then perhaps a semaphore makes sense. But the ISR should not block on the semaphore.
Here's an alternative to consider. Make a memory pool containing multiple appropriately sized data buffers. The ISR allocates the next available buffer from the pool, writes the data to it and puts the pointer to it on the queue. The task reads the pointer from the queue, uses the data, and then frees the buffer back to the pool. If the ISR runs again before the task uses the data, the ISR will be allocating a fresh buffer so it won't be overwriting the previous data.
Here's another consideration. The FreeRTOS Queue passes items by copy. If the data buffer is relatively small then perhaps it makes sense to just pass the data structure rather than a pointer to the data structure. If you pass the data structure then the queue service will make a copy to provide to the task and the ISR is free to update it's original buffer.
Now that I think about it, using the Queue service copy feature may be simpler than creating your own separate memory pool. When you create the Queue and specify the queue length and item size, I believe the queue service creates a memory pool to be used by the queue.

MSI/MESI: How can we get "read miss" in shared state?

In The Cache Memory Book by Jim Handy (excerpt is below), the author has the table description of MESI protocol. The table looks very unclear to me, and unfortunately the text does not help.
The first question (in green on the picture):
Is this right? -- a data block is in the cache of a CPU,
and it is in the shared state, but when the CPU reads it,
the CPU gets read miss.
How is this possible?
The second question (in purple):
Who and when create all these messages "Read miss", etc.?
(afaik, the system bus just translates messages of others)
And finally, the third question (not on the pic):
Do all these cache coherency protocols
(MSI,MESI,MOESI,Firefly,Dragon...)
maintain sequential consistency memory model?
Are there protocols that maintain other consistency models?

For the first question, while there is a data block (in shared state) in the cache, it is the wrong block (i.e., the tag does not match), so there is a cache miss. On a cache miss, the cache still needs to writeback data to memory if it is in the modified state.
For the second question, each bus is providing address and request information (read or write) to the cache; the system bus is providing this from a remote processor. The hit or miss is whether the address provided is a hit in the cache. Requests on the system bus are filtered by the remote processor's cache.
On a read by processor 1 (P1), if the P1's cache has a hit, no signal needs to be sent to the system bus. On a write by P1, if the P1's cache has a hit and the state is exclusive or modified, then no signals need to be sent to the system bus, but if state in P1's cache is shared, then the other (P0) cache must told (via the system bus) that P1 is performing a write and that any entry in P0's cache for that address must be invalidated.
(Note that "shared" state does not necessarily mean that the block of memory is present in some other cache. Most snoop-based protocols allow silent [no system bus communication] dropping of cache blocks in shared state.)
(By the way, the MESI protocol given by that book is a bit unusual. It is more common for Exclusive state to be entered by a read miss that found no other caches with the block [this allows silent update to modified on a later write] and for writes not to use write through to memory on a hit that was in shared state.)
For the third question, cache coherence protocols only address coherence not consistency. The consistency model that the hardware provides determines what kinds of activity can be buffered and completed out of order (in a way that is visible to software). E.g., it helps performance if a write miss does not force any following read hits to wait until the write notification has been seen by all remote caches, but such can violate sequential consistency.
Here is an example how sequential consistency can prevent buffering of writes:
P0 reads "B_old" at location B (cache hit), writes "A_new" to location A (a cache miss), then reads location B.
P1 reads "A_old" at location A (cache hit), writes "B_new" to location B (a cache miss), then reads location A.
If writes are buffered and reads are allowed to proceed without waiting for the preceding write to be recognized by the other cache, the P0's second read of B can read "B_old" (since it will still be a cache hit) and P1's second read of A can read "A_old" (since P0's write has not yet been seen). There is no way to construct a sequential ordering of these memory accesses, so sequential consistency would be violated.
However, if P0's second read of B waited until P0's write to A was recognized by P1 (and P1's second read of A waited until P1's write to B was recognized by P0), then a sequential ordering would be possible.
If P1 saw P0's write to A before its write to B, then sequential ordering is possible, including:
P0.read "B_old", P1.read "A_old", P0.write "A_new", P1.write "B_new", P0.read "B_new", P1.read "A_new"
P0.read "B_old", P1.read "A_old", P0.write "A_new", P0.read "B_old", P1.write "B_new", P1.read "A_new"
Note that hardware can speculate that ordering requirements are not violated and reorder the completion of memory accesses; however, it needs to be able to handle incorrect speculation (e.g., by restoration to a checkpoint). (A classic early paper on this is "Is SC + ILP = RC?" [PDF].)

There is a one-to-many mapping between cache lines and blocks of main memory. A cache line that is shared may not be the block of memory your program is looking for. What actually happened is that the block of memory your program wants, and the block of memory present in the cache, though different, have the same cache line index. Consecutive blocks of memory are mapped to consecutive cache lines, till the cache lines finish, and the subsequent blocks of memory are mapped to cache lines starting with the first cache line all over again.
I found an intuitive explanation here: https://medium.com/breaktheloop/direct-mapping-map-cache-and-main-memory-d5e4c1cbf73e

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse