How do weak ISAs resolve WAW memory hazards using the store buffer? - cpu-architecture

Modern CPUs use a store buffer to delay commit into cache until retirement, also avoiding WAR and WAW memory hazards. I'm wondering how weak ISAs resolve WAW hazards using the store buffer, which is otherwise not a FIFO, allowing StoreStore reordering? Do they insert an implicit barrier?
More specifically, if two stores to the same memory address retire in-order on a weak ISA, e.g. ARM/POWER, they could theoretically commit to cache out-of-order, since the store buffer is not FIFO, thus breaking the WAW dependency.
According to Wikipedia:
...the store instructions, including the memory address and store data, are buffered in a store queue until they reach the retirement point. When a store retires, it then writes its value to the memory system. This avoids the WAR and WAW dependence problems...

My guess; I'm not familiar with the details of any real-world designs
Even if the store buffer is a full scheduler that can "grab" any graduated store for commit to L1d, I'd assume it would use an oldest-ready first order. (Like an instruction / uop scheduler aka RS Reservation Station.)
"Ready" would mean the cache line is exclusively owned (Modified or Exclusive state). Every graduated store itself is implicitly ready to commit because by definition the associated store instruction has retired.
In-order retirement means that stores become eligible for commit in program-order, so you can't have an older store that's temporarily hidden from the oldest-ready-first scheduling.
Together, those things would ensure that for any given byte, stores overlapping it are in program order and thus maintain consistency of global-visibility order and final value for any given group of bytes within a cache line.
A memory barrier might work by fencing off the store buffer like a divider on a grocery-store checkout conveyor belt, preventing grabbing of stores past it while committing ones to the same place in the same line.
We do know real-world weakly-ordered store buffers like PowerPC RS64-III (in-order exec) and Alpha 21264 (OoO exec) do merging to help them create whole 4-byte or 8-byte aligned commits to L1d, e.g. out of multiple byte stores. That's also fine, assuming your merge algorithm respects order for any given byte e.g. by putting the data from a younger store into an older SB entry or vice versa and marking the other entry as "already committed". Obviously this must respect store barriers.
I think this is all fine even with unaligned stores, although preserving atomicity guarantees for unaligned stores could be tricky with merging. (Intel P6-family and later does provide atomicity guarantees for unaligned cached stores that don't cross a cache-line boundary, but we don't think Intel does merging in the store buffer proper; maybe just some stuff with LFBs for cache-miss back-to-back stores to the same line.)
It's likely that real hardware might not be a full scheduler that can merge any 2 SB entries, e.g. maybe only over limited range to reduce the amount of different addresses (and sizes) to compare at once. Also, you'd probably still only free up SB entries in program order, so it can basically be a circular buffer (unlike the RS). Alloc in program order, and having the order be tracked by the layout of the SB itself, makes it much cheaper for memory barriers to work, and to track where the youngest "graduated" store is.
Disclaimer: IDK if this is exactly how real HW works
Possible corner case: unaligned 4-byte store to [cache_line+63] (split across a CL boundary) and then to [cache_line+60] (fully contained in the lower cache line). If the older store-buffer entry can't commit right away because we don't yet own the next cache line, but we do own cache_line, we still can't let the younger store to cache_line+60 commit first, if we're depending on that not happening to avoid WAW hazards.
So you'd probably want a line-split SB entry to be able to commit the data to one line but not the other, allowing oldest-ready-first to happen for each location separately, not tying together order across 2 cache lines.
Related: I wrote my own answer explaining what a store buffer is. I tried to avoid mistakes like Wikipedia makes ("when a store retires, it writes its value to the memory system": In fact retirement just makes it eligible to commit; such stores are called "graduated" stores.)

Related

Why data is fetched from main memory in Write Allocate cache policy

For write allocate policy of a cache, when a write miss occurs, data is fetched from main memory and then updated with a write hit.
My question is, assuming write back policy on write hits, why the data is read from the main memory if it is immediately being updated by the CPU? Can't we just write to the cache without fetching the data from main memory?
On a store that hits in L1d cache, you don't need to fetch or RFO anything because the line is already exclusively owned.
Normally you're only storing to one part of the full line, thus you need a copy of the full line to have it in Modified state. You need to do a Read For Ownership (RFO) if you don't already have a valid Shared copy of the line. (Which you could promote to Exclusive and then Modified via just invalidating other copies. MESI).
A full-line store (like x86 AVX-512 vmovdqa [rdi], zmm0 64-byte store) can just invalidate instead of Read For Ownership, and just wait for an acknowledgement that no other cores have a valid copy of the line. IDK if that actually happens for AVX-512 stores specifically in current x86 microarchitectures.
Skipping the read (and just invalidating any other copies) definitely does happen in practice in some CPUs in some cases. e.g. in the store protocol used by microcode to implement x86 rep stos and rep movs, which are basically memset / memcpy. So for large counts they are definitely storing full lines, and it's worth it to avoid the memory traffic of reading first. See Andy Glew's comments, which I quotes in What setup does REP do? - he designed P6's (Pentium Pro) fast-strings microcode when he was working at Intel, and says it included a no-RFO store protocol.
See also Enhanced REP MOVSB for memcpy

Atomicity of small PCIE TLP writes

Are there any guarantees about how card to host writes from a PCIe device targeting regular memory are implemented from a software process' perspective, where a single TLP write is fully contained within a single CPU cache-line?
I'm wondering about a case where my device may write some number of words of data followed by a byte to indicate that the structure is now valid (for example an event completion), for example:
struct PCIE_COMPLETION_T {
uint64_t data_a;
uint64_t data_b;
uint64_t data_c;
uint64_t data_d;
uint8_t valid;
} alignas(SYSTEM_CACHE_LINE_SIZE);
Can I use a single TLP to write this structure, such that when software sees the valid member change to 1 (having been previously cleared to zero by software), then will the other data members will also reflect the values that I had written and not a previous value?
Currently I'm performing 2 writes, first writing the data and secondly marking it as valid, which doesn't have any apparent race conditions but does of course add unwanted overhead.
The most relevant question I can see on this site seems to be Are writes on the PCIe bus atomic? although this appears to relate to the relative ordering of TLPs.
Perusing the PCIe 3.0 specification, I didn't find anything that seemed to explicitly cover my concerns, I don't think that I need AtomicOps particularly. Given that I'm only concerned about interactions with x86-64 systems, I also dug through the Intel architecture guide but also came up no clearer.
Instinctively it seems that it should be possible for such a write to be perceived atomically -- especially as it is said to be a transaction -- but equally I can't find much in the way of documentation explicitly confirming that view (nor am I quite sure what I'd need to look at, probably the CPU vendor?). I also wonder if such a scheme can be extended over multiple cachelines -- ie if the valid sits on a second cacheline written from the same TLP transaction can I be assured that the first will be perceived no later than the second?
The write may be broken into smaller units, as small as dwords, but if it is, they must be observed in increasing address order.
PCIe revision 4, section 2.4.3:
If a single write transaction containing multiple DWs and the Relaxed
Ordering bit Clear is accepted by a Completer, the observed ordering
of the updates to locations within the Completer's data buffer must be
in increasing address order. This semantic is required in case a PCI
or PCI-X Bridge along the path combines multiple write transactions
into the single one. However, the observed granularity of the updates
to the Completer's data buffer is outside the scope of this
specification.
While not required by this specification, it is
strongly recommended that host platforms guarantee that when a PCI
Express write updates host memory, the update granularity observed by
a host CPU will not be smaller than a DW.
As an example of update
ordering and granularity, if a Requester writes a QW to host memory,
in some cases a host CPU reading that QW from host memory could
observe the first DW updated and the second DW containing the old
value.
I don't have a copy of revision 3, but I suspect this language is in that revision as well. To help you find it, Section 2.4 is "Transaction Ordering" and section 2.4.3 is "Update Ordering and Granularity Provided by a Write Transaction".

What is the difference between a store queue and a store buffer?

I am reading a number of papers and they are either using store buffer and store queue interchangeably or they are relating to different structures, and I just cannot follow along. This is what I thought a store queue was:
It is an associatively searchable FIFO queue that keeps information about store instructions in fetch order.
It keeps store addresses and data.
It keeps store instructions' data until the instructions become non-speculative, i.e. they reach retirement stage. Data of a store instruction is sent to the memory (L1 cache in this case) from the store queue only when it reaches retirement stage. This is important since we do not want speculative store data to be written to the memory, because it would mess with the in-order memory state, and we would not be able to fix the memory state in case of a misprediction.
Upon a misprediction, information in the store queue corresponding to store instructions that were fetched after the misprediction instruction are removed.
Load instructions send a read request to both L1 cache and the store queue. If data with the same address is found in the store queue, it is forwarded to the load instruction. Otherwise, data fetched from L1 is used.
I am not sure what a store buffer is, but I was thinking it was just some buffer space to keep data of retired store instructions waiting to be written to the memory (again, L1).
Now, here is why I am getting confused. In this paper, it is stated that "we propose the scalable store buffer [SSB], which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers." I am thinking that the non-scalable associatively searchable conventional structure they are talking about is what I know as a store queue, because they also say that
SSB eliminates the non-scalable associative search of conventional
store buffers by forwarding processor-visible/speculative values to
loads directly from the L1 cache.
As I mentioned above, as far as I know data forwarding to loads is done through store queue. In the footnote on the first page, it is also stated that
We use "store queue" to refer to storage that holds stores’ values
prior to retirement and "store buffer" to refer to storage containing
retired store values prior to their release to memory.
This is in line with what I explained above, but then it conflicts with the 'store buffer' in the first quote. The footnote corresponds to one of the references in the paper. In that reference, they say
a store buffer is a mechanism that exists in many current processors
to accomplish one or more of the following: store access ordering,
latency hiding and data forwarding.
Again, I thought the mechanism accomplishing those is called a store queue. In the same paper they later say
non-blocking caches and buffering structures such as write buffers,
store buffers, store queues, and load queues are typically employed.
So, they mention store buffer and store queue separately, but store queue is not mentioned again later. They say
the store buffer maintains the ordering of the stores and allows
stores to be performed only after all previous instructions have been
completed
and their store buffer model is the same as Mike Johnson's model. In Johnson's book (Superscalar Microprocessor Design), stores first go to store reservation station in fetch order. From there, they are sent to the address unit and from the address unit they are written into a "store buffer" along with their corresponding data. Load forwarding is handled through this store buffer. Once again, I thought this structure was called a store queue. In reference #2, authors also mention that
The Alpha 21264 microprocessor has a 32-entry speculative store buffer
where a store remains until it is retired."
I looked at a paper about Alpha 21264, which states that
Stores first transfer
their data across the data buses into the speculative store buffer.
Store data remains in the speculative store buffer until the stores retire.
Once they retire, the data is written into the data cache on idle cache cycles.
Also,
The internal memory system maintains a 32-entry load queue (LDQ) and
a 32-entry store queue (STQ) that manages the references while they
are in-flight. [...] Stores exit the STQ in fetch order after they
retire and dump into the data cache. [...] The STQ CAM logic controls
the speculative data buffer. It enables the bypass of speculative
store data to loads when a younger load issues after an older store.
So, it sounds like in Alpha 21264 there is a store queue that keeps some information about store instructions in fetch order, but it does not keep data of store instructions. Store instructions' data are kept in the store buffer.
So, after all of this I am not sure what a store buffer is. Is it just an auxiliary structure for a store queue, or is it a completely different structure that stores data which is waiting to be written to L1. Or is it something else? I feel like some authors mean "store queue" when they say "store buffer". Any ideas?
Your initial understanding is correct - Store Buffer and Store Queue are distinct terms and distinct hardware structures with different uses. If some authors use them interchangeably, it is plain incorrect.
Store Buffer :
A store buffer is a hardware structure closer to the memory hierarchy
and "buffers" up the write traffic (stores) from the processor so that
the Write-back stage of the processor is complete as soon as possible.
Depending on whether the cache is write-allocate/write-no-allocate, a write to the cache may take variable amount of cycles. The store buffer essentially decouples the processor pipeline with the memory pipeline. You can read some more info here.
Other use of store buffer is typically in speculative execution. This means that when the processor indulges in speculative execution and commit (like in hardware transactional memory systems, or thers like transmeta crusoe), the hardware must keep track of the specualtive writes and undo them in case of misspeculation. This is where such a processor would use the store buffer for.
Store Queue :
Store Queue is an associate array where the processor stores the data and addresses of the in-flight stores. These are typically used in out-of-order processors for memory disambiguation. The processor needs a Load-Store Queue (LSQ) really to perform the memory disambiguation because it must see through all the memory accesses to the same address before concluding to schedule one memory operation before the other.
All the memory disambiguation logic is accompalished via the Load-Store Queues in an out-of-order processor. Read more about memory disambiguation here
If your confusion is solely because of the paper you refering to, consider asking the authors - it is likely that their use of terminology is mixed up.
You seem to be making a big deal out of names, it's not that critical. A buffer is just some generic storage, which in this particular case should be managed as a queue (to maintain program order as you stated). So it could either be a store buffer (I'm more familiar with this one actually, and see also here), but in other cases it could be described as a store queue (some designs have it combined with the load queue, forming an LSQ).
The names don't matter that much because as you see in your second quote - people may overload them to describe new things. In this particular case, they chose to split the store buffer into 2 parts, divided by the retirement pointer, since they believe they could use it to avoid certain store related stalls in some consistency models. Hey, it's their paper, for the remainder of it they get to define what they want.
One note though - the last bullet of your description of the store buffer/queue seems very architecture specific, forwarding local stores to loads at highest priority may miss later stores to the same address from other threads, and break most of the memory ordering models except the most relaxed ones (unless you protect against that otherwise).
This is in line with what I explained above, but then it conflicts
with the 'store buffer' in the first quote.
There is really no conflict and your understanding seems to be consistent with the they way these terms are used in the paper. Let's carefully go through what the authors have said.
SSB eliminates the non-scalable associative search of conventional
store buffers...
The store buffer holds stores that have been retired but are yet to be written into the L1 cache. This necessarily implies that any later issued load is younger in program order with respect to any of the stores in the store buffer. So to check whether the most recent value of the target cache line of the load is still in the store buffer, all that needs to be done is to search the store buffer by the load address. There can be either zero stores that match the load or exactly one store. That is, there cannot be more than one matching store. In the store buffer, for the purpose of forwarding, you only need to keep track of the last store to a cache line (if any) and only compare against that. This is in contrast to the store queue as I will discuss shortly.
...by forwarding processor-visible/speculative values to loads
directly from the L1 cache.
In the architecture proposed by the authors, the store buffer and the L1 cache are not in the coherence domain. The L2 is the first structure that is in the coherence domain. Therefore, the L1 contains private values and the authors use it to forward data.
We use "store queue" to refer to storage that holds stores’ values
prior to retirement and "store buffer" to refer to storage containing
retired store values prior to their release to memory.
Since the store queue holds stores that have not yet been retired, when comparing a load with the store queue, both the address and the age of each store in the queue need to be checked. Then the value is forwarded from the youngest store that is older than the load targeting the same location.
The goal of the paper you cited is to find an efficient way to increase the capacity of the store buffer. It just doesn't make any changes to the store queue because that is not in the scope of the work. However, there is another paper that targets the store queue instead.
a store buffer is a mechanism that exists in many current processors
to accomplish one or more of the following: store access ordering,
latency hiding and data forwarding.
These features apply to both store buffers and store queues. Using store buffers (and queues) is the most common way to provide these features, but there are others.
In general, though, these terms might be used by different authors or vendors to refer to different things. For example, in the Intel manual, only the store buffer term is used and it holds both non-retired and retired-but-yet-to-be-committed stores (obviously the implementation is much more complicated than just a buffer). In fact, it's possible to have a single buffer for both kinds of stores and use a flag to distinguish between them. In the AMD manual, the terms store buffer, store queue, and write buffer are used interchangeably to refer to the same thing as what Intel calls the store buffer. Although the term write buffer does have a specific meaning in other contexts. If you are reading a document that uses any of these terms without defining them, you'll have to figure out from the context how they are used. In that particular paper you cited, the two terms have been defined precisely. Anyway, I understand that it's easy to get confused because I've been there.

Memory Mapped files and atomic writes of single blocks

If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
How do I achieve the same effect on a memory mapped file?
Memory mapped files are simply byte arrays, so if I modify the byte array, the operating system has no way of knowing when I consider a write "done", so it might (even if that is unlikely) swap out the memory just in the middle of my block-writing operation, and in effect I write half a block.
I'd need some sort of a "enter/leave critical section", or some method of "pinning" the page of a file into memory while I'm writing to it. Does something like that exist? If so, is that portable across common POSIX systems & Windows?
The technique of keeping a journal seems to be the only way. I don't know how this works with multiple apps writing to the same file. The Cassandra project has a good article on how to get performance with a journal. The key thing is to make sure of, is that the journal only records positive actions (my first approach was to write the pre-image of each write to the journal allowing you to rollback, but it got overly complicated).
So basically your memory-mapped file has a transactionId in the header, if your header fits into one block you know it won't get corrupted, though many people seem to write it twice with a checksum: [header[cksum]] [header[cksum]]. If the first checksum fails, use the second.
The journal looks something like this:
[beginTxn[txnid]] [offset, length, data...] [commitTxn[txnid]]
You just keep appending journal records until it gets too big, then roll it over at some point. When you startup your program you check to see if the transaction id for the file is at the last transaction id of the journal -- if not you play back all the transactions in the journal to sync up.
If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
In the general case, the OS does not guarantee "writes of a block" done with "normal IO APIs" are atomic:
Blocks are more of a filesystem concept - a filesystem's block size may actually map to multiple disk sectors...
Assuming you meant sector, how do you know your write only mapped to a sector? There's nothing saying the I/O was well aligned to that of a sector when it's gone through the indirection of a filesystem
There's nothing saying your disk HAS to implement sector atomicity. A "real disk" usually does but it's not mandatory or a guaranteed property. Sadly your program can't "check" for this property unless its an NVMe disk and you have access to the raw device or you're sending raw commands that have atomicity guarantees to a raw device.
Further, you're usually concerned with durability over multiple sectors (e.g. if power loss happens was the data I sent before this sector definitely on stable storage?). If there's any buffering going on, your write may have still only been in RAM/disk cache unless you used another command to check first / opened the file/device with flags requesting cache bypass and said flags were actually honoured.

operating systems - TLBs

I'm trying to get my head round this (okay, tbh cramming a night before the exams :) but i can't figure out (nor find a good high level overview on the net) of this:
'page table entries can be mapped to more than one TLB entry.. if for example every page table entry is mappped to two TLB entries, this is know as 2-way set associative TLB'
My question is, why would we want to map this more than once? surely we want to have the maximum number of possible entries represented in the TLB, and duplication would waste space right ? What am i missing?
Many thanks
It doesn't mean you would load the same entry into two places into the table -- it means a particular entry can be loaded to either of two places in the table. The alternative where you can only map an entry to one place in the table is a direct mapped TLB.
The primary disadvantage of a direct-mapped TLB arises if you're copying from one part of memory to another, and (by whatever direct-mapping scheme the CPU uses) the translations for both have to be mapped to the same spot in the TLB. In this case, you end up re-loading the TLB entry every time, so the TLB is doing little or no good at all. By having a two-way set associative TLB, you can guarantee that any two entries can be in the TLB at the same time so (for example) a block move from point A to point B can't ruin your day -- but if you read from two areas, combine them, and write results to a third it could (if all three used translations that map map to the same set of TLB entries).
The shortcoming of having a multiway TLB (like any other multiway cache) is that you can't directly compute which position might hold a particular entry at a given time -- you basically search across the ways to find the right entry. For two-way, that's rarely a problem -- but four ways is typically about the useful limit; 8-way set associative (TLBs | caches) aren't common at all, partly because searching across 8 possible locations for the data starts to become excessive.
Over time, the number of ways it makes sense to use in a cache or tlb tends to rise though. The differential in speed between memory and processors continues to rise. The greater the differential, the more cycles the CPU can use and still produce a result within a single memory clock cycle (or a specified number of memory clock cycles, even if that's more than one).