In MESI cache coherence protocol, when exactly does the state of a cache line change if the data needs to be fetched from memory? - cpu-architecture

In MESI protocol when a CPU:
Performs a read operation
Finds out the cache line is in Invalid state
There is no other non-invalid copies in other caches
It will need to fetch the data from the memory. This will take a certain number of cycles to do this. So does the state of the cache line change from (I) to (E) instantly or only after data is fetched from memory?

I think a cache would normally wait for the data to arrive; when it's not there yet you can't actually get a hit in cache for other requests to the same line, only to other lines that actually are present (hit under miss). Therefore the state for that line is still Invalid; the data for that tag isn't valid, so you can't set it to a valid state yet.
You'd want another miss to same line (miss under miss) to notice there was already an outstanding request for that line and attach itself to that line-request buffer. (e.g. Intel x86 LFB = line fill buffer). Since finding Invalid triggers looking at fill buffers but Exclusive doesn't, you want Invalid based on this reasoning as well.
e.g. the Skylake perf-counter event mem_load_retired.fb_hit counts, from perf list output:
[Retired load instructions which data sources were load missed L1 but
hit FB due to preceding miss to the same cache line with data not
ready.
Supports address when precise (Precise event)]
In a cache in a very old / simple or toy CPU with no memory-level parallelism (whole pipeline or just memory access totally stalls execution until the data arrives), the distinction is meaningless; nothing else happens to cache while the requested data is in-flight.
In such a CPU it's just an implementation detail. (Except it should still process MESI requests from other cores while a load is in flight so again tags need to reflect the correct state, otherwise it's extra stuff to check when deciding how to reply.)

After data is fetched from memory.
In practice, MESI (or any other protocol) has many transition states in addition to the main states of M/E/S/I. In your example, the coherence protocol would transition to a "Wait for Data Fill" state and will transition to E only after data is fetched and valid bit is set.
Reference: Cache coherence protocols in gem5/ruby-- http://learning.gem5.org/book/part3/MSI/cache-transitions.html (search for "was invalid, going to shared") may be useful.

Related

Page Fault - How does os search for the page in secondary storage?

My question is when a page fault occurs and the required page is not in RAM ,after that how does the os know where to look for the given page in the entire secondary memory to bring it to the RAM? So is the logical address the address of the secondary memory store or is the required secondary storage address stored in the page table itself or some other way?
I feel like i am probably missing something very basic here but this doubt came in my mind and a quick google search is not providing any answers.
My question is when a page fault occurs and the required page is not in RAM ,after that how does the os know where to look for the given page in the entire secondary memory to bring it to the RAM?
If there were 50 different operating systems that supported an average of 10 different architectures each, there would be up to a maximum of 500 different answers; where one of the answers would be "all software uses physical addresses and there is no virtual memory and there is no secondary memory" and another answer would be "a virtual address is a location on the disk and RAM is just used as disk cache to speed it up" (see https://en.wikipedia.org/wiki/Single-level_store ).
For most typical modern operating systems running on most typical architectures; if you worked out all of the information the kernel needs to know about each virtual page (e.g. what the page is pretending to be, what the page actually is, location on disk if any, location in RAM if any, something to keep track of "least recently used", something to keep track of "number of copy-on-write copies", etc); then you could scatter all the information across multiple different data structures such that:
some of the data structures are used/required by the CPU itself and some aren't
the same information may or may not be in 2 or more data structures at the same time
some data structures have an entry for each virtual page and some just have an entry for each range of multiple pages
some data structures are arrays/tables, some are trees, some are trees of tables, and some are something else.
some use "virtual address" or "virtual page number" as a key to find information; and some use something else (e.g. inverted page tables on PowerPC and Itanium use "physical address" as an index because using what you're trying to find as an index is the least intelligent thing you could possibly do, so why not?).
some of the data structures may be in the kernel and some may not be (e.g. the L4 micro-kernel manages virtual memory mapping purely in user-space via. an "abstract hierarchical address space" model).
In general; the information about where a page's data is in (each different piece of?) secondary memory (if there is secondary memory) will be stored in one or more places in one or more things.
Note that when a page fault occurs the page fault handler typically needs to make multiple decisions; possibly starting with figuring out what made the access (a process, the kernel itself?) and whether the access should be allowed or denied, then figuring out what to do about it (send SIGSEGV? do a kernel panic? fetch data into the CPU's TLB? invalidate stale data from CPU's TLB? do copy-on-write cloning? fetch data from swap space? fetch data from file?); so the page fault handler ends up finding multiple different pieces of data from (potentially) multiple different places.
A Concrete Example
For my OS designs (which are based on asynchronous message passing and use micro-kernels); a micro-kernel is small enough that it can be custom designed and optimised for a specific architecture (without any regard to portability). The operating system design is intended for distributed systems, and for that reason shared memory (and fork()) are not supported (you don't want page fault handler to have to fetch data from a remote computer over a congested network connection to do a "copy on write"); and the only case for "copy on write" is memory mapped files where the page is shared by one or more processes and the (local) VFS cache.
For 64-bit 80x86, the CPU requires a tree of 4 levels of tables (page tables, page directories, page directory pointer tables and page map level 4), and to improve efficiency (reduce memory consumption and reduce cache misses, etc) I use these tables as much as possible.
For page table entries (or page directory entries if 2 MiB pages are being used); if the page is not present there are 63 bits that are ignored by the CPU that the OS can use for its own purposes; and if the page is present then (depending on which features CPU supports) there are at least 9 bits that the OS can use for its own purposes and flags that the CPU uses (e.g. the "read, write, no-execute" flags) can be used to augment the OS's own information.
When a page is not present, the 63-bits are split into 2 fields - one 8 bit field to keep track of the virtual type of the page (if it's supposed to act like RAM, if it's supposed to be executable, if it's supposed to use "write-back" caching, etc), and one 55 bit "where" field. If the highest bit in the "where" field is set the page was sent to swap space and the other 54 bits are a "swap space handle" (allowing for a maximum of "2**54 * 4 KiB" of swap space); and if the highest bit in the "where" field is clear then the other 54 bits are a "memory mapped file handle". If a page fault occurs because of a "not present" page, the page fault handler uses the 8-bit field to determine if the access should be allowed or denied (or if it's already being handled due to a different thread accessing it already), then (if the access should be allowed) the page fault handler tells the scheduler to put the thread in a "WAITING FOR PAGE" state and marks the page as "being fetched" (so that other threads that belong to the same process know that it's being fetched already), then uses the "where" field to either send a request message asking for the page's data to the Swap Manager (which is a process in user-space), or to find a "memory mapped file descriptor" structure in kernel space that contains more information (that didn't fit in the page table entry) to determine the offset of the page within the file and a file handle, and send a request to the VFS for the page's data (the VFS or Virtual File System is another process in user-space). Later; when Swap Manager or VFS send a reply message containing the page's data back to the kernel, the kernel fixes up the page table entry (putting the page of data from the message into the virtual address space) and tells scheduler to unblock the thread/s (shift them from the "WAITING FOR PAGE" state to the "READY TO RUN" state).
For both of these cases (memory mapped file and swap space) if the access was an "allowed read" then the page is mapped as read only (regardless of whether the page is supposed to be writeable). If the access was an "allowed write", or if a later "allowed write" is done to a page that was previously fetched and mapped as read only; then if the page's data came from swap space the page fault handler informs the Swap Manager that the copy of the page in swap space can be discarded (can't be re-used if the same page is sent to swap space later), and if the page's data came from a memory mapped file the page fault handler informs the VFS that there's one less process with a copy of that page and copies the "copy on write" page's data to a newly allocated page.
When a page is "present", it may still be part of a memory mapped file and there may still be a copy in swap space; but there isn't enough space in the page table entry to store the "where" field. In this case, if the page is in swap space and in RAM, the Swap Manager has to accept "Process ID + virtual address" instead of a "swap space handle" (which causes a little extra overhead in Swap Manager because it has to convert "Process ID + virtual address" into "swap space handle" itself). If the page is a "copy on write" memory mapped file, then the page fault handler searches the process' list of "memory mapped file descriptors" (which causes a little extra overhead).
Note that (in theory) when an OS is running low on free RAM it wants to select a "least likely to be needed soon" page to send to swap space, but this isn't easy/practical so most operating systems use "least recently used" instead.
My kernels don't do this at all. Instead they just send "random" pages to the Swap Manager, and (initially) the Swap Manager keeps the data in RAM and doesn't send it to any of the swap providers to store; and the Swap Manager uses "least recently sent to swap manager" to figure out which pages to send to a swap provider to store. A page that is used often may be sent to swap manager many times without ever actually being sent to a swap provider (and without causing slow disk IO for frequently used pages). Also note that, because "copy on write memory mapped file" is the only case that "copy on write" is used and because there is no other form of shared memory, the VFS can keep track of how many processes are sharing a copy of pages itself and the kernel never need to keep track of how many processes are sharing a copy of any page (like most kernels for most operating systems do).

Does parent process lose write ability during copy on write?

Say we have a certain parent process with some arbitrary amount of data stored in memory and we use fork to spawn a child process. I understand that in order for the OS to perform copy on write, the certain page in memory that contains the data that we are modifying will have its Read-only bit set, and the OS will use the exception that will result when the child tries to modify the data to copy the entire page into another area in memory so that the child gets it's own copy. What I don't understand is, if that specific section in memory is marked as Read-only, then the parent process, to whom the data originally belonged, would not be able to modify the data neither. So how can this whole scheme work? Does the parent lose ownership of its data and copy on write will have to be performed even when the parent itself tries to modify the data?
Right, if either process writes a COW page, it triggers a page fault.
In the page fault handler, if the page is supposed to be writeable, it allocates a new physical page and does a memcpy(newpage, shared_page, pagesize), then updates the page table of whichever process faulted to map the newpage to that virtual address. Then returns to user-space for the store instruction to re-run.
This is a win for something like fork, because one process typically makes an execve system call right away, after touching typically one page (of stack memory). execve destroys all memory mappings for that process, effectively replacing it with a new process. The parent once again has the only copy of every page. (Except pages that were already copy-on-write, e.g. memory allocated with mmap is typically COW-mapped to a single physical page of zeros, so reads can hit in L1d cache).
A smart optimization would be for fork to actually copy the page containing the top of the stack, but still do lazy COW for all the other pages, on the assumption that the child process will normally execve right away and thus drop its references to all the other pages. It still costs a TLB invalidation in the parent to temporarily flip all the pages to read-only and back, though.
Some UNIX implementations share the program text between the two since
that cannot be modified. Alternatively, the child may share all of the
parent’s memory, but in that case the memory is shared
copy-on-write, which means that whenever either of the two wants to modify part of the memory, that chunk of memory is explicitly
copied first to make sure the modification occurs in a private memory
area.
Excerpted from: Modern Operating Systems (4th Edition), Tanenbaum

arm caches flush to uncached region

I wondering how the cache subsystem will act in the following situation:
Let's consider Cortex-a8 and VIPT L1-cache.
Suppose a cacheable virtual memory mapping [A,B] was created, for some time the one worked with this memory region so the data cache was filled up. After this, suppose, the mapping [A,B] was substituted by uncacheable. After that we perform a flush operation (DCCIMVAC) on virtual region [A,B]. What will happen in this situation? The cache data will be just discarded (invalidated)? Or data will be flushed to non cacheable pages anyway? Or something else?
Update:
The main reason why I'am asking is - my board stucks immediately after one of this flush instructions in the context I described above. I just have no idea why this could happen. More literally, it hangs in the middle of the page, flushing one of the cache lines at the offset 0xd80. If skip this cache line flush - it goes further. If change attrs to cacheable (from uncacheable) it works ok. And looks like it should be changed to cacheable, but I try to figure out what this code had done. I'am mostly interested in - if the scenario I had described is legal, or it may lead to some undefined behavior.

How do MemReq and MemResp exactly work in RoccIO - RISCV

I'm trying to figure out how can I read from and write to memory in RISCV when I'm using RoCCIO. But I couldn't clearly get what is happening. Especially how can I address the memory or how should I work with memory tag.
Are there any resources that I can find how I can transfer data between Rocket core and my Accelerator?
In the uncore/src/main/scala/consts.scala path they have mentioned different type of memory cmd. But what else?
For example I want to pass starting address of an array and number of elements that I plan to fetch into the accelerator and then start fetching them. What signalling should I use?
Thanks
Within the RoCC interface, the mem field is a connection to the L1 cache. The dmem field is a connection to the L2 cache. Which one you want to use depends upon the memory bandwidth requirements of your accelerator.
Rocket and the RoCC accelerator can either share data through the caches (remember to use a fence instruction on the Rocket core so the memory ordering is correct) or you can directly give data to Rocket through the resp field in the RoCCIO.
The L1 cache's IO can be found in Rocket's (https://github.com/ucb-bar/rocket/blob/master/src/main/scala/nbdcache.scala) whereas the L2 IO can be found in the uncore's (https://github.com/ucb-bar/uncore/blob/master/src/main/scala/tilelink.scala).
Although I don't know which memory tag you are referring to, typically the tag is passed through the memory system and returned to you with the response untouched (if you have multiple requests inflight, this returning tag helps you identify which is which).
I suspect if you want to fetch an array of data, you will need a state machine to request each individual address in your accelerator. Unless you go through the L2 cache interface, in which case I believe it comes in cache-line sizes.

What does 'Mutex lock' exactly do?

You can see an interesting table at this link. http://norvig.com/21-days.html#answers
The table described,
Mutex lock/unlock 25 nanosec
fetch from main memory 100 nanosec
Nanosec?
I surprised because mutex lock is faster than fetch data from memory. If so, what mutex lock exactly do? And what does Mutex lock mean at the table?
Let's say that ten people had to share a pen (maybe they work at a really cash-strapped company). Since they have to write long documents with the pen, but most of the work in writing a document is just thinking of what to say, they agree that each person gets to use the pen to write one sentence of the document, and then has to make it available to the rest of the group.
Now we have a problem: what if two people are done thinking about the next sentence, and both want to use the pen at once? We could just say that both people can grab the pen, but this is a fragile old pen, so if two people grab it then it breaks. Instead, we draw a chalk line around the pen. First you put your hand across the chalk line, then you grab the pen. If one person's hand is inside the chalk line, then nobody else is allowed to put their hand inside the chalk line. If two people try to put their hand across the chalk line at the same time, under these rules only one of them will get inside the chalk line first, so the other has to pull back their hand and keep it just outside the chalk line until the pen is available again.
Let's relate this back to mutexes. A mutex is a way to protect a shared resource (the pen) for a short period of time called the critical section (the time to write one sentence of a document). Whenever you want to use the resource, you agree to call mutex_lock first (put your hand inside the chalk line). Whenever you're done with the resource, you agree to call mutex_unlock (take your hand out from the chalk line area).
Now to how mutexes are implemented. A mutex is usually implemented with shared memory. There is some shared opaque data object called a mutex, and the mutex_lock and mutex_unlock functions both take a pointer to one of these. The mutex_lock function checks and modifies data inside the mutex using an atomic test-and-set or load-linked/store-conditional instruction sequence (on x86, xhcg is often used), and either "acquires the mutex" - sets the contents of the mutex object to indicate to other threads that the critical section is locked - or has to wait. Eventually, the thread gets the mutex, does the work inside the critical section, and calls mutex_unlock. This function sets the data inside the mutex to mark it as available, and possibly wakes up sleeping threads that have been trying to acquire the mutex (this depends on the mutex implementation - some implementations of mutex_lock just spin in a tight look on xchg until the mutex is available, so there is no need for mutex_unlock to notify anybody).
Why would locking a mutex be faster than going out to memory? In short, caching. The CPU has a cache that can be accessed very quickly, so the xchg operation doesn't need to go all the way out to memory as long as the processor can ensure that there is no other processor accessing that data. But x86 has a notion of "owning" a cache line - if processor 0 owns a cache line, any other processor that wants to use data in that cache line has to go through processor 0. This way, there is no need for the xhcg operation to look at any data beyond the cache, and cache access tends to be very fast, so acquiring an uncontested mutex is faster than a memory access.
There is one caveat to that last paragraph, though: the speed benefit only holds for an uncontested mutex lock. If two threads try to lock the same mutex at the same time, the processors that are running those threads have to communicate and deal with ownership of the relevant cache line, which greatly slows down the mutex acquisition. Also, one of the two threads will have to wait for the other thread to execute the code in the critical section and then release the mutex, further slowing down the mutex acquisition for one of the threads.
The article you linked does not mentioned the architecture, but judging by mentions of L1 and L2 cache it's Intel. If this is so, then I think that by mutex they meant LOCK instruction. In this respect this post seems relevant: Intel 64 and IA-32 | Atomic operations including acquire / release semantic
Also Intel software developer's manual can help if you know, what you are looking for. I'd read anything relevant I could find about the LOCK instruction.