Flush TLB on a context swtich - operating-system

This may depends on the OS, but in general as I understand that when there a page fault (the desired page is not in main memory) occurs OS will instruct CPU to read the page from disk, and I am wondering does OS dispatch to another process while the disk I/O ? if it does then there will be a complete flushing of the TLB on a context switch, correct ?

More or less, but a page fault doesn't always mean the page is on the disk (it could also not exist at all, be a lazy-allocation page, be a copy-on-write page that was written to, exist but be marked unreadable/unwritable, etc). But if that's how it is, it's probably going to schedule an other thread at least because disk IO takes approximately forever.
The amount of switching necessary depends on what it switches to, switching between threads from the same context doesn't imply a TLB flush. If a TLB flush is necessary, it's probably not a complete flush, because of global pages (so typically, you're not flushing out TLB entries for kernel pages). There is also PCID to avoid complete flushes (flushing can be limited to specified process context IDs), but that's quite recent, and tricky to use since there are only 4096 different IDs.

Process-specific pages are marked as non-global entries with nG(non-global) bit in TLB entry and also stores the pid(Address ID in ARM's terminology).
Now the article clearly lays out this concept.
"For non-global entries, when the TLB is updated and the entry is marked as non-global, a value is stored in the TLB entry in addition to the normal translation information. This value is called the Address Space ID (ASID), which is a number assigned by the OS to each individual task. Subsequent TLB look-ups only match on that entry if the current ASID matches with the ASID that is stored in the entry. This permits multiple valid TLB entries to be present for a particular page marked as non-global, but with different ASID values. In other words, we do not necessarily need to flush the TLBs when we context switch."
Source: https://developer.arm.com/documentation/den0024/a/The-Memory-Management-Unit/Context-switching

Related

TLB Flushing in case of process context switching

As TLB flushing in case of process context switch, why each process starts from scratch in TLB when given charge.
Why don't we fill in first few page table entries in the TLB as it can work in same fashion as we use locality of reference in memory management, i.e. when a process comes to execute, it is very likely that it will start with instruction 1 or the first instruction of the first few pages that are loaded in the main memory?
It can reduce the problem of filling up TLB during execution n speed up the system.
When CPU generates virtual address, The corresponding page will be searched in TLB, if it is not present in TLB, it will be searched in the next level memory, and then it will be placed in the TLB by following the suitable replacement algorithm.
The system can't predict in which frame the page containing so called 'instruction 1' is placed. If that was the case, then there won't be any need of page replacement algorithms, Instead it can replace all the required pages sequentially like page with first instruction, page with second instruction .. and so on.

Converting Virtual addresses to Physical addresses with a TLB miss

Suppose that you have a 64-bit system and that your OS is scheduling two processes on it. Assume that the core has access to a 4-entry TLB for 4KB page size, and full associativity. Furthermore, assume that the core has a 64-byte direct-mapped cache with 16 byte cache lines. Now suppose that your processes, A and B, have
the following page tables:
Process A Page Table
Now suppose that your OS schedules process A and in it, memory references to the following virtual address are made.
0x2002
For the memory reference presented above, detail all TLB access(whether they are hits or misses) and all cache accesses(whether they are hits or misses). Assume hardware page table walks and a physically addressed cache.

Does a user process have any control over paging?

A program might have some data that, when needed, it wants to access very fast. Let's call this VIP data. It would like to reduce the likelihood that page in memory that the VIP data resides on gets swapped to disk when memory utilization is high on the system. What types of control/influence does it have over this?
For example, I think it can consider the page replacement policy and try to influence the OS to not swap this VIP data to disk. If the policy is LRU, the program can periodically read the VIP data to ensure that the page has always been accessed fairly recently. A program can also use a very small amount of memory in total, making it likely that all its pages are recently accessed when it runs and therefore the VIP data is not likely swapped to disk.
Can it exert any more explicit control over paging?
In order to do this, you might consider
Prioritising the process using renice command or
Lock the processes in the main memory using MLOCK(2)
This is entirely operating system dependent. On some systems, if you have appropriate privileges you can lock pages in physical memory.

LRU Clock Replace Algorithm - What are reference bits initialized to?

Suppose I have a 2 entry TLB and am using LRU Clock replacement. Further suppose that I have a TLB miss and its a page fault, so I load in a page into memory and update TLB, now my TLB has 1 entry. Next, I have another TLB miss and its a page fault, so I load in a page into memory and update TLB now my TLB is full.
Now, let us say that there is an another TLB miss and its a page fault and I need to evict an entry from my TLB using LRU Clock.
My question is, is the reference bit for the two entries in the TLB 1? (i.e. When we add in a new TLB entry do we set the reference bit to 1 initially? Or is it initialized to 0?). If not is the first TLB entry's reference bit 0 and the second a 1? (Would this be because during the 2nd-page fault we set the reference bit for the first entry to 0?)
Lastly which entry would be evicted? (first or second)

MSI: Why do we need to write the line back when other CPU is going to override it?

In the book "Computer Architecture", by Hennessy/Patterson, 5th ed, on page 360 they describe MSI protocol, and write something like:
If the line is in state "Exclusive" (Modified), then on receiving "Write Miss" from the bus the current CPU 1) writes back the line into the bus, and then 2) goes into "Invalid" state.
Why do we need to write-back the line, if it will be overwritten anyway by the successive write by the other CPU?
Is it connected with the fact that every CPU should see the same writes? (but I don't see why is it a problem not see this particular write by some other CPU)
Here is the protocol from their book (question in green, in purple it is clear: we need to write-back in order to supply the line to requesting CPU):
Writing back the modified data to memory is not strictly necessary in a MSI protocol. The state diagrams also seem to assume a system with low cost memory access (data is supplied by memory even when found in the shared state in another cache) and a shared bus connecting to the memory interface.
However, the modified data cannot simply be dropped as in shared state since the requesting processor might only be modifying part of the cache block (e.g., only one byte). Whatever portions of the block that are not modified by the requesting processor must still be available either in memory or at the requesting processor (the other processor has already invalidated its copy). With a shared bus and low cost memory access the cost difference of adding a write-back to memory over just communicating the data to the other processor is small.
In addition, even on a word-addressed system with word-sized cache blocks, keeping the old data available allows the write miss request to be sent speculatively (as with out-of-order execution or prefetch-for-write) without correctness issues.
(Per-byte tracking of modified [or valid as a superset of modified] state would allow some data communication to be avoided at the cost of extra state bits and a more complex communication system.)