When accessing memory, will the page table accessed/dirty bit be set under a cache hit situation? - cpu-architecture

As far as I know, a memory access of CPU involves CPU cache and MMU. CPU will try to find its target in cache and if a cache miss happens, CPU will turn to MMU. During accessing by MMU, the accessed/dirty bit of correspondent page table entry will be set by hardware.
However to the best of my knowledge, most CPU design won't trigger the MMU unless there's a cache miss, and here my problem is, will the accessed/dirty bit of page table entry still be set under a cache hit? Or it's architecture related?

I think you can assume these bits are cached in the TLB, and if there is any inconsistency with the values in the TLB and accesses done by the core, a microcode assist will be taken and the bits will be updated. For example, if the A1 or D bits are zero and an access or store happens, this condition will be detected and the appropriate bits will be set.
You can also assume that the fast path for TLB hits can't go to memory and see if the cached TLB bits are consistent with the PTEs in RAM. Furthermore, on x86 changes to PTE are not pushed, cache-invalidation style, to TLBs by hardware; that is, the TLB is not coherent.
This implies that if the bits are out of sync in certain ways, they will probably not be updated correctly. E.g., if the A (resp. D) bit is set in the TLB, and an access (resp. store) occurs, nothing will happen, even if the A (resp. D) bit is actually unset in the PTE. The entity making changes to the bits is responsible for flushing TLBs so that the bits are correctly updated in the future.
1 Having a TLB entry with A == 0 is weird: you'd expect the entry to be there as a result of an access, so having the A bit set from the start. Perhaps there are some scenarios where this might occur, such as a page brought in by a speculative access or prefetch.

Most caches are virtually indexed and physically tagged, for faster access. So the CPU issues the virtual address and index bits of the address is used to locate the entry. During this time the address is sent to TLB for getting the physical address. By the time cache has located the entry, TLB will return with the physical address which is then used for TAG comparison. Now two things can happen.
TLB could not have the entry (TLB miss)
Cache TAG mismatch (Cache miss)
In the case of 1, you need to access the page table entry (PTE) to get the correct physical address.
In the case of 2, if TLB has returned a valid mapping, you just need to fetch it. If TLB also has a miss (i.e, 1 and 2), then you need to get the physical address from PTE and fetch the data.
So to answer your question, in case of a HIT, PTE doesn't need to know about it all.

You usually can't have a cache hit if the page was never accessed in the first place, so that question is irrelevant. (Edit: come to think of it, it may be possible in some bizarre cases of page aliasing, but the same answer for the dirty bit applies there)
It is possible to have a cached line from a clean page (never written to previously). It's a little uncommon since you usually need to initialize data before accessing it, but the page could have been swapped out previously and then reinstalled into the page map (the exact behavior would be OS dependent but it is possible).
In that case, the line is cached (let's say exclusively), and you write to it. The CPU would access the cache and the TLB in parallel, attempting to lookup the line in the cache while also doing a TLB access to verify the full physical address, assuming your system is virtually indexed - physically tagged as most CPUs are these days. The TLB process may complete either through a TLB hit, or a miss followed by a page walk to install a TLB entry from the actual page map in the memory.
The cache access cannot complete until the TLB access (and page walk if necessary) is done, at which point you will know the value of the access/dirty bits.
If you are trying to write to a page without the dirty bit set (or access a page without the access bit) - you will receive a page fault, triggering the OS to go and update the page in page table. The OS may choose to do various optimizations at this point, but it will eventually result in correcting these bits.

Related

How does page replacement policies work for systems with a TLB?

If we have a system with a TLB, how is page replacement done? Does the computer directly replace an entry in TLB and evict it from memory, or do we somehow keep the entry consistent in TLB and page table? If we synchronize the PTEs somehow, do we avoid replacing PTEs that are currently in TLB from being evicted even if the reference bit is 0?

Different Page Sizes for Processes

As part of the virtual to physical address conversion, for each process a table of mappings between virtual to physical addresses is stored. If a process is scheduled next the content of the page table is loaded into the MMU.
1) Where is the page table for each process stored? As part of the process control block?
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes? Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.
1) They could be, but most OS's have a notion of an address space which a process is attached to. The address space typically contains a description of the sorts of mappings that have been established, and pointers to the page structure(s). If you consider the operation of exec(2), at a certain level of abstraction it merely involves creating a new address space, populating it, then attaching the process to it. Once the operation is known to succeed, the old address space can simply be discarded.
2) It depends upon the mmu architecture of the machine. In a forward mapped arrangement (x86, armv[78]), the page tables form a sort of tree structure, but instead of having the conventional 2 or 3 items per node, there are hundreds or thousands of them. The x86-classic has a 2 level structure, where each of the 1024 entries in the first level points to a pagetable which covers 2^20 bytes of address space. Invalid entries, either at the inner or leaf level, can represent unmapped space; so in x86-classic, if you have a very small address space, you only need a root table, and a single leaf level table.
3) Yes, multiple page size has been supported by most OSes since the early 2000s. Again, in forward mapped ones, each of the levels of the tree can be replaced by a single large page for the same address space as that table level. x86-classic only had one size; later editions supported many more.
3a) There is no need to use large pages to do this -- simply having an invalid page table is sufficient. In x86-classic, the least significant bit of the page table/descriptor entry indicates the validity of the entry.
Your idea exists.
1) Where is the page table for each process stored? As part of the process control block?
Usually it's not "a page table". For some CPUs there's only TLB entries (Translation Lookaside Buffer entries - like a cache of what the translations are) where software has to handle "TLB miss" by loading whatever it feels like into the TLB itself, and where the OS might not use tables at all (e.g. could use "list of arbitrary length zones"). For some CPUs it's a hierarchy of multiple levels (e.g. for modern 64-bit 80x86 there's 4 levels); and in this case some of the levels may be in physical memory and some may be in swap space or somewhere else and some may be generated as needed from other data (a little bit like it would've been for "software handling of TLB miss"). In any case, if each process has its own virtual address space (e.g. and it's not some kind of "single-address space shared by many processes" scheme) its likely that the process control block (directly or indirectly) contains a reference to whatever the OS uses (e.g. maybe a single "physical address for the highest level page table", but maybe a virtual address of a "list of arbitrary length zones" and maybe anything else).
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
If there are page tables then there must be a way to indicate "page not present", where "page not present" may mean that the memory isn't allocated but could also mean that the (virtual) memory was allocated but the entry for it hasn't been set (either because OS is generating the tables on demand, or because the actual data is in swap space, or...).
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes?
Yes. It's relatively common for 64-bit 80x86 where there's 4 KiB pages, 2 MiB (or 4 MiB) "large pages" (plus maybe 1 GiB "huge pages"); and done to reduce the chance of TLB misses (while also reducing memory consumed by page tables). Note that this is mostly an artifact of having multiple levels of page tables - an entry in a higher level table can say "this entry is a large page" or it can say "this entry is a lower level page table that might contain smaller pages". Note that in this case it's not "multiple page sizes in the same table", but is "fixed page size for each level".
Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.
Converting a virtual address into a physical address (or some kind of fault to indicate the translation doesn't exist) needs to be very fast (because it happens extremely often). When you have "fixed page size for each level" it means you can extract some bits of the virtual address and use them as the index into the table; which is fast.
When you have "multiple page sizes in the same table" there's 2 options. The first option is to duplicate entries in the page table so that you can still extract some bits of the virtual address and use them as the index into the table; which (apart from minor differences in the way TLBs are managed - e.g. auto-detecting adjacent translations vs. being manually told) is effectively identical to not bothering at all; but there are some CPUs (ARM I think) that do this.
The other alternative is searching multiple entries in the page table to find the right entry, where the cost of searching reduces performance. I don't know of any CPU that supports this - performance is too important.

Number of memory access with Demand Paging

I have been studying Operating Systems Concepts and the book I am referring to is Operating System Concepts by Peter B. Galvin, Greg Gagne and Abraham Silberschatz.
In the chapter of Virtual Memory, book starts to talk about Paging and number of memory access it would require for the system to read data stored in a particular frame in memory given a logical address. The author states that when Page Table is present in Main Memory, system would need two memory accesses to read data stored in a frame. The first access is made to the page table to read the correct frame number and the next access is for reading the byte/word from the frame.
After a few sections, the book talks about Demand Paging and page fault. Author state that in case of no page fault, one memory access is needed and in case of a page fault, we will consider Page Fault Service time (which comprises of swap in time, swap out time, one memory access etc.) and presents readers with the formula
Effective Access Time = (1-p) x one memory access time + p x page fault service time
where p = page fault rate
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed. Applying the line of thought used with standard paging scheme earlier introduced by same author(s), we should need one memory access to read page table and another to read the data from frame.
Is it because we are talking about the time frame after the access to page table is made? Then why the same standard of calculation not applies to standard version of paging?
Note: I haven't read/seen this book.
For educational material; if the author describes reality accurately with all the details the reader will just get confused and won't be able to learn. To work around that, authors simplify (omit details and ignore reality) while introducing different concepts, so that the reader is able to learn each concept one at a time while building up the knowledge needed to comprehend the complexity of reality.
The problem is that different simplifications make sense at different stages, and authors are human (imperfect), so sometimes the simplifications that were beneficial at one point (in one chapter) conflict with simplifications that are beneficial at a later point (in a different chapter).
For an example, I might (initially) tell someone "each access from virtual memory involves a second memory fetch from RAM to determine the translation" to help them understand how page tables work and that there's (potential) performance problems involved (twice as many memory accesses). Then I might introduce the concept of "translation look-aside buffers" (after the reader understands the how page tables work and knows about the problem that TLBs are designed to solve). Then I might explain that often real systems have multiple levels of page tables (e.g. on 64-bit 80x86 it's four levels, potentially involving 4 memory accesses to determine a translation) and that there might be higher level caches/buffers involved (and not just TLBs that cache final translations). In this case, my original statement ("each access from virtual memory involves a second memory fetch from RAM to determine the translation") is a deliberate lie (a simplification) to avoid the complexity of a statement like "each access from virtual memory may or may not involve one or more additional fetches from some or all levels of page tables" (which is too confusing for beginners initially, because it creates lots of questions that they don't have answers to yet).
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed.
One reality is (for one real 80x86 CPU in long mode but not all 80x86 CPUs in long mode and not any 80x86 in other modes, if virtualisation is not being used), for a read from virtual memory that does not lead to a page fault, if the access is not misaligned/split across page boundaries (where CPU would have to do it all twice to fetch bytes from 2 different pages and merge the bytes):
* if the translation is not in the TLB, then:
* if the area is not in the "page directory cache"
* fetch the PML4 entry to determine address of PDPT (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PML4 entry
* fetch the PDPT entry to determine address of PD (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PDPT entry
* insert data into "page directory cache"
* if the area is in the "page directory cache"
* do access checks based on flags in "page directory cache entry"
* fetch the PD entry to determine address of PT (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PD entry
* fetch the PT entry to determine address of page (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PT entry
* insert data into TLB (including setting the "accessed" flag in the page table entry)
* if the translation is in the TLB, then:
* do access checks based on flags in "TLB entry"
* do the "physical address = physical address of page + offset in page" calculation
* read the data for the physical address (try L1 cache, then L2 cache, then L3 cache, then RAM)
For this reality (with the restrictions mentioned); the number of fetches from RAM can be anything from zero to 5.
Can you see why the author (while trying to explain page faults and not trying to explain translation costs) might want to avoid showing something like this and might simplify (by assuming that only one fetch is needed because the translation is in the TLB) instead?
The fundamental source of your problem is that you are reading a book that is only fit for lining a cat box. What you are describing is nonsensical gibberish that textbooks use to create confusion among students. This is not a case of over simplification because the authors apparently throw in a nonsensical formula for access times.
A formula like this
Effective Access Time = (1-p) x one memory access time + p x page fault service time
is total bovine fecal waste matter with no basis in reality.
The author states that when Page Table is present in Main Memory, system would need two memory accesses to read data stored in a frame.
The processor has to translate logical addresses to physical addresses using the page tables. Assuming that there is no caching in the CPU, the CPU has read the page table for each memory access.
The number reads depends upon the page table format used by the CPU.
Let's suppose your process has a multi-level page table. In that case the CPU has to make a read for each level of the table.
If you have a CPU that has separate linear system and user page tables, with the user tables in logical addresses, each access to the system space requires one memory read and each access to the user space requires at least two memory accesses and might, in fact, trigger a page fault. The first read is to system page table to find the user page table entry. The second read is to the user page table. The third is to the data.
In reality, every CPU on the planet does page table caching so separate reads are not required (all the time).
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed.
It sounds like the book is not being consistent in its BS.
The reality is that logical memory translation requires a number of steps. However, what those steps are depends upon the state of the processor, something that is unpredictable. These steps take place transparently behind the scenes and you do not even need to grasp all of them to understand operating systems.
What you need to know in the real world is that the CPU translates logical addresses to physical addresses. If the CPU is unable to make that translation, it triggers a page fault.

Is TLB used at all in the instruction fetching pipeline

Is a TLB used at all in the instruction fetching pipeline?
Is this architecture / microarchitecture - dependent?
Typically, a processor that supports paging (which typically includes a mechanism for excluding execute permission even if not separately from read permission) will access a TLB as part of instruction fetch.
A virtually tagged instruction cache would not require such even for permissions checks as long as permissions were checked when a block is inserted into the instruction cache (which typically would involve a TLB access, though a permission cache could be used with a virtually tagged L2 cache; this includes prefetches into the instruction cache), the permission domain was included with the virtual tag (typically the same as an address space identifier, which is useful anyway to avoid cache flushing), and system software ensured that blocks were removed when execute permission was revoked (or the permission domain/address space identifier was reused for a different permission domain/address space).
(In general, virtually tagged caches do not need a translation lookaside buffer; a cache of permission mappings is sufficient or permissions can be cached with the tag and an indication of the permission domain. Before accessing memory a TLB would be used, but cache hits would not require translation. Permission caching is less expensive than translation caching both because the granularity can be larger and because fewer bits are needed to express permission information.)
A physically tagged instruction cache would require address translation for hit determination, but this can be delayed significantly by speculating that the access was a hit (likely using way prediction). Hit determination can be delayed even to the time of instruction commit/result writeback, though earlier handling is typically better.
Because instruction accesses typically have substantial spatial locality, a very small TLB can provide decent hit rates and a reasonably fast, larger back-up TLB can reduce miss costs. Such a microTLB can facilitate sharing a TLB between data and instruction accesses by filtering out most instruction accesses.
Obviously, an architecture that does not support paging would not use a TLB (though it might use a memory protection unit to check that an access is permitted or use a different translation mechanism such as adding an offset possibly with a bounds check). An architecture oriented toward single address space operating systems would probably use virtually tagged caches and so access a TLB only on cache misses.

Is it true that CPU never fetches anything from memory directly?

I hear that cpu just fetches instruction from the EIP register,never fetches from memory directly.
But AFAIK,EIP just stores the address of the next instruction,the instruction itself is still in the memory.If CPU never fetches memory,how can it know what the next instruction actually is?
UPDATE
BTW,I know there're x86,x64,x87 architectures,but which does x86-64 belong to,x86 or x64??
The simple answer to your question is "no, it's not true".
The picture isn't very simple due to caching, instruction pipeline, branch prediction etc. However, the instruction pointer is just that, a pointer. It doesn't store opcodes.
EIP (Extended Instruction Pointer) should hold the address of the instruction. It's just a way to keep a tab of which instruction is being processed currently (or sometimes, which instruction to process next).
The instructions themselves are stored in the Memory (HDD, RAM, Cache) and need to be fetched by the CPU.
Maybe what you heard meant that since so many levels of caches are used generally it's quite rare that the fetch needs to access the RAM..
Well I don't know the point to your question.
Yes the CPU (in a broad sense of the word) does fetch from memory. It has a number of memory management devices (for cache line handling and pipelining). In fact, the 'pipeline' puts the instructions in L1 cache. Indeed, the instruction processor itself only fetches from there. The processor in reality probably never even looks at EIP (unless an instruction uses it directly as an operand).
So the real answer would be, find yourself a wikipedia articale on i86 processor design, and have a ball. You'll be able to know exactly what happens where.
Cheers
Not true in that way. CPU accesses memory thru the cache, so you can kinda say that it does not do it directly. (Also DMA cahnnel can transfer data between memory and IO without ever touching CPU).
Yes, CS:EIP points to the memory, to the next instruction to execute, but you can use direct addresses too for example (load the content of the address 0x0800 to the AX register, by default this is relative to DS segment):
MOV AX,[0x0800]