Different Page Sizes for Processes

Different Page Sizes for Processes - operating-system

As part of the virtual to physical address conversion, for each process a table of mappings between virtual to physical addresses is stored. If a process is scheduled next the content of the page table is loaded into the MMU.
1) Where is the page table for each process stored? As part of the process control block?
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes? Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.

1) They could be, but most OS's have a notion of an address space which a process is attached to. The address space typically contains a description of the sorts of mappings that have been established, and pointers to the page structure(s). If you consider the operation of exec(2), at a certain level of abstraction it merely involves creating a new address space, populating it, then attaching the process to it. Once the operation is known to succeed, the old address space can simply be discarded.
2) It depends upon the mmu architecture of the machine. In a forward mapped arrangement (x86, armv[78]), the page tables form a sort of tree structure, but instead of having the conventional 2 or 3 items per node, there are hundreds or thousands of them. The x86-classic has a 2 level structure, where each of the 1024 entries in the first level points to a pagetable which covers 2^20 bytes of address space. Invalid entries, either at the inner or leaf level, can represent unmapped space; so in x86-classic, if you have a very small address space, you only need a root table, and a single leaf level table.
3) Yes, multiple page size has been supported by most OSes since the early 2000s. Again, in forward mapped ones, each of the levels of the tree can be replaced by a single large page for the same address space as that table level. x86-classic only had one size; later editions supported many more.
3a) There is no need to use large pages to do this -- simply having an invalid page table is sufficient. In x86-classic, the least significant bit of the page table/descriptor entry indicates the validity of the entry.
Your idea exists.

1) Where is the page table for each process stored? As part of the process control block?
Usually it's not "a page table". For some CPUs there's only TLB entries (Translation Lookaside Buffer entries - like a cache of what the translations are) where software has to handle "TLB miss" by loading whatever it feels like into the TLB itself, and where the OS might not use tables at all (e.g. could use "list of arbitrary length zones"). For some CPUs it's a hierarchy of multiple levels (e.g. for modern 64-bit 80x86 there's 4 levels); and in this case some of the levels may be in physical memory and some may be in swap space or somewhere else and some may be generated as needed from other data (a little bit like it would've been for "software handling of TLB miss"). In any case, if each process has its own virtual address space (e.g. and it's not some kind of "single-address space shared by many processes" scheme) its likely that the process control block (directly or indirectly) contains a reference to whatever the OS uses (e.g. maybe a single "physical address for the highest level page table", but maybe a virtual address of a "list of arbitrary length zones" and maybe anything else).
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
If there are page tables then there must be a way to indicate "page not present", where "page not present" may mean that the memory isn't allocated but could also mean that the (virtual) memory was allocated but the entry for it hasn't been set (either because OS is generating the tables on demand, or because the actual data is in swap space, or...).
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes?
Yes. It's relatively common for 64-bit 80x86 where there's 4 KiB pages, 2 MiB (or 4 MiB) "large pages" (plus maybe 1 GiB "huge pages"); and done to reduce the chance of TLB misses (while also reducing memory consumed by page tables). Note that this is mostly an artifact of having multiple levels of page tables - an entry in a higher level table can say "this entry is a large page" or it can say "this entry is a lower level page table that might contain smaller pages". Note that in this case it's not "multiple page sizes in the same table", but is "fixed page size for each level".
Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.
Converting a virtual address into a physical address (or some kind of fault to indicate the translation doesn't exist) needs to be very fast (because it happens extremely often). When you have "fixed page size for each level" it means you can extract some bits of the virtual address and use them as the index into the table; which is fast.
When you have "multiple page sizes in the same table" there's 2 options. The first option is to duplicate entries in the page table so that you can still extract some bits of the virtual address and use them as the index into the table; which (apart from minor differences in the way TLBs are managed - e.g. auto-detecting adjacent translations vs. being manually told) is effectively identical to not bothering at all; but there are some CPUs (ARM I think) that do this.
The other alternative is searching multiple entries in the page table to find the right entry, where the cost of searching reduces performance. I don't know of any CPU that supports this - performance is too important.

Related

Number of memory access with Demand Paging

I have been studying Operating Systems Concepts and the book I am referring to is Operating System Concepts by Peter B. Galvin, Greg Gagne and Abraham Silberschatz.
In the chapter of Virtual Memory, book starts to talk about Paging and number of memory access it would require for the system to read data stored in a particular frame in memory given a logical address. The author states that when Page Table is present in Main Memory, system would need two memory accesses to read data stored in a frame. The first access is made to the page table to read the correct frame number and the next access is for reading the byte/word from the frame.
After a few sections, the book talks about Demand Paging and page fault. Author state that in case of no page fault, one memory access is needed and in case of a page fault, we will consider Page Fault Service time (which comprises of swap in time, swap out time, one memory access etc.) and presents readers with the formula
Effective Access Time = (1-p) x one memory access time + p x page fault service time
where p = page fault rate
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed. Applying the line of thought used with standard paging scheme earlier introduced by same author(s), we should need one memory access to read page table and another to read the data from frame.
Is it because we are talking about the time frame after the access to page table is made? Then why the same standard of calculation not applies to standard version of paging?

Note: I haven't read/seen this book.
For educational material; if the author describes reality accurately with all the details the reader will just get confused and won't be able to learn. To work around that, authors simplify (omit details and ignore reality) while introducing different concepts, so that the reader is able to learn each concept one at a time while building up the knowledge needed to comprehend the complexity of reality.
The problem is that different simplifications make sense at different stages, and authors are human (imperfect), so sometimes the simplifications that were beneficial at one point (in one chapter) conflict with simplifications that are beneficial at a later point (in a different chapter).
For an example, I might (initially) tell someone "each access from virtual memory involves a second memory fetch from RAM to determine the translation" to help them understand how page tables work and that there's (potential) performance problems involved (twice as many memory accesses). Then I might introduce the concept of "translation look-aside buffers" (after the reader understands the how page tables work and knows about the problem that TLBs are designed to solve). Then I might explain that often real systems have multiple levels of page tables (e.g. on 64-bit 80x86 it's four levels, potentially involving 4 memory accesses to determine a translation) and that there might be higher level caches/buffers involved (and not just TLBs that cache final translations). In this case, my original statement ("each access from virtual memory involves a second memory fetch from RAM to determine the translation") is a deliberate lie (a simplification) to avoid the complexity of a statement like "each access from virtual memory may or may not involve one or more additional fetches from some or all levels of page tables" (which is too confusing for beginners initially, because it creates lots of questions that they don't have answers to yet).
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed.
One reality is (for one real 80x86 CPU in long mode but not all 80x86 CPUs in long mode and not any 80x86 in other modes, if virtualisation is not being used), for a read from virtual memory that does not lead to a page fault, if the access is not misaligned/split across page boundaries (where CPU would have to do it all twice to fetch bytes from 2 different pages and merge the bytes):
* if the translation is not in the TLB, then:
* if the area is not in the "page directory cache"
* fetch the PML4 entry to determine address of PDPT (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PML4 entry
* fetch the PDPT entry to determine address of PD (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PDPT entry
* insert data into "page directory cache"
* if the area is in the "page directory cache"
* do access checks based on flags in "page directory cache entry"
* fetch the PD entry to determine address of PT (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PD entry
* fetch the PT entry to determine address of page (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PT entry
* insert data into TLB (including setting the "accessed" flag in the page table entry)
* if the translation is in the TLB, then:
* do access checks based on flags in "TLB entry"
* do the "physical address = physical address of page + offset in page" calculation
* read the data for the physical address (try L1 cache, then L2 cache, then L3 cache, then RAM)
For this reality (with the restrictions mentioned); the number of fetches from RAM can be anything from zero to 5.
Can you see why the author (while trying to explain page faults and not trying to explain translation costs) might want to avoid showing something like this and might simplify (by assuming that only one fetch is needed because the translation is in the TLB) instead?

The fundamental source of your problem is that you are reading a book that is only fit for lining a cat box. What you are describing is nonsensical gibberish that textbooks use to create confusion among students. This is not a case of over simplification because the authors apparently throw in a nonsensical formula for access times.
A formula like this
Effective Access Time = (1-p) x one memory access time + p x page fault service time
is total bovine fecal waste matter with no basis in reality.
The author states that when Page Table is present in Main Memory, system would need two memory accesses to read data stored in a frame.
The processor has to translate logical addresses to physical addresses using the page tables. Assuming that there is no caching in the CPU, the CPU has read the page table for each memory access.
The number reads depends upon the page table format used by the CPU.
Let's suppose your process has a multi-level page table. In that case the CPU has to make a read for each level of the table.
If you have a CPU that has separate linear system and user page tables, with the user tables in logical addresses, each access to the system space requires one memory read and each access to the user space requires at least two memory accesses and might, in fact, trigger a page fault. The first read is to system page table to find the user page table entry. The second read is to the user page table. The third is to the data.
In reality, every CPU on the planet does page table caching so separate reads are not required (all the time).
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed.
It sounds like the book is not being consistent in its BS.
The reality is that logical memory translation requires a number of steps. However, what those steps are depends upon the state of the processor, something that is unpredictable. These steps take place transparently behind the scenes and you do not even need to grasp all of them to understand operating systems.
What you need to know in the real world is that the CPU translates logical addresses to physical addresses. If the CPU is unable to make that translation, it triggers a page fault.

Can the page table be paged out?

According to my understanding it shouldn't be, since it's in kernel space and kernel space is non pageable. But with 64 bit address space I don't see how it can hold the full page table since it would be prohibitively large. Any ideas on how this achieved?
Also I guess even holding it on disk fully would take a lot of space. Since most of the VM space would be unused is there way to limit the page table to contain only used VM address ranges?

The page table is actually a tree: it consists of multiple child tables. The head (root) table stores pointers to child tables, the child tables may also store the pointers to their child tables, and so on (the last table in chain stores the actual page table entry, of course). As the most memory in 64-bit address space is unused, it is not necessary to actually allocate memory for all the tables. The root table just sets most of its pointers to null.
On x86_64, there are 4-5 levels of such indirection.

The page table only needs to hold some subset of the pages that are available. There is no rule that it be complete. When an attempt is made to access a virtual address not mapped by the page table, the kernel is invoked. It can then put that mapping into the page table, removing other mappings that haven't been used recently if desired.
The OS is free to keep every possible mapping that a process might access in the page tables if it wants. Alternatively, it can keep only those recently used if it would prefer that arrangement.

Yes.
Quote from wiki - page table
It was mentioned that creating a page table structure that contained mappings for every virtual page in the virtual address space could end up being wasteful. But, we can get around the excessive space concerns by putting the page table in virtual memory, and letting the virtual memory system manage the memory for the page table.
However, part of this linear page table structure must always stay resident in physical memory, in order to prevent against circular page faults, that look for a key part of the page table that is not present in the page table, which is not present in the page table, etc.

Hierachical Per-Process Page Tables: why don't we use simple linear array?

I would like to know why we need hierachical page tables in OS that handle per-process page tables, using PTBR and PTLR registers in CPU (tipically stored in PCB).
Thanks to PTLR I can check the limit of page table size for the current process, so its page table will contain just entries for its address memory space (that will be not so large as system address memory space).
If virtual address space of a process isn't sparse (its virtual page numbers are 0, 1, 2, ...) I will have a process page table of at most some K entries: totally its size will be at most some MBs, and I think it would be better to use a simple contiguous array.
So, why a lot of real solutions (ie x86 and x64) are based on multi-level page tables (or Hashed Page Tables)?
Thanks.

Because sparse virtual address space is good. Sparse address space allows the OS to crash a program that chases (some) wild pointers, and it makes prelinked shared libraries practical, and perhaps most useful of all, it allows your stack to grow from the "top" end of memory and your heap from the "bottom" end. You could of course define the page table index as a signed integer, which would allow you to implement the latter feature with just a simple array.
Also, think of "memory overcommit" allocation - when you malloc a few gigabytes the OS might say, "sure, fine!", knowing that most programs that ask for a few gigabytes turn out to use only a small fraction thereof. You could have problems supporting things like that with a simple array that isn't unnecessarily large.

Paging: Basic, Hierarchical, Hashed, and Inverted

With respect to operating systems and page tables, it seems there are 4 general methods to paging and page tables
Basic - A single page table which stores the page number and the offset
Hierarchical - A multi-tiered table which breaks up the virtual address into multiple parts
Hashed - A hashed page table which may often include multiple hashings mapping to the same entry
Inverted - The logical address also includes the PID, page number and offset. Then the PID is used to find the page in to the table and the number of rows down the table is added to the offset to find the physical address for main memory. (Rough, and probably terrible definition)
I am just wondering what are the pros and cons of each method? It seems like basic is the easier method but may also take up more space in memory for a larger address space.
What else?

The key to building a usable page model is minimizing the unused space for entries that are not necessary. You want to minimize the amount of memory needed while keeping the computation cost of a memory lookup low.
Basic can take up a lot of memory (for a modern system using 4GB of memory, that might amount to 300 MB only for the table) and is therefore impractical.
Hierarchical reduces that memory a lot by only adding subtables that are actually in use. Still, every process has a root page table. And if the memory footprint of the processes is scattered, there may still be a lot of unnecessary entries in secondary tables. This is a far better solution regarding memory than Basic and introduces only a marginal computation increase.
Hashed does not work because of hash collisions
Inverted is the solution to make Hashed work. The memory use is very small (as big as a Basic table for a single process, plus some PID and chaining overhead). The problem is, if there is a hash collision (several processes use the same virtual address) you will have to follow the chain information (just as in a linked list) until you find the entry with a matching PID. This may produce a lot of computing overhead in addition to the hash computing, but will keep the memory footprint as small as possible.

operating systems - TLBs

I'm trying to get my head round this (okay, tbh cramming a night before the exams :) but i can't figure out (nor find a good high level overview on the net) of this:
'page table entries can be mapped to more than one TLB entry.. if for example every page table entry is mappped to two TLB entries, this is know as 2-way set associative TLB'
My question is, why would we want to map this more than once? surely we want to have the maximum number of possible entries represented in the TLB, and duplication would waste space right ? What am i missing?
Many thanks

It doesn't mean you would load the same entry into two places into the table -- it means a particular entry can be loaded to either of two places in the table. The alternative where you can only map an entry to one place in the table is a direct mapped TLB.
The primary disadvantage of a direct-mapped TLB arises if you're copying from one part of memory to another, and (by whatever direct-mapping scheme the CPU uses) the translations for both have to be mapped to the same spot in the TLB. In this case, you end up re-loading the TLB entry every time, so the TLB is doing little or no good at all. By having a two-way set associative TLB, you can guarantee that any two entries can be in the TLB at the same time so (for example) a block move from point A to point B can't ruin your day -- but if you read from two areas, combine them, and write results to a third it could (if all three used translations that map map to the same set of TLB entries).
The shortcoming of having a multiway TLB (like any other multiway cache) is that you can't directly compute which position might hold a particular entry at a given time -- you basically search across the ways to find the right entry. For two-way, that's rarely a problem -- but four ways is typically about the useful limit; 8-way set associative (TLBs | caches) aren't common at all, partly because searching across 8 possible locations for the data starts to become excessive.
Over time, the number of ways it makes sense to use in a cache or tlb tends to rise though. The differential in speed between memory and processors continues to rise. The greater the differential, the more cycles the CPU can use and still produce a result within a single memory clock cycle (or a specified number of memory clock cycles, even if that's more than one).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse