I know that page size is fixed in some operation system, such as pg size is 4K in i386. However, how does memory manager know the size of the page? Does it store it somewhere of memory so that MMU can read it when translating address?
Page size has a direct impact in processor architecture. It defines how a page address is interpreted by hardware for the virtual-physical translation.
In-page part of the address (often called offset or displacement) is not translated and is sent unchanged to the cache, while the upper bits (page virtual address) are translated by the TLB, and modifying page size (and offset) would require changes in datapath width. Depending on the size of this offset and on the L1 cache characteristics (size and associativity), the cache can or not use a virtual index which can have a direct performance impact and would imply a redesign.
The virtual address size also determines the way page tables are organized and accessed after a TLB miss (page walk). The MMU and cache are a highly critical part of processor design that has a direct performance impact, and they need to be optimized that generally exclude flexibility.
So changing page size requires major changes in the processor architecture and page sizes are generally constant or have a limited number of values. Recent Pentium can have regular 4K or huge 4G pages. Older arm versions (v4 and v5) add subpages that allowed to divide page size by 4. On Arm v8, you can also have 64kB pages. But besides that processor are generally designed for fixed page size,
and the operating system must adapt to the processor pages.
There a three ways I am aware of that processors define page sizes:
The page size is constant and never changes.
The page size is the same but is configurable. In this case the page size is set in a system register. Usually, the page size had to be certain values so it is a bit setting rather than a numeric value. This seems to be what you are asking. The readers digest version on the intel chips is that setting a bit in the CR4 register switches between 4KB and 4 MB pages.
There are some systems that can have variable page sizes. In that case, the page size is usually set in the page table.
Related
As part of the virtual to physical address conversion, for each process a table of mappings between virtual to physical addresses is stored. If a process is scheduled next the content of the page table is loaded into the MMU.
1) Where is the page table for each process stored? As part of the process control block?
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes? Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.
1) They could be, but most OS's have a notion of an address space which a process is attached to. The address space typically contains a description of the sorts of mappings that have been established, and pointers to the page structure(s). If you consider the operation of exec(2), at a certain level of abstraction it merely involves creating a new address space, populating it, then attaching the process to it. Once the operation is known to succeed, the old address space can simply be discarded.
2) It depends upon the mmu architecture of the machine. In a forward mapped arrangement (x86, armv[78]), the page tables form a sort of tree structure, but instead of having the conventional 2 or 3 items per node, there are hundreds or thousands of them. The x86-classic has a 2 level structure, where each of the 1024 entries in the first level points to a pagetable which covers 2^20 bytes of address space. Invalid entries, either at the inner or leaf level, can represent unmapped space; so in x86-classic, if you have a very small address space, you only need a root table, and a single leaf level table.
3) Yes, multiple page size has been supported by most OSes since the early 2000s. Again, in forward mapped ones, each of the levels of the tree can be replaced by a single large page for the same address space as that table level. x86-classic only had one size; later editions supported many more.
3a) There is no need to use large pages to do this -- simply having an invalid page table is sufficient. In x86-classic, the least significant bit of the page table/descriptor entry indicates the validity of the entry.
Your idea exists.
1) Where is the page table for each process stored? As part of the process control block?
Usually it's not "a page table". For some CPUs there's only TLB entries (Translation Lookaside Buffer entries - like a cache of what the translations are) where software has to handle "TLB miss" by loading whatever it feels like into the TLB itself, and where the OS might not use tables at all (e.g. could use "list of arbitrary length zones"). For some CPUs it's a hierarchy of multiple levels (e.g. for modern 64-bit 80x86 there's 4 levels); and in this case some of the levels may be in physical memory and some may be in swap space or somewhere else and some may be generated as needed from other data (a little bit like it would've been for "software handling of TLB miss"). In any case, if each process has its own virtual address space (e.g. and it's not some kind of "single-address space shared by many processes" scheme) its likely that the process control block (directly or indirectly) contains a reference to whatever the OS uses (e.g. maybe a single "physical address for the highest level page table", but maybe a virtual address of a "list of arbitrary length zones" and maybe anything else).
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
If there are page tables then there must be a way to indicate "page not present", where "page not present" may mean that the memory isn't allocated but could also mean that the (virtual) memory was allocated but the entry for it hasn't been set (either because OS is generating the tables on demand, or because the actual data is in swap space, or...).
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes?
Yes. It's relatively common for 64-bit 80x86 where there's 4 KiB pages, 2 MiB (or 4 MiB) "large pages" (plus maybe 1 GiB "huge pages"); and done to reduce the chance of TLB misses (while also reducing memory consumed by page tables). Note that this is mostly an artifact of having multiple levels of page tables - an entry in a higher level table can say "this entry is a large page" or it can say "this entry is a lower level page table that might contain smaller pages". Note that in this case it's not "multiple page sizes in the same table", but is "fixed page size for each level".
Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.
Converting a virtual address into a physical address (or some kind of fault to indicate the translation doesn't exist) needs to be very fast (because it happens extremely often). When you have "fixed page size for each level" it means you can extract some bits of the virtual address and use them as the index into the table; which is fast.
When you have "multiple page sizes in the same table" there's 2 options. The first option is to duplicate entries in the page table so that you can still extract some bits of the virtual address and use them as the index into the table; which (apart from minor differences in the way TLBs are managed - e.g. auto-detecting adjacent translations vs. being manually told) is effectively identical to not bothering at all; but there are some CPUs (ARM I think) that do this.
The other alternative is searching multiple entries in the page table to find the right entry, where the cost of searching reduces performance. I don't know of any CPU that supports this - performance is too important.
I have been studying Operating Systems Concepts and the book I am referring to is Operating System Concepts by Peter B. Galvin, Greg Gagne and Abraham Silberschatz.
In the chapter of Virtual Memory, book starts to talk about Paging and number of memory access it would require for the system to read data stored in a particular frame in memory given a logical address. The author states that when Page Table is present in Main Memory, system would need two memory accesses to read data stored in a frame. The first access is made to the page table to read the correct frame number and the next access is for reading the byte/word from the frame.
After a few sections, the book talks about Demand Paging and page fault. Author state that in case of no page fault, one memory access is needed and in case of a page fault, we will consider Page Fault Service time (which comprises of swap in time, swap out time, one memory access etc.) and presents readers with the formula
Effective Access Time = (1-p) x one memory access time + p x page fault service time
where p = page fault rate
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed. Applying the line of thought used with standard paging scheme earlier introduced by same author(s), we should need one memory access to read page table and another to read the data from frame.
Is it because we are talking about the time frame after the access to page table is made? Then why the same standard of calculation not applies to standard version of paging?
Note: I haven't read/seen this book.
For educational material; if the author describes reality accurately with all the details the reader will just get confused and won't be able to learn. To work around that, authors simplify (omit details and ignore reality) while introducing different concepts, so that the reader is able to learn each concept one at a time while building up the knowledge needed to comprehend the complexity of reality.
The problem is that different simplifications make sense at different stages, and authors are human (imperfect), so sometimes the simplifications that were beneficial at one point (in one chapter) conflict with simplifications that are beneficial at a later point (in a different chapter).
For an example, I might (initially) tell someone "each access from virtual memory involves a second memory fetch from RAM to determine the translation" to help them understand how page tables work and that there's (potential) performance problems involved (twice as many memory accesses). Then I might introduce the concept of "translation look-aside buffers" (after the reader understands the how page tables work and knows about the problem that TLBs are designed to solve). Then I might explain that often real systems have multiple levels of page tables (e.g. on 64-bit 80x86 it's four levels, potentially involving 4 memory accesses to determine a translation) and that there might be higher level caches/buffers involved (and not just TLBs that cache final translations). In this case, my original statement ("each access from virtual memory involves a second memory fetch from RAM to determine the translation") is a deliberate lie (a simplification) to avoid the complexity of a statement like "each access from virtual memory may or may not involve one or more additional fetches from some or all levels of page tables" (which is too confusing for beginners initially, because it creates lots of questions that they don't have answers to yet).
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed.
One reality is (for one real 80x86 CPU in long mode but not all 80x86 CPUs in long mode and not any 80x86 in other modes, if virtualisation is not being used), for a read from virtual memory that does not lead to a page fault, if the access is not misaligned/split across page boundaries (where CPU would have to do it all twice to fetch bytes from 2 different pages and merge the bytes):
* if the translation is not in the TLB, then:
* if the area is not in the "page directory cache"
* fetch the PML4 entry to determine address of PDPT (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PML4 entry
* fetch the PDPT entry to determine address of PD (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PDPT entry
* insert data into "page directory cache"
* if the area is in the "page directory cache"
* do access checks based on flags in "page directory cache entry"
* fetch the PD entry to determine address of PT (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PD entry
* fetch the PT entry to determine address of page (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PT entry
* insert data into TLB (including setting the "accessed" flag in the page table entry)
* if the translation is in the TLB, then:
* do access checks based on flags in "TLB entry"
* do the "physical address = physical address of page + offset in page" calculation
* read the data for the physical address (try L1 cache, then L2 cache, then L3 cache, then RAM)
For this reality (with the restrictions mentioned); the number of fetches from RAM can be anything from zero to 5.
Can you see why the author (while trying to explain page faults and not trying to explain translation costs) might want to avoid showing something like this and might simplify (by assuming that only one fetch is needed because the translation is in the TLB) instead?
The fundamental source of your problem is that you are reading a book that is only fit for lining a cat box. What you are describing is nonsensical gibberish that textbooks use to create confusion among students. This is not a case of over simplification because the authors apparently throw in a nonsensical formula for access times.
A formula like this
Effective Access Time = (1-p) x one memory access time + p x page fault service time
is total bovine fecal waste matter with no basis in reality.
The author states that when Page Table is present in Main Memory, system would need two memory accesses to read data stored in a frame.
The processor has to translate logical addresses to physical addresses using the page tables. Assuming that there is no caching in the CPU, the CPU has read the page table for each memory access.
The number reads depends upon the page table format used by the CPU.
Let's suppose your process has a multi-level page table. In that case the CPU has to make a read for each level of the table.
If you have a CPU that has separate linear system and user page tables, with the user tables in logical addresses, each access to the system space requires one memory read and each access to the user space requires at least two memory accesses and might, in fact, trigger a page fault. The first read is to system page table to find the user page table entry. The second read is to the user page table. The third is to the data.
In reality, every CPU on the planet does page table caching so separate reads are not required (all the time).
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed.
It sounds like the book is not being consistent in its BS.
The reality is that logical memory translation requires a number of steps. However, what those steps are depends upon the state of the processor, something that is unpredictable. These steps take place transparently behind the scenes and you do not even need to grasp all of them to understand operating systems.
What you need to know in the real world is that the CPU translates logical addresses to physical addresses. If the CPU is unable to make that translation, it triggers a page fault.
As far as I know, a memory access of CPU involves CPU cache and MMU. CPU will try to find its target in cache and if a cache miss happens, CPU will turn to MMU. During accessing by MMU, the accessed/dirty bit of correspondent page table entry will be set by hardware.
However to the best of my knowledge, most CPU design won't trigger the MMU unless there's a cache miss, and here my problem is, will the accessed/dirty bit of page table entry still be set under a cache hit? Or it's architecture related?
I think you can assume these bits are cached in the TLB, and if there is any inconsistency with the values in the TLB and accesses done by the core, a microcode assist will be taken and the bits will be updated. For example, if the A1 or D bits are zero and an access or store happens, this condition will be detected and the appropriate bits will be set.
You can also assume that the fast path for TLB hits can't go to memory and see if the cached TLB bits are consistent with the PTEs in RAM. Furthermore, on x86 changes to PTE are not pushed, cache-invalidation style, to TLBs by hardware; that is, the TLB is not coherent.
This implies that if the bits are out of sync in certain ways, they will probably not be updated correctly. E.g., if the A (resp. D) bit is set in the TLB, and an access (resp. store) occurs, nothing will happen, even if the A (resp. D) bit is actually unset in the PTE. The entity making changes to the bits is responsible for flushing TLBs so that the bits are correctly updated in the future.
1 Having a TLB entry with A == 0 is weird: you'd expect the entry to be there as a result of an access, so having the A bit set from the start. Perhaps there are some scenarios where this might occur, such as a page brought in by a speculative access or prefetch.
Most caches are virtually indexed and physically tagged, for faster access. So the CPU issues the virtual address and index bits of the address is used to locate the entry. During this time the address is sent to TLB for getting the physical address. By the time cache has located the entry, TLB will return with the physical address which is then used for TAG comparison. Now two things can happen.
TLB could not have the entry (TLB miss)
Cache TAG mismatch (Cache miss)
In the case of 1, you need to access the page table entry (PTE) to get the correct physical address.
In the case of 2, if TLB has returned a valid mapping, you just need to fetch it. If TLB also has a miss (i.e, 1 and 2), then you need to get the physical address from PTE and fetch the data.
So to answer your question, in case of a HIT, PTE doesn't need to know about it all.
You usually can't have a cache hit if the page was never accessed in the first place, so that question is irrelevant. (Edit: come to think of it, it may be possible in some bizarre cases of page aliasing, but the same answer for the dirty bit applies there)
It is possible to have a cached line from a clean page (never written to previously). It's a little uncommon since you usually need to initialize data before accessing it, but the page could have been swapped out previously and then reinstalled into the page map (the exact behavior would be OS dependent but it is possible).
In that case, the line is cached (let's say exclusively), and you write to it. The CPU would access the cache and the TLB in parallel, attempting to lookup the line in the cache while also doing a TLB access to verify the full physical address, assuming your system is virtually indexed - physically tagged as most CPUs are these days. The TLB process may complete either through a TLB hit, or a miss followed by a page walk to install a TLB entry from the actual page map in the memory.
The cache access cannot complete until the TLB access (and page walk if necessary) is done, at which point you will know the value of the access/dirty bits.
If you are trying to write to a page without the dirty bit set (or access a page without the access bit) - you will receive a page fault, triggering the OS to go and update the page in page table. The OS may choose to do various optimizations at this point, but it will eventually result in correcting these bits.
I have learned that in an operating system (Linux), the memory management unit (MMU) can translate a virtual address (VA) to a physical address (PA) via the page table data structure. It seems that page is the smallest data unit that is managed by the VM. But how about the block? Is it also the smallest data unit transfered between the disk and the system memory?
What is the difference between pages and blocks?
A block is the smallest unit of data that an operating system can either write to a file or read from a file.
What exactly is a page?
Pages are used by some operating systems instead of blocks. A page is basically a virtual block. And, pages have a fixed size – 4K and 2K are the most commonly used sizes. So, the two key points to remember about pages is that they are virtual blocks and they have fixed sizes.
Why pages may be used instead of blocks
Pages are used because they make processing easier when there are many storage devices, because each device may support a different block size. With pages the operating system can deal with just a fixed size page, rather than try to figure out how to deal with blocks that are all different sizes. So, pages act as sort of a middleman between operating systems and hardware drivers, which translate the pages to the appropriate blocks. But, both pages and blocks are used as a unit of data storage.
http://www.programmerinterview.com/index.php/database-sql/page-versus-block/
Generally speaking, the hard-disk is one of those devices called "block-devices" as opposed to "character-devices" because the unit of transferring data is in the block.
Even if you want only a single character from a file, the OS and the drive will get you a block and then give you access only to what you asked for while the rest remains in a specific cache/buffer.
Note: The block size, however, can differ from one system to another.
To clear a point:
Yes, any data transferred between the hard disk and the RAM is usually sent in blocks rather than actual bytes.
Data which is stored in RAM is managed, typically, by pages yes; of course the assembly instructions only know byte addresses.
I would like to know why we need hierachical page tables in OS that handle per-process page tables, using PTBR and PTLR registers in CPU (tipically stored in PCB).
Thanks to PTLR I can check the limit of page table size for the current process, so its page table will contain just entries for its address memory space (that will be not so large as system address memory space).
If virtual address space of a process isn't sparse (its virtual page numbers are 0, 1, 2, ...) I will have a process page table of at most some K entries: totally its size will be at most some MBs, and I think it would be better to use a simple contiguous array.
So, why a lot of real solutions (ie x86 and x64) are based on multi-level page tables (or Hashed Page Tables)?
Thanks.
Because sparse virtual address space is good. Sparse address space allows the OS to crash a program that chases (some) wild pointers, and it makes prelinked shared libraries practical, and perhaps most useful of all, it allows your stack to grow from the "top" end of memory and your heap from the "bottom" end. You could of course define the page table index as a signed integer, which would allow you to implement the latter feature with just a simple array.
Also, think of "memory overcommit" allocation - when you malloc a few gigabytes the OS might say, "sure, fine!", knowing that most programs that ask for a few gigabytes turn out to use only a small fraction thereof. You could have problems supporting things like that with a simple array that isn't unnecessarily large.