I'm unsure of my understanding of paging. I'd like to check, by creating a hypothetical page directory, and asking the community to point out my mistakes along the way.
Let's suppose that the kernel code fits into the first frame - 0x0000 to 0x1000, and that the page directory will be put at 0x10000. It will be an identity map.
So first, I went to the intel 80386 manual, and found this illustration
╔══════════════════════════════════════╤═══════╤═══╤═╤═╤═══╤═╤═╤═╗
║ │ │ │ │ │ │U│R│ ║
║ PAGE FRAME ADDRESS 31..12 │ AVAIL │0 0│D│A│0 0│/│/│P║
║ │ │ │ │ │ │S│W│ ║
╚══════════════════════════════════════╧═══════╧═══╧═╧═╧═══╧═╧═╧═╝
The page directory will have only one entry (the present bit won't be set on the others). Here it is:
11000000 00001000 10000000 00000000
So this means that it is present, and that the corresponding page table is at 00000000 0000001 0001, right shifted 12 bit. this evaluates to 0x11000. (This is what i was aiming for, but does it actually mean this?). Now I understand the present bit, but what does read/write mean in this case? It isn't referring to a frame, so...
For the page table located at 0x11000:
everything is almost the same, except for addresses.
11000000 0000000 00000000 00000000
11000000 0000100 00000000 00000000
etc
11000000 0000100 01000000 00000000
And that is the complete page table directory for the kernel. Is this correct? Are there any mistakes along the way? What does the read/write bit mean as a page directory entry?
See Table 4-5 in Intel's System Programming Guide, as well as Section 4-6. In a page directory entry, the R/W bit controls writability for all the pages in the corresponding page table, i.e. for the whole 4MB region of virtual memory. So if you clear the bit, all the pages in that region are read-only, regardless of what their individual R/W bits may be set to. If you set the bit, then the CPU consults the R/W bit on the page itself to decide whether writing is allowed.
In other words, the effective read-write status of a page is the logical AND of the R/W bits in all the paging structures that you walk to reach it. If you are trying to write, and the CPU encounters a cleared R/W bit at any stage in its page table walk, it can bail out early with a page fault. This principle continues to hold in 64-bit mode when there are more levels.
This may be convenient if you need to set a large region of memory to be read-only; you can just clear the R/W bit in the page directory entry, without iterating over all the pages in the table. A common example would be a Unix process that calls fork(); all its writable memory needs to become read-only so that it can be copy-on-write.
Related
While reading about demand paging, I can see it mentioned in several sources (e.g. http://www.expertsmind.com/questions/name-the-hardware-to-support-demand-paging-30176232.aspx) that we need hardware support for valid / invalid bit for each entry in the page table. However, I'm unable to wrap my head around what that hardware support would look like. As per my understanding,
Page table itself is a software-based construct i.e. it has 4-byte / 8-byte (depending on addressing scheme / architecture etc) entries which are present in RAM.
The valid / invalid bit is separate from the 4-byte / 8-byte used for each entry of the page table so it's not like out of 4-bytes of a page table entry, we're using 31 bits to store the frame number and 1-bit for valid / invalid bit.
So in summary my question is - what does hardware support for valid / invalid bit look like? If it can vary across implementations / architectures, could you share any particular implementation's details?
Page table itself is a software-based construct i.e. it has 4-byte / 8-byte (depending on addressing scheme / architecture etc) entries which are present in RAM.
That's possible and is the case for some rare CPUs (which just ask the OS for a translation when they get a "TLB/translation look-aside buffer miss").
However; for most CPUs the page table is a data structure used directly by the CPU itself, and software (kernel) must provide the data in the format that the CPU understands. For these cases the page table entry format required by the CPU typically has multiple bits used for a variety of purposes, including "valid/not valid" (and "accessed/not accessed", and read/write/executable permissions, and user/supervisor permission, and ...).
The valid / invalid bit is separate from the 4-byte / 8-byte used for each entry of the page table so it's not like out of 4-bytes of a page table entry, we're using 31 bits to store the frame number and 1-bit for valid / invalid bit.
A page table entry (when valid) doesn't need to store the whole physical address because the address of the page must be aligned to the start of a page (and therefore the lowest bits can be assumed to be zero and not stored). E.g. if pages are 4 KiB (with 12 bits for "offset in page") and physical addresses are 32 bits; then only 20 bits are needed for physical address in the page table entries (and the lowest 12 bits of the physical address can be "assumed zero"), and with 32-bit page table entries those 12 bits can be used for other things ("valid/not valid", ...).
There are also "less regular" formats where some of the highest bits are repurposed for other things. For example, you might have 64-bit page table entries where the highest 4 bits are used for other things, then the middle 48 bits are used for physical address (without the "assumed zero" lower bits), then the lowest 12 bits are used for other things.
So in summary my question is - what does hardware support for valid / invalid bit look like?
Because it's different for each different type of CPU; the best place to look for page table entry format/s is the CPU manufacturer's manual. You can find a common example (for 32-bit 80x86) in the diagram here: https://wiki.osdev.org/Paging#Page_Table
Note that Intel call it a "present/not present" flag (labelled P), and it's the lowest bit in the page table entry.
As part of the virtual to physical address conversion, for each process a table of mappings between virtual to physical addresses is stored. If a process is scheduled next the content of the page table is loaded into the MMU.
1) Where is the page table for each process stored? As part of the process control block?
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes? Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.
1) They could be, but most OS's have a notion of an address space which a process is attached to. The address space typically contains a description of the sorts of mappings that have been established, and pointers to the page structure(s). If you consider the operation of exec(2), at a certain level of abstraction it merely involves creating a new address space, populating it, then attaching the process to it. Once the operation is known to succeed, the old address space can simply be discarded.
2) It depends upon the mmu architecture of the machine. In a forward mapped arrangement (x86, armv[78]), the page tables form a sort of tree structure, but instead of having the conventional 2 or 3 items per node, there are hundreds or thousands of them. The x86-classic has a 2 level structure, where each of the 1024 entries in the first level points to a pagetable which covers 2^20 bytes of address space. Invalid entries, either at the inner or leaf level, can represent unmapped space; so in x86-classic, if you have a very small address space, you only need a root table, and a single leaf level table.
3) Yes, multiple page size has been supported by most OSes since the early 2000s. Again, in forward mapped ones, each of the levels of the tree can be replaced by a single large page for the same address space as that table level. x86-classic only had one size; later editions supported many more.
3a) There is no need to use large pages to do this -- simply having an invalid page table is sufficient. In x86-classic, the least significant bit of the page table/descriptor entry indicates the validity of the entry.
Your idea exists.
1) Where is the page table for each process stored? As part of the process control block?
Usually it's not "a page table". For some CPUs there's only TLB entries (Translation Lookaside Buffer entries - like a cache of what the translations are) where software has to handle "TLB miss" by loading whatever it feels like into the TLB itself, and where the OS might not use tables at all (e.g. could use "list of arbitrary length zones"). For some CPUs it's a hierarchy of multiple levels (e.g. for modern 64-bit 80x86 there's 4 levels); and in this case some of the levels may be in physical memory and some may be in swap space or somewhere else and some may be generated as needed from other data (a little bit like it would've been for "software handling of TLB miss"). In any case, if each process has its own virtual address space (e.g. and it's not some kind of "single-address space shared by many processes" scheme) its likely that the process control block (directly or indirectly) contains a reference to whatever the OS uses (e.g. maybe a single "physical address for the highest level page table", but maybe a virtual address of a "list of arbitrary length zones" and maybe anything else).
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
If there are page tables then there must be a way to indicate "page not present", where "page not present" may mean that the memory isn't allocated but could also mean that the (virtual) memory was allocated but the entry for it hasn't been set (either because OS is generating the tables on demand, or because the actual data is in swap space, or...).
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes?
Yes. It's relatively common for 64-bit 80x86 where there's 4 KiB pages, 2 MiB (or 4 MiB) "large pages" (plus maybe 1 GiB "huge pages"); and done to reduce the chance of TLB misses (while also reducing memory consumed by page tables). Note that this is mostly an artifact of having multiple levels of page tables - an entry in a higher level table can say "this entry is a large page" or it can say "this entry is a lower level page table that might contain smaller pages". Note that in this case it's not "multiple page sizes in the same table", but is "fixed page size for each level".
Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.
Converting a virtual address into a physical address (or some kind of fault to indicate the translation doesn't exist) needs to be very fast (because it happens extremely often). When you have "fixed page size for each level" it means you can extract some bits of the virtual address and use them as the index into the table; which is fast.
When you have "multiple page sizes in the same table" there's 2 options. The first option is to duplicate entries in the page table so that you can still extract some bits of the virtual address and use them as the index into the table; which (apart from minor differences in the way TLBs are managed - e.g. auto-detecting adjacent translations vs. being manually told) is effectively identical to not bothering at all; but there are some CPUs (ARM I think) that do this.
The other alternative is searching multiple entries in the page table to find the right entry, where the cost of searching reduces performance. I don't know of any CPU that supports this - performance is too important.
I've been trying to do some research on why multi level page tables save space, and I think I'm a little confused on how a page table itself works. I found the following from Cornell:
The page table needs one entry per page. Assuming a 4GB (2^32 byte) virtual and physical address space and a page size of 4kB (2^12 bytes), we see that the the 2^32 byte address space must be split into 2^20 pages.
It is my understanding each process has its own page table. Does this mean that each process has 4GB of virtual address space? What is the point of the virtual address space being so huge? Why not allocate virtual pages as needed? Is it because the OS wants every possible address that can be made in the word size to map to a virtual page? Why not just prevent the program from dereferencing any virtual page number that is not a valid index for the page table?
I have read that one of the advantages of the multi-level page table is that it saves space by not having page table entries for virtual pages that are not in use. See below from Carnegie Mellon:
But why not just have a single level page table that has continuous entries - why would the process need PTE 1, 2, and then skip to 8? Why allow that? Even still, why do all the trailing, unused PTE's exist? Why not cut the page table short?
Consider a system with a 32-bit logical address space. If the
page size in such a system is 4 KB (2^12), then a page table may consist of over
1 million entries (2^20 = 2^32/2^12). Assuming that each entry consists of 4 bytes,
each process may need up to 4 MB of physical address space for the page table
alone.
Clearly, we would not want to allocate the page table contiguously in
main memory. One simple solution to this problem is to divide the page table
into smaller pieces. We can accomplish this division in several ways.
One way is to use a two-level paging algorithm, in which the page table
itself is also paged. For example, consider again the system with
a 32-bit logical address space and a page size of 4 KB. A logical address is
divided into a page number consisting of 20 bits and a page offset consisting
of 12 bits. Because we page the page table, the page number is further divided
into a 10-bit page number and a 10-bit page offset.
Two-level paging
--Silberschatz A., Galvin P.B.
So for the case proposed above we would use 2^10 * 4B = 4KB for the outer page (p1) and only N * 2^10 * 4B= N * 4KB of inner page tables where N is the number of the needed pages for a process, which in total is less than the space needed for a non-leveled page table (4MB).
You should also notice that the process only occupies the number of pages and memory needed, and the maximum virtual address space determines the maximum addressable and thus occupiable memory for a process given a system configuration (32/64 bit address space).
This saves memory (the word "space" is not specific enough) because the level 1 page table is much smaller when using two-level page tables instead of one. We can just keep level 1 page table in memory and paged out some level 2 page tables to disk.
You can watch this YouTube video to learn about size of page table and multi-level page tables.
I am just beginning to learn the concept of Direct mapped and Set Associative Caches.
I have some very elementary doubts . Here goes.
Supposing addresses are 32 bits long, and i have a 32KB cache with 64Byte block size and 512 frames, how much data is actually stored inside the "block"? If i have an instruction which loads from a value from a memory location and if that value is a 16-bit integer, is it that one of the 64Byte blocks now stores only a 16 bit(2Bytes) integer value. What of the other 62 bytes within the block? If i now have another load instruction which also loads a 16bit integer value, this value now goes into another block of another frame depending on the load address(If the address maps to the same frame of the previous instruction, then the previous value is evicted and the block again stores only 2bytes in 64 bytes). Correct?
Please forgive me if this seems like a very stupid doubt, its just that i want to get my concepts correctly.
I typed up this email for someone to explain caches, but I think you might find it useful as well.
You have 32-bit addresses that can refer to bytes in RAM.
You want to be able to cache the data that you access, to use them later.
Let's say you want a 1-MiB (220 bytes) cache.
What do you do?
You have 2 restrictions you need to meet:
Caching should be as uniform as possible across all addresses. i.e. you don't want to bias toward any particular kind of address.
How do you do this? Use remainder! With mod, you can evenly distribute any integer over whatever range you want.
You want to help minimize bookkeeping costs. That means e.g. if you're caching in blocks of 1 byte, you don't want to store 4 bytes of data just to keep track of where 1 byte belongs to.
How do you do that? You store blocks that are bigger than just 1 byte.
Let's say you choose 16-byte (24-byte) blocks. That means you can cache 220 / 24 = 216 = 65,536 blocks of data.
You now have a few options:
You can design the cache so that data from any memory block could be stored in any of the cache blocks. This would be called a fully-associative cache.
The benefit is that it's the "fairest" kind of cache: all blocks are treated completely equally.
The tradeoff is speed: To find where to put the memory block, you have to search every cache block for a free space. This is really slow.
You can design the cache so that data from any memory block could only be stored in a single cache block. This would be called a direct-mapped cache.
The benefit is that it's the fastest kind of cache: you do only 1 check to see if the item is in the cache or not.
The tradeoff is that, now, if you happen to have a bad memory access pattern, you can have 2 blocks kicking each other out successively, with unused blocks still remaining in the cache.
You can do a mixture of both: map a single memory block into multiple blocks. This is what real processors do -- they have N-way set associative caches.
Direct-mapped cache:
Now you have 65,536 blocks of data, each block being of 16 bytes.
You store it as 65,536 "rows" inside your cache, with each "row" consisting of the data itself, along with the metadata (regarding where the block belongs, whether it's valid, whether it's been written to, etc.).
Question:
How does each block in memory get mapped to each block in the cache?
Answer:
Well, you're using a direct-mapped cache, using mod. That means addresses 0 to 15 will be mapped to block 0 in the cache; 16-31 get mapped to block 2, etc... and it wraps around as you reach the 1-MiB mark.
So, given memory address M, how do you find the row number N? Easy: N = M % 220 / 24.
But that only tells you where to store the data, not how to retrieve it. Once you've stored it, and try to access it again, you have to know which 1-MB portion of memory was stored here, right?
So that's one piece of metadata: the tag bits. If it's in row N, all you need to know is what the quotient was, during the mod operation. Which, for a 32-bit address, is 12 bits big (since the remainder is 20 bits).
So your tag becomes 12 bits long -- specifically, the topmost 12 bits of any memory address.
And you already knew that the lowermost 4 bits are used for the offset within a block (since memory is byte-addressed, and a block is 16 bytes).
That leaves 16 bits for the "index" bits of a memory address, which can be used to find which row the address belongs to. (It's just a division + remainder operation, but in binary.)
You also need other bits: e.g. you need to know whether a block is in fact valid or not, because when the CPU is turned on, it contains invalid data. So you add 1 bit of metadata: the Valid bit.
There's other bits you'll learn about, used for optimization, synchronization, etc... but these are the basic ones. :)
I'm assuming you know the basics of tag, index, and offset but here's a short explanation as I have learned in my computer architecture class. Blocks are replaced in 64 byte blocks, so every time a new block is put into cache it replaces all 64 bytes regardless if you only need one byte. That's why when addressing the cache there is an offset that specifies the byte you want to get from the block. Take your example, if only 16 bit integer is being loaded, the cache will search for the block by the index, check the tag to make sure its the right data and then get the byte according to the offset. Now if you load another 16 bit value, lets say with the same index but different tag, it will replace the 64 byte block with the new block and get the info from the specified offset. (assuming direct mapped)
I hope this helps! If you need more info or this is still fuzzy let me know, I know a couple of good sites that do a good job of teaching this.
I would like to know why we need hierachical page tables in OS that handle per-process page tables, using PTBR and PTLR registers in CPU (tipically stored in PCB).
Thanks to PTLR I can check the limit of page table size for the current process, so its page table will contain just entries for its address memory space (that will be not so large as system address memory space).
If virtual address space of a process isn't sparse (its virtual page numbers are 0, 1, 2, ...) I will have a process page table of at most some K entries: totally its size will be at most some MBs, and I think it would be better to use a simple contiguous array.
So, why a lot of real solutions (ie x86 and x64) are based on multi-level page tables (or Hashed Page Tables)?
Thanks.
Because sparse virtual address space is good. Sparse address space allows the OS to crash a program that chases (some) wild pointers, and it makes prelinked shared libraries practical, and perhaps most useful of all, it allows your stack to grow from the "top" end of memory and your heap from the "bottom" end. You could of course define the page table index as a signed integer, which would allow you to implement the latter feature with just a simple array.
Also, think of "memory overcommit" allocation - when you malloc a few gigabytes the OS might say, "sure, fine!", knowing that most programs that ask for a few gigabytes turn out to use only a small fraction thereof. You could have problems supporting things like that with a simple array that isn't unnecessarily large.