MMU and TLB miss - operating-system

MMU and TLB miss - operating-system

Suppose the following. I have a system with virtual memory with one lever paging, I have a MMU and the TLB thing is controled by software.
Ok.. so imagine I'm a process, and I want to read a word in RAM of virtual address vaddr.
So, the CPU gives the MMU vaddr, the MMU checks in the TLB if there's an entry with the (suppose) 5 most significant bits of vaddr. If it's there... all's ok, it calculates the physical address and all goes fine.
Now.. suppose it wasn't in the TLB. In this case, the MMU makes an interrupt (a page fault).
Ok.. now I'm in the handler of the page fault.
In the PBR (page base register), I have the address of the start of the page table.
My question here. Is this address a physical one?. I guess yes, because if it were virtual, would means two things:
1) Must be reserved somehow in the virtual adress space of the process (never heard of something like that)
2) If this address is not in the TLB, would couse again a pagefault and I'll have an infinite loop.
Same question about addresses in tables. If I had a two level paging. The address in an entry in the first level table (that points to the second level table), is somehow virtual or physical?
Thanks.

Homework?
In any case, such things are in detail described in the CPU's architecture manual (and you haven't even written which CPU you're talking about - x86 does not generate page faults on TLB misses).

as #zvrba pointed out that behaviour is defined by CPU implementation, so this question is not answerable per se.
General things to consider though:
In the PBR (page base register), I have the address of the start of the page table. My question here. Is this address a physical one?
yes, that has to be physical address, otherwise MMU would need to resolve virtual-to-physical address for a translation table first.
1) Must be reserved somehow in the virtual adress space of the process
wrong, virtual should be 'translated'. So reservation doesn't make sense.
2) If this address is not in the TLB, would couse again a pagefault and I'll have an infinite loop
good try but still wrong thinking, even if addresses holding translation tables are mapped, you still need to get physical address first to read them (TLB tables, I mean) out of memory. Otherwise you are stuck on getting virtual-to-physical for first level TLB, not ever reaching point to get your general RAM access.By the way, TLB tables are usually mapped into virtual address space, that's how OS mapping/unmapping memory pages for applications.
If I had a two level paging. The address in an entry in the first level table (that points to the second level table), is somehow virtual or physical?
page walk is by physical addresses, no matter how many levels are there. See above.
I have a system with virtual memory with one lever paging... So, the CPU gives the MMU vaddr, the MMU checks in the TLB if there's an entry with the (suppose) 5 most significant bits of vaddr
5 most significant bits... hmm... Let's say that's 32b system so 5 most significant mean only bits [31..27] has effect. That concludes you have a page size 2^27 = 128MB. Why to bother with MMU if you could have only 32 pages mapped??? Keep MMU off!!!

Related

Number of RAM Accesses using Paging on a 32bit processor

Paging Architecture
Assuming we use a 32bit Processor with Paging as its memory management scheme. My question is what is the range of possible accesses it might need in order to find the physical adress.
My answer is 0 to 3.
First, it will look for the virtual address in the TLB if it is there then we have 0 accesses.
If it is not there then with the first 10 bits it will do 1 access in the page directory. After it will check the L1 cache with this PDE, if it finds it then we have 1 access in total.
Otherwise, it will move to the page table (according to the PDE) and it will do another memory access based on the second 10 bits of the virtual address. Then it will check the L2 cache with this PTE, if it finds then we 2 RAM accesses in total.
At, last it will use the last 12 bits (the offset) in order to access the RAM in the wanted physical address, giving us 3 accesses in total.
Is the above rationale correct? I am not familiar on how we use L1 and L2 cache.
Thanks

Is the above rationale correct?
It's mostly correct, there's just one subtlety you've overlooked...
My question is what is the range of possible accesses it might need in order to find the physical address.
Otherwise, it will move to the page table (according to the PDE) and it will do another memory access based on the second 10 bits of the virtual address.
At this point it has the physical address; so the answer is "0 to 2 accesses".
What happens after it already has the physical address (if the data at that address is cached or becomes a 3rd access) doesn't change the answer.

Why is address translation needed?

I am taking an operating systems class, and they introduced the concept of address translation. While a program is running, all memory accesses will be translated from virtual to physical. My question is: a lot of memory addresses are given to the program by the OS, and thus can be the physical addresses themselves. What types of memory requests does the program initiate all by itself (without an address given by the OS), and thus would be virtual addresses?
The address of the stack is pre-set by the OS in the stack pointer before the program starts running, and so the stack pointer can hold the physical address. Heap addresses returned by malloc are returned by the OS -- thus, the OS can return the physical address, since that address is stored in some variable and is transparent to the program. So what addresses does the program itself access that need to be translated to physical addresses? So far, the only examples I can think of are: 1) instruction addresses (jump commands have the instruction address hardcoded in the program code) and 2) maybe static variable addresses (if it's not stored in a register by the OS). Are there any more examples/am I missing something?
Thanks!

Maybe the simplest example why address translation is extremely useful:
The physical address space is usually partitioned into pages, e.g. 4k large.
Processes have their own virtual address space, that is also partitioned into pages of the same size.
Address translation, which is done by a memory management unit under control of the operating system, maps any virtual page to any physical page, for every page independently.
It is thus possible to combine arbitrary pages of a fragmented physical memory to continuous virtual pages.
This allows to use the physical memory much more effectively than without address translation.

OS: how does kernel virtual memory help in making swap pages of the page table easier?

Upon reading this chapter from "Operating Systems: Three Easy Pieces" book, I'm confused of this excerpt:
If in contrast the kernel were located entirely in physical memory, it would be quite hard to do things like swap pages of the page table to disk;
I've been trying to make sense of it for days but still can't get as to how kernel virtual memory helps in making swap pages of the page table easier. Wouldn't it be the same if the kernel would live completely in physical memory as the pages of different page tables' processes would end up in physical memory in the end anyway (and thus be swapped to disk if needed)? How is it different when page tables reside in kernel virtual memory vs. in kernel-owned physical memory?

Let's assume the kernel needs to access a string of characters from user-space (e.g. maybe a file name passed to an open() system call).
With the kernel in virtual memory; the kernel would have to check that the virtual address of the string is sane (e.g. not an address for kernel's own data) and also guard against a different thread in user-space modifying the data the pointer points to (so that the string can't change while the kernel is using it, possibly after the kernel checked that the string itself is valid but before the kernel uses the string). In theory (and likely in practice) this could all be hidden by a "check virtual address range" function. Apart from those things; kernel can just use the string like normal. If the data is in swap space, then kernel can just let its own page fault handler fetch the data when kernel attempts to access it the data and not worry about it.
With kernel in physical memory; the kernel would still have to do those things (check the virtual address is sane, guard against another thread modifying the data the pointer points to). In addition; it would have to convert the virtual address to a physical addresses itself, and ensure the data is actually in RAM itself. In theory this could still be hidden by a "check virtual address range" function that also converts the virtual address into physical address(es).
However, the "contiguous in virtual memory" data (the string) may not be contiguous in physical memory (e.g. first half of the string in one page with the second half of the string in a different page with a completely unrelated physical address). This means that kernel would have to deal with this problem too (e.g. for a string, even things like strlen() can't work), and it can't be hidden in a "check (and convert) virtual address range" function to make it easier for the rest of the kernel.
To deal with the "not contiguous in physical memory" problem there's mostly only 2 possibilities:
a) A set of "get char/short/int.. at offset N using this list of physical addresses" functions; or
b) Refuse to support it in the kernel; which mostly just shifts unwanted burden to user-space (e.g. the open() function in a C library copying the file name string to a new page if the original string crossed a page boundary before calling the kernel's open() system call).

In an ideal world (which is what we live in now with kernels that are hundreds of megabytes in size running on machines with gigabytes of physical RAM) the kernel would never swap even parts of itself. But in the old days, when physical memory was a constraint, the less of the kernel in physical memory, the more the application could be in physical memory. The more the application is in physical memory, the fewer page faults in user space.
THe linux kernel has been worked over fairly extensively to keep it compact. Case in point: kernel modules. You can load a module using insmod or modprobe, and that module will become resident, but if nothing uses is, after a while it will get swapped out, and that's no big deal because nothing is using it.

Different Page Sizes for Processes

As part of the virtual to physical address conversion, for each process a table of mappings between virtual to physical addresses is stored. If a process is scheduled next the content of the page table is loaded into the MMU.
1) Where is the page table for each process stored? As part of the process control block?
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes? Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.

1) They could be, but most OS's have a notion of an address space which a process is attached to. The address space typically contains a description of the sorts of mappings that have been established, and pointers to the page structure(s). If you consider the operation of exec(2), at a certain level of abstraction it merely involves creating a new address space, populating it, then attaching the process to it. Once the operation is known to succeed, the old address space can simply be discarded.
2) It depends upon the mmu architecture of the machine. In a forward mapped arrangement (x86, armv[78]), the page tables form a sort of tree structure, but instead of having the conventional 2 or 3 items per node, there are hundreds or thousands of them. The x86-classic has a 2 level structure, where each of the 1024 entries in the first level points to a pagetable which covers 2^20 bytes of address space. Invalid entries, either at the inner or leaf level, can represent unmapped space; so in x86-classic, if you have a very small address space, you only need a root table, and a single leaf level table.
3) Yes, multiple page size has been supported by most OSes since the early 2000s. Again, in forward mapped ones, each of the levels of the tree can be replaced by a single large page for the same address space as that table level. x86-classic only had one size; later editions supported many more.
3a) There is no need to use large pages to do this -- simply having an invalid page table is sufficient. In x86-classic, the least significant bit of the page table/descriptor entry indicates the validity of the entry.
Your idea exists.

1) Where is the page table for each process stored? As part of the process control block?
Usually it's not "a page table". For some CPUs there's only TLB entries (Translation Lookaside Buffer entries - like a cache of what the translations are) where software has to handle "TLB miss" by loading whatever it feels like into the TLB itself, and where the OS might not use tables at all (e.g. could use "list of arbitrary length zones"). For some CPUs it's a hierarchy of multiple levels (e.g. for modern 64-bit 80x86 there's 4 levels); and in this case some of the levels may be in physical memory and some may be in swap space or somewhere else and some may be generated as needed from other data (a little bit like it would've been for "software handling of TLB miss"). In any case, if each process has its own virtual address space (e.g. and it's not some kind of "single-address space shared by many processes" scheme) its likely that the process control block (directly or indirectly) contains a reference to whatever the OS uses (e.g. maybe a single "physical address for the highest level page table", but maybe a virtual address of a "list of arbitrary length zones" and maybe anything else).
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
If there are page tables then there must be a way to indicate "page not present", where "page not present" may mean that the memory isn't allocated but could also mean that the (virtual) memory was allocated but the entry for it hasn't been set (either because OS is generating the tables on demand, or because the actual data is in swap space, or...).
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes?
Yes. It's relatively common for 64-bit 80x86 where there's 4 KiB pages, 2 MiB (or 4 MiB) "large pages" (plus maybe 1 GiB "huge pages"); and done to reduce the chance of TLB misses (while also reducing memory consumed by page tables). Note that this is mostly an artifact of having multiple levels of page tables - an entry in a higher level table can say "this entry is a large page" or it can say "this entry is a lower level page table that might contain smaller pages". Note that in this case it's not "multiple page sizes in the same table", but is "fixed page size for each level".
Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.
Converting a virtual address into a physical address (or some kind of fault to indicate the translation doesn't exist) needs to be very fast (because it happens extremely often). When you have "fixed page size for each level" it means you can extract some bits of the virtual address and use them as the index into the table; which is fast.
When you have "multiple page sizes in the same table" there's 2 options. The first option is to duplicate entries in the page table so that you can still extract some bits of the virtual address and use them as the index into the table; which (apart from minor differences in the way TLBs are managed - e.g. auto-detecting adjacent translations vs. being manually told) is effectively identical to not bothering at all; but there are some CPUs (ARM I think) that do this.
The other alternative is searching multiple entries in the page table to find the right entry, where the cost of searching reduces performance. I don't know of any CPU that supports this - performance is too important.

When accessing memory, will the page table accessed/dirty bit be set under a cache hit situation?

As far as I know, a memory access of CPU involves CPU cache and MMU. CPU will try to find its target in cache and if a cache miss happens, CPU will turn to MMU. During accessing by MMU, the accessed/dirty bit of correspondent page table entry will be set by hardware.
However to the best of my knowledge, most CPU design won't trigger the MMU unless there's a cache miss, and here my problem is, will the accessed/dirty bit of page table entry still be set under a cache hit? Or it's architecture related?

I think you can assume these bits are cached in the TLB, and if there is any inconsistency with the values in the TLB and accesses done by the core, a microcode assist will be taken and the bits will be updated. For example, if the A1 or D bits are zero and an access or store happens, this condition will be detected and the appropriate bits will be set.
You can also assume that the fast path for TLB hits can't go to memory and see if the cached TLB bits are consistent with the PTEs in RAM. Furthermore, on x86 changes to PTE are not pushed, cache-invalidation style, to TLBs by hardware; that is, the TLB is not coherent.
This implies that if the bits are out of sync in certain ways, they will probably not be updated correctly. E.g., if the A (resp. D) bit is set in the TLB, and an access (resp. store) occurs, nothing will happen, even if the A (resp. D) bit is actually unset in the PTE. The entity making changes to the bits is responsible for flushing TLBs so that the bits are correctly updated in the future.
1 Having a TLB entry with A == 0 is weird: you'd expect the entry to be there as a result of an access, so having the A bit set from the start. Perhaps there are some scenarios where this might occur, such as a page brought in by a speculative access or prefetch.

Most caches are virtually indexed and physically tagged, for faster access. So the CPU issues the virtual address and index bits of the address is used to locate the entry. During this time the address is sent to TLB for getting the physical address. By the time cache has located the entry, TLB will return with the physical address which is then used for TAG comparison. Now two things can happen.
TLB could not have the entry (TLB miss)
Cache TAG mismatch (Cache miss)
In the case of 1, you need to access the page table entry (PTE) to get the correct physical address.
In the case of 2, if TLB has returned a valid mapping, you just need to fetch it. If TLB also has a miss (i.e, 1 and 2), then you need to get the physical address from PTE and fetch the data.
So to answer your question, in case of a HIT, PTE doesn't need to know about it all.

You usually can't have a cache hit if the page was never accessed in the first place, so that question is irrelevant. (Edit: come to think of it, it may be possible in some bizarre cases of page aliasing, but the same answer for the dirty bit applies there)
It is possible to have a cached line from a clean page (never written to previously). It's a little uncommon since you usually need to initialize data before accessing it, but the page could have been swapped out previously and then reinstalled into the page map (the exact behavior would be OS dependent but it is possible).
In that case, the line is cached (let's say exclusively), and you write to it. The CPU would access the cache and the TLB in parallel, attempting to lookup the line in the cache while also doing a TLB access to verify the full physical address, assuming your system is virtually indexed - physically tagged as most CPUs are these days. The TLB process may complete either through a TLB hit, or a miss followed by a page walk to install a TLB entry from the actual page map in the memory.
The cache access cannot complete until the TLB access (and page walk if necessary) is done, at which point you will know the value of the access/dirty bits.
If you are trying to write to a page without the dirty bit set (or access a page without the access bit) - you will receive a page fault, triggering the OS to go and update the page in page table. The OS may choose to do various optimizations at this point, but it will eventually result in correcting these bits.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse