Intel EPT table is 4 level page table? - virtualization

The figure is taken from here.
Q1. It seems that the EPT table keeps a whole copy of the guest page table, making it a 4-level page table. Is that correct?
Q2. Isn't it a bit of waste of space?
Q3. What exactly is an EPT violate? Does it mean this: the guest is trying to access a new guest virtual address (gVA), EPT table does not have a record for it yet, so it traps into VMM, and add the two gVA and gPA entries to the EPT table. Is that correct?

EPT maps guest physical address to host physical address.
Before EPT(hardware support for GPA<-->HPA) support was introduced Hypervisor had to manually maintain a shadow copy of the Guest Page Table mappings entries.
The PTE entries in the actual guest Page table would have lowered access permissions i.e. if it actual permission was write it would be lowered down to a read. This will result in a page fault which would be intercepted by the Hypervisor.
The Hypervisor will in turn update the corresponding shadow page table entries. This entire process was dog shit slow. Thats why EPT was introduced so that GPA to HPA translation is done by the hardware itself which is way faster.
So now answering your first question-- It does not. If you want to virtualize an OS without EPT support, you still need to maintain an additional shadow page table structures apart from the guest OS's page tables.
Q3-- The Guest Virtual Address(GVA) is translated normally by the hardware by traversing the page tables in the guest OS as it would have been done in an OS running on native hardware. Once we get the Guest Physical Address(GPA) after doing this translation EPT comes into the picture. Now Hardware translates GPA to HPA as HPA are the address real CPU knows about.
Ept violation VMExit happens when EPT does not have an existing mapping for a guest physical address(GPA) to host physical address(HPA). This results in a vmExit to VMM which will then create a new mapping. (The Ept violations is same as a page fault in normal OS, the only difference being the type of mapping being created.)

My 2 cent, please correct me if my memory went wrong.
Q1: No, EPT stores GPA to HPA mapping. With EPT, guest page table is only maintained in guest.
Q2: Without EPT, VMM should maintain shadow page table instead. So I don't think EPT wastes space.
Q3: EPT stores GPA to HPA mapping. GPA to GVA mapping is maintained in guest in this case.

My little contribution, one year late...
Q1:Yes EPT is like a mmu translation tree (4-level or less), but it translates GPA to HPA (Guest physical addresses to host physical ones).
Q2:For virtualization, translation tree (shadow or EPT) is necessary, so it's not a waste of space. Hardware translation is faster than shadow one (software) and prevents vmexit wich slows down process.
Q3: yes, an EPT violation occurs like a page fault but it occurs for access violation too. EPT allows a fine control of page access (read, write, execute).

Related

OS: how does kernel virtual memory help in making swap pages of the page table easier?

Upon reading this chapter from "Operating Systems: Three Easy Pieces" book, I'm confused of this excerpt:
If in contrast the kernel were located entirely in physical memory, it would be quite hard to do things like swap pages of the page table to disk;
I've been trying to make sense of it for days but still can't get as to how kernel virtual memory helps in making swap pages of the page table easier. Wouldn't it be the same if the kernel would live completely in physical memory as the pages of different page tables' processes would end up in physical memory in the end anyway (and thus be swapped to disk if needed)? How is it different when page tables reside in kernel virtual memory vs. in kernel-owned physical memory?
Let's assume the kernel needs to access a string of characters from user-space (e.g. maybe a file name passed to an open() system call).
With the kernel in virtual memory; the kernel would have to check that the virtual address of the string is sane (e.g. not an address for kernel's own data) and also guard against a different thread in user-space modifying the data the pointer points to (so that the string can't change while the kernel is using it, possibly after the kernel checked that the string itself is valid but before the kernel uses the string). In theory (and likely in practice) this could all be hidden by a "check virtual address range" function. Apart from those things; kernel can just use the string like normal. If the data is in swap space, then kernel can just let its own page fault handler fetch the data when kernel attempts to access it the data and not worry about it.
With kernel in physical memory; the kernel would still have to do those things (check the virtual address is sane, guard against another thread modifying the data the pointer points to). In addition; it would have to convert the virtual address to a physical addresses itself, and ensure the data is actually in RAM itself. In theory this could still be hidden by a "check virtual address range" function that also converts the virtual address into physical address(es).
However, the "contiguous in virtual memory" data (the string) may not be contiguous in physical memory (e.g. first half of the string in one page with the second half of the string in a different page with a completely unrelated physical address). This means that kernel would have to deal with this problem too (e.g. for a string, even things like strlen() can't work), and it can't be hidden in a "check (and convert) virtual address range" function to make it easier for the rest of the kernel.
To deal with the "not contiguous in physical memory" problem there's mostly only 2 possibilities:
a) A set of "get char/short/int.. at offset N using this list of physical addresses" functions; or
b) Refuse to support it in the kernel; which mostly just shifts unwanted burden to user-space (e.g. the open() function in a C library copying the file name string to a new page if the original string crossed a page boundary before calling the kernel's open() system call).
In an ideal world (which is what we live in now with kernels that are hundreds of megabytes in size running on machines with gigabytes of physical RAM) the kernel would never swap even parts of itself. But in the old days, when physical memory was a constraint, the less of the kernel in physical memory, the more the application could be in physical memory. The more the application is in physical memory, the fewer page faults in user space.
THe linux kernel has been worked over fairly extensively to keep it compact. Case in point: kernel modules. You can load a module using insmod or modprobe, and that module will become resident, but if nothing uses is, after a while it will get swapped out, and that's no big deal because nothing is using it.

Reading and writing memory, but having trouble writing to a virtual address

I am trying to write a program where I scan a processes memory and can also write to these addresses(just like cheat engine). However I did some research and found out that the memory I was reading is virtual memory I can read this memory but I can't write to it and to translate it I need page tables. So my question is where can I find these page tables and is there any other way to write using the virtual address I get?
Virtual memory is an elaborate illusion. What you think is read/write RAM may actually be data in swap space, or "ready only, copy on write", or something else.
To maintain the illusion, and for security, and for compatibility (e.g. 32-bit program running on a 64-bit CPU with a 64-bit kernel); user-space is not given access to page tables.
An OS or kernel might provide an abstract interface to some of the information (with suitable restrictions and limitations for security). One example of this would be the VirtualQuery() and VirtualQueryEx() functions in Windows (see https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualqueryex ).
In a similar way, an OS or kernel might provide an abstract interface to alter a page's permissions (with suitable restrictions and limitations for security). One example of this would be the VirtualProtect() function in Windows (see https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualprotect ).
... and is there any other way to write using the virtual address I get?
If your CPU is an 80x86 CPU that supports Intel's transactional extensions; you can misuse "transactions" to suppress page faults (make them cause a "transaction abort" instead of triggering a page fault).
This won't allow you to write to a read-only or "not present" page; but will allow you to attempt to write without being detected by the OS.

Number of memory access with Demand Paging

I have been studying Operating Systems Concepts and the book I am referring to is Operating System Concepts by Peter B. Galvin, Greg Gagne and Abraham Silberschatz.
In the chapter of Virtual Memory, book starts to talk about Paging and number of memory access it would require for the system to read data stored in a particular frame in memory given a logical address. The author states that when Page Table is present in Main Memory, system would need two memory accesses to read data stored in a frame. The first access is made to the page table to read the correct frame number and the next access is for reading the byte/word from the frame.
After a few sections, the book talks about Demand Paging and page fault. Author state that in case of no page fault, one memory access is needed and in case of a page fault, we will consider Page Fault Service time (which comprises of swap in time, swap out time, one memory access etc.) and presents readers with the formula
Effective Access Time = (1-p) x one memory access time + p x page fault service time
where p = page fault rate
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed. Applying the line of thought used with standard paging scheme earlier introduced by same author(s), we should need one memory access to read page table and another to read the data from frame.
Is it because we are talking about the time frame after the access to page table is made? Then why the same standard of calculation not applies to standard version of paging?
Note: I haven't read/seen this book.
For educational material; if the author describes reality accurately with all the details the reader will just get confused and won't be able to learn. To work around that, authors simplify (omit details and ignore reality) while introducing different concepts, so that the reader is able to learn each concept one at a time while building up the knowledge needed to comprehend the complexity of reality.
The problem is that different simplifications make sense at different stages, and authors are human (imperfect), so sometimes the simplifications that were beneficial at one point (in one chapter) conflict with simplifications that are beneficial at a later point (in a different chapter).
For an example, I might (initially) tell someone "each access from virtual memory involves a second memory fetch from RAM to determine the translation" to help them understand how page tables work and that there's (potential) performance problems involved (twice as many memory accesses). Then I might introduce the concept of "translation look-aside buffers" (after the reader understands the how page tables work and knows about the problem that TLBs are designed to solve). Then I might explain that often real systems have multiple levels of page tables (e.g. on 64-bit 80x86 it's four levels, potentially involving 4 memory accesses to determine a translation) and that there might be higher level caches/buffers involved (and not just TLBs that cache final translations). In this case, my original statement ("each access from virtual memory involves a second memory fetch from RAM to determine the translation") is a deliberate lie (a simplification) to avoid the complexity of a statement like "each access from virtual memory may or may not involve one or more additional fetches from some or all levels of page tables" (which is too confusing for beginners initially, because it creates lots of questions that they don't have answers to yet).
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed.
One reality is (for one real 80x86 CPU in long mode but not all 80x86 CPUs in long mode and not any 80x86 in other modes, if virtualisation is not being used), for a read from virtual memory that does not lead to a page fault, if the access is not misaligned/split across page boundaries (where CPU would have to do it all twice to fetch bytes from 2 different pages and merge the bytes):
* if the translation is not in the TLB, then:
* if the area is not in the "page directory cache"
* fetch the PML4 entry to determine address of PDPT (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PML4 entry
* fetch the PDPT entry to determine address of PD (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PDPT entry
* insert data into "page directory cache"
* if the area is in the "page directory cache"
* do access checks based on flags in "page directory cache entry"
* fetch the PD entry to determine address of PT (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PD entry
* fetch the PT entry to determine address of page (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PT entry
* insert data into TLB (including setting the "accessed" flag in the page table entry)
* if the translation is in the TLB, then:
* do access checks based on flags in "TLB entry"
* do the "physical address = physical address of page + offset in page" calculation
* read the data for the physical address (try L1 cache, then L2 cache, then L3 cache, then RAM)
For this reality (with the restrictions mentioned); the number of fetches from RAM can be anything from zero to 5.
Can you see why the author (while trying to explain page faults and not trying to explain translation costs) might want to avoid showing something like this and might simplify (by assuming that only one fetch is needed because the translation is in the TLB) instead?
The fundamental source of your problem is that you are reading a book that is only fit for lining a cat box. What you are describing is nonsensical gibberish that textbooks use to create confusion among students. This is not a case of over simplification because the authors apparently throw in a nonsensical formula for access times.
A formula like this
Effective Access Time = (1-p) x one memory access time + p x page fault service time
is total bovine fecal waste matter with no basis in reality.
The author states that when Page Table is present in Main Memory, system would need two memory accesses to read data stored in a frame.
The processor has to translate logical addresses to physical addresses using the page tables. Assuming that there is no caching in the CPU, the CPU has read the page table for each memory access.
The number reads depends upon the page table format used by the CPU.
Let's suppose your process has a multi-level page table. In that case the CPU has to make a read for each level of the table.
If you have a CPU that has separate linear system and user page tables, with the user tables in logical addresses, each access to the system space requires one memory read and each access to the user space requires at least two memory accesses and might, in fact, trigger a page fault. The first read is to system page table to find the user page table entry. The second read is to the user page table. The third is to the data.
In reality, every CPU on the planet does page table caching so separate reads are not required (all the time).
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed.
It sounds like the book is not being consistent in its BS.
The reality is that logical memory translation requires a number of steps. However, what those steps are depends upon the state of the processor, something that is unpredictable. These steps take place transparently behind the scenes and you do not even need to grasp all of them to understand operating systems.
What you need to know in the real world is that the CPU translates logical addresses to physical addresses. If the CPU is unable to make that translation, it triggers a page fault.

When accessing memory, will the page table accessed/dirty bit be set under a cache hit situation?

As far as I know, a memory access of CPU involves CPU cache and MMU. CPU will try to find its target in cache and if a cache miss happens, CPU will turn to MMU. During accessing by MMU, the accessed/dirty bit of correspondent page table entry will be set by hardware.
However to the best of my knowledge, most CPU design won't trigger the MMU unless there's a cache miss, and here my problem is, will the accessed/dirty bit of page table entry still be set under a cache hit? Or it's architecture related?
I think you can assume these bits are cached in the TLB, and if there is any inconsistency with the values in the TLB and accesses done by the core, a microcode assist will be taken and the bits will be updated. For example, if the A1 or D bits are zero and an access or store happens, this condition will be detected and the appropriate bits will be set.
You can also assume that the fast path for TLB hits can't go to memory and see if the cached TLB bits are consistent with the PTEs in RAM. Furthermore, on x86 changes to PTE are not pushed, cache-invalidation style, to TLBs by hardware; that is, the TLB is not coherent.
This implies that if the bits are out of sync in certain ways, they will probably not be updated correctly. E.g., if the A (resp. D) bit is set in the TLB, and an access (resp. store) occurs, nothing will happen, even if the A (resp. D) bit is actually unset in the PTE. The entity making changes to the bits is responsible for flushing TLBs so that the bits are correctly updated in the future.
1 Having a TLB entry with A == 0 is weird: you'd expect the entry to be there as a result of an access, so having the A bit set from the start. Perhaps there are some scenarios where this might occur, such as a page brought in by a speculative access or prefetch.
Most caches are virtually indexed and physically tagged, for faster access. So the CPU issues the virtual address and index bits of the address is used to locate the entry. During this time the address is sent to TLB for getting the physical address. By the time cache has located the entry, TLB will return with the physical address which is then used for TAG comparison. Now two things can happen.
TLB could not have the entry (TLB miss)
Cache TAG mismatch (Cache miss)
In the case of 1, you need to access the page table entry (PTE) to get the correct physical address.
In the case of 2, if TLB has returned a valid mapping, you just need to fetch it. If TLB also has a miss (i.e, 1 and 2), then you need to get the physical address from PTE and fetch the data.
So to answer your question, in case of a HIT, PTE doesn't need to know about it all.
You usually can't have a cache hit if the page was never accessed in the first place, so that question is irrelevant. (Edit: come to think of it, it may be possible in some bizarre cases of page aliasing, but the same answer for the dirty bit applies there)
It is possible to have a cached line from a clean page (never written to previously). It's a little uncommon since you usually need to initialize data before accessing it, but the page could have been swapped out previously and then reinstalled into the page map (the exact behavior would be OS dependent but it is possible).
In that case, the line is cached (let's say exclusively), and you write to it. The CPU would access the cache and the TLB in parallel, attempting to lookup the line in the cache while also doing a TLB access to verify the full physical address, assuming your system is virtually indexed - physically tagged as most CPUs are these days. The TLB process may complete either through a TLB hit, or a miss followed by a page walk to install a TLB entry from the actual page map in the memory.
The cache access cannot complete until the TLB access (and page walk if necessary) is done, at which point you will know the value of the access/dirty bits.
If you are trying to write to a page without the dirty bit set (or access a page without the access bit) - you will receive a page fault, triggering the OS to go and update the page in page table. The OS may choose to do various optimizations at this point, but it will eventually result in correcting these bits.

MMU and TLB miss

Suppose the following. I have a system with virtual memory with one lever paging, I have a MMU and the TLB thing is controled by software.
Ok.. so imagine I'm a process, and I want to read a word in RAM of virtual address vaddr.
So, the CPU gives the MMU vaddr, the MMU checks in the TLB if there's an entry with the (suppose) 5 most significant bits of vaddr. If it's there... all's ok, it calculates the physical address and all goes fine.
Now.. suppose it wasn't in the TLB. In this case, the MMU makes an interrupt (a page fault).
Ok.. now I'm in the handler of the page fault.
In the PBR (page base register), I have the address of the start of the page table.
My question here. Is this address a physical one?. I guess yes, because if it were virtual, would means two things:
1) Must be reserved somehow in the virtual adress space of the process (never heard of something like that)
2) If this address is not in the TLB, would couse again a pagefault and I'll have an infinite loop.
Same question about addresses in tables. If I had a two level paging. The address in an entry in the first level table (that points to the second level table), is somehow virtual or physical?
Thanks.
Homework?
In any case, such things are in detail described in the CPU's architecture manual (and you haven't even written which CPU you're talking about - x86 does not generate page faults on TLB misses).
as #zvrba pointed out that behaviour is defined by CPU implementation, so this question is not answerable per se.
General things to consider though:
In the PBR (page base register), I have the address of the start of the page table. My question here. Is this address a physical one?
yes, that has to be physical address, otherwise MMU would need to resolve virtual-to-physical address for a translation table first.
1) Must be reserved somehow in the virtual adress space of the process
wrong, virtual should be 'translated'. So reservation doesn't make sense.
2) If this address is not in the TLB, would couse again a pagefault and I'll have an infinite loop
good try but still wrong thinking, even if addresses holding translation tables are mapped, you still need to get physical address first to read them (TLB tables, I mean) out of memory. Otherwise you are stuck on getting virtual-to-physical for first level TLB, not ever reaching point to get your general RAM access.By the way, TLB tables are usually mapped into virtual address space, that's how OS mapping/unmapping memory pages for applications.
If I had a two level paging. The address in an entry in the first level table (that points to the second level table), is somehow virtual or physical?
page walk is by physical addresses, no matter how many levels are there. See above.
I have a system with virtual memory with one lever paging... So, the CPU gives the MMU vaddr, the MMU checks in the TLB if there's an entry with the (suppose) 5 most significant bits of vaddr
5 most significant bits... hmm... Let's say that's 32b system so 5 most significant mean only bits [31..27] has effect. That concludes you have a page size 2^27 = 128MB. Why to bother with MMU if you could have only 32 pages mapped??? Keep MMU off!!!