How to enable x86 paging only in ring 3 - operating-system

I understand that setting the PG bit of CR0 enables paging in x86 and further all addresses generated will be logical and translated using page directories and tables. However I want that logical addresses be generated only for ring 3 and wish to keep all generated addresses in ring 0 as physical. Can this be achieved somehow? I had thought of setting the PG bit just before IRET. However I doubt that would help because IRET uses ESP to pop out CS, EIP, EFLAGS, SS, ESP etc. And if the PG bit is set, then ESP would point to a logical address and the command would fail, right?

The basic problem with what you want is that x86 isn't very good at pc-relative addressing; so you need to link your kernel at some address. What will that address be?
1M (2^20) is a pretty good choice.
Next up is how much memory do you have? 4G? Thats a bit ugly, your "one true physical map" leaves no address space for user programs. Look at [2] for a fix.
Supposing you don't actually need all of memory mapped, just the kernel-y bits, and your kernel is likely less than 1G, you could keep kernel mapped pages 1:1 for the lower 1G of address space. User programs would then be in virtual addresses > 1G, with arbitrary mappings.
Not really sure the point of it all; but it is quite doable.
[2] for the tougher case, you can do something funky that has a side effect of stomping on a few of the spectre derived bugs. When an application is running, only provide kernel mappings for a very small area - the code to handle initial kernel/trap entry and the in-kernel user context. As part of the initial register save code, switch to a kernel only cr3 which is a unity mapping of all memory. If you wanted, you could actually disable paging instead, the re-enable it before re-entering user mode. Again, not sure the point, but feasible.

Related

paging and binding schemes in memory management

The concept of paging in memory management can be used with which all schemes of binding?
By binding, I mean "mapping logical addresses to physical addresses". In my knowledge there are three types of binding schemes compile time, load time and execution time binding.
Paging is not involved in compiling, so we can rule that out.
Load time can have to meanings - combining the object modules of a program and libraries to produce an executable image (program) with no unresolved symbols (unix definition) OR transferring a program into memory so it may execute (non-unix).
What unix calls loading, some other systems call link editting.
Unix loading/link-editting is really part of compiling so doesn't involve paging at all. This operation does need to know the valid program addresses it can assign, which will permit the program to load. Conventionally these are from 0 to a very large number like 2^31 or 2^47.
Transferring an image to memory and executing can be considered either phases of the same thing, or in demand loading environments, exactly the same thing. Either way, the bit of the system that prepares the program address space has to fill out a set of tables which relate a program address to a physical address.
The program address of main() might be 0x12345; which might be viewed as offset 0x345 from page 0x12. The operating system might attach that to physical page 0x100, meaning that main() might temporarily be at 0x100345. Temporarily, because the operating system is free to change this relation (conventionally called a mapping) at any time.
The dynamic nature of these mappings is a positive attribute of paging, as it permits the system to reformulate its use of physical memory to meet changing demands.

How do operating systems allow userspace programs to interact with kernelspace programs?

This isn't quite a question about a specific OS, but let's take Windows as an example. A userspace program uses the Windows API to communicate with kernelspace. However, I don't understand how that's possible. The API, according to MS websites, lives in userspace. In order to access kernelspace it has to be in kernelspace, if I understand it correctly. So what is the mechanism by which the windows API gets extra privileges to speak to kernelspace? In which space does that mechanism operate? Is this sort of thing universal to all modern PC OS's?
As you're already aware there a bunch of facilities exposed to userspace programs by the Windows kernel. (If you're curious there's a list of system calls). These system calls are all identified by a unique number, which isn't part of the publicly documented interface given by Microsoft. Instead when you call a publicly exposed function from your program there's a DLL installed when you install (or update) Windows that has an entry point which is just a normal, unprivileged user mode function call. This DLL knows the mappings between public interfaces and the available system calls in the currently running kernel. These mappings are not always 1:1 which allows for tweaks and enhancements without breaking existing code using stable interfaces.
When some userland code calls one of these functions its role is to prepare arguments for the system call and then initiate the jump into kernel mode. How exactly that jump occurs is specific to the architecture that Windows is currently running on. In fact it varies not just between x86 and Arm but between AMD and Intel x86 systems even. I'll talk just about the modern Intel x86 32-bit case (using the SYSENTER instruction) here for simplicity. On x86 most of the other variations are relatively minor, for instance int 2Eh was used prior to SYSENTER support.
Early in boot up the operating system does a bunch of work to prepare for enabling a userland and system calls from it. Understanding this is critical to understanding how system calls really work.
First let's rewind a little and consider what exactly we mean by userland and kernelmode. On x86 when we talk about privileged vs un-privileged code we talk about "rings". There are actually 4 (ignoring hypervisors) but for various reasons nobody really used anything but ring0 (kernel) and ring3 (userland). When we run code on x86 the address that's being executed (EIP) and data that's being read/written come from segments.
Segments are mostly just a historical accident left over from the days before virtual addressing on x86 was a thing. They are however important for us here because there are special registers that define which segments are currently being used when we execute instructions or otherwise reference memory. Segments on x86 are all defined in a big table, called the Global Descriptor Table or GDT. (There's also a local descriptor table, LDT, but that's not going to further the current discussion here). The important point for our discussion here is that the (arcane) layout of the table entries include 2 bits, called DPL which define the privilege level of the currently active segment. You'll notice that 2 bits is exactly enough to define 4 levels of privilege.
So in short when we talk about "executing in kernel mode" we really just mean that our active code segment (CS) and data segment selectors point to entries in the GDT which have DPL set to 0. Likewise for userland we have a CS and data segment selectors pointing to GDT entries with DPL set to 3 and no access to kernel addresses. (There are other selectors too, but to keep it simple we'll just consider "code" and "data" for now).
Back to early on during kernel boot up: during start up the kernel creates the GDT entries we need. (These have to be laid out in a specific order for SYSENTER to work, but that's mostly just an implementation detail). There are also some "machine specific registers" that control how our processor behaves. These can only be set by privileged code. Three of them that are important here are:
IA32_SYSENTER_ESP
IA32_SYSENTER_EIP
IA32_SYSENTER_CS
Recall that we've got some code runnig in userland (ring3) that wants to transition to ring0. Let's assume that it has saved any registers that it needs to per the calling convention and put arguments into the right registers that the call expects. We then hit the SYSENTER instruction. (Actually it uses KiFastSystemCall I think). The SYSENTER instruction is special. It modifies the current code and data segment selectors based on the value that the kernel setup in the machine specific register IA32_SYSENTER_CS. (The stack/data segument values are computed as an offset of IA32_SYSENTER_CS). Subsequently the stack pointer itself (ESP) is set to the kernel stack that was setup for handling system calls earlier on and saved into the MSR IA32_SYSENTER_ESP and likewise for EIP the instruction pointer from IA32_SYSENTER_EIP.
Since the CS selector now points to a GDT entry with DPL set to 0 and EIP points to kernel mode code on a kernel stack we're running in the kernel at this point.
From here onwards the kernel mode code can read and write memory from both kernel and userspace (with some appropriate caution) to undertake the actual work needed to perform the system call. The arguments to the system call can be read from registers etc. according to the calling convention, but any arguments that are actually pointers back to userland or handles to kernel objects can be accessed to read larger blocks of data too.
When the system call is over the process is basically reversed and we end up back in userland with DPL 3 for the selectors.
Its the CPU that is acts as intermediate for transfer of information between user memory space(accessible in user mode) to protected memory space(accessible in kernel mode), via CPU registers.
Here's an Example:
Suppose a user writes a program in higher level language. Now when execution of the program happens, CPU generates the virtual addresses.
Now before any read/write operation occurs, the virtual address, is converted to physical address. Because the translation mechanism(memory management unit), is only accessible in kernel mode, cause its stored in protected memory, the translation occurs in kernel mode and the physical address is finally saved into some register of the CPU, and only then a read/write operation occurs.

What exactly happens inside the OS to cause the segmentation fault

I have read a lot about the virtual address and paging.
Let me first tell you people what i understood. When a process wants to execute something it tries to load the data from the hard disk to memory. To do this it uses a virtual address. So our MMU validates the virtual address looks into the TLB to find the corresponding physical page, if it doesn't find there it looks into Inverted Page Table and at the end it looks into Page table if it doesn't find an entry over there it generates a page fault and all the swapping of page is done and all the tables will be updated.
And as I read all the processes have different page tables and same virtual address. so if I try access an array element a[1000] which was defined as int a[100] I am sure that am gonna get a segmentation fault cause that instruction might be trying to access a memory that doesn't belong to it. but how OS comes to know that a[1000] doesn't belong to the running process by just using the concept of virtual address and physical pages. Am I missing something here or my entire understanding is wrong?
I know we can say a memory access is illegal if a process is trying to access a read only or sup true memory segment.
at the end the boiling question is how OS decides which memory is allocated to which process and how it decides that this access of memory is illegal.
What is a segmentation fault on Linux?
this link didn't help much .
thanks a lot in advance for all you lovely people's inputs :)
Actually each and every process in the Linux kernel has some well defined structure associated with it called task_struct which stores all the information about the corresponding process like its parent, its PID, its child, its address space, pending signals, threads associated with it etc. Now the address space entries tell the kernel while executing the process the legal address space for that process.Every process has its own address space allotted to it by the kernel right from the very beginning when it is created. So when a executing process tries to access the space outside its legal memory a fault is generated(called segmentation fault in Unix/Linux) and the process is terminated by giving a signal to it by the kernel. Its important for memory protection to be achieved in OS.
On x86, linux uses a combination of segmentation and paging, so the address generated by program first looks up for the corresponding segments base and limit registers values. This gives the virtual address which is then translated using the page table. When you try to access a memory which has not been allocated, the accessed page is beyond what the limit register allows, hence generating a segmentation fault.

Providing physical address directly to the TLB in x86-64

Is it possible to provide physical address for a given virtual address in a direct way to the TLB on x86-64 architectures in long mode?
For example, lets say, I put zeros in PML4E, so a page fault exception will be triggered because an invalid address will be found, during the exception can the CPU tell the TLB by using some instruction that this virtual address is located at X physical page frame?
I want to do this because by code I can easily tell where the physical address would be, and this way avoid expensive page walk.
No, you need to put a page to the TLB. To be precise, you need to create/update appropriate PTE (with PDE and PDPE if needed). Everything around MMU management is somehow based on page tables and TLB. Even user/supervisor protection mode is done as a special flag of mapped page.
Why do you think that "page walk" is expensive operation? It is not expensive at all. To determine the PTE that must be updated you need to dereference only 4 pointers: PML4E -> PDPE -> PDE -> PTE. These entries are just indices in related tables. To get PML4E you need to use 39-47 bits of address taken during page fault handling and use the value as an index in PML4 table. To get PDPE you need 30-39 bits of an address as an index in PDE table and so on. It's not the thing that can slow down your system. I think allocation of a physical page takes more time than that.

Does the MMU mediate everything between the operating system and physical memory or is it just an address translator?

I'm trying to understand how does an operating system work when we want to assign some value to a particular virtual memory address.
My first question concerns whether the MMU handles everything between the CPU and the RAM. Is this true? From what one can read from Wikipedia, I'd say so:
A memory management unit (MMU), sometimes called paged memory
management unit (PMMU), is a computer
hardware component responsible for
handling accesses to memory requested
by the CPU.
If that is the case, how can one tell the MMU I want to get 8 bytes, 64 or 128bytes, for example? What about writing?
If that is not the case, I'm guessing the MMU just translates virtual addresses to physical ones?
What happens when the MMU detects there will be what we call a page-fault? I guess it has to tell it to the CPU so the CPU loads the page itself off disk, or is the MMU able to do this?
Thanks
Devoured Elysium,
I'll attempt to answer your questions one by one but note, it might be a good idea to get your hands on a textbook for an OS course or an introductory computer architecture course.
The MMU consists of some hardware logic and state whose purpose is, indeed, to produce a physical address and provide/receive data to and from the memory controller. Actually, the job of memory translation is one that is taken care of by cooperating hardware and software (OS) mechanisms (at least in modern PCs). Once the physical address is obtained, the CPU has essentially done its job and now sends the address out on a bus which is at some point connected to the actual memory chips. In many systems this bus is called the Front-Side Bus (FSB), which is in turn connected to a memory controller. This controller takes the physical address supplied by the CPU and uses it to interact with the DRAM chips, and ultimately extract the bits in the correct rows and columns of the memory array. The data is then sent back to the CPU, which can now operate on it. Note that I'm not including caching in this description.
So no, the MMU does not interact directly with RAM, which I assume you are using to mean the physical DRAM chips. And you cannot tell the MMU that you want 8 bytes, or 24 bytes, or whatever, you can only supply it with an address. How many bytes that gets you depends on the machine you're on and whether it's byte-addressable or word-addressable.
Your last question urges me to remind you: the MMU is actually a part of the CPU--it sits on the same silicon die (although this was not always the case).
Now, let's take your example with the page fault. Suppose our user-level application wants to, like you said, set someAddress = 10, I'll take it in steps. Let's assume someAddress is 0xDEADBEEF and let's ignore caches for now.
1) The application issues a store instruction to 0xsomeAddress, which, in x86 might look something like
mov %eax, 0xDEADBEEF
where 10 is the value in the eax register.
2) 0xDEADBEEF in this case is a virtual address, which must be translated. Most of the time, the virtual to physical address translation will be available in a hardware structure called the Translation Lookaside Buffer (TLB), which will provide this translation to us very fast. Typically, it can do so in one clock cycle. If the translation is in the TLB, called a TLB hit, execution can continue immediately (i.e. the physical address corresponding to 0xDEADBEEF and the value 10 are sent out to the memory controller to be written).
3) Let's suppose though, that the translation wasn't available in the TLB (called a TLB miss). Then we must find the translation in the page tables, which are structures in memory whose structure is defined by the hardware and managed by the OS. They simply contain entries that map a virtual address to a physical one (more accurately, a virtual page number to a physical page number). But these structures also reside in memory, and so must have addresses! The hardware contains a special register called cr3 which contains the physical address of the current page table. We can index into this page table using our virtual address, so the hardware takes the value in cr3, computes an address by adding an offset, and goes off to memory to fetch the page table entry (PTE). This PTE will (hopefully) contain the physical address corresponding to 0xDEADBEEF, in which case we put this mapping in the TLB (so we don't have to walk the page table again) and continue on our way.
4) But oh no! What if there is no PTE in the page tables for 0xDEADBEEF? This is a page fault, and this is where the Operating System comes into play. The PTE we got out of the page table existed, as in it was (let's assume) a valid memory address to access, but the OS had not created a VA->PA mapping for it yet, so it would have had a bit set to indicate that it is invalid. The hardware is programmed in such a way that when it sees this invalid bit upon an access, it generates an exception, in this case a page fault.
5) The exception causes the hardware to invoke the OS by jumping to a well known location--a piece of code called a handler. There can be many exception handlers, and a page fault handler is one of them. The page fault handler will know the address that caused the fault because it's stored in a register somewhere, and so will create a new mapping for our virtual address 0xDEADBEEF. It will do so by allocating a free page of physical memory and then saying "all virtual addresses between VA x and VA y will map to some address within this newly allocated page of physical memory". 0xDEADBEEF will be somewhere in that range, so the mapping is now securely in the page tables, and we can restart the instruction that caused the page fault (the mov).
6) Now, when we go through the page tables again, we will find a mapping and the PTE we pull out will have a nice physical address, the one we want to store to. We provide this with the value 10 to the memory controller and we're done!
Caches will change this game quite a bit, but I hope this serves to illustrate how paging works. Again, it would benefit you greatly to check out some OS/Computer Architecture books. I hope this was clear.
There are data structures that describe which virtual addresses correspond to which physical addresses. The OS creates and manages these data structures, and the CPU uses them to translate virtual addresses into physical addresses.
For example, the OS might use these data structures to say "virtual addresses in the range from 0x00000000 to 0x00000FFF correspond to physical addresses 0x12340000 to 0x12340FFFF"; and if software tries to read 4 bytes from the virtual address 0x00000468 then the CPU will actually read 4 bytes from the physical address 0x12340468.
Typically everything is effected by the virtual->physical translation (except for when the CPU is accessing the data structures that describe the translation). Also, usually there's some sort of translation cache build into the CPU to help reduce the overhead involved.