Why page faults are usually handled by the OS, not hardware? - operating-system

I find that during TLB missing process, some architecture use hardware to handle it while some use the OS. But when it comes to page fault, most of them use the OS instead of hardware.
I tried to find the answer but didn't find any article explains why.
Could anyone help with this?
Thanks.

If the hardware could handle it on its own, it wouldn't need to fault.
The whole point is that the OS hasn't wired the page into the hardware page tables, e.g. because it's not actually in memory at all, or because the OS needs to catch an attempt to write so the OS can implement copy-on-write.
Page faults come in three categories:
valid (the process logically has the memory mapped, but the OS was lazy or playing tricks):
hard: the page needs to be paged in from disk, either from swap space or from a disk file (e.g. a memory mapped file, like a page of an executable or shared library). Usually the OS will schedule another task while waiting for I/O.
soft: no disk access required, just for example allocating + zeroing a new physical page to back a virtual page that user-space just tried to write. Or copy-on-write of a writeable page that multiple processes had mapped, but where changes by one shouldn't be visible to the other (like mmap(MAP_PRIVATE)). This turns a shared page into a private dirty page.
invalid: There wasn't even a logical mapping for that page. A POSIX OS like Linux will deliver SIGSEGV signal to the offending process/thread.
The hardware doesn't know which is which, all it knows was that a page walk didn't find a valid page-table entry for that virtual address, so it's time to let the OS decide what to do next. (i.e. raise a page-fault exception which runs the OS's page-fault handler.) valid/invalid are purely software/OS concepts.
These example reasons are not an exhaustive list. e.g. an OS might remove the hardware mapping for a page without actually paging it out, just to see if the process touches it again soon. (In which case it's just a cheap soft page fault. But if not, then it might actually page it out to disk. Or drop it if it's clean.)
For HW to be able to fully handle a page fault, we'd need data structures with a hardware-specified layout that somehow lets hardware know what to do in some possible situations. Unless you build a whole kernel into the CPU microcode, it's not possible to have it handle every page fault, especially not invalid ones which require reading the OS's process / task-management data structures and delivering a signal to user-space. Either to a signal handler if there is one, or killing the process.
And especially not hard page faults, where a multi-tasking OS will let some other process run while waiting for the disk to DMA the page(s) into memory, before wiring up the page tables for this process and letting it retry the faulting load or store instruction.

Related

Does Page fault means CPU blocked until the page is brought into RAM?

I am not quite sure about which work will be done by CPU and which will be done by the OS when a page fault occurs. That's why I'm asking the following questions.
Consider a single-core CPU, with several processes running. When a page fault occurs, the OS would try to fetch the required page from disk to RAM, which will cost a long time. During this time period, can the CPU keep on executing? Or the CPU has to wait until the required page is loaded into RAM?
If the CPU can keep on executing without waiting for the required page, then thrashing may occur when there are too many processes. At some moment, most of the instructions that the CPU executes will cause page faults, then the most of the time spent will be waiting for OS loading the pages from disk to RAM. That's why thrashing occurs. May I know if my understanding is correct?
Thanks in advance.
Update: this website describes thrashing very well.
The CPU doesn't know that it's "in" a page fault. CPUs aren't recursive!
When a 32-bit x86 CPU (for example) encounters a page fault, here's what it does (slightly simplified):
Set the value of CR2 to the address which caused the page fault.
Look at the Interrupt Descriptor Table and some other tables and find the address of the page fault handler (new CS, new EIP) and the kernel stack (new SS, new ESP).
Set the values of CS, EIP, SS, and ESP to the ones it just read.
Push the old SS, old ESP, EFLAGS, old CS and old EIP onto the stack.
Push the SS, ESP, EFLAGS, CS and EIP registers onto that stack.
Update the flags to say we're now in kernel mode.
That's all it does. Now, there is some data on the stack that the kernel uses when it wants to make the CPU go back to what it was doing before the page fault happened. But the kernel isn't obligated to use that data. It could go back somewhere entirely different, or it could never go back. It's up to the kernel. The CPU doesn't care.
A usual kernel will first save all the other registers (important!), look at the address, decide where to get the page, tell the disk to start fetching the page, make a note that the process is stopped because of a page fault, and then it will go and do something entirely different until the data comes back from the disk. It might run a different process, for example. If there are no processes left to run, it might turn off the CPU (yes, really).
Eventually the data comes back from the disk and the kernel sees that there's a process waiting for that data because of a page fault, and it updates the page table so the process can see the data, and it resets all the registers, including SS, ESP, EFLAGS, CS, and EIP. Now the CPU is doing whatever it was doing before.
The key point to notice is: the CPU only cares what's in its registers right now! It doesn't have a long-term memory. If you save the register values somewhere, you can make it stop doing whatever it was doing, and resume it later as if nothing ever happened. For example, there is absolutely no requirement that you have to return from function calls in the order they happened. The CPU doesn't care if you have a function that returns twice, for example (see setjmp), or if you have two coroutines and calling yield in one coroutine causes yield to return in the other one. You don't have to do things in stack order like you do in C.
In a cooperative multitasking OS the OS cannot initialize a context switch, so the CPU must wait for the page to be brought in.
Modern systems are preemptive multitasking systems. In this case the OS will most likely initiate a context switch and so other threads/processes will run on the CPU.
Thrashing is a concern when the amount of memory used far exceeds the capacity of the RAM. "Download more RAM" is a meme for a reason.
The CPU can keep on executing.
The CPU cannot, however, carry on execution of the thread that incurred the fault. That thread needs the fault to be resolved before the very next instruction can be executed. That is, it must block on the fault.
That many threads/processes may be blocked on fault handling is not in itself thrashing. Thrashing occurs when, in order to bring a page in, there are insufficient free page frames, so it is necessary to write a page out. But then, when the OS tries to find another thread to run, it picks the thread that owned a page it just wrote out, so it has to fault that page back in.
Thrashing is therefore a symptom of insufficient available real memory.

Stored Program Computer in modern computing

I was given this exact question on a quiz.
Question
Answer
Does the question make any sense? My understanding is that the OS schedules a process and manages what instructions it needs the processor to execute next. This is because the OS is liable to pull all sorts of memory management tricks, especially in main memory where fragmentation is a way of life. I remember that there is supposed to be a special register on the processor called the program counter. In light of the scheduler and memory management done by the OS I have trouble figuring out the purpose of this register unless it is just for the OS. Is the concept of the Stored Program Computer really relevant to how a modern computer operates?
Hardware fetches machine code from main memory, at the address in the program counter (which increments on its own as instructions execute, or is modified by executing a jump or call instruction).
Software has to load the code into RAM (main memory) and start the process with its program counter pointing into that memory.
And yes, if the OS wants to page that memory out to disk (or lazily load it in the first place), hardware will trigger a page fault when the CPU tries to fetch code from an unmapped page.
But no, the OS does not feed instructions to the CPU one at a time.
(Unless you're debugging a program by putting the CPU into "single step" mode when returning to user-space for that process, so it traps after executing one instruction. Like x86's trap flag, for example. Some ISAs only have software breakpoints, not HW support for single stepping.)
But anyway, the OS itself is made up of machine code that runs on the CPU. CPU hardware knows how to fetch and execute instructions from memory. An OS is just a fancy program that can load and manage other programs. (Remember, in a von Neumann architecture, code is data.)
Even the OS has to depend on the processing architecture. Memory today often is virtualized. That means the memory location seen by the program is not the real physical location, but is indirected by one or more tables describing the actual location and some attributes (e.g. read/write/execute allowed or not) for memory accesses. If the accessed virtual memory has not been loaded into main memory (these tables say so), an exception is generated, and the address of an exception handler is loaded into the program counter. This exception handler is by the OS and resides in main memory. So the program counter is quite relevant with today's computers, but the next instruction can be changed by exceptions (exceptions are also called for thread or process switching in preemptive multitasking systems) on the fly.
Does the question make any sense?
Yes. It makes sense to me. It is a bit imprecise, but the meanings of each of the alternatives are sufficiently distinct to be able to say that D) is the best answer.
(In theory, you could create a von Neumann computer which was able to execute instructions out of secondary storage, registers or even the internet ... but it would be highly impractical for various reasons.)
My understanding is that the OS schedules a process and manages what instructions it needs the processor to execute next. This is because the OS is liable to pull all sorts of memory management tricks, especially in main memory where fragmentation is a way of life.
Fragmentation of main memory is not actually relevant. A modern machine uses special hardware (and page tables) to deal with that. From the perspective of executing code (application or kernel) this is all hidden. The code uses virtual addresses, and the hardware maps them to physical addresses. (This is even true when dealing with page faults, though special care will be taken to ensure that the code and page table entries for the page fault handler are in RAM pages that are never swapped out.)
I remember that there is supposed to be a special register on the processor called the program counter. In light of the scheduler and memory management done by the OS I have trouble figuring out the purpose of this register unless it is just for the OS.
The PC is fundamental. It contains the virtual memory address of the next instruction that the CPU is to execute. For application code AND for OS kernel code. When you switch between the application and kernel code, the value in the PC is updated as part of the context switch.
Is the concept of the Stored Program Computer really relevant to how a modern computer operates?
Yes. Unless you are working on a special custom machine where (say) the program has been transformed into custom silicon.

If using Pure Demand Paging, how does CPU know where the first instruction is in the executable?

I am reading Chap9 of Operating System Concepts and the concept of pure demand paging is described as follows:
In the extreme case, we can start executing a process with no pages in
memory. When the operating system sets the instruction pointer to the first instruction of the process, which is on a non-memory-resident page, the process
immediately faults for the page....
But if NONE of the pages, particularly the pages containing code, are in memory, how does the OS know where the program counter is in the first place? Is program counter set as part of process creation by inspecting the program image on disk? If so, I would assume the OS knows the format of the binary image and can directly access that info on disk. And it will only make sense if somehow this info is stored in the part of the program image not needed during program execution, if OS decides not to bring the page containing this info into memory.
To summarize, I would like to know:
How is program counter set for a new process if using pure demand paging?
Is any real OS using pure demand paging and what benefit does it have?
How does an executable's binary format (e.g. ELF, PE formats) help the OS do demand paging (OS needs to know where the first page is at least?)

Are the followings user-only or OS-only instructions?

I have these options in my homework. I will explain my reason and I hope someone can critique them?
Indicate whether the following CPU instructions are the user-only or the O/S only or both?
Execution of 'sleep' instruction that halts CPU execution
user-only because I've only seen programmers writing sleep
Loading the 'program counter' PC register with a new memory address
I think it's O/S only.
Reading of disk controller register
O/S only.
'trap' that generates interrupt
From what I understand trap is usually a user-program fault and since O/S is a software application, so probably BOTH
Loading of alarm timeout value into clock register
O/S only
Reading the processor status word PSW register
O/S only.
Loading the memory lower bounds register
O/S only
Adding the contents of two memory locations
both. O/S needs to do computation too.
I don't really understand how to make a distinction between user and O/S specific instructions. They are all essentially "user" programs..
Can someone verify these answers, tell me why I am wrong, and how to tackle these questions?
I don't really understand how to make a distinction between user and O/S specific instructions. They are all essentially "user" programs.
Here's the difference: Did you start a task to have that happen, or did it happen on its own?
Did you start a task to read from the hard drive, or did you merely instruct the OS to do so? (all device access is an OS instruction, for the most part)
Sometimes professors want you to say that "reading the hard drive is user initiated" but "preemptive multitasking by the OS is always OS initiated" or "user actions may remain in a limited state while waiting on a device to finish responding and the OS to return control in a pre-emptive multitasking OS"
These are how I interpret the answers, but if you can't find these answers in the coursework then adopting my answers won't help you any. Notice that I gave a short blurb after each to explain why I chose these things. I am not your professor and have no way to know what he/she intends, so be sure that you can understand my responses. Also, having programmed in ASM helps to answer some of these ...
Execution of 'sleep' instruction that halts CPU execution
O/S. Sleep is actually just a counter that says to skip execution for one or more cycles, and is most often modeled by an API call. This can allow the scheduler access to delay reloading the pre-empted task until many rounds later. Once again, many very basic platforms would require a NOP loop counter to even come close to emulating a sleep command.
Loading the 'program counter' PC register with a new memory address
O/S. The Program Counter register is intended to be used by the system to keep track of the current execution of a program, and during multi-process pre-emption may be used to save the current execution point of the program.
Reading of disk controller register
O/S. In general User commands do not interface the disk subsystem, although on older systems they may be accessed, often by direct register access. In more modern systems, the disk is accessed only by the O/S, and is only accessed by the User via API.
'trap' that generates interrupt
User, O/S. This is when we generate a request for the O/S to handle a situation for us, so we give up control to the internal kernel. It can also result in something returning a faulted condition.
Loading of alarm timeout value into clock register
O/S. These timers are often regarded as having system-only level access, as they are used to monitor the rest of the system. Will be generally protected in CPUs that support such protection (such as those that support ring-level execution prevention).
Reading the processor status word PSW register
User, O/S. Notably the PSW registers are system-level controlled ONLY. On rare occasion one may find a system which allows one, two or merely some of the PSW registers to be read by a user. Since these are status fields for program execution, they aren't normally required to be user readable.
Loading the memory lower bounds register
User, O/S. All memory register assignment is done through CPU commands which are directly received from the binary executable loaded into the CPUs registers. There are no restrictions (aside from changing execution ring level, in participating processors) which are particularly prevented from happening at the application level. Some device interaction may or may not be permitted, and often registers are how devices are interacted with on older hardware. Note that the base memory address may not be 0, and the O/S may intercept memory calls specifically to sandbox the application.
Adding the contents of two memory locations
User, O/S. This is a fundamental requirement of algorithm design, and is often one of the first and most basic commands designed into a CPU unit.

Memory mapped files and "soft" page faults. Unavoidable?

I have two applications (processes) running under Windows XP that share data via a memory mapped file. Despite all my efforts to eliminate per iteration memory allocations, I still get about 10 soft page faults per data transfer. I've tried every flag there is in CreateFileMapping() and CreateFileView() and it still happens. I'm beginning to wonder if it's just the way memory mapped files work.
If anyone there knows the O/S implementation details behind memory mapped files I would appreciate comments on the following theory: If two processes share a memory mapped file and one process writes to it while another reads it, then the O/S marks the pages written to as invalid. When the other process goes to read the memory areas that now belong to invalidated pages, this causes a soft page fault (by design) and the O/S knows to reload the invalidated page. Also, the number of soft page faults is therefore directly proportional to the size of the data write.
My experiments seem to bear out the above theory. When I share data I write one contiguous block of data. In other words, the entire shared memory area is overwritten each time. If I make the block bigger the number of soft page faults goes up correspondingly. So, if my theory is true, there is nothing I can do to eliminate the soft page faults short of not using memory mapped files because that is how they work (using soft page faults to maintain page consistency). What is ironic is that I chose to use a memory mapped file instead of a TCP socket connection because I thought it would be more efficient.
Note, if the soft page faults are harmless please note that. I've heard that at some point if the number is excessive, the system's performance can be marred. If soft page faults intrinsically are not significantly harmful then if anyone has any guidelines as to what number per second is "excessive" I'd like to hear that.
Thanks.