Memory mapped files and "soft" page faults. Unavoidable? - memory-mapped-files

I have two applications (processes) running under Windows XP that share data via a memory mapped file. Despite all my efforts to eliminate per iteration memory allocations, I still get about 10 soft page faults per data transfer. I've tried every flag there is in CreateFileMapping() and CreateFileView() and it still happens. I'm beginning to wonder if it's just the way memory mapped files work.
If anyone there knows the O/S implementation details behind memory mapped files I would appreciate comments on the following theory: If two processes share a memory mapped file and one process writes to it while another reads it, then the O/S marks the pages written to as invalid. When the other process goes to read the memory areas that now belong to invalidated pages, this causes a soft page fault (by design) and the O/S knows to reload the invalidated page. Also, the number of soft page faults is therefore directly proportional to the size of the data write.
My experiments seem to bear out the above theory. When I share data I write one contiguous block of data. In other words, the entire shared memory area is overwritten each time. If I make the block bigger the number of soft page faults goes up correspondingly. So, if my theory is true, there is nothing I can do to eliminate the soft page faults short of not using memory mapped files because that is how they work (using soft page faults to maintain page consistency). What is ironic is that I chose to use a memory mapped file instead of a TCP socket connection because I thought it would be more efficient.
Note, if the soft page faults are harmless please note that. I've heard that at some point if the number is excessive, the system's performance can be marred. If soft page faults intrinsically are not significantly harmful then if anyone has any guidelines as to what number per second is "excessive" I'd like to hear that.
Thanks.

Related

Does Page fault means CPU blocked until the page is brought into RAM?

I am not quite sure about which work will be done by CPU and which will be done by the OS when a page fault occurs. That's why I'm asking the following questions.
Consider a single-core CPU, with several processes running. When a page fault occurs, the OS would try to fetch the required page from disk to RAM, which will cost a long time. During this time period, can the CPU keep on executing? Or the CPU has to wait until the required page is loaded into RAM?
If the CPU can keep on executing without waiting for the required page, then thrashing may occur when there are too many processes. At some moment, most of the instructions that the CPU executes will cause page faults, then the most of the time spent will be waiting for OS loading the pages from disk to RAM. That's why thrashing occurs. May I know if my understanding is correct?
Thanks in advance.
Update: this website describes thrashing very well.
The CPU doesn't know that it's "in" a page fault. CPUs aren't recursive!
When a 32-bit x86 CPU (for example) encounters a page fault, here's what it does (slightly simplified):
Set the value of CR2 to the address which caused the page fault.
Look at the Interrupt Descriptor Table and some other tables and find the address of the page fault handler (new CS, new EIP) and the kernel stack (new SS, new ESP).
Set the values of CS, EIP, SS, and ESP to the ones it just read.
Push the old SS, old ESP, EFLAGS, old CS and old EIP onto the stack.
Push the SS, ESP, EFLAGS, CS and EIP registers onto that stack.
Update the flags to say we're now in kernel mode.
That's all it does. Now, there is some data on the stack that the kernel uses when it wants to make the CPU go back to what it was doing before the page fault happened. But the kernel isn't obligated to use that data. It could go back somewhere entirely different, or it could never go back. It's up to the kernel. The CPU doesn't care.
A usual kernel will first save all the other registers (important!), look at the address, decide where to get the page, tell the disk to start fetching the page, make a note that the process is stopped because of a page fault, and then it will go and do something entirely different until the data comes back from the disk. It might run a different process, for example. If there are no processes left to run, it might turn off the CPU (yes, really).
Eventually the data comes back from the disk and the kernel sees that there's a process waiting for that data because of a page fault, and it updates the page table so the process can see the data, and it resets all the registers, including SS, ESP, EFLAGS, CS, and EIP. Now the CPU is doing whatever it was doing before.
The key point to notice is: the CPU only cares what's in its registers right now! It doesn't have a long-term memory. If you save the register values somewhere, you can make it stop doing whatever it was doing, and resume it later as if nothing ever happened. For example, there is absolutely no requirement that you have to return from function calls in the order they happened. The CPU doesn't care if you have a function that returns twice, for example (see setjmp), or if you have two coroutines and calling yield in one coroutine causes yield to return in the other one. You don't have to do things in stack order like you do in C.
In a cooperative multitasking OS the OS cannot initialize a context switch, so the CPU must wait for the page to be brought in.
Modern systems are preemptive multitasking systems. In this case the OS will most likely initiate a context switch and so other threads/processes will run on the CPU.
Thrashing is a concern when the amount of memory used far exceeds the capacity of the RAM. "Download more RAM" is a meme for a reason.
The CPU can keep on executing.
The CPU cannot, however, carry on execution of the thread that incurred the fault. That thread needs the fault to be resolved before the very next instruction can be executed. That is, it must block on the fault.
That many threads/processes may be blocked on fault handling is not in itself thrashing. Thrashing occurs when, in order to bring a page in, there are insufficient free page frames, so it is necessary to write a page out. But then, when the OS tries to find another thread to run, it picks the thread that owned a page it just wrote out, so it has to fault that page back in.
Thrashing is therefore a symptom of insufficient available real memory.

Why page faults are usually handled by the OS, not hardware?

I find that during TLB missing process, some architecture use hardware to handle it while some use the OS. But when it comes to page fault, most of them use the OS instead of hardware.
I tried to find the answer but didn't find any article explains why.
Could anyone help with this?
Thanks.
If the hardware could handle it on its own, it wouldn't need to fault.
The whole point is that the OS hasn't wired the page into the hardware page tables, e.g. because it's not actually in memory at all, or because the OS needs to catch an attempt to write so the OS can implement copy-on-write.
Page faults come in three categories:
valid (the process logically has the memory mapped, but the OS was lazy or playing tricks):
hard: the page needs to be paged in from disk, either from swap space or from a disk file (e.g. a memory mapped file, like a page of an executable or shared library). Usually the OS will schedule another task while waiting for I/O.
soft: no disk access required, just for example allocating + zeroing a new physical page to back a virtual page that user-space just tried to write. Or copy-on-write of a writeable page that multiple processes had mapped, but where changes by one shouldn't be visible to the other (like mmap(MAP_PRIVATE)). This turns a shared page into a private dirty page.
invalid: There wasn't even a logical mapping for that page. A POSIX OS like Linux will deliver SIGSEGV signal to the offending process/thread.
The hardware doesn't know which is which, all it knows was that a page walk didn't find a valid page-table entry for that virtual address, so it's time to let the OS decide what to do next. (i.e. raise a page-fault exception which runs the OS's page-fault handler.) valid/invalid are purely software/OS concepts.
These example reasons are not an exhaustive list. e.g. an OS might remove the hardware mapping for a page without actually paging it out, just to see if the process touches it again soon. (In which case it's just a cheap soft page fault. But if not, then it might actually page it out to disk. Or drop it if it's clean.)
For HW to be able to fully handle a page fault, we'd need data structures with a hardware-specified layout that somehow lets hardware know what to do in some possible situations. Unless you build a whole kernel into the CPU microcode, it's not possible to have it handle every page fault, especially not invalid ones which require reading the OS's process / task-management data structures and delivering a signal to user-space. Either to a signal handler if there is one, or killing the process.
And especially not hard page faults, where a multi-tasking OS will let some other process run while waiting for the disk to DMA the page(s) into memory, before wiring up the page tables for this process and letting it retry the faulting load or store instruction.

Can OS processes share a single CPU stack?

Can processes share a single stack?
I'm currently thinking yes and no. That they "share" stack but it need to copy and save the information already there elsewere before using it and return it when the first process is getting picked up by the CPU again. But I might be confusing this with registers in general.
Can someone help me shed some light on this?
Processes do not share CPU stacks.
While processes can potentially share memory using shared-memory facilities, processes do not share memory by default. Operating systems try to minimize the amount of sharing between processes as a way to ensure security.
Sharing CPU stack between processes A and B would be detrimental to security, because process A would be able to poke around "junk" left on the stack by process B, and vice versa. Hackers managed to exploit an indirect sharing on a much smaller scale to create a major security vulnerability (you can read more about Meltdown and Spectre here). Sharing CPU stacks would create a much bigger problem.
It goes without saying that an attempt to share stacks would require a degree of inter-process synchronization that would be prohibitive to overall performance. An ability to make independent operations on CPU stack is so critical to concurrency that threads inside the same process are allocated separate stacks, even though they already share all the memory allocated to the process, so security is not a concern. Sharing stacks would effectively kill concurrency, because maintaining a shared stack would require frequent exclusive access with lots of synchronization.
Some systems use an interrupt stack shared by all processes. Generally, there is one interrupt stack per processor.
User stacks (and there is usually one for each processor mode used by the system) are unique to each process (or thread).
The difference between the registers and the stack is that the latter can be anywhere in memory (it is indirectly referenced by appropriate registers) while the formers are fixed (there is only one set of architecturally visible registers).
The stack is part of the state of a program, just like it make no sense to mix program instruction, data and context together, mixing two stacks make no sense.
If program A pushes X it expects to pop X and not an eventual value pushed by program B meanwhile.
It's possible to make all program shares the same memory area for the stack but this is, in general, counterproductive.
As you correctly noted, the stack must be swapped in an out, thus, in the case of two program A and B, two additional memory areas are needed: one for saving the stack of A and one for the stack of B.
In the end, three memory areas are used instead of two.
There are cases where the swapping is necessary: when the shared is at a fixed location.
This is the case, in some degenerate form, of registers but other structure can have a fixed location.
One simple example are the page table entries, if a program A is used to generate two processes A1 and A2, most OSs will copy-on-write them.
Under this circumstances, the two processes end up sharing a lot of pages, even all but a few. For the OS may be easier to swap in and out the few different pages rather than make the page table (or part of it) point to two different locations.
In general, if we cannot afford to have multiple instances of a resource, we need to time-share it.
Since we can afford to have more than one instance of the stack, we prefer to not share it.

What is Page fault service time?

I am reading operating systems and having a doubt regarding page fault service time ?
Average memory access Time = prob. of no page fault (memory access time)
+
prob. of page fault (Page fault service time)
My doubt is that what does the page fault service time includes ?
According to me,
First address translation is there in TLB or Page table , but when entry is not found in page table, it means page fault occurred . So, i have to fetch from disk and all the entries get updated in the TLB and as well as page table.
Hence, page Fault service time = TLB time + page table time + page fetch from disk
Plz someone confirm it ?
What you are describing is academic Bulls____. There are so many factors that a simple equation like that does not describe the access time. Nonetheless, there are certain idiotic operating systems books that put out stuff like that to sound intellectual (and professors like it for exam questions).
What these idiots are trying to say is that a page reference will be in memory or not in memory with the two probabilities adding up to 1.0. This is entirely meaningless because the relative probabilities are dynamic. If other processes start using memory, the likelihood of a page fault increases and if other processes stop using memory, the probability declines.
Then you have memory access times. That is not constant either. Accessing a cached memory location is faster than a non-cached location. Accessing memory that is shared by multiple processors and interlocked is slower. This is not a constant either.
Then you have page fault service time. There are soft and hard page faults. A page fault on a demand zero page is different in time for one that has to be loaded from disk. Is the disk access cached or not cached? How much activity is there on the disk?
Oh, is the page table paged? If so, is the page fault on the page table or on the page itself? It could even be both.
Servicing a page fault:
The process enters the exception and interrupt handler.
The interrupt handler dispatches to the page fault handler.
The page fault handler has to find where the page is stored.
If the page is in memory (has been paged out but not written to disk), the handler just has to update the page table.
If the page is not in memory, the handler has to look up where the page is stored (this is system and type of memory specific).
The system has to allocate a physical page frame for the memory.
If this is a first reference to a demand zero page, there is no need to read from disk, just set everything to zero.
If the page is in a disk cache, get the page from that.
Otherwise read the page from disk to the page frame.
Reset the process's registers as appropriate.
Return to user mode
Restart the instruction causing the fault.
(All of the above have gross simplifications.)
The TLB has nothing really to do with this except that the servicing time is marginally faster if the page table entry in question is in the TLB.
Hence, page Fault service time = TLB time + page table time + page fetch from disk
Not at all.

How can we reduce page fault

I learned that on virtual memory, the penalty caused by page fault is expensive. How do we reduce this page fault??I saw one argument that says a smaller page size reduces the page fault. Why is this true??
To consider why smaller page sizes might reduce fault rates, consider and extreme example in the other direction. Assume you have 2GB of physical memory and pages that are 1GB in size. As soon as you allocate more than 2GB of virtual memory, you will have at least 3 pages, of which only 2 will fit in memory. More than 1-in-3 memory accesses would cause a page fault.
Having smaller page sizes means you have more granularity, allowing the OS to perform more targeted swapping.
Of course (isn't it always that way), there are trade-offs. For one, smaller page sizes means more pages, which means more overhead to manage pages.
One method to reduce page faults is to use a memory allocator that is smart about allocating memory likely to be used at the same time on the same pages.
For example, at the application level, bucket allocators (example) allow an application to request a chunk of memory that the application will then allocate from. The application can use the bucket for specific phases of program execution and then release the bucket as a unit. This helps to minimize memory fragmentation that might cause active and inactive parts of the program from receiving memory allocations from the same physical page.