TLB Flushing in case of process context switching

TLB Flushing in case of process context switching - operating-system

As TLB flushing in case of process context switch, why each process starts from scratch in TLB when given charge.
Why don't we fill in first few page table entries in the TLB as it can work in same fashion as we use locality of reference in memory management, i.e. when a process comes to execute, it is very likely that it will start with instruction 1 or the first instruction of the first few pages that are loaded in the main memory?
It can reduce the problem of filling up TLB during execution n speed up the system.

When CPU generates virtual address, The corresponding page will be searched in TLB, if it is not present in TLB, it will be searched in the next level memory, and then it will be placed in the TLB by following the suitable replacement algorithm.
The system can't predict in which frame the page containing so called 'instruction 1' is placed. If that was the case, then there won't be any need of page replacement algorithms, Instead it can replace all the required pages sequentially like page with first instruction, page with second instruction .. and so on.

Related

where virtual address space is located

Didn't able to understand where is virtual address space is present is it in RAM or HARD-DISK?
If it is present in RAM then How it's address space is larger than physical address space?

Virtual addresses,as their name implies, are virtual. Their are only manipulated by the processor and do not correspond to actual addresses until they are translated.
Translation is done by the hardware, thanks to tables that are filled by the operating system. These tables indicates for every potential virtual page address to which physical page address it corresponds. So mostly, virtual addresses are mapped to physical (RAM) addresses.
Didn't able to understand where is virtual address space is present is it in RAM or HARD-DISK? If it is present in RAM then How it's address space is larger than physical address space?
A process always has the same kind of memory structure in terms of virtual addresses. At the lower address end, there are the instructions, global data, and the heap, that are organized in several sections. At the upper end, are the program parameters (argv) and the stack. In between there is free space that allows the stack and the heap to grow.
So there are addresse equal to 0 (the first instruction of a program) and to 0xfffffffffffffffff (start of the stack).
Obviously is is far beyond the capacity of most (all?) present RAM. With 64 bits virtual addresses and a 4GB RAM (32 bits), at most one page over 1 billion can be used.
But the mapping mecanism is possible thanks to the page based translation. In the free space between the heap and the stack, most addresses will never be used. In that case, no page table for the translation is created by the OS.
If you generate a random address in a program, the most likely is that it will not correspond to an address mapped by the system to the RAM. If you try any access on this address, the processor will detect that no page exist and will raise an exception that will handled by the system. Most probably the system will stop your program and display an error message like "access violation".
The same mechanism is used to map part of the memory to the disk. To somehow increase the memory size, the system may swap to disk part of the physical memory assigned to a process, in order to allocate it to another process. If the first process tries to access it, again an exception will be raised, but the OS will detect that the address corresponds to a memory zone this stored to disk. It will read the disk, determine a physical address for this page, fill the corresponding page table, restore the memory content and go back to the program that can now perform the memory access.

Virtual Address Space is stored in hard disk. Okay, but where? There are two files that the process uses for this. One of them is the page file, and the other is the swap file.
But what is the difference between the page file and swap file?
The swap file is used when the memory is completely used up and the process needs more memory. So, the swap is an extended part of the memory. The other utility of the swap file is to switch context between processes. So, for instance, imagine Process A is performing, and Process B is waiting to run. Then, before Process B runs, Process A is removed from the actual context e store in the swap file to wait for its time to run again.
Then we have the page file. The page file is used to store the pages that the process is not using in this moment. It is used to save memory because this is a limited resource in the PC. Thus, only pages recently used are present in the actual memory. When the CPU tries to access a page that is not present in the memory, the CPU has a exception, which is handled by the Windows, thus restoring the page in memory, updating the page table, and letting the CPU continue performing the process.

The virtual address space is kept in secondary storage (disk). The virtual part of virtual memory means that the operating system maintains an image of the address space in secondary storage. Because an image of the address space is kept in secondary storage, it can be larger than the physical memory.
The second piece of implementing virtual memory is logical address translation that takes place entirely in memory. In the logical address space, memory is subdivided into pages (something like 512bytes to 1MB). Physical memory is subdivided into page frames. The size of a page frame has to match the size of the logical page on most systems.
The operating system maintains a page table for each process. The page table maps pages in the logical address space to physical page frames. An address consists of an index into the page table and an offset into the page used once the page is located.
In most cases there is no mapping of a logical address to a physical address. If you access a page that has no mapping the processor generates a page fault. Once the logical translation fails, the operating system has to do a virtual translation of the page. It looks to see if the page in question is located in secondary storage.
If the page does not exist, the operating system triggers an access violation exception. If the page does exist, the operating system loads the page into a free physical page frame; updates the page to map the page to that page frame, then restarts the process that caused the fault.
A virtual memory implementation has to maintain a copy of each process's virtual address space in secondary storage. It has to be able to translate logical page references into the virtual page stored on disk. It has to be able to copy logical pages in memory between virtual pages on disk.
You can have logical memory translation without virtual memory translation but you cannot have virtual memory translation without logical memory translation.

Converting Virtual addresses to Physical addresses with a TLB miss

Suppose that you have a 64-bit system and that your OS is scheduling two processes on it. Assume that the core has access to a 4-entry TLB for 4KB page size, and full associativity. Furthermore, assume that the core has a 64-byte direct-mapped cache with 16 byte cache lines. Now suppose that your processes, A and B, have
the following page tables:
Process A Page Table
Now suppose that your OS schedules process A and in it, memory references to the following virtual address are made.
0x2002
For the memory reference presented above, detail all TLB access(whether they are hits or misses) and all cache accesses(whether they are hits or misses). Assume hardware page table walks and a physically addressed cache.

LRU Clock Replace Algorithm - What are reference bits initialized to?

Suppose I have a 2 entry TLB and am using LRU Clock replacement. Further suppose that I have a TLB miss and its a page fault, so I load in a page into memory and update TLB, now my TLB has 1 entry. Next, I have another TLB miss and its a page fault, so I load in a page into memory and update TLB now my TLB is full.
Now, let us say that there is an another TLB miss and its a page fault and I need to evict an entry from my TLB using LRU Clock.
My question is, is the reference bit for the two entries in the TLB 1? (i.e. When we add in a new TLB entry do we set the reference bit to 1 initially? Or is it initialized to 0?). If not is the first TLB entry's reference bit 0 and the second a 1? (Would this be because during the 2nd-page fault we set the reference bit for the first entry to 0?)
Lastly which entry would be evicted? (first or second)

Flush TLB on a context swtich

This may depends on the OS, but in general as I understand that when there a page fault (the desired page is not in main memory) occurs OS will instruct CPU to read the page from disk, and I am wondering does OS dispatch to another process while the disk I/O ? if it does then there will be a complete flushing of the TLB on a context switch, correct ?

More or less, but a page fault doesn't always mean the page is on the disk (it could also not exist at all, be a lazy-allocation page, be a copy-on-write page that was written to, exist but be marked unreadable/unwritable, etc). But if that's how it is, it's probably going to schedule an other thread at least because disk IO takes approximately forever.
The amount of switching necessary depends on what it switches to, switching between threads from the same context doesn't imply a TLB flush. If a TLB flush is necessary, it's probably not a complete flush, because of global pages (so typically, you're not flushing out TLB entries for kernel pages). There is also PCID to avoid complete flushes (flushing can be limited to specified process context IDs), but that's quite recent, and tricky to use since there are only 4096 different IDs.

Process-specific pages are marked as non-global entries with nG(non-global) bit in TLB entry and also stores the pid(Address ID in ARM's terminology).
Now the article clearly lays out this concept.
"For non-global entries, when the TLB is updated and the entry is marked as non-global, a value is stored in the TLB entry in addition to the normal translation information. This value is called the Address Space ID (ASID), which is a number assigned by the OS to each individual task. Subsequent TLB look-ups only match on that entry if the current ASID matches with the ASID that is stored in the entry. This permits multiple valid TLB entries to be present for a particular page marked as non-global, but with different ASID values. In other words, we do not necessarily need to flush the TLBs when we context switch."
Source: https://developer.arm.com/documentation/den0024/a/The-Memory-Management-Unit/Context-switching

Where are multiple stacks and heaps put in virtual memory?

I'm writing a kernel and need (and want) to put multiple stacks and heaps into virtual memory, but I can't figure out how to place them efficiently. How do normal programs do it?
How (or where) are stacks and heaps placed into the limited virtual memory provided by a 32-bit system, such that they have as much growing space as possible?
For example, when a trivial program is loaded into memory, the layout of its address space might look like this:
[ Code Data BSS Heap-> ... <-Stack ]
In this case the heap can grow as big as virtual memory allows (e.g. up to the stack), and I believe this is how the heap works for most programs. There is no predefined upper bound.
Many programs have shared libraries that are put somewhere in the virtual address space.
Then there are multi-threaded programs that have multiple stacks, one for each thread. And .NET programs have multiple heaps, all of which have to be able to grow one way or another.
I just don't see how this is done reasonably efficient without putting a predefined limit on the size of all heaps and stacks.

I'll assume you have the basics in your kernel done, a trap handler for page faults that can map a virtual memory page to RAM. Next level up, you need a virtual memory address space manager from which usermode code can request address space. Pick a segment granularity that prevents excessive fragmentation, 64KB (16 pages) is a good number. Allow usermode code to both reserve space and commit space. A simple bitmap of 4GB/64KB = 64K x 2 bits to keep track of segment state gets the job done. The page fault trap handler also needs to consult this bitmap to know whether the page request is valid or not.
A stack is a fixed size VM allocation, typically 1 megabyte. A thread usually only needs a handful of pages of it, depending on function nesting level, so reserve the 1MB and commit only the top few pages. When the thread nests deeper, it will trip a page fault and the kernel can simply map the extra page to RAM to allow the thread to continue. You'll want to mark the bottom few pages as special, when the thread page faults on those, you declare this website's name.
The most important job of the heap manager is to prevent fragmentation. The best way to do that is to create a lookaside list that partitions heap requests by size. Everything less than 8 bytes comes from the first list of segments. 8 to 16 from the second, 16 to 32 from the third, etcetera. Increasing the size bucket as you go up. You'll have to play with the bucket sizes to get the best balance. Very large allocations come directly from the VM address manager.
The first time an entry in the lookaside list is hit, you allocate a new VM segment. You subdivide the segment into smaller blocks with a linked list. When such an allocation is released, you add the block to the list of free blocks. All blocks have the same size regardless of the program request so there won't be any fragmentation. When the segment is fully used and no free blocks are available you allocate a new segment. When a segment contains nothing but free blocks you can return it to the VM manager.
This scheme allows you to create any number of stacks and heaps.

Simply put, as your system resources are always finite, you can't go limitless.
Memory management always consists of several layers each having its well defined responsibility. From the perspective of the program, the application-level manager is visible that is usually concerned only with its own single allocated heap. A level above could deal with creating the multiple heaps if needed out of (its) one global heap and assigning them to subprograms (each with its own memory manager). Above that could be the standard malloc()/free() that it uses and above those the operating system dealing with pages and actual memory allocation per process (it is basically not concerned not only about multiple heaps, but even user-level heaps in general).
Memory management is costly and so is trapping into the kernel. Combining the two could impose severe performance hit, so what seems to be the actual heap management from the application's point of view is actually implemented in user space (the C runtime library) for the sake of performance (and other reason out of scope for now).
When loading a shared (DLL) library, if it is loaded at program startup, it will of course be most probably loaded to CODE/DATA/etc so no heap fragmentation occurs. On the other hand, if it is loaded at runtime, there's pretty much no other chance than using up heap space.
Static libraries are, of course, simply linked into the CODE/DATA/BSS/etc sections.
At the end of the day, you'll need to impose limits to heaps and stacks so that they're not likely to overflow, but you can allocate others.
If one needs to grow beyond that limit, you can either
Terminate the application with error
Have the memory manager allocate/resize/move the memory block for that stack/heap and most probably defragment the heap (its own level) afterwards; that's why free() usually performs poorly.
Considering a pretty large, 1KB stack frame on every call as an average (might happen if the application developer is unexperienced) a 10MB stack would be sufficient for 10240 nested call -s. BTW, besides that, there's pretty much no need for more than one stack and heap per thread.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse