operating systems - TLBs - operating-system

operating systems - TLBs - operating-system

I'm trying to get my head round this (okay, tbh cramming a night before the exams :) but i can't figure out (nor find a good high level overview on the net) of this:
'page table entries can be mapped to more than one TLB entry.. if for example every page table entry is mappped to two TLB entries, this is know as 2-way set associative TLB'
My question is, why would we want to map this more than once? surely we want to have the maximum number of possible entries represented in the TLB, and duplication would waste space right ? What am i missing?
Many thanks

It doesn't mean you would load the same entry into two places into the table -- it means a particular entry can be loaded to either of two places in the table. The alternative where you can only map an entry to one place in the table is a direct mapped TLB.
The primary disadvantage of a direct-mapped TLB arises if you're copying from one part of memory to another, and (by whatever direct-mapping scheme the CPU uses) the translations for both have to be mapped to the same spot in the TLB. In this case, you end up re-loading the TLB entry every time, so the TLB is doing little or no good at all. By having a two-way set associative TLB, you can guarantee that any two entries can be in the TLB at the same time so (for example) a block move from point A to point B can't ruin your day -- but if you read from two areas, combine them, and write results to a third it could (if all three used translations that map map to the same set of TLB entries).
The shortcoming of having a multiway TLB (like any other multiway cache) is that you can't directly compute which position might hold a particular entry at a given time -- you basically search across the ways to find the right entry. For two-way, that's rarely a problem -- but four ways is typically about the useful limit; 8-way set associative (TLBs | caches) aren't common at all, partly because searching across 8 possible locations for the data starts to become excessive.
Over time, the number of ways it makes sense to use in a cache or tlb tends to rise though. The differential in speed between memory and processors continues to rise. The greater the differential, the more cycles the CPU can use and still produce a result within a single memory clock cycle (or a specified number of memory clock cycles, even if that's more than one).

Related

How do weak ISAs resolve WAW memory hazards using the store buffer?

Modern CPUs use a store buffer to delay commit into cache until retirement, also avoiding WAR and WAW memory hazards. I'm wondering how weak ISAs resolve WAW hazards using the store buffer, which is otherwise not a FIFO, allowing StoreStore reordering? Do they insert an implicit barrier?
More specifically, if two stores to the same memory address retire in-order on a weak ISA, e.g. ARM/POWER, they could theoretically commit to cache out-of-order, since the store buffer is not FIFO, thus breaking the WAW dependency.
According to Wikipedia:
...the store instructions, including the memory address and store data, are buffered in a store queue until they reach the retirement point. When a store retires, it then writes its value to the memory system. This avoids the WAR and WAW dependence problems...

My guess; I'm not familiar with the details of any real-world designs
Even if the store buffer is a full scheduler that can "grab" any graduated store for commit to L1d, I'd assume it would use an oldest-ready first order. (Like an instruction / uop scheduler aka RS Reservation Station.)
"Ready" would mean the cache line is exclusively owned (Modified or Exclusive state). Every graduated store itself is implicitly ready to commit because by definition the associated store instruction has retired.
In-order retirement means that stores become eligible for commit in program-order, so you can't have an older store that's temporarily hidden from the oldest-ready-first scheduling.
Together, those things would ensure that for any given byte, stores overlapping it are in program order and thus maintain consistency of global-visibility order and final value for any given group of bytes within a cache line.
A memory barrier might work by fencing off the store buffer like a divider on a grocery-store checkout conveyor belt, preventing grabbing of stores past it while committing ones to the same place in the same line.
We do know real-world weakly-ordered store buffers like PowerPC RS64-III (in-order exec) and Alpha 21264 (OoO exec) do merging to help them create whole 4-byte or 8-byte aligned commits to L1d, e.g. out of multiple byte stores. That's also fine, assuming your merge algorithm respects order for any given byte e.g. by putting the data from a younger store into an older SB entry or vice versa and marking the other entry as "already committed". Obviously this must respect store barriers.
I think this is all fine even with unaligned stores, although preserving atomicity guarantees for unaligned stores could be tricky with merging. (Intel P6-family and later does provide atomicity guarantees for unaligned cached stores that don't cross a cache-line boundary, but we don't think Intel does merging in the store buffer proper; maybe just some stuff with LFBs for cache-miss back-to-back stores to the same line.)
It's likely that real hardware might not be a full scheduler that can merge any 2 SB entries, e.g. maybe only over limited range to reduce the amount of different addresses (and sizes) to compare at once. Also, you'd probably still only free up SB entries in program order, so it can basically be a circular buffer (unlike the RS). Alloc in program order, and having the order be tracked by the layout of the SB itself, makes it much cheaper for memory barriers to work, and to track where the youngest "graduated" store is.
Disclaimer: IDK if this is exactly how real HW works
Possible corner case: unaligned 4-byte store to [cache_line+63] (split across a CL boundary) and then to [cache_line+60] (fully contained in the lower cache line). If the older store-buffer entry can't commit right away because we don't yet own the next cache line, but we do own cache_line, we still can't let the younger store to cache_line+60 commit first, if we're depending on that not happening to avoid WAW hazards.
So you'd probably want a line-split SB entry to be able to commit the data to one line but not the other, allowing oldest-ready-first to happen for each location separately, not tying together order across 2 cache lines.
Related: I wrote my own answer explaining what a store buffer is. I tried to avoid mistakes like Wikipedia makes ("when a store retires, it writes its value to the memory system": In fact retirement just makes it eligible to commit; such stores are called "graduated" stores.)

Different Page Sizes for Processes

As part of the virtual to physical address conversion, for each process a table of mappings between virtual to physical addresses is stored. If a process is scheduled next the content of the page table is loaded into the MMU.
1) Where is the page table for each process stored? As part of the process control block?
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes? Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.

1) They could be, but most OS's have a notion of an address space which a process is attached to. The address space typically contains a description of the sorts of mappings that have been established, and pointers to the page structure(s). If you consider the operation of exec(2), at a certain level of abstraction it merely involves creating a new address space, populating it, then attaching the process to it. Once the operation is known to succeed, the old address space can simply be discarded.
2) It depends upon the mmu architecture of the machine. In a forward mapped arrangement (x86, armv[78]), the page tables form a sort of tree structure, but instead of having the conventional 2 or 3 items per node, there are hundreds or thousands of them. The x86-classic has a 2 level structure, where each of the 1024 entries in the first level points to a pagetable which covers 2^20 bytes of address space. Invalid entries, either at the inner or leaf level, can represent unmapped space; so in x86-classic, if you have a very small address space, you only need a root table, and a single leaf level table.
3) Yes, multiple page size has been supported by most OSes since the early 2000s. Again, in forward mapped ones, each of the levels of the tree can be replaced by a single large page for the same address space as that table level. x86-classic only had one size; later editions supported many more.
3a) There is no need to use large pages to do this -- simply having an invalid page table is sufficient. In x86-classic, the least significant bit of the page table/descriptor entry indicates the validity of the entry.
Your idea exists.

1) Where is the page table for each process stored? As part of the process control block?
Usually it's not "a page table". For some CPUs there's only TLB entries (Translation Lookaside Buffer entries - like a cache of what the translations are) where software has to handle "TLB miss" by loading whatever it feels like into the TLB itself, and where the OS might not use tables at all (e.g. could use "list of arbitrary length zones"). For some CPUs it's a hierarchy of multiple levels (e.g. for modern 64-bit 80x86 there's 4 levels); and in this case some of the levels may be in physical memory and some may be in swap space or somewhere else and some may be generated as needed from other data (a little bit like it would've been for "software handling of TLB miss"). In any case, if each process has its own virtual address space (e.g. and it's not some kind of "single-address space shared by many processes" scheme) its likely that the process control block (directly or indirectly) contains a reference to whatever the OS uses (e.g. maybe a single "physical address for the highest level page table", but maybe a virtual address of a "list of arbitrary length zones" and maybe anything else).
2) Does the page table contain entries for not allocated memory so a segfault can be detected (more easily)?
If there are page tables then there must be a way to indicate "page not present", where "page not present" may mean that the memory isn't allocated but could also mean that the (virtual) memory was allocated but the entry for it hasn't been set (either because OS is generating the tables on demand, or because the actual data is in swap space, or...).
3) Is it possible (and used in any known relevant OS) that one process does have multiple page frame sizes?
Yes. It's relatively common for 64-bit 80x86 where there's 4 KiB pages, 2 MiB (or 4 MiB) "large pages" (plus maybe 1 GiB "huge pages"); and done to reduce the chance of TLB misses (while also reducing memory consumed by page tables). Note that this is mostly an artifact of having multiple levels of page tables - an entry in a higher level table can say "this entry is a large page" or it can say "this entry is a lower level page table that might contain smaller pages". Note that in this case it's not "multiple page sizes in the same table", but is "fixed page size for each level".
Especially if question 2 is true it is very convenient to map huge page tables to non existing memory to keep the page table as small as possible. It will still allow high precision in mapping smaller frames to the memory to keep external (and internal) fragmentation as small as possible? This of course requires an extra field storing the frame size for each entry. Please point out the reason(s) if my "idea" cannot exist.
Converting a virtual address into a physical address (or some kind of fault to indicate the translation doesn't exist) needs to be very fast (because it happens extremely often). When you have "fixed page size for each level" it means you can extract some bits of the virtual address and use them as the index into the table; which is fast.
When you have "multiple page sizes in the same table" there's 2 options. The first option is to duplicate entries in the page table so that you can still extract some bits of the virtual address and use them as the index into the table; which (apart from minor differences in the way TLBs are managed - e.g. auto-detecting adjacent translations vs. being manually told) is effectively identical to not bothering at all; but there are some CPUs (ARM I think) that do this.
The other alternative is searching multiple entries in the page table to find the right entry, where the cost of searching reduces performance. I don't know of any CPU that supports this - performance is too important.

Number of memory access with Demand Paging

I have been studying Operating Systems Concepts and the book I am referring to is Operating System Concepts by Peter B. Galvin, Greg Gagne and Abraham Silberschatz.
In the chapter of Virtual Memory, book starts to talk about Paging and number of memory access it would require for the system to read data stored in a particular frame in memory given a logical address. The author states that when Page Table is present in Main Memory, system would need two memory accesses to read data stored in a frame. The first access is made to the page table to read the correct frame number and the next access is for reading the byte/word from the frame.
After a few sections, the book talks about Demand Paging and page fault. Author state that in case of no page fault, one memory access is needed and in case of a page fault, we will consider Page Fault Service time (which comprises of swap in time, swap out time, one memory access etc.) and presents readers with the formula
Effective Access Time = (1-p) x one memory access time + p x page fault service time
where p = page fault rate
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed. Applying the line of thought used with standard paging scheme earlier introduced by same author(s), we should need one memory access to read page table and another to read the data from frame.
Is it because we are talking about the time frame after the access to page table is made? Then why the same standard of calculation not applies to standard version of paging?

Note: I haven't read/seen this book.
For educational material; if the author describes reality accurately with all the details the reader will just get confused and won't be able to learn. To work around that, authors simplify (omit details and ignore reality) while introducing different concepts, so that the reader is able to learn each concept one at a time while building up the knowledge needed to comprehend the complexity of reality.
The problem is that different simplifications make sense at different stages, and authors are human (imperfect), so sometimes the simplifications that were beneficial at one point (in one chapter) conflict with simplifications that are beneficial at a later point (in a different chapter).
For an example, I might (initially) tell someone "each access from virtual memory involves a second memory fetch from RAM to determine the translation" to help them understand how page tables work and that there's (potential) performance problems involved (twice as many memory accesses). Then I might introduce the concept of "translation look-aside buffers" (after the reader understands the how page tables work and knows about the problem that TLBs are designed to solve). Then I might explain that often real systems have multiple levels of page tables (e.g. on 64-bit 80x86 it's four levels, potentially involving 4 memory accesses to determine a translation) and that there might be higher level caches/buffers involved (and not just TLBs that cache final translations). In this case, my original statement ("each access from virtual memory involves a second memory fetch from RAM to determine the translation") is a deliberate lie (a simplification) to avoid the complexity of a statement like "each access from virtual memory may or may not involve one or more additional fetches from some or all levels of page tables" (which is too confusing for beginners initially, because it creates lots of questions that they don't have answers to yet).
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed.
One reality is (for one real 80x86 CPU in long mode but not all 80x86 CPUs in long mode and not any 80x86 in other modes, if virtualisation is not being used), for a read from virtual memory that does not lead to a page fault, if the access is not misaligned/split across page boundaries (where CPU would have to do it all twice to fetch bytes from 2 different pages and merge the bytes):
* if the translation is not in the TLB, then:
* if the area is not in the "page directory cache"
* fetch the PML4 entry to determine address of PDPT (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PML4 entry
* fetch the PDPT entry to determine address of PD (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PDPT entry
* insert data into "page directory cache"
* if the area is in the "page directory cache"
* do access checks based on flags in "page directory cache entry"
* fetch the PD entry to determine address of PT (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PD entry
* fetch the PT entry to determine address of page (try L1 cache, then L2 cache, then L3 cache, then RAM)
* do access checks based on flags in PT entry
* insert data into TLB (including setting the "accessed" flag in the page table entry)
* if the translation is in the TLB, then:
* do access checks based on flags in "TLB entry"
* do the "physical address = physical address of page + offset in page" calculation
* read the data for the physical address (try L1 cache, then L2 cache, then L3 cache, then RAM)
For this reality (with the restrictions mentioned); the number of fetches from RAM can be anything from zero to 5.
Can you see why the author (while trying to explain page faults and not trying to explain translation costs) might want to avoid showing something like this and might simplify (by assuming that only one fetch is needed because the translation is in the TLB) instead?

The fundamental source of your problem is that you are reading a book that is only fit for lining a cat box. What you are describing is nonsensical gibberish that textbooks use to create confusion among students. This is not a case of over simplification because the authors apparently throw in a nonsensical formula for access times.
A formula like this
Effective Access Time = (1-p) x one memory access time + p x page fault service time
is total bovine fecal waste matter with no basis in reality.
The author states that when Page Table is present in Main Memory, system would need two memory accesses to read data stored in a frame.
The processor has to translate logical addresses to physical addresses using the page tables. Assuming that there is no caching in the CPU, the CPU has read the page table for each memory access.
The number reads depends upon the page table format used by the CPU.
Let's suppose your process has a multi-level page table. In that case the CPU has to make a read for each level of the table.
If you have a CPU that has separate linear system and user page tables, with the user tables in logical addresses, each access to the system space requires one memory read and each access to the user space requires at least two memory accesses and might, in fact, trigger a page fault. The first read is to system page table to find the user page table entry. The second read is to the user page table. The third is to the data.
In reality, every CPU on the planet does page table caching so separate reads are not required (all the time).
I cannot wrap my head around why the author suggests that, in case of no page fault, only one memory access will be needed.
It sounds like the book is not being consistent in its BS.
The reality is that logical memory translation requires a number of steps. However, what those steps are depends upon the state of the processor, something that is unpredictable. These steps take place transparently behind the scenes and you do not even need to grasp all of them to understand operating systems.
What you need to know in the real world is that the CPU translates logical addresses to physical addresses. If the CPU is unable to make that translation, it triggers a page fault.

Lucene: Loading Index files while searching?

Can anyone explain how index files are loaded in memory while searching?
Is the whole file (fnm, tis, fdt etc) loaded at once or in chunks?
How individual segments are loaded and in which order?
How to encrypt Lucene index?

The main point of having the index segments is that you can rarely load the whole index in the memory.
The most important limitation that is taken into account while designing the index format is that disk seek time is relatively long (on plate-base hard drives, that are still most widely used). A good estimation is that the transfer time per byte is about 0.01 to 0.02 μs, while average seek time of disk head is about 5 ms!
So the part that is kept in memory is typically only the dictionary, used to find out the beginning block of the postings list on the disk*. The other parts are loaded only on-demand and then purged from the memory to make room for other searches.
As for encryption, it depends on whether you need to keep the index encrypted all the time (even when in memory) or if it suffices to encrypt only the index files. As for the latter, I think that an encrypted file system will be enough. As for the former, it is also certainly possible, as different index compression techniques are already in place. However, I don't think it's widely used, as the first and foremost requirement for full-text engine is speed.
[*] It's not really such simple, as we're performing binary searches against the dictionary, so we need to ensure that all entries in the first structure have equal length. As it's clearly not the case with normal words in dictionary and applying padding is too much costly (think of word lengths for some chemical substances), we actually maintain two levels of dictionary, the first one (which needs to fit in the memory and is stored in .tii files) keeps sorted list of starting positions of terms in the second index (.tis files). The second index is then a concatenated array of all terms in an increasing order, along with pointer to the sector in the .frq file. The second index often fits in the memory and is loaded at the start, but it can be impossible e.g. for bigram indexes. Also note that for some time Lucene by default doesn't use individual files, but so called compound files (with .cfs extension) to cut down the number of open files.

Paging: Basic, Hierarchical, Hashed, and Inverted

With respect to operating systems and page tables, it seems there are 4 general methods to paging and page tables
Basic - A single page table which stores the page number and the offset
Hierarchical - A multi-tiered table which breaks up the virtual address into multiple parts
Hashed - A hashed page table which may often include multiple hashings mapping to the same entry
Inverted - The logical address also includes the PID, page number and offset. Then the PID is used to find the page in to the table and the number of rows down the table is added to the offset to find the physical address for main memory. (Rough, and probably terrible definition)
I am just wondering what are the pros and cons of each method? It seems like basic is the easier method but may also take up more space in memory for a larger address space.
What else?

The key to building a usable page model is minimizing the unused space for entries that are not necessary. You want to minimize the amount of memory needed while keeping the computation cost of a memory lookup low.
Basic can take up a lot of memory (for a modern system using 4GB of memory, that might amount to 300 MB only for the table) and is therefore impractical.
Hierarchical reduces that memory a lot by only adding subtables that are actually in use. Still, every process has a root page table. And if the memory footprint of the processes is scattered, there may still be a lot of unnecessary entries in secondary tables. This is a far better solution regarding memory than Basic and introduces only a marginal computation increase.
Hashed does not work because of hash collisions
Inverted is the solution to make Hashed work. The memory use is very small (as big as a Basic table for a single process, plus some PID and chaining overhead). The problem is, if there is a hash collision (several processes use the same virtual address) you will have to follow the chain information (just as in a linked list) until you find the entry with a matching PID. This may produce a lot of computing overhead in addition to the hash computing, but will keep the memory footprint as small as possible.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse