Address translation with multiple pagesize-specific TLBs - cpu-architecture

For Intel 64 and IA-32 processors, for both data and code independently, there may be both a 4KB TLB, and a Large Page (2MB, 1GB) TLB (LTLB). How does address translation work in this case?
Would the hardware simply be able to access both in parallel, knowing that a double-hit can't occur?
In the LTLBs, how would the entries be organized? I suppose, when the entry is originally filled from a page-structure entry, the LTLB entry could include information about how a hit on this entry would proceed?
Anyone have a reference to a current microarchetucture?

There are many possible designs for a TLB that supports multiple page sizes and the trade-offs are significant. However, I'll only briefly discuss those designs used in commercial processors (see this and this for more).
One immediate issue is that how to know the page size before accessing a set-associative TLB. A given virtual address to be mapped to a physical address has to be partitioned as follows:
-----------------------------------------
| page number | page offset |
-----------------------------------------
| tag | index | page offset |
-----------------------------------------
The index is used to determine which set of the TLB to lookup and the tag is used to determine whether there is a matching entry in that set. But given only a virtual address, the page size cannot be known without accessing the page table entry. And if the page size is not known, the size of the page offset cannot be determined. This means that the location of the bits that constitute the index and the tag are not known.
Most commercial processors use one of two designs (or both) to deal with this issue. The first is by using a parallel TLB structure where each TLB is designated for page entries of a particular size only (this is not precise, see below). All TLBs are looked up in parallel. There can either be a single hit or all misses. There are also situations where multiple hits can occur. In such cases the processor may choose one of the cached entries.
The second is by using a fully-associative TLB, which is designed as follows. Let POmin denote the size of the page offset for the smallest page size supported by the architecture. Let VA denote the size of a virtual address. In a fully-associative cache, an address is partitioned into a page offset and a tag; there is no index. Let Tmin denote VA - POmin. The TLB is designed so that each entry to hold a tag of size Tmin irrespective of the size of the page of the page table entry cached in that TLB entry.
The Tmin most significant bits of the virtual address are supplied to the comparator at each entry in of the fully-associative TLB to compare the tags (if the entry is valid). The comparison is performed as follows.
| M |
|11|0000| | the mask of the cached entry
-----------------------------------------
| T(x) |M(x)| | some bits of the offset needs to be masked out
-----------------------------------------
| T(x) | PO(x) | partitioning according to actual page size
-----------------------------------------
| T(min) | PO(min) | partitioning before tag comparison
-----------------------------------------
Each entry in the TLB contains an field called the tag mask. Let Tmax denote the size of the tag of the largest page size supported by the architecture. Then the size of the tag mask, M, is Tmin - Tmax. When a page table entry gets cached in the TLB, the mask is set in a way so that when its bitwise-and'ed with the corresponding least significant bit of a given tag (of Tmin), any remaining bits that belong to the page offset field would become all zeros. In addition, the tag stored in the entry is appended with a sufficient number of zeros so that its size is Tmin. So some bits of the mask would be zeros while others would be ones, as shown in the figure above.
Now I'll discuss a couple of examples. For simplicity, I'll assume there is no hyperthreading (possible designs options include sharing, static partitioning, and dynamic partitioning). Intel Skylake uses the parallel TLB design for both the L1 D/I TLB and the L2 TLB. In Intel Haswell, 1 GB pages are not supported by the L2 TLB. Note that 4 MB pages use two TLB entires (with replicated tags). I think that the 4 MB page table entries can only be cached in the 2 MB page entry TLB. The AMD 10h and 12h processors use a fully-associative L1 DTLB, a parallel L2 DTLB, a fully-associative parallel L1 ITLB, and an L2 ITLB that supports only 4 KB pages. The Sparc T4 processor uses a fully-associative L1 ITLB and a fully-associative L1 DTLB. There is no L2 TLB in Sparc T4.

Related

How do I calculate the size of a virtual page in a system?

Given a virtual memory system which utilises a 32-bit virtual address.
A page table that takes 1 MiB of memory per process.
Each PTE(page table entry requires 4 bytes.
The system has a total of 256 Megabytes of memory available.
I understand that a Page table is essentially a list of entries(PTE) that provide a mapping of the virtual addresses to a physical address.
I need to calculate the size of each virtual page. But I have no clue how.
So far all I've got is 2^20(page-table size)/2^2(PTE size)=2^18 this gives me the total amount of entries I can have in a page table. I'm not even sure if this is useful to find the size of each virtual page.
Could anyone point me in the right direction? Perhaps I'm misunderstanding in how these metrics relate to the size of a virtual page.
Edit: I've found out the size of the page is determined by the following.
A virtual address consists of bits for a page pointer and an offset.
The last bits of the virtual address are called the offset which is the location difference between the byte address you want and the start of the page. You require enough bits in the offset to be able to get to any byte in the page.For a 4K page you require (4K == (4 * 1024) == 4096 == 212 ==) 12 bits of offset.
The page pointer can be determined by the number of entries in the table. This was simply my formula from before 2^20(page-table size)/2^2(PTE size)=2^18 entries. Which means I have 18 bits being used in my virtual address for my page pointer. I can determine the offset by 2^32(virtual address size)/2^18 which gives me 2^14. Therefore my page size for my virtual address is 2^14 or 16KiB.
The problem as you describe is under-specified. You need to know the width of the page offset field within the virtual address (or, how many levels of indirection the VM system is using). For example, (as in ONE of the modes that x86 system uses), if you have two levels of indirection, then you will have 10x2 bits used for levels of indirection and remaining 12 bits for offset within the page. That gives you a page size (= frame size) of 4KB.
If you instead use one level of indirection (as ANOTHER x86 mode allowed, but is found less often), then you can have a division of 10 bits for the only level of indirection and remaining 22 bits as offset within the page. That gives a page size of 4MB.
You see above that same 32 bit virtual address can follow different levels of indirection for paging and end up with different page sizes.
Page offset size in the virtual address determines the page size.
There is no answer under those fact. You have the page table entry is 32 bits. That puts a theoretical upper bound on the page size as 2^32. However, some bits are going to used for control so the size will be smaller.
The 1MB size of of the page table and 32-bit virtual address facts are irrelevant to the page size.

Finding minimum page size to allow TLB access to overlap with tag fetch [duplicate]

This question already has an answer here:
Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical
(1 answer)
Closed last year.
Homework question, so please just nudge me in the right direction.
Consider a system with physically-addressed caches, and assume that 40-bit virtual addresses and 32-bit physical addresses are used, and the memory is byte-addressable. Further assume that the cache is 4-way set-associative, the cache line size is 64 Bytes and the total size of the cache is 64 KBytes.
What should be the minimum page size in this system to allow for the overlap of the TLB access and the cache access?
I've been stuck on this question and have no idea how to even begin. Can someone give me a hint towards finding the solution?
I think the most important piece of information in the question is
overlap of the TLB access and the cache access
This means, we access the Cache at the same time we access the TLB. In practice, what we really do is, we index the cache with the index bits from the virtual address and by the time we have located the entry in the cache, we will have the data (physical address) from the TLB. Then we can do the tag comparison with physical address. In other words cache acts as a Virtually indexed, Physically tagged (VIPT) cache.
Even though the scheme sounds efficient, the thing to lookout is, number of bits used to index the cache, cannot be higher than the number of bits needed to represent the page size. Simply, size of a page can put an upper limit on the number of cache entries.
Now coming back to your question,
its a 64KBytes cache with 4 way set assoc. and cacheline of 64Bytes.
Number of cachelines = (64KBytes/4)/64Bytes = 2^8 cachelines
That means if a page is 256Bytes or bigger, we can use this mechanism. If a page is smaller than 256 Bytes, then we cannot assume the index bits of the virtual address and the physical address are going to be the same.
What should be the minimum page size in this system to allow for the
overlap of the TLB access and the cache access?
256Bytes

How PAGE table size is calculated here

I am having hard time calculating the page size as from below link:
http://www.embedded-bits.co.uk/2011/mmucode/
As we know page table entries in this table are 4 bytes long and that there is a maximum of 4096 entries (one for each 1MB of the address space) we can calculate the size of the table as 16KB
Now total size of page table is 4096 entries * 4 bytes wide entry = 16384 bytes = 16kb
But as from above statement each of the 4096 entry corresponds to 1 Mb of address space, that means 1 entry = 1MB .
Since there are 4096 entries, space required to store it is 4096MB but we have page table size of 16kb only.
Also, how many virtual address this 1mb of section has, 250000?
EDIT:
Sorry, if its going to be more stupid from my end. I tried to understand it again. This 1 Mb of section is part of Physical memory not the virtual memory/page table(which I understood earlier).
Now each entry is 4 Bytes longs, does it means 4 virtual addresses are going to cover 1 Mb of Physical Memory section ?
Not quite sure what the question is asking but in a typical AArch32 MMU (barring the newer extensions), you would have first and second level translation tables in a typical configuration.
The first level translation table splits the entire address space into 1MB sections and typically contain pointers to level 2 tables which split those 1MB sections into granular pages (traditionally 4KB on most machines). 2nd level translation tables store physical addresses that correspond to those virtual addresses (each entry stores a physical address along with some flags). The virtual address is determined by its position within the L1 and L2 tables. To further clarify, all addresses stored in page tables (including addresses to L2 tables in the L1 table) are physical.
The first level translation table has a fixed size but depending on which architecture you're running you can change its size (this is used by a lot of ARM kernels to provide a user/kernel address space split). Each gigabyte of virtual memory space requires 4096 bytes in the L1 table at 4096 byte page size.
To even further clarify as far as I remember you can use L1 entries to map 1MB chunks directly without using L2 tables, the type of L1 entry is denoted by lower bits of it.
(Sorry if I got the unit suffixes wrong)

What is page table entry size?

I found this example.
Consider a system with a 32-bit logical address space. If the page
size in such a system is 4 KB (2^12), then a page table may consist of
up to 1 million entries (2^32/2^12). Assuming that
each entry consists of 4 bytes, each process may need up to 4 MB of physical address space for the page table alone.
What is the meaning of each entry consists of 4 bytes and why each process may need up to 4 MB of physical address space for the page table?
A page table is a table of conversions from virtual to physical addresses that the OS uses to artificially increase the total amount of main memory available in a system.
Physical memory is the actual bits located at addresses in memory (DRAM), while virtual memory is where the OS "lies" to processes by telling them where it's at, in order to do things like allow for 2^64 bits of address space, despite the fact that 2^32 bits is the most RAM normally used. (2^32 bits is 4 gigabytes, so 2^64 is 16 gb.)
Most default page table sizes are 4096 kb for each process, but the number of page table entries can increase if the process needs more process space. Page table sizes can also initially be allocated smaller or larger amounts or memory, it's just that 4 kb is usually the best size for most processes.
Note that a page table is a table of page entries. Both can have different sizes, but page table sizes are most commonly 4096 kb or 4 mb and page table size is increased by adding more entries.
As for why a PTE(page table entry) is 4 bytes:
Several answers say it's because the address space is 32 bits and the PTE needs 32 bits to hold the address.
But a PTE doesn't contain the complete address of a byte, only the physical page number. The rest of the bits contain flags or are left unused. It need not be 4 bytes exactly.
1) Because 4 bytes (32 bits) is exactly the right amount of space to hold any address in a 32-bit address space.
2) Because 1 million entries of 4 bytes each makes 4MB.
Your first doubt is in the line, "Each entry in the Page Table Entry, also called PTE, consists of 4 bytes". To understand this, first let's discuss what does page table contain?", Answer will be PTEs. So,this 4 bytes is the size of each PTE which consist of virtual address, offset,( And maybe 1-2 other fields if are required/desired)
So, now you know what page table contains, you can easily calculate the memory space it will take, that is: Total no. of PTEs times the size of a PTE.
Which will be: 1m * 4 bytes= 4MB
Hope this clears your doubt. :)
The page table entry is the number number of bits required to get any frame number . for example if you have a physical memory with 2^32 frames , then you would need 32 bits to represent it. These 32 bits are stored in the page table in 4 bytes(32/8) .
Now, since the number of pages are 1 million i.e. so the total size of the page table =
page table entry*number of pages
=4b*1million
=4mb.
hence, 4mb would be required to store store the table in the main memory(physical memory).
So, the entry refers to page table entry (PTE). The data stored in each entry is the physical memory address (PFN). The underlying assumption here is the physical memory also uses a 32-bit address space. Therefore, PTE will be at least 4 bytes (4 * 8 = 32 bits).
In a 32-bit system with memory page size of 4KB (2^2 * 2^10 B), the maximum number of pages a process could have will be 2^(32-12) = 1M. Each process thinks it has access to all physical memory. In order to translate all 1M virtual memory addresses to physical memory addresses, a process may need to store 1 M PTEs, that is 4MB.
Honestly a bit new to this myself, but to keep things short it looks like 4MB comes from the fact that there are 1 million entries (each PTE stores a physical page number, assuming it exists); therefore, 1 million PTE's, which is 2^20 = 1MB. 1MB * 4 Bytes = 4MB, so each process will require that for their page tables.
size of a page table entry depends upon the number of frames in the physical memory, since this text is from "OPERATING SYSTEM CONCEPTS by GALVIN" it is assumed here that number of pages and frames are same, so assuming the same, we find the number of pages/frames which comes out to be 2^20, since page table only stores the frame number of the respective page, so each page table entry has to be of atleast 20 bits to map 2^20 frame numbers with pages, here 4 byte is taken i.e 32 bits, because they are using the upper limit, since page table not only stores the frame numbers, but it also stores additional bits for protection and security, for eg. valid and invalid bit is also stored in the page table, so to map pages with frames we need only 20 bits, the rest are extra bits to store protection and security information.

Multilevel Paging Operating System

I had this problem in an exam today:
Suppose you have a computer system with a 38-bit logical address, page size of 16K, and 4 bytes per page table entry.
How many pages are there in the logical address space? Suppose we use two level paging and each page table can fit completely in a frame.
For the above mentioned system, give the breakup of logical address bits clearly indicating number of offset bits, page table index bits and page directory index bits.
Suppose we have a 32MB program such that the entire program and all necessary page tables (using two level paging) are in memory. How much memory (in number of frames) is used by the program, including its page tables?
How do I go about solving such a problem? Until now I would have thought page size = frame size, but that won't happen in this case.
Here is what I think:
Since the page size is 16K, my offset will be 17 bits (2^17 = 16K). Now how do I divide the rest of the bits, and what will be the frame size? Do I divide the rest of the bits in half?
238 / 16384 = 16777216 pages.
On one hand, the remaining 38-log216384=24 bits of address may be reasonable to divide equally between the page directory and page table portions of the logical address since such a symmetry will simplify the design. On the other hand, each page table should have the same size as a page so they can be offloaded to the disk in exactly the same way as normal/leaf pages containing program code and data. Fortunately, in this case using 12 bits for page directory indices and page table indices gets us both since 212 * 4 bytes of page table entry size = 16384. Also, since a page address always has 14 least significant bits set to zero due to the natural page alignment, only 38-14=24 bits of the page address need to be stored in a page table entry and that gives you 32-24=8 bits for the rest of control data (present, supervisor/user, writable/non-writable, dirty, accessed, etc bits). This is what we get assuming that the physical address is also not longer than 38 bits. The system may have slightly more than 38 bits of the physical address at the expense of having fewer control bits. Anyway, everything fits. So, there, 38=12(page directory index)+12(page table index)+14(offset).
32MB/16KB = 2048 pages for the program itself. Each page table covers 212=4096 pages, so you will need about 2048/4096=0 page tables for this program. We round this up to 1 page table. Then there's also the page directory. 2048+1+1=2050 is how many pages are necessary to contain the entire program with its related pages tables in the memory.