Who place the data onto the cache? - cpu-architecture

Like we have locality of reference on which basis this data is copied to cache but who is responsible for this.
Is there any h/w or is there any s/f which perform this action?

The CPU reads/writes data into the cache when an instruction that access the memory is executed.
So it's an on-demand service, data is moved upon a request.
It then try to keep the data in the cache as long as possible until there is no more space and a replacement policy is used to evict a line in favor of new data.
The minimal unit of data transferred is called line and it is usually bigger than the register size (to improve locality).
Some CPUs have a prefetcher that, upon recognition of specific memory access patterns, try to automatically move data into the cache before it is actually requested by the program.
Some architecture have instructions that performs as hints for the CPU to prefetch data from a specific address.
This let the software have a minimal control over the prefetching circuitry, however if the software wants to just move data into the cache it only has to read the data (the CPU will cache it, if caching is enabled in that region).

Related

Why data is fetched from main memory in Write Allocate cache policy

For write allocate policy of a cache, when a write miss occurs, data is fetched from main memory and then updated with a write hit.
My question is, assuming write back policy on write hits, why the data is read from the main memory if it is immediately being updated by the CPU? Can't we just write to the cache without fetching the data from main memory?
On a store that hits in L1d cache, you don't need to fetch or RFO anything because the line is already exclusively owned.
Normally you're only storing to one part of the full line, thus you need a copy of the full line to have it in Modified state. You need to do a Read For Ownership (RFO) if you don't already have a valid Shared copy of the line. (Which you could promote to Exclusive and then Modified via just invalidating other copies. MESI).
A full-line store (like x86 AVX-512 vmovdqa [rdi], zmm0 64-byte store) can just invalidate instead of Read For Ownership, and just wait for an acknowledgement that no other cores have a valid copy of the line. IDK if that actually happens for AVX-512 stores specifically in current x86 microarchitectures.
Skipping the read (and just invalidating any other copies) definitely does happen in practice in some CPUs in some cases. e.g. in the store protocol used by microcode to implement x86 rep stos and rep movs, which are basically memset / memcpy. So for large counts they are definitely storing full lines, and it's worth it to avoid the memory traffic of reading first. See Andy Glew's comments, which I quotes in What setup does REP do? - he designed P6's (Pentium Pro) fast-strings microcode when he was working at Intel, and says it included a no-RFO store protocol.
See also Enhanced REP MOVSB for memcpy

Page Replacement and LRU

If a page fault occured, then we have to replace the least recently used page of the process that request the frame or we have to replace the page that is least recently used all over the main memory?
Thank you.
Theory
Assume that there are N pages of data, which includes:
all data belonging to all processes
all file data on disk (that could be pre-fetched into a virtual file system cache)
all DNS lookup information (that could be pre-fetched into some kind of DNS cache)
all static HTML pages, images, etc (that could be pre-fetched into some kind of web page cache)
anything else you could possibly pre-fetch before software could want it
all data that can be pre-generated by software (e.g. things like prime number sieves, cached pixel data generated from fonts, mipmaps, ...)
The goal is to fill RAM with the "most likely to be needed next" data from all possible sources. Note that this can include (e.g.) sending recently used data belonging to a process from RAM to swap space so that you can use that RAM to pre-fetch data from the internet that has not been requested (if you know the data is more likely to be needed sooner than the data from the process).
There are 3 major problems:
some of the data is controlled by normal processes and not the OS; and there's no standard way of allowing normal processes to participate in the operating system's "keep RAM filled with the most likely to be needed next data" scheme.
often you can't accurately predict the future. Note that you can look at things like when a process will wake up after calling "sleep()" to accurately predict a tiny part of the future; and you can track statistics to inaccurately predict other things (e.g. if you know that the user checked a certain web site at lunch time on 9 of the previous 10 days then you can predict that there's a 90% chance they will check that web site at lunch time today). Of course (for some cases) "most recently used" is a reasonable predictor of "most likely to be needed again soon"; which leads to "keep the most recently used in RAM", which is where "evict the least recently used" (LRU) comes from.
there is cost associated with transferring data, where the cost depends on where the data is now and how busy the hardware needed to fetch the data currently is (e.g. fetching data from a fast Internet connection might be cheap when the network card is doing nothing anyway but expensive when the network card is busy doing a lot of other stuff)
Practice
You can try to solve all the problems (e.g. keep track of lots of things and have fancy prediction algorithms; and take the cost of transferring and/or generating data into account when deciding what to do; and provide some kind of "current memory pressure notification" that normal processes can use to participate in the operating system's "keep RAM filled with the most likely to be needed data" scheme); but it's all complicated and difficult (e.g. you'd want to ensure that the overhead of figuring out what should/shouldn't be in RAM doesn't cost more performance than you gain), so operating systems often do something much simpler and less effective.
Specifically; a very simple OS might only do "evict the least recently used" (with no pre-fetching, no consideration for the cost of transfers and and without normal processes participating at all); and this might be considered "good enough" despite being horribly bad.
If a page fault occured, then we have to replace the least recently used page of the process that request the frame or we have to replace the page that is least recently used all over the main memory?
Ideally, you'd try to evict the "least likely to be needed soon" data from all of memory (possibly including data belonging to the kernel itself); but compromises are unavoidable and there's nothing to say a "good enough despite being horribly bad" OS can't just evict the least recently used page from the current process.

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.
I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."

mmap writes to file on disk(synchronous/asynchronous)

I havea question regarding mmap functionality. when mmap is used in asynchronous mode where the kernel takes care of persisting the data to the mapped file on the disk , is it possible to have the former updates overwrite the later updates ?
Lets say at time T, we modify a location in memory that is memory mapped to a file on disk and again at time T+1 we modify the same location in memory. As the writes to the file are not synchronous, is it possible that kernel first picks up the modifications at time T+1 and then picks up the modifications at time T resulting in inconsistency in the memory mapped file ?
It's not exactly possible. The file is allowed to be inconsistent till msync(2) or munmap(2) - when that happens, dirty (modified) pages are written to disk page by page (sometimes more, depends on filesystem in newer kernels). msync() allows you to specify synchronous operation and invalidation of caches after finished write, which allows you to ensure that the data in cache is the same as data in file. Without that, it's possible that your program will see newer data but file contains older - exact specifics of the rather hairy situation depend on CPU architecture and specific OS implementation of those routines.

mmap() internals

It's widely known that the most significant mmap() feature is that file mapping is shared between many processes. But it's not less widely known that every process has its own address space.
The question is where are memmapped files (more specifically, its data) truly kept, and how processes can get access to this memory?
I mean not *(pa+i) and other high-level stuff, but I mean the internals of the process.
This happens at the virtual memory management layer in the operating system. When you memory map a file, the memory manager basically treats the file as if it were swap space for the process. As you access pages in your virtual memory address space, the memory mapper has to interpret them and map them to physical memory. When you cross a page boundary, this may cause a page fault, at which time the OS must map a chunk of disk space to a chunk of physical memory and resolve the memory mapping. With mmap, it simply does so from your file instead of its own swap space.
If you want lots of details of how this happens, you'll have to tell us which operating system you're using, as implementation details vary.
This is very implementation-dependent, but the following is one possible implementation:
When a file is a first memory-mapped, the data isn't stored anywhere at first, it's still on disk. The virtual memory manager (VMM) allocates a range of virtual memory addresses to the process for the file, but those addresses aren't immediately added to the page table.
When the program first tries to read or write to one of those addresses, a page fault occurs. The OS catches the page fault, figures out that that address corresponds to a memory-mapped file, and reads the appropriate disk sector into an internal kernel buffer. Then, it maps the kernel buffer into the process's address space, and restarts the user instruction that caused the page fault. If the faulting instruction was a read, we're all done for now. If it was a write, the data is written to memory, and the page is marked as dirty. Subsequent reads or writes to data within the same page do not require reading/writing to/from disk, since the data is in memory.
When the file is flushed or closed, any pages which have been marked dirty are written back to disk.
Using memory-mapped files is advantageous for programs which read or write disk sectors in a very haphazard manner. You only read disk sectors which are actually used, instead of reading the entire file.
I'm not really sure what you are asking, but mmap() sets aside a chunk of virtual memory to hold the given amount of data (usually. It can be file-backed sometimes).
A process is an OS entity, and it gains access to memory mapped areas through the OS-proscribed method: calling mmap().
The kernel has internal buffers representing chunks of memory. Any given process is assigned a memory mapping in its own address space which refers to that buffer. A number of proccesses may have their own mappings, but they all end up resolving to the same chunk (via the kernel buffer).
This is a simple enough concept, but it can get a little tricky when processes write. To keep things simple in the read-only case there's usually a copy-on-write functionality that's only used as needed.
Any data will be in some form of memory or others, some cases in HDD, in embedded systems may be some flash memory or even the ram (initramfs), barring the last one, the data in the memory are frequently cached in the RAM, RAM is logical divided into pages and the kernel maintains a list of descriptors which uniquely identify an page.
So at best accessing data would be accessing the physical pages. Process gets there own process address space which consists of many vm_are_struct which identifies a mapped section in the address space. In a call to mmap, new vm_area_struct may be created or may be merged with an existing one if the addresses are adjacent.
A new virtual address is returned to the call to mmap. Also new page tables are created which consists the mapping of the newly created virtual addresses to the physical address where the real data resides. mapping can be done on a file, or anonymously like malloc. The process address space structure mm_struct uses the pointer of pgd_t (Page global directory) to reach the physical page and access the data.