Why can't DMBSes rely on the OS buffer pool? - operating-system

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?

Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.

The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.

It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.

The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.

I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."

Related

Page Replacement and LRU

If a page fault occured, then we have to replace the least recently used page of the process that request the frame or we have to replace the page that is least recently used all over the main memory?
Thank you.
Theory
Assume that there are N pages of data, which includes:
all data belonging to all processes
all file data on disk (that could be pre-fetched into a virtual file system cache)
all DNS lookup information (that could be pre-fetched into some kind of DNS cache)
all static HTML pages, images, etc (that could be pre-fetched into some kind of web page cache)
anything else you could possibly pre-fetch before software could want it
all data that can be pre-generated by software (e.g. things like prime number sieves, cached pixel data generated from fonts, mipmaps, ...)
The goal is to fill RAM with the "most likely to be needed next" data from all possible sources. Note that this can include (e.g.) sending recently used data belonging to a process from RAM to swap space so that you can use that RAM to pre-fetch data from the internet that has not been requested (if you know the data is more likely to be needed sooner than the data from the process).
There are 3 major problems:
some of the data is controlled by normal processes and not the OS; and there's no standard way of allowing normal processes to participate in the operating system's "keep RAM filled with the most likely to be needed next data" scheme.
often you can't accurately predict the future. Note that you can look at things like when a process will wake up after calling "sleep()" to accurately predict a tiny part of the future; and you can track statistics to inaccurately predict other things (e.g. if you know that the user checked a certain web site at lunch time on 9 of the previous 10 days then you can predict that there's a 90% chance they will check that web site at lunch time today). Of course (for some cases) "most recently used" is a reasonable predictor of "most likely to be needed again soon"; which leads to "keep the most recently used in RAM", which is where "evict the least recently used" (LRU) comes from.
there is cost associated with transferring data, where the cost depends on where the data is now and how busy the hardware needed to fetch the data currently is (e.g. fetching data from a fast Internet connection might be cheap when the network card is doing nothing anyway but expensive when the network card is busy doing a lot of other stuff)
Practice
You can try to solve all the problems (e.g. keep track of lots of things and have fancy prediction algorithms; and take the cost of transferring and/or generating data into account when deciding what to do; and provide some kind of "current memory pressure notification" that normal processes can use to participate in the operating system's "keep RAM filled with the most likely to be needed data" scheme); but it's all complicated and difficult (e.g. you'd want to ensure that the overhead of figuring out what should/shouldn't be in RAM doesn't cost more performance than you gain), so operating systems often do something much simpler and less effective.
Specifically; a very simple OS might only do "evict the least recently used" (with no pre-fetching, no consideration for the cost of transfers and and without normal processes participating at all); and this might be considered "good enough" despite being horribly bad.
If a page fault occured, then we have to replace the least recently used page of the process that request the frame or we have to replace the page that is least recently used all over the main memory?
Ideally, you'd try to evict the "least likely to be needed soon" data from all of memory (possibly including data belonging to the kernel itself); but compromises are unavoidable and there's nothing to say a "good enough despite being horribly bad" OS can't just evict the least recently used page from the current process.

What does "mostly-memory" mean?

ArangoDB describus itself as a "mostly-memory" database, but I am not clear on the implications. The FAQ gives very little detail:
ArangoDB is a “mostly memory” database, which means that it
appreciates RAM very much and is most performing when it is not forced
to swap data to the hard disk.
I am looking at running ArangoDB on a Raspberry Pi to serve two or three users. What are the implications of "mostly-memory" in such a context?
If it is unplugged for some reason will I lose data?
It's mostly working in memory, but is also doing some work on disks.
More precisely, it works with memory-mapped files, so all the operations will eventually be saved to disk (or equivalent long-term storage), but because it doesn't wait (by default) for the persistence to disk to happen, it can benefit in performance from this.
The implications is that if you use this default you get better performance than you might expect otherwise, but if something brings it down before the save has happened (especially a sudden power failure) then you could lose data or have a corrupt database.
If you configure a collection for immediate synchronisation you protect against this, but the performance is affected.

Shared buffer in postgres

I am curious about the role played by shared buffer in postgres. Shared buffer maintains all the recently accessed disk pages and dirty pages. If a new page needs to be brought in and there is no space left in shared buffer, a victim dirty page is written back to the disk.
However, I am confused about this statement-
PostgreSQL depends on the OS for caching. (http://www.varlena.com/GeneralBits/Tidbits/perf.html#shbuf)"
How does postgres depends on the OS for caching? And how does it change the behavior of shared buffer?
Postgresql uses the OS cache and its own data cache. The two are useful according to your database usage.
OS cache is very fast but basic: it removes older data with the new one. It is useful for very versatile query results.
PG cache is slower (still much faster than disk) but it keeps usage counters of the most used data. Useful for recurrent results/index.
I think this link is clearer (and more up-to-date).
My understanding is that shared_buffers is where PostgreSQL processes work and share information, but above a certain limit (15% to 25% of the server RAM) diminishing returns makes it more interesting to leave more RAM to the OS to perform caching itself.

Is it true that CPU never fetches anything from memory directly?

I hear that cpu just fetches instruction from the EIP register,never fetches from memory directly.
But AFAIK,EIP just stores the address of the next instruction,the instruction itself is still in the memory.If CPU never fetches memory,how can it know what the next instruction actually is?
UPDATE
BTW,I know there're x86,x64,x87 architectures,but which does x86-64 belong to,x86 or x64??
The simple answer to your question is "no, it's not true".
The picture isn't very simple due to caching, instruction pipeline, branch prediction etc. However, the instruction pointer is just that, a pointer. It doesn't store opcodes.
EIP (Extended Instruction Pointer) should hold the address of the instruction. It's just a way to keep a tab of which instruction is being processed currently (or sometimes, which instruction to process next).
The instructions themselves are stored in the Memory (HDD, RAM, Cache) and need to be fetched by the CPU.
Maybe what you heard meant that since so many levels of caches are used generally it's quite rare that the fetch needs to access the RAM..
Well I don't know the point to your question.
Yes the CPU (in a broad sense of the word) does fetch from memory. It has a number of memory management devices (for cache line handling and pipelining). In fact, the 'pipeline' puts the instructions in L1 cache. Indeed, the instruction processor itself only fetches from there. The processor in reality probably never even looks at EIP (unless an instruction uses it directly as an operand).
So the real answer would be, find yourself a wikipedia articale on i86 processor design, and have a ball. You'll be able to know exactly what happens where.
Cheers
Not true in that way. CPU accesses memory thru the cache, so you can kinda say that it does not do it directly. (Also DMA cahnnel can transfer data between memory and IO without ever touching CPU).
Yes, CS:EIP points to the memory, to the next instruction to execute, but you can use direct addresses too for example (load the content of the address 0x0800 to the AX register, by default this is relative to DS segment):
MOV AX,[0x0800]

Do any common OS file systems use hashes to avoid storing the same content data more than once?

Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.
It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?
This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.
Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.
ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup
Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.
It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.
NTFS has single instance storage.
NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).
The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.
All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.
btrfs supports online de-duplication of data at the block level. I'd recommend duperemove as an external tool is needed.
It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.
Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.
Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.
This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.
There are file systems that do deduplication, which is sort of like this, but still noticeably different. In particular, deduplication is typically done on a basis of relatively small blocks of a file, not on complete files. Under such a system, a "file" basically just becomes a collection of pointers to de-duplicated blocks. Along with the data, each block will typically have some metadata for the block itself, that's separate from the metadata for the file(s) that refer to that block (e.g., it'll typically include at least a reference count). Any block that has a reference count greater than 1 will be treated as copy on write. That is, any attempt at writing to that block will typically create a copy, write to the copy, then store the copy of the block to the pool (so if the result comes out the same as some other block, deduplication will coalesce it with the existing block with the same content).
Many of the same considerations still apply though--most people don't have enough duplication to start with for deduplication to help a lot.
At the same time, especially on servers, deduplication at a block level can serve a real purpose. One really common case is dealing with multiple VM images, each running one of only a few choices of operating systems. If we look at the VM image as a whole, each is usually unique, so file-level deduplication would do no good. But they still frequently have a large chunk of data devoted to storing the operating system for that VM, and it's pretty common to have many VMs running only a few operating systems. With block-level deduplication, we can eliminate most of that redundancy. For a cloud server system like AWS or Azure, this can produce really serious savings.