Shared buffer in postgres - postgresql

I am curious about the role played by shared buffer in postgres. Shared buffer maintains all the recently accessed disk pages and dirty pages. If a new page needs to be brought in and there is no space left in shared buffer, a victim dirty page is written back to the disk.
However, I am confused about this statement-
PostgreSQL depends on the OS for caching. (http://www.varlena.com/GeneralBits/Tidbits/perf.html#shbuf)"
How does postgres depends on the OS for caching? And how does it change the behavior of shared buffer?

Postgresql uses the OS cache and its own data cache. The two are useful according to your database usage.
OS cache is very fast but basic: it removes older data with the new one. It is useful for very versatile query results.
PG cache is slower (still much faster than disk) but it keeps usage counters of the most used data. Useful for recurrent results/index.

I think this link is clearer (and more up-to-date).
My understanding is that shared_buffers is where PostgreSQL processes work and share information, but above a certain limit (15% to 25% of the server RAM) diminishing returns makes it more interesting to leave more RAM to the OS to perform caching itself.

Related

What is the downside to increase shared buffer in PostgreSQL

I've noticed a significant performance drop when data is not loaded into shared_buffer when querying PostgreSQL, the difference can be almost 100 times. So in the process of optimizing the query, I was wondering if there is anyway to increase performance by increasing the shared_buffer.
Then I started to investigate the shared_buffer in PostgreSQL. and I found that the recommend value is 25% of the OS memory and PostgreSQL will take advantage of OS cache to accelerate the query. But from what I've seen with my own db, reading from disk vs shared_buffer has huge difference, so I would like to query from shared_buffer for the most time.
So I wondered, what's the downside if I increase the shared_buffer in PostgreSQL? What if I only increase the shared_buffer in my readonly instance?
A downside of increasing the buffer cache is double buffering. When you need to read a page into shared_buffers, it might first need to evict an existing page to make room for it. But then the OS cache might need to evict a page from itself as well so to make room for it to read the page from the actual disk. Then you end up with the same page being located in both places, which wastes cache space. So then instead of reading a page from the OS cache you are more likely to need to read it from actual disk, which is far slower. From a double-buffering perspective, you probably want shared_buffers to be much less than half of the system RAM (using OS cache as the main cache) or much larger than half (using shared_buffers as the main cache)
Another downside is that if it is too large, you might start to get out-of-memory errors or invoke the OOM killer or otherwise destabilize the system.
Another problem is that after some operations, like DROP TABLE, TRUNCATE, or the ending of a COPY in some circumstances, PostgreSQL needs to invalidate a lot of buffers and chooses to do so by scouring the entire buffer cache. If you do a lot of those operations, that time can really add up with large buffer cache settings.
Some workloads (I know about DROP TABLE, but there may be others) perform better with a smaller shared_buffers. But essentially, it is a matter of trial and error (or better yet: reproducible performance tests).
If you can make shared_buffers big enough that it can hold everything you need from the database, that is probably a good choice.

What does "mostly-memory" mean?

ArangoDB describus itself as a "mostly-memory" database, but I am not clear on the implications. The FAQ gives very little detail:
ArangoDB is a “mostly memory” database, which means that it
appreciates RAM very much and is most performing when it is not forced
to swap data to the hard disk.
I am looking at running ArangoDB on a Raspberry Pi to serve two or three users. What are the implications of "mostly-memory" in such a context?
If it is unplugged for some reason will I lose data?
It's mostly working in memory, but is also doing some work on disks.
More precisely, it works with memory-mapped files, so all the operations will eventually be saved to disk (or equivalent long-term storage), but because it doesn't wait (by default) for the persistence to disk to happen, it can benefit in performance from this.
The implications is that if you use this default you get better performance than you might expect otherwise, but if something brings it down before the save has happened (especially a sudden power failure) then you could lose data or have a corrupt database.
If you configure a collection for immediate synchronisation you protect against this, but the performance is affected.

design mongodb to load entire content in memory

I am involved in a project where they get enough RAM to store the entire database in memory. According to the manager, that is what 10Gen recommended. This is counter intuitive. Is that really the way you want to use Mongodb?
It is not counter intuitive... I find it quite intuitive, actually.
In How much faster is the memory usually than the disk? you can read:
(...) memory is only about 6 times faster when you're doing sequential
access (350 Mvalues/sec for memory compared with 58 Mvalues/sec for
disk); but it's about 100,000 times faster when you're doing random
access.
So if you can fit all your data in RAM, it is quite good because you are going to be really fast reading your data.
Regarding MongoDB, from the FAQ's:
It’s certainly possible to run MongoDB on a machine with a small
amount of free RAM.
MongoDB automatically uses all free memory on the machine as its
cache. System resource monitors show that MongoDB uses a lot of
memory, but its usage is dynamic. If another process suddenly needs
half the server’s RAM, MongoDB will yield cached memory to the other
process.
Technically, the operating system’s virtual memory subsystem manages
MongoDB’s memory. This means that MongoDB will use as much free memory
as it can, swapping to disk as needed. Deployments with enough memory
to fit the application’s working data set in RAM will achieve the best
performance.
The problem is that you usually have much more data than memory available. And then you have to go to disk, and disk I/O is slow. Regarding database performance, avoiding full scan queries is key (much more important when accessing to disk). Therefore, if your data set does not fit in memory, you should aim at having indexes for the vast majority of your access patterns and try to fit those indexes in memory:
If you have created indexes for your queries and your working data set
fits in RAM, MongoDB serves all queries from memory.
It all depends on the size of your database. I am guessing that you said your database was actually quite small, otherwise I cannot see how someone at 10gen gave such advice, I mean not even #Stennie gives such advice (he is 10gen by the way).
Even if your database is small I don't see how the manager recommended that. MongoDB does not do memory management of its own as such it does not "pin" data into pages like memcached does or other memory based databases do.
This means that the paging of mongods data can be quite unpredicatable, a.k.a you will spend more time trying to keep things in RAM than paging in data. This is why it is better to just make sure your working set fits and it can loaded with speed, such things are based upon your hardware and queries.
#Stennies comment pretty much sums up the stance you should be taking with MongoDB.

Heroku Postgres RAM for cache vs Memcache RAM

I have a web app on Heroku and I'm trying to understand the difference/ trade-offs between adding a Memcached instance with 1GB of RAM and adding 1GB of RAM to my Postgres server.
If I added a Memcached instance, I would probably use Johnny Cache (for Django - http://packages.python.org/johnny-cache/).
Should I expect similar performance improvement from the two options? In general, what is the advantage of using memcache vs. increasing the size of the Postgres cache. (I understand that people often run memcache on the DB server so there must be one).
I appreciate this is probably a very naive question but I haven't been able to find anything to clear up my confusion through Google.
Postgres for best performance needs enough cache to keep most frequently used object (indexes, tables). So there is a tipping point in setting of shared_buffers. After that point,
increasing shared buffers does not help much.
It is good to leave some part of RAM for filesystem-level caching.
For much more see http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
As for memcache, it's totally different beast... It can be used directly from application to have ultra-fast non-persistent key-value storage.
All three traits make memcached differ from relational database (RDB).
ultra-fast (RDB are not)
non-persistent (RDB are)
key-value only (RDB much better)

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.
I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."