How to handle buffer and secondary storage with PostgreSQL Server Programming (SPI)? - postgresql

I am wondering where/how to let PostgreSQL (9.6) handle memory issues between secondary storage (e.g. Hard Drives) and memory buffers?
For example, how to load relevant data into memory when some tuples being queried are not in the buffer; and how to flush some data to disk when the memory buffer is full?
I haven't done server programming before. But when I looked at the Server Programming Interface and the section about memory management, I can't find any mention of "secondary storage" or "buffer" etc. Where are such issues handled?
Can anyone give some pointers about this?

I think you are confused here.
The memory management functions you reference above are to allocate and manage memory that remains allocated after your function has finished (but is freed when the calling statement ends), e.g. to contain results to return to the caller of the function.
Storage management and data buffering happen on a different, much lower, level, and you cannot influence that via SPI. SPI is just an interface for C code running in the server to run SQL statements. As far as shared buffers are concerned, it does not make a difference whether you issue a query from psql or via SPI.

Related

Why postgres database using double buffer method?

Postgres using double buffer for performing operation. But why? What is the need for it.
please explain .
I guess you are referring to shared buffers and the kernel cache, both of which cache PostgreSQL data since PostgreSQL uses buffered I/O rather than direct I/O.
The reason is probably that PostgreSQL aims to be very portable, so using buffered I/O with its higher level of abstraction from the system was the easier choice.
There is definitely a desire to change that, but that is a big project, if it ever comes to pass.

Memory usage of zfs for mapped files

I read the following on https://blogs.oracle.com/roch/entry/does_zfs_really_use_more
There is one peculiar workload that does lead ZFS to consume more
memory: writing (using syscalls) to pages that are also mmaped. ZFS
does not use the regular paging system to manage data that passes
through reads and writes syscalls. However mmaped I/O which is closely
tied to the Virtual Memory subsystem still goes through the regular
paging code . So syscall writting to mmaped pages, means we will keep
2 copies of the associated data at least until we manage to get the
data to disk. We don't expect that type of load to commonly use large
amount of ram
What does this mean exactly? does this mean that zfs will "uselessly" double cache any memory region that is backed by a memory mapped file? or does "using syscalls" mean writing using some other method of writing that I am not familiar with.
If so, am I better off keeping the working directories of files written this way on a ufs partition?
Does this mean that zfs will "uselessly" double cache any memory region that is backed by a memory mapped file?
Hopefully, no.
or does "using syscalls" mean writing using some other method of writing that I am not familiar with.
That method is just regular low level write(fd, buf, nbytes) system calls and similars and not what memory mapped files are designed to support: accessing file content just with reading / writing memory by using pointers, using the file data as a byte array or whatever.
If so, am I better off keeping the working directories of files written this way on a ufs partition?
No, unless memory mapped files that are also written to using system calls sum to a significant part of your RAM workload, which is quite unlikely to happen.
PS: Note that this blog is almost ten years old. There might have been changes in the implementation since that time.

Writing to hard disk from contiguous physical memory

I have an ARM based device, running linux, which is connected to a camera, and I'm trying to store captured frames to HD efficiently.
I'm developing in user space, but can modify drivers at will
I'm coding in C
Frames which are written into memory using DMA, and I have their physical memory pointer.
I am able to control all the frame capturing flow, and I can tell when the frame buffers are stable (dqueued from the video4linux driver)
Linux version is 3.0.35
I'm familiar with kernel source code, not an expert, but I'm able to find my way in it and figure out things, as long as I get some hints...
I believe I have 2 alternatives:
Find the optimal configuration for my filesystem, for opening the file and writing into it. I'm now using ext4, and standard fopen() fwrite() functions. I understand I can also use mmap, or add O_DIRECT flag when calling open(), but didn't try it yet.
Find a way to pass the physical address of the buffer (I can get it
from my Video4Linux driver) directly to the filesystem/hard drive driver,
so the data will be transfered directly from there.
I found method 1 to be slow, having memory transactions as my bottleneck, since fwrite involves copying data from userspace to kernel space, and then again into some sort of cache, and then on to DMA. Too many memory transactions for a simple store...
Regarding method 2 - I don't know if that's possible, but if I was the one designing this system from scratch, this is what I would do.
Any thoughts?
Regarding method 1 (using open() and write(), mmap() and/or O_DIRECT)
can you recommend an optimal settings for my purpose?
Is method 2 (storing to HD directly from an existing DMA buffer) possible? If so - can you point me to an example?
the only problem with writing into a file via mmap on UNIXs, is that you either have to deal with signals in case of out-of-disk-space
or you have make certain that the file is not sparse
and thus all needed disk space is already allocated.
I think an uptodate G++ provides a method of converting signals into C++ exception handling,
but I'm not certain how supported this is on other systems than mac-os.

Shared buffer in postgres

I am curious about the role played by shared buffer in postgres. Shared buffer maintains all the recently accessed disk pages and dirty pages. If a new page needs to be brought in and there is no space left in shared buffer, a victim dirty page is written back to the disk.
However, I am confused about this statement-
PostgreSQL depends on the OS for caching. (http://www.varlena.com/GeneralBits/Tidbits/perf.html#shbuf)"
How does postgres depends on the OS for caching? And how does it change the behavior of shared buffer?
Postgresql uses the OS cache and its own data cache. The two are useful according to your database usage.
OS cache is very fast but basic: it removes older data with the new one. It is useful for very versatile query results.
PG cache is slower (still much faster than disk) but it keeps usage counters of the most used data. Useful for recurrent results/index.
I think this link is clearer (and more up-to-date).
My understanding is that shared_buffers is where PostgreSQL processes work and share information, but above a certain limit (15% to 25% of the server RAM) diminishing returns makes it more interesting to leave more RAM to the OS to perform caching itself.

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.
I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."