Are direct reads possible in postgresql - postgresql

Oracle has this concept of direct reads, where a session reads data from a table directly into its session memory bypassing buffer cache. Is something similar possible in postgres? Does a session always gets data from shared buffer?

You are mixing up two things.
the kernel cache where the kernel caches files to serve reads and writes more efficiently
the database shared memory cache (shared buffers) where the database caches table and index blocks
All database use the latter (Oracle calls it “database buffer cache”), because without caching performance would be abysmal.
With direct I/O you avoid the kernel cache, that is, all read and write requests go directly to disk.
There is no way in PostgreSQL to use direct I/O.
However, it has been recognized that buffered I/O comes with its own set of problems (e.g., a write request may succeed, a sync request that tells the kernel to flush the data to disk may fail, but the next sync request for the same (unpersisted!) data may not return an error any more). Relevant people hold the opinion that it might be a good idea to move to direct I/O eventually to avoid having to deal with such problems, but that would be a major change, and I wouldn't hold my breath until it happens.

Related

PostgreSQL resource consumption, segregation and scheduling

How does PostgreSQL protect sessions from each other from the
resource consumption perspective?
For example, I write some stored procedures:
a stored procedure that executes a highly cpu-bound tight loop, how does PostgreSQL keep it from sucking up a big portion of the available cpu?
a stored procedure that triggers a lot of IO, how does PostgreSQL keep it from sucking up most of the IO bandwidth?
a stored procedure that reads widely scattered pages that no other session references, how does PostgreSQL keep it from filling up the buffer pool?
Also, as I understand it that each PostgreSQL session corresponds to a different OS process, so I also wonder what resource consumption segregation that PostgreSQL deals with explicitly and what it relies on for the OS to perform (as part of the OS's scheduling mechanisms).
Thanks much.
piaka
There is no resource throttling for processes in PostgreSQL, each process will consume as much CPU and I/O as it can.
This is somewhat mitigated by the fact that PostgreSQL backends run single-threaded, so a single backend cannot consume all the resources of the database server. Note, however, that PostgreSQL has parallel query, so (with the default configuration) up to three processes can work on a single statement. You can reduce that by setting max_parallel_workers_per_gather to 0.
There is also no limit of how many pages a statement can evict from shared buffers. But unless the statement touches a single page multiple times, the usage count of the pages read in by the statement will remain low, and the buffers can get evicted from the cache again. There is also an optimization for large sequential scans: if the table is estimated to blow out more than a quarter of shared buffers, it is scanned using a "ring buffer" consisting of only a small part of shared buffers.

How does write ahead logging improve IO performance in Postgres?

I've been reading through the WAL chapter of the Postgres manual and was confused by a portion of the chapter:
Using WAL results in a significantly reduced number of disk writes, because only the log file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction.
How is it that continuous writing to WAL more performant than simply writing to the table/index data itself?
As I see it (forgetting for now the resiliency benefits of WAL) postgres need to complete two disk operations; first pg needs to commit to WAL on disk and then you'll still need to change the table data to be consistent with WAL. I'm sure there's a fundamental aspect of this I've misunderstood but it seems like adding an additional step between a client transaction and the and the final state of the table data couldn't actually increase overall performance. Thanks in advance!
You are fundamentally right: the extra writes to the transaction log will per se not reduce the I/O load.
But a transaction will normally touch several files (tables, indexes etc.). If you force all these files out to storage (“sync”), you will incur more I/O load than if you sync just a single file.
Of course all these files will have to be written and sync'ed eventually (during a checkpoint), but often the same data are modified several times between two checkpoints, and then the corresponding files will have to be sync'ed only once.

How to make APC cache based on distributed hash tables(like memcache)?

I've read an article about Distributed Hash Tables and seems it's possible to implement such a thing like memcache with APC. As you know APC is much more faster than memcache if we're fetching keys from a single server. So if we make APC distributed we have both performance and distribution. I need some thoughts to start it. Could someone who is familiar with Hash tables explain how to do that? How to make APC like memcache?
If you know something about keyspace partitioning and Overlay network that would be much more better.
Although at the surface both softwares provide a comparable service, their underpinnings are entirely different, and that explains the dramatic difference in performance.
APC is basically a system that allows you to store objects (be it user objects or parsed opcode chunks) in shared memory. Shared memory, in all systems I know, is as fast as local RAM once you obtained a pointer to it.
So, in short, what APC has to do to write or read an object is:
request shm access and obtain a pointer to it
calculate object offset and size in the shm
memcpy that memory zone into a buffer or vice versa
done
Simple, and taking into account that memory bandwidth nowadays is 10's of gigabytes per second, quick.
Due to its distributed nature in a memcache scenario more needs to be done:
client encodes and transmits request
server receives and decodes request
server calculates object offset and size in memcached's memory
server memcpy's that memory zone into a buffer or vice versa
server transmit buffer
client receives and decodes buffer
Now, if we want to distribute APC, the client and server will need to talk to each other. And all of a sudden we find ourselves in a scenario that, with the exception of a few less important details, is identical to the one used by memcache. And all the expensive operations will become necessary again, ie all the copying around, sending through the network stack included.
That's also an explanation why even with a memcache instance running on localhost, without horribly slow gigabit ethernet between the nodes, there is a considerable overhead in what needs to be done to make a distributed system work.
And that's why I'm convinced you're looking at the wrong suspect here, make APC distributed and it will be in the same performance/throughput category.

Does it make sense to cache data obtained from a memory mapped file?

Or it would be faster to re-read that data from mapped memory once again, since the OS might implement its own cache?
The nature of data is not known in advance, it is assumed that file reads are random.
i wanted to mention a few things i've read on the subject. The answer is no, you don't want to second guess the operating system's memory manager.
The first comes from the idea that you want your program (e.g. MongoDB, SQL Server) to try to limit your memory based on a percentage of free RAM:
Don't try to allocate memory until there is only x% free
Occasionally, a customer will ask for a way to design their program so it continues consuming RAM until there is only x% free. The idea is that their program should use RAM aggressively, while still leaving enough RAM available (x%) for other use. Unless you are designing a system where you are the only program running on the computer, this is a bad idea.
(read the article for the explanation of why it's bad, including pictures)
Next comes from some notes from the author of Varnish, and reverse proxy:
Varnish Cache - Notes from the architect
So what happens with squids elaborate memory management is that it gets into fights with the kernels elaborate memory management, and like any civil war, that never gets anything done.
What happens is this: Squid creates a HTTP object in "RAM" and it gets used some times rapidly after creation. Then after some time it get no more hits and the kernel notices this. Then somebody tries to get memory from the kernel for something and the kernel decides to push those unused pages of memory out to swap space and use the (cache-RAM) more sensibly for some data which is actually used by a program. This however, is done without squid knowing about it. Squid still thinks that these http objects are in RAM, and they will be, the very second it tries to access them, but until then, the RAM is used for something productive.
Imagine you do cache something from a memory-mapped file. At some point in the future that memory holding that "cache" will be swapped out to disk.
the OS has written to the hard-drive something which already exists on the hard drive
Next comes a time when you want to perform a lookup from your "cache" memory, rather than the "real" memory. You attempt to access the "cache", and since it has been swapped out of RAM the hardware raises a PAGE FAULT, and cache is swapped back into RAM.
your cache memory is just as slow as the "real" memory, since both are no longer in RAM
Finally, you want to free your cache (perhaps your program is shutting down). If the "cache" has been swapped out, the OS must first swap it back in so that it can be freed. If instead you just unmapped your memory-mapped file, everything is gone (nothing needs to be swapped in).
in this case your cache makes things slower
Again from Raymon Chen: If your application is closing - close already:
When DLL_PROCESS_DETACH tells you that the process is exiting, your best bet is just to return without doing anything
I regularly use a program that doesn't follow this rule. The program
allocates a lot of memory during the course of its life, and when I
exit the program, it just sits there for several minutes, sometimes
spinning at 100% CPU, sometimes churning the hard drive (sometimes
both). When I break in with the debugger to see what's going on, I
discover that the program isn't doing anything productive. It's just
methodically freeing every last byte of memory it had allocated during
its lifetime.
If my computer wasn't under a lot of memory pressure, then most of the
memory the program had allocated during its lifetime hasn't yet been
paged out, so freeing every last drop of memory is a CPU-bound
operation. On the other hand, if I had kicked off a build or done
something else memory-intensive, then most of the memory the program
had allocated during its lifetime has been paged out, which means that
the program pages all that memory back in from the hard drive, just so
it could call free on it. Sounds kind of spiteful, actually. "Come
here so I can tell you to go away."
All this anal-rententive memory management is pointless. The process
is exiting. All that memory will be freed when the address space is
destroyed. Stop wasting time and just exit already.
The reality is that programs no longer run in "RAM", they run in memory - virtual memory.
You can make use of a cache, but you have to work with the operating system's virtual memory manager:
you want to keep your cache within as few pages as possible
you want to ensure they stay in RAM, by the virtue of them being accessed a lot (i.e. actually being a useful cache)
Accessing:
a thousand 1-byte locations around a 400GB file
is much more expensive than accessing
a single 1000-byte location in a 400GB file
In other words: you don't really need to cache data, you need a more localized data structure.
If you keep your important data confined to a single 4k page, you will play much nicer with the VMM; Windows is your cache.
When you add 64-byte quad-word aligned cache-lines, there's even more incentive to adjust your data structure layout. But then you don't want it too compact, or you'll start suffering performance penalties of cache flushes from False Sharing.
The answer is highly OS-specific. Generally speaking, there will be no sense in caching this data. Both the "cached" data as well as the memory-mapped can be paged away at any time.
If there will be any difference it will be specific to an OS - unless you need that granularity, there is no sense in caching the data.

How can I handle multiple sockets within a Perl daemon with large memory usage?

I have created a client-server program with Perl using IO::Socket::INET. I access server through CGI based site. My server program will run as daemon and will accept multiple simultaneous connections. My server process consumes about 100MB of memory space (9 large arrays, many arrays...). I want these hashes to reside in memory and share them so that I don't have to create them for every connection. Hash creation takes 10-15 seconds.
Whenever a new connection is accepted through sockets, I fork a new process to take care of the processing for each connection received. Since parent process is huge, every time I fork, processor tries to allocate memory to a new child, but due to limited memory, it takes large time to spawn a new child, thereby increasing the response time. Many times it hangs down even for a single connection.
Parent process creates 9 large hashes. For each child, I need to refer to one or more hashes in read-only mode. I will not update hashes through child. I want to use something like copy-on-write, by which I can share whole 100mb or whole global variables created by parent with all child? or any other mechanism like threads. I expect the server will get minimum 100 request per second and it should be able to process all of them in parallel. On an average, a child will exit in 2 seconds.
I am using Cygwin on Windows XP with only 1GB of RAM. I am not finding any way to overcome this issue. Can you suggest something? How can I share variables and also create 100 child processes per second and manage them and synchronize them,
Thanks.
Instead of forking there are two other approaches to handle concurrent connections. Either you use threads or a polling approach.
In the thread approach for each connection a new thread is created that handles the I/O of a socket. A thread runs in the same virtual memory of the creating process and can access all of its data. Make sure to properly use locks to synchronize write access on your data.
An even more efficient approach is to use polling via select(). In this case a single process/thread handles all sockets. This works under the assumption that most work will be I/O and that the time spend with waiting for I/O requests to finish is spent handling other sockets.
Go research further on those two options and decide which one suits you best.
See for example: http://www.perlfect.com/articles/select.shtml
If you have that much data, I wonder why you don't simply use a database?
This architecture is unsuitable for Cygwin. Forking on real unix systems is cheap, but on fake unix systems like Cygwin it's terribly expensive, because all data has to be copied (real unices use copy-on-write). Using threads changes the memory usage pattern (higher base usage, but smaller increase per thread), but odds are it will still be inefficient.
I would advice you to use a single-process approach using polling, and maybe non-blocking IO too.