Will pg_dump and pg_restore of PostgreSQL affect the buffer cache and kernel file system cache? - postgresql

May I query about the buffer cache behavior of PostgreSQL during pg_dump and pg_restore?
As we know, PostgreSQL has a buffer cache to cache the recent working set, and Linux also has its file system level cache.
When we use pg_dump to backup the database, would the backup operation affect the PostgreSQL buffer cache and the system file cache?
And what about the pg_restore operation?

Since these operations read or write files on the machine, they will certainly affect the kernel's file system cache, potentially blowing out some data that were previously cached.
The same is true for the PostgreSQL shared buffers, although there is an optimization that avoids overwriting all shared buffers during a large sequential scan: if a table is bigger than a quarter on shared buffers, a ring buffer of 256 kB will be used rather that eliminating a major part of the cache.
See the following quotes from src/backend/access/heap/heapam.c and src/backend/storage/buffer/README:
/*
* If the table is large relative to NBuffers, use a bulk-read access
* strategy and enable synchronized scanning (see syncscan.c). Although
* the thresholds for these features could be different, we make them the
* same so that there are only two behaviors to tune rather than four.
* (However, some callers need to be able to disable one or both of these
* behaviors, independently of the size of the table; also there is a GUC
* variable that can disable synchronized scanning.)
*
* Note that table_block_parallelscan_initialize has a very similar test;
* if you change this, consider changing that one, too.
*/
if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) &&
scan->rs_nblocks > NBuffers / 4)
{
allow_strat = (scan->rs_base.rs_flags & SO_ALLOW_STRAT) != 0;
allow_sync = (scan->rs_base.rs_flags & SO_ALLOW_SYNC) != 0;
}
else
allow_strat = allow_sync = false;
For sequential scans, a 256KB ring is used. That's small enough to fit in L2
cache, which makes transferring pages from OS cache to shared buffer cache
efficient. Even less would often be enough, but the ring must be big enough
to accommodate all pages in the scan that are pinned concurrently. 256KB
should also be enough to leave a small cache trail for other backends to
join in a synchronized seq scan. If a ring buffer is dirtied and its LSN
updated, we would normally have to write and flush WAL before we could
re-use the buffer; in this case we instead discard the buffer from the ring
and (later) choose a replacement using the normal clock-sweep algorithm.
Hence this strategy works best for scans that are read-only (or at worst
update hint bits). In a scan that modifies every page in the scan, like a
bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and
the ring strategy effectively degrades to the normal strategy.
As the README indicates, that strategy is probably not very effective for bulk writes.
Still, a pg_dump or pg_restore will affect many tables, so you can expect that it will blow out a significant portion of shared buffers.

Related

What is the downside to increase shared buffer in PostgreSQL

I've noticed a significant performance drop when data is not loaded into shared_buffer when querying PostgreSQL, the difference can be almost 100 times. So in the process of optimizing the query, I was wondering if there is anyway to increase performance by increasing the shared_buffer.
Then I started to investigate the shared_buffer in PostgreSQL. and I found that the recommend value is 25% of the OS memory and PostgreSQL will take advantage of OS cache to accelerate the query. But from what I've seen with my own db, reading from disk vs shared_buffer has huge difference, so I would like to query from shared_buffer for the most time.
So I wondered, what's the downside if I increase the shared_buffer in PostgreSQL? What if I only increase the shared_buffer in my readonly instance?
A downside of increasing the buffer cache is double buffering. When you need to read a page into shared_buffers, it might first need to evict an existing page to make room for it. But then the OS cache might need to evict a page from itself as well so to make room for it to read the page from the actual disk. Then you end up with the same page being located in both places, which wastes cache space. So then instead of reading a page from the OS cache you are more likely to need to read it from actual disk, which is far slower. From a double-buffering perspective, you probably want shared_buffers to be much less than half of the system RAM (using OS cache as the main cache) or much larger than half (using shared_buffers as the main cache)
Another downside is that if it is too large, you might start to get out-of-memory errors or invoke the OOM killer or otherwise destabilize the system.
Another problem is that after some operations, like DROP TABLE, TRUNCATE, or the ending of a COPY in some circumstances, PostgreSQL needs to invalidate a lot of buffers and chooses to do so by scouring the entire buffer cache. If you do a lot of those operations, that time can really add up with large buffer cache settings.
Some workloads (I know about DROP TABLE, but there may be others) perform better with a smaller shared_buffers. But essentially, it is a matter of trial and error (or better yet: reproducible performance tests).
If you can make shared_buffers big enough that it can hold everything you need from the database, that is probably a good choice.

Does postgres support in memory temp table

I know that postgres does not allow for in memory structures. But for temp table to work efficiently it is important to have in memory structure, otherwise it would have to flow to disk and would not be that efficient. So my question is does postgres allow in memory storage for temp table? My hunch is that it does not. Just wanted to confirm it with someone.
Thanks in advance!
Yes, Postgres can keep temp tables in memory. The amount of memory available for that is configured through the property temp_buffers
Quote from the manual:
Sets the maximum number of temporary buffers used by each database session. These are session-local buffers used only for access to temporary tables. The default is eight megabytes (8MB). The setting can be changed within individual sessions, but only before the first use of temporary tables within the session; subsequent attempts to change the value will have no effect on that session.
A session will allocate temporary buffers as needed up to the limit given by temp_buffers. The cost of setting a large value in sessions that do not actually need many temporary buffers is only a buffer descriptor, or about 64 bytes, per increment in temp_buffers. However if a buffer is actually used an additional 8192 bytes will be consumed for it (or in general, BLCKSZ bytes).
So if you really need that, you can increase temp_buffers.

Postgresql binary column free memory

we have a database which we store some small files temporarily before they are pushed to S3. The problem I'm having at the moment is that once we clear the biniary in Postgresql (setting the binary column value = null) it does not seem to free up the memory. Are we missing anything?
You would need to perform a vacuum full in order to reclaim the free space, or just a vacuum to be able to re-use the space.
The doc says:
Plain VACUUM (without FULL) simply reclaims space and makes it
available for re-use. This form of the command can operate in parallel
with normal reading and writing of the table, as an exclusive lock is
not obtained. However, extra space is not returned to the operating
system (in most cases); it's just kept available for re-use within the
same table. VACUUM FULL rewrites the entire contents of the table into
a new disk file with no extra space, allowing unused space to be
returned to the operating system. This form is much slower and
requires an exclusive lock on each table while it is being processed.
Let's emphasis that this is true for both delete or update commands.
The FULL option is not recommended for routine use, but might be
useful in special cases. An example is when you have deleted or
updated most of the rows in a table and would like the table to
physically shrink to occupy less disk space and allow faster table
scans. VACUUM FULL will usually shrink the table more than a plain
VACUUM would.

When should one vacuum a database, and when analyze?

I just want to check that my understanding of these two things is correct. If it's relevant, I am using Postgres 9.4.
I believe that one should vacuum a database when looking to reclaim space from the filesystem, e.g. periodically after deleting tables or large numbers of rows.
I believe that one should analyse a database after creating new indexes, or (periodically) after adding or deleting large numbers of rows from a table, so that the query planner can make good calls.
Does that sound right?
vacuum analyze;
collects statistics and should be run as often as much data is dynamic (especially bulk inserts). It does not lock objects exclusive. It loads the system, but is worth of. It does not reduce the size of table, but marks scattered freed up place (Eg. deleted rows) for reuse.
vacuum full;
reorganises the table by creating a copy of it and switching to it. This vacuum requires additional space to run, but reclaims all not used space of the object. Therefore it requires exclusive lock on the object (other sessions shall wait it to complete). Should be run as often as data is changed (deletes, updates) and when you can afford others to wait.
Both are very important on dynamic database
Correct.
I would add that you can change the value of the default_statistics_target parameter (default to 100) in the postgresql.conf file to a higher number, after which, you should restart your server and run analyze to obtain more accurate statistics.

How to get the Cassandra Table/Columnfamily size in MB

I want to design my cluster and want to set proper size of key_cache and row_cache
depending on the size of the tables/columnfamilies.
Similar to mysql, do we have something like this in Cassandra/CQL?
SELECT table_name AS "Tables",
round(((data_length + index_length) / 1024 / 1024), 2) "Size in MB"
FROM information_schema.TABLES
WHERE table_schema = "$DB_NAME";
Or any other way to look for the data-size and indexes' size separately.
Or what configuration of each node would be needed to put my table completely in the memory
without considering any replication factor.
The key cache and row caches work rather differently. It's important to understand the difference for calculating sizes.
The key cache is a cache of offsets within files for locations for rows. It is basically a map from (key, file) to offset. Therefore scaling the key cache size depends on number of rows, not overall data size. You can find the number of rows from the 'Number of keys' parameter in 'nodetool cfstats'. Note this is per node, not a total, but that's what you want to decide on cache sizes. The default size is min(5% of Heap (in MB), 100MB), which is probably sufficient for most applications. A subtlety here is that rows may exist in multiple files (SSTables), the number depending on your write pattern. However, this duplication is accounted for (approximately) in the estimated count from nodetool.
The row cache caches the actual row. To get a size estimate for this you can use the 'Space used' parameter in 'nodetool cfstats'. However, the row cache caches deserialized data and only the latest copy so the size could be quite different (higher or lower).
There is also a third less configurable cache - your OS filesystem cache. In most cases this is actually better than the row cache. It avoids duplicating data in memory, because when using the row cache most likely data will be in the filesystem cache too. And reading from an SSTable in the filesystem cache is only 30% slower than the row cache in my experiments (a while ago, probably not valid any more but unlikely to be significantly different). The main use case for the row cache is when you have one relatively small CF that you want to ensure is cached. Otherwise using the filesystem cache is probably the best.
In conclusion, the Cassandra defaults of a large key cache and no row cache are the best for most setups. You should only play with the caches if you know your access pattern won't work with the defaults or if you're having performance issues.