How does PostgreSQL cache statements and data?

How does PostgreSQL cache statements and data? - postgresql

In Oracle, SQL statements will be cached in shared_pool, and data which is selected frequently will be cached in db_cache.
What does PostgreSQL do? Will SQL statements and data be cached in shared_buffers?

Generally, only the contents of table and index files will be cached in the shared buffer space.
Query plans are cached in some circumstances. The best way to ensure this is to PREPARE the query once, then EXECUTE it each time.
The results of a query are not automatically cached. If you rerun the same query -- even if it's letter-for-letter identical, and no updates have been performed on the DB -- it will still execute the whole plan. It will, of course, make use of any table/index data that's already in the shared buffers cache; so it will not necessarily have to read all the data from disk again.
Update on plan caching
Plan caching is generally done per session. This means only the connection that makes the plan can use the cached version. Other connections have to make and use their own cached versions. This isn't really a performance issue because the saving you get from reusing a plan is almost always miniscule compared to the cost of connecting anyway. (Unless your queries are really complicated.)
It does cache if you use PREPARE: http://www.postgresql.org/docs/current/static/sql-prepare.html
It does cache when the query is in a PL/plSQL function: http://www.postgresql.org/docs/current/static/plpgsql-implementation.html#PLPGSQL-PLAN-CACHING
It does not cache ad-hoc queries entered in psql.
Hopefully someone else can elaborate on any other cases of query plan caching.

Related

How to do caching if you can't afford misses at all?

I'm developing an app that is processing incoming data and currently needs to hit the database for each incoming datapoint. The problem is twofold:
the database can't keep up with the load
the database returns results for less than 5% of the queries
The first idea is to cache the data from the relational database into something like Redis to improve lookup speed. But all the regular caching strategies rely on the fact that you can fall back to the database if needed and fetch data from there. This is problematic in my case because for 95% of the queries there is nothing in the database and I don't have anything to store in the cache. I can of course store the empty results in the cache but that would mean that 95% (or even more, depending on the composition of data) of my cache storage would be rubbish.
The preferred way to do it would be to implement a caching system that doesn't have any misses: everything from the database is always present in the cache and therefore if it's not in the cache, then it's not in the database. After looking around though I found that the consistency of Redis does not seem reliable enough to always make that assumption - if the key doesn't exist in Redis, how can I be 100% sure that it doesn't exist in the database (assuming that we're not in the midst of an update)? It is a strong requirement that if there is a row in the database about an incoming datapoint, then it needs to be found and can't just be missed out on.
How do I go about designing a caching system that will always have the same data as the relational database - without having a fallback to look the data up in the database? Redis might not be the tool but what would you recommend? Is there a pattern or a keyword that I should look up that I haven't thought of?

There already is such a cache in the database: shared buffers. So all you have to do is to set shared_buffers big enough to contain the whole database and restart. Soon the whole database will be cached, and reading will cause no more I/O and will be fast.
That also works if you cannot cache the whole database, as long as you only need to access part of it: PostgreSQL will then just cache those 8kB-pages that are in use.
In my opinion, adding another external caching system can never do better than that. That is particularly true if data are ever modified: any external caching system would have to make sure that its data are not stale, which would introduce an additional overhead.

Is there any way we can monitor all data modifications in intersystems cache?

I'm a newbie of intersystems cache. We have an old system using cache database, now we want to extract and transform all data of it to store into another different database such as PostgreSQL first, and then monitor all modifications of the original cache data to modify(new or update) our transformed data in PostgreSQL in time.
Is there any way we can monitor all data modifications in cache?
Does cache have got any modification/replication log just like mongodb's oplog?
Any idea would be appreciated, thanks!

In short, yes, there is a way to monitor data modifications. InterSystems uses journaling for some reasons, mostly related to keeping data consistent. In some situations, journals may have more records than in others. It used to rollback transactions, restore data in unexpected shutdowns, for backups, for mirroring, and so on.
But I think in your situation it may help. Most of the quite old applications on Caché does not use Objects and SQL tables and works just with globals as is. While in journals you will find only globals, so, you if already have objects and tables in your Caché application, you should know where and how it stores data in globals. And with Journals API you will be able to read every change in the data. There will be any changes, if the application uses transactions, you will have flags about it as well. Every record changes only one value, it could be set or kill.
And be aware, that Caché cleans outdated journal files by settings, so, you have to increase the number of days after which it will be purged.

When should one vacuum a database, and when analyze?

I just want to check that my understanding of these two things is correct. If it's relevant, I am using Postgres 9.4.
I believe that one should vacuum a database when looking to reclaim space from the filesystem, e.g. periodically after deleting tables or large numbers of rows.
I believe that one should analyse a database after creating new indexes, or (periodically) after adding or deleting large numbers of rows from a table, so that the query planner can make good calls.
Does that sound right?

vacuum analyze;
collects statistics and should be run as often as much data is dynamic (especially bulk inserts). It does not lock objects exclusive. It loads the system, but is worth of. It does not reduce the size of table, but marks scattered freed up place (Eg. deleted rows) for reuse.
vacuum full;
reorganises the table by creating a copy of it and switching to it. This vacuum requires additional space to run, but reclaims all not used space of the object. Therefore it requires exclusive lock on the object (other sessions shall wait it to complete). Should be run as often as data is changed (deletes, updates) and when you can afford others to wait.
Both are very important on dynamic database

Correct.
I would add that you can change the value of the default_statistics_target parameter (default to 100) in the postgresql.conf file to a higher number, after which, you should restart your server and run analyze to obtain more accurate statistics.

Using memcache infront of a mongodb server

I am trying to understand how mongo's internal cache works and if it does eliminate using memcache. Our database size is around 200G and index fits in the memory but after the index not much free memory left on the server.
One of my colleague says mongo's internal cache will be as fast as memcache so no need to introduce another level of complexity by using memcache.
The scenario in my head is when we read the data from db, it's saved in memcache and next time it's directly read from the cache instead of going back to db server. If the data is changed and needs to be saved/updated, it's done on both memcache server and database server.
I have been reading about this but couldn't convince myself yet. So I'd really appreciate if someone could shed some light on this.

First thing is that a cache storage is different to a database. So MongoDB and SQL are different in purpose and usage when compared to Memcache.
Memcache is really good at lowering working set sizes for queries. For example: imagine a huge aggregated query with subselects and CASE statements and what not in SQL (think of the most complex query you can), doing this query in realtime all the time could cause the computer(s) to "thrash" (not to mention the problems client side).
However as everyone knows you need only summarise this query to another collection/table for it to be instantly faster. The real speed of memcache comes from the fact that it is a in memory key value store. This is where MongoDB could fail in speed because it is not memory stored, it is memory mapped but not stored.
MongoDB does no self caching, providing the query is "hot" and in LRU (this is where your working set comes in) you shouldn't notice much of a difference in response times. A good way to ensure a query is "hot" is to run it. Some people have a script of their biggest queries that they run to warm up the cache.
As I said memcache is a cache layer this is why:
If the data is changed and needs to be saved/updated, it's done on both memcache server and database server.
Makes me die a little inside. Many do blur the line between the DB and the cache layer.

Is it necessary to vacuum a SQLite3 database to prevent data-loss?

In PostgreSQL it is necessary to vacuum periodically to prevent data loss of very old data due to transaction ID wraparound. I am concerned that data loss might be an issue with SQLite3 databases as well if they are not vacuumed routinely.
Additionally, does the workload that the SQLite3 database experiences matter? I am currently thinking of using SQLite3 in a few scenarios including:
as a file format for a program where people might share files and use them across different machines
to store application settings
to store logs for an application which might log multiple times per second (queries on recent data might be performed every hour)
Also would the frequency of updates and deletes matter?

VACUUM
removes fragmentation, so it helps when you have both lots of deletions and inserts, and many read-only queries that scan entire tables, and
frees unused pages, so it helps when you have delete lots of data, and have very few insertions afterwards.
But these are merely optimizations.
Fragmentation typcially matters only on rotating disks, and freeing space is not necessary unless you're running out of space.
SQLite uses a different transaction locking mechanism (which is much simpler and faster, but not scalable) and does not require maintenance.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse