PostgreSql Query Caching Logic

PostgreSql Query Caching Logic - postgresql

So I have database with 140 million entries. First Query takes around 500ms, repeating the query takes 8ms. So there is some caching.
I would like to learn more about how long this cache exists what are the conditions it's refreshed etc - time or number of new entries made to the table.
Adding ~1500 new entries to table still makes queries around 10ms. But adding 150k will reset the cache. So there could be both, or only transaction count.
Anyone has a good resource, so that I could get more information on it?
I made this gist to make the load tests.

PostgreSQL uses a simple clock sweep algorithm to manage the cache (“shared buffers” in PostgreSQL lingo).
Shared buffers are in shared memory, and all access to data is via this cache. An 8kB block that is cached in RAM is also called a buffer.
Whenever a buffer is used, its usage count is increased, up to a maximum of 5.
There is a free list of buffers with usage count 0. If the free list is empty, anybody who searches a "victim" buffer to replace goes through shared buffers in a circular fashion and decreases the usage count of each buffer they encounter. If the usage count is 0, the buffer gets evicted and reused.
If a buffer is "dirty" (it has been modified, but not written to disk yet), it has to be written out before it can be evicted. The majority of dirty buffers get written out during one of the regular checkpoints, and a few of them get written out by the background writer between checkpoints. Occasionally a normal worker process has to do that as well, but that should be an exception.

Related

Big transaction (multiple inserts) exceed RAM size : what happens?

I do multiple inserts in single transaction that exceed RAM size:
I learned that inserts (delete) are written in RAM and that modified pages became dirty pages.
But what happen if dirty pages exceed RAM size: Checkpoint writes dirty pages on disk before the end of transaction?

Don't worry.
PostgreSQL caches 8kB pages of disk storage in RAM when the data are read or written, but modified (“dirty”) pages in the cache (shared buffers) are automatically written out to disk by the database system.
Normally, this happens during a checkpoint, but there is also a special background process, the background writer, that keeps writing out dirty pages from the cache so that there are always some clean pages in the cache that clients can use.
In the unlikely event that despite all that there is still no clean page to be found, the backend process that needs some cache space can clean out a dirty page itself. So, no matter what, you will always get some cache space eventually.
It also doesn't matter if your transaction is large or not. PostgreSQL doesn't wait for the transaction to be committed before it writes out data to disk, it will happily persist uncommitted data modifications from an active transaction. If the transaction fails, these data will just be rendered invisible (“dead”).
Owing to PostgreSQL architecture, your transaction will never fail just because it changes too many data.

How does Mongo's eventual consistency work with a large number of data writes?

I have a flow like this:
I have a Worker that's processing a "large" batch (say, 1M records) and storing the results in Mongo.
Once the batch is complete, a notification message is sent to Publish, which then pulls all the records from Mongo for final publication.
Let's say the Worker write process is done, i.e. it has sent all 1M records to Mongo through a driver. Mongo is "eventually consistent" so I'm not 100% guaranteed all records are written to physical storage at the time the Notify Publish happens.
When Publish does a 'find' and gets a cursor on the collection holding the batch records, is the cursor smart enough to handle the eventual consistency?
So in practical terms let's imagine 750,000 records are actually physically written by Mongo when Notify Publish happens and Publish does its find. Will the cursor traverse 750,000 records and stop or will it block or otherwise handle the remaining 250,000 as they're eventually written to disk (which presumably is very likely to happen while publishing of the first 750K)?

As #BlakesSeven already noted in the comments, "eventual consistency" refers to the fact that in a replicated environment, when a write is finished on the primary, it will only be written to the secondaries eventually. You can modify this behavior at the cost of reduced write performance by setting the write concern to > 1. Setting it to "majority" basically guarantees that a write operation is durable even in case of a failover – though at a (in some cases) drastically reduced performance.
In general here is what happens when you do a write (simplified) with journaling enabled:
The operation is checked for being syntactically correct.
The query optimizer kicks in and does his stuff. (Irrelevant for this question, so I spare the details).
The write operation is applied to the in memory representation of the data set called "private view".
Every commitIntervalMs, the private view is synced to the journal, with a median of 15 or 50ms, depending on the write concern.
On sync, the operation is applied to the shared view. Iirc, this is the point where a new connection would be provided with the new data.
So in order to ensure that the data will be readable by the new connection, simply delay the publish notification by commitIntervalMs + 1, which, given your batch size, is hardly noticeable.

Memcache flush all does not empty slabs?

I am using the using the flush all command to delete all the key/value pair on my Memcache server.
While the values get deleted, there are two things I can't understand when looking at the Memcache server through phpMemcachedAdmin
The memory usage is not reset to 0 after flushing it all. I still have 77% used and 22% wasted (just an example, but you get the spirit). How is that possible?
All the previous slab with the previous items are still there. For example, looking at a specific slab would show all the previous key/value pairs, despite the flush all command. How is that possible?
Thanks

This happens because memcache flushes on read, not on write. When you flush_all, that operations is designed for performance: it just means anything read beyond that time will be instantly expired, even though it is still in the cache. It just updates a single number and then checks that on each fetch.
This optimization significantly improves memcache performance. Imagine if several other people are searching or inserting at the same time as a flush.

Memcache or Queue for Hits Logging

We have a busy website, which needs to log 'hits' about certain pages or API endpoints which are visited, to help populate stats, popularity grids, etc. The hits were logging aren't simple page hits, so can't use log parsing.
In the past, we've just directly logged to the database with an update query, but under heavy concurrency, this creates a database load that we don't want.
We are currently using Memcache but experiencing some issues with the stats not being quite accurate due to non-atomic updates.
So my question:
Should we continue to use Memcache but improve atomic increments:
1) When page is hit, create a memcache key such as "stats:pageid:3" and increment this each time we hit atomically
2) Write a batch script to cycle through all the memcache keys and create a batch update to database once every 10 mins
PROS: Less database hits, as we're only updating once per page per 10 mins (with however many hits in that 10 min period)
CONS: We can atomically increment the individual counters, but would still need a memcache key to store which pageids have had hits, to loop through and log. This won't be atomic, so when we flush the data to DB and reset everything, things may linger in this key. We could lose up to 10 mins of data.
OR Use a queue/task system:
1) When page is hit, add a job to the task queue
2) Task queue can then be rate limited and in the background process these 'hits' to the database.
PROS: Easy to code, we can scale up queue workers if required.
CONS: We're still hitting the database once per hit as each task would be processed individually, rather than 'summing' up all the hits.
Or any other suggestions?

OR: use something designed for recording stats at high-traffic levels, such as StatsD & Graphite. The original StatsD is written in Javascript on top of NodeJS, which can be a little complex to setup (but there are easier ways to install it, with a Docker container), or you can use a work-alike (not using NodeJS), that does the same function, such as one written in GoLang.
I've used the original StatsD and Graphite pair to great effect, plus it's making the pretty graphs (this was for 10's of millions of events per day).

How to keep 32 bit mongodb memory usage down on changing dataset

I'm using MongoDB on a 32 bit production system, which sucks but it's out of my control right now. The challenge is to keep the memory usage under ~2.5GB since going over this will cause 32 bit systems to crash.
According to the mongoDB team, the best way to track the memory usage is to use your operating system's process tracking system (i.e. ps or htop on Unix systems; Process Explorer on Windows.) for virtual memory size.
The DB mainly consists of one table which is continually cycling data, i.e. receiving data at regular intervals from sensors, and every day a cron job wipes all data from before the last 3 days. Over a period of time, the memory usage slowly increases. I took some notes over time using db.serverStats(), db.lectura.totalSize() and ps, shown in the chart below. Note that the size of the table in question has reduced in the last month but the memory usage increased nonetheless.
Now, there is some scope for adjustment in how many days of data I store. Today I deleted basically half of the data, and then restarted mongodb, and yet the mem virtual / mem mapped and most importantly memory usage according to ps have hardly changed! Why do these not reduce when I wipe data (and restart)? I read some other questions where people said that mongo isn't really using all the memory that it might appear to be using, and that you can't clear the cache or limit memory use. But then how can I ensure I stay under the 2.5GB limit?
Unless there is a way to stem this dataset-size-irrespective gradual increase in memory usage, it seems to me that the 32-bit version of Mongo is unuseable. Note: I don't mind losing a bit of performance if it solves the problem.

To answer regarding why the mapped and virtual memory usage does not decrease with the deletes, the mapped number is actually what you get when you mmap() the entire set of data files. This does not shrink when you delete records, because although the space is freed up inside the data files, they are not themselves reduced in size - the files are just more empty afterwards.
Virtual will include journal files, and connections, and other non-data related memory usage also, but the same principle applies there. This, and more, is described here:
http://www.mongodb.org/display/DOCS/Checking+Server+Memory+Usage
So, the 2GB storage size limitation on 32-bit will actually apply to the data files whether or not there is data in them. To reclaim deleted space, you will have to run a repair. This is a blocking operation and will require the database to be offline/unavailable while it was run. It will also need up to 2x the original size in terms of free disk space to be able to run the repair, since it essentially represents writing out the files again from scratch.
This limitation, and the problems it causes, is why the 32-bit version should not be run in production, it is just not suitable. I would recommend getting onto a 64-bit version as soon as possible.
By the way, neither of these figures (mapped or virtual) actually represents your resident memory usage, which is what you really want to look at. The best way to do this over time is via MMS, which is the free monitoring service provided by 10gen - it will graph virtual, mapped and resident memory for you over time as well as plenty of other stats.
If you want an immediate view, run mongostat and check out the corresponding memory columns (res, mapped, virtual).
In general, when using 64-bit builds with essentially unlimited storage, the data will usually greatly exceed the available memory. Therefore, mongod will use all of the available memory it can in terms of resident memory (which is why you should always have swap configured to the OOM Killer does not come into play).
Once that is used, the OS does not stop allocating memory, it will just have the oldest items paged out to make room for the new data (LRU). In other words, the recycling of memory will be done for you, and the resident memory level will remain fairly constant.

Your options for stretching 32-bit are limited, but you can try some things. The thing that you run out of is address space, and the increases in the sizes of additional database files mean that you would like to avoid crossing over the boundary from "n" files to "n+1". It may be worth structuring your data into more or fewer databases so that you can get the maximum amount of actual data into memory and as little as possible "dead space".
For example, if your database named "mydatabase" consists of the files mydatabase.ns (the namespace file) at 16 MB, mydatabase.0 at 64 MB, mydatabase.1 at 128 MB and mydatabase.2 at 256 MB, then the next file created for this database will be mydatabase.3 at 512 MB. If instead of adding to mydatabase you instead created an additional database "mynewdatabase" it would start life with mynewdatabase.ns at 16 MB and mynewdatabase.0 at 64 MB ... quite a bit smaller than the 512 MB that adding to the original database would be. In fact, you could create 4 new databases for less space than would be consumed by adding a new file to the original database, and because the files are smaller they would be easier to fit into contiguous blocks of memory.

It is a well-known message that 32-bit should not be used for production.
Use 64-bit systems.
Point.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse