Our Postgres DB (hosted on Google Cloud SQL with 1 CPU, 3.7 GB of RAM, see below) consists mostly of one big ~90GB table with about ~60 million rows. The usage pattern consists almost exclusively of appends and a few indexed reads near the end of the table. From time to time a few users get deleted, deleting a small percentage of rows scattered across the table.
This all works fine, but every few months an autovacuum gets triggered on that table, which significantly impacts our service's performance for ~8 hours:
Storage usage increases by ~1GB for the duration of the autovacuum (several hours), then slowly returns to the previous value (might eventually drop below it, due to the autovacuum freeing pages)
Database CPU utilization jumps from <10% to ~20%
Disk Read/Write Ops increases from near zero to ~50/second
Database Memory increases slightly, but stays below 2GB
Transaction/sec and ingress/egress bytes are also fairly unaffected, as would be expected
This has the effect of increasing our service's 95th latency percentile from ~100ms to ~0.5-1s during the autovacuum, which in turn triggers our monitoring. The service serves around ten requests per second, with each request consisting of a few simple DB reads/writes that normally have a latency of 2-3ms each.
Here are some monitoring screenshots illustrating the issue:
The DB configuration is fairly vanilla:
The log entry documenting this autovacuum process reads as follows:
system usage: CPU 470.10s/358.74u sec elapsed 38004.58 sec
avg read rate: 2.491 MB/s, avg write rate: 2.247 MB/s
buffer usage: 8480213 hits, 12117505 misses, 10930449 dirtied
tuples: 5959839 removed, 57732135 remain, 4574 are dead but not yet removable
pages: 0 removed, 6482261 remain, 0 skipped due to pins, 0 skipped frozen
automatic vacuum of table "XXX": index scans: 1
Any suggestions what we could tune to reduce the impact of future autovacuums on our service? Or are we doing something wrong?
If you can increase autovacuum_vacuum_cost_delay, your autovacuum would run slower and be less invasive.
However, it is usually the best solution to make it faster by setting autovacuum_vacuum_cost_limit to 2000 or so. Then it finishes faster.
You could also try to schedule VACUUMs of the table yourself at times when it hurts least.
But frankly, if a single innocuous autovacuum is enough to disturb your operation, you need more I/O bandwidth.
Related
I have a use case to insert 100 000 rows per min at the same time in another end few threads will take the rows and delete them from my table. So definitely it will create lot of dead tuples in my table.
My auto-vacuum configurations are
autovacuum_max_workers = 3
autovacuum_naptime = 1min
utovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.1
autovacuum_vacuum_cost_delay = 20ms
autovacuum_vacuum_cost_limit = -1
From "pg_stat_user_tables" I can find auto-vacuum is running on my table but within a few hours my disk will be full (500 GB) and I can't able to insert any new row.
on the second try, I changed the following configuration
autovacuum_naptime = 60min
autovacuum_vacuum_cost_delay = 0
This time my simulation and auto-vacuum are running well and max disk size is 180 GB.
Here my doubt is, if I change the "autovacuum_vacuum_cost_delay" to zero ms, how auto-vacuum freeing the dead tuples space and PG reuse it? why it is not working as intended if I set the value is 20 ms?
Here my doubt is, if I change the "autovacuum_vacuum_cost_delay" to zero ms, how auto-vacuum freeing the dead tuples space and PG reuse it?
The space freed up by vacuum is recorded in the free space map, from where it gets handed out for re-use by future INSERTs.
Another detail to add, in 9.6 the free space map is only vacuumed once the entire table itself is completely vacuumed, and so the freed up space is not findable until then. If the VACUUM never makes it to the very end, because it is too slow or gets interupted, then the space it is freeing up will not be reused for INSERTs. This was improved in v11.
why it is not working as intended if I set the value is 20 ms?
Because vacuum can't keep up at that value. The default values for PostgreSQL are often suitable only for smaller servers, which yours doesn't seem to be. It is appropriate and advisable to change the defaults in this situation. Note that in v12, the default was lowered from 20 to 2 (and its type was correspondingly changed from int to float, so you can now specify the value with more precision)
To summarize, your app creates tons of dead tuples and autovacuum can't keep up. Possible solutions
This sounds more like a task queue than a regular table. Perhaps a PostgreSQL table is not ideal for your this specific use case. Use a solution such as RabbitMQ/Redis instead.
Create time-based range partitions and purge old partitions once they're empty, while disabling autovacuum on this table alone. Consider not deleting rows at all and just purging old partitions if you can identify handled partitions.
Tweak with the autovacuum settings so that it works constantly, without any naps or interference. Increasing maintenance_work_mem could help speed autovacuum too. Perhaps you'll find out that you've reached your hard-drive's limits. In that case, you will have to optimize the storage so that it can accommodate those expensive INSERT+DELETE+autovacuum operations.
Well the default value is 2 ms Autovacuum. So your 20ms value is high:
autovacuum_vacuum_cost_delay (floating point)
"Specifies the cost delay value that will be used in automatic VACUUM operations. If -1 is specified, the regular vacuum_cost_delay value will be used. If this value is specified without units, it is taken as milliseconds. The default value is 2 milliseconds. This parameter can only be set in the postgresql.conf file or on the server command line; but the setting can be overridden for individual tables by changing table storage parameters."
As explained here Vacuum:
"
vacuum_cost_delay (floating point)
The amount of time that the process will sleep when the cost limit has been exceeded. If this value is specified without units, it is taken as milliseconds. The default value is zero, which disables the cost-based vacuum delay feature. Positive values enable cost-based vacuuming.
When using cost-based vacuuming, appropriate values for vacuum_cost_delay are usually quite small, perhaps less than 1 millisecond. While vacuum_cost_delay can be set to fractional-millisecond values, such delays may not be measured accurately on older platforms. On such platforms, increasing VACUUM's throttled resource consumption above what you get at 1ms will require changing the other vacuum cost parameters. You should, nonetheless, keep vacuum_cost_delay as small as your platform will consistently measure; large delays are not helpful.
"
Overview:
We are comparing performance of create/read/write/rw over two different architectures: Single Database vs Multiple Databases (15k-25k).
We prefer to use the Multi-DB architecture because that makes it easier to separate customers (customer = 1 company). However, due to performance degradation we fear this may not be a good solution.
Server Specification:
Single instance MongoDB server; 64GB RAM; 16 core; SSD HD
Test results:
Both test scenarios have the same total number of documents (and documents are roughly the same size). The variables are number of databases, collections per database and documents per collection.
All tests are conducted in parallel using 50 client threads (separate machine), with the exception of Read/Write, which uses 100 (50R/50W). 'directoryPerDB' is enabled.
(All times are in milliseconds per doc operation)
Test Creation Read Write Read/Write Notes
25000 DB 4 Coll 250 Doc 23ms 1-10ms 1-4ms 2-10ms Max 1400% CPU, noticeable "pauses" (CPU drops to 100%)
15000 DB 4 Coll 420 Doc 23ms 0.7-4ms 0.9-4ms 2-9ms Max 1400% CPU, noticeable "pauses" (CPU drops to 100%)
1 DB 4 Coll 125000 Doc 0.8ms 0.6ms 0.8ms 1.2-1.6ms Max 600% CPU, no pauses
Conclusion:
There seems to be noticeable performance degradation on a regular interval when the DB count is very high. It may be due to the sheer number of files (25000 DBs * 4 Colls * 2 files = 200k files) or some other bottleneck.
In the Single-DB test, the CPU stays around 600% and maintains that until completion. In the multi DB tests, the CPU (at peak performance) is somewhere between 800-1400%, but every so often the CPU drops to 100% and all operations are paused. This can be verified by watching the mongo log, as well as the logs from the test clients that are issuing R/W commands.
If it wasn't for these pauses, the Multi-DB architecture would be ~2x faster than Single-DB, however it appears there is some global contention that cannot be avoided.
I'm hoping someone might know what this global contention is and (if possible) how to solve it.
I ran two deletes on a PostgreSQL 9.3.12 database against a fairly large table. Each one required a table scan and took about ~10 minutes to complete.
While they were running clients weren't impacted. Disk I/O was high, upwards of 70%, but that's fine.
After the second delete finished Disk I/O went to near zero and Load Average shot through the roof. Requests were not being completed in a timely manner and since new requests continued to arrive they all stacked up.
My two theories are:
Something with the underlying I/O layer that caused all I/O requests to block for some period of time, or
Postgres acquired (and held for a non-trivial period of time) a lock needed by clients. Either a global one or one related to the table from which rows were deleted. This table is frequently inserted into by clients; if someone were holding a lock that blocked inserts it would definitely explain this behavior.
Any ideas? Load was in excess of 40, which never happens in our environment even during periods of heavy load. Network I/O was high during/after the deletes but only because they were being streamed to our replication server.
I created a database containing a total of 3 tables for a specific purpose. The total size of all tables is about 850 MB - very lean... out of which one single table contains about 800 MB (including index) of data and 5 million records (daily addition of about 6000 records).
The system is PG-Windows with 8 GB RAM Windows 7 laptop with SSD.
I allocated 2048MB as shared_buffers, 256MB as temp_buffers and 128MB as work_mem.
I execute a single query multiple times against the single table - hoping that the table stays in RAM (hence the above parameters).
But, although I see a spike in memory usage during execution (by about 200 MB), I do not see memory consumption remaining at at least 500 MB (for the data to stay in memory). All postgres exe running show 2-6 MB size in task manager. Hence, I suspect the LRU does not keep the data in memory.
Average query execution time is about 2 seconds (very simple single table query)... but I need to get it down to about 10-20 ms or even lesser if possible, purely because there are just too many times, the same is going to be executed and can be achieved only by keeping stuff in memory.
Any advice?
Regards,
Kapil
You should not expect postgres processes to show large memory use, even if the whole database is cached in RAM.
That is because PostgreSQL relies on buffered reads from the operating system buffer cache. In simplified terms, when PostgreSQL does a read(), the OS looks to see whether the requested blocks are cached in the "free" RAM that it uses for disk cache. If the block is in cache, the OS returns it almost instantly. If the block is not in cache the OS reads it from disk, adds it to the disk cache, and returns the block. Subsequent reads will fetch it from the cache unless it's displaced from the cache by other blocks.
That means that if you have enough free memory to fit the whole database in "free" operating system memory, you won't tend to hit the disk for reads.
Depending on the OS, behaviour for disk writes may differ. Linux will write-back cache "dirty" buffers, and will still return blocks from cache even if they've been written to. It'll write these back to the disk lazily unless forced to write them immediately by an fsync() as Pg uses at COMMIT time. When it does that it marks the cached blocks clean, but doesn't flush them. I don't know how Windows behaves here.
The point is that PostgreSQL can be running entirely out of RAM with a 1GB database, even though no PostgreSQL process seems to be using much RAM. Having shared_buffers too high just leads to double-caching and can reduce the amount of RAM available for the OS to cache blocks.
It isn't easy to see exactly what's cached in RAM because Pg relies on the OS cache. That's why I referred you to pg_fincore.
If you're on Windows and this won't work, you really just have to rely on observing disk activity. Does performance monitor show lots of uncached disk reads? Does operating system memory monitoring show lots of memory used for disk cache in the OS?
Make sure that effective_cache_size correctly reflects the RAM used for disk cache. It will help PostgreSQL choose appropriate query plans.
You are making the assumption, without apparent evidence, that the query performance you are experiencing is explained by disk read delays, and that it can be improved by in-memory caching. This may not be the case at all. You need to look at explain analyze output and system performance metrics to see what's going on.
Mongodb Background Flushing blocks all the requests:
Server: Windows server 2008 R2
CPU Usage: 10 %
Memory: 64G, Used 7%, 250MB for Mongod
Disk % Read/Write Time: less than 5% (According to Perfmon)
Mongodb Version: 2.4.6
Mongostat Normally:
insert:509 query:608 update:331 delete:*0 command:852|0 flushes:0 mapped:63.1g vsize:127g faults:6449 locked db:Radius:12.0%
Mongostat Before(maybe while) Flushing:
insert:1 query:4 update:3 delete:*0 command:7|0 flushes:0 mapped:63.1g vsize:127g faults:313 locked db:local:0.0%
And Mongostat After Flushing:
insert:1572 query:1849 update:1028 delete:*0 command:2673|0 flushes:1 mapped:63.1g vsize:127g faults:21065 locked db:.:99.0%
As you see when flushes happening lock is 99% just at this point mongod stops responding any read/write operation (mongotop and mongostat also stop). The flushing takes about 7 to 8 seconds to complete which does not increase disk load more than 10%.
Is there any suggestions?
Under Windows server 2008 R2 (and other versions of Windows I would suspect, although I don't know for sure), MongoDB's (2.4 and older) background flush process imposes a global lock, doing substantial blocking of reads and writes, and the length of the flush time tends to be proportional to the amount of memory MongoDB is using (both resident and system cache for memory-mapped files), even if very little actual write activity is going on. This is a phenomenon we ran into at our shop.
In one replica set where we were using MongoDB version 2.2.2, on a host with some 128 GBs of RAM, when most of the RAM was in use either as resident memory or as standby system cache, the flush time was reliably between 10 and 15 seconds under almost no load and could go as high as 30 to 40 seconds under load. This could cause Mongo to go into long pauses of unresponsiveness every minute. Our storage did not show signs of being stressed.
The basic problem, it seems, is that Windows handles flushing to memory-mapped files differently than Linux. Apparently, the process is synchronous under Windows and this has a number of side effects, although I don't understand the technical details well enough to comment.
MongoDb, Inc., is aware of this issue and is working on optimizations to address it. The problem is documented in a couple of tickets:
https://jira.mongodb.org/browse/SERVER-13444
https://jira.mongodb.org/browse/SERVER-12401
What to do?
The phenomenon is tied, to some degree, to the minimum latency of the disk subsystem as measured under low stress, so you might try experimenting with faster disks, if you can. Some improvements have been reported with this approach.
A strategy that worked for us in some limited degree is avoiding provisioning too much RAM. It happened that we really didn't need 128 GBs of RAM, so by dialing back on the RAM, we were able to reduce the flush time. Naturally, that wouldn't work for everyone.
The latest versions of MongoDB (2.6.0 and later) seem to handle the
situation better in that writes are still blocked during the long
flush but reads are able to proceed.
If you are working with a sharded cluster, you could try dividing the RAM by putting multiple shards on the same host. We didn't try this ourselves, but it seems like it might have worked. On the other hand, careful design and testing would be highly recommended in any such scenario to avoid compromising performance and/or high availability
We tried playing with syncdelay. Reducing it didn't help (the long flush times just happened more frequently). Increasing it helped a little (there was more time between flushes to get work done), but increasing it too much can exacerbate the problem severely. We boosted the syncdelay to five minutes (300 seconds), at one point, and were rewarded with a background flush of 20 minutes.
Some optimizations are in the works at MongoDB, Inc. These may be available soon.
In our case, to relieve the pressure on the primary host, we periodically rebooted one of the secondaries (clearing all memory) and then failed over to it. Naturally, there is some performance hit due to re-caching, and I think this only worked for us because our workload is write-heavy. Moreover, this technique not in any sense a solution. But if high flush times are causing serious disruption, this may be one way to "reduce the fever" so to speak.
Consider running on Linux... :-)
Background flush by default does not block read/write. mongod does flush every 60s, unless otherwise specified with -syncDelay parameter. syncDelay uses fsync() operation, which can set to block write while in-memory pages flush to disk. A blocked write could have potential to block reads as well. Read more: http://docs.mongodb.org/manual/reference/command/fsync/
However, normally a flush should not take more than 1000ms (1 second). If it does, it is likely the amount of data flushing to disk is too large for your disk to handle.
Solution: upgrade to a faster disk like SSD, or decrease flush interval (try 30s, rather than the default 60s).