Slow transaction processing in PostgreSQL - postgresql

I have been noticing a bad behavior with my Postrgre database, but I still can't find any solution or improvement to apply.
The context is simple, let's say I have two tables: CUSTOMERS and ITEMS. During certain days, the number of concurrent customers increase and so the request of items, they can consult, add or remove the amount from them. However in the APM I can see how any new request goes slower than the previous, pointing at the query response from those tables as the highest time consumer.
If the normal execution of the query is about 200 milliseconds, few moments later it can be about 20 seconds.
I understand the lock process in PostgreSQL as many users can be checking over the same item, they could be even affecting the values of it, but the response from the database it's too slow.
So I would like to know if there are ways to improve the performance in the database.
The first time I used PGTune to get the initial settings and it worked well. I have version 11 with 20Gb for RAM, 4 vCPU and SAN storage, the simultaneous customers (no sessions) can reach over 500.

Related

Why Select queries on partitioned tables take lock and get stuck

We designed a database so that it can accepts lots of data.
For that we used partitioned table quite a lot (because the way we handle trafic information in database can take advantage of the partitionning system).
to be more precise, we have table with partition, and partition that also have partition (with 4 levels)
main table
-> sub tables partitioned by row 1 (list)
-> sub tables partitioned by row 2 (list)
...
There are 4 main partitionned tables. Each one has from 40 to 120 (sub) partitions behing.
The query that take lock and is locked by others is a SELECT that work on these 4 tables, joined. (So counting partitions it work over about 250 tables)
We had no problem until now, maybe due to trafic increase. Now Select queries that use these tables, that normally are executed in 20ms, can wait up to 10seconds , locked and waiting.
When requesting pg_stat_activity I see that these queries are :
wait_event_type : LWLock
wait_event : lock_manager
I asked dev team and also confirmed in reading logs (primary and replica), there were nothing else running except select and insert / update queries on these tables.
These queries, select queries, are running on the replica servers.
I try to find on the internet before but everything I find is : yes there are exclusive lock on partitioned table , but it's when there are some operations like drop, attach / dettach partitions.. And it's not happening on my server while there is a problem.
Server is version 12.4, running on AWS Aurora.
What can make these queries locked and waiting for this LWLock ?
What could be my options to improve this behaviour ? (help my queries no being locked...)
EDIT :
Adding some details I ve been asked or that could be interesting :
number of connections :
usually : 10 connections opened by seconds
in peak (when the problem appear) : from 40 to 100 connections opened by second
during the problem, the numbre of opened connection vary from 100 to 200.
size of the database : 30 Gb - currently lots of partitions are empty.
You are probably suffering from internal contention on database resources caused by too many connections that all compete to use the same shared data structures. It is very hard to pinpoint the exact resource with the little information provided, but the high number of connections is a strong indication.
What you need is a connecrion pool that maintains a small number of persistent database connections. That will at the same time reduce the problematic contention and do away with the performance wasted on opening lots of short-lived database connections. Your overall throughput will increase.
If your application has no connection pool built in, use pgBouncer.

Slow bulk read from Postgres Read replica while updating the rows we read

We have on RDS a main Postgres server and a read replica.
We constantly write and update new data for the last couple of days.
Reading from the read-replica works fine when looking at older data but when trying to read from the last couple of days, where we keep updating the data on the main server, is painfully slow.
Queries that take 2-3 minutes on old data can timeout after 20 minutes when querying data from the last day or two.
Looking at the monitors like CPU I don't see any extra load on the read replica.
Is there a solution for this?
You are accessing over 65 buffers for ever 1 visible row found in the index scan (and over 500 buffers for each row which is returned by the index scan, since 90% are filtered out by the mmsi criterion).
One issue is that your index is not as well selective as it could be. If you had the index on (day, mmsi) rather than just (day) it should be about 10 times faster.
But it also looks like you have a massive amount of bloat.
You are probably not vacuuming the table often enough. With your described UPDATE pattern, all the vacuum needs are accumulating in the newest data, but the activity counters are evaluated based on the full table size, so autovacuum is not done often enough to suit the needs of the new data. You could lower the scale factor for this table:
alter table simplified_blips set (autovacuum_vacuum_scale_factor = 0.01)
Or if you partition the data based on "day", then the partitions for newer days will naturally get vacuumed more often because the occurrence of updates will be judged against the size of each partition, it won't get diluted out by the size of all the older inactive partitions. Also, each vacuum run will take less work, as it won't have to scan all of the indexes of the entire table, just the indexes of the active partitions.
As suggested, the problem was bloat.
When you update a record in an ACID database the database creates a new version of the record with the new updated record.
After the update you end with a "dead record" (AKA dead tuple)
Once in a while the database will do autovacuum and clean the table from the dead tuples.
Usually the autovacuum should be fine but if your table is really large and updated often you should consider changing the autovacuum analysis and size to be more aggressive.

Redshift cluster: queries getting hang and filling up space

I have a Redshift cluster with 3 nodes. Every now and then, with users running queries against it, we end in this unpleasant situation where some queries run for way longer than expected (even simple ones, exceeding 15 minutes), and the cluster storage starts increasing to the point that if you don't terminate the long-standing queries it gets to 100% storage occupied.
I wonder why this may happen. My experience is varied, sometimes it's been a single query doing this and sometimes it's been different concurrent queries been run at the same time.
One specific scenario where we saw this happen related to LISTAGG. The type of LISTAGG is varchar(65535), and while Redshift optimizes away the implicit trailing blanks when stored to disk, the full width is required in memory during processing.
If you have a query that returns a million rows, you end up with 1,000,000 rows times 65,535 bytes per LISTAGG, which is 65 gigabytes. That can quickly get you into a situation like what you describe, with queries taking unexpectedly long or failing with “Disk Full” errors.
My team discussed this a bit more on our team blog the other day.
This typically happens when a poorly constructed query spills a too much data to disk. For instance the user accidentally specifies a Cartesian product (every row from tblA joined to every row of tblB).
If this happens regularly you can implement a QMR rule that limits the amount of disk spill before a query is aborted.
QMR Documentation: https://docs.aws.amazon.com/redshift/latest/dg/cm-c-wlm-query-monitoring-rules.html
QMR Rule Candidates query: https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/wlm_qmr_rule_candidates.sql

pathological spike in postgresql load average after large delete

I ran two deletes on a PostgreSQL 9.3.12 database against a fairly large table. Each one required a table scan and took about ~10 minutes to complete.
While they were running clients weren't impacted. Disk I/O was high, upwards of 70%, but that's fine.
After the second delete finished Disk I/O went to near zero and Load Average shot through the roof. Requests were not being completed in a timely manner and since new requests continued to arrive they all stacked up.
My two theories are:
Something with the underlying I/O layer that caused all I/O requests to block for some period of time, or
Postgres acquired (and held for a non-trivial period of time) a lock needed by clients. Either a global one or one related to the table from which rows were deleted. This table is frequently inserted into by clients; if someone were holding a lock that blocked inserts it would definitely explain this behavior.
Any ideas? Load was in excess of 40, which never happens in our environment even during periods of heavy load. Network I/O was high during/after the deletes but only because they were being streamed to our replication server.

Memcache or Queue for Hits Logging

We have a busy website, which needs to log 'hits' about certain pages or API endpoints which are visited, to help populate stats, popularity grids, etc. The hits were logging aren't simple page hits, so can't use log parsing.
In the past, we've just directly logged to the database with an update query, but under heavy concurrency, this creates a database load that we don't want.
We are currently using Memcache but experiencing some issues with the stats not being quite accurate due to non-atomic updates.
So my question:
Should we continue to use Memcache but improve atomic increments:
1) When page is hit, create a memcache key such as "stats:pageid:3" and increment this each time we hit atomically
2) Write a batch script to cycle through all the memcache keys and create a batch update to database once every 10 mins
PROS: Less database hits, as we're only updating once per page per 10 mins (with however many hits in that 10 min period)
CONS: We can atomically increment the individual counters, but would still need a memcache key to store which pageids have had hits, to loop through and log. This won't be atomic, so when we flush the data to DB and reset everything, things may linger in this key. We could lose up to 10 mins of data.
OR Use a queue/task system:
1) When page is hit, add a job to the task queue
2) Task queue can then be rate limited and in the background process these 'hits' to the database.
PROS: Easy to code, we can scale up queue workers if required.
CONS: We're still hitting the database once per hit as each task would be processed individually, rather than 'summing' up all the hits.
Or any other suggestions?
OR: use something designed for recording stats at high-traffic levels, such as StatsD & Graphite. The original StatsD is written in Javascript on top of NodeJS, which can be a little complex to setup (but there are easier ways to install it, with a Docker container), or you can use a work-alike (not using NodeJS), that does the same function, such as one written in GoLang.
I've used the original StatsD and Graphite pair to great effect, plus it's making the pretty graphs (this was for 10's of millions of events per day).