pathological spike in postgresql load average after large delete - postgresql

I ran two deletes on a PostgreSQL 9.3.12 database against a fairly large table. Each one required a table scan and took about ~10 minutes to complete.
While they were running clients weren't impacted. Disk I/O was high, upwards of 70%, but that's fine.
After the second delete finished Disk I/O went to near zero and Load Average shot through the roof. Requests were not being completed in a timely manner and since new requests continued to arrive they all stacked up.
My two theories are:
Something with the underlying I/O layer that caused all I/O requests to block for some period of time, or
Postgres acquired (and held for a non-trivial period of time) a lock needed by clients. Either a global one or one related to the table from which rows were deleted. This table is frequently inserted into by clients; if someone were holding a lock that blocked inserts it would definitely explain this behavior.
Any ideas? Load was in excess of 40, which never happens in our environment even during periods of heavy load. Network I/O was high during/after the deletes but only because they were being streamed to our replication server.

Related

Sudden Increase in row exclusive locks and connection exhaustion in PostgreSQL

I have a scenario that repeats itself every few hours. In every few hours, there is a sudden increase in row exclusive locks in PostgreSQL DB. In Meantime there seems that some queries are not responded in time and causes connection exhaustion to happen that PostgreSQL does not accept new clients anymore. After 2-3 minutes locks and connection numbers drops and the system comes back to normal state again.
I wonder if auto vacuum can be the root cause of this? I see analyze and vacuum (NOT FULL VACCUM) take about 20 seconds to complete on one of the tables. I have INSERT,SELECT,UPDATE and DELETE operations going on from my application and I don't have DDL commands (ALTER TABLE, DROP TABLE, CREATE INDEX, ...) going on. Can auto vacuum procedure conflict with queries from my application and cause them to wait until vacuum has completed? Or it's all the applications and my bad design fault? I should say one of my tables has a field of type jsonb that keeps relatively large data for each row (10 MB roughly).
I have attached an image from monitoring application that shows the sudden increase in row exclusive locks.
ROW EXCLUSIVE locks are perfectly harmless; they are taken on tables against which DML statements run. Your graph reveals nothing. You should set log_lock_waits = on and log_min_duration_statement to a reasonable value. Perhaps you can spot something in the logs. Also, watch out for long running transactions.

Slow transaction processing in PostgreSQL

I have been noticing a bad behavior with my Postrgre database, but I still can't find any solution or improvement to apply.
The context is simple, let's say I have two tables: CUSTOMERS and ITEMS. During certain days, the number of concurrent customers increase and so the request of items, they can consult, add or remove the amount from them. However in the APM I can see how any new request goes slower than the previous, pointing at the query response from those tables as the highest time consumer.
If the normal execution of the query is about 200 milliseconds, few moments later it can be about 20 seconds.
I understand the lock process in PostgreSQL as many users can be checking over the same item, they could be even affecting the values of it, but the response from the database it's too slow.
So I would like to know if there are ways to improve the performance in the database.
The first time I used PGTune to get the initial settings and it worked well. I have version 11 with 20Gb for RAM, 4 vCPU and SAN storage, the simultaneous customers (no sessions) can reach over 500.

PostgreSQL limits itself to a single core CPU usage (debugging lock issue?)

Update after some research, it seems this question was incorrect - the 100% was representing all cores, not a single core, making the whole question moot. My sincere apologies to the community.
On PostgreSQL 10, PostGIS 2.5.2, without any data modifications (SELECT queries only), I have 40 identical GIS queries running in parallel (with different params), each taking ~20-500ms. Server has lots of RAM, NVME SSDs.
The CPU usage consistently shows 100% of a single core, implying that all queries are stuck waiting for something that cannot execute in parallel, but I am not sure how to find it.
Examining pg_stat_activity multiple times shows that all queries are active, and their wait_event could be one of these cases:
wait_event is NULL for all
a few ClientRead and lock_manager, NULL everything else
a lot of lock_manager, and a few ClientRead and NULLs.
Is there a way to figure out what may be causing this?
That is surprising, as reading queries never lock on anything short of an ACCESS EXCLUSIVE lock that is required by operations like DROP TABLE, TRUNCATE, ALTER TABLE and similar statements.
Perhaps the locks are “light-weight locks” on internal PostgreSQL data structures, which are usually only held for a very short time. I don't know what in a PostGIS query could have high contention on such internal locks, but then you didn't show the statement or its execution plan, nor did you show the exact lock events.
If you have several concurrent queries that each take a long time like 500ms, the definitely should be running in parallel.
Apart from the possibilities of some internal lock contention, I can think of two explanations:
Most of the queries are short enough that a single core suffices to process all the queries. Each connection spends most of its time waiting for the client.
The system is I/O bound, so that most of the CPUs just twiddle their thumbs. That would be indicated by a CPU iowait% of 10 or more.

Redshift cluster: queries getting hang and filling up space

I have a Redshift cluster with 3 nodes. Every now and then, with users running queries against it, we end in this unpleasant situation where some queries run for way longer than expected (even simple ones, exceeding 15 minutes), and the cluster storage starts increasing to the point that if you don't terminate the long-standing queries it gets to 100% storage occupied.
I wonder why this may happen. My experience is varied, sometimes it's been a single query doing this and sometimes it's been different concurrent queries been run at the same time.
One specific scenario where we saw this happen related to LISTAGG. The type of LISTAGG is varchar(65535), and while Redshift optimizes away the implicit trailing blanks when stored to disk, the full width is required in memory during processing.
If you have a query that returns a million rows, you end up with 1,000,000 rows times 65,535 bytes per LISTAGG, which is 65 gigabytes. That can quickly get you into a situation like what you describe, with queries taking unexpectedly long or failing with “Disk Full” errors.
My team discussed this a bit more on our team blog the other day.
This typically happens when a poorly constructed query spills a too much data to disk. For instance the user accidentally specifies a Cartesian product (every row from tblA joined to every row of tblB).
If this happens regularly you can implement a QMR rule that limits the amount of disk spill before a query is aborted.
QMR Documentation: https://docs.aws.amazon.com/redshift/latest/dg/cm-c-wlm-query-monitoring-rules.html
QMR Rule Candidates query: https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/wlm_qmr_rule_candidates.sql

Mongo continues to insert documents, slowly, long after script is quit

Do I have a zombie somewhere?
My script finished inserting a massive amount of new data.
However, the server continues with high lock rates and slowly inserting new records. It's been about an hour since the script that did the inserts finished, and the documents are still trickling in.
Where are these coming from and how to I purge the queue? (I refactored the code to use an index and want to redo the process to avoid the 100-200% lock rate)
This could be because of following scenarios,
1..Throughput bound Disk IO
One can look into following metrics using "mongostat" and "MongoDB Management Service":
Average Flush time (how long MongoDB's periodic sync to disk is taking)
IOStats in the hardware tab (look specifically IOWait)
Since the Disk IO is slower than CPU processing time, all the inserts get queued up, and this can continue for longer duration, one can check the server status using db.serverStatus() and look into "globalLock"(as Write acquires global lock) field for "currrentQueue" associated with the lock, to see number of writers in queue.
2..Another possible cause could be Managed Sharded Cluster Balancer has been put in On Status(which is by Default On)
If you have been working on clustered environment, whenever write operation starts Balancer automatically gets ON, in-order to keep the cluster in balanced state, which can continue moving chunks from one shard to another even after completion of your scripts. In such a case I would rather suggest to keep the balancer off while having bulk load, in such a case all your documents goes to single shard, but balancer can be kicked on at any downtime.
3..Write Concern
This may also contribute to problem slightly, if they are set to Replica Acknowledged or Acknowledged mode, it depends on you, based on your type of data, to decide on these concerns.