Debug long running 'RELEASE SAVEPOINT <savepoint_name>' - postgresql

I have a savepoint which has been running for almost 24 hrs now. Its causing other issues like long running queries which refreshes materialized view concurrently.
Is there a way to know which query is causing the RELEASE SAVEPOINT <savepoint-name> to be in idle in transaction. Is it safe to use SELECT pg_cancel_backend(__pid__); against its pid?

If the session is “idle in transaction”, it is not running.
What you see in pg_stat_activity is the last statement executed in that session.
There is a bug in your application that causes a transaction to remain open, and the locks that are held by this transmission can block concurrent sessions.

Related

Short INSERT query hangs on DataFileRead

I have a short insert query with just a few joins. It takes around ~1 second to run.
Recently I am facing an issue where the query hangs and never finishes.
Querying on pg_stat_activity I see that it is in an active state. While refreshing the query I used on pg_stat_activity, I see the PID of the query is changing between wait_event and wait_event_type being both NULL to being DataFileRead and IO, respectively, and back again.
I wanted to see how long the query will hang, the longest I got to was an hour and 30 minutes, I terminated it since I wanted to continue to work.
Also, it doesn't always hang. It will sometime finish successfully (and fast) and sometimes hang. No other user is querying on this table, and the DB is used only by me and another person, so there are no locks or heavy load (I checked multiple times).
Any ideas on how to better investigate what is blocking this query from finishing?
According to that wait_event, the cause is a slow disk. You'll have to figure out if the problem is in the hardware or configuration or whether it is just overloaded by concurrent activity.

How to avoid long delay before finally getting "40001 could not serialize access due to concurrent update"

We have a Postgres 12 system running one master master and two async hot-standby replica servers and we use SERIALIZABLE transactions. All the database servers have very fast SSD storage for Postgres and 64 GB of RAM. Clients connect directly to master server if they cannot accept delayed data for a transaction. Read-only clients that accept data up to 5 seconds old use the replica servers for querying data. Read-only clients use REPEATABLE READ transactions.
I'm aware that because we use SERIALIZABLE transactions Postgres might give us false positive matches and force us to repeat transactions. This is fine and expected.
However, the problem I'm seeing is that randomly a single line INSERT or UPDATE query stalls for a very long time. As an example, one error case was as follows (speaking directly to master to allow modifying table data):
A simple single row insert
insert into restservices (id, parent_id, ...) values ('...', '...', ...);
stalled for 74.62 seconds before finally emitting error
ERROR 40001 could not serialize access due to concurrent update
with error context
SQL statement "SELECT 1 FROM ONLY "public"."restservices" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
We log all queries exceeding 40 ms so I know this kind of stall is rare. Like maybe a couple of queries a day. We average around 200-400 transactions per second during normal load with 5-40 queries per transaction.
After finally getting the above error, the client code automatically released two savepoints, rolled back the transaction and disconnected from database (this cleanup took 2 ms total). It then reconnected to database 2 ms later and replayed the whole transaction from the start and finished in 66 ms including the time to connect to the database. So I think this is not about performance of the client or the master server as a whole. The expected transaction time is between 5-90 ms depending on transaction.
Is there some PostgreSQL connection or master configuration setting that I can use to make PostgreSQL to return the error 40001 faster even if it caused more transactions to be rolled back? Does anybody know if setting
set local statement_timeout='250'
within the transaction has dangerous side-effects? According to the documentation https://www.postgresql.org/docs/12/runtime-config-client.html "Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions" but I could set the timeout only for transactions by this client that's able to automatically retry the transaction very fast.
Is there anything else to try?
It looks like someone had the parent row to the one you were trying to insert locked. PostgreSQL doesn't know what to do about that until the lock is released, so it blocks. If you failed rather than blocking, and upon failure retried the exact same thing, the same parent row would (most likely) still be locked and so would just fail again, and you would busy-wait. Busy-waiting is not good, so blocking rather than failing is generally a good thing here. It blocks and then unblocks only to fail, but once it does fail a retry should succeed.
An obvious exception to blocking-better-than-failing being if when you retry, you can pick a different parent row to retry with, if that make sense in your context. In this case, maybe the best thing to do is explicitly lock the parent row with NOWAIT before attempting the insert. That way you can perhaps deal with failures in a more nuanced way.
If you must retry with the same parent_id, then I think the only real solution is to figure out who is holding the parent row lock for so long, and fix that. I don't think that setting statement_timeout would be hazardous, but it also wouldn't solve your problem, as you would probably just keep retrying until the lock on the offending row is released. (Setting it on the other session, the one holding the lock, might be helpful, depending on what that session is doing while the lock is held.)

When are locks released in Postgres

I have some problems understanding locks. Naturally locks are released when everything goes smoothly. But I'm unsure on the exact logic for when locks are released, when things break down. How long a lock can persist? Can I kill all processes and thereby release all locks? Do I have to explicitly call rollback?
In general locks are released when transaction ends with COMMIT or ROLLBACK.
There are exceptions:
Once acquired, a lock is normally held till end of transaction. But if
a lock is acquired after establishing a savepoint, the lock is
released immediately if the savepoint is rolled back to. This is
consistent with the principle that ROLLBACK cancels all effects of the
commands since the savepoint. The same holds for locks acquired within
a PL/pgSQL exception block: an error escape from the block releases
locks acquired within it.
2.
There are two ways to acquire an advisory lock in PostgreSQL: at
session level or at transaction level. Once acquired at session level,
an advisory lock is held until explicitly released or the session
ends.
Killing backend processes should release the locks but should be not the right way to release the locks: it should only be used as last resort if you cannot end the client application in a clean way.

select pg_database_size('name') hangs and can't be killed

Our PostgreSQL 10.1 server ran out of connections today because a monitor process that was calling
select pg_database_size('databasename');
was getting stuck. It was NOT getting an obvious Lock. It just never returned. The monitor dutifully logged in every few minutes, over and over until we ran out of connections. When I run the query for other databases it works, but not for our main database.
Killing the calling process did not clear the query.
select pg_cancel_backend(1234)
doesn't kill the query. Nor does
select pg_terminate_backend(1234)
Ditto if I run the query by hand, nothing kills it in the database.
I will probably have to restart the database server to recover from this. However I'd like to prevent it from happening again.
What is this function doing that would resist signals and never return (like 8 hours after being invoked)? Is there any way to clear them from the process table without restarting the database and breaking the users who still have the few remaining connections still active in the system?

How can I control a PostgreSQL function is running in a long period of time

A program which I developed is using postgresql. That program is running a plpgsql function it is taking so long time(hours or days). I want to be sure that function is running during that long time.
How can I know that? I don't want to use "raise notice" in a loop in function because that will extend running time.
You can see if it's running by examining pg_stat_activity for the process. However, this won't tell you if the function is progressing.
You can check to see whether that backend is blocked on any locks by joining pg_stat_activity against pg_locks to see if there are any open (granted = False) locks for that table. Again, this won't tell you if it's progressing, just that if it isn't it's not stuck on a lock.
If you want to monitor a function's progress you will need to emit log messages or use one of the other hacks for monitoring progress. You can (ab)use NOTIFY with payload to LISTEN for progress messages. Alternately, you could create a sequence that you call nextval on each time you process an item in your procedure; you can then SELECT * FROM the_sequence_name; in another transaction to see the approximate progress.
In general I'd recommend setting client_min_messages to notice or above then RAISE LOG so you record messages that appear only in the logs, without being sent to the client. To reduce overhead, keep a counter and log every 100 or 1000 or whatever iterations of your loop so you only log occasionally. There's a cost to updating the counter, for sure, but it's pretty low compared to the cost of a big, slow PL/PgSQL procedure like this.