Debug Postgres 'too many notifications in the NOTIFY queue' - postgresql

I am using a Postgres table which gets 2000-3000 updates per second.
I am using for update this table queries generated with the update helper of pg-promise library.
Each update triggers a notify with pg_notify() function. Some nodejs scripts are handling these notifications. For some reason in Postgres logs keep appearing 'too many notifications in the NOTIFY queue' messages and also indication about the notify queue size which keep increasing up to 100%.
I read some posts like: https://postgrespro.com/list/thread-id/1557124
or https://github.com/hasura/graphql-engine/issues/6263
but I cannot find a way to debug this issue.
Which would be a good way to approach this situation?

Your listener doesn't seem to be consuming the notices fast enough, or possibly not at all. So the first step would be something like logging the processing of each notice from your app code, to figure out what is actually going on.

This might be because there is a long-running transaction that is blocking the release of older messages from the buffer. The process is explained in the manuals and is somewhat analoguous to vacuuming - old transactions need to finish in order to clean up old data.
A gotcha here is that any long-running query can hold up the cleanup; for me it was the process that was running the Listen - it was designed to just keep running forever. PG server log has a backend PID that might be the culprit, so you can look it up in pg_stat_activity and proceed from there.

Related

Identify which service listens for notifications but doesn't consume them

I have a huge database where, in some places, we use Postgres notifications. We noticed that the queue size is increasing. The way we check is executing this simple command: select pg_notification_queue_usage();.
When it reaches 100% then all messages are gone. The problem I have is that I don't know who listens to notifications and what channels we have there. I identified only two services that listen for those notifications but it seems that's not the case.
My task is to find other places where we use notifications (consume or produce) to find the root cause. How can I do it?
The only thing I found about it is the query select pg_notification_queue_usage(); but it seems that Postgres doesn't provide other useful functions related to this feature.
I did some experiments regarding it. I launched a local Postgres instance and started publishing notifications there. Everything worked as expected. When I did it once again but without actual consuming notifications, the queue size started to grow. That's what I expected, tho.
Then, I restarted the process and the queue size dropped to 0. That's exactly what the docs say about it.
A session's listen registrations are automatically cleared when the session ends.
On the production, we did exactly the same - we restarted known services but the notification queue didn't drop to 0 as we expected.
It means, there's something else listening to one of the channels but it doesn't consume it or does it too slow.
Is there any way of identifying such listeners?

POSTGRES: pg_cancel_backend does not always work (reason behind it)

I'm currenty using postgres as my database engine, which i've hooked up to a web application.
I'm have noticed on some occasions that there are locks that get accumlated in the database, mainly AccessSharedLocks (when running the query: select * from pg_locks).
One thing I have noticed is that to cancel a process that is acquiring a lock you can use pg_cancel_backend(pid), but sometime i realise that this doesnt always work!! And i'm curious to know why. Is it that this function sends a SIGINT to the database to shut it down gracefully? meaning that it wont shut it down immediately?
There is pg_terminate_backend, but i prefer to not use this.
Any advice on why pg_cancel_backend intermittently works (or at least some explanation) would be grateful).
thanks.
pg_cancel_backend and pg_terminate_backend send signals to the process.
The backend checks ever so often for pending interrupts, but it can happen that execution is in a place where it takes a while until that happens.
Canceling a query won't get rid of the locks until the transaction is closed.

Is there a way to rely on Postgres Notify/Listen mechanism?

I have implemented a Notify/Listen mechanism, so when a special request is sent to the web server, using notify I can notify the workers (in Python) that there's a pending request waiting to be processed.
The implementation works fine, but the problem is that if the workers server is restarting, the notification gets lost, since at that particular time there's no listener.
I can implement a service like MQRabbit or similar, but my needs are so simple that implement such a monster is too much.
Is there any way, a configuration variable perhaps, that can give some persistence to the notification mechanism?
Thanks in advance
I don't think there is a way to persist notification channels, but you can simply store the pending requests to a table, and have the worker check for any missed work on startup.
Either a timestamp or a pending/completed flag would work, depending on what kind of work it's doing.
For consistency, you can have the NOTIFY fire from an INSERT trigger on the queue table, and have the worker always check for any remaining work (not just a specific request) when notified.

PostgreSQL backend behavior upon receiving "Terminate" ('X') after "COMMIT"

We run a postgres server v9.2.8, and use epgsql (erlang) as a client library. And in some cases, which we had on production but weren't able to reproduce in dev environment, we're loosing data.
A function in our application (it should be killed) allows an operator to change session parameters on a running connection. Since connection is usually always busy on production, a "SET SESSION bla-bla" query always crashes pgsql_connection process.
Before crashing, pgsql_connection sends a "Terminate" ('X') signal via pgsql_sock (a wrapper around tcp socket) to a backend. At the same time another erlang process (let's call it "worker") is waiting for a response from postgres backend using the same socket.
Now the question: is it possible that upon receiving a "Terminate" signal from a client, backend can cancel last transaction even if it has sent an "OK" on "COMMIT" statement already?
Because if it is possible, a worker will have a chance to report to the main application process about successfully written transaction while indeed the transaction has been cancelled.
Or, where can I read more details about this? Documentation says (http://www.postgresql.org/docs/9.2/static/protocol-flow.html):
For either normal or abnormal termination, any open transaction is
rolled back, not committed. One should note however that if a frontend
disconnects while a non-SELECT query is being processed, the backend
will probably finish the query before noticing the disconnection. If
the query is outside any transaction block (BEGIN ... COMMIT sequence)
then its results might be committed before the disconnection is
recognized.
– not a crystal clear statement.
Now the question: is it possible that upon receiving a "Terminate" signal from a client, backend can cancel last transaction even if it has sent an "OK" on "COMMIT" statement already?
No. that is fundamentally impossible. If it's committed, it's committed, and there's no going back. That's what "commit" means.
The only time Pg might return success before the commit hits disk and is persistent is if you told it to by setting synchronous_commit = off.
If you're seeing anything different happening then most likely it's a result of attempting to share a single connection between multiple processes (as you establish the connection before fork()) without proper locking or other mutual exclusion to ensure that the connection is locked while a command is in-flight.
Note that the reverse isn't true, which might be what you're thinking of with the quoted documentation passage. A transaction can get committed without returning a successful OK to the client if the client goes away (crashes, loses connection, etc) after issuing the commit command.
What the application is doing, where it sends out-of-sync messages on the wire protocol, is totally broken. It's guaranteed to cause unpredictable problems. The protocol is somewhat robust, so you're not likely to get things like an unintended commit, but you're very likely to get transactions aborted or whole sessions disconnected suddenly.
If you need to be able to roll back/abort committed transactions, then your application design has problems. You're not really ready to commit when you say COMMIT. You would have the same problem if the app process crashed or the whole server crashed between Pg committing the transaction and you doing whatever you need to do.
If you cannot fix the app design to avoid this then you will have to use two-phase transactions, either directly using PREPARE TRANSACTION then COMMIT PREPARED, or indirectly via the XA API. This has significant costs in performance and management overhead, but it's the only option if you need to do special work after database commit but before you're really "done".
The docs you quote are talking about the case where the app has sent a COMMIT but then disconnects before receiving the backend's acknowledgement of the commit. Because TCP/IP is buffered there's no guarantee the COMMIT got flushed to Pg, and if it did there's no guarantee it doesn't accompany the RST that terminates the connection. So in this specific case it's somewhat uncertain whether the transaction will commit or not. An application for which this is a problem would need to have a way of checking whether the last unit of work committed or not when it resumes work, or if it can't do that use two-phase transactions. The docs you quote say nothing about being able to cancel a commit after it's completed, because you can't. Ever.
Assuming that the app has to do some kind of extra work after commit, like moving a file or sending an email or doing work on another data store, then you're probably going to need two-phase transactions. Even then you're vulnerable to issues unless all parties in the distributed transaction support two phase commit, because your "other bit" could get done then your worker or server could crash before the confirmation of its completion is sent to the database to finish phase II of the commit.
You can keep your own two phase commit log of sorts in the DB instead of using true 2PC:
Do the main database work and write a record to the work log table that says "I've done the work in the database and I'm about to do the next part".
Do the next part; and
Update the work log to say the next part is done.
... but this has the same problem, where a crash between parts 2 and 3 causes the app to forget that it did part 2 and repeat it on startup. If you can't live with that, you need to find a way to make part 2 commit completion verifiable, so you can tell if it's done or not, or find a way to make it capable of doing 2-phase commit.
To learn more about this topic, read about XA, distributed transactions, two-phase commit, etc.

CQRS/EventStore: How are failures to deliver events handled?

Getting into CQRS and I understand that you have commands (app layer) and events (from the domain).
In the simple case where events are to update the read model, do read model updates fail? If there is no "bug" then I cannot see them failing and as I am using EventStore, I know there is a commit flag which will retry failures.
So my question is do I have to do anything in addition to EventStore to handle failures?
Coming from a world where you do everything in one transaction and now things are done separately is worrying me.
Of course there may be cases where a published event will fail in the read models.
You have to make sure you can detect that and solve it.
The nice thing is that you can replay all the events again and again so you have the chance not only to fix the error. You can also test the fix by replaying every single event if you want.
I use NServiceBus as my publishing mechanism which allows me to use an error queue. Using my other logging tools together with the error queue I can easily determine what happened since I have the error log and the actual message that caused the error in the first place.