We run a postgres server v9.2.8, and use epgsql (erlang) as a client library. And in some cases, which we had on production but weren't able to reproduce in dev environment, we're loosing data.
A function in our application (it should be killed) allows an operator to change session parameters on a running connection. Since connection is usually always busy on production, a "SET SESSION bla-bla" query always crashes pgsql_connection process.
Before crashing, pgsql_connection sends a "Terminate" ('X') signal via pgsql_sock (a wrapper around tcp socket) to a backend. At the same time another erlang process (let's call it "worker") is waiting for a response from postgres backend using the same socket.
Now the question: is it possible that upon receiving a "Terminate" signal from a client, backend can cancel last transaction even if it has sent an "OK" on "COMMIT" statement already?
Because if it is possible, a worker will have a chance to report to the main application process about successfully written transaction while indeed the transaction has been cancelled.
Or, where can I read more details about this? Documentation says (http://www.postgresql.org/docs/9.2/static/protocol-flow.html):
For either normal or abnormal termination, any open transaction is
rolled back, not committed. One should note however that if a frontend
disconnects while a non-SELECT query is being processed, the backend
will probably finish the query before noticing the disconnection. If
the query is outside any transaction block (BEGIN ... COMMIT sequence)
then its results might be committed before the disconnection is
recognized.
– not a crystal clear statement.
Now the question: is it possible that upon receiving a "Terminate" signal from a client, backend can cancel last transaction even if it has sent an "OK" on "COMMIT" statement already?
No. that is fundamentally impossible. If it's committed, it's committed, and there's no going back. That's what "commit" means.
The only time Pg might return success before the commit hits disk and is persistent is if you told it to by setting synchronous_commit = off.
If you're seeing anything different happening then most likely it's a result of attempting to share a single connection between multiple processes (as you establish the connection before fork()) without proper locking or other mutual exclusion to ensure that the connection is locked while a command is in-flight.
Note that the reverse isn't true, which might be what you're thinking of with the quoted documentation passage. A transaction can get committed without returning a successful OK to the client if the client goes away (crashes, loses connection, etc) after issuing the commit command.
What the application is doing, where it sends out-of-sync messages on the wire protocol, is totally broken. It's guaranteed to cause unpredictable problems. The protocol is somewhat robust, so you're not likely to get things like an unintended commit, but you're very likely to get transactions aborted or whole sessions disconnected suddenly.
If you need to be able to roll back/abort committed transactions, then your application design has problems. You're not really ready to commit when you say COMMIT. You would have the same problem if the app process crashed or the whole server crashed between Pg committing the transaction and you doing whatever you need to do.
If you cannot fix the app design to avoid this then you will have to use two-phase transactions, either directly using PREPARE TRANSACTION then COMMIT PREPARED, or indirectly via the XA API. This has significant costs in performance and management overhead, but it's the only option if you need to do special work after database commit but before you're really "done".
The docs you quote are talking about the case where the app has sent a COMMIT but then disconnects before receiving the backend's acknowledgement of the commit. Because TCP/IP is buffered there's no guarantee the COMMIT got flushed to Pg, and if it did there's no guarantee it doesn't accompany the RST that terminates the connection. So in this specific case it's somewhat uncertain whether the transaction will commit or not. An application for which this is a problem would need to have a way of checking whether the last unit of work committed or not when it resumes work, or if it can't do that use two-phase transactions. The docs you quote say nothing about being able to cancel a commit after it's completed, because you can't. Ever.
Assuming that the app has to do some kind of extra work after commit, like moving a file or sending an email or doing work on another data store, then you're probably going to need two-phase transactions. Even then you're vulnerable to issues unless all parties in the distributed transaction support two phase commit, because your "other bit" could get done then your worker or server could crash before the confirmation of its completion is sent to the database to finish phase II of the commit.
You can keep your own two phase commit log of sorts in the DB instead of using true 2PC:
Do the main database work and write a record to the work log table that says "I've done the work in the database and I'm about to do the next part".
Do the next part; and
Update the work log to say the next part is done.
... but this has the same problem, where a crash between parts 2 and 3 causes the app to forget that it did part 2 and repeat it on startup. If you can't live with that, you need to find a way to make part 2 commit completion verifiable, so you can tell if it's done or not, or find a way to make it capable of doing 2-phase commit.
To learn more about this topic, read about XA, distributed transactions, two-phase commit, etc.
Related
I am using a Postgres table which gets 2000-3000 updates per second.
I am using for update this table queries generated with the update helper of pg-promise library.
Each update triggers a notify with pg_notify() function. Some nodejs scripts are handling these notifications. For some reason in Postgres logs keep appearing 'too many notifications in the NOTIFY queue' messages and also indication about the notify queue size which keep increasing up to 100%.
I read some posts like: https://postgrespro.com/list/thread-id/1557124
or https://github.com/hasura/graphql-engine/issues/6263
but I cannot find a way to debug this issue.
Which would be a good way to approach this situation?
Your listener doesn't seem to be consuming the notices fast enough, or possibly not at all. So the first step would be something like logging the processing of each notice from your app code, to figure out what is actually going on.
This might be because there is a long-running transaction that is blocking the release of older messages from the buffer. The process is explained in the manuals and is somewhat analoguous to vacuuming - old transactions need to finish in order to clean up old data.
A gotcha here is that any long-running query can hold up the cleanup; for me it was the process that was running the Listen - it was designed to just keep running forever. PG server log has a backend PID that might be the culprit, so you can look it up in pg_stat_activity and proceed from there.
I am making an application which will need to use NestJS' CQRS module, as the requirements naturally lend themselves to that pattern.
Updates to the application logic are expected to be frequent and to happen during busy hours (that's just how my management works...), so the application needs to be able to restart gracefully. However, this means that events started just before the shutdown may not finish, or even if they do, some sagas may not trigger due to some events having happened before the restart... I'd like to ensure that doesn't happen.
I'm aware of NestJS' OnApplicationShutdown and OnApplicationBootstrap hooks, which is exactly for this purpose, but what I'm not sure is what I should do there. How can I capture all events that have unfinished handlers and sagas? Then after a restart, how can I make the event bus aware of the events monitored by sagas, without executing the already executed handlers?
I guess the second part could be worked around with a random ID per event/handler combo, that will be looked up in a log, and if present, the handler will be skipped, and if not, it will be executed and added to the log... But even with such a workaround, I don't see how I could do the first part. There will be a lot of events, and sagas (by definition) execute commands, meaning they have side effects... Even if all commands can become idempotent, the sheer quantity of events and frequent restarts means restarting from the very first command is a no go.
I've seen this package but I'm not sure if it solves this particular use case, or if it's really just logging the events, and pretty much nothing more.
I have implemented a Notify/Listen mechanism, so when a special request is sent to the web server, using notify I can notify the workers (in Python) that there's a pending request waiting to be processed.
The implementation works fine, but the problem is that if the workers server is restarting, the notification gets lost, since at that particular time there's no listener.
I can implement a service like MQRabbit or similar, but my needs are so simple that implement such a monster is too much.
Is there any way, a configuration variable perhaps, that can give some persistence to the notification mechanism?
Thanks in advance
I don't think there is a way to persist notification channels, but you can simply store the pending requests to a table, and have the worker check for any missed work on startup.
Either a timestamp or a pending/completed flag would work, depending on what kind of work it's doing.
For consistency, you can have the NOTIFY fire from an INSERT trigger on the queue table, and have the worker always check for any remaining work (not just a specific request) when notified.
Getting into CQRS and I understand that you have commands (app layer) and events (from the domain).
In the simple case where events are to update the read model, do read model updates fail? If there is no "bug" then I cannot see them failing and as I am using EventStore, I know there is a commit flag which will retry failures.
So my question is do I have to do anything in addition to EventStore to handle failures?
Coming from a world where you do everything in one transaction and now things are done separately is worrying me.
Of course there may be cases where a published event will fail in the read models.
You have to make sure you can detect that and solve it.
The nice thing is that you can replay all the events again and again so you have the chance not only to fix the error. You can also test the fix by replaying every single event if you want.
I use NServiceBus as my publishing mechanism which allows me to use an error queue. Using my other logging tools together with the error queue I can easily determine what happened since I have the error log and the actual message that caused the error in the first place.
I am using C# and .Net Framework 1.1 (yes its old but I inherited this stuff and can't convert up). I places messages on a transactional queue but it does not get on the queue about 50% of the time. Running workgroup and Windows/XP Professional with all service packs installed. I don't see any messages in the dead letter queue either.
Any ideas where to look?
If it isn't hitting the queue at all and isn't going to the dead-letter queue, it suggests the item isn't being sent to the queue. You should be able to confirm that this is the case by switching on the journal for the queue.
Assuming it isn't hitting the queue, it is probably a transaction issue. I would check that you are definitely committing the message to the queue every time. Make sure there aren't any exceptions being thrown and swallowed that causes the transaction to roll back or never be committed (essentially the same thing). Also make sure there aren't any conditional statements that mean the commit gets skipped.
I would add some logging around every location where a transaction is started, committed and rolled back and also around any location where you are creating a message. You can then review you log to see the order of events and see what's going astray.
Another option would be to remove all of the transaction code and test the code against a non-transactional queue. If the messages all appear then it is a transactional problem. If not, the issue is elsewhere.
I use MSMQ a lot and the one thing I have learned through experience is that it works really well and the weak point is me :-)