We have a Postgres 12 system running one master master and two async hot-standby replica servers and we use SERIALIZABLE transactions. All the database servers have very fast SSD storage for Postgres and 64 GB of RAM. Clients connect directly to master server if they cannot accept delayed data for a transaction. Read-only clients that accept data up to 5 seconds old use the replica servers for querying data. Read-only clients use REPEATABLE READ transactions.
I'm aware that because we use SERIALIZABLE transactions Postgres might give us false positive matches and force us to repeat transactions. This is fine and expected.
However, the problem I'm seeing is that randomly a single line INSERT or UPDATE query stalls for a very long time. As an example, one error case was as follows (speaking directly to master to allow modifying table data):
A simple single row insert
insert into restservices (id, parent_id, ...) values ('...', '...', ...);
stalled for 74.62 seconds before finally emitting error
ERROR 40001 could not serialize access due to concurrent update
with error context
SQL statement "SELECT 1 FROM ONLY "public"."restservices" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
We log all queries exceeding 40 ms so I know this kind of stall is rare. Like maybe a couple of queries a day. We average around 200-400 transactions per second during normal load with 5-40 queries per transaction.
After finally getting the above error, the client code automatically released two savepoints, rolled back the transaction and disconnected from database (this cleanup took 2 ms total). It then reconnected to database 2 ms later and replayed the whole transaction from the start and finished in 66 ms including the time to connect to the database. So I think this is not about performance of the client or the master server as a whole. The expected transaction time is between 5-90 ms depending on transaction.
Is there some PostgreSQL connection or master configuration setting that I can use to make PostgreSQL to return the error 40001 faster even if it caused more transactions to be rolled back? Does anybody know if setting
set local statement_timeout='250'
within the transaction has dangerous side-effects? According to the documentation https://www.postgresql.org/docs/12/runtime-config-client.html "Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions" but I could set the timeout only for transactions by this client that's able to automatically retry the transaction very fast.
Is there anything else to try?
It looks like someone had the parent row to the one you were trying to insert locked. PostgreSQL doesn't know what to do about that until the lock is released, so it blocks. If you failed rather than blocking, and upon failure retried the exact same thing, the same parent row would (most likely) still be locked and so would just fail again, and you would busy-wait. Busy-waiting is not good, so blocking rather than failing is generally a good thing here. It blocks and then unblocks only to fail, but once it does fail a retry should succeed.
An obvious exception to blocking-better-than-failing being if when you retry, you can pick a different parent row to retry with, if that make sense in your context. In this case, maybe the best thing to do is explicitly lock the parent row with NOWAIT before attempting the insert. That way you can perhaps deal with failures in a more nuanced way.
If you must retry with the same parent_id, then I think the only real solution is to figure out who is holding the parent row lock for so long, and fix that. I don't think that setting statement_timeout would be hazardous, but it also wouldn't solve your problem, as you would probably just keep retrying until the lock on the offending row is released. (Setting it on the other session, the one holding the lock, might be helpful, depending on what that session is doing while the lock is held.)
I have a publisher and a subscriber. Every so often I get:
ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally before or while processing the request.
I can guess why it terminates abnormally, one of the computers turn off. However, when the two computers are connected again, it doesn't restart automatically.
The only thing that works is to truncate all the tables in the subscription, delete the subscription and publication and create the subscription and publication again.
I tried looking at the WAL, they're very nice. Not sure what to do otherwise.
Here are some pictures:
It should not be necessary to re-initialize logical replication just because there was a connection problem. The logical replication slot on the primary will make sure that all required information is retained on the server so that replication can be resumed later on.
Reading you primary log, it looks like you are just hitting a timeout because there is nothing to replicate. That shouldn't be a problem, but you can set wal_sender_timeout = 0 on the primary to disable the timeout.
We have a micro-services architecture, with Kafka used as the communication mechanism between the services. Some of the services have their own databases. Say the user makes a call to Service A, which should result in a record (or set of records) being created in that service’s database. Additionally, this event should be reported to other services, as an item on a Kafka topic. What is the best way of ensuring that the database record(s) are only written if the Kafka topic is successfully updated (essentially creating a distributed transaction around the database update and the Kafka update)?
We are thinking of using spring-kafka (in a Spring Boot WebFlux service), and I can see that it has a KafkaTransactionManager, but from what I understand this is more about Kafka transactions themselves (ensuring consistency across the Kafka producers and consumers), rather than synchronising transactions across two systems (see here: “Kafka doesn't support XA and you have to deal with the possibility that the DB tx might commit while the Kafka tx rolls back.”). Additionally, I think this class relies on Spring’s transaction framework which, at least as far as I currently understand, is thread-bound, and won’t work if using a reactive approach (e.g. WebFlux) where different parts of an operation may execute on different threads. (We are using reactive-pg-client, so are manually handling transactions, rather than using Spring’s framework.)
Some options I can think of:
Don’t write the data to the database: only write it to Kafka. Then use a consumer (in Service A) to update the database. This seems like it might not be the most efficient, and will have problems in that the service which the user called cannot immediately see the database changes it should have just created.
Don’t write directly to Kafka: write to the database only, and use something like Debezium to report the change to Kafka. The problem here is that the changes are based on individual database records, whereas the business significant event to store in Kafka might involve a combination of data from multiple tables.
Write to the database first (if that fails, do nothing and just throw the exception). Then, when writing to Kafka, assume that the write might fail. Use the built-in auto-retry functionality to get it to keep trying for a while. If that eventually completely fails, try to write to a dead letter queue and create some sort of manual mechanism for admins to sort it out. And if writing to the DLQ fails (i.e. Kafka is completely down), just log it some other way (e.g. to the database), and again create some sort of manual mechanism for admins to sort it out.
Anyone got any thoughts or advice on the above, or able to correct any mistakes in my assumptions above?
Thanks in advance!
I'd suggest to use a slightly altered variant of approach 2.
Write into your database only, but in addition to the actual table writes, also write "events" into a special table within that same database; these event records would contain the aggregations you need. In the easiest way, you'd simply insert another entity e.g. mapped by JPA, which contains a JSON property with the aggregate payload. Of course this could be automated by some means of transaction listener / framework component.
Then use Debezium to capture the changes just from that table and stream them into Kafka. That way you have both: eventually consistent state in Kafka (the events in Kafka may trail behind or you might see a few events a second time after a restart, but eventually they'll reflect the database state) without the need for distributed transactions, and the business level event semantics you're after.
(Disclaimer: I'm the lead of Debezium; funnily enough I'm just in the process of writing a blog post discussing this approach in more detail)
Here are the posts
https://debezium.io/blog/2018/09/20/materializing-aggregate-views-with-hibernate-and-debezium/
https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
first of all, I have to say that I’m no Kafka, nor a Spring expert but I think that it’s more a conceptual challenge when writing to independent resources and the solution should be adaptable to your technology stack. Furthermore, I should say that this solution tries to solve the problem without an external component like Debezium, because in my opinion each additional component brings challenges in testing, maintaining and running an application which is often underestimated when choosing such an option. Also not every database can be used as a Debezium-source.
To make sure that we are talking about the same goals, let’s clarify the situation in an simplified airline example, where customers can buy tickets. After a successful order the customer will receive a message (mail, push-notification, …) that is sent by an external messaging system (the system we have to talk with).
In a traditional JMS world with an XA transaction between our database (where we store orders) and the JMS provider it would look like the following: The client sets the order to our app where we start a transaction. The app stores the order in its database. Then the message is sent to JMS and you can commit the transaction. Both operations participate at the transaction even when they’re talking to their own resources. As the XA transaction guarantees ACID we’re fine.
Let’s bring Kafka (or any other resource that is not able to participate at the XA transaction) in the game. As there is no coordinator that syncs both transactions anymore the main idea of the following is to split processing in two parts with a persistent state.
When you store the order in your database you can also store the message (with aggregated data) in the same database (e.g. as JSON in a CLOB-column) that you want to send to Kafka afterwards. Same resource – ACID guaranteed, everything fine so far. Now you need a mechanism that polls your “KafkaTasks”-Table for new tasks that should be send to a Kafka-Topic (e.g. with a timer service, maybe #Scheduled annotation can be used in Spring). After the message has been successfully sent to Kafka you can delete the task entry. This ensures that the message to Kafka is only sent when the order is also successfully stored in application database. Did we achieve the same guarantees as we have when using a XA transaction? Unfortunately, no, as there is still the chance that writing to Kafka works but the deletion of the task fails. In this case the retry-mechanism (you would need one as mentioned in your question) would reprocess the task an sends the message twice. If your business case is happy with this “at-least-once”-guarantee you’re done here with a imho semi-complex solution that could be easily implemented as framework functionality so not everyone has to bother with the details.
If you need “exactly-once” then you cannot store your state in the application database (in this case “deletion of a task” is the “state”) but instead you must store it in Kafka (assuming that you have ACID guarantees between two Kafka topics). An example: Let’s say you have 100 tasks in the table (IDs 1 to 100) and the task job processes the first 10. You write your Kafka messages to their topic and another message with the ID 10 to “your topic”. All in the same Kafka-transaction. In the next cycle you consume your topic (value is 10) and take this value to get the next 10 tasks (and delete the already processed tasks).
If there are easier (in-application) solutions with the same guarantees I’m looking forward to hear from you!
Sorry for the long answer but I hope it helps.
All the approach described above are the best way to approach the problem and are well defined pattern. You can explore these in the links provided below.
Pattern: Transactional outbox
Publish an event or message as part of a database transaction by saving it in an OUTBOX in the database.
http://microservices.io/patterns/data/transactional-outbox.html
Pattern: Polling publisher
Publish messages by polling the outbox in the database.
http://microservices.io/patterns/data/polling-publisher.html
Pattern: Transaction log tailing
Publish changes made to the database by tailing the transaction log.
http://microservices.io/patterns/data/transaction-log-tailing.html
Debezium is a valid answer but (as I've experienced) it can require some extra overhead of running an extra pod and making sure that pod doesn't fall over. This could just be me griping about a few back to back instances where pods OOM errored and didn't come back up, networking rule rollouts dropped some messages, WAL access to an aws aurora db started behaving oddly... It seems that everything that could have gone wrong, did. Not saying Debezium is bad, it's fantastically stable, but often for devs running it becomes a networking skill rather than a coding skill.
As a KISS solution using normal coding solutions that will work 99.99% of the time (and inform you of the .01%) would be:
Start Transaction
Sync save to DB
-> If fail, then bail out.
Async send message to kafka.
Block until the topic reports that it has received the
message.
-> if it times out or fails Abort Transaction.
-> if it succeeds Commit Transaction.
I'd suggest to use a new approach 2-phase message. In this new approach, much less codes are needed, and you don't need Debeziums any more.
https://betterprogramming.pub/an-alternative-to-outbox-pattern-7564562843ae
For this new approach, what you need to do is:
When writing your database, write an event record to an auxiliary table.
Submit a 2-phase message to DTM
Write a service to query whether an event is saved in the auxiliary table.
With the help of DTM SDK, you can accomplish the above 3 steps with 8 lines in Go, much less codes than other solutions.
msg := dtmcli.NewMsg(DtmServer, gid).
Add(busi.Busi+"/TransIn", &TransReq{Amount: 30})
err := msg.DoAndSubmitDB(busi.Busi+"/QueryPrepared", db, func(tx *sql.Tx) error {
return AdjustBalance(tx, busi.TransOutUID, -req.Amount)
})
app.GET(BusiAPI+"/QueryPrepared", dtmutil.WrapHandler2(func(c *gin.Context) interface{} {
return MustBarrierFromGin(c).QueryPrepared(db)
}))
Each of your origin options has its disadvantage:
The user cannot immediately see the database changes it have just created.
Debezium will capture the log of the database, which may be much larger than the events you wanted. Also deployment and maintenance of Debezium is not an easy job.
"built-in auto-retry functionality" is not cheap, it may require much codes or maintenance efforts.
Is there a way for a replication client to say whenever they it was able to successfully store the data, or is it that PostgreSQL is streaming pending data to the client and the moment data leave network interface it is considered delivered?
I'd think that client has a chance to say "ACK - I got the data", but I can't seem to find this anywhere... I'm simply wondering what if the client fails to store the data (e.g. due to power failure) - isn't there a way to get it again from Postgres?
General info here https://www.postgresql.org/docs/9.5/static/logicaldecoding.html
I'll answer my own Q.
After doing much more reading, I can say there is ACK-like mechanism there.
Under some conditions (e.g. on interval) server will ask logical replication consumer to report what was the last piece of data that was persisted (i.e. flushed to disk or similar). Then and only then server will treat data up to that reported point delivered for given replication channel.
I am working out a master/slave architecture for my web application in which frontends reading from slaves must only do so if the slave is consistent up to the time of the last known write triggered by the requesting client. Slaves can be inconsistent with respect to the master as long as they are inconsistent only regarding writes by other users, not by the requesting user.
All writes are sent to the master which is easy enough, but the trick is knowing when to send reads to the master versus a slave.
What I would like to do is:
On a write request, at the end of the request processing phase after all writes are committed, take some kind of reading of the database's current transaction pointer and stash it in a cookie on the client response.
On a read request, take the value from this cookie and first check if the slave is caught up to this transaction pointer location. If it's caught up, delete the cookie and read from the slave happily. If not, read from the master and leave the cookie in place.
I am not sure what specific functions to use to achieve this on the master and slave or if they exist at all. I would like to avoid the overhead of a dedicated counter in a table that I have to explicitly update and query, since I presume PG is already doing this for me in some fashion. However, I could do that if necessary.
pg_current_xlog_location on the master and pg_last_xlog_replay_location on the slave look promising, however, I do not know enough to know if these will reliably do the job:
Will an idle master and a caught-up slave always report the exact same values for these functions?
The syntax of their return value is confusing to me, for instance 0/6466270 - how do I convert this string into an integer in a way that I can reliably do a simple greater- or less-than comparison?
Note: I am planning to use streaming replication with slaves in hot standby mode, if that affects the available solutions. I am currently using 9.1, but would entertain an upgrade if that helped.
take some kind of reading of the database's current transaction pointer and stash it in a cookie on the client response.
You can use:
SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/00000000');
to get an absolute position, but in this case you actually only need to store pg_current_xlog_location(), because:
On a read request, take the value from this cookie and first check if the slave is caught up to this transaction pointer location.
Compare the saved pg_current_xlog_location() with the slave's pg_last_xlog_replay_location() using pg_xlog_location_diff.
Will an idle master and a caught-up slave always report the exact same values for these functions?
If you're using streaming replication, yes. If you're doing archive based replication, no.
You shouldn't rely on the same value anyway. You just need to know if the slave is new enough.
The syntax of their return value is confusing to me, for instance 0/6466270 - how do I convert this string into an integer in a way that I can reliably do a simple greater- or less-than comparison?
Use pg_xlog_location_diff. It might not be in 9.1, so you may need to upgrade.