PostgreSQL synchronous replication consistency - postgresql

If we compare multiple types of replication (Single-leader, Multi-leader or Leaderless), Single-leader replication has the possibility to be Linearizable. In my understanding, Linearizability means that once a write completes, all later reads should return that value, or of a later write. Or said in other words, there should be an impression if there is only one database, but no more. So i guess, no stale reads.
PostgreSQL in his streaming replication, has the ability to make all it's replicas synchronous using the synchronous_standby_names and it also has the ability to fine tune with the synchronous_commit option, where it can be set to remote_apply, so the leader waits until the transaction is replayed on the standby (making it visible to queries). In the documentation, in the paragraph where it talks about the remote_apply option, it states that this allows load-balancing in simple cases with causal consistency.
Few pages back, it says this:
,,Some solutions are synchronous, meaning that a data-modifying transaction is not considered committed until all servers have committed the transaction. This guarantees that a failover will not lose any data and that all load-balanced servers will return consistent results no matter which server is queried,,
So i'm struggling to understand what can there be guaranteed, and what anomalies can happen if we load-balance read queries to the read replicas. Can still there be stale reads? Can it happen when i query different replicas to get different results even no write after happend on the leader? My impression is yes, but i'm not really sure.
If no, how PostgreSQL prevents stale reads? I did not find anything with more details how it fully works under the hood. Does it use two-phase commit, or some modification of it, or it uses some other algorithm to prevent stale reads?
If it does not provide option of no stale reads, is there a way to accomplish that? I saw, PgPool has to option to load-balance to the replicas that are behind no more than a defined threshold, but i did not understand if it could be defined to load-balance to replicas that are up with the leader.
It's really confusing to me to really understand if there anomalies can happen in a fully synchronous replication in PostgreSQL.
I understand that setup like this has problems with availability, but that is now not a concern.

If you use synchronous replication with synchronous_commit = remote_apply, you can be certain that you will see the modified data on the standby as soon as the modifying transaction has committed.
Synchronous replication does not use two-phase commit, the primary server first commits locally and then simply waits for the feedback of the synchronous standby server before COMMIT returns. So the following is possible:
An observer will see the modified data on the primary before COMMIT returns, and before the data are propagated to the standby.
An observer will see the modified data on the standby before the COMMIT on the primary returns.
If the committing transaction is interrupted on the primary at the proper moment before COMMIT returns, the transaction will already be committed only on the primary. There is always a certain time window between the time the commit happens on the server and the time it is reported to the client, but that window increases considerably with streaming replication.
Eventually, though, the data modification will always make it to the standby.

Related

Postgresql logical replication protocol, "standby status update" message impact

I'm trying to figure out the logical replication protocol. My attention was drawn to the message "standby status update" (byte('r')). Unfortunately, the documentation does not seem to describe the expected server behavior for this command.
If I send an lsn from the past, will the server resend transactions that were in the future regarding this lsn? Or, as far as I could find out in a practical way, this message will only affect the meta data of the replication slot (from pg_replication_slots table)?
That information is used to move the replication slot ahead and to provide the data that can be seen in pg_stat_replication (sent_lsn, write_lsn, flush_lsn and replay_lsn). With synchronous replication, it also provides the data that the primary has to wait for before it can report a transaction as committed.
Sending old, incorrect log sequence numbers will not change the data that the primary streams to the standby, but it might make the primary retail old WAL, because the replication slot is not moved ahead. It will also confuse monitoring. With synchronous replication, it will lead to hanging DML transactions.

Synchronous vs asynchronous streaming replication for Postgres with PgPool

After reading the documentation of PgPool I was left confused which option would suit my use case best. I need a main database instance which would serve the queries and 1 or more replicas (standbys) of the main one which would be used for disaster recovery scenarios.
What is very improtant for me is that all transactions committed successfully to the master node are guaranteed to be replicated eventually to the replicas such that when a failover occurs, the replica database instance has all transactions up to and including the latest one applied to it.
In terms of asynchronous replication, I have not seen any mention whether that is the case in the PgPool documentation, however, it does indeed mention some potential data loss occurring which is a bit too vague for me to draw any conclusions.
To combat this data loss, the documentation suggests to use synchronous streaming replication which before committing a transaction in the main node, ensures that all replicas have applied that change also. Thus, this method is slower than the asynchronous one but if there is no data loss, it could be viable.
Is synchronous replication the only method that allows me to achieve my use-case or would the asynchronous replication also do the trick? Also, what constitutes the potential data loss in the asynchronous replication?
Asynchronous replication means that the primary server does not wait for a standby server before reporting a successful COMMIT to the client. As a consequence, if the primary server fails, it is possible that the client believes that a certain transaction is committed, but none of the standby servers has yet received the WAL information. In a high availability setup, where you promote a standby in case of loss of the primary server, that means that you could potentially lose committed transactions, although it typically takes only a split second for the information to reach the standby.
With synchronous replication, the primary waits until the first available synchronous standby server has reported that it has received and persisted the WAL information before reporting a successful COMMIT to the client (the details of this, like which standby server is chosen, how many of them have to report back and when exactly WAL counts as received by the standby are configurable). So no transaction that has been reported committed to the client can get lost, even if the primary server is gone for good.
While it is technically simple to configure synchronous replication, it poses an architectural and design problem, so that asynchronous replication is often the better choice:
Synchronous replication slows down all data modification drastically. To work reasonably well, the network latency between primary and standby has to be very low. You usually cannot reasonably use synchronous replication between different data centers.
Synchronous replication reduces the availability of the whole system, because failure of the standby server prevents any data modification from succeeding. For that reason, you need to have at least two synchronous standby servers, one that is kept synchronized and one as a stand-in.
Even with synchronous replication, it is not guaranteed that reading from the standby after writing to the primary will give you the new data, because by default the primary doesn't wait for WAL to be replayed on the standby. If you want that, you need to set synchronous_commit = remote_apply on the primary server. But since queries on the standby can conflict with WAL replay, you will either have to deal with replication (and commit) delay or canceled queries on the standby. So using synchronous replication as a tool for horizontal scaling is reasonably possible only if you can deal with data modifications not being immediately visible on the standby.

In MongoDB, are causally consistent sessions subject to reading data that may be rolled back?

If I'm using a causally consistent session in MongoDB, I can execute an acknowledged write that gets acknowledged before the write is written to a majority. Because I get acknowledgement so quickly, I can then do a read and see the results of my write. However, since I haven't waited for 'majority' acknowledgement, is it possible that my write could be rolled back, but I wouldn't know it because the 'read' that I did didn't depend on the majority members of the cluster actually getting the data?
I can execute an acknowledged write that gets acknowledged before the write is written to a majority is it possible that my write could be rolled back
If you are not specifying write concern majority, then yes it is possible that your write could be rolled back. However if you are, then the writes are unacknowledged, MongoDB does not return the operation time and the cluster time for unacknowledged write operations. Unacknowledged writes do not imply any causal relationship.
In addition, if you specify retryable writes true, the client will attempt one retry. This helps address transient network errors and replica set elections, but not persistent network errors.
the 'read' that I did didn't depend on the majority members of the cluster actually getting the data?
Within a Causally Consistent session, the read operation would wait to see the cluster time after the write operation.
See more information on Implementation of Cluster Write Causal Consistency in MongoDB

Can MongoDB manage RollBack procedure more then 300 MB Streaming Data?

I am dealing with rollback procedures of MongoDB. Problem is rollback for huge data may be bigger than 300 MB or more.
Is there any solution for this problem? Error log is
replSet syncThread: replSet too much data to roll back
In official MongoDB document, I could not see a solution.
Thanks for the answers.
The cause
The page Rollbacks During Replica Set Failover states:
A rollback is necessary only if the primary had accepted write operations that the secondaries had not successfully replicated before the primary stepped down. When the primary rejoins the set as a secondary, it reverts, or “rolls back,” its write operations to maintain database consistency with the other members.
and:
When a rollback does occur, it is often the result of a network partition.
In other words, rollback scenario typically occur like this:
You have a 3-nodes replica set setup of primary-secondary-secondary.
There is a network partition, separating the current primary and the secondaries.
The two secondaries cannot see the former primary, and elected one of them to be the new primary. Applications that are replica-set aware are now writing to the new primary.
However, some writes keep coming into the old primary before it realized that it cannot see the rest of the set and stepped down.
The data written to the old primary in step 4 above are the data that are rolled back, since for a period of time, it was acting as a "false" primary (i.e., the "real" primary is supposed to be the elected secondary in step 3)
The fix
MongoDB will not perform a rollback if there are more than 300 MB of data to be rolled back. The message you are seeing (replSet too much data to roll back) means that you are hitting this situation, and would either have to save the rollback directory under the node's dbpath, or perform an initial sync.
Preventing rollbacks
Configuring your application using w: majority (see Write Concern and Write Concern for Replica Sets) would prevent rollbacks. See Avoid Replica Set Rollbacks for more details.

wait for transactional replication in ADO.NET or TSQL

My web app uses ADO.NET against SQL Server 2008. Database writes happen against a primary (publisher) database, but reads are load balanced across the primary and a secondary (subscriber) database. We use SQL Server's built-in transactional replication to keep the secondary up-to-date. Most of the time, the couple of seconds of latency is not a problem.
However, I do have a case where I'd like to block until the transaction is committed at the secondary site. Blocking for a few seconds is OK, but returning a stale page to the user is not. Is there any way in ADO.NET or TSQL to specify that I want to wait for the replication to complete? Or can I, from the publisher, check the replication status of the transaction without manually connecting to the secondary server.
[edit]
99.9% of the time, The data in the subscriber is "fresh enough". But there is one operation that invalidates it. I can't read from the publisher every time on the off chance that it's become invalid. If I can't solve this problem under transactional replication, can you suggest an alternate architecture?
There's no such solution for SQL Server, but here's how I've worked around it in other environments.
Use three separate connection strings in your application, and choose the right one based on the needs of your query:
Realtime - Points directly at the one master server. All writes go to this connection string, and only the most mission-critical reads go here.
Near-Realtime - Points at a load balanced pool of subscribers. No writes go here, only reads. Used for the vast majority of OLTP reads.
Delayed Reporting - In your environment right now, it's going to point to the same load-balanced pool of subscribers, but down the road you can use a technology like log shipping to have a pool of servers 8-24 hours behind. These scale out really well, but the data's far behind. It's great for reporting, search, long-term history, and other non-realtime needs.
If you design your app to use those 3 connection strings from the start, scaling is a lot easier, especially in the case you're experiencing.
You are describing a synchronous mirroring situation. Replication cannot, by definition, support your requirement. Replication must wait for a transaction to commit before reading it from the log and delivering it to the distributor and from there to the subscriber, which means replication by definition has a window of opportunity for data to be out of sync.
If you have a requirement an operation to read the authorithative copy of the data, then you should make that decission in the client and ensure you read from the publisher in that case.
While you can, in threory, validate wether a certain transaction was distributed to the subscriber or not, you should not base your design on it. Transactional replication makes no latency guarantee, by design, so you cannot rely on a 'perfect day' operation mode.