After reading the documentation of PgPool I was left confused which option would suit my use case best. I need a main database instance which would serve the queries and 1 or more replicas (standbys) of the main one which would be used for disaster recovery scenarios.
What is very improtant for me is that all transactions committed successfully to the master node are guaranteed to be replicated eventually to the replicas such that when a failover occurs, the replica database instance has all transactions up to and including the latest one applied to it.
In terms of asynchronous replication, I have not seen any mention whether that is the case in the PgPool documentation, however, it does indeed mention some potential data loss occurring which is a bit too vague for me to draw any conclusions.
To combat this data loss, the documentation suggests to use synchronous streaming replication which before committing a transaction in the main node, ensures that all replicas have applied that change also. Thus, this method is slower than the asynchronous one but if there is no data loss, it could be viable.
Is synchronous replication the only method that allows me to achieve my use-case or would the asynchronous replication also do the trick? Also, what constitutes the potential data loss in the asynchronous replication?
Asynchronous replication means that the primary server does not wait for a standby server before reporting a successful COMMIT to the client. As a consequence, if the primary server fails, it is possible that the client believes that a certain transaction is committed, but none of the standby servers has yet received the WAL information. In a high availability setup, where you promote a standby in case of loss of the primary server, that means that you could potentially lose committed transactions, although it typically takes only a split second for the information to reach the standby.
With synchronous replication, the primary waits until the first available synchronous standby server has reported that it has received and persisted the WAL information before reporting a successful COMMIT to the client (the details of this, like which standby server is chosen, how many of them have to report back and when exactly WAL counts as received by the standby are configurable). So no transaction that has been reported committed to the client can get lost, even if the primary server is gone for good.
While it is technically simple to configure synchronous replication, it poses an architectural and design problem, so that asynchronous replication is often the better choice:
Synchronous replication slows down all data modification drastically. To work reasonably well, the network latency between primary and standby has to be very low. You usually cannot reasonably use synchronous replication between different data centers.
Synchronous replication reduces the availability of the whole system, because failure of the standby server prevents any data modification from succeeding. For that reason, you need to have at least two synchronous standby servers, one that is kept synchronized and one as a stand-in.
Even with synchronous replication, it is not guaranteed that reading from the standby after writing to the primary will give you the new data, because by default the primary doesn't wait for WAL to be replayed on the standby. If you want that, you need to set synchronous_commit = remote_apply on the primary server. But since queries on the standby can conflict with WAL replay, you will either have to deal with replication (and commit) delay or canceled queries on the standby. So using synchronous replication as a tool for horizontal scaling is reasonably possible only if you can deal with data modifications not being immediately visible on the standby.
Related
I'm trying to figure out the logical replication protocol. My attention was drawn to the message "standby status update" (byte('r')). Unfortunately, the documentation does not seem to describe the expected server behavior for this command.
If I send an lsn from the past, will the server resend transactions that were in the future regarding this lsn? Or, as far as I could find out in a practical way, this message will only affect the meta data of the replication slot (from pg_replication_slots table)?
That information is used to move the replication slot ahead and to provide the data that can be seen in pg_stat_replication (sent_lsn, write_lsn, flush_lsn and replay_lsn). With synchronous replication, it also provides the data that the primary has to wait for before it can report a transaction as committed.
Sending old, incorrect log sequence numbers will not change the data that the primary streams to the standby, but it might make the primary retail old WAL, because the replication slot is not moved ahead. It will also confuse monitoring. With synchronous replication, it will lead to hanging DML transactions.
If we compare multiple types of replication (Single-leader, Multi-leader or Leaderless), Single-leader replication has the possibility to be Linearizable. In my understanding, Linearizability means that once a write completes, all later reads should return that value, or of a later write. Or said in other words, there should be an impression if there is only one database, but no more. So i guess, no stale reads.
PostgreSQL in his streaming replication, has the ability to make all it's replicas synchronous using the synchronous_standby_names and it also has the ability to fine tune with the synchronous_commit option, where it can be set to remote_apply, so the leader waits until the transaction is replayed on the standby (making it visible to queries). In the documentation, in the paragraph where it talks about the remote_apply option, it states that this allows load-balancing in simple cases with causal consistency.
Few pages back, it says this:
,,Some solutions are synchronous, meaning that a data-modifying transaction is not considered committed until all servers have committed the transaction. This guarantees that a failover will not lose any data and that all load-balanced servers will return consistent results no matter which server is queried,,
So i'm struggling to understand what can there be guaranteed, and what anomalies can happen if we load-balance read queries to the read replicas. Can still there be stale reads? Can it happen when i query different replicas to get different results even no write after happend on the leader? My impression is yes, but i'm not really sure.
If no, how PostgreSQL prevents stale reads? I did not find anything with more details how it fully works under the hood. Does it use two-phase commit, or some modification of it, or it uses some other algorithm to prevent stale reads?
If it does not provide option of no stale reads, is there a way to accomplish that? I saw, PgPool has to option to load-balance to the replicas that are behind no more than a defined threshold, but i did not understand if it could be defined to load-balance to replicas that are up with the leader.
It's really confusing to me to really understand if there anomalies can happen in a fully synchronous replication in PostgreSQL.
I understand that setup like this has problems with availability, but that is now not a concern.
If you use synchronous replication with synchronous_commit = remote_apply, you can be certain that you will see the modified data on the standby as soon as the modifying transaction has committed.
Synchronous replication does not use two-phase commit, the primary server first commits locally and then simply waits for the feedback of the synchronous standby server before COMMIT returns. So the following is possible:
An observer will see the modified data on the primary before COMMIT returns, and before the data are propagated to the standby.
An observer will see the modified data on the standby before the COMMIT on the primary returns.
If the committing transaction is interrupted on the primary at the proper moment before COMMIT returns, the transaction will already be committed only on the primary. There is always a certain time window between the time the commit happens on the server and the time it is reported to the client, but that window increases considerably with streaming replication.
Eventually, though, the data modification will always make it to the standby.
In general, MongoDB will replicate from a Primary to Secondaries asynchronously, based on number of write operations, time and other factors by shipping oplog from primary to secondaries.
When describing WriteConcern options, MongoDB documentation states "...primary waits until the required number of secondaries acknowledge the write before returning write concern acknowledgment". This seems to suggest that a WriteConcern other than "w:1" would replicate to at least some of the members of the replica set in a blocking manner, potentially avoiding log shipping.
The basic question I'm trying to answer is this: if every write is using WriteCocnern of "majority", would MongoDB ever have to use log shipment? In other words, is using WriteCocnern of "majority" also controls replication timing?
I would like to better understand how MongoDB handles WriteConcern of "majority". A few obvious options:
Primary sends write requests to every Secondary, and blocks the thread until majority respond with acknowledgment
or
Primary pre-selects Secondaries first and sends requests to only those secondaries, blocking the thread until all chosen secondaries respond with acknowledgment
or
Something much smarter than either of these options
If Option 1 is used, in most cases (assuming equidistant placement of secondaries) all secondaries will have received the write operation by the time Write completes, and there's high probability (although not a guarantee) all secondaries will have applied it. If true, this behavior enables use cases where writes need to be reflected on Secondaries quicker than typical asynchronous replication process.
Obviously WriteConcern of "majority" will incur performance penalty, but this may be acceptable for specific use cases where read operations may target Secondaries (e.g. ReadPreference of "nearest") and desire more recent data.
if every write is using WriteConcern of "majority", would MongoDB ever have to use log shipment?
Replication in MongoDB uses what is termed as the oplog. This is a record of all operations on the primary (the only node that accept writes).
Instead of pushing the oplog into the secondaries, the secondaries long-pull on the oplog of the primary. If replication chaining is allowed (the default), then a secondary can also pull the oplog from another secondary. So scenario 1 & 2 you posted are not the reality with MongoDB replication as of MongoDB 4.0.
The details of the replication process is described in MongoDB Github wiki page: Replication Internals.
To quote the relevant parts regarding your question:
If a command includes a write concern, the command will just block in its own thread until the oplog entries it generates have been replicated to the requested number of nodes. The primary keeps track of how up-to-date the secondaries are to know when to return. A write concern can specify a number of nodes to wait for, or majority.
In other words, the secondaries continually report back to the primary how far along it has applied the oplog into its own dataset. Since the primary knows the timestamp that the write took place, once a secondary has applied that timestamp, it can tell that the write has propagated to that secondary. To satisfy the write concern, the primary simply waits until a determined number of secondaries have applied the write timestamp.
Note that only the thread specifying the write concern is waiting for this acknowledgment. All other threads are not blocked due to this waiting at all.
Regarding to you other question:
Obviously WriteConcern of "majority" will incur performance penalty, but this may be acceptable for specific use cases where read operations may target Secondaries (e.g. ReadPreference of "nearest") and desire more recent data.
To achieve what you described, you need a combination of read and write concerns. See
Causal Consistency and Read and Write Concerns for more details on this subject.
Write majority is typically used for:
Ensuring that the write will not be rolled back in the event of the primary failure.
Ensuring that the application is not writing so fast that the provisioned hardware of the replica set cannot cope with the traffic; i.e. it can act as a backpressure mechanism.
In combination with read concern, provide the client with differing levels of consistency guarantees.
These points assume that the write majority was acknowledged and the acknowledgment was received by the client. There are multiple different failure scenario that are possible (as expected with a distributed system that needs to cope with unreliable network), but those are beyond the scope of this discussion.
I am dealing with rollback procedures of MongoDB. Problem is rollback for huge data may be bigger than 300 MB or more.
Is there any solution for this problem? Error log is
replSet syncThread: replSet too much data to roll back
In official MongoDB document, I could not see a solution.
Thanks for the answers.
The cause
The page Rollbacks During Replica Set Failover states:
A rollback is necessary only if the primary had accepted write operations that the secondaries had not successfully replicated before the primary stepped down. When the primary rejoins the set as a secondary, it reverts, or “rolls back,” its write operations to maintain database consistency with the other members.
and:
When a rollback does occur, it is often the result of a network partition.
In other words, rollback scenario typically occur like this:
You have a 3-nodes replica set setup of primary-secondary-secondary.
There is a network partition, separating the current primary and the secondaries.
The two secondaries cannot see the former primary, and elected one of them to be the new primary. Applications that are replica-set aware are now writing to the new primary.
However, some writes keep coming into the old primary before it realized that it cannot see the rest of the set and stepped down.
The data written to the old primary in step 4 above are the data that are rolled back, since for a period of time, it was acting as a "false" primary (i.e., the "real" primary is supposed to be the elected secondary in step 3)
The fix
MongoDB will not perform a rollback if there are more than 300 MB of data to be rolled back. The message you are seeing (replSet too much data to roll back) means that you are hitting this situation, and would either have to save the rollback directory under the node's dbpath, or perform an initial sync.
Preventing rollbacks
Configuring your application using w: majority (see Write Concern and Write Concern for Replica Sets) would prevent rollbacks. See Avoid Replica Set Rollbacks for more details.
My web app uses ADO.NET against SQL Server 2008. Database writes happen against a primary (publisher) database, but reads are load balanced across the primary and a secondary (subscriber) database. We use SQL Server's built-in transactional replication to keep the secondary up-to-date. Most of the time, the couple of seconds of latency is not a problem.
However, I do have a case where I'd like to block until the transaction is committed at the secondary site. Blocking for a few seconds is OK, but returning a stale page to the user is not. Is there any way in ADO.NET or TSQL to specify that I want to wait for the replication to complete? Or can I, from the publisher, check the replication status of the transaction without manually connecting to the secondary server.
[edit]
99.9% of the time, The data in the subscriber is "fresh enough". But there is one operation that invalidates it. I can't read from the publisher every time on the off chance that it's become invalid. If I can't solve this problem under transactional replication, can you suggest an alternate architecture?
There's no such solution for SQL Server, but here's how I've worked around it in other environments.
Use three separate connection strings in your application, and choose the right one based on the needs of your query:
Realtime - Points directly at the one master server. All writes go to this connection string, and only the most mission-critical reads go here.
Near-Realtime - Points at a load balanced pool of subscribers. No writes go here, only reads. Used for the vast majority of OLTP reads.
Delayed Reporting - In your environment right now, it's going to point to the same load-balanced pool of subscribers, but down the road you can use a technology like log shipping to have a pool of servers 8-24 hours behind. These scale out really well, but the data's far behind. It's great for reporting, search, long-term history, and other non-realtime needs.
If you design your app to use those 3 connection strings from the start, scaling is a lot easier, especially in the case you're experiencing.
You are describing a synchronous mirroring situation. Replication cannot, by definition, support your requirement. Replication must wait for a transaction to commit before reading it from the log and delivering it to the distributor and from there to the subscriber, which means replication by definition has a window of opportunity for data to be out of sync.
If you have a requirement an operation to read the authorithative copy of the data, then you should make that decission in the client and ensure you read from the publisher in that case.
While you can, in threory, validate wether a certain transaction was distributed to the subscriber or not, you should not base your design on it. Transactional replication makes no latency guarantee, by design, so you cannot rely on a 'perfect day' operation mode.