How to stop collection cloned even when all the data is restored on recovered instance using oplog replay.
Replication Scenario
I have a 3 node Replcation set up.
Load
There is continous load, data keeps adding every day.
and we have oplog backups every 2 housrs. Inspite of the oplog backups set for ever 2 hours,
we have some of transactions roll off from the oplog. That means we might be miss some records when we replay those oplogs.
Scenario.
In a replication scenario we have one of the secondaries not responding and by the time we join it back to the replication set
the minimum oplog timestamp goes past the oplog in the failed instance and the failed instance tries to catch up but it gets into a recovering mode.
from the
log message on the recovering instance.
2019-02-13T15:49:42.346-0500 I REPL [replication-0] We are too stale to use primaryserver3:27012 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1550090168, 1) is before their earliest timestamp: Timestamp(1550090897, 28907) for 1min until: 2019-02-13T15:50:42.346-0500
2019-02-13T15:49:42.347-0500 I REPL [replication-0] sync source candidate: primaryserver3:27012
2019-02-13T15:49:42.347-0500 I ASIO [RS] Connecting to primaryserver3:27012
2019-02-13T15:49:42.348-0500 I REPL [replication-0] We are too stale to use primaryserver3:27012 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1550090168, 1) is before their earliest timestamp: Timestamp(1550090897, 22809) for 1min until: 2019-02-13T15:50:42.348-0500
2019-02-13T15:49:42.348-0500 I REPL [replication-0] could not find member to sync from
To bring this instance at par with Primary, We make this "RECOVERING" instance as a "new PRIMARY" and apply all oplog backups taken till present insert. after the oplogs are applied Record count on both the servers match. Now when i join the recovering instance(ie "new PRIMARY") back to the replication set,
i see the logs showing "initial sync" which is supposed to do and then seeing the below log
2019-03-01T12:11:58.327-0500 I REPL [repl writer worker 4] CollectionCloner ns:datagen_it_test.test finished cloning with status: OK
2019-03-01T12:12:40.518-0500 I REPL [repl writer worker 8] CollectionCloner ns:datagen_it_test.link finished cloning with status: OK
Where the collections are cloned again.
My question is Why does it clone again to get the data. We have the data restored in the "recovering" instance records all match.
How to stop the cloning happening.
As per the MongoDB documentation
A replica set member becomes “stale” when its replication process
falls so far behind that the primary overwrites oplog entries the
member has not yet replicated. The member cannot catch up and becomes
“stale.” When this occurs, you must completely resynchronize the
member by removing its data and performing an initial sync.
This tutorial addresses both resyncing a stale member and creating a
new member using seed data from another member, both of which can be
used to restore a replica set member. When syncing a member, choose a
time when the system has the bandwidth to move a large amount of data.
Schedule the synchronization during a time of low usage or during a
maintenance window.
MongoDB provides two options for performing an initial sync:
Restart the mongod with an empty data directory and let MongoDB’s
normal initial syncing feature restore the data. This is the more
simple option but may take longer to replace the data.
See Automatically Sync a Member.
Restart the machine with a copy of a recent data directory from
another member in the replica set. This procedure can replace the data
more quickly but requires more manual steps.
See Sync by Copying Data Files from Another Member.
Step by step procedure is available in
Resync a Member of a Replica Set
Related
I'm running master & replica on PG 13.3. I decided to use delayed replication (30 minutes configured in recovery_min_apply_delay parameter). On top of that, WAL archiving is configured and working well.
When load on master is very high for a long time, it happens that replication is falling behind until max_slot_wal_keep_size is exceeded (see my another, related question: Replication lag - exceeding max_slot_wal_keep_size, WAL segments not removed). Once it falls too far behind, the slot is "lost' and replica falls back to restoring WAL from the archive. So far so good. The problem is, it never tries replication again. Restarting slave does not help.
There are two ways how I managed to restore the replication:
Restarts & config edits
Remove the delay config from the replica
Restart postgres. Then it restores all the WAL from archive and once there's nothing left it will start replication again - but without any delay. Then I edit config again to introduce replication and it sometimes works, sometimes doesn't. I think it depends on the load.
Removing a WAL segment from archive
Look at currently restored WAL segments from the postgresql log and temporarily move the following one from the WAL archive. When PG tries to recovery it fails and falls back to replication
This doesn't seem like the right way to do it, does it?
Thanks,
-- Marcin
As far as I can see, this is a non-problem.
If you want replication delayed by 30 minutes, and you archive more than one 16MB WAL segment per half hour, there is no need to replicate. The information can just as well be read from the archive. If the latest entry in the latest archived WAL segment happens to be older than recovery_min_apply_delay, the standby will contact the primary and replicate.
If you insist on replication rather than archive recovery, remove restore_command and max_slot_wal_keep_size from the configuration. But I don't see the point.
If you are concerned about losing the active WAL segment in case of a catastrophe on the primary, use pg_receivewal rather than archive_command to populate the WAL archive.
I am trying to set logical replication between 2 cloud instances both with Debian 9 and PG 11.1. The command CREATE PUBLICATION on master was successful, but when I start the command CREATE SUBSCRIPTION on the intended logical replica, the command hangs indefinitely.
On the master I can see that the replication slot was created and is active and I can see a new walsender process created and "waiting" and in the log on the master I see these these lines:
2019-01-14 14:20:39.924 UTC [8349] repl_user#db LOG: logical decoding found initial starting point at 7B0/6C777D10
2019-01-14 14:20:39.924 UTC [8349] repl_user#db DETAIL: Waiting for transactions (approximately 2) older than 827339177 to end.
But that is all. The command CREATE SUBSCRIPTION never ends.
Master is a db with heavy inserts, like 100s per minute, but they are all always committed. So there should not be any long time uncommitted transactions.
I tried to google for this problem but did not find anything. What am I missing?
Since the databases are “in the cloud”, you don't know where they really are.
Odds are that they are actually in the same database cluster, which would explain the deadlock you see: CREATE SUBSCRIPTION waits until all concurrent transactions on the cluster that contains the replication source database are finished before it can create its replication slot, but since both databases are in the same cluster, it waits for itself to finish, which obviously won't happen.
The solution is to explicitly create a logical replication slot in the source database and to use that existing slot when you create the subscription.
Aerospike supports ACID in clustered environment with replication factor greater than 1, where any write will be written to Master and Replica and then only it will be marked as success to the client.
But, we can change the above mentioned default behaviour by changing the write.commit_level from all to master.
In such case, suppose the write/update is successful at Master node and client is notified, but the write fails at Replica node, What would happen?
Will the Aerospike have inconsistent data for same key in the cluster?
Or will it be retried at Replica?
Or will the write on the Master be rolled back?
Note the Replica node is not down, just the write failed due to any reason like stop writes pct is breached at Replica node, etc.
if you choose write.commit_level=master, and if the prole write fails the client will not be notified about the failure. The replica will stay inconsistent with the master. The master write will not be rolled back. The replica will get fixed on the next write with successful replication. i.e it will get overwritten with latest record.
BTW, an important thing to note is that stop-writes is honored at the master and not at the replica. It will be a bad idea to fail the replica write because of this. As long as you have some head room in terms of memory (no malloc failures) and disk, there are hardly any chances of replica write failure when the node itself did not go down.
I have 3 member replicaSet in MongoDB which fell apart when re-configuring the host names of the sever instances. I had to reconfigure the replicaSet, however I am curious how MongoDB handles records that are not synced across all the members.
Case 1) There is a new record on the MongoDB server that I access to reconfigure the set.
Case 2) There is a new record on another MongoDB server that is added later to the replica set.
Each replica-set has one primary node and one or more secondary nodes.
All writes happen on the primary. The primary then sends these changes to the secondaries (the list of changes is referred to as "the oplog"). That means the primary is always the member with the most up-to-date data.
When the primary is suddenly unreachable, the replica-set is put into read-only mode and an election takes place to find a new primary. Usually the secondary which is most up-to-date is selected (more details on replica-set election). Any writes which were not propagated to that secondary yet are lost.
When the old primary goes back online, it re-joins the replica-set as a secondary. Its data gets synchronized to the state of the new primary. Any writes which only happened on the old primary which weren't propagated to the new primary before the crash are rolled back.
The rolled-back writes are backed up as bson-files in the directory /rollback and can be re-added to the replica-set using bsondump and mongorestore. Details about this procedure can be found in the article Rollbacks During Replica Set Failover
I am trying to shrink the size of my MongoDB replica-set(the collections are the same size but disk space keeps growing). According to the MongoDB website, I should just run mongod --repair on the master node to compact all collections. The problem would be downtime for the website. So, I have two options(that I know about):
Take secondary node off of replica-set and run mongod --repair on it and restart back on replica-set. I tried this and couldn't get past permission errors on 'local' collection.
Shut down secondary node and delete all files in the data directory. Restart mongo and let it recover from master. This actually worked for me but my only concern is, what if your journal collection is full and since it's a capped collection, will you only receive the data that is in the journal or will you actually copy over all of master's data?
Has anyone else run into this scenario? I'm surprised by the lack of information when trying to search for this.
Take secondary node off of replica-set and run mongod --repair on it and restart back on replica-set.
This is a common practice which is usually referred to as a "rolling repair". You take each secondary out of the replica set and repair it, and eventually step down the primary for repair as a last step. As long as you always have a majority of your replica set nodes available this approach will minimize potential downtime.
If you are frequently deleting data you should consider using the new PowerOf2Sizes collection option in MongoDB 2.2. This changes the allocation method to allocate document space in powers of two (eg. a 500 byte document would be allocated 512 bytes), which allows for more effective reuse of the space from deleted documents (at the slight expense of a few more bytes per document).
I tried this and couldn't get past permission errors on 'local' collection.
Permission errors on the 'local' collection sound like file system permissions (i.e. based on the user you were running your mongod as). You should run the repair process with the same user.
Shut down secondary node and delete all files in the data directory. Restart mongo and let it recover from master. This actually worked for me but my only concern is, what if your journal collection is full and since it's a capped collection, will you only receive the data that is in the journal or will you actually copy over all of master's data?
It sounds like you are conflating the Journal which is used for durability and crash recovery with the Oplog used for replication.
If you resync a node from the primary, all data will be copied over. During this initial period the
node will be in RECOVERING state and is not considered a "healthy" node (i.e. available for queries).
Once the node is caught up it will change to a normal SECONDARY state at which point the oplog will be used for ongoing sync.
Some further reading:
Replication fundamentals
Replica set status reference