I am dealing with rollback procedures of MongoDB. Problem is rollback for huge data may be bigger than 300 MB or more.
Is there any solution for this problem? Error log is
replSet syncThread: replSet too much data to roll back
In official MongoDB document, I could not see a solution.
Thanks for the answers.
The cause
The page Rollbacks During Replica Set Failover states:
A rollback is necessary only if the primary had accepted write operations that the secondaries had not successfully replicated before the primary stepped down. When the primary rejoins the set as a secondary, it reverts, or “rolls back,” its write operations to maintain database consistency with the other members.
and:
When a rollback does occur, it is often the result of a network partition.
In other words, rollback scenario typically occur like this:
You have a 3-nodes replica set setup of primary-secondary-secondary.
There is a network partition, separating the current primary and the secondaries.
The two secondaries cannot see the former primary, and elected one of them to be the new primary. Applications that are replica-set aware are now writing to the new primary.
However, some writes keep coming into the old primary before it realized that it cannot see the rest of the set and stepped down.
The data written to the old primary in step 4 above are the data that are rolled back, since for a period of time, it was acting as a "false" primary (i.e., the "real" primary is supposed to be the elected secondary in step 3)
The fix
MongoDB will not perform a rollback if there are more than 300 MB of data to be rolled back. The message you are seeing (replSet too much data to roll back) means that you are hitting this situation, and would either have to save the rollback directory under the node's dbpath, or perform an initial sync.
Preventing rollbacks
Configuring your application using w: majority (see Write Concern and Write Concern for Replica Sets) would prevent rollbacks. See Avoid Replica Set Rollbacks for more details.
Related
Take this circumstances when a client writes to a server in replica set mode:
Successful write & acknowledgement
Unsuccessful and error.
If 1. happens but right after that the primary goes down - before sending data to secondary nodes, there will be troubles. When going back in, it will roll back and although the client got an acknowledgement, the data is dismissed.
Question
why does it roll back instead of sending the data to the remaining nodes when the primary is back in? Does this happen because of an election? And what if the result of the election is the same node?
Conjecture: The server goes down, triggers an election, and a different server takes place. When it catches up the new primary, the written message it's not in the oplog, and I guess they continue with different oplogs?
I know we can change this behaviour using majority but would like to understand why this roll back happens.
Any ideas?
MongoDB implements single-master replication, which mandates that only one server is the authoritative source of replication at any time. If it were to replicate the data that is rolled back, it would have to merge it into the new primary and this is complicated and error-prone as the data could have been changed multiple times while the old primary was down.
When a primary goes down and later rejoins the cluster, it reconciles its own copy of the oplog with the one that is in the server that is currently the primary. Since other write operations could have happened in the meantime on the new primary, the new authoritative source of replication is the oplog of the new primary. So, the old primary has to purge its oplog of any operations that are not present in the oplog of the new primary and these are rolled back.
If no primary was available in the cluster when the server rejoins, election takes care of selecting the server with the newest copy of the data (based on the timestamp of the last operation in the oplog). This becomes the new primary and all other servers will sync their oplog to this. So, if the old primary becomes primary again and no newer writes happened in the cluster, then it will not rollback.
Rolled back data is not lost but put aside on a file so that it can be examined and eventually recovered by DBAs if needed. However, you should consider the nature of the data you are storing and, if it is crucial that rollbacks never happen, then use the appropriate write concern to ensure additional copies are made to guarantee it is never lost.
My application is essentially a bunch of microservices deployed across Node.js instances. One service might write some data while a different service will read those updates. (specific example, I'm processing data that is inbound to my solution using a processing pipeline. Stage 1 does something, stage 2 does something else to the same data, etc. It's a fairly common pattern)
So, I have a large data set (~250GB now, and I've read that once a DB gets much larger than this size, it is impossible to introduce sharding to a database, at least, not without some major hoop jumping). I want to have a highly available DB, so I'm planning on a replica set with at least one secondary and an arbiter.
I am still researching my 'sharding' options, but I think that I can shard my data by the 'client' that it belongs to and so I think it makes sense for me to have 3 shards.
First question, if I am correct, if I have 3 shards and my replica set is Primary/Secondary/Arbiter (with Arbiter running on the Primary), I will have 6 instances of MongoDB running. There will be three primaries and three secondaries (with the Arbiter running on each Primary). Is this correct?
Second question. I've read conflicting info about what 'majority' means... If I have a Primary and Secondary and I'm writing using the 'majority' write acknowledgement, what happens when either the Primary or Secondary goes down? If the Arbiter is still there, the election can happen and I'll still have a Primary. But, does Majority refer to members of the replication set? Or to Secondaries? So, if I only have a Primary and I try to write with 'majority' option, will I ever get an acknowledgement? If there is only a Primary, then 'majority' would mean a write to the Primary alone triggers the acknowledgement. Or, would this just block until my timeout was reached and then I would get an error?
Third question... I'm assuming that as long as I do writes with 'majority' acknowledgement and do reads from all the Primaries, I don't need to worry about causally consistent data? I've read that doing reads from 'Secondary' nodes is not worth the effort. If reading from a Secondary, you have to worry about 'eventual consistency' and since writes are getting synchronized, the Secondaries are essentially seeing the same amount of traffic that the Primaries are. So there isn't any benefit to reading from the Secondaries. If that is the case, I can do all reads from the Primaries (using 'majority' read concern) and be sure that I'm always getting consistent data and the sharding I'm doing is giving me some benefits from distributing the load across the shards. Is this correct?
Fourth (and last) question... When are causally consistent sessions worthwhile? If I understand correctly, and I'm not sure that I do, then I think it is when I have a case like a typical web app (not some distributed application, like my current one), where there is just one (or two) nodes doing the reading and writing. In that case, I would use causally consistent sessions and do my writes to the Primary and reads from the Secondary. But, in that case, what would the benefit of reading from the Secondaries be, anyway? What am I missing? What is the use case for causally consistent sessions?
if I have 3 shards and my replica set is Primary/Secondary/Arbiter (with Arbiter running on the Primary), I will have 6 instances of MongoDB running. There will be three primaries and three secondaries (with the Arbiter running on each Primary). Is this correct?
A replica set Arbiter is still an instance of mongod. It's just that an Arbiter does not have a copy of the data and cannot become a Primary. You should have 3 instances per shard, which means 9 instances in total.
Since you mentioned that you would like to have a highly available database deployment, please note that the minimum recommended replica set members for production deployment would be a Primary with two Secondaries.
If I have a Primary and Secondary and I'm writing using the 'majority' write acknowledgement, what happens when either the Primary or Secondary goes down?
When either the Primary or Secondary becomes unavailable, a w:majority writes will either:
Wait indefinitely,
Wait until either nodes is restored, or
Failed with timeout.
This is because an Arbiter carries no data and unable to acknowledge writes but still counted as a voting member. See also Write Concern for Replica sets.
I can do all reads from the Primaries (using 'majority' read concern) and be sure that I'm always getting consistent data and the sharding I'm doing is giving me some benefits from distributing the load across the shards
Correct, MongoDB Sharding is to scale horizontally to distribute load across shards. While MongoDB Replication is to provide high availability.
If you read only from the Primary and also specifies readConcern:majority, the application will read data that has been acknowledged by the majority of the replica set members. This data is durable in the event of partition (i.e. not rolled back). See also Read Concern 'majority'.
What is the use case for causally consistent sessions?
Causal Consistency is used if the application requires an operation to be logically dependent on a preceding operation (causal). For example, a write operation that deletes all documents based on a specified condition and a subsequent read operation that verifies the delete operation have a causal relationship. This is especially important in a sharded cluster environment, where write operations may go to different replica sets.
I have been reading the docs and from my understanding I could see a scenario whereby a rollback could still occur:
Write goes to primary which confirms that the journal has been written to disk
Majority of the secondaries confirm the write but do not write to disk
Power fails on entire cluster
Primary for some reason does not start back up when power is restored
A secondary takes the primary role
The original primary finally starts, rejoins the set as a secondary and rolls back
Is this scenario plausible?
This could be a plausible case for rollback yes, if the power fails between the other members getting the command and writing to disk.
In this case, as you point out, the primary could not start back up and so would, once back up, contain operations that the rest of the set could not validate causing a rollback.
It is also good to note, as a curve ball that if the primary were not to go down then it would return a successful write and the application would be none the wiser that the set has gone down and their {w: majority} wasn't written to disk. This is, of course, an edge case.
Don't think it will happen in MongoDB 3.2+, as in here, you see:
Changed in version 3.2: With j: true, MongoDB returns only after the
requested number of members, including the primary, have written to
the journal. Previously j: true write concern in a replica set only
requires the primary to write to the journal, regardless of the w:
write concern.
based on the docs, my understanding is that if you set j=1 then w > 1 doesn't matter. your app will have the write ack'd only once (and as soon as) the primary has committed the write to its own journal. writes to replicas will happen but don't factor into your write concern.
in light of this, the senario of "can the primary commit to journal, ack the write, and have the cluster go down before the secondaries commit to their journal and then roll back the primary when a secondary comes back up as primary" is more likely (but still of very low likelihood) than the original question implies.
from the docs:
Requiring journaled write concern in a replica set only requires a journal commit of the write
operation to the primary of the set regardless of the level of replica acknowledged write concern.
http://docs.mongodb.org/manual/core/write-concern/
MongoDB docs about concurrency state that the DB is 'write greedy'. That is something I understand. However it does not tell about what locks do to secondaries in a replica set.
Taking my use-case which would get about 40 writes per 100 queries wherein I am not in need of having the most recent document at all times. A lag of 5-10 seconds is okay with me which is how much the secondaries in a replica set would be behind the master. Now if the write lock locks down master as well as the replicas, then I am locked out of reads on secondaries as well.
I wanted to know if writers will lock read operations on secondaries as well.
Into a replica set SECONDARY servers are not affected by the write lock on MASTERS.
You can see the status of your servers by using mongotop or montostat.
The locks are per mongod instance. That means that the read/write locks are locking operations only on the primary. The secondaries are reading oplog from primary and replicating actions from the primary.
You can read much more details on their manual about concurrency.
In my test Envinroment:
node1:shard1 primary,shard2 primary
node2:shard1 secondary,shard2 secondary
node3:shard1 arbiter,shard2 artbiter
I wrote a multi-thread to concurrently write the mongo replicat set shard,after 1 hour(the primary had 6g data)
I found the secondary status is :recovering
I checked the secondary log,said:stale data from primary oplog
So the reason was my write request very frequent?then render the secondary cannot replicate in time?
or other reasons?
I'm puzzling...
Thanks in advance
This situation can happen if the size of the OpLog is not sufficient to keep a record of all the operations occurring on the primary, or the secondary just can't keep up with the primary. What will happen in that case is the position in the OpLog where the secondary is will be overwritten by the new inserts from the primary. At this point the secondary will report that it's status is Recovering and you will see a RS102 message in the log, indicating that it is too stale to catch up.
To fix the issue you would need to follow the steps outlined in the documentation.
In order to prevent the problem from happening in the future, you would need to tune the size of the OpLog, and make sure that the secondaries are of equivalent hardware configurations.
To help tune the OpLog you can look at the output of db.printReplicationInfo() which will tell you how much time you have in your OpLog. The documentation outlines how to resize the OpLog if it is too small.