mongodb rollback of primary when write concern is 1 - mongodb

Take this circumstances when a client writes to a server in replica set mode:
Successful write & acknowledgement
Unsuccessful and error.
If 1. happens but right after that the primary goes down - before sending data to secondary nodes, there will be troubles. When going back in, it will roll back and although the client got an acknowledgement, the data is dismissed.
Question
why does it roll back instead of sending the data to the remaining nodes when the primary is back in? Does this happen because of an election? And what if the result of the election is the same node?
Conjecture: The server goes down, triggers an election, and a different server takes place. When it catches up the new primary, the written message it's not in the oplog, and I guess they continue with different oplogs?
I know we can change this behaviour using majority but would like to understand why this roll back happens.
Any ideas?

MongoDB implements single-master replication, which mandates that only one server is the authoritative source of replication at any time. If it were to replicate the data that is rolled back, it would have to merge it into the new primary and this is complicated and error-prone as the data could have been changed multiple times while the old primary was down.
When a primary goes down and later rejoins the cluster, it reconciles its own copy of the oplog with the one that is in the server that is currently the primary. Since other write operations could have happened in the meantime on the new primary, the new authoritative source of replication is the oplog of the new primary. So, the old primary has to purge its oplog of any operations that are not present in the oplog of the new primary and these are rolled back.
If no primary was available in the cluster when the server rejoins, election takes care of selecting the server with the newest copy of the data (based on the timestamp of the last operation in the oplog). This becomes the new primary and all other servers will sync their oplog to this. So, if the old primary becomes primary again and no newer writes happened in the cluster, then it will not rollback.
Rolled back data is not lost but put aside on a file so that it can be examined and eventually recovered by DBAs if needed. However, you should consider the nature of the data you are storing and, if it is crucial that rollbacks never happen, then use the appropriate write concern to ensure additional copies are made to guarantee it is never lost.

Related

Can MongoDB manage RollBack procedure more then 300 MB Streaming Data?

I am dealing with rollback procedures of MongoDB. Problem is rollback for huge data may be bigger than 300 MB or more.
Is there any solution for this problem? Error log is
replSet syncThread: replSet too much data to roll back
In official MongoDB document, I could not see a solution.
Thanks for the answers.
The cause
The page Rollbacks During Replica Set Failover states:
A rollback is necessary only if the primary had accepted write operations that the secondaries had not successfully replicated before the primary stepped down. When the primary rejoins the set as a secondary, it reverts, or “rolls back,” its write operations to maintain database consistency with the other members.
and:
When a rollback does occur, it is often the result of a network partition.
In other words, rollback scenario typically occur like this:
You have a 3-nodes replica set setup of primary-secondary-secondary.
There is a network partition, separating the current primary and the secondaries.
The two secondaries cannot see the former primary, and elected one of them to be the new primary. Applications that are replica-set aware are now writing to the new primary.
However, some writes keep coming into the old primary before it realized that it cannot see the rest of the set and stepped down.
The data written to the old primary in step 4 above are the data that are rolled back, since for a period of time, it was acting as a "false" primary (i.e., the "real" primary is supposed to be the elected secondary in step 3)
The fix
MongoDB will not perform a rollback if there are more than 300 MB of data to be rolled back. The message you are seeing (replSet too much data to roll back) means that you are hitting this situation, and would either have to save the rollback directory under the node's dbpath, or perform an initial sync.
Preventing rollbacks
Configuring your application using w: majority (see Write Concern and Write Concern for Replica Sets) would prevent rollbacks. See Avoid Replica Set Rollbacks for more details.

Why is MongoDB supposed to Consistent & Partition tolerant but not Available [duplicate]

Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?
MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets
I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.
As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.
Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.
MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.
Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.
I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)
Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability

What happens if a failed server return back from failover?

I was going to replication procedure in mongodb, which has the feature of automatic failover. I tried to make my question in a picture format.
My question is what happens if the failed server is back online after failover? This seems like a common practise in database. This part is totally new to me, so please share your suggestions?
Before we jump to the answer, will setup some background:
Each server in a replica set maintains oplog. It is log for changes happening on the server. It is a capped collection. Which means that it only maintains recent changes.
Primary should have latest oplog. Each secondary syncs itself using oplog of other servers. So it fetches oplog of primary or other secondaries to update itself. It could happen that it lags too much behind, i.e. other secondaries or primary doesn't have intermediate logs. This could happen because of capped nature of oplogs. In such a scenario, secondary does Initial Sync, i.e. full data synch with other servers.
Now coming to your question, what is the state of primary when it comes back. Answer will depend on the state of primary before it fails. Below could be the two cases:
Case 1. Primary, which is about to fail, has the latest changes. Also, at least one of the secondaries has the latest changes. Now Primary fails. The secondary which has the latest changes, will become primary. It will take all changes from mongos or driver. Now what if the earlier primary comes back. It will be a secondary. And will sync itself based on the point 2 mentioned above.
Case 2. Primary, which is about to fail, has the latest changes. And ALL secondaries lag behind primary by couple of changes. Now Primary fails. The newly elected primary will not have the latest changes in this case. But it assumes that it has the latest changes. And it starts behaving as per that. Now if earlier primary comes back, it will be secondary. It will realise that its oplog is behind as well ahead. Behind because changes happened while it was down. Ahead because it has some changes which are not replicated on other servers. So it will remove the changes which makes its ahead. It will use oplog for this process. This process is called Rollback. But before rolling back, it will log reverted changes in a system rollback file. Now it is not ahead. But behind others. And will sync itself based on the point 2 mentioned above. There is a caveat that administrator will have to manually apply changes from system rollback file. Rollback details can be found at: http://docs.mongodb.org/manual/core/replica-set-rollbacks/

what if mongodb failover take place when replication havnt not done yet

Im new to mongodb, I have one question:
Im setting up a mongoDB test env, with one mongos, 3 conf server, 2 shards ( 3 server as replication set for a shard )
let's say, for reasons I have a big lag of replication (like secondary is backing up, or network issue. or something else happening. in case it happens)
during this time, the primary server is down,
what will happen ? Auto fail-over select one secondary db as new primary, how about these data havnt replicated yet ?
are we gonna lost data ?
if so, how can we do to get data back and what need to be done, to avoid such issue.
thanks a lot.
during this time, the primary server is down, what will happen ?
No writes will be allowed
Auto fail-over select one secondary db as new primary, how about these data havnt replicated yet ?
If data has no replicated from the primary to the secondary which then becomes primary then a rollback will occur when the primary comes back into the set as a secondary: http://docs.mongodb.org/manual/core/replica-set-rollbacks/
Of course as to whether you lose data or not depends on whether the write went to journal and/or data files and whether the member just left the set or crashed. If the member crashed before the write could go to journal then there is a chance that the write could be lost, yes.
how can we do to get data back and what need to be done, to avoid such issue.
You can use w=majority for most cases but this will still have certain pitfalls within edge cases that just cannot be taken care of, for example if the primary fails over before the write can be propogated to other members etc.

Where does mongodb stand in the CAP theorem?

Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?
MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets
I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.
As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.
Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.
MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.
Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.
I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)
Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability