How to prevent two masters from being active at the same time when doing master election with a registry like Zookeeper, Consul, or etcd? - apache-zookeeper

tl;dr
Even if you implement master election with one of the registries such as Zookeeper, Consul, or etcd, there always seems to be a race condition where an old master does not realize it is no longer master and tries to write, while a new master is unblocked from writing, resulting in two instances of the service writing at the same time, which we want to avoid. How can we implement master election without this race condition?
Detailed problem statement
Suppose we want to implement master election for failover using one of the registries such as Zookeeper, Consul, or etcd.
Suppose there are three instances of the service S1, S2, S3, each one with a corresponding registry node on the same machine, and currently S1 is master and S2 and S3 are slaves.
Furthermore, S1, S2, S3 all store shared state in the registry, but we do not want to have multiple instances write that state at the same time because concurrent access might make the state inconsistent.
Suppose S1 is in the middle of write operations to the shared state stored in the registry. That is, it will execute more write operations before checking again whether it is leader.
Suppose there is a network partition at this point. On one partition is S1. On the other partition is S2 and S3, so it has the quorum.
The registry correctly identifies S2 as the new leader and invalidates S1 as the leader.
S2 activates because it is the new leader and begins a series of write operations to the shared state stored in the registry.
The partition is healed.
At this point both S1 and S2 will execute write operations to the registry concurrently and both their write operations will succeed since the partition is healed, which might result in inconsistent state.
The sample trace is:
S1 is notified it is master, and begins write operations to shared state in registry
Partition happens, with S1 in one partition, S2 & S3 in the other
S2 is identified as new master
S2 is notified it is new master, and begins write operations to shared state in registry
Partition is repaired
S1 writes to shared state in registry and succeeds since no partition
S2 writes to shared state in registry and succeeds also, causing arbitrary interleaving of writes with S1
S1 is notified it is no longer master and stops writing
Thoughts
Thinking in terms of Consul's sessions, would an API write call that also takes a session ID and only succeeds if that session ID is still master's solve this problem?
Is there such a call in Consul or one of the other registries?

I think the problem lies between steps 5 & 6. S1 shouldn't attempt to write immediately after healing the partition since it should have lost its leadership status during the partition.
S1 should be aware of this since it would have to re-establish connectivity to zookeeper.
http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection has a description of the basic process. I'd look at Apache Curator for an implementation.
Or for something non-zookeeper this could happen via timeout values.
https://github.com/Netflix/edda/blob/48de14fc185d8b2d8605c51630c0906c7e923925/src/main/scala/com/netflix/edda/aws/DynamoDBElector.scala has a nice implementation of that approach.
I haven't tried to use Consul for this yet, but I suspect it should fall into one of these two categories.

Related

How does raft prevent submitted logs from being overwritten

figure 8 in raft paper
Consider a situation like the figure 8 in raft paper, but in (c), log entry from term 2 has been commited, and s1 crashs, s5 becomes leader, then s5 send append entry rpc to s2, s3, s4, according to the rule, s2, s3, s4 must replace the log entry from term 2 with log entry from term 3, caused the log that has been submitted to be overwritten, how can we avoid that?
I met this kind of situation in 6.824 labs, causes me to sometimes fail the test (very infrequently. Only one or two times out of hundreds)
The issue is with the voting; if there is a committed item X, then a node could be elected only if it has item X in its log. Basically, committed items will never be overridden.
In your case, if S5 doesn't have the latest committed value, it won't be able to get the majority of votes to become a leader.
Quick edit: the key property of raft is that only legit nodes may become leaders. If a leader committed a value and died (even before other nodes learned about the committed index), it guarantees that the majority of nodes have the value. So the next leader will be elected from that set.

How to avoid loss of internal state of a master during fail-over to new master during a network partition

I was trying to implement a simple single master node against multiple backup nodes system to learn about distributed and fault tolerant architecture.
Currently this is what my system looks like:
N different nodes, each one identical. 1 master node running a simple webserver.
All nodes communicate with each other using simple heartbeat protocol and each maintain global state (count of nodes available, who is master, downtime and uptime of each other.)
If any node does not hear from master for some set time, if raises a alarm. If a consensus is reached that the master is down, new master is elected.
If the network of nodes gets partitioned.
And the master is in minor partition, then it will stop serving request and go down by itself after a set period of time. Minor group cannot elect master (some minimum nodes require to make decision)
New master gets selected in the major partition after a set time after not hearing from old master.
Now I am stuck with a problem, that is, in the step 4 above, there is a time gap where the old master is still serving the requests, while new master getting elected in the major node.
This seems can cause inconsistent data across the system if some client decided to write new data to old master. How we avoid this issue. Would be glad if someone points me to right direction.
Rather than accepting writes to the minority master, what you want is to simply reject writes to the old master in that case, and you can do so by attempting to verify its mastership with a majority of the cluster on each write. If the master is on the minority side of a partition, it will no longer be able to contact a majority of the cluster and so will not be able to acknowledge clients’ requests. This brief period of unavailability is preferable to losing acknowledged writes in quorum based systems.
You should read the Raft paper. You’re slowly moving towards an implementation of the Raft protocol, and it will probably answer many of the questions you might have alonggn the way.

How can a node with complete log can be elected if another becomes a candidate first?

I've been watching Raft Algorithm video at https://youtu.be/vYp4LYbnnW8?t=3244, but am not clear about one circumstance.
In leader election for term 4, if node s1 broadcasts RequestVote before s3 does then node s2, s4 and s5 would vote for it, while s3 doesn't. And then node s3 broadcasts RequestVote to others, how can it get the vote of others?
One possible way to handle the situation I can figure out is:
if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.
In both scenarios, eventually node s3 gets all others' votes, and sets itself as leader. I'm not sure if my guess is correct.
(Before I comment, be aware that there is NO possible way for entry #9 to be committed. There is no indication of which log entries are committed, but this discussion works with any of #s 1-8 as being committed.)
In short, s3 does not become the leader, s1 does because it gets a majority of the votes. If your concern is that entry #9 will be lost, that is true, but it wasn't committed anyway.
From §5.3:
In Raft, the leader handles inconsistencies by forcing
the followers’ logs to duplicate its own. This means that
conflicting entries in follower logs will be overwritten
with entries from the leader’s log.
To comment on your handling of the situation.
1, if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
It could do this, but it will make failover take longer because s3 would have to try again with a different timeout, and you come into a race condition where s1 always broadcasts RequestVote before s3 does. But again, it is always safe to delete the excess entries that s3 has.
The last paragraph of §5.3 talks about how this easy, timeout-based election process was used instead of ranking the nodes and selecting the best. I agree with the outcome. Simpler protocols are more robust.
2, As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.
This is strictly forbidden because it destroys leader election. That is, if you have this in place you will very often elect multiple leaders. This is bad. I cannot stress enough how bad this is. Bad, bad, bad.

RAFT consensus protocol - Should entries be durable before commiting

I have the following query about implementation RAFT:
Consider the following scenario\implementation:
RAFT leader receives a command entry, it appends the entry to an
in-memory array It then sends the entries to followers (with the
heartbeat)
The followers receive the entry and append it to their
in-memory array and then send a response that it has received the
entry
The leader then commits the entry by writing it to a durable
store (file) The leader sends the latest commit index in the
heartbeat
The followers then commit the entries based on leader's
commit index by storing the entry to their durable store (file)
One of the implementations of RAFT (link: https://github.com/peterbourgon/raft/) seems to implement it this way. I wanted to confirm if this fine.
Is it OK if entries are maintained "in memory" by the leader and the followers until it is committed? In what circumstances might this scenario fail?
I disagree with the accepted answer.
A disk isn't mystically durable. Assuming the disk is local to the server it can permanently fail. So clearly writing to disk doesn't save you from that. Replication is durability provided that the replicas live in different failure domains which if you are serious about durability they will be. Of course there are many hazards to a process that disks don't suffer from (linux oom killed, oom in general, power etc), but a dedicated process on a dedicated machine can do pretty well. Especially if the log store is say ramfs, so process restart isn't an issue.
If log storage is lost then host identity should be lost as well. A,B,C identify logs. New log, new id. B "rejoining" after (potential) loss of storage is simply a buggy implementation. The new process can't claim the identity of B because it can't be sure that it has all the information that B had. Just like in the case of always flushing to disk if we replaced the disk of the machine hosting B we couldn't just restart the process with it configured to have B's identity. That would be nonsense. It should restart as D in both cases then ask to join the cluster. At which point the problem of losing committed writes disappears in a puff of smoke.
I found the answer to the question by posting to raft-dev google group. I have added the answer for reference.
Please reference: https://groups.google.com/forum/#!msg/raft-dev/_lav2NeiypQ/1QbUB52fkggJ
Quoting Diego's answer:
For safety even in the face of correlated power outages, a majority of
servers needs to have persisted the log entry before its effects are
externalized. Any less than a majority and those servers could
permanently fail, resulting in data loss/corruption
Quoting from Ben Johnson's answer to my email regarding the same:
No, a server has to flush entries to disk before being considered part
of the quorum.
For example, let's say you have a cluster of nodes called A, B, & C
where A is the leader.
Node A replicates an entry to Node B.
Node B stores entry in memory and responds to Node A.
Node A now has a quorum and commits the entry.
Node A then gets partitioned away from Node B & C.
Node B then dies and loses the in-memory copy of the entry.
Node B comes back up.
When Node B & C then go to elect a leader, the "committed" entry will not be in their log.
When Node A rejoins the cluster, it will have an inconsistent log. The entry will have been committed and applied to the state machine so
it can't be rolled back.
Ben
I think entries should be durable before commiting.
Let's take the Figure 8(e) of the Raft extended paper as an example. If entries are durable when committed, then:
S1 replicates 4 to S2 and S3 then commit 2 and 4.
All servers crash. Because S2 and S3 don't know S1 has commited 2 and 4, they won't commit 2 and 4. Therefore S1 has commited 1,2,4, S2, S3, S4, S5 has commited 1.
All servers restart except S1.
Because only commited entries are durable, S2, S3, S4, S5 have the same single entry: 1.
S2 is elected as the leader.
S2 replicates a new entry to all other servers except the crashed S1.
S1 restarts. Because S2's entries are newer than S1, so S1's 2 and 4 are replaced by the previous new entry.
As a result, the commited entries 2 and 4 are lost. So I think the un-commited entries should be also durable.

Why ZooKeeper needs majority to run?

I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.
When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?
Since one machine is enough to do all the work (for example single machine ensemble works fine)...
Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?
Edit:
Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?
Thanks in advance.
Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.
The reason to get a majority vote is to avoid a problem called "split-brain".
Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.
There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.
The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses
As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.
Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.
Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.
This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.