How does raft prevent submitted logs from being overwritten - distributed-computing

figure 8 in raft paper
Consider a situation like the figure 8 in raft paper, but in (c), log entry from term 2 has been commited, and s1 crashs, s5 becomes leader, then s5 send append entry rpc to s2, s3, s4, according to the rule, s2, s3, s4 must replace the log entry from term 2 with log entry from term 3, caused the log that has been submitted to be overwritten, how can we avoid that?
I met this kind of situation in 6.824 labs, causes me to sometimes fail the test (very infrequently. Only one or two times out of hundreds)

The issue is with the voting; if there is a committed item X, then a node could be elected only if it has item X in its log. Basically, committed items will never be overridden.
In your case, if S5 doesn't have the latest committed value, it won't be able to get the majority of votes to become a leader.
Quick edit: the key property of raft is that only legit nodes may become leaders. If a leader committed a value and died (even before other nodes learned about the committed index), it guarantees that the majority of nodes have the value. So the next leader will be elected from that set.

Related

Why RAFT protocol rejects RequestVote with lower term?

In raft every node rejects any request with term number lower than it's own. But why do we need this for RequestVote rpc? If Leader Completeness Property holds, then node can vote for this candidate, right? So why reject request? I can't come up with example, where without this additional check when receiving RequestVote we can lose our consistency, safety, etc.
Maybe someone can help pls?
A candidate with lower term in RequestVote RPC has a log that could be not up to date with other nodes because a leader could be elected in a higher term and already replicated an entry to a majority of servers.
If this candidate is elected leader, the rules of raft cause this leader to not be able to do anything for safety reasons. Its RPCs will be rejected by other servers due to lower term.
I think there are these reasons.
Term in Raft is monotonically increasing. All requests from previous terms are requests made based on stale state and will only be rejected and returned with the current term and leader information.
Any elected leader must have all committed logs before the election, a candidate from the previous term is unlikely to have all committed logs since it won't have any committed logs from the current term.

How can a node with complete log can be elected if another becomes a candidate first?

I've been watching Raft Algorithm video at https://youtu.be/vYp4LYbnnW8?t=3244, but am not clear about one circumstance.
In leader election for term 4, if node s1 broadcasts RequestVote before s3 does then node s2, s4 and s5 would vote for it, while s3 doesn't. And then node s3 broadcasts RequestVote to others, how can it get the vote of others?
One possible way to handle the situation I can figure out is:
if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.
In both scenarios, eventually node s3 gets all others' votes, and sets itself as leader. I'm not sure if my guess is correct.
(Before I comment, be aware that there is NO possible way for entry #9 to be committed. There is no indication of which log entries are committed, but this discussion works with any of #s 1-8 as being committed.)
In short, s3 does not become the leader, s1 does because it gets a majority of the votes. If your concern is that entry #9 will be lost, that is true, but it wasn't committed anyway.
From §5.3:
In Raft, the leader handles inconsistencies by forcing
the followers’ logs to duplicate its own. This means that
conflicting entries in follower logs will be overwritten
with entries from the leader’s log.
To comment on your handling of the situation.
1, if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
It could do this, but it will make failover take longer because s3 would have to try again with a different timeout, and you come into a race condition where s1 always broadcasts RequestVote before s3 does. But again, it is always safe to delete the excess entries that s3 has.
The last paragraph of §5.3 talks about how this easy, timeout-based election process was used instead of ranking the nodes and selecting the best. I agree with the outcome. Simpler protocols are more robust.
2, As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.
This is strictly forbidden because it destroys leader election. That is, if you have this in place you will very often elect multiple leaders. This is bad. I cannot stress enough how bad this is. Bad, bad, bad.

Is keep logging messages in group communication service or paxos practical?

In the case of network partition or node crash, most of the distributed atomic broadcast protocols (like Extended Virtual Synchrony or Paxos), require running nodes, to keep logging messages, until the crashed or partitioned node rejoins the cluster. When a node rejoins the cluster, replay of logged messages are enough to regain the current state.
My question is, if the partitioned/crash node takes really long time to join the cluster again, then eventually logs will overflow. This seem to be a very practical issue, but still no one in their paper talks about it. Is there a very obvious solution to this which I am missing? Or my understanding in incorrect.
You don't really need to remember the whole log. Imagine for example that the state you were synchronizing between the nodes was something like an SQL table with a row of the form (id: int, name: string) and the commands that would be written into the logs were in a form "insert row with id=x and name=y", "delete row where id=z", "set name=a where id=1000",...
Once such commands were committed, all you really care about is the final table. Then once a node which was offline for a long time goes online, it would only need to download the table + few entries from the log that were committed while the table was being downloaded.
This is called "log compaction", check out the chapter 7 in the Raft paper for more info.
There are a few potential solutions to the infinite log problem but one of the more popular ones for replicated state machines is to periodically snap-shot the full replicated state machine and delete all history prior to that point. A node that has been offline too long would then just discard all of their information, download the snapshot, and start replaying the replicated logs from that point.

How to prevent two masters from being active at the same time when doing master election with a registry like Zookeeper, Consul, or etcd?

tl;dr
Even if you implement master election with one of the registries such as Zookeeper, Consul, or etcd, there always seems to be a race condition where an old master does not realize it is no longer master and tries to write, while a new master is unblocked from writing, resulting in two instances of the service writing at the same time, which we want to avoid. How can we implement master election without this race condition?
Detailed problem statement
Suppose we want to implement master election for failover using one of the registries such as Zookeeper, Consul, or etcd.
Suppose there are three instances of the service S1, S2, S3, each one with a corresponding registry node on the same machine, and currently S1 is master and S2 and S3 are slaves.
Furthermore, S1, S2, S3 all store shared state in the registry, but we do not want to have multiple instances write that state at the same time because concurrent access might make the state inconsistent.
Suppose S1 is in the middle of write operations to the shared state stored in the registry. That is, it will execute more write operations before checking again whether it is leader.
Suppose there is a network partition at this point. On one partition is S1. On the other partition is S2 and S3, so it has the quorum.
The registry correctly identifies S2 as the new leader and invalidates S1 as the leader.
S2 activates because it is the new leader and begins a series of write operations to the shared state stored in the registry.
The partition is healed.
At this point both S1 and S2 will execute write operations to the registry concurrently and both their write operations will succeed since the partition is healed, which might result in inconsistent state.
The sample trace is:
S1 is notified it is master, and begins write operations to shared state in registry
Partition happens, with S1 in one partition, S2 & S3 in the other
S2 is identified as new master
S2 is notified it is new master, and begins write operations to shared state in registry
Partition is repaired
S1 writes to shared state in registry and succeeds since no partition
S2 writes to shared state in registry and succeeds also, causing arbitrary interleaving of writes with S1
S1 is notified it is no longer master and stops writing
Thoughts
Thinking in terms of Consul's sessions, would an API write call that also takes a session ID and only succeeds if that session ID is still master's solve this problem?
Is there such a call in Consul or one of the other registries?
I think the problem lies between steps 5 & 6. S1 shouldn't attempt to write immediately after healing the partition since it should have lost its leadership status during the partition.
S1 should be aware of this since it would have to re-establish connectivity to zookeeper.
http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection has a description of the basic process. I'd look at Apache Curator for an implementation.
Or for something non-zookeeper this could happen via timeout values.
https://github.com/Netflix/edda/blob/48de14fc185d8b2d8605c51630c0906c7e923925/src/main/scala/com/netflix/edda/aws/DynamoDBElector.scala has a nice implementation of that approach.
I haven't tried to use Consul for this yet, but I suspect it should fall into one of these two categories.

RAFT consensus protocol - Should entries be durable before commiting

I have the following query about implementation RAFT:
Consider the following scenario\implementation:
RAFT leader receives a command entry, it appends the entry to an
in-memory array It then sends the entries to followers (with the
heartbeat)
The followers receive the entry and append it to their
in-memory array and then send a response that it has received the
entry
The leader then commits the entry by writing it to a durable
store (file) The leader sends the latest commit index in the
heartbeat
The followers then commit the entries based on leader's
commit index by storing the entry to their durable store (file)
One of the implementations of RAFT (link: https://github.com/peterbourgon/raft/) seems to implement it this way. I wanted to confirm if this fine.
Is it OK if entries are maintained "in memory" by the leader and the followers until it is committed? In what circumstances might this scenario fail?
I disagree with the accepted answer.
A disk isn't mystically durable. Assuming the disk is local to the server it can permanently fail. So clearly writing to disk doesn't save you from that. Replication is durability provided that the replicas live in different failure domains which if you are serious about durability they will be. Of course there are many hazards to a process that disks don't suffer from (linux oom killed, oom in general, power etc), but a dedicated process on a dedicated machine can do pretty well. Especially if the log store is say ramfs, so process restart isn't an issue.
If log storage is lost then host identity should be lost as well. A,B,C identify logs. New log, new id. B "rejoining" after (potential) loss of storage is simply a buggy implementation. The new process can't claim the identity of B because it can't be sure that it has all the information that B had. Just like in the case of always flushing to disk if we replaced the disk of the machine hosting B we couldn't just restart the process with it configured to have B's identity. That would be nonsense. It should restart as D in both cases then ask to join the cluster. At which point the problem of losing committed writes disappears in a puff of smoke.
I found the answer to the question by posting to raft-dev google group. I have added the answer for reference.
Please reference: https://groups.google.com/forum/#!msg/raft-dev/_lav2NeiypQ/1QbUB52fkggJ
Quoting Diego's answer:
For safety even in the face of correlated power outages, a majority of
servers needs to have persisted the log entry before its effects are
externalized. Any less than a majority and those servers could
permanently fail, resulting in data loss/corruption
Quoting from Ben Johnson's answer to my email regarding the same:
No, a server has to flush entries to disk before being considered part
of the quorum.
For example, let's say you have a cluster of nodes called A, B, & C
where A is the leader.
Node A replicates an entry to Node B.
Node B stores entry in memory and responds to Node A.
Node A now has a quorum and commits the entry.
Node A then gets partitioned away from Node B & C.
Node B then dies and loses the in-memory copy of the entry.
Node B comes back up.
When Node B & C then go to elect a leader, the "committed" entry will not be in their log.
When Node A rejoins the cluster, it will have an inconsistent log. The entry will have been committed and applied to the state machine so
it can't be rolled back.
Ben
I think entries should be durable before commiting.
Let's take the Figure 8(e) of the Raft extended paper as an example. If entries are durable when committed, then:
S1 replicates 4 to S2 and S3 then commit 2 and 4.
All servers crash. Because S2 and S3 don't know S1 has commited 2 and 4, they won't commit 2 and 4. Therefore S1 has commited 1,2,4, S2, S3, S4, S5 has commited 1.
All servers restart except S1.
Because only commited entries are durable, S2, S3, S4, S5 have the same single entry: 1.
S2 is elected as the leader.
S2 replicates a new entry to all other servers except the crashed S1.
S1 restarts. Because S2's entries are newer than S1, so S1's 2 and 4 are replaced by the previous new entry.
As a result, the commited entries 2 and 4 are lost. So I think the un-commited entries should be also durable.