How can a node with complete log can be elected if another becomes a candidate first? - consensus

I've been watching Raft Algorithm video at https://youtu.be/vYp4LYbnnW8?t=3244, but am not clear about one circumstance.
In leader election for term 4, if node s1 broadcasts RequestVote before s3 does then node s2, s4 and s5 would vote for it, while s3 doesn't. And then node s3 broadcasts RequestVote to others, how can it get the vote of others?
One possible way to handle the situation I can figure out is:
if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.
In both scenarios, eventually node s3 gets all others' votes, and sets itself as leader. I'm not sure if my guess is correct.

(Before I comment, be aware that there is NO possible way for entry #9 to be committed. There is no indication of which log entries are committed, but this discussion works with any of #s 1-8 as being committed.)
In short, s3 does not become the leader, s1 does because it gets a majority of the votes. If your concern is that entry #9 will be lost, that is true, but it wasn't committed anyway.
From §5.3:
In Raft, the leader handles inconsistencies by forcing
the followers’ logs to duplicate its own. This means that
conflicting entries in follower logs will be overwritten
with entries from the leader’s log.
To comment on your handling of the situation.
1, if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
It could do this, but it will make failover take longer because s3 would have to try again with a different timeout, and you come into a race condition where s1 always broadcasts RequestVote before s3 does. But again, it is always safe to delete the excess entries that s3 has.
The last paragraph of §5.3 talks about how this easy, timeout-based election process was used instead of ranking the nodes and selecting the best. I agree with the outcome. Simpler protocols are more robust.
2, As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.
This is strictly forbidden because it destroys leader election. That is, if you have this in place you will very often elect multiple leaders. This is bad. I cannot stress enough how bad this is. Bad, bad, bad.

Related

How does raft prevent submitted logs from being overwritten

figure 8 in raft paper
Consider a situation like the figure 8 in raft paper, but in (c), log entry from term 2 has been commited, and s1 crashs, s5 becomes leader, then s5 send append entry rpc to s2, s3, s4, according to the rule, s2, s3, s4 must replace the log entry from term 2 with log entry from term 3, caused the log that has been submitted to be overwritten, how can we avoid that?
I met this kind of situation in 6.824 labs, causes me to sometimes fail the test (very infrequently. Only one or two times out of hundreds)
The issue is with the voting; if there is a committed item X, then a node could be elected only if it has item X in its log. Basically, committed items will never be overridden.
In your case, if S5 doesn't have the latest committed value, it won't be able to get the majority of votes to become a leader.
Quick edit: the key property of raft is that only legit nodes may become leaders. If a leader committed a value and died (even before other nodes learned about the committed index), it guarantees that the majority of nodes have the value. So the next leader will be elected from that set.

When does the Raft consensus algorithm apply the commit log to the leader and followers?

In RAFT, I understand that a leader receives a request and federates it out to it's peer list to commit to their respective logs.
My question is, is there a distinction between committing the action to the log and actually applying the action? If the answer is yes, then at what point does the action get applied?
My understanding is once the leaders receives, from the majority - "hey I wrote this to my log", the leader applies the change then federates an "Apply" command to the peers that wrote the change to their respective logs and then the ack is sent to the client.
I would say there is a distinction between committing an entry and applying it to the state machine. Once an entry is committed (i.e. the commitIndex is >= the entry index) it can be applied at any time. In practice, you want the leader to apply committed entries as soon as possible to reduce latency, so entries will usually be applied to an in-memory state machine immediately.
In the case of in-memory state machines the distinction is not very obvious. But it’s the other use cases for Raft that do necessitate this distinction. For example, the distinction becomes particularly important with persistent state machines. If the state machine is persisting changes to e.g. an underlying database, it’s critical that each entry only be applied to the state machine once so that the underlying store does not go back in time when the node replays entries to the state machine when recovering from a failure. To make persistent state machines idempotent, the state machine on each node needs to persist the entries that have been applied on that node as part of the persistent state. In this case, the process of applying entries is indeed a distinction, and a critical one.
State machine replication is also only one use case for Raft. There are others as well. It’s perfectly valid, for example, to use the protocol for simple log replication, in which case entries wouldn’t be applied at all.

Why RAFT protocol rejects RequestVote with lower term?

In raft every node rejects any request with term number lower than it's own. But why do we need this for RequestVote rpc? If Leader Completeness Property holds, then node can vote for this candidate, right? So why reject request? I can't come up with example, where without this additional check when receiving RequestVote we can lose our consistency, safety, etc.
Maybe someone can help pls?
A candidate with lower term in RequestVote RPC has a log that could be not up to date with other nodes because a leader could be elected in a higher term and already replicated an entry to a majority of servers.
If this candidate is elected leader, the rules of raft cause this leader to not be able to do anything for safety reasons. Its RPCs will be rejected by other servers due to lower term.
I think there are these reasons.
Term in Raft is monotonically increasing. All requests from previous terms are requests made based on stale state and will only be rejected and returned with the current term and leader information.
Any elected leader must have all committed logs before the election, a candidate from the previous term is unlikely to have all committed logs since it won't have any committed logs from the current term.

Is keep logging messages in group communication service or paxos practical?

In the case of network partition or node crash, most of the distributed atomic broadcast protocols (like Extended Virtual Synchrony or Paxos), require running nodes, to keep logging messages, until the crashed or partitioned node rejoins the cluster. When a node rejoins the cluster, replay of logged messages are enough to regain the current state.
My question is, if the partitioned/crash node takes really long time to join the cluster again, then eventually logs will overflow. This seem to be a very practical issue, but still no one in their paper talks about it. Is there a very obvious solution to this which I am missing? Or my understanding in incorrect.
You don't really need to remember the whole log. Imagine for example that the state you were synchronizing between the nodes was something like an SQL table with a row of the form (id: int, name: string) and the commands that would be written into the logs were in a form "insert row with id=x and name=y", "delete row where id=z", "set name=a where id=1000",...
Once such commands were committed, all you really care about is the final table. Then once a node which was offline for a long time goes online, it would only need to download the table + few entries from the log that were committed while the table was being downloaded.
This is called "log compaction", check out the chapter 7 in the Raft paper for more info.
There are a few potential solutions to the infinite log problem but one of the more popular ones for replicated state machines is to periodically snap-shot the full replicated state machine and delete all history prior to that point. A node that has been offline too long would then just discard all of their information, download the snapshot, and start replaying the replicated logs from that point.

RAFT consensus protocol - Should entries be durable before commiting

I have the following query about implementation RAFT:
Consider the following scenario\implementation:
RAFT leader receives a command entry, it appends the entry to an
in-memory array It then sends the entries to followers (with the
heartbeat)
The followers receive the entry and append it to their
in-memory array and then send a response that it has received the
entry
The leader then commits the entry by writing it to a durable
store (file) The leader sends the latest commit index in the
heartbeat
The followers then commit the entries based on leader's
commit index by storing the entry to their durable store (file)
One of the implementations of RAFT (link: https://github.com/peterbourgon/raft/) seems to implement it this way. I wanted to confirm if this fine.
Is it OK if entries are maintained "in memory" by the leader and the followers until it is committed? In what circumstances might this scenario fail?
I disagree with the accepted answer.
A disk isn't mystically durable. Assuming the disk is local to the server it can permanently fail. So clearly writing to disk doesn't save you from that. Replication is durability provided that the replicas live in different failure domains which if you are serious about durability they will be. Of course there are many hazards to a process that disks don't suffer from (linux oom killed, oom in general, power etc), but a dedicated process on a dedicated machine can do pretty well. Especially if the log store is say ramfs, so process restart isn't an issue.
If log storage is lost then host identity should be lost as well. A,B,C identify logs. New log, new id. B "rejoining" after (potential) loss of storage is simply a buggy implementation. The new process can't claim the identity of B because it can't be sure that it has all the information that B had. Just like in the case of always flushing to disk if we replaced the disk of the machine hosting B we couldn't just restart the process with it configured to have B's identity. That would be nonsense. It should restart as D in both cases then ask to join the cluster. At which point the problem of losing committed writes disappears in a puff of smoke.
I found the answer to the question by posting to raft-dev google group. I have added the answer for reference.
Please reference: https://groups.google.com/forum/#!msg/raft-dev/_lav2NeiypQ/1QbUB52fkggJ
Quoting Diego's answer:
For safety even in the face of correlated power outages, a majority of
servers needs to have persisted the log entry before its effects are
externalized. Any less than a majority and those servers could
permanently fail, resulting in data loss/corruption
Quoting from Ben Johnson's answer to my email regarding the same:
No, a server has to flush entries to disk before being considered part
of the quorum.
For example, let's say you have a cluster of nodes called A, B, & C
where A is the leader.
Node A replicates an entry to Node B.
Node B stores entry in memory and responds to Node A.
Node A now has a quorum and commits the entry.
Node A then gets partitioned away from Node B & C.
Node B then dies and loses the in-memory copy of the entry.
Node B comes back up.
When Node B & C then go to elect a leader, the "committed" entry will not be in their log.
When Node A rejoins the cluster, it will have an inconsistent log. The entry will have been committed and applied to the state machine so
it can't be rolled back.
Ben
I think entries should be durable before commiting.
Let's take the Figure 8(e) of the Raft extended paper as an example. If entries are durable when committed, then:
S1 replicates 4 to S2 and S3 then commit 2 and 4.
All servers crash. Because S2 and S3 don't know S1 has commited 2 and 4, they won't commit 2 and 4. Therefore S1 has commited 1,2,4, S2, S3, S4, S5 has commited 1.
All servers restart except S1.
Because only commited entries are durable, S2, S3, S4, S5 have the same single entry: 1.
S2 is elected as the leader.
S2 replicates a new entry to all other servers except the crashed S1.
S1 restarts. Because S2's entries are newer than S1, so S1's 2 and 4 are replaced by the previous new entry.
As a result, the commited entries 2 and 4 are lost. So I think the un-commited entries should be also durable.