my question is that if the leader crash, but is elected as leader again. how does it know its actual lastApplied? will it apply log entries repeatedly?
I want to know how it work
lastApplied is tracked on each node individually, both leader and followers. All node must have a persistent storage, where the value (and other data) resides.
So if a node restarts after crash, it would read the lastApplied from that storage.
If the storage is completely lost, it's the same case if a brand new node needs to be added to the pool.
Quick edit: lastApplied index works the same way for both leader and followers. There is nothing special about it based on the node role.
Related
In RAFT, I understand that a leader receives a request and federates it out to it's peer list to commit to their respective logs.
My question is, is there a distinction between committing the action to the log and actually applying the action? If the answer is yes, then at what point does the action get applied?
My understanding is once the leaders receives, from the majority - "hey I wrote this to my log", the leader applies the change then federates an "Apply" command to the peers that wrote the change to their respective logs and then the ack is sent to the client.
I would say there is a distinction between committing an entry and applying it to the state machine. Once an entry is committed (i.e. the commitIndex is >= the entry index) it can be applied at any time. In practice, you want the leader to apply committed entries as soon as possible to reduce latency, so entries will usually be applied to an in-memory state machine immediately.
In the case of in-memory state machines the distinction is not very obvious. But it’s the other use cases for Raft that do necessitate this distinction. For example, the distinction becomes particularly important with persistent state machines. If the state machine is persisting changes to e.g. an underlying database, it’s critical that each entry only be applied to the state machine once so that the underlying store does not go back in time when the node replays entries to the state machine when recovering from a failure. To make persistent state machines idempotent, the state machine on each node needs to persist the entries that have been applied on that node as part of the persistent state. In this case, the process of applying entries is indeed a distinction, and a critical one.
State machine replication is also only one use case for Raft. There are others as well. It’s perfectly valid, for example, to use the protocol for simple log replication, in which case entries wouldn’t be applied at all.
I've been watching Raft Algorithm video at https://youtu.be/vYp4LYbnnW8?t=3244, but am not clear about one circumstance.
In leader election for term 4, if node s1 broadcasts RequestVote before s3 does then node s2, s4 and s5 would vote for it, while s3 doesn't. And then node s3 broadcasts RequestVote to others, how can it get the vote of others?
One possible way to handle the situation I can figure out is:
if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.
In both scenarios, eventually node s3 gets all others' votes, and sets itself as leader. I'm not sure if my guess is correct.
(Before I comment, be aware that there is NO possible way for entry #9 to be committed. There is no indication of which log entries are committed, but this discussion works with any of #s 1-8 as being committed.)
In short, s3 does not become the leader, s1 does because it gets a majority of the votes. If your concern is that entry #9 will be lost, that is true, but it wasn't committed anyway.
From §5.3:
In Raft, the leader handles inconsistencies by forcing
the followers’ logs to duplicate its own. This means that
conflicting entries in follower logs will be overwritten
with entries from the leader’s log.
To comment on your handling of the situation.
1, if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
It could do this, but it will make failover take longer because s3 would have to try again with a different timeout, and you come into a race condition where s1 always broadcasts RequestVote before s3 does. But again, it is always safe to delete the excess entries that s3 has.
The last paragraph of §5.3 talks about how this easy, timeout-based election process was used instead of ranking the nodes and selecting the best. I agree with the outcome. Simpler protocols are more robust.
2, As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.
This is strictly forbidden because it destroys leader election. That is, if you have this in place you will very often elect multiple leaders. This is bad. I cannot stress enough how bad this is. Bad, bad, bad.
In the case of network partition or node crash, most of the distributed atomic broadcast protocols (like Extended Virtual Synchrony or Paxos), require running nodes, to keep logging messages, until the crashed or partitioned node rejoins the cluster. When a node rejoins the cluster, replay of logged messages are enough to regain the current state.
My question is, if the partitioned/crash node takes really long time to join the cluster again, then eventually logs will overflow. This seem to be a very practical issue, but still no one in their paper talks about it. Is there a very obvious solution to this which I am missing? Or my understanding in incorrect.
You don't really need to remember the whole log. Imagine for example that the state you were synchronizing between the nodes was something like an SQL table with a row of the form (id: int, name: string) and the commands that would be written into the logs were in a form "insert row with id=x and name=y", "delete row where id=z", "set name=a where id=1000",...
Once such commands were committed, all you really care about is the final table. Then once a node which was offline for a long time goes online, it would only need to download the table + few entries from the log that were committed while the table was being downloaded.
This is called "log compaction", check out the chapter 7 in the Raft paper for more info.
There are a few potential solutions to the infinite log problem but one of the more popular ones for replicated state machines is to periodically snap-shot the full replicated state machine and delete all history prior to that point. A node that has been offline too long would then just discard all of their information, download the snapshot, and start replaying the replicated logs from that point.
I have the following query about implementation RAFT:
Consider the following scenario\implementation:
RAFT leader receives a command entry, it appends the entry to an
in-memory array It then sends the entries to followers (with the
heartbeat)
The followers receive the entry and append it to their
in-memory array and then send a response that it has received the
entry
The leader then commits the entry by writing it to a durable
store (file) The leader sends the latest commit index in the
heartbeat
The followers then commit the entries based on leader's
commit index by storing the entry to their durable store (file)
One of the implementations of RAFT (link: https://github.com/peterbourgon/raft/) seems to implement it this way. I wanted to confirm if this fine.
Is it OK if entries are maintained "in memory" by the leader and the followers until it is committed? In what circumstances might this scenario fail?
I disagree with the accepted answer.
A disk isn't mystically durable. Assuming the disk is local to the server it can permanently fail. So clearly writing to disk doesn't save you from that. Replication is durability provided that the replicas live in different failure domains which if you are serious about durability they will be. Of course there are many hazards to a process that disks don't suffer from (linux oom killed, oom in general, power etc), but a dedicated process on a dedicated machine can do pretty well. Especially if the log store is say ramfs, so process restart isn't an issue.
If log storage is lost then host identity should be lost as well. A,B,C identify logs. New log, new id. B "rejoining" after (potential) loss of storage is simply a buggy implementation. The new process can't claim the identity of B because it can't be sure that it has all the information that B had. Just like in the case of always flushing to disk if we replaced the disk of the machine hosting B we couldn't just restart the process with it configured to have B's identity. That would be nonsense. It should restart as D in both cases then ask to join the cluster. At which point the problem of losing committed writes disappears in a puff of smoke.
I found the answer to the question by posting to raft-dev google group. I have added the answer for reference.
Please reference: https://groups.google.com/forum/#!msg/raft-dev/_lav2NeiypQ/1QbUB52fkggJ
Quoting Diego's answer:
For safety even in the face of correlated power outages, a majority of
servers needs to have persisted the log entry before its effects are
externalized. Any less than a majority and those servers could
permanently fail, resulting in data loss/corruption
Quoting from Ben Johnson's answer to my email regarding the same:
No, a server has to flush entries to disk before being considered part
of the quorum.
For example, let's say you have a cluster of nodes called A, B, & C
where A is the leader.
Node A replicates an entry to Node B.
Node B stores entry in memory and responds to Node A.
Node A now has a quorum and commits the entry.
Node A then gets partitioned away from Node B & C.
Node B then dies and loses the in-memory copy of the entry.
Node B comes back up.
When Node B & C then go to elect a leader, the "committed" entry will not be in their log.
When Node A rejoins the cluster, it will have an inconsistent log. The entry will have been committed and applied to the state machine so
it can't be rolled back.
Ben
I think entries should be durable before commiting.
Let's take the Figure 8(e) of the Raft extended paper as an example. If entries are durable when committed, then:
S1 replicates 4 to S2 and S3 then commit 2 and 4.
All servers crash. Because S2 and S3 don't know S1 has commited 2 and 4, they won't commit 2 and 4. Therefore S1 has commited 1,2,4, S2, S3, S4, S5 has commited 1.
All servers restart except S1.
Because only commited entries are durable, S2, S3, S4, S5 have the same single entry: 1.
S2 is elected as the leader.
S2 replicates a new entry to all other servers except the crashed S1.
S1 restarts. Because S2's entries are newer than S1, so S1's 2 and 4 are replaced by the previous new entry.
As a result, the commited entries 2 and 4 are lost. So I think the un-commited entries should be also durable.
I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.
When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?
Since one machine is enough to do all the work (for example single machine ensemble works fine)...
Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?
Edit:
Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?
Thanks in advance.
Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.
The reason to get a majority vote is to avoid a problem called "split-brain".
Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.
There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.
The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses
As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.
Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.
Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.
This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.