How raft algorithm maintains strong read consistency in case of write failure followed by a node failure - consensus

Consider three nodes(A,B,C) getting key/value data. And the following steps happened
Node A receive key:value (1:15). It is a leader
It replicate to node B and node C
Entry made to node B in pre commit log
Node C fail the entry
Ack from node B is lost.
Nod A fail the entry and sent failure to client
Node A is still leader and B is not in quorum
Client read from node A for key 1 and it returned old value.
node A is down
Node B and node C is up
now node B has an entry in precommit log and node C doesn't.
How does log matching happen at this time. Is node Bgoing to commit that entry or going to discard it. If it is going to commit thenit would be read inconsistent or if it is going to discard then there could be data loss in other cases

The error is in step 8. Every read operation must be replicated to other nodes otherwise you risk getting stale data, the system should serve read after it writes a dummy value to the log. In your case (B is offline), the "read" must affect nodes A and C, so when node B comes back online and A dies, C would be able to invalidate B's records.
This is a tricky problem and even Etcd run into it in the past (now it's fixed).

Related

How does raft preserve safty when a leader commits a log entry and crashes before informing followers this commitment?

In my understanding, a leader sends AppendEntries RPC to the followers, and if majority of followers return success, the leader will commit this entry. It will commit this entry by applying it to its own state machine, and it will also return to the client to let the client know that the command is successful.
However, at this time, this commitment is not known to the followers yet. It will inform the followers in the next AppendEntries (or heartbeat) RPC calls.
In the simplest case, if the leader crashes after the commitment and before the next AppendEntries, raft will use the "only most up to date follower can win" strategy to ensure that the next leader must contain this log entry (although not committed), and the new leader will commit this entry and send AppendEntries to other followers. In this way, the log entry is safely kept.
However, consider the following complicated scenario (extracted from PHD thesis "CONSENSUS: BRIDGING THEORY AND PRACTICE" page 23).
At this point, the log entry from term 2 has been replicated on a
majority of the servers, but it is not committed. If S1 crashes as in
(d1), S5 could be elected leader (with votes from S2, S3, and S4) and
overwrite the entry with its own entry from term 3.
How if at this point, it is committed in Server S1, but not committed in other servers yet? If S1 then crashes as in (d1), this log entry will be overwritten by S5?
In my understanding, a committed entry (applied to state machine and possibly informed the client about the result) shall never be overwritten?
Did I misunderstand anything of the raft protocol?
Thanks.
There are more conditions in Raft to commit an entry.
On page 4 of this paper (The 1-page summary of raft) it says
Leaders:
...
If there exists an N such that N > commitIndex, a majority of matchIndex[i] ≥ N, and log[N].term == currentTerm set commitIndex = N (§5.3, §5.4).
In other words, not only does the entry have to be replicated to a majority, its term has to be from the current term. This is why in a practical system a new leader will propose a no-op so that it can advance this commitIndex.
So now we know the leader won't commit until then, but what if it commits but doesn't send the commit.
Later in section 5.4.1 of the same paper it says (emphasis mine):
Raft...guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election....Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster
in order to be elected, which means that every committed entry must be present in at least one of those servers.
In short, the new leader by definition must have the entries that the old leader considered committed.

Does the Raft consensus protocol handle nodes that lost contact to the leader but not to the other nodes?

In case of network partitions, Raft stays consistent. But what does happen if only a single node loses contact only to the leader, becomes a candidate and calls for votes?
This is the setup, I adjusted the examples from http://thesecretlivesofdata.com/raft/ to fit my needs:
Node B is the current leader and sends out heartbeats (red) to the followers. The connection between B and C gets lost and after the election timeout C becomes a candidate, votes for itself and asks nodes A, D and E to vote for it (green).
What does happen?
As far as I understand Raft, nodes A, D and E should vote for C which makes C the next leader (Term 2). We then have two leaders each sending out heartbeats, and hopefully nodes A, D and E will ignore those from B because of the lower term.
Is this correct or is there some better mechanism?
After going through the Raft Paper again, it seems that my above approach was correct. From the paper:
Terms
act as a logical clock in Raft, and they allow servers
to detect obsolete information such as stale leaders. Each
server stores a current term number, which increases
monotonically over time. Current terms are exchanged
whenever servers communicate; if one server’s current
term is smaller than the other’s, then it updates its current
term to the larger value. If a candidate or leader discovers
that its term is out of date, it immediately reverts to follower
state. If a server receives a request with a stale term
number, it rejects the request
The highlighted part is the one I was missing above. So the process is:
After node C has become candidate, it increases its term-number to 2 and requests votes from the reachable nodes (A, D and E).
Those will immediately update their current_term variable to 2 and vote for C.
Thus, nodes A, D and E will ignore heartbeats from B and moreover tell B that the current term is 2.
B will return into follower state (and won't get updated until the network connection between C and B is healed).
Since A, B, D keeps health heartbeat to the the leader B (Term 1 ), they would not responds to the vote request from C ( Term 2 ), C will timeout and repeat vote and repeat timeout.
As the Figure 4 from the raft paper https://raft.github.io/raft.pdf

Zookeeper, do watches on nodes that are modified block all the other reads?

My understanding of ZooKeeper is that a client will always execute requests in an ordered manner from ITS point of view.
Therefore if the client 1 issues:
write node A
2 reads node A, B
write node B
they will be executed in that order.
But in case client 1 has also a watch on a node C, and client 2 writes that node, does that write on node C impacts/blocks reads from client 1?
For example:
Client 1: starts watching C
Client 1: writes node A
Client 2: writes C
(Client 1: does client 1 block until the watch of C is fired? What if at this point the Client 1 tries to write node C?)
Client 1: 3 reads node A then B then C
Client 1: writes node B
But in case client 1 has also a watch on a node C, and client 2 writes
that node, does that write on node C impacts/blocks reads from client
1?
It doesn't block reads from client1 because (check here):
When a ZooKeeper object is created, two threads are created as well:
an IO thread and an event thread. All IO happens on the IO thread
(using Java NIO). All event callbacks happen on the event thread.
but impacts client1 with thee see of events in this order (check here):
A client will see a watch event for a znode it is watching before
seeing the new data that corresponds to that znode.
Next question was:
Does client 1 block until the watch of C is fired?
No (see the explanation above).
What if at this point the Client 1 tries to write node C?
It will overwrite client2's write because it happens after according to your bullets sequence (ZooKeeper is ordered; check also here in order to understand how ZooKeeper achive the ordering).

Why Paxos is design in two phases

Why Paxos requires two phases(prepare/promise + accept/accepted) instead of a single one? That is, using only prepare/promise portion, if the proposer has heard back from a majority of acceptors, that value is choose.
What is the problem, does it break safety or liveness?
It breaks safety not to follow the full protocol.
Typical implementations of multi-paxos have a steady state mode where a stable leader streams Accept messages containing fresh values. Only when a problem occurs (leader crashes, stalls, or is partitioned by a network issue) does a new leader need to issue prepare messages to ensure saftey. A full description of this is in the write-up of how TRex an open source Paxos library implements Paxos.
Consider the following crash scenario which TRex would handle properly:
Nodes A, B, C with A leading
Client application sends V1 to leader A
Node A is in steady state and so sends accept(n, V1) to nodes B and C. The network starts to fail though so only B sees the message and it replies with accepted(n)
Node A sees the response and has a majority {A,B} so it knows the value is fixed due to the safety proof of the protocol.
Node A attempts to broadcast the outcome to everyone just as it's server dies. Only the client application who issued the V1 gets the message. Imagine that V1 is a customer order and upon learning the order is fixed the client application debts the customer credit card.
Node C times out on the dead leader and attempts to lead. It never saw the value V1. It cannot arbitrarily choose any new value without rolling back the order V1 but the customer has already been charged.
So Node C first issues a prepare(n+1) and node B responds with promise(n+1, V1).
Node C then issues accept(n+1, V1) and as long as the remaining messages get through nodes B and C will learn the value V1 was chosen.
Intuitively we can say that Node C has chosen to collaborate with the dead node A by choosing A's value. So intuitively we can see why there must be two rounds. The first round is needed to discover whether there is any pending work to finish. The second round is used to fix the correct value to give consistency across all processes within the system.
It's not entirely accurate, but you can think of the two phases as 1) copying the data, and then 2) committing the data. If the data is just copied to the other servers, those servers would have no idea if enough other servers have the data for it to be considered safe to serve. Thus there is a second phase to let the servers know that they can commit the data.
Paxos is a little more complex than that, which allows it to continue during failures of either phase. Part of the Paxos proof is that it is the minimal protocol for doing this completely. That is, other protocols do more work, either because they add more capabilities, or because they were poorly designed.

Handling quorum writies fail on Cassandra

According to Datastax documentation about atomicity in Cassandra: QUORUM write succeeded only on one node will not be rolled back (Check Atomicity chapter there:http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/dml/dml_about_transactions_c.html). So when I am performing a QUORUM write on cluster with RF=3 and one node fails, I will get write error status and one successful write on another node. This produces two cases:
write will be propagated to other nodes when they became online;
write can be completely lost if the node accepted that write will be completely broken before propagation.
What is the best ways to deal with such kind of fails in let say hypothetical funds transfer logging?
When a QUORUM write fails with a "TimedOut" exception, you don't know if the write succeeded or not. You should retry the write, and treat it as if it failed. If you have multiple writes that you need to be grouped together, you should place them in a "batch", so that the batch succeeds or fails together.
In either case, you also want to be doing QUORUM reads if you care about consistent results coming back. If you had an RF=3, and the QUORUM write only got on one node, the first time a QUORUM read succeeds that includes the new value, it will be repaired on one of the other nodes, and QUORUM read will always give the new value. So even if the value is written at ONE, successive QUORUM reads will never see the value go back in time.