How are out-of-order and wait-free writes handled? - apache-zookeeper

As stated in Guarantees:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Let's assume a client makes 2 updates (update1 and update2) in a very short time window (I understand zookeeper is good at read-domination applications). So my questions are:
Is that possible update2 is received before update1, therefore for zookeeper update1 has later stamp than that of update2? I assume yes due to network connection nature. If this the case that means client will lose its update2 and will have update1. Is there anyway zookeeper can ACK back the client with different stamp or whatever other data that let the client to determine if update2 is really received after update1. Basically zookeeper tells what it sees from server side to client, which gives client some info to act if that's not what the client wants.
What if there is a leader failure after receiving and confirming update1 and before receiving update2? I assume such writes are persisted somewhere in disk/DB etc. When the new leader comes back will it catch up first, meaning conduct update1, before confirming update2 back to client?
Just curious, since zookeeper claims it supports wait-free writing, does that mean there is a message queue built inside zookeeper to hold incoming writes? Otherwise if the leader has to make sure the update is populated to all other followers, the client is actually being blocked by during this replication process. I am guessing that's part of reason zookeeper does not support heavy write application.

For the first two questions, I think you can find details in Zookeeper's paper.
It's quite normal that different operations from the same client arrive in disorder to Zookeeper node. But Zookeeper use TCP to ensure that sequential network package will be receive orderly.
Leader must write operations in Write-Ahead-Log before it can confirm operations. The problems will diverge in two dimensions. The first situation we should consider is whether the leader could recover before followers realize leader failure. If yes, nothing bad will happen, all operations in failure time will lost, and client will resend the operations. If not, then we should consider whether the Leader has proposed a proposal before it fails. If it fails before proposing a proposal, then client will know the failure. If it has proposed a proposal, there must be at least one node in the cluster which has got the newest transactions. Then it will be the new Leader in next rolling. When the original Leader recovers from failure, it will realize he's no longer the leader(All transactions of Zookeeper contains a 64-bits transaction id, of which the higher 32 bits represent epoch, and the lower 32 bits represents proposal id). It will communicate with new Leader and then get updated(Sometimes it need truncate it's local transaction log first).
I don't know the details since I haven't read ZooKeeper's source code. But Leader only needs over half acknowledge from followers before it response to clients. Zookeeper provide both blocking and non-blocking API and you can choose what you like.

Related

inconsistent state after zookeeper leader crash?

I'm trying to understand zookeeper's internal.
Suppose a 3-servers zookeeper cluster, the leader server send a proposal(say setdata: foo=1) to two followers and then crashed, but at least one follower record this proposal to its transaction log file. According "Zab paper" says, the other two server can still form a valid quorum and elect a new leader. And the new leader can still propose and commit this proposal(setdata: foo=1).
My question is in this situation, the client think this request is not completed(because of the leader crash and not respond or the client timeout), but in fact it is still success in the zookeeper cluster. Is this an inconsistent?
In fact this is an inconsistent, but it's not a problem.
In zookeeper programmer guide,there is a line:
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the only guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
This means you know your update succeed when you gets a successful return code, but when you can't know whether it succeed or failed when you don't receive the return code.
But this is not a problem, when your update failed because of leader crash, you can just retry the update operation. This time your update will failed because the the version you specified is behind the actual version number and you will be notified. Then you can call get method to retrive the data to see whether the data equals you specified value.

How to handle reordered RPC in raft

When implementing the Raft algorithm, I found there is a situation that I think may or may not do harm to the cluster.
It is reasonable to assume some AppendEntriesRPC from Leader are received reordered(network delay or other reasons). Consider the Leader send a heartbeat AppendEntriesRPC to peer A, with prev_log_index = 1, and then send another AppendEntriesRPC with entry 2, and then it crash(I ensure this happen immediately by a callback in my test). If the two RPCs are handled in the order which they are sent, entry 2 will be inserted successfully. However, if the heartbeat RPC is delayed, then peer A will firstly insert entry 1 and respond to the Leader. Then comes the delayed heartbeat, peer A will erase entry 2, because the entry conflict with the Leader's prev_log_index = 1. So peer A erases a log entry by mistake.
To dig a little deeper, if the Leader doesn't crash immediately, will it fix this? I think if peer A respond to the delayed heartbeat correctly, the Leader will find out and fix it up in some later RPCs.
However, what if peer A's response to entry 2 lead to the commit_index advancing? In this case peer A vote to advance commit_index to 2, even though it actually does not have entry 2. So there may not enough votes for this advancing. When the Leader crashs now, a node with less logs will be elected as Leader. And I do encounter such situation during my testing.
My question is:
Is my reasoning correct?
If reordered RPC a real problem, how should I solve that? Is indexing and caching all RPCs, and force them be handled one by one a good solution? I found it hard to implement in gRPC.
Raft assumes an ordered stream protocol such as TCP. That is, if a message arrives out of order then it is buffered until its predecessor arrives. (This behavior is why TCP exists: because each individual packet can go through separate routes between servers and there is a high chance of out-of-order messages, and most applications prefer the ease-of-mind of a strict ordering.)
Other protocols, such as plain old Paxos, can work with out-of-order messages, but are typically much slower than Raft.

Can "observer" nodes in zookeeper respond with stale results?

This question is in reference to https://zookeeper.apache.org/doc/trunk/zookeeperObservers.html
Observers are non-voting members of an ensemble which only hear the
results of votes, not the agreement protocol that leads up to them.
Other than this simple distinction, Observers function exactly the
same as Followers - clients may connect to them and send read and
write requests to them. Observers forward these requests to the Leader
like Followers do, but they then simply wait to hear the result of the
vote. Because of this, we can increase the number of Observers as much
as we like without harming the performance of votes.
Observers have other advantages. Because they do not vote, they are
not a critical part of the ZooKeeper ensemble. Therefore they can
fail, or be disconnected from the cluster, without harming the
availability of the ZooKeeper service. The benefit to the user is that
Observers may connect over less reliable network links than Followers.
In fact, Observers may be used to talk to a ZooKeeper server from
another data center. Clients of the Observer will see fast reads, as
all reads are served locally, and writes result in minimal network
traffic as the number of messages required in the absence of the vote
protocol is smaller.
1) non-voting members of an ensemble - What do the voting members vote on?
2) How does an update request work for observers - When a ZK leader gets an update request, it requires a quorum of nodes to respond. Observer nodes seems like is not considered a quorum node. Does that mean an observer node lags behind the leader node for updates? If that is true, how does it ensure that observer nodes do not respond with stale data during reads?
3) Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller - Reads from all the other nodes will also be local only because they are in-sync with the leader, no? And I did not get the part about writes.
These questions should be good to understanding zookeeper and distributed systems in general. Appreciate a good detailed answer for these. Thanks in advance !
1) non-voting members of an ensemble - What do the voting members vote on?
Typical members of the ensemble (not observers) vote on success/failure of proposed changes coordinated by the leader. There is some further discussion of the details in the paper ZooKeeper: Wait-free coordination for Internet-scale systems.
2) How does an update request work for observers - When a ZK leader gets an update request, it requires a quorum of nodes to respond. Observer nodes seems like is not considered a quorum node. Does that mean an observer node lags behind the leader node for updates? If that is true, how does it ensure that observer nodes do not respond with stale data during reads?
You are correct that observer nodes are not considered necessary participants in the quorum. In general, update lag will be subject to network latency between the observer and the leader. (Whether or not this is noticeable is subject to specific external factors, such as whether or not the observer and leader are in the same data center with a low-latency network link.)
Note that even without use of observers, there is no guarantee that every server in the ensemble is always completely up to date. The Apache ZooKeeper documentation on Consistency Guarantees contains this disclaimer:
Sometimes developers mistakenly assume one other guarantee that ZooKeeper does not in fact make. This is:
Simultaneously Consistent Cross-Client Views ZooKeeper does not
guarantee that at every instance in time, two different clients will
have identical views of ZooKeeper data. Due to factors like network
delays, one client may perform an update before another client gets
notified of the change. Consider the scenario of two clients, A and B.
If client A sets the value of a znode /a from 0 to 1, then tells
client B to read /a, client B may read the old value of 0, depending
on which server it is connected to. If it is important that Client A
and Client B read the same value, Client B should should call the
sync() method from the ZooKeeper API method before it performs its
read.
However, clients of ZooKeeper will never appear to "go back in time" by reading stale data from a point in time prior to the data they already read. This is accomplished by attaching a monotonically increasing transaction ID (called "zxid") to each ZooKeeper transaction. When the ZooKeeper client interacts with a server, it compares the client's last seen zxid to the current zxid of the server. If the server is behind the client, then it will not allow the client's next read to be processed by that server.
3) Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller - Reads from all the other nodes will also be local only because they are in-sync with the leader, no? And I did not get the part about writes.
It's important to note that this statement from the documentation is written in the context of an important use-case for observers: multiple data center deployments with higher network latency between different data centers. In this statement, "served locally" means served from a ZooKeeper server within the same data center as the client, so that it doesn't suffer from the longer latency of connecting to another data center. For full context, here is a copy of the full quote:
In fact, Observers may be used to talk to a ZooKeeper server from another data center. Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller.

How ZooKeeper guarantees "Single System Image"?

In the Consistency Guarantees section of ZooKeeper Programmer's Guide, it states that ZooKeeper will give "Single System Image" guarantees:
A client will see the same view of the service regardless of the server that it connects to.
According to the ZAB protocol, only if more than half of the followers acknowledge a proposal, the leader could commit the transaction. So it's likely that not all the followers are in the same status.
If the followers are not in the same status, how could ZooKeeper guarantees "Single System Status"?
References:
ZooKeeper’s atomic broadcast protocol: Theory and practice
Single System Image
Leader only waits for responses from a quorum of the followers to acknowledge to commit a transaction. That doesn't mean that some of the followers need not acknowledge the transaction or can "say no".
Eventually as the rest of the followers process the commit message from leader or as part of the synchronization, will have the same state as the master (with some delay). (not to be confused with Eventual consistency)
How delayed can the follower's state be depends on the configuration items syncLimit & tickTime (https://zookeeper.apache.org/doc/current/zookeeperAdmin.html)
A follower can at most be behind by syncLimit * tickTime time units before it gets dropped.
The document is a little misleading, I have made a pr.
see https://github.com/apache/zookeeper/pull/931.
In fact, zookeeper client keeps a zxid, so it will not connect to older follower if it has read some data from a newer server.
All reads and writes go to a majority of the nodes before being considered successful, so there's no way for a read following a write to not know about that previous write. At least one node knows about it. (Otherwise n/2+1 + n/2+1 > n, which is false.) It doesn't matter if many (at most all but one) has an outdated view of the world since at least one of them knows it all.
If enough nodes crash or the network becomes partitioned so that no group of nodes that are able to talk to each other are in a majority, Zab stops handling requests. If your acknowledged update gets accepted by a set of nodes that disappear and never come back online, your cluster will lose some data (but only when you ask it to move on, and leave its dead nodes behind).
Handling more than two requests is done by handling them two at a time, until there's only one state left.

What to do if the leader fails in Multi-Paxos for master-slave systems?

Backgound:
In section 3, named Implementing a State Machine, of Lamport's paper Paxos Made Simple, Multi-Paxos is described. Multi-Paxos is used in Google Paxos Made Live. (Multi-Paxos is used in Apache ZooKeeper). In Multi-Paxos, gaps can appear:
In general, suppose a leader can get α commands ahead--that is, it can propose commands i + 1 through i + α commands after commands 1 through i are chosen. A gap of up to α - 1 commands could then arise.
Now consider the following scenario:
The whole system uses master-slave architecture. Only the master serves client commands. Master and slaves reach consensus on the sequence of commands via Multi-Paxos. The master is the leader in Multi-Paxos instances. Assume now the master and two of its slaves have the states (commands have been chosen) shown in the following figure:
.
Note that, there are more than one gaps in the master state. Due to asynchrony, the two slaves lag behind. At this time, the master fails.
Problem:
What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?
In particular, how to handle with the gaps and the missing commands with respect to that of the old master?
Update about Zab:
As #sbridges has pointed out, ZooKeeper uses Zab instead of Paxos. To quote,
Zab is primarily designed for primary-backup (i.e., master-slave) systems, like ZooKeeper, rather than for state machine replication.
It seems that Zab is closely related to my problems listed above. According to the short overview paper of Zab, Zab protocol consists of two modes: recovery and broadcast. In recovery mode, two specific guarantees are made: never forgetting committed messages and letting go of messages that are skipped. My confusion about Zab is:
In recovery mode does Zab also suffer from the gaps problem? If so, what does Zab do?
The gap should be the Paxos instances that has not reached agreement. In the paper Paxos Made Simple, the gap is filled by proposing a special “no-op” command that leaves the state unchanged.
If you cares about the order of chosen values for Paxos instances, you'd better use Zab instead, because Paxos does not preserve causal order. https://cwiki.apache.org/confluence/display/ZOOKEEPER/PaxosRun
The missing command should be the Paxos instances that has reached agreement, but not learned by learner. The value is immutable because it has been accepted a quorum of acceptor. When you run a paxos instance of this instance id, the value will be chosen and recovered to the same one on phase 1b.
When slaves/followers detected a failure on Leader, or the Leader lost a quorum support of slaves/follower, they should elect a new leader.
In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO.
In recovery mode, after the leader is elected, the follower synchronize with leader first, and apply the modification on state until NEWLEADER is received.
In broadcast mode, the follower queues the PROPOSAL in pendingTxns, and wait the COMMIT in the same order. If the zxid of COMMIT mismatch with the zxid of head of pendingTxns, the follower will exit.
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0
Multi-Paxos is used in Apache ZooKeeper
Zookeeper uses zab, not paxos. See this link for the difference.
In particular, each zookeeper node in an ensemble commits updates in the same order as every other nodes,
Unlike client requests, state updates must be applied in the exact
original generation order of the primary, starting from the original
initial state of the primary. If a primary fails, a new primary that
executes recovery cannot arbitrarily reorder uncommitted state
updates, or apply them starting from a different initial state.
Specifically the ZAB paper says that a newly elected leader undertakes discovery to learn the next epoch number to set and who has the most up-to-date commit history. The follower sands an ACK-E message which states the max contiguous zxid it has seen. It then says that it undertakes a synchronisation phase where it transmits the state which followers which they have missed. It notes that in interesting optimisation is to only elect a leader which has a most up to date commit history.
With Paxos you don't have to allow gaps. If you do allow gaps then the paper Paxos Made Simple explains how to resolve them from page 9. A new leader knows the last committed value it saw and possibly some committed values above. It probes the slots from the lowest gap it knows about by running phase 1 to propose to those slots. If there are values in those slots it runs phase 2 to fix those values but if it is free to set a value it sets no-op value. Eventually it gets to the slot number where there have been no values proposed and it runs as normal.
In answer to your questions:
What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?
They should attempt to lead after a randomised delay to try to reduce the risk of two candidates proposing at the same time which would waste messages and disk flushes as only one can lead. Randomised leader timeout is well covered in the Raft paper; the same approach can be used for Paxos.
In particular, how to handle with the gaps and the missing commands with respect to that of the old master?
The new leader should probe and fix the gaps to either the highest value proposed to that slot else a no-op until it has filled in the gaps then it can lead as normal.
The answer of #Hailin explains the gap problem as follows:
In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO"
To supplement:
In the paper A simple totally ordered broadcast protocol, it mentions that ZooKeeper requires the prefix property:
If $m$ is the last message delivered for a leader $L$, any message proposed before $m$ by $L$ must also be delivered".
This property mainly relies on the TCP mechanism used in Zab. In Zab Wiki, it mentions that the implementation of Zab must follow the following assumption (besides others):
Servers must process packets in the order that they are received. Since TCP maintains ordering when sending packets, this means that packets will be processed in the order defined by the sender.