Is ZooKeeper always consistent in terms of CAP theorem? - apache-zookeeper

Is that correct that ZooKeeper is always CP (in terms of CAP theorem)?
Or is there anyway to use it as AP for service discovery needs?

Zookeeper is not A, and can't drop P. So it's called CP apparently. In terms of CAP theorem, "C" actually means linearizability.
linearizability : if operation B started after operation A successfully completed, then operation B must see the the system in the same state as it was on completion of operation A, or a newer state.
But,
Zookeeper has Sequential Consistency - Updates from a client will be applied in the order that they were sent.
ZooKeeper does not in fact simultaneously consistent across client views.
http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkGuarantees
ZooKeeper does not guarantee that at every instance in time, two different clients will have identical views of ZooKeeper data. Due to factors like network delays, one client may perform an update before another client gets notified of the change. Consider the scenario of two clients, A and B. If client A sets the value of a znode /a from 0 to 1, then tells client B to read /a, client B may read the old value of 0, depending on which server it is connected to. If it is important that Client A and Client B read the same value, Client B should should call the sync() method from the ZooKeeper API method before it performs its read.
ZooKeeper provides "sequential consistency". This is weaker than linearizability but is still very strong, much stronger than "eventual consistency". ZooKeeper also provides a sync command. If you invoke a sync command and then a read, the read is guaranteed to see at least the last write that completed before the sync started.
linearizability, writes should appear to be instantaneous. Imprecisely, once a write completes, all later reads (where “later” is defined by wall-clock start time) should return the value of that write or the value of a later write. Once a read returns a particular value, all later reads should return that value or the value of a later write."
In Zookeeper they have sync() method to use where we need something like linearizability.
Serializability is a guarantee about transactions, or groups of one or more operations over one or more objects. It guarantees that the execution of a set of transactions (usually containing read and write operations) over multiple items is equivalent to some serial execution (total ordering) of the transactions.
Refer :
http://zookeeper-user.578899.n2.nabble.com/Consistency-in-zookeeper-td7578531.html
http://www.bailis.org/blog/linearizability-versus-serializability/
Difference between Linearizability and Serializability

No, you cannot change consistency guarantees in current versions of ZooKeeper like you can in some other systems.
You can add a local cache to your clients which will make them have read only data if the cluster goes down, but in terms of CAP that is still not A because it needs to be available for updates as well as reads.
If ZK offers too strong levels of consistency for your service discovery needs, you should try researching other options, e.g. Eureka, Consul or etcd.
Possibly related reads:
https://tech.knewton.com/blog/2014/12/eureka-shouldnt-use-zookeeper-service-discovery/
https://github.com/Netflix/eureka/wiki/FAQ
https://www.consul.io/intro/vs/zookeeper.html

An excellent question.
In terms of CAP theorem, "C" actually means linearizability:
if operation B started after operation A successfully completed, then
operation B must see the the system in the same state as it was on
completion of operation A, or a newer state.
Since the write in ZooKeeper is considered completed after the quorum confirms it, there can still be stale nodes with old data. Therefore, strictly speaking, ZooKeeper is by default not a CP system, even though it provides reasonably high level of consistency. You can ensure linearizability by preceding a read with a sync command.
Regarding the availability under the network partition, those nodes that are not in majority could not process write requests anymore because they don't have a quorum.
See also:
Please stop calling database CP or AP
CAP FAQ

Related

Why Paxos is design in two phases

Why Paxos requires two phases(prepare/promise + accept/accepted) instead of a single one? That is, using only prepare/promise portion, if the proposer has heard back from a majority of acceptors, that value is choose.
What is the problem, does it break safety or liveness?
It breaks safety not to follow the full protocol.
Typical implementations of multi-paxos have a steady state mode where a stable leader streams Accept messages containing fresh values. Only when a problem occurs (leader crashes, stalls, or is partitioned by a network issue) does a new leader need to issue prepare messages to ensure saftey. A full description of this is in the write-up of how TRex an open source Paxos library implements Paxos.
Consider the following crash scenario which TRex would handle properly:
Nodes A, B, C with A leading
Client application sends V1 to leader A
Node A is in steady state and so sends accept(n, V1) to nodes B and C. The network starts to fail though so only B sees the message and it replies with accepted(n)
Node A sees the response and has a majority {A,B} so it knows the value is fixed due to the safety proof of the protocol.
Node A attempts to broadcast the outcome to everyone just as it's server dies. Only the client application who issued the V1 gets the message. Imagine that V1 is a customer order and upon learning the order is fixed the client application debts the customer credit card.
Node C times out on the dead leader and attempts to lead. It never saw the value V1. It cannot arbitrarily choose any new value without rolling back the order V1 but the customer has already been charged.
So Node C first issues a prepare(n+1) and node B responds with promise(n+1, V1).
Node C then issues accept(n+1, V1) and as long as the remaining messages get through nodes B and C will learn the value V1 was chosen.
Intuitively we can say that Node C has chosen to collaborate with the dead node A by choosing A's value. So intuitively we can see why there must be two rounds. The first round is needed to discover whether there is any pending work to finish. The second round is used to fix the correct value to give consistency across all processes within the system.
It's not entirely accurate, but you can think of the two phases as 1) copying the data, and then 2) committing the data. If the data is just copied to the other servers, those servers would have no idea if enough other servers have the data for it to be considered safe to serve. Thus there is a second phase to let the servers know that they can commit the data.
Paxos is a little more complex than that, which allows it to continue during failures of either phase. Part of the Paxos proof is that it is the minimal protocol for doing this completely. That is, other protocols do more work, either because they add more capabilities, or because they were poorly designed.

Can "observer" nodes in zookeeper respond with stale results?

This question is in reference to https://zookeeper.apache.org/doc/trunk/zookeeperObservers.html
Observers are non-voting members of an ensemble which only hear the
results of votes, not the agreement protocol that leads up to them.
Other than this simple distinction, Observers function exactly the
same as Followers - clients may connect to them and send read and
write requests to them. Observers forward these requests to the Leader
like Followers do, but they then simply wait to hear the result of the
vote. Because of this, we can increase the number of Observers as much
as we like without harming the performance of votes.
Observers have other advantages. Because they do not vote, they are
not a critical part of the ZooKeeper ensemble. Therefore they can
fail, or be disconnected from the cluster, without harming the
availability of the ZooKeeper service. The benefit to the user is that
Observers may connect over less reliable network links than Followers.
In fact, Observers may be used to talk to a ZooKeeper server from
another data center. Clients of the Observer will see fast reads, as
all reads are served locally, and writes result in minimal network
traffic as the number of messages required in the absence of the vote
protocol is smaller.
1) non-voting members of an ensemble - What do the voting members vote on?
2) How does an update request work for observers - When a ZK leader gets an update request, it requires a quorum of nodes to respond. Observer nodes seems like is not considered a quorum node. Does that mean an observer node lags behind the leader node for updates? If that is true, how does it ensure that observer nodes do not respond with stale data during reads?
3) Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller - Reads from all the other nodes will also be local only because they are in-sync with the leader, no? And I did not get the part about writes.
These questions should be good to understanding zookeeper and distributed systems in general. Appreciate a good detailed answer for these. Thanks in advance !
1) non-voting members of an ensemble - What do the voting members vote on?
Typical members of the ensemble (not observers) vote on success/failure of proposed changes coordinated by the leader. There is some further discussion of the details in the paper ZooKeeper: Wait-free coordination for Internet-scale systems.
2) How does an update request work for observers - When a ZK leader gets an update request, it requires a quorum of nodes to respond. Observer nodes seems like is not considered a quorum node. Does that mean an observer node lags behind the leader node for updates? If that is true, how does it ensure that observer nodes do not respond with stale data during reads?
You are correct that observer nodes are not considered necessary participants in the quorum. In general, update lag will be subject to network latency between the observer and the leader. (Whether or not this is noticeable is subject to specific external factors, such as whether or not the observer and leader are in the same data center with a low-latency network link.)
Note that even without use of observers, there is no guarantee that every server in the ensemble is always completely up to date. The Apache ZooKeeper documentation on Consistency Guarantees contains this disclaimer:
Sometimes developers mistakenly assume one other guarantee that ZooKeeper does not in fact make. This is:
Simultaneously Consistent Cross-Client Views ZooKeeper does not
guarantee that at every instance in time, two different clients will
have identical views of ZooKeeper data. Due to factors like network
delays, one client may perform an update before another client gets
notified of the change. Consider the scenario of two clients, A and B.
If client A sets the value of a znode /a from 0 to 1, then tells
client B to read /a, client B may read the old value of 0, depending
on which server it is connected to. If it is important that Client A
and Client B read the same value, Client B should should call the
sync() method from the ZooKeeper API method before it performs its
read.
However, clients of ZooKeeper will never appear to "go back in time" by reading stale data from a point in time prior to the data they already read. This is accomplished by attaching a monotonically increasing transaction ID (called "zxid") to each ZooKeeper transaction. When the ZooKeeper client interacts with a server, it compares the client's last seen zxid to the current zxid of the server. If the server is behind the client, then it will not allow the client's next read to be processed by that server.
3) Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller - Reads from all the other nodes will also be local only because they are in-sync with the leader, no? And I did not get the part about writes.
It's important to note that this statement from the documentation is written in the context of an important use-case for observers: multiple data center deployments with higher network latency between different data centers. In this statement, "served locally" means served from a ZooKeeper server within the same data center as the client, so that it doesn't suffer from the longer latency of connecting to another data center. For full context, here is a copy of the full quote:
In fact, Observers may be used to talk to a ZooKeeper server from another data center. Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller.

How ZooKeeper guarantees "Single System Image"?

In the Consistency Guarantees section of ZooKeeper Programmer's Guide, it states that ZooKeeper will give "Single System Image" guarantees:
A client will see the same view of the service regardless of the server that it connects to.
According to the ZAB protocol, only if more than half of the followers acknowledge a proposal, the leader could commit the transaction. So it's likely that not all the followers are in the same status.
If the followers are not in the same status, how could ZooKeeper guarantees "Single System Status"?
References:
ZooKeeper’s atomic broadcast protocol: Theory and practice
Single System Image
Leader only waits for responses from a quorum of the followers to acknowledge to commit a transaction. That doesn't mean that some of the followers need not acknowledge the transaction or can "say no".
Eventually as the rest of the followers process the commit message from leader or as part of the synchronization, will have the same state as the master (with some delay). (not to be confused with Eventual consistency)
How delayed can the follower's state be depends on the configuration items syncLimit & tickTime (https://zookeeper.apache.org/doc/current/zookeeperAdmin.html)
A follower can at most be behind by syncLimit * tickTime time units before it gets dropped.
The document is a little misleading, I have made a pr.
see https://github.com/apache/zookeeper/pull/931.
In fact, zookeeper client keeps a zxid, so it will not connect to older follower if it has read some data from a newer server.
All reads and writes go to a majority of the nodes before being considered successful, so there's no way for a read following a write to not know about that previous write. At least one node knows about it. (Otherwise n/2+1 + n/2+1 > n, which is false.) It doesn't matter if many (at most all but one) has an outdated view of the world since at least one of them knows it all.
If enough nodes crash or the network becomes partitioned so that no group of nodes that are able to talk to each other are in a majority, Zab stops handling requests. If your acknowledged update gets accepted by a set of nodes that disappear and never come back online, your cluster will lose some data (but only when you ask it to move on, and leave its dead nodes behind).
Handling more than two requests is done by handling them two at a time, until there's only one state left.

How are out-of-order and wait-free writes handled?

As stated in Guarantees:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Let's assume a client makes 2 updates (update1 and update2) in a very short time window (I understand zookeeper is good at read-domination applications). So my questions are:
Is that possible update2 is received before update1, therefore for zookeeper update1 has later stamp than that of update2? I assume yes due to network connection nature. If this the case that means client will lose its update2 and will have update1. Is there anyway zookeeper can ACK back the client with different stamp or whatever other data that let the client to determine if update2 is really received after update1. Basically zookeeper tells what it sees from server side to client, which gives client some info to act if that's not what the client wants.
What if there is a leader failure after receiving and confirming update1 and before receiving update2? I assume such writes are persisted somewhere in disk/DB etc. When the new leader comes back will it catch up first, meaning conduct update1, before confirming update2 back to client?
Just curious, since zookeeper claims it supports wait-free writing, does that mean there is a message queue built inside zookeeper to hold incoming writes? Otherwise if the leader has to make sure the update is populated to all other followers, the client is actually being blocked by during this replication process. I am guessing that's part of reason zookeeper does not support heavy write application.
For the first two questions, I think you can find details in Zookeeper's paper.
It's quite normal that different operations from the same client arrive in disorder to Zookeeper node. But Zookeeper use TCP to ensure that sequential network package will be receive orderly.
Leader must write operations in Write-Ahead-Log before it can confirm operations. The problems will diverge in two dimensions. The first situation we should consider is whether the leader could recover before followers realize leader failure. If yes, nothing bad will happen, all operations in failure time will lost, and client will resend the operations. If not, then we should consider whether the Leader has proposed a proposal before it fails. If it fails before proposing a proposal, then client will know the failure. If it has proposed a proposal, there must be at least one node in the cluster which has got the newest transactions. Then it will be the new Leader in next rolling. When the original Leader recovers from failure, it will realize he's no longer the leader(All transactions of Zookeeper contains a 64-bits transaction id, of which the higher 32 bits represent epoch, and the lower 32 bits represents proposal id). It will communicate with new Leader and then get updated(Sometimes it need truncate it's local transaction log first).
I don't know the details since I haven't read ZooKeeper's source code. But Leader only needs over half acknowledge from followers before it response to clients. Zookeeper provide both blocking and non-blocking API and you can choose what you like.

Looking for message bus implementations that offer something between full ACID and nothing

Anyone know of a message bus implementation which offers granular control over consistency guarantees? Full ACID is too slow and no ACID is too wrong.
We're currently using Rhino ESB wrapping MSMQ for our messaging. When using durable, transactional messaging with distributed transactions, MSMQ can block the commit for considerable time while it waits on I/O completion.
Our messages fall into two general categories: business logic and denormalisation. The latter account for a significant percentage of message bus traffic.
Business logic messages require the guarantees of full ACID and MSMQ has proven quite adequate for this.
Denormalisation messages:
MUST be durable.
MUST NOT be processed until after the originating transaction completes.
MAY be processed multiple times.
MAY be processed even if the originating transaction rolls back, as long as 2) is adhered to.
(In some specific cases the durability requirements could probably be relaxed, but identifying and handling those cases as exceptions to the rule adds complexity.)
All denormalisation messages are handled in-process so there is no need for IPC.
If the process is restarted, all transactions may be assumed to have completed (committed or rolled back) and all denormalisation messages not yet processed must be recovered. It is acceptable to replay denormalisation messages which were already processed.
As far as I can tell, messaging systems which deal with transactions tend to offer a choice between full ACID or nothing, and ACID carries a performance penalty. We're seeing calls to TransactionScope#Commit() taking as long as a few hundred milliseconds in some cases depending on the number of messages sent.
Using a non-transactional message queue causes messages to be processed before their originating transaction completes, resulting in consistency problems.
Another part of our system which has similar consistency requirements but lower complexity is already using a custom implementation of something akin to a transaction log, and generalising that for this use case is certainly an option, but I'd rather not implement a low-latency, concurrent, durable, transactional messaging system myself if I don't have to :P
In case anyone's wondering, the reason for requiring durability of denormalisation messages is that detecting desyncs and fixing desyncs can be extremely difficult and extremely expensive respectively. People do notice when something's slightly wrong and a page refresh doesn't fix it, so ignoring desyncs isn't an option.
It's not exactly the answer you're looking for, but Jonathan Oliver has written extensively on how to avoid using distributed transactions in messaging and yet maintain transactional integrity:
http://blog.jonathanoliver.com/2011/04/how-i-avoid-two-phase-commit/
http://blog.jonathanoliver.com/2011/03/removing-2pc-two-phase-commit/
http://blog.jonathanoliver.com/2010/04/idempotency-patterns/
Not sure if this helps you but, hey.
It turns out that MSMQ+SQL+DTC don't even offer the consistency guarantees we need. We previously encountered a problem where messages were being processed before the distributed transaction which queued them had been committed to the database, resulting in out-of-date reads. This is a side-effect of using ReadCommitted isolation to consume the queue, since:
Start transaction A.
Update database table in A.
Queue message in A.
Request commit of A.
Message queue commits A
Start transaction B.
Read message in B.
Read database table in B, using ReadCommitted <- gets pre-A data.
Database commits A.
Our requirement is that B's read of the table block on A's commit, which requires Serializable transactions, which carries a performance penalty.
It looks like the normal thing to do is indeed to implement the necessary constraints and guarantees oneself, even though it sounds like reinventing the wheel.
Anyone got any comments on this?
If you want to do this by hand, here is a reliable approach. It satisfies (1) and (2), and it doesn't even need the liberties that you allow in (3) and (4).
Producer (business logic) starts transaction A.
Insert/update whatever into one or more tables.
Insert a corresponding message into PrivateMessageTable (part of the domain, and unshared, if you will). This is what will be distributed.
Commit transaction A. Producer has now simply and reliably performed its writes including the insertion of a message, or rolled everything back.
Dedicated distributer job queries a batch of unprocessed messages from PrivateMessageTable.
Distributer starts transaction B.
Mark the unprocessed messages as processed, rolling back if the number of rows modified is different than expected (two instances running at the same time?).
Insert a public representation of the messages into PublicMessageTable (a publically exposed table, in whatever way). Assign new, strictly sequential Ids to the public representations. Because only one process is doing these inserts, this can be guaranteed. Note that the table must be on the same host to avoid 2PC.
Commit transaction B. Distributor has now distributed each message to the public table exactly once, with strictly sequantial Ids.
A consumer (there can be several) queries the next batch of messages from PublicMessageTable with Id greater than its own LastSeenId.
Consumer starts transaction C.
Consumer inserts its own representation of the messages into its own table ConsumerMessageTable (thus advancing LastSeenId). Insert-ignore can help protect against multiple instances running. Note that this table can be in a completely different server.
Commit transaction C. Consumer has now consumed each message exactly once, in the same order the messages were made publically available, without ever skipping a message.
We can do whatever we want based on the consumed messages.
Of course, this requires very careful implementation.
It is even suitable for database clusters, as long as there is only a single write node, and both reads and writes perform causality checks. It may well be that having one of these is sufficient, but I'd have to consider the implications more carefully to make that claim.