Handling quorum writies fail on Cassandra - nosql

According to Datastax documentation about atomicity in Cassandra: QUORUM write succeeded only on one node will not be rolled back (Check Atomicity chapter there:http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/dml/dml_about_transactions_c.html). So when I am performing a QUORUM write on cluster with RF=3 and one node fails, I will get write error status and one successful write on another node. This produces two cases:
write will be propagated to other nodes when they became online;
write can be completely lost if the node accepted that write will be completely broken before propagation.
What is the best ways to deal with such kind of fails in let say hypothetical funds transfer logging?

When a QUORUM write fails with a "TimedOut" exception, you don't know if the write succeeded or not. You should retry the write, and treat it as if it failed. If you have multiple writes that you need to be grouped together, you should place them in a "batch", so that the batch succeeds or fails together.
In either case, you also want to be doing QUORUM reads if you care about consistent results coming back. If you had an RF=3, and the QUORUM write only got on one node, the first time a QUORUM read succeeds that includes the new value, it will be repaired on one of the other nodes, and QUORUM read will always give the new value. So even if the value is written at ONE, successive QUORUM reads will never see the value go back in time.

Related

Kubernetes - How do I prevent duplication of work when there are multiple replicas of a service with a watcher?

I'm trying to build an event exporter as a toy project. It has a watcher that gets informed by the Kubernetes API every time an event, and as a simple case, let's assume that it wants to store the event in a database or something.
Having just one running instance is probably susceptible to failures, so ideally I'd like two. In this situation, the naive implementation would both instances trying to store the event in the database so it'd be duplicated.
What strategies are there to de-duplicate? Do I have to do it at the database level (say, by using some sort of eventId or hash of the event content) and accept the extra database load or is there a way to de-duplicate at the instance level, maybe built into the Kubernetes client code? Or do I need to implement some sort of leader election?
I assume this is a pretty common problem. Is there a more general term for this issue that I can search on to learn more?
I looked at the code for GKE event exporter as a reference but I was unable to find any de-duplication, so I assume that it happens on the receiving end.
You should use both leader election and de-duplication at your watcher level. Only one of them won't be enough.
Why need leader election?
If high availability is your main concern, you should have leader election between the watcher instances. Only the leader pod will write the event to the database. If you don't use leader election, the instances will race with each other to write into the database.
You may check if the event has been already written in the database and then write it. However, you can not guarantee that other instances won't write into the database between when you checked and when you write the event. In that case, database level lock / transaction might help.
Why need de-duplication?
Only leader election will not save you. You also need to implement de-duplication. If your leader pod restart, it will resync all the existing events. So, you should have a check whether to process the event or not.
Furthermore, if a failover happen, how you know from the new leader about which events were successfully exported by previous leader?

inconsistent state after zookeeper leader crash?

I'm trying to understand zookeeper's internal.
Suppose a 3-servers zookeeper cluster, the leader server send a proposal(say setdata: foo=1) to two followers and then crashed, but at least one follower record this proposal to its transaction log file. According "Zab paper" says, the other two server can still form a valid quorum and elect a new leader. And the new leader can still propose and commit this proposal(setdata: foo=1).
My question is in this situation, the client think this request is not completed(because of the leader crash and not respond or the client timeout), but in fact it is still success in the zookeeper cluster. Is this an inconsistent?
In fact this is an inconsistent, but it's not a problem.
In zookeeper programmer guide,there is a line:
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the only guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
This means you know your update succeed when you gets a successful return code, but when you can't know whether it succeed or failed when you don't receive the return code.
But this is not a problem, when your update failed because of leader crash, you can just retry the update operation. This time your update will failed because the the version you specified is behind the actual version number and you will be notified. Then you can call get method to retrive the data to see whether the data equals you specified value.

How ZooKeeper guarantees "Single System Image"?

In the Consistency Guarantees section of ZooKeeper Programmer's Guide, it states that ZooKeeper will give "Single System Image" guarantees:
A client will see the same view of the service regardless of the server that it connects to.
According to the ZAB protocol, only if more than half of the followers acknowledge a proposal, the leader could commit the transaction. So it's likely that not all the followers are in the same status.
If the followers are not in the same status, how could ZooKeeper guarantees "Single System Status"?
References:
ZooKeeper’s atomic broadcast protocol: Theory and practice
Single System Image
Leader only waits for responses from a quorum of the followers to acknowledge to commit a transaction. That doesn't mean that some of the followers need not acknowledge the transaction or can "say no".
Eventually as the rest of the followers process the commit message from leader or as part of the synchronization, will have the same state as the master (with some delay). (not to be confused with Eventual consistency)
How delayed can the follower's state be depends on the configuration items syncLimit & tickTime (https://zookeeper.apache.org/doc/current/zookeeperAdmin.html)
A follower can at most be behind by syncLimit * tickTime time units before it gets dropped.
The document is a little misleading, I have made a pr.
see https://github.com/apache/zookeeper/pull/931.
In fact, zookeeper client keeps a zxid, so it will not connect to older follower if it has read some data from a newer server.
All reads and writes go to a majority of the nodes before being considered successful, so there's no way for a read following a write to not know about that previous write. At least one node knows about it. (Otherwise n/2+1 + n/2+1 > n, which is false.) It doesn't matter if many (at most all but one) has an outdated view of the world since at least one of them knows it all.
If enough nodes crash or the network becomes partitioned so that no group of nodes that are able to talk to each other are in a majority, Zab stops handling requests. If your acknowledged update gets accepted by a set of nodes that disappear and never come back online, your cluster will lose some data (but only when you ask it to move on, and leave its dead nodes behind).
Handling more than two requests is done by handling them two at a time, until there's only one state left.

Is ZooKeeper always consistent in terms of CAP theorem?

Is that correct that ZooKeeper is always CP (in terms of CAP theorem)?
Or is there anyway to use it as AP for service discovery needs?
Zookeeper is not A, and can't drop P. So it's called CP apparently. In terms of CAP theorem, "C" actually means linearizability.
linearizability : if operation B started after operation A successfully completed, then operation B must see the the system in the same state as it was on completion of operation A, or a newer state.
But,
Zookeeper has Sequential Consistency - Updates from a client will be applied in the order that they were sent.
ZooKeeper does not in fact simultaneously consistent across client views.
http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkGuarantees
ZooKeeper does not guarantee that at every instance in time, two different clients will have identical views of ZooKeeper data. Due to factors like network delays, one client may perform an update before another client gets notified of the change. Consider the scenario of two clients, A and B. If client A sets the value of a znode /a from 0 to 1, then tells client B to read /a, client B may read the old value of 0, depending on which server it is connected to. If it is important that Client A and Client B read the same value, Client B should should call the sync() method from the ZooKeeper API method before it performs its read.
ZooKeeper provides "sequential consistency". This is weaker than linearizability but is still very strong, much stronger than "eventual consistency". ZooKeeper also provides a sync command. If you invoke a sync command and then a read, the read is guaranteed to see at least the last write that completed before the sync started.
linearizability, writes should appear to be instantaneous. Imprecisely, once a write completes, all later reads (where “later” is defined by wall-clock start time) should return the value of that write or the value of a later write. Once a read returns a particular value, all later reads should return that value or the value of a later write."
In Zookeeper they have sync() method to use where we need something like linearizability.
Serializability is a guarantee about transactions, or groups of one or more operations over one or more objects. It guarantees that the execution of a set of transactions (usually containing read and write operations) over multiple items is equivalent to some serial execution (total ordering) of the transactions.
Refer :
http://zookeeper-user.578899.n2.nabble.com/Consistency-in-zookeeper-td7578531.html
http://www.bailis.org/blog/linearizability-versus-serializability/
Difference between Linearizability and Serializability
No, you cannot change consistency guarantees in current versions of ZooKeeper like you can in some other systems.
You can add a local cache to your clients which will make them have read only data if the cluster goes down, but in terms of CAP that is still not A because it needs to be available for updates as well as reads.
If ZK offers too strong levels of consistency for your service discovery needs, you should try researching other options, e.g. Eureka, Consul or etcd.
Possibly related reads:
https://tech.knewton.com/blog/2014/12/eureka-shouldnt-use-zookeeper-service-discovery/
https://github.com/Netflix/eureka/wiki/FAQ
https://www.consul.io/intro/vs/zookeeper.html
An excellent question.
In terms of CAP theorem, "C" actually means linearizability:
if operation B started after operation A successfully completed, then
operation B must see the the system in the same state as it was on
completion of operation A, or a newer state.
Since the write in ZooKeeper is considered completed after the quorum confirms it, there can still be stale nodes with old data. Therefore, strictly speaking, ZooKeeper is by default not a CP system, even though it provides reasonably high level of consistency. You can ensure linearizability by preceding a read with a sync command.
Regarding the availability under the network partition, those nodes that are not in majority could not process write requests anymore because they don't have a quorum.
See also:
Please stop calling database CP or AP
CAP FAQ

How are out-of-order and wait-free writes handled?

As stated in Guarantees:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Let's assume a client makes 2 updates (update1 and update2) in a very short time window (I understand zookeeper is good at read-domination applications). So my questions are:
Is that possible update2 is received before update1, therefore for zookeeper update1 has later stamp than that of update2? I assume yes due to network connection nature. If this the case that means client will lose its update2 and will have update1. Is there anyway zookeeper can ACK back the client with different stamp or whatever other data that let the client to determine if update2 is really received after update1. Basically zookeeper tells what it sees from server side to client, which gives client some info to act if that's not what the client wants.
What if there is a leader failure after receiving and confirming update1 and before receiving update2? I assume such writes are persisted somewhere in disk/DB etc. When the new leader comes back will it catch up first, meaning conduct update1, before confirming update2 back to client?
Just curious, since zookeeper claims it supports wait-free writing, does that mean there is a message queue built inside zookeeper to hold incoming writes? Otherwise if the leader has to make sure the update is populated to all other followers, the client is actually being blocked by during this replication process. I am guessing that's part of reason zookeeper does not support heavy write application.
For the first two questions, I think you can find details in Zookeeper's paper.
It's quite normal that different operations from the same client arrive in disorder to Zookeeper node. But Zookeeper use TCP to ensure that sequential network package will be receive orderly.
Leader must write operations in Write-Ahead-Log before it can confirm operations. The problems will diverge in two dimensions. The first situation we should consider is whether the leader could recover before followers realize leader failure. If yes, nothing bad will happen, all operations in failure time will lost, and client will resend the operations. If not, then we should consider whether the Leader has proposed a proposal before it fails. If it fails before proposing a proposal, then client will know the failure. If it has proposed a proposal, there must be at least one node in the cluster which has got the newest transactions. Then it will be the new Leader in next rolling. When the original Leader recovers from failure, it will realize he's no longer the leader(All transactions of Zookeeper contains a 64-bits transaction id, of which the higher 32 bits represent epoch, and the lower 32 bits represents proposal id). It will communicate with new Leader and then get updated(Sometimes it need truncate it's local transaction log first).
I don't know the details since I haven't read ZooKeeper's source code. But Leader only needs over half acknowledge from followers before it response to clients. Zookeeper provide both blocking and non-blocking API and you can choose what you like.