I get the concept of Crash Fault Tolerance (CTF) in theory. CTF is used to guarantee that the system is still running even if the leader server is crashing.
I need to implement a distributed system (chat application) and also need to implement a crash fault tolerance. For this I have to use so-called "heartbeat" to check if the leader server is still "living".
My question is if someone could show me a good code example to implement such a heartbeat?
The heartbeat mechanism applicability depends on the size of cluster or the typical use-case / deployment scenario you have in hand.
Many consensus based algorithms rely on heartbeats as that is used to decide on the status of leader or leader server.
The raft algorithm can be referred where heartbeats are sent from leader server to followers and you can also use their leader election mechanism in case leader crashes.
For large clusters, only heartbeat mechanism might not scale and hence failure detectors along with gossip based protocols for propagation is preferred that can be referred.
Few references :
https://raft.github.io/ ,
https://github.com/topics/gossip-protocol
Related
We have a three-node Couchbase cluster with two replicas and durability level MAJORITY.
This means that the mutation will be replicated to the active node(node A) and to one of the two replicas(node B) before it is acknowledged as successful.
In terms of consistency, what will happen if node A becomes unavailable and the hard failover process promotes node C replica before node A manages to replicate the mutation to node C?
According to the docs Protection Guarantees and Automatic Failover, write is durable but will be available immediately?
Answered by #ingenthr here.
Assuming the order is that the client gets the acknowledgment of the
durability, then the hard failover is triggered of your node A, during
the failover the cluster manager and the underlying data service will
determine whether node B or C should be promoted to active for that
vbucket (a.k.a. partition) to satisfy all promised durability. That
was actually one of the trickier bits of implementation.
“Immediately” is pretty much correct. Technically it does take some
time to do the promotion of the vbucket, but this should be very short
as it’s just metadata checks and state changes and doesn’t involve any
data movement. Clients will need to be updated with the new topology
as well. How long is a function of the environment and what else is
going on, but I’d expect single-digit-seconds or even under a second.
Assuming you’re using a modern SDK API 3.x client with best-effort
retries, it will be mostly transparent to your application, but not
entirely transparent since you’re doing a hard failover.
Non-idempotent operations, for example, may bubble up as errors.
The information I found comparing Apache Kafka and ActiveMQ (and similar message queuing products) is never clear about the integrity properties of each solution (especially, consistency).
With Kafka you can get the guarantee that no message is lost even in the presence of failures. Do you lose that guarantee using the "LazyPersistence" option?
By "no loss" I mean that the messages would be available to clients, even upon failure after restart - ideally, all messages arriving at the client, in the correct order.
Does ActiveMQ (either "classic" or Artemis) guarantee no loss of messages upon failure? Any configuration options that do give that guarantee? If the answer would differ for "classic" vs Artemis, that would be nice to know.
With Kafka, you can get the guarantee that no message is lost, even in the presence of failures; I guess you loose that guarantee using the "LazyPersistence" option, is that correct?
This is a large topic.
guarantee that no message is lost
This depends on a few things. First, you may configure retention - after a specific period where it is fine for you that the messages are lost. You may consider infinite retention but also beware that you have enough storage for that, maybe you need compaction of the topic?
even in the presence of failures; I guess you loose that guarantee using the "LazyPersistence" option, is that correct?
Kafka is a distributed system, it is common for distributed system to rely more on distributed replication than synchronous disk writes. Even if you write synchronous to disk - the disk may die and be lost. To what degree you want to use distributed replication (e.g. 3 or 6 replicas?) and synchronous or asynchronous disk writes depends on your requirements - but it also has a trade off in throughput. E.g. AWS Aurora is a distributed database that use 6 replicas.
There is no reasonable or practical way to have "no loss of messages" with any solution.
Kafka's approach is to replicate the data once it gets to the server. As #Jonas mentioned there is a total throughput trade-off. Kafka's producers are typically asynchronous out-of-the-box, so it is reasonable to expect that a process (container restart) or network outage would result in observable message loss from the producing application-side. Also, the LazyPersistence can lead to reasonably observable message loss due to process or server-side Kafka failure.
ActiveMQ's approach is to sync data to disk using the OS system call fsync() which is supposed to result in a write to disk. When you combine that with a RAID storage you have the most practical guarantee of data not being lost.
However, there is a alternative pattern that has nothing to do with persistence that can achieve a higher degree of guarantee. This is used by some financial trading systems and defense applications.
Often referred to as 'fanout'. ActiveMQ has a fanout transport included in its client. Works like this:
Producer sends message to 3 servers (they should be as isolated and separated from each other as possible).
Consumer(s) receive up to 3 messages.
First message through "wins" and the consumer app drops the other 2 messages.
With this approach, you can skip persistence altogether, since you have 3 independent routes and the odds of all 3 failing are low. (There are strategies to improve producer-side QOS in the event producer's network is offline).
Consumer has the option of processing first-message (fast) or requiring at least 2 messages to process and validate that the request is legit (secure, but higher latency).
In consensus algorithms like for example PAXOS and RAFT, a value is proposed, and if a quorum agrees, it's written durably to the data store. What happens to the participants that were unavailable at the time of the quorum? How do they eventually catch up? This seems to be left as an exercise for the reader wherever I look.
Take a look at the Raft protocol. It’s simply built in to the algorithm. If the leader tracks the highest index (matchIndex) and the nextIndex to be sent to each follower, and the leader always sends entries to each follower starting at that follower’s nextIndex, there is no special case needed to handle catching up a follower that was missing when the entry was committed. By its nature, when the restarts, the leader will always begin sending entries to that follower starting with the last entry in its log. Thus the node is caught up.
This is mentioned in Paxos made Simple:
Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.
And also in Raft paper:
The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower.
If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any).
If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully.
With the original Paxos papers, it is indeed left as an exercise for the reader. In practice, with Paxos you can send additional messages such as negative acknowledgements to propagate more information around the cluster as a performance optimisation. That can be used to let a node know that it is behind due to lost messages. Once a node knows that it is behind it needs to catch up which can be done with additional message types. That is described as Retransmission in the Trex multi-paxos engine that I wrote to demystify Paxos.
The Google Chubby paxos paper Paxos Made Live criticises Paxos for leaving a lot up to the people doing the implementation. Lamport trained as a mathematician and was attempting to mathematically prove that you couldn't have consensus over lossy networks when he found the solution. The original papers are very much supplying a proof it is possible rather than explaining how to build practical systems with it. Modern papers usually describe an application of some new techniques backed up by some experimental results, while they also supply a formal proof, IMHO most people skip over it and take it on trust. The unapproachable way that Paxos was introduced means that many people who quote the original paper but have failed to see that they describe leader election and multi-Paxos. Unfortunately, Paxos is still taught in a theoretical manner, not in a modern style which leads people to think that it is hard and miss the essence of it.
I argue that Paxos is simple but that reasoning about failures in a distributed system and testing to find any bugs is hard. Everything that is left the reader in the original papers doesn't affect correctness but does effect latency, throughput and the complexity of the code. Once you understand what makes Paxos correct as it is mechanically simple it makes it straightforward to write the rest of what is needed in a way that doesn't violate consistency when you optimise the code for your use case and workload.
For example, Corfu and CURP give blisteringly high performance, one uses Paxos only for metadata, the other only needs to do Paxos when there are concurrent writes to the same keys. Those solutions don't directly complete with Raft or Multi-Paxos as they solve for specific high-performance scenarios (e.g., k-v stores). Yet they demonstrate that it's worth understanding that for practical applications there is a huge amount of optimisations you can make if your particular workload will let you while still using Paxos for some part of the overall solution.
The Raft algorithm used by etcd and ZAB algorithm by Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values. And why those system decided to use a replication log.
I my example if we have the following setup
machine A (Leader), contain version 1
machine B (Follower), contain version 1
machine C (Follower), contain version 1
And the write would go like this:
Machine A receive Write request and store pending write V2
Machine A send prepare request to Machine B and Machine C
Followers (Machine B and Machine C) send Acknowledge to leader (Machine A)
After Leader (machine A) receive Acknowledge from quorum of machine, it know V2 is now commited and send success response to client
Leader (machine a) send finalize request to Follower (machine A and Machine B) to inform them that V2 is commited and V1 could be discarded.
For this system to work, On leader change after acquiring leader Lease the leader machine have to get the latest data version by reading from a quorum of node before accepting Request.
The raft algorithm in ETCD and ZAB algorithm in Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values.
Yes, it's possible to achieve consensus/linearizability without log replication. Originally the consensus problem was solved in the Paxos Made Simple paper by Leslie Lamport (1998). He described two algorithms: Single Decree Paxos to to build a distributed linearizable write-once register and Multi-Paxos to make a distributed state machine on top of append only log (an ordered array of write-once registers).
Append only logs is much more powerful abstraction than write-once registers therefore it isn't surprising that people chose logs over registers. Besides, until Vertical Paxos (2009) was published, log replication was the only consensus protocol capable of cluster membership change; what is vital for multiple tasks: if you can't replace failed nodes then eventually your cluster becomes unavailable.
Yet Vertical Paxos is a good paper, it was much easier for me to understand the Raft's idea of cluster membership via the joint consensus, so I wrote a post on how to adapt the Raft's way for Single Decree Paxos.
With time the "write-once" nature of the Single Decree Paxos was also resolved turning write-once registers into distributed linearizable variables, a quite powerful abstraction suitable for the many use cases. In the wild I saw that approach in the Treode database. If you got interested I blogged about this improved SDP in the How Paxos Works post.
So now when we have an alternative to logs it makes sense to consider it because log based replication is complex and has intrinsic limitations:
with logs you need to care about log compaction and garbage collection
size of the log is limited by the size of one node
protocols for splitting a log and migration to a new cluster are not well-known
And why those system decided to use a replication log.
The log-based approach is older that the alternative, so it has more time to gain popularity.
About your example
It's hard to evaluate it, because you didn't describe how the leader election happens and the conflicts between leaders are resolved, what is the strategy to handle failures and how to change membership of the cluster.
I believe if you describe them carefully you'll get a variant of Paxos.
Your example makes sense. However, have you considered every possible failure scenario? In step 2, Machine B could receive the message minutes before or after Machine C (or vice versa) due to network partitions or faulty routers. In step 3, the acknowledgements could be lost, delayed, or re-transmitted numerous times. The leader could also fail and come back up once, twice, or potentially several times all within the same consensus round. And in step 5, the messages could be lost, duplicated, or Machine A & C could receive the notification while B misses it....
Conceptual simplicity, also known as "reducing the potential points of failure", is key to distributed systems. Anything can happen, and will happen in realistic environments. Primitives, such as replicated logs based on consensus protocols proven to be correct in any environment, are a solid foundation upon which to build higher levels of abstraction. It's certainly true that better performance or latency or your "metric of interest" can be achieved by a custom-built algorithm but ensuring correctness for such an algorithm is a major time investment.
Replicated logs are simple, easily understood, predictable, and fall neatly into the domain of established consensus protocols (paxos, paxos-variants, & raft). That's why they're popular. It's not because they're the best for any particular application, rather they're understood and reliable.
For related references, you may be interested in Understanding Paxos and Consensus in the Cloud: Paxos Systems Demystified
As I understand Chubby at any given time there are 5 chubby servers. One is the master and handles coordination of writes to the quorum, and the other 4 servers are read only and forward handling of writes to the master. Writes use Paxos to maintain consistency.
Can someone explain to me why there is a distinction between the master and the 4 replicas. Why isn't Chubby multi-master? This question could also apply to Zookeeper.
Having a single master is more efficient because nodes don't have to deal with as much contention.
Both Chubby and Zookeeper implement a distributed state-machine where the point of the system is to decide a total ordering on transitions from one state to the next. It can take a lot of messages (theoretically infinite messages) to resolve a contention when multiple nodes are proposing a transition at the same time.
Paxos (and thus Chubby) uses an optimization called a "distinguished leader" where the replicas forward writes to the distinguished leader to reduce contention. (I am assuming Chubby replicas do this. If not they could, but the designers merely push that responsibility to the client.) Zookeeper does this too.
Both Chubby and Zookeeper actually do handle multiple leaders because they must deal with a leader that doesn't know it has died and then comes back from the dead. For Chubby, this is the whole point of using Paxos: eventually one of the leaders will win. (Well theoretically it may not, but we engineers do practical things like randomized backoff to make that probability tolerably small.) Zookeeper, on the other hand, assigns a non-decreasing id to each leader; and any non-current leader's transitions are rejected. This means that when a leader dies, Zookeeper has to pause and go through a reconfiguration phase before accepting new transitions.