When does the Raft consensus algorithm apply the commit log to the leader and followers? - consensus

In RAFT, I understand that a leader receives a request and federates it out to it's peer list to commit to their respective logs.
My question is, is there a distinction between committing the action to the log and actually applying the action? If the answer is yes, then at what point does the action get applied?
My understanding is once the leaders receives, from the majority - "hey I wrote this to my log", the leader applies the change then federates an "Apply" command to the peers that wrote the change to their respective logs and then the ack is sent to the client.

I would say there is a distinction between committing an entry and applying it to the state machine. Once an entry is committed (i.e. the commitIndex is >= the entry index) it can be applied at any time. In practice, you want the leader to apply committed entries as soon as possible to reduce latency, so entries will usually be applied to an in-memory state machine immediately.
In the case of in-memory state machines the distinction is not very obvious. But it’s the other use cases for Raft that do necessitate this distinction. For example, the distinction becomes particularly important with persistent state machines. If the state machine is persisting changes to e.g. an underlying database, it’s critical that each entry only be applied to the state machine once so that the underlying store does not go back in time when the node replays entries to the state machine when recovering from a failure. To make persistent state machines idempotent, the state machine on each node needs to persist the entries that have been applied on that node as part of the persistent state. In this case, the process of applying entries is indeed a distinction, and a critical one.
State machine replication is also only one use case for Raft. There are others as well. It’s perfectly valid, for example, to use the protocol for simple log replication, in which case entries wouldn’t be applied at all.

Related

Why do we need to use Zookeeper for a Coordination Service instead of just a central database?

Quoting the zookeeper docs
ZooKeeper is a distributed, open-source coordination service for
distributed applications. It exposes a simple set of primitives that
distributed applications can build upon to implement higher level
services for synchronization, configuration maintenance, and groups
and naming.
Guarantees
ZooKeeper is very fast and very simple. Since its goal, though, is to
be a basis for the construction of more complicated services, such as
synchronization, it provides a set of guarantees. These are:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
But I don't see any new problem that Zookeeper solves apart from being highly fault tolerant compared to a central database. All the guarantees that zookeeper assures can be guaranteed in a central database too.
Atomicity -> As it's a single node. all updates are atomic.
Sequential Consistency -> after an update clients can wait until the ack until they send the next update to maintain the sequence.
Single System Image, Reliability, Timeliness -> guaranteed as it's a single node.
So, Avoiding a single point of failure is the only main advantage of using zookeeper. Please correct me if I'm wrong.
Zookeeper (and other consensus based systems) offers sequential consistency, strong consistency and high availability.
"apart from being highly fault tolerant" that's actually huge - the fault tolerance.
If you don't care about availability, you totally can use any other linearizable storage - even a directory with files will work.
Consensus based system, and systems based on them (e.g. zoo + your own code) are used to implement machine state replication. All transitions are stored in a distributed log - to make it durable there are many copies. Consensus is about what is the order of event in the log.
With the log being available, the actual business code can consume events and change its state machine - typical state machine transitions. Since each copy of log has the same sequence of events, all states machines will get to the same state.
The key thing is about timing - all logs will get same events in the same order, but there is no guarantee when that happens - a node could be disconnected from the network, hence its log will be stale, and by extension the state machine as well.
To see the true latest value, as you would expect with a singe source of truth, you have to use linearizable read. One way of doing this is to append the read operation to the log itself and wait for it to be committed. Read do nothing with state machines, but the fact that a reader placed something to log and got it committed, that signals that the entire log is read - there is no stale data. (Stale it means that all writes happened before the read are reflected, while read is happening, new writes could happen).
All of this complexity comes form the availability requirements - a cluster with three nodes can let one node to go down, without affecting operations.
So, yes, you could use any linear storage to do the same, ignoring availability. You could do this by keeping the log of events in a table, and every client to track a pointer (or id) of last applied operation; so every client could go and move its own state machine.

Failover and strong consistency in Couchbase

We have a three-node Couchbase cluster with two replicas and durability level MAJORITY.
This means that the mutation will be replicated to the active node(node A) and to one of the two replicas(node B) before it is acknowledged as successful.
In terms of consistency, what will happen if node A becomes unavailable and the hard failover process promotes node C replica before node A manages to replicate the mutation to node C?
According to the docs Protection Guarantees and Automatic Failover, write is durable but will be available immediately?
Answered by #ingenthr here.
Assuming the order is that the client gets the acknowledgment of the
durability, then the hard failover is triggered of your node A, during
the failover the cluster manager and the underlying data service will
determine whether node B or C should be promoted to active for that
vbucket (a.k.a. partition) to satisfy all promised durability. That
was actually one of the trickier bits of implementation.
“Immediately” is pretty much correct. Technically it does take some
time to do the promotion of the vbucket, but this should be very short
as it’s just metadata checks and state changes and doesn’t involve any
data movement. Clients will need to be updated with the new topology
as well. How long is a function of the environment and what else is
going on, but I’d expect single-digit-seconds or even under a second.
Assuming you’re using a modern SDK API 3.x client with best-effort
retries, it will be mostly transparent to your application, but not
entirely transparent since you’re doing a hard failover.
Non-idempotent operations, for example, may bubble up as errors.

Is a replication log necessary to achieve linearizability in distributed store

The Raft algorithm used by etcd and ZAB algorithm by Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values. And why those system decided to use a replication log.
I my example if we have the following setup
machine A (Leader), contain version 1
machine B (Follower), contain version 1
machine C (Follower), contain version 1
And the write would go like this:
Machine A receive Write request and store pending write V2
Machine A send prepare request to Machine B and Machine C
Followers (Machine B and Machine C) send Acknowledge to leader (Machine A)
After Leader (machine A) receive Acknowledge from quorum of machine, it know V2 is now commited and send success response to client
Leader (machine a) send finalize request to Follower (machine A and Machine B) to inform them that V2 is commited and V1 could be discarded.
For this system to work, On leader change after acquiring leader Lease the leader machine have to get the latest data version by reading from a quorum of node before accepting Request.
The raft algorithm in ETCD and ZAB algorithm in Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values.
Yes, it's possible to achieve consensus/linearizability without log replication. Originally the consensus problem was solved in the Paxos Made Simple paper by Leslie Lamport (1998). He described two algorithms: Single Decree Paxos to to build a distributed linearizable write-once register and Multi-Paxos to make a distributed state machine on top of append only log (an ordered array of write-once registers).
Append only logs is much more powerful abstraction than write-once registers therefore it isn't surprising that people chose logs over registers. Besides, until Vertical Paxos (2009) was published, log replication was the only consensus protocol capable of cluster membership change; what is vital for multiple tasks: if you can't replace failed nodes then eventually your cluster becomes unavailable.
Yet Vertical Paxos is a good paper, it was much easier for me to understand the Raft's idea of cluster membership via the joint consensus, so I wrote a post on how to adapt the Raft's way for Single Decree Paxos.
With time the "write-once" nature of the Single Decree Paxos was also resolved turning write-once registers into distributed linearizable variables, a quite powerful abstraction suitable for the many use cases. In the wild I saw that approach in the Treode database. If you got interested I blogged about this improved SDP in the How Paxos Works post.
So now when we have an alternative to logs it makes sense to consider it because log based replication is complex and has intrinsic limitations:
with logs you need to care about log compaction and garbage collection
size of the log is limited by the size of one node
protocols for splitting a log and migration to a new cluster are not well-known
And why those system decided to use a replication log.
The log-based approach is older that the alternative, so it has more time to gain popularity.
About your example
It's hard to evaluate it, because you didn't describe how the leader election happens and the conflicts between leaders are resolved, what is the strategy to handle failures and how to change membership of the cluster.
I believe if you describe them carefully you'll get a variant of Paxos.
Your example makes sense. However, have you considered every possible failure scenario? In step 2, Machine B could receive the message minutes before or after Machine C (or vice versa) due to network partitions or faulty routers. In step 3, the acknowledgements could be lost, delayed, or re-transmitted numerous times. The leader could also fail and come back up once, twice, or potentially several times all within the same consensus round. And in step 5, the messages could be lost, duplicated, or Machine A & C could receive the notification while B misses it....
Conceptual simplicity, also known as "reducing the potential points of failure", is key to distributed systems. Anything can happen, and will happen in realistic environments. Primitives, such as replicated logs based on consensus protocols proven to be correct in any environment, are a solid foundation upon which to build higher levels of abstraction. It's certainly true that better performance or latency or your "metric of interest" can be achieved by a custom-built algorithm but ensuring correctness for such an algorithm is a major time investment.
Replicated logs are simple, easily understood, predictable, and fall neatly into the domain of established consensus protocols (paxos, paxos-variants, & raft). That's why they're popular. It's not because they're the best for any particular application, rather they're understood and reliable.
For related references, you may be interested in Understanding Paxos and Consensus in the Cloud: Paxos Systems Demystified

Why is MongoDB supposed to Consistent & Partition tolerant but not Available [duplicate]

Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?
MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets
I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.
As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.
Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.
MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.
Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.
I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)
Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability

Where does mongodb stand in the CAP theorem?

Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?
MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets
I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.
As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.
Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.
MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.
Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.
I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)
Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability