Does paxos provide true linearizable consistency or not? - distributed-computing

I think I might be confusing concepts here, but it seems to me like paxos would provide linearizable consistency for systems that implement it.
I know Cassandra uses it. I'm not 100% clear on how but assuming a leader is elected and that single leader does all the writes then communication is synchronous and real-time linearizability is achieved right?
But consensus algorithms like paxos are generally considered partially synchronous because there is a quorum (not 100% of node communication)- does this also mean it's not truly linearizable as well?
maybe because there is only a quorum a node could fall out of sync and that would break linearization?

A linearizable system does not need to be synchronous. Linearizability is a safety property: it says "nothing bad happens" but it doesn't affect linearizability if nothing good happens either. Any reads or writes that do not return (or that return an error) can be ignored when checking for linearizability. This means it's perfectly possible for a system to be linearizable even if one or more of the nodes are faulty or partitioned or running slowly.
Paxos is commonly used to implement a replicated state machine: a system that executes a sequence of operations on multiple nodes at once. Since the operations are deterministic and the nodes all agree on the operations to run and the sequence in which to run them, the nodes all converge to the same state (eventually).
You can implement a linearizable system using Paxos by having the operations in the sequence be writes and reads using the fact that the operations are placed in a totally-ordered sequence (i.e. linearized) by the Paxos protocol.
It's important to put the reads in the sequence as well as the writes. Imagine instead you only used Paxos to agree on the writes, and served reads directly from a node's local state. If the node serving the reads is partitioned from the other nodes then it would serve stale reads, violating linearizability. Each read must involve a quorum of nodes to ensure that the returned value is fresh, which means (effectively) putting the read into the sequence alongside the writes.
(There's some tricks you can play to make reads a bit more efficient than writes, given that reads commute with each other and don't need to be persisted to disk, but you can't escape the need to contact a quorum of nodes for both read and write operations)

Related

Why do we need to use Zookeeper for a Coordination Service instead of just a central database?

Quoting the zookeeper docs
ZooKeeper is a distributed, open-source coordination service for
distributed applications. It exposes a simple set of primitives that
distributed applications can build upon to implement higher level
services for synchronization, configuration maintenance, and groups
and naming.
Guarantees
ZooKeeper is very fast and very simple. Since its goal, though, is to
be a basis for the construction of more complicated services, such as
synchronization, it provides a set of guarantees. These are:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
But I don't see any new problem that Zookeeper solves apart from being highly fault tolerant compared to a central database. All the guarantees that zookeeper assures can be guaranteed in a central database too.
Atomicity -> As it's a single node. all updates are atomic.
Sequential Consistency -> after an update clients can wait until the ack until they send the next update to maintain the sequence.
Single System Image, Reliability, Timeliness -> guaranteed as it's a single node.
So, Avoiding a single point of failure is the only main advantage of using zookeeper. Please correct me if I'm wrong.
Zookeeper (and other consensus based systems) offers sequential consistency, strong consistency and high availability.
"apart from being highly fault tolerant" that's actually huge - the fault tolerance.
If you don't care about availability, you totally can use any other linearizable storage - even a directory with files will work.
Consensus based system, and systems based on them (e.g. zoo + your own code) are used to implement machine state replication. All transitions are stored in a distributed log - to make it durable there are many copies. Consensus is about what is the order of event in the log.
With the log being available, the actual business code can consume events and change its state machine - typical state machine transitions. Since each copy of log has the same sequence of events, all states machines will get to the same state.
The key thing is about timing - all logs will get same events in the same order, but there is no guarantee when that happens - a node could be disconnected from the network, hence its log will be stale, and by extension the state machine as well.
To see the true latest value, as you would expect with a singe source of truth, you have to use linearizable read. One way of doing this is to append the read operation to the log itself and wait for it to be committed. Read do nothing with state machines, but the fact that a reader placed something to log and got it committed, that signals that the entire log is read - there is no stale data. (Stale it means that all writes happened before the read are reflected, while read is happening, new writes could happen).
All of this complexity comes form the availability requirements - a cluster with three nodes can let one node to go down, without affecting operations.
So, yes, you could use any linear storage to do the same, ignoring availability. You could do this by keeping the log of events in a table, and every client to track a pointer (or id) of last applied operation; so every client could go and move its own state machine.

Failover and strong consistency in Couchbase

We have a three-node Couchbase cluster with two replicas and durability level MAJORITY.
This means that the mutation will be replicated to the active node(node A) and to one of the two replicas(node B) before it is acknowledged as successful.
In terms of consistency, what will happen if node A becomes unavailable and the hard failover process promotes node C replica before node A manages to replicate the mutation to node C?
According to the docs Protection Guarantees and Automatic Failover, write is durable but will be available immediately?
Answered by #ingenthr here.
Assuming the order is that the client gets the acknowledgment of the
durability, then the hard failover is triggered of your node A, during
the failover the cluster manager and the underlying data service will
determine whether node B or C should be promoted to active for that
vbucket (a.k.a. partition) to satisfy all promised durability. That
was actually one of the trickier bits of implementation.
“Immediately” is pretty much correct. Technically it does take some
time to do the promotion of the vbucket, but this should be very short
as it’s just metadata checks and state changes and doesn’t involve any
data movement. Clients will need to be updated with the new topology
as well. How long is a function of the environment and what else is
going on, but I’d expect single-digit-seconds or even under a second.
Assuming you’re using a modern SDK API 3.x client with best-effort
retries, it will be mostly transparent to your application, but not
entirely transparent since you’re doing a hard failover.
Non-idempotent operations, for example, may bubble up as errors.

MongoDB: Write concern w:[number>1] - why?

I'm trying to understand the advantages of using write concern where is greater than 1.
I understand the use cases for w:1, and w:primary. I'm trying to understand why someone would use other values. Let's use a 6-node (+arbiter) replica set as an example.
If it's guaranteed writes with {w:majority, j:true} will survive the
failure of the primary, what the advantage of the w:5?
Does w:6 help to achieve linearizable access to secondaries at the
expense of availability as a failure of any node will prevent writes?
This is not documented as such.
Why would someone ever use w:2 or w:3? It doesn't guarantee the
write will survive a failure of the primary.
Write concern is a mechanism for tuning availability, durability and performance.
In a w:1 environment, you essentially expect to lose data if the primary experiences any issue whatsoever.
In a w:2 environment, you still may lose data but you expect less loss because the data will be replicated. If you lose two nodes, and you are unlucky with where the second write went, you can lose the data.
In a w:3 environment, there is still potential for data loss but it is less than in the w:2 environment.
w:4 is majority write, a sensible default for most applications.
w:5 provides one "unnecessary" write. This means your writes will take longer than they would with w:majority but if you have elections, secondary nodes will generally be more up-to-date with the primary, so you would reduce election times.
w:6 writes to every node. If you are doing secondary reads and your nodes are geographically distributed, this is a way of getting the data to all of the nodes quicker at the expense of potential write unavailability and obviously longer writes.
Does w:6 help to achieve linearizable access to secondaries at the expense of availability as a failure of any node will prevent writes?
No, this is not one of the benefits you get from w=# of nodes.

Is a replication log necessary to achieve linearizability in distributed store

The Raft algorithm used by etcd and ZAB algorithm by Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values. And why those system decided to use a replication log.
I my example if we have the following setup
machine A (Leader), contain version 1
machine B (Follower), contain version 1
machine C (Follower), contain version 1
And the write would go like this:
Machine A receive Write request and store pending write V2
Machine A send prepare request to Machine B and Machine C
Followers (Machine B and Machine C) send Acknowledge to leader (Machine A)
After Leader (machine A) receive Acknowledge from quorum of machine, it know V2 is now commited and send success response to client
Leader (machine a) send finalize request to Follower (machine A and Machine B) to inform them that V2 is commited and V1 could be discarded.
For this system to work, On leader change after acquiring leader Lease the leader machine have to get the latest data version by reading from a quorum of node before accepting Request.
The raft algorithm in ETCD and ZAB algorithm in Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values.
Yes, it's possible to achieve consensus/linearizability without log replication. Originally the consensus problem was solved in the Paxos Made Simple paper by Leslie Lamport (1998). He described two algorithms: Single Decree Paxos to to build a distributed linearizable write-once register and Multi-Paxos to make a distributed state machine on top of append only log (an ordered array of write-once registers).
Append only logs is much more powerful abstraction than write-once registers therefore it isn't surprising that people chose logs over registers. Besides, until Vertical Paxos (2009) was published, log replication was the only consensus protocol capable of cluster membership change; what is vital for multiple tasks: if you can't replace failed nodes then eventually your cluster becomes unavailable.
Yet Vertical Paxos is a good paper, it was much easier for me to understand the Raft's idea of cluster membership via the joint consensus, so I wrote a post on how to adapt the Raft's way for Single Decree Paxos.
With time the "write-once" nature of the Single Decree Paxos was also resolved turning write-once registers into distributed linearizable variables, a quite powerful abstraction suitable for the many use cases. In the wild I saw that approach in the Treode database. If you got interested I blogged about this improved SDP in the How Paxos Works post.
So now when we have an alternative to logs it makes sense to consider it because log based replication is complex and has intrinsic limitations:
with logs you need to care about log compaction and garbage collection
size of the log is limited by the size of one node
protocols for splitting a log and migration to a new cluster are not well-known
And why those system decided to use a replication log.
The log-based approach is older that the alternative, so it has more time to gain popularity.
About your example
It's hard to evaluate it, because you didn't describe how the leader election happens and the conflicts between leaders are resolved, what is the strategy to handle failures and how to change membership of the cluster.
I believe if you describe them carefully you'll get a variant of Paxos.
Your example makes sense. However, have you considered every possible failure scenario? In step 2, Machine B could receive the message minutes before or after Machine C (or vice versa) due to network partitions or faulty routers. In step 3, the acknowledgements could be lost, delayed, or re-transmitted numerous times. The leader could also fail and come back up once, twice, or potentially several times all within the same consensus round. And in step 5, the messages could be lost, duplicated, or Machine A & C could receive the notification while B misses it....
Conceptual simplicity, also known as "reducing the potential points of failure", is key to distributed systems. Anything can happen, and will happen in realistic environments. Primitives, such as replicated logs based on consensus protocols proven to be correct in any environment, are a solid foundation upon which to build higher levels of abstraction. It's certainly true that better performance or latency or your "metric of interest" can be achieved by a custom-built algorithm but ensuring correctness for such an algorithm is a major time investment.
Replicated logs are simple, easily understood, predictable, and fall neatly into the domain of established consensus protocols (paxos, paxos-variants, & raft). That's why they're popular. It's not because they're the best for any particular application, rather they're understood and reliable.
For related references, you may be interested in Understanding Paxos and Consensus in the Cloud: Paxos Systems Demystified

Cassandra write performance regarding consistency level

Here is quote from cassandra documentation about writes (LINK)
If all replicas for the affected row key are down, it is still
possible for a write to succeed if using a write consistency level of
ANY. Under this scenario, the hint and written data are stored on the
coordinator node, but will not be available to reads until the hint
gets written to the actual replicas that own the row. The ANY
consistency level provides absolute write availability at the cost of
consistency, as there is no guarantee as to when written data will be
available to reads (depending how long the replicas are down). Using
the ANY consistency level can also potentially increase load on the
cluster, as coordinator nodes must temporarily store extra rows
whenever a replica is not available to accept a write.
My question is: is writing to cassandra slower if we use consistency level of ANY than writes when we use consistency level of ONE ?
Hints are generated when appropriate replica nodes are inaccessible at write time. Write requests are then serialized locally on the request coordinator node. Once a valid replica node becomes available and the coordinator node learns of it, the request is passed along to the newly available replica.
With that background, there are two write-time scenarios to consider:
1) At least one replica is up for the affected row. In this case, there is no difference between consistency levels of ANY and ONE. The write just goes to the replica(s), and hinted handoff is not triggered. No performance difference.
2) All replicas are down for the affected row. This is where hints enter the picture. With consistency ANY there is extra work to be done on the coordinator node at request time, as the hint is written to a local system table for later replay. With consistency ONE, you would simply get a refused write in the same circumstances. ONE will expose write failures to the client, and will be faster than ANY.
Essentially, the tradeoff is refusing requests vs. pushing work onto remaining nodes, but only when nodes responsible for storing that row are down.