Why do we need to use Zookeeper for a Coordination Service instead of just a central database? - apache-zookeeper

Quoting the zookeeper docs
ZooKeeper is a distributed, open-source coordination service for
distributed applications. It exposes a simple set of primitives that
distributed applications can build upon to implement higher level
services for synchronization, configuration maintenance, and groups
and naming.
Guarantees
ZooKeeper is very fast and very simple. Since its goal, though, is to
be a basis for the construction of more complicated services, such as
synchronization, it provides a set of guarantees. These are:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
But I don't see any new problem that Zookeeper solves apart from being highly fault tolerant compared to a central database. All the guarantees that zookeeper assures can be guaranteed in a central database too.
Atomicity -> As it's a single node. all updates are atomic.
Sequential Consistency -> after an update clients can wait until the ack until they send the next update to maintain the sequence.
Single System Image, Reliability, Timeliness -> guaranteed as it's a single node.
So, Avoiding a single point of failure is the only main advantage of using zookeeper. Please correct me if I'm wrong.

Zookeeper (and other consensus based systems) offers sequential consistency, strong consistency and high availability.
"apart from being highly fault tolerant" that's actually huge - the fault tolerance.
If you don't care about availability, you totally can use any other linearizable storage - even a directory with files will work.
Consensus based system, and systems based on them (e.g. zoo + your own code) are used to implement machine state replication. All transitions are stored in a distributed log - to make it durable there are many copies. Consensus is about what is the order of event in the log.
With the log being available, the actual business code can consume events and change its state machine - typical state machine transitions. Since each copy of log has the same sequence of events, all states machines will get to the same state.
The key thing is about timing - all logs will get same events in the same order, but there is no guarantee when that happens - a node could be disconnected from the network, hence its log will be stale, and by extension the state machine as well.
To see the true latest value, as you would expect with a singe source of truth, you have to use linearizable read. One way of doing this is to append the read operation to the log itself and wait for it to be committed. Read do nothing with state machines, but the fact that a reader placed something to log and got it committed, that signals that the entire log is read - there is no stale data. (Stale it means that all writes happened before the read are reflected, while read is happening, new writes could happen).
All of this complexity comes form the availability requirements - a cluster with three nodes can let one node to go down, without affecting operations.
So, yes, you could use any linear storage to do the same, ignoring availability. You could do this by keeping the log of events in a table, and every client to track a pointer (or id) of last applied operation; so every client could go and move its own state machine.

Related

When does the Raft consensus algorithm apply the commit log to the leader and followers?

In RAFT, I understand that a leader receives a request and federates it out to it's peer list to commit to their respective logs.
My question is, is there a distinction between committing the action to the log and actually applying the action? If the answer is yes, then at what point does the action get applied?
My understanding is once the leaders receives, from the majority - "hey I wrote this to my log", the leader applies the change then federates an "Apply" command to the peers that wrote the change to their respective logs and then the ack is sent to the client.
I would say there is a distinction between committing an entry and applying it to the state machine. Once an entry is committed (i.e. the commitIndex is >= the entry index) it can be applied at any time. In practice, you want the leader to apply committed entries as soon as possible to reduce latency, so entries will usually be applied to an in-memory state machine immediately.
In the case of in-memory state machines the distinction is not very obvious. But it’s the other use cases for Raft that do necessitate this distinction. For example, the distinction becomes particularly important with persistent state machines. If the state machine is persisting changes to e.g. an underlying database, it’s critical that each entry only be applied to the state machine once so that the underlying store does not go back in time when the node replays entries to the state machine when recovering from a failure. To make persistent state machines idempotent, the state machine on each node needs to persist the entries that have been applied on that node as part of the persistent state. In this case, the process of applying entries is indeed a distinction, and a critical one.
State machine replication is also only one use case for Raft. There are others as well. It’s perfectly valid, for example, to use the protocol for simple log replication, in which case entries wouldn’t be applied at all.

Failover and strong consistency in Couchbase

We have a three-node Couchbase cluster with two replicas and durability level MAJORITY.
This means that the mutation will be replicated to the active node(node A) and to one of the two replicas(node B) before it is acknowledged as successful.
In terms of consistency, what will happen if node A becomes unavailable and the hard failover process promotes node C replica before node A manages to replicate the mutation to node C?
According to the docs Protection Guarantees and Automatic Failover, write is durable but will be available immediately?
Answered by #ingenthr here.
Assuming the order is that the client gets the acknowledgment of the
durability, then the hard failover is triggered of your node A, during
the failover the cluster manager and the underlying data service will
determine whether node B or C should be promoted to active for that
vbucket (a.k.a. partition) to satisfy all promised durability. That
was actually one of the trickier bits of implementation.
“Immediately” is pretty much correct. Technically it does take some
time to do the promotion of the vbucket, but this should be very short
as it’s just metadata checks and state changes and doesn’t involve any
data movement. Clients will need to be updated with the new topology
as well. How long is a function of the environment and what else is
going on, but I’d expect single-digit-seconds or even under a second.
Assuming you’re using a modern SDK API 3.x client with best-effort
retries, it will be mostly transparent to your application, but not
entirely transparent since you’re doing a hard failover.
Non-idempotent operations, for example, may bubble up as errors.

How good are ZooKeeper and Etcd?

Disclaimer: I'm quite new for the etcd project and ZooKeeper project.
I'm recently getting interested in the distributed open source products.
I found they seems to require configuration(coordination?) systems such as ZooKeeper for Presto DB, Hive and Etcd for kubernetes and I think that understanding the role of etcd and ZooKeeper is the first step to understand the distributed systems.
But now, I feel like getting lost... I could not yet understand what is the good and unique points of the etcd and ZooKeeper. They look for me a well-distributed key-value storage or file systems.
Here is the impression that I have for the products. I know the impressions don't reflect the feature of the products. but I don't know what is the remaining feature that I should know.
ZooKeeper: According to the overview page of ZooKeeper, it guarantees the following things.
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
The sequential consistency and atomicity are the unique features which is not supported by most file systems but others are common among other file systems.
Etcd: According to the README of etcd. it focuses on
Simple: curl'able user-facing API (HTTP+JSON)
Secure: optional SSL client cert authentication
Fast: benchmarked 1000s of writes/s per instance
Reliable: properly distributed using Raft
Most of them seems common with Amazon S3 (S3 doesn't support such a fast access.)
I know those products are very good ones because most of the distributed open source products depend on them. but what is the key, unique feature that the distributed open source product choose them?
I think you're confusing the file-system-like interface with an actual file system. The systems you are mentioning are well suited for cluster coordination, in particular ZooKeeper. What they are not designed for is storing large amounts of data like a file system would. You should think of them more as suited for coordinating a file system. That is, one could imagine a file system storing paths to files in a consistent store like ZooKeeper or etcd, but not the files themselves. That they expose a file system-like interface does not correlate to any ability to store files. Indeed, these systems are designed to store small amounts of data that can be held in memory. By using a consistent store like ZooKeeper for storing file information in a distributed file system, the file system would ensure that clients see changes in the file system in sequential order.
ZooKeeper is really a set of primitives with which distributed systems can be coordinated. Particularly relevant to coordinating distributed systems with ZooKeeper are its session events (watches) which allow clients to listen for changes to the cluster state. Distributed systems typically use watches in ZooKeeper for things like locks, and the strong consistency guarantees of ZooKeeper make it perfectly suitable for that use case.
If you want a good idea of what systems like ZooKeeper and etcd are used for, you should check out the Apache Curator recipes. Atomix also implements similar types of APIs for coordinating distributed systems on top of a consensus algorithm. All of these tools are demonstrative of typical use cases for consensus-based distributed systems.
What's important to note is that these types of systems are built on top of consensus algorithms and usually store state in memory. They're suitable for operations that involve a small amount of data but require a high level of consistency, and that's why they're frequently used for things like distributed locking, configuration management, and group membership.

Do NoSQL datacenter aware features enable fast reads and writes when nodes are distributed across high-latency connections?

We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.
You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.
I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html).
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).

wait for transactional replication in ADO.NET or TSQL

My web app uses ADO.NET against SQL Server 2008. Database writes happen against a primary (publisher) database, but reads are load balanced across the primary and a secondary (subscriber) database. We use SQL Server's built-in transactional replication to keep the secondary up-to-date. Most of the time, the couple of seconds of latency is not a problem.
However, I do have a case where I'd like to block until the transaction is committed at the secondary site. Blocking for a few seconds is OK, but returning a stale page to the user is not. Is there any way in ADO.NET or TSQL to specify that I want to wait for the replication to complete? Or can I, from the publisher, check the replication status of the transaction without manually connecting to the secondary server.
[edit]
99.9% of the time, The data in the subscriber is "fresh enough". But there is one operation that invalidates it. I can't read from the publisher every time on the off chance that it's become invalid. If I can't solve this problem under transactional replication, can you suggest an alternate architecture?
There's no such solution for SQL Server, but here's how I've worked around it in other environments.
Use three separate connection strings in your application, and choose the right one based on the needs of your query:
Realtime - Points directly at the one master server. All writes go to this connection string, and only the most mission-critical reads go here.
Near-Realtime - Points at a load balanced pool of subscribers. No writes go here, only reads. Used for the vast majority of OLTP reads.
Delayed Reporting - In your environment right now, it's going to point to the same load-balanced pool of subscribers, but down the road you can use a technology like log shipping to have a pool of servers 8-24 hours behind. These scale out really well, but the data's far behind. It's great for reporting, search, long-term history, and other non-realtime needs.
If you design your app to use those 3 connection strings from the start, scaling is a lot easier, especially in the case you're experiencing.
You are describing a synchronous mirroring situation. Replication cannot, by definition, support your requirement. Replication must wait for a transaction to commit before reading it from the log and delivering it to the distributor and from there to the subscriber, which means replication by definition has a window of opportunity for data to be out of sync.
If you have a requirement an operation to read the authorithative copy of the data, then you should make that decission in the client and ensure you read from the publisher in that case.
While you can, in threory, validate wether a certain transaction was distributed to the subscriber or not, you should not base your design on it. Transactional replication makes no latency guarantee, by design, so you cannot rely on a 'perfect day' operation mode.