How good are ZooKeeper and Etcd? - apache-zookeeper

Disclaimer: I'm quite new for the etcd project and ZooKeeper project.
I'm recently getting interested in the distributed open source products.
I found they seems to require configuration(coordination?) systems such as ZooKeeper for Presto DB, Hive and Etcd for kubernetes and I think that understanding the role of etcd and ZooKeeper is the first step to understand the distributed systems.
But now, I feel like getting lost... I could not yet understand what is the good and unique points of the etcd and ZooKeeper. They look for me a well-distributed key-value storage or file systems.
Here is the impression that I have for the products. I know the impressions don't reflect the feature of the products. but I don't know what is the remaining feature that I should know.
ZooKeeper: According to the overview page of ZooKeeper, it guarantees the following things.
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
The sequential consistency and atomicity are the unique features which is not supported by most file systems but others are common among other file systems.
Etcd: According to the README of etcd. it focuses on
Simple: curl'able user-facing API (HTTP+JSON)
Secure: optional SSL client cert authentication
Fast: benchmarked 1000s of writes/s per instance
Reliable: properly distributed using Raft
Most of them seems common with Amazon S3 (S3 doesn't support such a fast access.)
I know those products are very good ones because most of the distributed open source products depend on them. but what is the key, unique feature that the distributed open source product choose them?

I think you're confusing the file-system-like interface with an actual file system. The systems you are mentioning are well suited for cluster coordination, in particular ZooKeeper. What they are not designed for is storing large amounts of data like a file system would. You should think of them more as suited for coordinating a file system. That is, one could imagine a file system storing paths to files in a consistent store like ZooKeeper or etcd, but not the files themselves. That they expose a file system-like interface does not correlate to any ability to store files. Indeed, these systems are designed to store small amounts of data that can be held in memory. By using a consistent store like ZooKeeper for storing file information in a distributed file system, the file system would ensure that clients see changes in the file system in sequential order.
ZooKeeper is really a set of primitives with which distributed systems can be coordinated. Particularly relevant to coordinating distributed systems with ZooKeeper are its session events (watches) which allow clients to listen for changes to the cluster state. Distributed systems typically use watches in ZooKeeper for things like locks, and the strong consistency guarantees of ZooKeeper make it perfectly suitable for that use case.
If you want a good idea of what systems like ZooKeeper and etcd are used for, you should check out the Apache Curator recipes. Atomix also implements similar types of APIs for coordinating distributed systems on top of a consensus algorithm. All of these tools are demonstrative of typical use cases for consensus-based distributed systems.
What's important to note is that these types of systems are built on top of consensus algorithms and usually store state in memory. They're suitable for operations that involve a small amount of data but require a high level of consistency, and that's why they're frequently used for things like distributed locking, configuration management, and group membership.

Related

What is meant by Distributed System?

I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)

Why do we need to use Zookeeper for a Coordination Service instead of just a central database?

Quoting the zookeeper docs
ZooKeeper is a distributed, open-source coordination service for
distributed applications. It exposes a simple set of primitives that
distributed applications can build upon to implement higher level
services for synchronization, configuration maintenance, and groups
and naming.
Guarantees
ZooKeeper is very fast and very simple. Since its goal, though, is to
be a basis for the construction of more complicated services, such as
synchronization, it provides a set of guarantees. These are:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
But I don't see any new problem that Zookeeper solves apart from being highly fault tolerant compared to a central database. All the guarantees that zookeeper assures can be guaranteed in a central database too.
Atomicity -> As it's a single node. all updates are atomic.
Sequential Consistency -> after an update clients can wait until the ack until they send the next update to maintain the sequence.
Single System Image, Reliability, Timeliness -> guaranteed as it's a single node.
So, Avoiding a single point of failure is the only main advantage of using zookeeper. Please correct me if I'm wrong.
Zookeeper (and other consensus based systems) offers sequential consistency, strong consistency and high availability.
"apart from being highly fault tolerant" that's actually huge - the fault tolerance.
If you don't care about availability, you totally can use any other linearizable storage - even a directory with files will work.
Consensus based system, and systems based on them (e.g. zoo + your own code) are used to implement machine state replication. All transitions are stored in a distributed log - to make it durable there are many copies. Consensus is about what is the order of event in the log.
With the log being available, the actual business code can consume events and change its state machine - typical state machine transitions. Since each copy of log has the same sequence of events, all states machines will get to the same state.
The key thing is about timing - all logs will get same events in the same order, but there is no guarantee when that happens - a node could be disconnected from the network, hence its log will be stale, and by extension the state machine as well.
To see the true latest value, as you would expect with a singe source of truth, you have to use linearizable read. One way of doing this is to append the read operation to the log itself and wait for it to be committed. Read do nothing with state machines, but the fact that a reader placed something to log and got it committed, that signals that the entire log is read - there is no stale data. (Stale it means that all writes happened before the read are reflected, while read is happening, new writes could happen).
All of this complexity comes form the availability requirements - a cluster with three nodes can let one node to go down, without affecting operations.
So, yes, you could use any linear storage to do the same, ignoring availability. You could do this by keeping the log of events in a table, and every client to track a pointer (or id) of last applied operation; so every client could go and move its own state machine.

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

Are all distributed database designed to process data in parallel?

I am learning about the characteristics of distributed database and I came across this website that describes some of the advantages of distributed database:
https://www.atlantic.net/cloud-hosting/about-distributed-databases-and-distributed-data-systems/
According to that site, the advantages of distributed database are listed below:
Reliability – Building an infrastructure is similar to investing: diversify to reduce your chances of loss. Specifically, if a failure occurs in one area of the distribution, the entire database does not experience a setback.
Security – You can give permissions to single sections of the overall database, for better internal and external protection.
Cost-effective – Bandwidth prices go down because users are accessing remote data less frequently.
Local access – Similarly to #1 above, if there is a failure in the umbrella network, you can still get access to your portion of the database.
Growth – If you add a new location to your business, it’s simple to create an additional node within the database, making distribution highly scalable.
Speed & resource efficiency – Most requests and other interactivity with the database are performed at a local level, also decreasing remote traffic.
Responsibility & containment – Because any glitches or failures occur locally, the issue is contained and can potentially be handled by the IT staff designated to handle that piece of the company.
However, parallelism (I mean not concurrent write, but processing data in parallel in each node) is not on the list. This makes me wonder: are all distributed databases (i.e. Mongo DB, Cassandra, HBase) designed to process data in parallel? If this is false, which distributed databases support parallel processing and which ones don't?
To find out what I mean by Parallel Processing (not concurrent write), please see: https://softwareengineering.stackexchange.com/questions/190719/the-difference-between-concurrent-and-parallel-execution

I am looking for a key-value datastore which has following (preferable) properties

I am trying to build a distributed task queue, and I am wondering if there is any data store, which has some or all of the following properties. I am looking to have a completely decentralized, multinode/multi-master self replicating datastore cluster to avoid any single point of failure.
Essential
Supports Python pickled object as Value.
Persistent.
More, the better, In decreasing order of importance (I do not expect any datastore to meet all the criteria. :-))
Distributed.
Synchronous Replication across multiple nodes supported.
Runs/Can run on multiple nodes, in multi-master configuration.
Datastore cluster exposed as a single server.
Round-robin access to/selection of a node for read/write action.
Decent python client.
Support for Atomicity in get/put and replication.
Automatic failover
Decent documentation and/or Active/helpful community
Significantly mature
Decent read/write performance
Any suggestions would be much appreciated.
Cassandra (open-sourced by facebook) has pretty much all of these properties. There are several Python clients, including pycassa.
Edited to add:
Cassandra is fully distributed, multi-node P2P, with tunable consistency levels (i.e. your replication can be synchronous or asynchronous or a mixture of both). Clients can connect to any server. Failover is automatic, and new servers can be added on-the-fly for load balancing. Cassandra is in production use by companies such as Facebook. There is an O'Reilly book. Write performance is extremely high, read performance is also high.