How to use ZooKeeper to distribute work across a cluster of servers - apache-zookeeper

I'm studying up for system design interviews and have run into this pattern in several different problems. Imagine I have a large volume of work that needs to be repeatedly processed at some cadence. For example, I have a large number of alert configurations that need to be checked every 5 min to see if the alert threshold has been breached.
The general approach is to split the work across a cluster of servers for scalability and fault tolerance. Each server would work as follows:
start up
read assigned shard
while true:
process the assigned shard
sleep 5 min
Based on this answer (Zookeeper for assigning shard indexes), I came up with the following approach using ZooKeeper:
When a server starts up, it adds itself as a child under the node /service/{server-id} and watches the children of the node. ZooKeeper assigns a unique sequence number to the server.
Server reads its unique sequence number i from ZooKeeper. It also reads the total number of children n under the /service node.
Server identifies its shard by dividing the total volume of work into n pieces and locating the ith piece.
While true:
If the watch triggers (because servers have been added to or removed from the cluster), server recalculates its shard.
Server processes its shard.
Sleep 5 min.
Does this sound reasonable? Is this generally the way that it is done in real world systems? A few questions:
In step #2, when the server reads the number of children, does it need to wait a period of time to let things settle down? What if every server is joining at the same time?
I'm not sure how timely the watch would be. Seems like there would be a time period where the server is still processing its shard and reassignment of shards might cause another server to pick up a shard that overlaps with what this server is processing, causing duplicate processing (which may or may not be ok). Is there any way to solve this?


How to reliably shard data across multiple servers

I am currently reading up on some distributed systems design patterns. One of the designs patterns when you have to deal with a lot of data (billions of entires or multiple peta bytes) would be to spread it out across multiple servers or storage units.
One of the solutions for this is to use a Consistent hash. This should result in an even spread across all servers in the hash.
The concept is rather simple: we can just add new servers and only the servers in the range would be affected, and if you loose servers the remaining servers in the consistent hash would take over. This is when all servers in the hash have the same data (in memory, disk or database).
My question is how do we handle adding and removing servers from a consistent hash where there are so much data that it can't be stored on a single host. How do they figure out what data to store and what not too?
Let say that we have 2 machines running, "0" and "1". They are starting to reach 60% of their maximum capacity, so we decide to add an additional machine "2". Now a large part the data on machine 0 has to be migrated to machine 2.
How would we automate so this will happen without downtime and while being reliable.
My own suggested approach would be that the service hosing consistent hash and the machines would have be aware of how to transfer data between each other. When a new machine is added, will the consistent hash service calculate the affected hash ranges. Then inform the affect machine
of the affected hash range and that they need to transfer affected data to machine 2. Once the affected machines are done transferring their data, they would ACK back to the consistent hash service. Once all affected services are done transferring data, the consistent hash service would start sending data to machine 2, and inform the affected machine that they can remove their transferred data now. If we have peta bytes on each server can this process take a long time. We there for need to keep track of what entires where changes during the transfer so we can ensure to sync them after, or we can submit the write/updates to both machine 0 and 2 during the transfer.
My approach would work, but i feel it is a little risky with all the backs and forth, so i would like to hear if there is a better way.
How would we automate so this will happen without downtime and while being reliable?
It depends on the technology used to store your data, but for example in Cassandra, there is no "central" entity that governs the process and it is done like almost everything else; by having nodes gossiping with each other. There is no downtime when a new node joins the cluster (performance might be slightly impacted though).
The process is as follow:
The new node joining the cluster is defined as an empty node without system tables or data.
When a new node joins the cluster using the auto bootstrap feature, it will perform the following operations
- Contact the seed nodes to learn about gossip state.
- Transition to Up and Joining state (to indicate it is joining the cluster; represented by UJ in the nodetool status).
- Contact the seed nodes to ensure schema agreement.
- Calculate the tokens that it will become responsible for.
- Stream replica data associated with the tokens it is responsible for from the former owners.
- Transition to Up and Normal state once streaming is complete (to indicate it is now part of the cluster; represented by UN in the nodetool status).
Taken from
So when the joining node is in the Joining State, it is receiving data from other nodes but not ready for reads until the process is complete (Up status).
DataStax also has some material on this

Why does a mongodb replica set need an odd number of voting members?

If find the replica set requirement a bit confusing, and I'm probably missing something obvious (like under which condition there are elections).
I understand that in normal operations you need quorum, and a voting takes place and to get a majority you need and odd numbers of machines.
But since we use a replica set for failover, if the master dies, then we are left with an even number of voting members, which based on my limited experience lengthen the time to elect a primary.
Also according to the documentation, the addition of a voting member doesn't start an election, it would seem that starting (booting) you replica set with an even number of nodes would make more sense?
So if we start say with 4 machines in the replica set, and one machine dies, there is a re-election with 3 machines, fast quorum. We add a machine back to get back to our normal operation state, no re-election and we are back to our normal operation conditions.
Can someone shed a light on this?
TL;DR: With single master systems, even partitions make it impossible to determine which remainder still has a majority, taking both systems down.
Let N be a cluster of four machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
Let M be a cluster of three machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
=> Same result at 3/4 of the cost.
Now, let's add an assumption or two:
We're also going to operate some kind of server application that uses the database
The network can be partitioned
Let's say you have two datacenters, one with two database instances and the backend server machines. If the connection to the backup center (which has one MongoDB instance) fails, you're still online.
Now if you added a second MongoDB instance at the backup data center, a network partition would, despite seemingly higher redundancy, yield lower availability since we'd lose the majority in case of a network partition and can't continue to operate.
=> Less availability at higher cost. But that doesn't answer the question yet.
Let's say you're really worried about availability: You have two data centers, with backend servers in both datacenters, anycast IPs, the whole deal. Now the network between the two DCs is partitioned, but some clients connect to DC A while other reach DC B. How do you now determine which datacenter may accept writes? It's not possible - this is why the odd number is necessary.
You don't actually need Anycast IPs, BGP or any fancy stuff for the problem to become real, any writing application (like a worker, a stale request, anything) would require later merging different writes, which is a completely different concurrency scheme.

Do NoSQL datacenter aware features enable fast reads and writes when nodes are distributed across high-latency connections?

We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.
You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.
I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).

How to implement client authentication solution with NoSQL (Cassandra)?

I am currently thinking of how to implement an authentication for a web application with a NoSQL solution. The problem I encounter hereby is that in most of the NoSQL solutions (e.g. Cassandra, MongoDB) have probably delayed writes. For example we write on node A but it is not guaranteed that the write is appearing on node B at the same time. This is logical with the approaches behind the NoSQL solutions.
Now one idea would be that you do no secondary reads (so everything goes over a master). This would probably work in MongoDB (where you actually have a master) but not in Cassandra (where all nodes are equal). But our application runs at several independent points all over the world, so we need multi master capability.
At the moment I am not aware of a solution with Cassandra where I could update data and be sure that subsequent reads (to all of the nodes) do have the change. So how could one build an authentication on top of those NoSQL solutions where the authentication request (read) could appear on several nodes in parallel?
Thanks for your help!
With respects to Apache Cassandra:
The ConsistencyLevel is an enum that controls both read and write behavior based on in your schema definition. The different consistency levels have different meanings, depending on if you're doing a write or read operation. Note that if W + R > ReplicationFactor, where W is the number of nodes to block for on write, and R the number to block for on reads, you will have strongly consistent behavior; that is, readers will always see the most recent write. Of these, the most interesting is to do QUORUM reads and writes, which gives you consistency while still allowing availability in the face of node failures up to half of ReplicationFactor. Of course if latency is more important than consistency then you can use lower values for either or both.
This is managed on the application side. To your question specifically, it comes down to how you design your Cassandra implementation, replication factor across the Cassandra nodes and how your application behaves on read/writes.
ANY: Ensure that the write has been written to at least 1 node, including HintedHandoff recipients.
ONE: Ensure that the write has been written to at least 1 replica's commit log and memory table before responding to the client.
QUORUM: Ensure that the write has been written to N / 2 + 1 replicas before responding to the client.
LOCAL_QUORUM: Ensure that the write has been written to / 2 + 1 nodes, within the local datacenter (requires NetworkTopologyStrategy)
EACH_QUORUM: Ensure that the write has been written to / 2 + 1 nodes in each datacenter (requires NetworkTopologyStrategy)
ALL: Ensure that the write is written to all N replicas before responding to the client. Any unresponsive replicas will fail the operation.
ANY: Not supported. You probably want ONE instead.
ONE: Will return the record returned by the first replica to respond. A consistency check is always done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is used. This means subsequent calls will have correct data even if the initial read gets an older value. (This is called ReadRepair)
QUORUM: Will query all replicas and return the record with the most recent timestamp once it has at least a majority of replicas (N / 2 + 1) reported. Again, the remaining replicas will be checked in the background.
LOCAL_QUORUM: Returns the record with the most recent timestamp once a majority of replicas within the local datacenter have replied.
EACH_QUORUM: Returns the record with the most recent timestamp once a majority of replicas within each datacenter have replied.
ALL: Will query all replicas and return the record with the most recent timestamp once all replicas have replied. Any unresponsive replicas will fail the operation.

mongoDB replication+sharding on 2 servers reasonable?

Consider the following setup:
There a 2 physical servers which are set up as a regular mongodb replication set (including an arbiter process, so automatic failover will work correctly).
now, as far as i understand, most actual work will be done on the primary server, while the slave will mostly just do work to keep its dataset in sync.
Would it be reasonable, to introduce sharding into this setup in a way that one would set up another replication set on the same 2 servers, so that each of them has one mongod process running as primary and one process running as secondary.
The expected result would be that both servers will share the workload of actual querys/inserts while both are up. In the case of one server failing the whole setup should elegantly fail over to continue running, until the other server is restored.
Are there any downsides to this setup, except the overall overhead in setup and number of processes (mongos/configservers/arbiters)?
That would definitely work. I'd asked a question in the #mongodb IRC channel a bit ago as to whether or not it was a bad idea to run multiple mongod processes on a single machine. The answer was "as long as you have the RAM/CPU/bandwidth, go nuts".
It's worth noting that if you're looking for high-performance reads, and don't mind writes being a bit slower, you could:
Do your writes in "safe mode", where the write doesn't return until it's been propagated to N servers (in this case, where N is the number of servers in the replica set, so all of them)
Set the driver-appropriate flag in your connection code to allow reading from slaves.
This would get you a clustered setup similar to MySQL - write once on the master, but any of the slaves is eligible for a read. In a circumstance where you have many more reads than writes (say, an order of magnitude), this may be higher performance, but I don't know how it'd behave when a node goes down (since writes may stall trying to write to 3 nodes, but only 2 are up, etc - that would need testing).
One thing to note is that while both machines are up, your queries are being split between them. When one goes down, all queries will go to the remaining machine thus doubling the demands placed on it. You'd have to make sure your machines could withstand a sudden doubling of queries.
In that situation, I'd reconsider sharding in the first place, and just make it an un-sharded replica set of 2 machines (+1 arbiter).
You are missing one crucial detail: if you have a sharded setup with two physical nodes only, if one dies, all your data is gone. This is because you don't have any redundancy below the sharding layer (the recommended way is that each shard is composed of a replica set).
What you said about the replica set however is true: you can run it on two shared-nothing nodes and have an additional arbiter. However, the recommended setup would be 3 nodes: one primary and two secondaries.