Apache Zookeeper: distribution of nodes across data centers - apache-zookeeper

I am working on a brand new SolrCloud - ZooKeeper infrastructure.
Some background information:
all other services (mostly web site infrastructure) are distributed across two data centers, with active-active configurations.
at the network level, the servers are setup on extended LANs, with dark fibre across the data centers. So latency is at a minimum.
the SolrCloud - ZooKeeper infrastructure will be used by most of these applications.
I got a SolrCloud, and a ZooKeeper ensemble running. Implementation at this level is fine.
But I wonder how to distribute my ZooKeeper servers. I must have an odd number of servers, but I only have two data centers. If one fails, I have a 50-50 chance that I will lose majority.
What should I do? So far I have thought of:
requesting a third data center (not likely to happen, $$$!)
host two per data center and two on an external cloud provider (Amazon or ...?). Again $$$
set up an odd number at data center 1 and use an observer on site 2. What then happens if site 1 fails? Can SolrCloud work with only one observer?

If your requirement is to serve all search requests from a local data center (at which request was origin) then you don’t need to go for a cross data center ZooKeeper deployment.
Because a cross data center ZooKeeper deployment is only needed to survive a DC crash (it is most likely not going to happen, and that's why you pay $$$$), so in that case there isn't any need to spawn a ZooKeeper cluster in multiple data centers.

I got a third site to host the other ZooKeeper instance. This site is another office of my company, not a "full data center". So each site has one ZooKeeper instance.
What allowed me to have one cluster spread over three data centers was that they are close enough together to get a dark fiber between them. The latency is very low and does not impact ZooKeeper performance.
Then for Solr, I got full replicas on the two main data centers. The third office only hosts a ZooKeeper for quorum. Using full replicas, I have all the data in each data center. If my Solr needs to increase later, I will shard, but for now our index is small.
It has proven solid for four years now, with one failure. And it was at the third office, not in a data center.

Related

ActiveMQ Artemis cluster failover questions

I have a question in regards to Apache Artemis clustering with message grouping. This is also done in Kubernetes.
The current setup I have is 4 master nodes and 1 slave node. Node 0 is dedicated as LOCAL to handle message grouping and node 1 is the dedicated backup to node 0. Nodes 2-4 are REMOTE master nodes without backup nodes.
I've noticed that clients connected to nodes 2-4 is not failing over to the 3 other master nodes available when the connected Artemis node goes down, essentially not discovering the other nodes. Even after the original node comes back up, the client continues to fail to establish a connection. I've seen from a separate Stack Overflow post that master-to-master failover is not supported. Does this mean for every master node I need to create a slave node as well to handle the failover? Would this cause a two instance point of failure instead of however many nodes are within the cluster?
On a separate basic test using a cluster of two nodes with one master and one slave, I've observed that when I bring down the master node clients are connected to, the client doesn't failover to the slave node. Any ideas why?
As you note in your question, failover is only supported between a live and a backup. Therefore, if you wanted failover for clients which were connected to nodes 2-4 then those nodes would need backups. This is described in more detail in the ActiveMQ Artemis documentation.
It's worth noting that clustering and message grouping, while technically possible, is a somewhat odd pairing. Clustering is a way to improve overall message throughput using horizontal scaling. However, message grouping naturally serializes message consumption for each group (to maintain message order) which then decreases overall message throughput (perhaps severely depending on the use-case). A single ActiveMQ Artemis node can potentially handle millions of messages per second. It may be that you don't need the increased message throughput of a cluster since you're grouping messages.
I've often seen users simply assume they need a cluster to deal with their expected load without actually conducting any performance benchmarking. This can potentially lead to higher costs for development, testing, administration, and (especially) hardware, and in some use-cases it can actually yield worse performance. Please ensure you've thoroughly benchmarked your application and broker architecture to confirm the proposed design.

Is there any service registry without master-cluster/server?

I recently started to learn more about service registries and their usage in distributed architecture.
All the applications providing service registries that I found (etcd, Consul, or Zookeeper) are based on the same model: a master-server/cluster with leader election.
Correct me if I'm wrong but... doesn't this make the architecture less reliable ? In the sense that the master cluster brings a point-of-failure. To circumvent this we could always make a bigger cluster but it's more costly and/or less-performance effective.
My questions here are:
as all these service registries elect a leader — wouldn't it be possible to do the same without specifying the machines that form the master cluster but rather let them discover themselves through broadcasting and elect a leader or a leading group ?
does a service registry without master-server/cluster exists ?
and if not, what are the current limitations that prevent us from doing this ?
All of those services are based on one whitepaper - Google Chubby(https://ai.google/research/pubs/pub27897). The idea is to have fast and consistent configuration storage for distributed systems. To get there you need to eliminate a single point of failure. How you can do that? You introduce multiple machines storing the same data and also replicate the data. But in that case, considering unreliable network between those machines, how do you make sure that the data is consistent among nodes? You choose one of the nodes from the cluster to be Leader(using distributed leader election algorithm), if nodes have inconsistent values between them, it's a leaders job to pick the "correct" one. It looks like we've returned to a "single point of failure" situation, but in reality if the leader fails, the rest of the cluster just votes and promotes a new leader. So Leader role in those systems is NOT to be a Single point of truth, but rather a Single point of decision making

Can a Kafka-Cluster be cut in half?

Scenario: you have a Kafka-Cluster in different DCs but they are configured as one cluster. So there is no mirroring through MirrorMaker or something liket hat. The DCs are not very far from eatch other. But they are physically separated.
Now what do you have to do to ensure that the cluster is failsafe on BOTH SIDES if the connection between those two DCs is down? So on BOTH sides the producers and consumer should still work.
I would guess: you need multiple Zookeepers on both sides and multiple Kafka-Nodes.
But is that enough? Does the cluster heal itself after getting reconnected?
Thanks in advance.
I'm assuming your datacenters that "are not very far from eatch other" are basically Availability Zones (AZs).
It's pretty common to spread a cluster over multiples AZs. However it's usually not desired or possible that each "slice" can live on its own.
The immediate issue is Zookeeper which by design prevents split-brain scenarios. So if a ZK cluster is split only one "slice" (at best) will carry on working. So the brokers that are on a side of the non working ZK clusters will be non functional.
Then let's say it was possible to have both sides keep working. What happens when you joins both sides again?
Data is likely to have diverged as clients wrote data to each side separately. Now you could have the same partition with different messages for the same offset and no way to resolve the conflict as both options are "valid".
I hope this shows why this is not a possible solution. In practice, if an AZ goes offline, it is non functional until it is brought back online.
Clients that were connected to the offline AZ should reconnect to the other AZ (using multiple bootstrap servers) and clients that were in the failed AZ should be reprovisioned into another one.
If configured correctly, Kafka can survive an AZ outage (even though in practice, it's best to have 3 AZs) and keep all resources available. Also in this scenario, the cluster will automatically return to a good state when the failed AZ returns.

How to deploy zookeeper across multiple data centers and failover?

I would like to know about the existing approaches that are available when running Zookeeper across data centers?
One approach that I found after doing some research is to have observers. That approach is to have only one ensemble in the main data center with leader and follower. And having observers in the backup data center. When main datacenter crash, we select other datacenter as the new main data center and convert observers to leader/follower manually.
I would like to about better approaches to achieve the same.
Thanks
First I would like to point the cons of your solution which hopefully my solution would solve:
a) in case of main data center failure the recovery process is manual (I quote you: "convert observers to leader/follower manually")
b) only the main data center accepts writes -> in case of failure all data (when observer don't write logs) or only last updates (when observer do write logs) are lost
Because the question is about data centerS I'll consider that we have enough (DCs) to reach our objective: solving a. and b. while having an usable multi data center distributed ZK.
So, when having an even number of data centers (DC) one could use an additional DC only for getting an odd number of ZK nodes in the ensemble. When having e.g. 2 DCs than a 3rd one could be added; each DC could contain 1 rwZK (read-write ZK node) or, for better tolerance against failures, each DC could contain 3 rwZK organized as hierarchical quorums (both cases could benefit of ZK observers). Inside a DC all ZK clients should point only to the DC's ZK-group so the traffic remained between DCs would be only for e.g. leader election, writes. With this kind of setup one solves both a. and b. but loses write/recovery-performance because the writes/elections must be agreed between data centers: at least 2 DCs must agree on writes/elections with 2 ZK nodes agreement per DC (see hierarchical quorums). The intra-DC agreement should be fast enough hence won't matter much for the overall write agreement process; bottom line, approximately only the delay between DCs would matter. The disadvantages of this approach are:
- additional cost for the 3rd data center: this could be mitigated by using the company office (a guy did that) as the 3rd data center
- lost sessions because of inter-DC network latency and/or throughput: with high enough timeouts one could reach a maximum possible write-throughput (depending on inter-DC average network speed) so this solution would be valid only when that maximum is acceptable. Still, when using 1 rw-ZK per DC I guess there'll be not much difference to your solution because the writes from backup DC to main DC must travel between DCs too; but for your solution won't be inter-DCs write agreements or leader elections related communication so it's faster.
Other consideration:
Regardless of the chosen solution the inter-DCs communication should be secured and for this ZK offers no solution so tunneling or other approach must be implemented.
UPDATE
Another solution would be to still use an additional 3rd DC (or company office) but where to keep only the rw-ZKs (1, 3 or other odd number) while the other 2 DCs to only have observer-ZKs. The clients should still connect only to the DC's ZK servers but we no longer need hierarchical quorums. The gain here is that the write agreements and leader elections would be only inside the DC with rw-ZKs (let's call it arbiter DC). The disadvantages are:
- the arbiter DC is a single point of failure
- the write requests will still have to travel from observer DCs to arbiter DC

Single Kubernetes/OpenShift cluster/instance across datacenters?

With the understanding that Ubernetes is designed to fully solve this problem, is it currently possible (not necessarily recommended) to span a single K8/OpenShift cluster across multiple internal corporate datacententers?
Additionally assuming that latency between data centers is relatively low and that infrastructure across the corporate data centers is relatively consistent.
Example: Given 3 corporate DC's, deploy 1..* masters at each datacenter (as a single cluster) and have 1..* nodes at each DC with pods/rc's/services/... being spun up across all 3 DC's.
Has someone implemented something like this as a stop gap solution before Ubernetes drops and if so, how has it worked and what would be some considerations to take into account on running like this?
is it currently possible (not necessarily recommended) to span a
single K8/OpenShift cluster across multiple internal corporate
datacententers?
Yes, it is currently possible. Nodes are given the address of an apiserver and client credentials and then register themselves into the cluster. Nodes don't know (or care) of the apiserver is local or remote, and the apiserver allows any node to register as long as it has valid credentials regardless of where the node exists on the network.
Additionally assuming that latency between data centers is relatively
low and that infrastructure across the corporate data centers is
relatively consistent.
This is important, as many of the settings in Kubernetes assume (either implicitly or explicitly) a high bandwidth, low-latency network between the apiserver and nodes.
Example: Given 3 corporate DC's, deploy 1..* masters at each
datacenter (as a single cluster) and have 1..* nodes at each DC with
pods/rc's/services/... being spun up across all 3 DC's.
The downside of this approach is that if you have one global cluster you have one global point of failure. Even if you have replicated, HA master components, data corruption can still take your entire cluster offline. And a bad config propagated to all pods in a replication controller can take your entire service offline. A bad node image push can take all of your nodes offline. And so on. This is one of the reasons that we encourage folks to use a cluster per failure domain rather than a single global cluster.