Maximum servers in a ZooKeeper ensemble cluster? - apache-zookeeper

Use case: 100 Servers in a pool; I want to start a ZooKeeper service on each Server and Server applications (ZooKeeper client) will use the ZooKeeper cluster (read/write). Then there is no single point of failure.
Is this solution possible for this use case? What about the performance?
What if there are 1000 Servers in the pool?

If you are simply trying to avoid a single point of failure, then you only need 3 servers. In a 3 node ensemble, a single failure can be tolerated with the remaining 2 nodes forming the quorum. The more servers you have the worse write performance will be. And 100 servers is the extreme of this, if ZK can even handle it.
However, having that many clients is no problem at all. Zookeeper has active deployments with many more than 1000 clients. If you find that you need more servers to handle your read load, you can always add Observers. I highly recommend you join the list serve. It is an excellent way to quickly have your questions answered, and likely in much more detail than anyone will give you on SO.

Maybe zookeeper is not the right tool?
Hazelcast does what you want, I think. You can hundreds of peers, and if the master is lost a new one is elected from all the peers.
You don't need to use all of hazel cast. You can just use the maps, or just the worker pools, or just the synchronisation primitives, or just the messaging etc.

Related

Kafka on multiple instances of ec2

I am new to Kafka, trying to do a project. Wanted to do it as it would be in real life example, but I am kinda confused. While searching thru the internet I found that if I want to have 3 brokers and 3 zookeepers, to provide replication factor = 2 and quorum, I need 6 EC2 instances. I am looking thru youtube to find some examples, but as far as I see all of them show multiple brokers on one cluster. From my understanding it's better to keep ZKs and all brokers separately on each VM, so if one goes down I still have all of the rest. Can you confirm that ?
Also, wondering how to set partitioning. Is it important at the beginning of creating a topic, or I change that later when I need to scale ?
Thanks in advance
looking for information on yt, google.
My suggestion would be to use MSK Serverless and forget how many machines exist.
Kafka 3.3.1 doesn't need Zookeeper anymore. Zookeeper doesn't need to be separate machines (although recommended). You can also run multiple brokers on one server... So, I'm not fully sure why you would need 6 for replication factor of 2.
Regarding partitions, yes, create it ahead of time (over provision, if necessary) since you cannot easily move data across partitions when you do scale

ActiveMQ Artemis cluster failover questions

I have a question in regards to Apache Artemis clustering with message grouping. This is also done in Kubernetes.
The current setup I have is 4 master nodes and 1 slave node. Node 0 is dedicated as LOCAL to handle message grouping and node 1 is the dedicated backup to node 0. Nodes 2-4 are REMOTE master nodes without backup nodes.
I've noticed that clients connected to nodes 2-4 is not failing over to the 3 other master nodes available when the connected Artemis node goes down, essentially not discovering the other nodes. Even after the original node comes back up, the client continues to fail to establish a connection. I've seen from a separate Stack Overflow post that master-to-master failover is not supported. Does this mean for every master node I need to create a slave node as well to handle the failover? Would this cause a two instance point of failure instead of however many nodes are within the cluster?
On a separate basic test using a cluster of two nodes with one master and one slave, I've observed that when I bring down the master node clients are connected to, the client doesn't failover to the slave node. Any ideas why?
As you note in your question, failover is only supported between a live and a backup. Therefore, if you wanted failover for clients which were connected to nodes 2-4 then those nodes would need backups. This is described in more detail in the ActiveMQ Artemis documentation.
It's worth noting that clustering and message grouping, while technically possible, is a somewhat odd pairing. Clustering is a way to improve overall message throughput using horizontal scaling. However, message grouping naturally serializes message consumption for each group (to maintain message order) which then decreases overall message throughput (perhaps severely depending on the use-case). A single ActiveMQ Artemis node can potentially handle millions of messages per second. It may be that you don't need the increased message throughput of a cluster since you're grouping messages.
I've often seen users simply assume they need a cluster to deal with their expected load without actually conducting any performance benchmarking. This can potentially lead to higher costs for development, testing, administration, and (especially) hardware, and in some use-cases it can actually yield worse performance. Please ensure you've thoroughly benchmarked your application and broker architecture to confirm the proposed design.

Can a Kafka-Cluster be cut in half?

Scenario: you have a Kafka-Cluster in different DCs but they are configured as one cluster. So there is no mirroring through MirrorMaker or something liket hat. The DCs are not very far from eatch other. But they are physically separated.
Now what do you have to do to ensure that the cluster is failsafe on BOTH SIDES if the connection between those two DCs is down? So on BOTH sides the producers and consumer should still work.
I would guess: you need multiple Zookeepers on both sides and multiple Kafka-Nodes.
But is that enough? Does the cluster heal itself after getting reconnected?
Thanks in advance.
I'm assuming your datacenters that "are not very far from eatch other" are basically Availability Zones (AZs).
It's pretty common to spread a cluster over multiples AZs. However it's usually not desired or possible that each "slice" can live on its own.
The immediate issue is Zookeeper which by design prevents split-brain scenarios. So if a ZK cluster is split only one "slice" (at best) will carry on working. So the brokers that are on a side of the non working ZK clusters will be non functional.
Then let's say it was possible to have both sides keep working. What happens when you joins both sides again?
Data is likely to have diverged as clients wrote data to each side separately. Now you could have the same partition with different messages for the same offset and no way to resolve the conflict as both options are "valid".
I hope this shows why this is not a possible solution. In practice, if an AZ goes offline, it is non functional until it is brought back online.
Clients that were connected to the offline AZ should reconnect to the other AZ (using multiple bootstrap servers) and clients that were in the failed AZ should be reprovisioned into another one.
If configured correctly, Kafka can survive an AZ outage (even though in practice, it's best to have 3 AZs) and keep all resources available. Also in this scenario, the cluster will automatically return to a good state when the failed AZ returns.

Postgresql-Patroni-Etcd election handicapping

I am planning the creation of a PostgreSQL HA cluster that spans multiple data centres on different continents and trying to figure out how to tweak the election parameters in etcd and patron so that we are unlikely to failover accidentally to a continent from our app servers unless the databases closer to the application servers are down.
So far in my research I have concluded that I should be able to tweak the election timeout settings in etc so that the variance between the servers is less than the latency of the intercontinental hop. This should help on the etcd side with helping to prevent far-away servers deciding to take over. However.... how do I prevent the same on the Patroni side? Is there a best practice for handicapping servers we want to avoid being put into the master role?

Why ZooKeeper needs majority to run?

I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.
When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?
Since one machine is enough to do all the work (for example single machine ensemble works fine)...
Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?
Edit:
Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?
Thanks in advance.
Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.
The reason to get a majority vote is to avoid a problem called "split-brain".
Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.
There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.
The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses
As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.
Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.
Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.
This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.