Scenario: you have a Kafka-Cluster in different DCs but they are configured as one cluster. So there is no mirroring through MirrorMaker or something liket hat. The DCs are not very far from eatch other. But they are physically separated.
Now what do you have to do to ensure that the cluster is failsafe on BOTH SIDES if the connection between those two DCs is down? So on BOTH sides the producers and consumer should still work.
I would guess: you need multiple Zookeepers on both sides and multiple Kafka-Nodes.
But is that enough? Does the cluster heal itself after getting reconnected?
Thanks in advance.
I'm assuming your datacenters that "are not very far from eatch other" are basically Availability Zones (AZs).
It's pretty common to spread a cluster over multiples AZs. However it's usually not desired or possible that each "slice" can live on its own.
The immediate issue is Zookeeper which by design prevents split-brain scenarios. So if a ZK cluster is split only one "slice" (at best) will carry on working. So the brokers that are on a side of the non working ZK clusters will be non functional.
Then let's say it was possible to have both sides keep working. What happens when you joins both sides again?
Data is likely to have diverged as clients wrote data to each side separately. Now you could have the same partition with different messages for the same offset and no way to resolve the conflict as both options are "valid".
I hope this shows why this is not a possible solution. In practice, if an AZ goes offline, it is non functional until it is brought back online.
Clients that were connected to the offline AZ should reconnect to the other AZ (using multiple bootstrap servers) and clients that were in the failed AZ should be reprovisioned into another one.
If configured correctly, Kafka can survive an AZ outage (even though in practice, it's best to have 3 AZs) and keep all resources available. Also in this scenario, the cluster will automatically return to a good state when the failed AZ returns.
Related
I am new to Kafka, trying to do a project. Wanted to do it as it would be in real life example, but I am kinda confused. While searching thru the internet I found that if I want to have 3 brokers and 3 zookeepers, to provide replication factor = 2 and quorum, I need 6 EC2 instances. I am looking thru youtube to find some examples, but as far as I see all of them show multiple brokers on one cluster. From my understanding it's better to keep ZKs and all brokers separately on each VM, so if one goes down I still have all of the rest. Can you confirm that ?
Also, wondering how to set partitioning. Is it important at the beginning of creating a topic, or I change that later when I need to scale ?
Thanks in advance
looking for information on yt, google.
My suggestion would be to use MSK Serverless and forget how many machines exist.
Kafka 3.3.1 doesn't need Zookeeper anymore. Zookeeper doesn't need to be separate machines (although recommended). You can also run multiple brokers on one server... So, I'm not fully sure why you would need 6 for replication factor of 2.
Regarding partitions, yes, create it ahead of time (over provision, if necessary) since you cannot easily move data across partitions when you do scale
I recently started to learn more about service registries and their usage in distributed architecture.
All the applications providing service registries that I found (etcd, Consul, or Zookeeper) are based on the same model: a master-server/cluster with leader election.
Correct me if I'm wrong but... doesn't this make the architecture less reliable ? In the sense that the master cluster brings a point-of-failure. To circumvent this we could always make a bigger cluster but it's more costly and/or less-performance effective.
My questions here are:
as all these service registries elect a leader — wouldn't it be possible to do the same without specifying the machines that form the master cluster but rather let them discover themselves through broadcasting and elect a leader or a leading group ?
does a service registry without master-server/cluster exists ?
and if not, what are the current limitations that prevent us from doing this ?
All of those services are based on one whitepaper - Google Chubby(https://ai.google/research/pubs/pub27897). The idea is to have fast and consistent configuration storage for distributed systems. To get there you need to eliminate a single point of failure. How you can do that? You introduce multiple machines storing the same data and also replicate the data. But in that case, considering unreliable network between those machines, how do you make sure that the data is consistent among nodes? You choose one of the nodes from the cluster to be Leader(using distributed leader election algorithm), if nodes have inconsistent values between them, it's a leaders job to pick the "correct" one. It looks like we've returned to a "single point of failure" situation, but in reality if the leader fails, the rest of the cluster just votes and promotes a new leader. So Leader role in those systems is NOT to be a Single point of truth, but rather a Single point of decision making
I am working on a brand new SolrCloud - ZooKeeper infrastructure.
Some background information:
all other services (mostly web site infrastructure) are distributed across two data centers, with active-active configurations.
at the network level, the servers are setup on extended LANs, with dark fibre across the data centers. So latency is at a minimum.
the SolrCloud - ZooKeeper infrastructure will be used by most of these applications.
I got a SolrCloud, and a ZooKeeper ensemble running. Implementation at this level is fine.
But I wonder how to distribute my ZooKeeper servers. I must have an odd number of servers, but I only have two data centers. If one fails, I have a 50-50 chance that I will lose majority.
What should I do? So far I have thought of:
requesting a third data center (not likely to happen, $$$!)
host two per data center and two on an external cloud provider (Amazon or ...?). Again $$$
set up an odd number at data center 1 and use an observer on site 2. What then happens if site 1 fails? Can SolrCloud work with only one observer?
If your requirement is to serve all search requests from a local data center (at which request was origin) then you don’t need to go for a cross data center ZooKeeper deployment.
Because a cross data center ZooKeeper deployment is only needed to survive a DC crash (it is most likely not going to happen, and that's why you pay $$$$), so in that case there isn't any need to spawn a ZooKeeper cluster in multiple data centers.
I got a third site to host the other ZooKeeper instance. This site is another office of my company, not a "full data center". So each site has one ZooKeeper instance.
What allowed me to have one cluster spread over three data centers was that they are close enough together to get a dark fiber between them. The latency is very low and does not impact ZooKeeper performance.
Then for Solr, I got full replicas on the two main data centers. The third office only hosts a ZooKeeper for quorum. Using full replicas, I have all the data in each data center. If my Solr needs to increase later, I will shard, but for now our index is small.
It has proven solid for four years now, with one failure. And it was at the third office, not in a data center.
I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.
When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?
Since one machine is enough to do all the work (for example single machine ensemble works fine)...
Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?
Edit:
Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?
Thanks in advance.
Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.
The reason to get a majority vote is to avoid a problem called "split-brain".
Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.
There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.
The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses
As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.
Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.
Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.
This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.
Use case: 100 Servers in a pool; I want to start a ZooKeeper service on each Server and Server applications (ZooKeeper client) will use the ZooKeeper cluster (read/write). Then there is no single point of failure.
Is this solution possible for this use case? What about the performance?
What if there are 1000 Servers in the pool?
If you are simply trying to avoid a single point of failure, then you only need 3 servers. In a 3 node ensemble, a single failure can be tolerated with the remaining 2 nodes forming the quorum. The more servers you have the worse write performance will be. And 100 servers is the extreme of this, if ZK can even handle it.
However, having that many clients is no problem at all. Zookeeper has active deployments with many more than 1000 clients. If you find that you need more servers to handle your read load, you can always add Observers. I highly recommend you join the list serve. It is an excellent way to quickly have your questions answered, and likely in much more detail than anyone will give you on SO.
Maybe zookeeper is not the right tool?
Hazelcast does what you want, I think. You can hundreds of peers, and if the master is lost a new one is elected from all the peers.
You don't need to use all of hazel cast. You can just use the maps, or just the worker pools, or just the synchronisation primitives, or just the messaging etc.