Apache Kafka KRaft mode topology best practices

Apache Kafka KRaft mode topology best practices - apache-kafka

My question concerns the recommended topology of Kafka brokers and controllers in KRaft mode.
Now, according to best practices with zookeeper we are supposed to create:
{3,5,7} Zookeeper nodes
{3,5,7} Three Kafka broker nodes
This is a well known structure that is recommended in every book and online course. But one of the drawbacks of this model is that we need to have at least 6 machines / nodes which is a lot.
Now, I'm afraid that in KRaft mode things might be different. The alternatives I see are the following:
Three nodes where each node consists of a controller and a broker. I'm not sure it's a good one for production because once a single node is down (controller + broker), our system becomes fragile and we cannot afford loosing another node. Plus, I think it can introduce complications in case we want to update a node in production in case the other crashes.
Six nodes: three separate controllers and three separate brokers - This is a good solution, it better handles some of the issues mentioned in (1), but I think we can find something better.
Five nodes where each node is both a controller and a broker - I know that five nodes is reserved for heavy load systems, but I think that it's much better than to use model (2). Why should be use six machines when we can use five and have a much more reliable and available system? In other words, we can use a much better and cheaper solution.
Hybrid - some standalone controller and brokers, and some mixed controllers and brokers - I'm not sure whether this model has some benefits.
The only thing that worries about model (3) is that I've not seen it in any other place so I'm not completely sure about it. Looking for your opinion and advise

I might be a little late here but your proposed third option is not really cheaper than the second if all five machines run as brokers. Brokers have way higher hardware requirements than controllers, so I would still go for the conservative second way.
Controller nodes will need as little as maybe 8 GB RAM and some storage/CPU so you only need three machines (the brokers) that do the heavy lifting with lots of RAM and storage (and some CPU)

Related

Kafka on multiple instances of ec2

I am new to Kafka, trying to do a project. Wanted to do it as it would be in real life example, but I am kinda confused. While searching thru the internet I found that if I want to have 3 brokers and 3 zookeepers, to provide replication factor = 2 and quorum, I need 6 EC2 instances. I am looking thru youtube to find some examples, but as far as I see all of them show multiple brokers on one cluster. From my understanding it's better to keep ZKs and all brokers separately on each VM, so if one goes down I still have all of the rest. Can you confirm that ?
Also, wondering how to set partitioning. Is it important at the beginning of creating a topic, or I change that later when I need to scale ?
Thanks in advance
looking for information on yt, google.

My suggestion would be to use MSK Serverless and forget how many machines exist.
Kafka 3.3.1 doesn't need Zookeeper anymore. Zookeeper doesn't need to be separate machines (although recommended). You can also run multiple brokers on one server... So, I'm not fully sure why you would need 6 for replication factor of 2.
Regarding partitions, yes, create it ahead of time (over provision, if necessary) since you cannot easily move data across partitions when you do scale

Paxos and Discovery

Suppose I throw some machines in an elastic cluster and want to run some consensus algorithm in they (say, Paxos). Suppose they know the initial size of the network, say, 8 machines.
So, they'll run a consensus algorithm, and the quorum is 5.
Now, consider these cases:
I see that CPU is too low, and I reduce the cluster size in half, to 4 machines.
There is a partition split, and each split gets 4 machines.
If I take the current cluster size to get quorums, I'm subject to partition splits. Since for the underlying cluster, situations (1) and (2) look exactly the same. However, if I use a fixed number, I'm not able to scale down the cluster (and I'm subject to inconsistencies due to partition if I scale it up).
I have a third alternative, that of informing all the machines the size of the cluster when scaling, but there's a possibility of a partition happening right before a scale up, for instance, and that partition not receiving the new size and having enough quorum for a consensus using the old size.
Is Paxos (and any other safe consensus algorithms) unusable in an elastic environment?

Quorum-based consensus protocols fundamentally require quorums in order to operate. Both Multi-Paxos and Raft can be used in environments with dynamically changing cluster and quorum sizes but it must be done in a controlled manner that always maintains a consistent quorum. If, for example, you were currently using a cluster size of 8 and wanted to reduce that cluster to a size of 4. You could do so. However, that decision to reduce the cluster size to 4 must be a consensual decision that's agreed upon by the original 8.
Your question is a little unclear but it sounded like you were asking if you could safely reduce your cluster size to 4 as a recovery mechanism in the event that some kind of network partition renders your original cluster of 8 inoperable. The answer to that is effectively no since the decision to do so could not be consensual and attempting to go behind the back of the consensus algorithm is virtually guaranteed to result in inconsistencies. How would the new set of 4 be defined? How would you guarantee that all peers reached the same conclusion? How do you ensure they all make the same decision at the same time?
You could, of course, make all of these decisions manually and force the system to recover by shutting the consensus service down on each system and reconfiguring their quorum definition by hand. Assuming you don't screw up (which is an overwhelmingly big assumption for any real-world deployment) this would be safe. A better approach though would be to design the system such that one or two network partitions either won't halt the system (lots of sites) or use an eventual consistency model that gracefully handles the occasional network partitions. There's no magic bullet for getting around CAP restrictions.

Paxos and friends can scale in an elastic way (somewhat). Instead of changing the quorum size, though, just add learners. Learners are nodes that don't participate in consensus, but get all the decisions. Just like acceptors, learners accept reads and forward writes to the leader.
There are two styles of learner. The first listens to all events from acceptors; in the second the leader forwards all committed transitions to the learners

Why ZooKeeper needs majority to run?

I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.
When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?
Since one machine is enough to do all the work (for example single machine ensemble works fine)...
Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?
Edit:
Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?
Thanks in advance.

Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.

The reason to get a majority vote is to avoid a problem called "split-brain".
Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.
There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.
The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses
As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.

Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.
Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.
This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.

Maximum servers in a ZooKeeper ensemble cluster?

Use case: 100 Servers in a pool; I want to start a ZooKeeper service on each Server and Server applications (ZooKeeper client) will use the ZooKeeper cluster (read/write). Then there is no single point of failure.
Is this solution possible for this use case? What about the performance?
What if there are 1000 Servers in the pool?

If you are simply trying to avoid a single point of failure, then you only need 3 servers. In a 3 node ensemble, a single failure can be tolerated with the remaining 2 nodes forming the quorum. The more servers you have the worse write performance will be. And 100 servers is the extreme of this, if ZK can even handle it.
However, having that many clients is no problem at all. Zookeeper has active deployments with many more than 1000 clients. If you find that you need more servers to handle your read load, you can always add Observers. I highly recommend you join the list serve. It is an excellent way to quickly have your questions answered, and likely in much more detail than anyone will give you on SO.

Maybe zookeeper is not the right tool?
Hazelcast does what you want, I think. You can hundreds of peers, and if the master is lost a new one is elected from all the peers.
You don't need to use all of hazel cast. You can just use the maps, or just the worker pools, or just the synchronisation primitives, or just the messaging etc.

Do I absolutely need a minimum of 3 nodes/servers for a Cassandra cluster or will 2 suffice?

Surely one can run a single node cluster but I'd like some level of fault-tolerance.
At present I can afford to lease two servers (8GB RAM, private VLAN #1GigE) but not 3.
My understanding is that 3 nodes is the minimum needed for a Cassandra cluster because there's no possible majority between 2 nodes, and a majority is required for resolving versioning conflicts. Oh wait, am I thinking of "vector clocks" and Riak? Ack! Cassandra uses timestamps for conflict resolution.
For 2 nodes, what is the recommended read/write strategy? Should I generally write to ALL (both) nodes and read from ONE (N=2; W=N/2+1; W=2/2+1=2)? Cassandra will use hinted-handoff as usual even for 2 nodes, yes?
These 2 servers are located in the same data center FWIW.
Thanks!

If you need availability on a RF=2, clustersize=2 system, then you can't use ALL or you will not be able to write when a node goes down.
That is why people recommend 3 nodes instead of 2, because then you can do quorum reads+writes and still have both strong consistency and availability if a single node goes down.
With just 2 nodes you get to choose whether you want strong consistency (write with ALL) or availability in the face of a single node failure (write with ONE) but not both. Of course if you write with ONE cassandra will do hinted handoff etc as needed to make it eventually consistent.