I'm studying kafka and I see that a lot of articles about it are talking about kafka cluster with 3 nodes. And I'm curious why 3 nodes?
Do we have something technically specific about this number ?
This is all about high availability, the recommended replication factor for Kafka is 3, meaning that you are going to have 3 replicas for every piece of data sent to your cluster. If you have only 1 node, and that node crashes or has HW issues such as disk issues, probably you will lose all your data, as opposed to have healthy replicas on the other 2 nodes.
Here there are some additional explanations about it as well Kafka replicas
One of the reasons it's about electing the leader. One of the algorithms for electing a leader is Raft algorithm https://raft.github.io/ it will elect a new leader only when the major of brokers (n (alive) + 1 / 2 > all / 2) agree on one value. Long story short. If we have 3 brokers it's guaranteed that we will survive one failure. If there will be 2 brokers and one broker fails another broker will not be able to elect itself.
P.S. I'm not sure if kafka using the raft algorithm now for electing a leader (older versions used zookeeper). But I'm just writing an interesting case with electing leaders in distributed systems and if they use Raft so why it should be minimum 3.
Related
I am considering how to deploy our kafka cluster: a big cluster with several broker groups or several clusters. If a big cluster, I want to know how big a kafka cluster can be. kafka has a controller node and I don't know how many brokers it can support. And another one is _consume_offset_ topic ,how big it can be and can we add more partitions to it.
I've personally worked with production Kafka clusters anywhere from 3 brokers to 20 brokers. They've all worked fine, it just depends on what kind of workload you're throwing at it. With Kafka, my general recommendation is that it's better to have a smaller amount of larger/more-powerful brokers, than having a bunch of tiny servers.
For a standing cluster, each broker you add increases "crosstalk" between the nodes, since they have to move partitions around, replicate data, as well as maintain the metadata in sync. This additional network chatter can impact how much load the broker can handle. As a general rule, adding brokers will add overall capacity, but you have to shift partitions around so that the load will be balanced properly across the entire cluster. Because of that, it's much better to start with 10 nodes, so that topics and partitions will be spread out evenly from the beginning, than starting out with 6 nodes and then adding 4 nodes later.
Regardless of the size of the cluster, there is always only one controller node at a time. If that node happens to go down, another node will take over as controller, but only one can be active at a given time, assuming the cluster is not in an unstable state.
The __consumer_offsets topic can have as many partitions as you want, but it comes by default set to 50 partitions. Since this is a compacted topic, assuming that there is no excessive committing happening (this has happened to me twice already in production environments), then the default settings should be enough for almost any scenario. You can look up the configuration settings for consumer offsets topics by looking for broker properties that start with offsets. in the official Kafka documentation.
You can get more details at the official Kafka docs page: https://kafka.apache.org/documentation/
The size of a cluster can be determined by the following ways.
The most accurate way to model your use case is to simulate the load you expect on your own hardware.You can use the kafka load generation tools kafka-producer-perf-test and kafka-consumer-perf-test.
Based on the producer and consumer metrics, we can decide the number of brokers for our cluster.
The other approach is without simulation, which is based on the estimated rate at which you get data that required data retention period.
We can also calculate the throughput and based on that we can also decide the number of brokers in our cluster.
Example
If you have 800 messages per second, of 500 bytes each then your throughput is 800*500/(1024*1024) = ~0.4MB/s. Now if your topic is partitioned and you have 3 brokers up and running with 3 replicas that would lead to 0.4/3*3=0.4MB/s.
More details about the architecture are available at confluent.
Within a Kafka Cluster, a single broker works as a controller. If you have a cluster of 100 brokers then one of them will act as the controller.
If we talk internally, each broker tries to create a node(ephemeral node) in the zookeeper(/controller). The first one becomes the controller. The other brokers get an exception("node already exists"), they set a watch on the controller. When the controller dies, the ephemeral node is removed and watching brokers are notified for the controller selection process.
The functionality of the controller can be found here.
The __consumer_offset topic is used to store the offsets committed by consumers. Its default value is 50 but it can be set for more partitions. To change, set the offsets.topic.num.partitions property.
Try to understanding consistency maintenance in Kafka. Please find the scenario and help to understand.
Number of partition = 2
Replication factor = 3
Number of broker in the cluster = 4
In that case, for achieving the strong consistency how many nodes should acknowledge. Either ack = all or ack = 3 or any other value. Please confirm for the same.
You might be interested in seeing When it Absolutely, Positively, Has to be There talk from Kafka Summit.
Which was given by an engineer at Cloudera, and Cloudera has their own documenation on Kafka availability
To summarize, more than 1 replica and higher than 1 in-sync replica is a good start. Then on the producer, if you are okay with sacrificing throughput for data availability, meaning you must have all replicas be written before continuing, then acks=all. Otherwise, if you trust the leader broker to be highly available with unclean leader election is false, then acks=1 should be okay in most cases.
acks=3 isn't a valid config, by the way. I think you are looking for min.insync.replicas=2 and acks=all with a replication factor of 3; from above link
If min.insync.replicas is set to 2 and acks is set to all, each message must be written successfully to at least two replicas. This guarantees that the message is not lost unless both hosts crash
Also, you can enable the transactional producer, as of Kafka 0.11 to work towards exactly once processing
enable.idempotence=true
In your setting, what you have is
4 brokers
Replication factor = 3
That means each message in a given partition will be replicated to 3 out of 4 brokers, including the leader for that partition.
In-order to achieve strong consistency guarantees, you have to set min.insync.replicas to 2 and use acks=all. This way, you are guaranteed that each write goes to at-least 2 out of 3 brokers which hold the data, before which it is acknowledged.
Setting acks to all provides the highest consistency guarantee at the expense of slower writes to the cluster.
If you use older versions of Kafka where unclean leader election is true by default, you should also consider setting that to false explicitly. This way, an out of sync. broker won't be elected as the leader in case of leader crashes (effectively compromising availability).
Also, Kafka is a system where all the reads go through the leader. This is a bit different from some other distributed system such as zookeeper which supports read replicas. So you do not have a situation where a client ends up reading directly from a stale broker. Leader ensures that writes are ordered and replicated to designated number of in-sync replicas and acknowledged based on your acks setting.
If you are looking for consistency as in realm of ACID property, all replicas need to be acknowledged. Since you have 3 replicas, all of those 3 nodes should be acknowledged.
I have a problem with some Kafka topics and couldn't find an answer to it yet.
While adding more partitions to __confluent.support.metrics shouldn't be a problem (I know how to do that), I wonder if it is possible to tell it to use brokers which obviously can not be seen by this topic?
Also I'd love to understand why these topics only inherit some brokers instead of all available 5 brokers in their cluster.
I'd love to fix these topics. But I fear that if I tell it to add (or use) partitions on brokers the topic can't "see", that it might not work or even destroy the topic, which would be rather bad.
How can I instruct these topics, that there are 5 available brokers? Can I do it with one of the Kafka tools?
How could that have happened in the first place?
Why does the __consumer_offsets topic only "see" 4 brokers instead of 5 like all other topics in this cluster do?
FYI: I didn't setup any of this, but I have to cleanup/revamp the running clusters and am stuck now, I never came across this sort of problem before
The reason this has happened is because you have only one partition and one replica for the __confluent.support.metrics topic. In a 5-node cluster, this means you will only be using 20% of the available brokers in the cluster, which corresponds with the image you've posted. A topic with replication-factor 1 and 1 partition will only ever hold data on one broker.
On the other hand, it is unusual that your __consumer_offsets topic would be using only 4 out of 5 brokers. My guess would be that your 5th broker was not online at the time of creation of __consumer_offsets (this is created when you consume from any topic for the first time) and thus no partitions were created on this broker.
However, this is probably nothing to worry about, as the spread of partitions across the cluster is generally handled by Kafka itself rather than being a user problem. There is no concept of a topic "seeing" a broker per se; rather, the brokers hold the data for the topics, and the topics will know which brokers they reside on. A topic doesn't generally need to concern itself with other brokers.
Both the consumer offsets and Confluent metrics topics have line items in the server properties file that determines what configurations those topics will be created with.
To improve the health of those topics, you can attempt to increase the replication factor, which will spread your topic over more brokers and provide fault tolerance. Also see Kafka Tools Wiki
Kafka has the concept of a in-sync replica set, which is the set of nodes that aren't too far behind the leader.
What happens if the network cleanly partitions so that a minority containing the leader is on one side, and a majority containing the other in-sync nodes on the other side?
The minority/leader-side presumably thinks that it lost a bunch of nodes, reduces the ISR size accordingly, and happily carries on.
The other side probably thinks that it lost the leader, so it elects a new one and happily carries on.
Now we have two leaders in the same cluster, accepting writes independently. In a system that requires a majority of nodes to proceed after a partition, the old leader would step down and stop accepting writes.
What happens in this situation in Kafka? Does it require majority vote to change the ISR set? If so, is there a brief data loss until the leader side detects the outages?
I haven't tested this, but I think the accepted answer is wrong and Lars Francke is correct about the possibility of brain-split.
Zookeeper quorum requires a majority, so if ZK ensemble partitions, at most one side will have a quorum.
Being a controller requires having an active session with ZK (ephemeral znode registration). If the current controller is partitioned away from ZK quorum, it should voluntarily stop considering itself a controller. This should take at most zookeeper.session.timeout.ms = 6000. Brokers still connected to ZK quorum should elect a new controller among themselves. (based on this: https://stackoverflow.com/a/52426734)
Being a topic-partition leader also requires an active session with ZK. Leader that lost a connection to ZK quorum should voluntarily stop being one. Elected controller will detect that some ex-leaders are missing and will assign new leaders from the ones in ISR and still connected to ZK quorum.
Now, what happens to producer requests received by the partitioned ex-leader during ZK timeout window? There are some possibilities.
If producer's acks = all and topic's min.insync.replicas = replication.factor, then all ISR should have exactly the same data. The ex-leader will eventually reject in-progress writes and producers will retry them. The newly elected leader will not have lost any data. On the other hand it won't be able to serve any write requests until the partition heals. It will be up to producers to decide to reject client requests or keep retrying in the background for a while.
Otherwise, it is very probable that the new leader will be missing up to zookeeper.session.timeout.ms + replica.lag.time.max.ms = 16000 worth of records and they will be truncated from the ex-leader after the partition heals.
Let's say you expect longer network partitions than you are comfortable with being read-only.
Something like this can work:
you have 3 availability zones and expect that at most 1 zone will be partitioned from the other 2
in each zone you have a Zookeeper node (or a few), so that 2 zones combined can always form a majority
in each zone you have a bunch of Kafka brokers
each topic has replication.factor = 3, one replica in each availability zone, min.insync.replicas = 2
producers' acks = all
This way there should be two Kafka ISRs on ZK quorum side of the network partition, at least one of them fully up to date with ex-leader. So no data loss on the brokers, and available for writes from any producers that are still able to connect to the winning side.
In a Kafka cluster, one of the brokers is elected to serve as the controller.
Among other things, the controller is responsible for electing new leaders. The Replica Management section covers this briefly: http://kafka.apache.org/documentation/#design_replicamanagment
Kafka uses Zookeeper to try to ensure there's only 1 controller at a time. However, the situation you described could still happen, spliting both the Zookeeper ensemble (assuming both sides can still have quorum) and the Kafka cluster in 2, resulting in 2 controllers.
In that case, Kafka has a number of configurations to limit the impact:
unclean.leader.election.enable: False by default, this is used to prevent replicas that were not in-sync to ever become leaders. If no available replicas are in-sync, Kafka marks the partition as offline, preventing data loss
replication.factor and min.insync.replicas: For example, if you set them to 3 and 2 respectively, in case of a "split-brain" you can prevent producers from sending records to the minority side if they use acks=all
See also KIP-101 for the details about handling logs that have diverged once the cluster is back together.
Is there any tools or operation to use to mitigate data loss issues when kafka broker fail in multi node kafka cluster.
well, replication is an important features of Kafka and a key element to avoid data loss. In particular, should one of your broker go down, the replica on other brokers will be used by the consumers just as nothing happened (from the business side). Of course, this has consequences on the connections, band width etc.
However, a message must have been properly produced to be replicated.
So basically, if you have a replication set at higher than 1, this should be safe, as long as your producers don't go down.
The default.replication.factor is 1, so set replication (at the topic or general level) to 2 or 3. Of course you need 2 or 3 brokers.
http://kafka.apache.org/documentation.html#basic_ops_increase_replication_factor