From the Kafka documentation and from some other blogs I read, I concluded that one kafka-broker consists of one partition of topics. Here It says one Kafka-broker holds only one partition. I have only one broker in my system, But I can able to create a topic with 3 partitions and 1 replication factor. I also tried to create a topic with 3 partitions and 3 replication factor with only one broker. It throws below error
Error while executing topic command : replication factor: 3 larger than available brokers: 1
[2017-10-21 15:35:25,928] ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: replication factor: 3 larger than available brokers: 1
(kafka.admin.TopicCommand$).
So I have a query.
Whether Kafka-broker holds replication instead of a partition?
If I create 3 partitions with a single broker, what happens?
In such case of 1 broker, 1 replica and 3 partition , how many partitions of single topic can kafka-broker hold?
Somebody, please explain what happens here.
The post you are reffering to doesn't say that one broker can store only one partitions. It just says that partition is not splittable between brokers (topic is). Actually I manage a brokers with thousands of partitions. So, for your questions:
Kafka brokers hold many partitions. Replication is the way to store multiple copies of partitions across cluster.
If you create topic with 3 partitions on single node cluster, the broker will hold the data for all the partitions. Replication is not possible since it requires more nodes.
All of them.
Summary: The replication factor has to be equal or less in number compared to the number of brokers you have.
Related
For example, I have a topic that has 2 partitions and a producer using defaultpartitioner (round-robin I assumed) writes to the topic. At some point, partition 1 becomes unavailable because all of the replica brokers go offline. Assuming the messages have no specified keys, will the producer resend the messages to partition 2? or simply gets stuck?
That is an interesting question and we should look at it from a broader (cluster) perspective.
At some point, partition 1 becomes unavailable because all of the replica brokers go offline.
I see the following scenarios:
All replica brokers of partition one are different to the replica brokers of partition two.
All replica brokers of partition one are the same as for partition two.
Some replica brokers are the same as for partition two.
In scenario "1" it means you still have enough brokers alive as the replication factor is a topic-wide not a partition-based configuration. In that case as soon as the first broker goes down its data will be moved to another broker to ensure that your partition always has enough in-sync replicas.
In scenarios "2" both partitions become unavailable and your KafkaProducer will eventually time out. Now, it depends if you have other brokers that are alive and can take on the data of the partitions.
In scenario "3" the dead replicas would be shifted to running brokers. During that time the KafkaProducer will only write to partition 2 as this is the only available partition in the topic. As soon as partition 1 has enough in-sync replicas the producer will start producing again to both partitions.
Actually, I could think of many more scenarios. If you need a more concrete answer you need to specify
how many brokers you have,
what your replication factor actually is and
in what timely order which broker goes down.
Assuming the messages have no specified keys, will the producer resend the messages to partition 2?
The KafkaProducer will not re-send the data that was previously send to partition 1 to partition 2. Whatever was written to partition 1 will stay in partition 1.
A general question. Assume a topic has 3 kafka partitions on different servers (brokers), each partition has 10 message with offset as its timestamp (0,1,...,9, greater number means stayed shorter time in partition, also means is newly came message). Let's say one partition happen to shut down since the server is done. What's the strategy for Kafka to re-balance the 10 message in the shut down partition into other partitions?
Visually, we have
broker 1 partition: |1-0|1-1|1-2|1-3|1-4|1-5|1-6|1-7|1-8|1-9|
broker 2 partition: |2-0|2-1|2-2|2-3|2-4|2-5|2-6|2-7|2-8|2-9|
broker 3 partition: |3-0|3-1|3-2|3-3|3-4|3-5|3-6|3-7|3-8|3-9|
Now if broker 3 is done, how will 3-0 to 3-9 be inserted into broker 1 and broker 2?
( My assumption is by default it will be spread half half randomly and inserted based on timestamp of broker 3, attached to tail of broker 1 and 2, and maybe there is somewhere one can configure behavior by code?)
Thanks in advance.
If a partition only exists on a single broker (replication factor 1) then when this broker is offline, the partition is not available. This is what you drew in your question.
To keep data available even when brokers go down you have to create topics with a replication factor greater than 1.
Then the data of the partition will be replicated onto several brokers and if one of them go offline, user traffic will be rediected to the available replicas.
I suggest you to go through the Replication section in the docs to understand how this works.
The below diagram will help you understand how Kafka replicates partitions. If one broker is down, the consumer can read from the other broker because Kafka has a replication ability. (Of course, you need to set it like below)
For example, if broker 1 dies, broker 2 will become a leader of topic1-part1, and a consumer can read from it.
Zookeeper will know if a broker( partition) is down, it will appoint another leader.
Trying to Understand the relationship between replication factor and Consumer group . Example : Number of partition = 2 Number of replication = 3 Number consumers in consumer group = 4 . In this case ,
How many consumer will receive the message ?
How This replication will impact the number of consumer to receive .
For your first question, since you have two partitions in your example, only 2 of the 4 consumers will actually get data. The other two consumers will not have any partitions assigned to them, because there aren't any partitions left for that consumer group. If you had a different consumer group, then those consumers would still be assigned partitions.
Additionally, in this case, you mention there's only a single message coming through. Depending on which partition it's assigned to, the message will only be sent to that partition. So in this case, only one of the four consumers will get the message, the one that had that partition assigned to it.
As for your second question, replication factor configuration in Kafka doesn't impact the number of messages consumers receive. Replication, as far as consumers and producers are concerned, is an internal kafka cluster detail that they don't need to worry about. As long as they're producing/consuming to/from the leader of the partition, that's all they need to know. A topic could have replication factor 2, and another one could have replication factor 10, and they would both behave identically to producers and consumers.
There's a few more details in the official Kafka documentation: https://kafka.apache.org/documentation/#theconsumer
To give some additional details on the replication factor, it doesn't have any relation whatsoever to the number of consumers receiving messages from the topic. Replication serves only one major purpose, and that is High Availability. So, let's say you have 3 brokers in a cluster, and for a topic my-topic you've set replication factor as 2. Now, if at-most one broker goes down at some point of time, you'd still be okay, as the messages are replicated in another broker for the topic.
I have read from here and a bit not sure about the partition log.
First they say:
For each topic, the Kafka cluster maintains a partitioned log that
looks like this:
Then they show a picture:
Also they say
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Do I understand correctly that :
On a cluster, it can have only one partition log of a topic? In other words, two partition of the same topic cannot be in the same cluster?
A Cluster can have multiple partition log from different topics?
The picture about a topic should be more like this?
A topic consist of 1 or many partitions. You specify the number of partitions when creating the topic, and partitions can also be added after creation.
Kafka will spread the partitions on as many brokers as it can in the cluster. If you only have a single broker then they will be all on this broker.
Many partitions from the same topic can live on the same broker. This happens all the time as most clusters only have a dozen brokers and it's not uncommon to have 50 partitions, hence several partitions from the same topic will live on the same broker.
What the docs say is that a partition is a unit that cannot be split. It's either on a broker or not. Whereas a topic is just a collections of partitions that have the same name and configuration.
To answer your question:
For a Kafka cluster of b brokers and a topic with p partitions, each broker will roughly hold p/b partitions as primary copy. They might also hold the replica partitions, but that depends on your replication factor. So, e.g. if you have a 3-node cluster, and a topic test with 6 partitions, each node will have 2 partitions.
Yes, it surely can. Extending the previous point, if you have two topics test1, and test2, each with 6 partitions, then each broker will hold 4 partitions in total (2 for each topics).
I guess in the diagram you have mislabeled brokers as cluster.
I just started exploring kafka.
I have query regarding kafka topic and partitions.
Let assume we have 3 machines x.x.x.1 , x.x.x.2 , x.x.x.3
We have a Topic Test and it has 3 partitions and 3 replica set as the 3 machines above 1,2,3.
Can it be possible that we can write 1st partition data to machine 1. 2nd partition data to machine 2 and 3rd partition data to 3rd machine always?
If it is possible than how?
The way the partition assignment works is the following.
Starting from a random broker-id (which could not be x.x.x.1 but x.x.x.3 instead), the partition leader 0 will be assigned to such broker, the partition leader 1 to the next one and so on.
So for example, if the broker x.x.x.2 is chosen then partition leader 0 will be on it (2), partition leader 1 on x.x.x.3 and finally partition leader 2 on x.x.x.1.
Regarding the follower replicas they are assigned increasing by one the starting broker : in this example the first follower for partition 0 will be on x.x.x.3, the second follower will be on x.x.x.1. The same will happen for follower replicas for partition 1 and 2. In this way the replication allows HA and the traffic is balanced across the cluster.
Btw there is a tool named "kafka-reassign-partitions.sh" you can use for specifying your preferred assignment through a JSON. You can find more here : https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-4.ReassignPartitionsTool