How does offset work when I have multiple topics on one partition in Kafka? - apache-kafka

I am trying to develop a better understanding of how Kafka works. To keep things simple, currently I am running Kafka on one Zookeeper with 3 brokers and one partition with duplication factor of 3. I learned that, in general, it's better to have number of partitions ~= number of consumers.
Question 1: Do topics share offsets in the same partition?
I have multiple topics (e.g. dogs, cats, dinosaurs) on one partition (e.g. partition 0). Now my producers have produced a message to each of the topics. "msg: bark" to dogs, "msg: meow" to cats and "msg: rawr" to dinosaurs. I noticed that if I specify dogs[0][0], I get back bark and if I do the same on cats and dinosaurs, I do get back each message respectively. This is an awesome feature but it contradicts with my understanding. I thought offset is specific to a partition. If I have pushed three messages into a partition sequentially. Shouldn't the messages be indexed with 0, 1, and 2? Now it seems me that offset is specific to a topic.
This is how I imagined it
['bark', 'meow', 'rawr']
In reality, it looks like this
['bark']
['meow']
['rawr']
But that can't be it. There must be something keeping track of offset and the actual physical location of where the message is in the log file.
Question 2: How do you manage your messages if you were to have multiple partitions for one topic?
In question 1, I have multiple topics in one partition, now let's say I have multiple partitions for one topic. For example, I have 4 partitions for the dogs topic and I have 100 messages to push to my Kafka cluster. Do I distribute the messages evenly across partitions like 25 goes in partition 1, 25 goes in partition 2 and so on...?
If a consumer wants to consume all those 100 messages at once, he/she needs to hit all four partitions. How is this different from hitting 1 partition with 100 messages? Does network bandwidth impose a bottleneck?
Thank you in advance

For your question 1: It is impossible to have multiple topics on one partition. Partition is part of topic conceptually. You can have 3 topics and each of them has only one partition. So you have 3 partitions in total. That explains the behavior that you observed.
For your question 2: AT the producer side, if a valid partition number is specified that partition will be used when sending the record. If no partition is specified but a key is present, a partition will be chosen using a hash of the key. If neither key nor partition is present a partition will be assigned in a round-robin fashion. Now the number of partitions decides the max parallelism. There is a concept called consumer group, which can have multiple consumers in the same group consuming the same topic. In the example you gave, if your topic has only one partition, the max parallelism is one and only one consumer in the consumer group will receive messages (100 of them). But if you have 4 partitions, you can have up to 4 consumers, one for each partition and each receives 25 messages.

Related

When to create new Consumer in ConsumerGroup

I am newbie in Kafka world and was reading about Consumer and ConsumerGroup.I got the difference between them and understand why we need ConsumerGroup in Kafka.
But here my question is When we should decide when to create new Consumer within same Group.
When we have huge amount of data?
Could someone help me to understand any real use case.
Thanks
I think some very good points have already been mentioned and here are my few cents. As your primary question seems to be "When" to add a consumer in a group...
There are 2 scenarios I could think of:
If one or more consumers in a Consumer group are overloaded by consumption from multiple partitions and you intend to distribute that load and increase parallelism. In this case, you could add consumers and trigger a rebalance.
If the partitions in a topic are increasing. This is quite a tricky scenario and may disturb the existing consumers in some ways. Following are a few examples of when this might happen:
a) If the semantics of your data are changing as partitioning a topic
based on the semantics is quite a common use case
b) If the data volume is increasing and the semantics are also changing
c) If only the volume is increasing that is leading to Scenario 1
However, as you've pointed out in your question - if only the volume is increasing and the consumers in a group are nicely mapped to the partitions on a 1-to-1 basis then you may be better off leaving things as they are. Otherwise, you might end up in the Scenario 2b.
Hope this helps!
In Apache Kafka, the level of parallelism is defined by the number of partitions. The higher the number of partitions, the higher the level of parallelism one can achieve. Depending on the volume of data, you should set the number of partitions to the desired value. Note that you can not have more active consumers than number of partitions.
For example, assume that you have a topic test with 5 partitions and a consumer group test-group. At any given time, only 5 consumers can be active withing test-group. Say we've got 1000 messages in topic test, then each of the 5 active consumers will consume (approximately) 200 messages. In case you run more than 5 partitions, the remaining will be inactive meaning that they won't consumer any messages at all. Similarly, if you have less consumers than partitions, then some of your active consumers will consumer messages from more than one partition.
Another -less straight-forward- example would be the following (taken from):
In this scenario, we do have two topics (A and B), each of which has 3 partitions. Two consumers belonging to the same consumer group are consuming messages from both topics.
As mentioned above, Kafka scales the topic consumption by distributing partitions among a consumer group. A consumer group is nothing, but a set of consumers sharing the common identifier.
A consumer is responsible to consumer messages from one or more partitions. If there is a single consumer running in the consumer group, it will consume data from all partitions. If there are multiple consumers running with in same group, they distribute the load in consumes from different-different partitions.
Maximum number of consumers are equal to the maximum number of partitions. If the consumers number exceeds than number of partitions, excessive consumers will be idle.
Let's say if there is a topic with 4 partitions. There are two consumer groups A and B. Group A has two consumers C1,C2. Both consumers will consume from approx 2 and 2 partitions.
While in Consumer Group B, there are 4 consumers, each consumer will consume from one partition.
When to use single consumer or multiple consumer : It depends on the use case. If you want a consolidated output from the processing where the calculations are based on the entire data in the topic, you should use single consumer unless you have a post processing logic to merge the output from each consumer.
If you are just reading the data and want to parallelize the process by distributing load, use multiple consumers

Scaling up kafka consumer applications

Lets say I have one consumer group which subscribed to 4 topics and partitions for each topics are:-
EDITED:
First topic => 5 partitions
Second topic => 3 partitions
Third topic => 2 partitions
Fourth topic => 1 partitions
Total number of partitions = 11. So total how many applications I can run.
5(max number of partitions in input topics) or 11?
In kafka, scaling consumers depends on partition number.
Lets assume you have one topic with 3 partitions. And you have 2 different consumer app (different consumer groups) which does different works.
You can scale your consumer number up to 3 for per consumer group.
Single consumer (consumer group A) can consume messages from 3
partitions.
Two consumer (same consumer group) can not consume single
partition.
Take look at image : https://hadoopabcd.files.wordpress.com/2015/04/consumer-group.png
Read more about consumer groups blog series : https://dzone.com/articles/understanding-kafka-consumer-groups-and-consumer-l
In ideal situation the number of consumer in the consumer group should be equal to the number of partition. If that is not the case then you can have more then one consumer group kafka provides the feature that 2 consumer from the different consumer group can read from the same partition. That’s totally depends on your resources how many resources do you have for running the consumers.
Suppose you have an application that needs to read messages from a Kafka topic, run some validations against them, and write the results to another data store. In this case your application will create a consumer object, subscribe to the appropriate topic, and start receiving messages, validating them and writing the results. This may work well for a while, but what if the rate at which producers write messages to the topic exceeds the rate at which your application can validate them? If you are limited to a single consumer reading and processing the data, your application may fall farther and farther behind, unable to keep up with the rate of incoming messages. Obviously there is a need to scale consumption from topics. Just like multiple producers can write to the same topic, we need to allow multiple consumers to read from the same topic, splitting the data between them.
Kafka consumers are typically part of a consumer group. When multiple consumers are subscribed to a topic and belong to the same consumer group, each consumer in the group will receive messages from a different subset of the partitions in the topic.
Please refer to this https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html

How kafka partitions behave

Can you explain how kafka partitions works for this scenario
If i produce 9 (1-9) messages round robin with 1 topic & 3 partitions.
Does it means that:
Partition 1 contains: [1,4,7]
Partition 2 contains: [2,5,8]
Partition 3 contains: [3,6,9]
?
Also how many consumers can get all the data 3? why?
Can you explain?
I guess also that consumer group can solve it but not sure why
Can you explain how kafka partitions works for this scenario
Your understanding is correct.
Also how many consumers can get all the data 3? why?
Depends on how many consumers you have in your consumer group.
If you only have 1 consumer in a group, it will get all the messages from all partitions.
If you have 2 consumers in a group, each will claim a subset of the partitions, e.g. 1st consumer will get all messages from partitions 1 and 2 and the 2nd consumer will get messages from partition 3.
If you have 3 consumers in a group, each will get one partition assigned.
If you have more than 3 consumers in a group, 3 consumers will get one partition each and the remaining consumers will not get any messages, just act as redundancy in case of failover.
The distribution of messages in the partitions is correct if and only if you publish messages without keys. In Kafka it is common to publish messages as (Key, Value) pairs and if you produce messages this way then the default partitioner will ensure that all messages of the same key will get put in the same partition. It does this by using a hashing function on each of the keys that maps to one of the available partitions. In the extreme case where all your messages have the same key, then they would all go to the same partition. If your messages all had either a string key "foo" or a key called "bar" then all the messages with key "foo" may go to partition 3 and all the messages with key "bar" may go to partition 1.
In terms of your question about consumers, you can have an unlimited number of consumers. If each consumer has a unique group.id then they are considered independent and they will each get their own full set of the messages from all partitions.
However if you have consumers that share the same group.id then they are said to be in a consumer group and each will get an exclusive and roughly equal subset of the partitions. If you had 3 consumers in the same group they would get 1 partition each. If you added any more than 3 consumers in the same group then the first 3 will get 1 partition each and all the others will be standby consumers than only become active if one of the 3 active consumers leaves the group.
The distribution of the messages through the partitions is correct in the idea. The partitions are the paralelism unit of Kafka.
You can have 3 consumers which will each handle one partition, but you can also have only 1 consumer which will get the data from the 3 partitions. It depends on the throughput you can have/want for each consumer.
Concerning the consumer groups :
If all your consumers have the same consumer group, the messages will be load balanced over the consumers
If your consumers have different consumer groups, then each messages will be broadcast to all consumer processes
FYI : the messages order is only kept within a partition, that is why messages coming from different partitions could be unordered.

Apache Kafka Scaling Topics using partitions

We started to use Apache Kafka to persist Timeseries data into a Timeseries database. What we started with was to just have a single topic, a producer writing to this topic and a single consumer reading from this topic and dumping the data to the Timeseries database.
We had 3 broker instances and what we noticed in the first try was that the producer was pretty fast in writing messages to the topic. Within a matter of 30 minutes, we had around 1.5 million messages. The consumer was just doing 300 messages per second.
Our next approach was to partition the topic and have more consumer instances (equal to the number of partitions). This definitely improved on the consumer write speed. Now my questions are:
What happens if I set my topic partition to 6, but I have only 3 broker instances. Which broker instance would be the leader for partition 1 to 6?
Is there a formula to determine how many partitions would I be needing? Since this was our test environment, we could play with it and scale it. We might not be able to do the same on our production environment. So how to determine the partition size?
The partitions get distributed amongst your brokers. It's impossible to know which broker will be elected leader of a given partition -- and it can change over time. Depending on which version of Kafka and which Consumer API you use, your consumer may or may not discover partition leaders on its own. With the SimpleConsumer you have to find partition leaders on your own, and respond to new leader election in your code (instead of having it handled by the API automatically).
As to the number of partitions -- there's no real "formula" other than this: you can have no more parallelism than you have partitions. If you have 4 partitions and 5 consumers, one of the consumers will starve. I usually use numbers like 12 or 60 or multiples thereof for the number of partitions for large topics. Something that divides easily and cleanly among variable numbers of consumers.
Also, note that you can later on change the number of partitions, with some caveats. See this answer for how and what the caveats are.

How can I scale Kafka consumers?

I'm reading the Kafka documentation and noticed the following line:
Note however that there cannot be more consumer instances in a consumer group than partitions.
Hmm. How can I auto-scale this?
For example let's say I have a messaging system with hi/lo priorities, so I create a topic for messages and partitions for hi and lo priority messages.
If this was RabbitMQ, I'd have an auto-scalable group of consumers assigned to each partition, like this:
If I understand the Kafka model I can't have >1 consumer per partition in a consumer group, so that picture doesn't work for Kafka, right?
Ok, so what about >1 consumer groups like this:
That get's around Kafka's limitation but... If I understand how this works both consumer groups would be pulling from a partition, for example msg.hi, with their own offsets so neither would know about the other--meaning messages would likely be delivered twice!
How can I achieve the capability I had in the Rabbit design w/Kafka and still maintain the "queue-ness" of the behavior (I don't want to send a message twice)? What am I missing?
TL;DR
Topic is made up of partitions. Partitions decide the max number of consumers you can have in a group.
Scenario 1:
When we have only one consumer, It can read all the messages from all the partitions.
Scenario 2:
In the above set up, when you increase the number of consumers in the group, partition reassignment happens and instead of consumer 1 reading all the messages from all the partitions, consumer 2 could share some of the load with consumer 1 as shown below.
Scenario 3:
What happens If I have more number of consumers than the number of partitions.? Each consumer would be assigned 1 partition. Any additional consumers in the group will be sitting idle unless you increase the number of partitions for a Topic.
Summary:
We need to choose the partitions accordingly. That decides the max number of consumers in the group. Changing the partition for an existing topic is really NOT recommended as It could cause issues.
That is, Let's assume a producer producing names into a topic where we have 3 partitions. All the names starting with A-I go to Partition 1, J-R in partition 2 and S-Z in partition 3. Let's also assume that we have already produced 1 million messages. Now if you suddenly increase the number of partitions to 5 from 3, It will create a different A-Z range now. That is, A-F in Partition 1, G-K in partition 2, L-Q in partition 3, R-U in partition 4 and V-Z in partition 5. Do you get it? It kind of affects the order of the messages we had before! So you need to be aware of this. If this could be a problem, then we need to choose the partition accordingly upfront.
More info is here - http://www.vinsguru.com/kafka-scaling-consumers-out-for-a-consumer-group/
Your assumption about messages being consumed twice is correct (since each group consumes 100% of messages from a topic).
I agree with David. Moreover, I suggest that you create more partitions than you really need, which would leave you some headroom to increase the number of threads in the group when such a need arises.
You can always increase the number of partitions later (and/or add additional brokers), but it's nice to have that already done, so that you can only increase number of threads and be done with it (those situations usually require a quick response, so you should do all the prep. that you can do in advance).
Just create a bunch of partitions for hi and lo. 12 is a good number. So is 60. Just pick a number of partitions that matches how much maximum parallelization you want.
Honestly, although I personally would makemsg.hi and msg.lo be different topics entirely, that's not a requirement -- you can do custom parititoning to divide messages between partitions.
You can also use an AI based auto scaler like this https://www.confluent.io/events/kafka-summit-americas-2021/intelligent-auto-scaling-of-kafka-consumers-with-workload-prediction/
This scaler calculates the right number of consumer PODs based on workload prediciton and target KPI metrics