Will consumers in a group subscribe to one topic each, if there is a single partition per topic? - apache-kafka

I'm using Debezium to log changes in my database, and Debezium generates change events in a topic for each table that exists in my database.These change records are consumed to populate another database.
If I restrict each topic to only have 1 partition, and let's say I have 4 consumers running, when the consumers subscribe to topics, will the 4 consumers divide the topics among themselves? (they would distribute partitions among themselves, but here 1 topic = 1 partition)
Would the above setup mean that for each table, the generated events on the topic will always be executed in order because there's at most 1 consumer acting on the topic at the same time?
I'm currently trying to get Kafka to restrict 1 partition per topic, and have 2 consumers. But the 2 consumers seem to pick up different topics every now and then and not be 'sticky' to the topics.

Yes, if there are 4 topics, 1 partition in each topic and 4 consumers, the consumers will evenly distribute the partitions among themselves which will result in 1 topic per consumer.
However, to get a "sticky assignment" you would need to give the consumer groups static group IDs. Otherwise, when there are failures and a rebalance is triggered, the consumer can be assigned a different partition (in this case a different topic).

Related

Kafka Consumers subscribed to different topics in same consumer group

I am starting with kafka , have a question on the consumer groups. We have an application where we want different consumers from same group subscribing to different topics. The grouping is done based on some business criteria . To be specific consumer 1 from group A and consumer 2 from group A are subscribed to Topic 1 and Topic 2 each with 10 partitions. Does this mean that there consumer 1 can scale to 10 and consumer 2 also can scale to 10 since they are subscribed to different topics. Is this a correct design
yes, since within a topic kafka try to assign partitions to consumers as equal as possible. The key is topic:consumer_group_id so it doesn't matter another_topic:same_consumer_group_id - it's another key, and consumer with topic:consumer_group_id could be scaled to partitions number

Number of consumers in kafka comsumer-group

If a producer has 3 topics and 4 partitions each topic, should the consumer group contains 4 or 12 consumers?
I want to achieve ideal consumption.
There should be one consumer each partition for ideal consumption. So, for your case, 12 consumers should be ideal.
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.
You cannot have multiple consumers -within the same consumer group- consuming data from a single partition. Therefore, in order to consume data from the same partition using N consumers, you'd need to create N distinct consumer groups too.
Note that partitioning enhances the parallelism within a Kafka cluster. If you create thousands of consumers to consume data from only one partition, I suspect that you will lose some level of parallelism.
If you have 3 topics with 4 partition each.
For best optimisation you should have 4 consumers per consumer group.
Reason : If you have more than 4 consumers ,your extra consumers would be left ideal, because 4 consumers will be assigned 4 partitions with 1 consumer assigned 1 partition. So in short more than 4 consumers is not required per consumer group.
If you have less consumers say 2 consumers for 4 topics , each consumer will consume messages from 2 partitions each which will overload it.
There is no limit in number of consumer groups which subscribe to a topic.

Scaling up kafka consumer applications

Lets say I have one consumer group which subscribed to 4 topics and partitions for each topics are:-
EDITED:
First topic => 5 partitions
Second topic => 3 partitions
Third topic => 2 partitions
Fourth topic => 1 partitions
Total number of partitions = 11. So total how many applications I can run.
5(max number of partitions in input topics) or 11?
In kafka, scaling consumers depends on partition number.
Lets assume you have one topic with 3 partitions. And you have 2 different consumer app (different consumer groups) which does different works.
You can scale your consumer number up to 3 for per consumer group.
Single consumer (consumer group A) can consume messages from 3
partitions.
Two consumer (same consumer group) can not consume single
partition.
Take look at image : https://hadoopabcd.files.wordpress.com/2015/04/consumer-group.png
Read more about consumer groups blog series : https://dzone.com/articles/understanding-kafka-consumer-groups-and-consumer-l
In ideal situation the number of consumer in the consumer group should be equal to the number of partition. If that is not the case then you can have more then one consumer group kafka provides the feature that 2 consumer from the different consumer group can read from the same partition. That’s totally depends on your resources how many resources do you have for running the consumers.
Suppose you have an application that needs to read messages from a Kafka topic, run some validations against them, and write the results to another data store. In this case your application will create a consumer object, subscribe to the appropriate topic, and start receiving messages, validating them and writing the results. This may work well for a while, but what if the rate at which producers write messages to the topic exceeds the rate at which your application can validate them? If you are limited to a single consumer reading and processing the data, your application may fall farther and farther behind, unable to keep up with the rate of incoming messages. Obviously there is a need to scale consumption from topics. Just like multiple producers can write to the same topic, we need to allow multiple consumers to read from the same topic, splitting the data between them.
Kafka consumers are typically part of a consumer group. When multiple consumers are subscribed to a topic and belong to the same consumer group, each consumer in the group will receive messages from a different subset of the partitions in the topic.
Please refer to this https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html

How multiple consumer group consumers work across partition on the same topic in Kafka?

I was reading this SO answer and many such blogs.
What I know:
Multiple consumers can run on a single partition when running multiple consumers with multiple consumer group id and only one consumer from a consumer group can consume at a given time from a partition.
My question is related to multiple consumers from multiple consumer groups consuming from the same topic:
What happens in the case of multiple consumers(different groups) consuming a single topic(eventually the same partition)?
Do they get the same data?
How offset is managed? Is it separate for each consumer?
(Might be opinion based) How do you or generally recommended way is to handle overlapping data across two consumers of a separate group operating on a single partition?
Edit:
"overlapping data": means two consumers of separate consumer groups operating on the same partition getting the same data.
Yes they get the same data. Kafka only stores one copy of the data in the topic partitions' commit log. If consumers are not in the same group then they can each get the same data using fetch requests from the clients' consumer library. The assignment of which partitions each group member will get is managed by the lead consumer of each group. The entire process in detailed steps is documented here https://community.hortonworks.com/articles/72378/understanding-kafka-consumer-partition-assignment.html
Offsets are "managed" by the consumers, but "stored" in a special __consumer_offsets topic on the Kafka brokers.
Offsets are stored for each (consumer group, topic, partition) tuple. This combination is also used as the key when publishing offsets to the __consumer_offsets topic so that log compaction can delete old unneeded offset commit messages and so that all offsets for the same (consumer group, topic, partition) tuple are stored in the same partition of the __consumer_offsets topic (which defaults to 50 partitions)
Each consumer group gets every message from a subscribed topic.
Yes
Offset are stored by partition. For example let's say you have a topic with 2 partitions and a consumer group named cg made up of 2 consumers. In that case Kafka assigns each of the consumers one of the partitions. Then the consumers fetch the offset for the partition they were assigned to from Kafka (e.g. consumer 'asks' Kafka: "What is the offset for this topic for consumer group cg partition 1", or partition 2 for the other consumer). After getting the correct offset the consumer polls some Kafka broker for the next message in that partition.
I'm not entirely sure what you mean by overlapping data, can you clarify a bit or give an example?

How kafka partitions behave

Can you explain how kafka partitions works for this scenario
If i produce 9 (1-9) messages round robin with 1 topic & 3 partitions.
Does it means that:
Partition 1 contains: [1,4,7]
Partition 2 contains: [2,5,8]
Partition 3 contains: [3,6,9]
?
Also how many consumers can get all the data 3? why?
Can you explain?
I guess also that consumer group can solve it but not sure why
Can you explain how kafka partitions works for this scenario
Your understanding is correct.
Also how many consumers can get all the data 3? why?
Depends on how many consumers you have in your consumer group.
If you only have 1 consumer in a group, it will get all the messages from all partitions.
If you have 2 consumers in a group, each will claim a subset of the partitions, e.g. 1st consumer will get all messages from partitions 1 and 2 and the 2nd consumer will get messages from partition 3.
If you have 3 consumers in a group, each will get one partition assigned.
If you have more than 3 consumers in a group, 3 consumers will get one partition each and the remaining consumers will not get any messages, just act as redundancy in case of failover.
The distribution of messages in the partitions is correct if and only if you publish messages without keys. In Kafka it is common to publish messages as (Key, Value) pairs and if you produce messages this way then the default partitioner will ensure that all messages of the same key will get put in the same partition. It does this by using a hashing function on each of the keys that maps to one of the available partitions. In the extreme case where all your messages have the same key, then they would all go to the same partition. If your messages all had either a string key "foo" or a key called "bar" then all the messages with key "foo" may go to partition 3 and all the messages with key "bar" may go to partition 1.
In terms of your question about consumers, you can have an unlimited number of consumers. If each consumer has a unique group.id then they are considered independent and they will each get their own full set of the messages from all partitions.
However if you have consumers that share the same group.id then they are said to be in a consumer group and each will get an exclusive and roughly equal subset of the partitions. If you had 3 consumers in the same group they would get 1 partition each. If you added any more than 3 consumers in the same group then the first 3 will get 1 partition each and all the others will be standby consumers than only become active if one of the 3 active consumers leaves the group.
The distribution of the messages through the partitions is correct in the idea. The partitions are the paralelism unit of Kafka.
You can have 3 consumers which will each handle one partition, but you can also have only 1 consumer which will get the data from the 3 partitions. It depends on the throughput you can have/want for each consumer.
Concerning the consumer groups :
If all your consumers have the same consumer group, the messages will be load balanced over the consumers
If your consumers have different consumer groups, then each messages will be broadcast to all consumer processes
FYI : the messages order is only kept within a partition, that is why messages coming from different partitions could be unordered.