Scaling up kafka consumer applications - apache-kafka

Lets say I have one consumer group which subscribed to 4 topics and partitions for each topics are:-
EDITED:
First topic => 5 partitions
Second topic => 3 partitions
Third topic => 2 partitions
Fourth topic => 1 partitions
Total number of partitions = 11. So total how many applications I can run.
5(max number of partitions in input topics) or 11?

In kafka, scaling consumers depends on partition number.
Lets assume you have one topic with 3 partitions. And you have 2 different consumer app (different consumer groups) which does different works.
You can scale your consumer number up to 3 for per consumer group.
Single consumer (consumer group A) can consume messages from 3
partitions.
Two consumer (same consumer group) can not consume single
partition.
Take look at image : https://hadoopabcd.files.wordpress.com/2015/04/consumer-group.png
Read more about consumer groups blog series : https://dzone.com/articles/understanding-kafka-consumer-groups-and-consumer-l

In ideal situation the number of consumer in the consumer group should be equal to the number of partition. If that is not the case then you can have more then one consumer group kafka provides the feature that 2 consumer from the different consumer group can read from the same partition. That’s totally depends on your resources how many resources do you have for running the consumers.
Suppose you have an application that needs to read messages from a Kafka topic, run some validations against them, and write the results to another data store. In this case your application will create a consumer object, subscribe to the appropriate topic, and start receiving messages, validating them and writing the results. This may work well for a while, but what if the rate at which producers write messages to the topic exceeds the rate at which your application can validate them? If you are limited to a single consumer reading and processing the data, your application may fall farther and farther behind, unable to keep up with the rate of incoming messages. Obviously there is a need to scale consumption from topics. Just like multiple producers can write to the same topic, we need to allow multiple consumers to read from the same topic, splitting the data between them.
Kafka consumers are typically part of a consumer group. When multiple consumers are subscribed to a topic and belong to the same consumer group, each consumer in the group will receive messages from a different subset of the partitions in the topic.
Please refer to this https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html

Related

Can I have multiple consumers process messages from a single queue in Apache Kafka

What I want to achieve is this:
Subscribe multiple consumers to a single topic
Each message should be processed by only one consumer
No consumer should be idle as long as the topic has unprocessed messages
As far as I understand I can get the first two points by defining multiple partitions for that topic, at least one partition per consumer. But that doesn't satisfy my 3rd requirement.
Assume I created a topic with 3 partitions and subscribe 3 consumers (same group id). Then a producer pushes a bulk of 300 messages which are equally distributed to all three partitions. So each partition contains 100 messages and consumers start to process. For whatever reasons one consumer takes longer and at some point when 2 consumers have already processed all messages of their partitions, the 3rd consumer still has several messages left to process.
In that scenario the 2 fast consumers would fall idle while the 3rd one is still processing messages.
What I have in mind is something like a topic with only one partition and all consumers subscribed share the same offset index. Then, whenever a consumer is idle it will fetch the next message from the topic that hasn't been processed by any of the consumers yet. I know that Kafka cannot have multiple consumers of the same group on one partition. It's just to explain my intentions.
Is there a way to configure my topology to meet my requirements?

Consuming from single kafka partition by multiple consumers

I read following in kafka docs:
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time.
Kafka only provides a total order over records within a partition, not between different partitions in a topic.
Per-partition ordering combined with the ability to partition data by key is sufficient for most applications.
However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
I read following on this page:
Consumers read from any single partition, allowing you to scale throughput of message consumption in a similar fashion to message production.
Consumers can also be organized into consumer groups for a given topic — each consumer within the group reads from a unique partition and the group as a whole consumes all messages from the entire topic.
If you have more consumers than partitions then some consumers will be idle because they have no partitions to read from.
If you have more partitions than consumers then consumers will receive messages from multiple partitions.
If you have equal numbers of consumers and partitions, each consumer reads messages in order from exactly one partition.
Doubts
Does this means that single partition cannot be consumed by multiple consumers? Cant we have single partition and a consumer group with more than one consumer and make them all consume from single partition?
If single partition can be consumed by only single consumer, I was thinking why is this design decision?
What if I need total order over records and still need it to be consumed parallel? Is it undoable in Kafka? Or such scenario does not make sense?
Within a consumer group, at any time a partition can only be consumed by a single consumer. No you can't have 2 consumers within the same group consuming from the same partition at the same time.
Kafka Consumer groups allow to have multiple consumer "sort of" behave like a single entity. The group as a whole should only consume messages once. If multiple consumer in a group were to consume the same partitions, these records would be processed multiple times.
If you need to consume a partition multiple times, be sure these consumers are in different groups.
When processing needs to happen in order (serially) at any time there's only a single task to do. If you have records 1, 2 and 3 and want to process them in order, you cannot do anything until message 1 has been processed. It's the same for message 2 and 3. So what do you want to do in parallel?

I am reading Kafka documentation and trying to understand the working of it

I am reading Kafka documentation and trying to understand the working of it. This is regarding consumers. In brief, a topic is divided into number of partitions. There are number of consumer groups, each having number of consumer instances. Now, my question is, does each partition sends sends "same" message to each consumer groups, which in turn is given to specific consumer instance within the group?
If it is, how does Kafka ensures the message is processed only by one consumer?
Kindly guide me if I am missing something.
Well to put it simply :
we have topic divided into partitions.
we have consumer that consume data from thoses topics.
Consumers are part of consumer group by sharing the same group.id.
From a topic every partitions is consumed by one consumer within a consumer groups.
Example :
Topic "test" with 3 partitions.
Consumer group A : with 3 consumers
Consumer group B : with 2 consumers.
Ths two consumer groups A and B consumes data from the topic "test".
Within the group A every consumer (so 3) will consume one partition each whereas in group consumer B (with 2 consumer) , one consumer will read 2 partitions and the other will consume the last one.
If we have a last consumer group with only one consumer inside, it will read all 3 partitions of that topic.
Hope that's help, let me know if you didn't understand.

When to create new Consumer in ConsumerGroup

I am newbie in Kafka world and was reading about Consumer and ConsumerGroup.I got the difference between them and understand why we need ConsumerGroup in Kafka.
But here my question is When we should decide when to create new Consumer within same Group.
When we have huge amount of data?
Could someone help me to understand any real use case.
Thanks
I think some very good points have already been mentioned and here are my few cents. As your primary question seems to be "When" to add a consumer in a group...
There are 2 scenarios I could think of:
If one or more consumers in a Consumer group are overloaded by consumption from multiple partitions and you intend to distribute that load and increase parallelism. In this case, you could add consumers and trigger a rebalance.
If the partitions in a topic are increasing. This is quite a tricky scenario and may disturb the existing consumers in some ways. Following are a few examples of when this might happen:
a) If the semantics of your data are changing as partitioning a topic
based on the semantics is quite a common use case
b) If the data volume is increasing and the semantics are also changing
c) If only the volume is increasing that is leading to Scenario 1
However, as you've pointed out in your question - if only the volume is increasing and the consumers in a group are nicely mapped to the partitions on a 1-to-1 basis then you may be better off leaving things as they are. Otherwise, you might end up in the Scenario 2b.
Hope this helps!
In Apache Kafka, the level of parallelism is defined by the number of partitions. The higher the number of partitions, the higher the level of parallelism one can achieve. Depending on the volume of data, you should set the number of partitions to the desired value. Note that you can not have more active consumers than number of partitions.
For example, assume that you have a topic test with 5 partitions and a consumer group test-group. At any given time, only 5 consumers can be active withing test-group. Say we've got 1000 messages in topic test, then each of the 5 active consumers will consume (approximately) 200 messages. In case you run more than 5 partitions, the remaining will be inactive meaning that they won't consumer any messages at all. Similarly, if you have less consumers than partitions, then some of your active consumers will consumer messages from more than one partition.
Another -less straight-forward- example would be the following (taken from):
In this scenario, we do have two topics (A and B), each of which has 3 partitions. Two consumers belonging to the same consumer group are consuming messages from both topics.
As mentioned above, Kafka scales the topic consumption by distributing partitions among a consumer group. A consumer group is nothing, but a set of consumers sharing the common identifier.
A consumer is responsible to consumer messages from one or more partitions. If there is a single consumer running in the consumer group, it will consume data from all partitions. If there are multiple consumers running with in same group, they distribute the load in consumes from different-different partitions.
Maximum number of consumers are equal to the maximum number of partitions. If the consumers number exceeds than number of partitions, excessive consumers will be idle.
Let's say if there is a topic with 4 partitions. There are two consumer groups A and B. Group A has two consumers C1,C2. Both consumers will consume from approx 2 and 2 partitions.
While in Consumer Group B, there are 4 consumers, each consumer will consume from one partition.
When to use single consumer or multiple consumer : It depends on the use case. If you want a consolidated output from the processing where the calculations are based on the entire data in the topic, you should use single consumer unless you have a post processing logic to merge the output from each consumer.
If you are just reading the data and want to parallelize the process by distributing load, use multiple consumers

Kafka how to consume one topic parallel

I read kafka document, still don't know how consume one topic parallel?
Suppose:
I have one topic like "something happened" (don't split this topic), and I have many customers that want to consume it.
So what should I do, so that multiple customers can consume it parallel? Should I use partitioning and customer groups?
I have one idea about this, but I'm not sure whether is it right.
Make many partitions about the same topic, and make one partition to one customer, so one producer must produce the same to these partitions, and every customer in the different customer group, is it right?
Using partitions is the way of being able to parallelize the consumption of a topic. Let´s say you have 10 partitions for your topic, then you can have 10 consumers in the same consumer group reading one partition each. If you have less consumers than partitions, then they would be responsible for more than one partition each. If you have more consumers than partitions, then there would be consumers who would not get any partition assigned to them and have nothing to do except being available to replace another consumer who has died.
Each topic in Kafka can be organized into many partitions. Partition allows for parallel consumption increasing throughput.
Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.
For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below
Each consumer uses a single stream.
In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.
Each consumer uses more than one stream
(say C1 uses 3, C2 uses 3 and C3 uses 4). In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.
Just to add the list of answers, Confluent has a library to do this for you, like Rapids. The project is here:
https://github.com/confluentinc/parallel-consumer
It's open source.
Note: I'm the author.
#Lundahl did all the didactic, I'll give you a pratical sample.
Create a topic for some meaning, e.g. news_events with the parallelism your consumers will need (partitions), you can calc that using the time to process one message, the number of messages you will have and the time you want to have all the messages processed.
Let's create consumers for that topic, you wan't to read the news and your sister or brother also, each one on your time, then every one needs a consumer group id, this way kafka will know that threads a,b,c are for one consumer group and the d,e,c are for the second consumer group, every consumer group will receive the same messages, process it at their time and won't affect each other.
A message will come at one or other partition, never at two, by default Kafka makes round robin to choose the partition, remember, all consumers groups can connect and read data from all the same partitions
I would suggest you to use rapids-kafka-client, a library which do that parallelism stuff for you, choose the number of threads equal the number of partitions you have, choose a consumer group, and see the magic happen.
public static void main(String[] args){
ConsumerConfig.<String, String>builder()
.prop(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(GROUP_ID_CONFIG, "news-app")
.topics("news_events")
.consumers(7)
.callback((ctx, record) -> {
System.out.printf("status=consumed, value=%s%n", record.value());
})
.build()
.consume()
.waitFor();
}
You can read more about consumer groups, topics and partitions here
I assume what you want is parallel consumption between customers in a publish/subscribe fashion.
Beside that, you can also have parallel consumption within a single customer in order to scale the consumer application.
Parallel consumption between customers
If by "customers" you mean different organizations which are interested in consuming topic's messages independently, all you need is consumer groups.
This is a simple publish/subscribe pattern where each customer runs its own application and read all topic's messages without interfering with others.
Each customer application can be seen as a consumer group, made up by one or more Kafka consumers (whether running on a single node or spread across a cluster), all of them sharing the consumer group's identifier.
You achieve this goal regardless of partitions. In case topic is partitioned, you don't need to worry about writing the same message to all partitions. Remember that in Kafka messages are durable, a message read by a Kafka consumer is not deleted and is available to be read by other Kafka consumers from a different consumer group (until it expires). Furthermore, partitions are not meant to work like this, they help scale storage of data (at a certain point all topic's data wouldn't fit into just one node) and scale consumer applications as you can see below.
Parallel consumption within single customer
You can further parallelize, or better to say, scale the consumption of messages within a consumer group with, in fact, Kafka consumers.
Imagine topic is huge, producers write into it with an high rate, and consumer group has only one consumer: this poor consumer may struggle to keep up with the message arrival rate, especially if message processing is time-consuming too.
That's the case where you need partitions and more consumers in your consumer group, so that Kafka will assign partitions to consumers to distribute reading load among them.
How partition assignment works has been already explained in other answers here, but basically for a given consumer group:
each topic's partition is assigned exclusively to one consumer,
a consumer might get assigned more partitions
if consumers are more than topic's partitions, some of them will stay idle as they won't get assigned any partition to consume from.
Remember that message ordering in Kafka is guaranteed only at partition level, so if you have many partitions and ordering matters, you need to choose the right message key to partition data according to your requirements.
For example if you want messages be ordered by device, a device_id would be your key that guarantees messages of the same device will be written to the same partition.