Parallel Producing and Consuming in Kafka - apache-kafka

1. Consuming concurrently on the same topic and same partition
Suppose I have 100 partitions for a given topic (e.g. Purchases), I can easily consume these 100 partitions (e.g. Electronics, Clothing, and etc...) in parallel using a consumer group with 100 consumers in it.
However, that is assigning one consumer to each subset of the total data on Purchases. What if I want just want to consume one subset of data with 100 consumers concurrently? For example, for all of my consumers, they just want to know Electronics partition of the Purchases topic.
Is there way they can consume this partition concurrently?
In general I just want all my consumers to receive the same data set concurrently.
From the information I've gathered, it seems to me that consumers CANNOT consume from replicas: Consuming from a replica
Can I produce the same data to multiple topics, like Purchase-1[Electronics] and Purchase-2[Electronics] so then I can consume them concurrently? Is this a recommended approach?
2. Producing concurrently on the same topic and same partition
When multiple producers are producing to the same topic and same partition, since we can only write to the partition leader and replicas are only there for fault-tolerance, does this mean there isn't any concurrency? (i.e. each commit must wait in line.)

If those 100 consumers belong to different consumer groups, they can consume from the same topic and partition simultaneously. In that case, you need to make sure each consumer is able to handle the load from the 100 partitions.
Producers can produce to the same topic partition at the same time, but the actual order of messages written to the partition is determined by the partition leader.

If you want to consumer from a single partition in parallel, use something like Parallel Consumer (PC).
By using PC, you can process all your keys in parallel, regardless of how long it takes, and you can be as concurrent as you wish.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

Related

Is there a limit on consumers and consumer groups?

Is there any limit on the number of consumers or consumer groups in Kafka?
I am planning to push 200 MB of data every 10 mins to a topic and have 200+ distinct consumers listen and consume from this topic. Is there any other recommended way to do this?
As Rohit answer states, there' no such limit.
Regarding your issue, it seems like you want to achieve some kind of paralellization of consumption. If you send 200 consumers with 200 different consumer groups, each consumer will read all the data independently, so you'll have 200 threads reading the same 200MB every 10 minutes (200x200 MB = 40GB received every 10 minutes). I guess you wanted every consumer to read 1MB every 10 mins with your approach, but that's not how it works.
If the logic implemented by each consumer is the same, you shouldn't declare more than a consumer group. If you declare two consumer groups, each one will read the same data, and you'll just repeat the job done, duplicating the output. Set different consumer groups if the job to be done on the topic's records is different: for example, one consumer group must store the records into a DDBB. The other consumer group must visualize the data into Grafana. Those are two different processing mechanisms, so each one must read all the data at its own. This is not the only reason to declare different consumer groups, but one example of them. There are multiple justifications for declaring more than a consumer group for a topic.
Imagine an scenario where the only job to be done is storing the messages into a DDBB. If you declare two consumer groups and launch your consumers, what you'll get is duplicate values stored in your database, as the first consumer group is just doing the same work than the second. Not only you are re-reading from kafka, you are re-storing the same messages into the ddbb.
In order to achieve launching multiple consumers that efficiently share the work (so for example, launching 4 consumers each one reads 50MB), you must partition your topic.
Only one consumer thread from the same consumer group can read from an specific partition. If you have 4 partitions in that topic, and 4 consumer threads that share the same consumer group, launching them will lead to each thread reading from one partition. If you launch two consumers, both will be assigned 2 partitions. Works like this:
And in this scenario, you do have a limit in the number of consumers concurrently reading if they share the same consumer group, which is, the number of partitions of that topic. If you launch a 5th consumer thread, one of them will block/wait, because it wasn't assigned any partition. In the example, consumer 5 waits until a partition is avaliable for him (so maybe waits forever).
What I suggest is: decide how many consumer threads you'll need to consume the data and partition the topic in base of that. If you, for example, partition the topic to 8 different partitions, you'll be able to launch 8 consumers from the same consumer group. Each one will then read, more or less, (depending on the producer partitioner) 25MB (200/8) of the incoming data, efficiently sharing the work load: Each consumer will read from its own partition.
If you launch 200 consumers with 200 different consumer groups,
you'll just multiply the work to be done x200, as every single consumer will read the data from start to end.
If you launch 200 consumers with the same consumer group and the topic has a single partition,
you'll have one thread doing all the job and 199 stale consumers.
In Kafka, there is no limit on the number of Consumer groups for a particular topic. However, the increase in consumer groups increases network utilization.
Worth nothing that newer versions of Kafka, store offsets in the internal Kafka topic called __consumer_offsets.

Consuming from single kafka partition by multiple consumers

I read following in kafka docs:
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time.
Kafka only provides a total order over records within a partition, not between different partitions in a topic.
Per-partition ordering combined with the ability to partition data by key is sufficient for most applications.
However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
I read following on this page:
Consumers read from any single partition, allowing you to scale throughput of message consumption in a similar fashion to message production.
Consumers can also be organized into consumer groups for a given topic — each consumer within the group reads from a unique partition and the group as a whole consumes all messages from the entire topic.
If you have more consumers than partitions then some consumers will be idle because they have no partitions to read from.
If you have more partitions than consumers then consumers will receive messages from multiple partitions.
If you have equal numbers of consumers and partitions, each consumer reads messages in order from exactly one partition.
Doubts
Does this means that single partition cannot be consumed by multiple consumers? Cant we have single partition and a consumer group with more than one consumer and make them all consume from single partition?
If single partition can be consumed by only single consumer, I was thinking why is this design decision?
What if I need total order over records and still need it to be consumed parallel? Is it undoable in Kafka? Or such scenario does not make sense?
Within a consumer group, at any time a partition can only be consumed by a single consumer. No you can't have 2 consumers within the same group consuming from the same partition at the same time.
Kafka Consumer groups allow to have multiple consumer "sort of" behave like a single entity. The group as a whole should only consume messages once. If multiple consumer in a group were to consume the same partitions, these records would be processed multiple times.
If you need to consume a partition multiple times, be sure these consumers are in different groups.
When processing needs to happen in order (serially) at any time there's only a single task to do. If you have records 1, 2 and 3 and want to process them in order, you cannot do anything until message 1 has been processed. It's the same for message 2 and 3. So what do you want to do in parallel?

Kafka repartitioning

From my understanding partitions and consumers are tied up into a 1:1 relationship in which a single consumer processes a partition. However is there such a way to repartition in the middle of processing?
We are currently trying to optimize a process in which the topic gets consumed across a group but there are cases in which the data processing needs to take longer on a certain consumer while others are already idle. Its like data cleansing where a certain partition might no longer need cleansing while others require fuzzy matching thereby adding complexity to the task a consumer performs.
Your understanding with regards to partitions and consumers is not quite right.
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.
If you have one consumer per partition, then some of the partitions might receive more messages and this is why some of your consumers might be idle while some others might still processing some messages. Note that messages are not always inserted into topic partitions in a round-robin fashion as messages with the same key are placed into the same partition.
in kafka topics are partitioned, and even if you can add partitions to a topic there is no repartitioning: all the data already written to a partition stays there, new data will be partitioned among the existing partitions (in a round robin fashion if you do not define keys, otherwise one key will always land in the same partition as long as you do not add partitions.)
But if you have a consumer group, and you add or remove consumers to this group, there is a group rebalancing where each consumer receives its share of partitions to exclusively consume from.
So if you have 3 partitions (with evenly distributed messages among them) and 2 consumers (in the same group) one consumer will have twice as much messages to handle than the other; with 3 consumers each one will consume one partition; with 4 consumers one will stay idle...
So as you already have evenly distributed messages (which is good), you should have as many consumers as you have partitions, and if it is still not fast enough you may add n partitions and n consumers. (For sure you could also try to optimize the consumer but that is another story...)
Added to answer comment:
Once a consumer -- from a given group -- is consuming a partition, it will continue to do so and will be the only one from the group consuming this partition, even if a lot of other consumers from the same group are idle. In one group a partition is never shared between consumers. (If the consumer crashes, another one will continue the work, and if a new consumer enters the group a rebalance will occur, but anyway only one consumer will work on one partition at a given time).
So one approach, as said in your comment would be to distribute the load evenly over the partitions. Another approach, would be to have a topic dedicated to expensive jobs, let it have a lot of partitions and a lot of consumers; and let the topic for non-expensive jobs have fever consumers.
Last approach that I would not recommend would be to not use the consumer group features and to manage yourself how you consume from Kafka, by using assign and seek methods from the consumer. (See KafkaConsumer JavaDoc for more information). Spark Structured Streaming for example is using that approach, but it is much more complex...

Can I have multiple consumer reading from same Kafka Topic in parallel?

Currently I have one Kafka topic.
Now I need to run multiple consumer so that message can be read and processed in parallel.
Is this possible.
I am using python and pykafka library.
consumer = topic.get_simple_consumer(consumer_group=b"charlie",
auto_commit_enable=True)
Is taking same message in both consumer. I need to process message only once.
You need to use BalancedConsumer instead of SimpleConsumer:
consumer = topic.get_balanced_consumer(consumer_group=b"charlie",
auto_commit_enable=True)
You should also ensure that the topic you're consuming has at least as many partitions as the number of consumers you're instantiating.
Generally you need multiple partitions and multiple consumer to do this, or, something like Parallel Consumer (PC) to sub divide the single partition.
However, it's recommended to have at least 3 partitions and have at least three consumers running in a group, to utilise high availability. You can again use PC to process all these partitions, sub divided by key, in parallel.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
Yes you can have multiple consumers reading from the same topic in parallel provided you use the same consumer group id and the number of partitions of the topic should greater than the consumers otherwise some of the consumers will not be assigned any partitions and those consumers won't fetch any data

Kafka how to consume one topic parallel

I read kafka document, still don't know how consume one topic parallel?
Suppose:
I have one topic like "something happened" (don't split this topic), and I have many customers that want to consume it.
So what should I do, so that multiple customers can consume it parallel? Should I use partitioning and customer groups?
I have one idea about this, but I'm not sure whether is it right.
Make many partitions about the same topic, and make one partition to one customer, so one producer must produce the same to these partitions, and every customer in the different customer group, is it right?
Using partitions is the way of being able to parallelize the consumption of a topic. Let´s say you have 10 partitions for your topic, then you can have 10 consumers in the same consumer group reading one partition each. If you have less consumers than partitions, then they would be responsible for more than one partition each. If you have more consumers than partitions, then there would be consumers who would not get any partition assigned to them and have nothing to do except being available to replace another consumer who has died.
Each topic in Kafka can be organized into many partitions. Partition allows for parallel consumption increasing throughput.
Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.
For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below
Each consumer uses a single stream.
In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.
Each consumer uses more than one stream
(say C1 uses 3, C2 uses 3 and C3 uses 4). In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.
Just to add the list of answers, Confluent has a library to do this for you, like Rapids. The project is here:
https://github.com/confluentinc/parallel-consumer
It's open source.
Note: I'm the author.
#Lundahl did all the didactic, I'll give you a pratical sample.
Create a topic for some meaning, e.g. news_events with the parallelism your consumers will need (partitions), you can calc that using the time to process one message, the number of messages you will have and the time you want to have all the messages processed.
Let's create consumers for that topic, you wan't to read the news and your sister or brother also, each one on your time, then every one needs a consumer group id, this way kafka will know that threads a,b,c are for one consumer group and the d,e,c are for the second consumer group, every consumer group will receive the same messages, process it at their time and won't affect each other.
A message will come at one or other partition, never at two, by default Kafka makes round robin to choose the partition, remember, all consumers groups can connect and read data from all the same partitions
I would suggest you to use rapids-kafka-client, a library which do that parallelism stuff for you, choose the number of threads equal the number of partitions you have, choose a consumer group, and see the magic happen.
public static void main(String[] args){
ConsumerConfig.<String, String>builder()
.prop(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(GROUP_ID_CONFIG, "news-app")
.topics("news_events")
.consumers(7)
.callback((ctx, record) -> {
System.out.printf("status=consumed, value=%s%n", record.value());
})
.build()
.consume()
.waitFor();
}
You can read more about consumer groups, topics and partitions here
I assume what you want is parallel consumption between customers in a publish/subscribe fashion.
Beside that, you can also have parallel consumption within a single customer in order to scale the consumer application.
Parallel consumption between customers
If by "customers" you mean different organizations which are interested in consuming topic's messages independently, all you need is consumer groups.
This is a simple publish/subscribe pattern where each customer runs its own application and read all topic's messages without interfering with others.
Each customer application can be seen as a consumer group, made up by one or more Kafka consumers (whether running on a single node or spread across a cluster), all of them sharing the consumer group's identifier.
You achieve this goal regardless of partitions. In case topic is partitioned, you don't need to worry about writing the same message to all partitions. Remember that in Kafka messages are durable, a message read by a Kafka consumer is not deleted and is available to be read by other Kafka consumers from a different consumer group (until it expires). Furthermore, partitions are not meant to work like this, they help scale storage of data (at a certain point all topic's data wouldn't fit into just one node) and scale consumer applications as you can see below.
Parallel consumption within single customer
You can further parallelize, or better to say, scale the consumption of messages within a consumer group with, in fact, Kafka consumers.
Imagine topic is huge, producers write into it with an high rate, and consumer group has only one consumer: this poor consumer may struggle to keep up with the message arrival rate, especially if message processing is time-consuming too.
That's the case where you need partitions and more consumers in your consumer group, so that Kafka will assign partitions to consumers to distribute reading load among them.
How partition assignment works has been already explained in other answers here, but basically for a given consumer group:
each topic's partition is assigned exclusively to one consumer,
a consumer might get assigned more partitions
if consumers are more than topic's partitions, some of them will stay idle as they won't get assigned any partition to consume from.
Remember that message ordering in Kafka is guaranteed only at partition level, so if you have many partitions and ordering matters, you need to choose the right message key to partition data according to your requirements.
For example if you want messages be ordered by device, a device_id would be your key that guarantees messages of the same device will be written to the same partition.