Can we have multiple consumers to consume from a topic to achieve parallel processing in kafka.
My use case is to read messages from a single partition in parallel.
Yes you can process messages in parallel using many Kafka consumers, but no, it's not possible if you only have one partition.
Parallelism in Kafka consuming is defined by the number of partitions, you can easily re-partition your topic at any time to create more partitions.
An example of how process messages in parallel using rapids-kafka-client below, a library to make Kafka parallel consuming easier.
public static void main(String[] args){
ConsumerConfig.<String, String>builder()
.prop(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(GROUP_ID_CONFIG, "stocks")
.topics("stock_changed")
.consumers(7)
.callback((ctx, record) -> {
System.out.printf("status=consumed, value=%s%n", record.value());
})
.build()
.consume()
.waitFor();
}
Simply saying we can't achieve partition level parallelism for Consumers by default.
But you can try Akka Streams Kafka (Reactive kafka). Once go through thsese docs.
The number of partitions define the level of parallelism to read from a kafka topic. But reading is (more or less) only restricted by your network capacities.
A good pattern is to separate reading and processing of messages (one thread per topic-partition for reading and multiple threads for processing this messages).
You need multiple partitions to do this, or something like Parallel Consumer (PC) to sub divide the single partition.
However, it's recommended to have at least 3 partitions and have at least three consumers running in a group, to utilise high availability. You can again use PC to process all these partitions, sub divided by key, in parallel.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
Related
which one is recommended to use :
1. Single kafka stream consuming from multiple topics
2. Different kafka streams consuming from different topics (I've used this one already with no issues encountered)
Is it possible to achieve #1 ? and if yes, what're the implications?
and if I use 'EXACTLY_ONCE' settings, what kind of complexities it'll bring?
kafka version : 2.2.0-cp2
Is it possible to achieve #1 (Single kafka stream consuming from multiple topics)
Yes, you can use StreamsBuilder#stream(Collection<String> topics)
If the data that you want to process is spread across multiple topics and that these multiple topics constitute one single source, then you can use this, but not if you want to process those topics in parallel.
It is like one consumer subscribing to all these topics which also means one thread for consuming all the topics. When you call poll() it returns ConsumerRecords from all the subscribed topics and not just one topic.
In Kafka streams, there is a term called Topology, which is basically a acyclic graph of sources, processors and sinks. A topology can contain sub-topologies.
Sub-topologies can then be executed as independent stream tasks through parallel threads (Reference)
Since each topology can have a source, which can be a topic, and if you want parallel processing of these topics, then you have to break-up your graph to sub-topologies.
If I use 'EXACTLY_ONCE' settings, what kind of complexities it'll bring?
When messages reach sink processor in a topology, then its source must be committed, where a source can be a single topic or collection of topics.
Multiple topics or one topic, we need to send offsets to the transaction from the producer, which is basically Map<TopicPartition, OffsetMetadata> that should be committed when the messages are produced.
So, I think it should not introduce any complexities whether it is single topic having 10 partitions or 10 topics with 1 partition each, because offset is at the TopicPartition level and not at topic level.
I would like to ask for some input on the following question - I'm using a Consumer.committableSource in my application. During tests I have discovered that instead of going round-robin among partitions of the the Kafka topic, the application will drain a given partition until it consumes the latest entry before switching to the next partition. This is not ideal for my application as it cares about the temporal order at which the events are put on Kafka. This exhaustive way of reading partitions is like going back and forth in time.
Any ideas on how I can tune the consumer to favor round-robin on partition consumption instead?
Thank you!
You can use this scenario in 2 ways first one preferable as it achieves parallelization and high throughput with minimal latency.
Create multiple instances for the same consumer. It will work as a consumer group and all instances will shared partition load in parallel.
e.g. if you have 4 partitions and you use 2 instances that means ideal case 1 instance will consume 2 partitions. Now if you increase instance to 4 then in that case each instance in the ideal case will be using 1 partition. In that case, partition rebalance will be managed by the consumer's group management.
You can also assign a list of partition to the consumer by using below API
public void assign(java.util.Collection partitions)
This will manually be assigned list of partitions to the consumer so consumers will consume only the assigned partition. This will not use consumer rebalance.
I have a Kafka cluster with multiple topics, I'm going to set One partition for each topic and all those topics will be consumed by a single one EC2 instance running with 3 Kafka Consumer threads (One consumer per thread), belong to same Consumer Group.
I haven't experimented it yet, but I'm wondering if the Kafka can do balancing the partitions of all topics to be consumed by 3 threads equally ? or Kafka will assign all partitions to be consumed by only one thread?
The Kafka consumer is NOT thread-safe, you should not share same consumer instance between different thread. Instead you should create new instance for each thread.
From documentation https://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#multithreaded:
1. One Consumer Per Thread
A simple option is to give each thread its own consumer instance. Here are the pros and cons of this approach:
PRO: It is the easiest to implement
PRO: It is often the fastest as no inter-thread co-ordination is needed
PRO: It makes in-order processing on a per-partition basis very easy to implement (each thread just processes messages in the order it receives them).
CON: More consumers means more TCP connections to the cluster (one per thread). In general Kafka handles connections very efficiently so this is generally a
small cost.
CON: Multiple consumers means more requests being sent to
the server and slightly less batching of data which can cause some
drop in I/O throughput.
CON: The number of total threads across all processes will be limited by the total number of partitions.
If topic has several partitions, messages from different partitions can be processed in parallel. You can create few consumer instances with same group.id and each of consumer will get subset of partitions to consume data.
Kafka doesn't support parallel processing across different topics. By this I mean that groups are not managed across different topics, partitions from different topics might not be assigned evenly.
One should not have more consumer than the partitions. Otherwise, the order of the messages cannot be guaranteed and the way the consumer offset is store will nto work. Partially because of this, Kafka (Java) producers/consumer are not thread-safe.
So in Kafka case, the number of partitions is your parallellism.
So in your scenario, having one partition, run exactly one consumer with exactly one consumer instance in exactly one thread (you can, sure, send the message for later processing to some threads in a pool)
1. Consuming concurrently on the same topic and same partition
Suppose I have 100 partitions for a given topic (e.g. Purchases), I can easily consume these 100 partitions (e.g. Electronics, Clothing, and etc...) in parallel using a consumer group with 100 consumers in it.
However, that is assigning one consumer to each subset of the total data on Purchases. What if I want just want to consume one subset of data with 100 consumers concurrently? For example, for all of my consumers, they just want to know Electronics partition of the Purchases topic.
Is there way they can consume this partition concurrently?
In general I just want all my consumers to receive the same data set concurrently.
From the information I've gathered, it seems to me that consumers CANNOT consume from replicas: Consuming from a replica
Can I produce the same data to multiple topics, like Purchase-1[Electronics] and Purchase-2[Electronics] so then I can consume them concurrently? Is this a recommended approach?
2. Producing concurrently on the same topic and same partition
When multiple producers are producing to the same topic and same partition, since we can only write to the partition leader and replicas are only there for fault-tolerance, does this mean there isn't any concurrency? (i.e. each commit must wait in line.)
If those 100 consumers belong to different consumer groups, they can consume from the same topic and partition simultaneously. In that case, you need to make sure each consumer is able to handle the load from the 100 partitions.
Producers can produce to the same topic partition at the same time, but the actual order of messages written to the partition is determined by the partition leader.
If you want to consumer from a single partition in parallel, use something like Parallel Consumer (PC).
By using PC, you can process all your keys in parallel, regardless of how long it takes, and you can be as concurrent as you wish.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
Currently I have one Kafka topic.
Now I need to run multiple consumer so that message can be read and processed in parallel.
Is this possible.
I am using python and pykafka library.
consumer = topic.get_simple_consumer(consumer_group=b"charlie",
auto_commit_enable=True)
Is taking same message in both consumer. I need to process message only once.
You need to use BalancedConsumer instead of SimpleConsumer:
consumer = topic.get_balanced_consumer(consumer_group=b"charlie",
auto_commit_enable=True)
You should also ensure that the topic you're consuming has at least as many partitions as the number of consumers you're instantiating.
Generally you need multiple partitions and multiple consumer to do this, or, something like Parallel Consumer (PC) to sub divide the single partition.
However, it's recommended to have at least 3 partitions and have at least three consumers running in a group, to utilise high availability. You can again use PC to process all these partitions, sub divided by key, in parallel.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
Yes you can have multiple consumers reading from the same topic in parallel provided you use the same consumer group id and the number of partitions of the topic should greater than the consumers otherwise some of the consumers will not be assigned any partitions and those consumers won't fetch any data