Does Kafka do balancing the partitions to each consumer threads? - apache-kafka

I have a Kafka cluster with multiple topics, I'm going to set One partition for each topic and all those topics will be consumed by a single one EC2 instance running with 3 Kafka Consumer threads (One consumer per thread), belong to same Consumer Group.
I haven't experimented it yet, but I'm wondering if the Kafka can do balancing the partitions of all topics to be consumed by 3 threads equally ? or Kafka will assign all partitions to be consumed by only one thread?

The Kafka consumer is NOT thread-safe, you should not share same consumer instance between different thread. Instead you should create new instance for each thread.
From documentation https://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#multithreaded:
1. One Consumer Per Thread
A simple option is to give each thread its own consumer instance. Here are the pros and cons of this approach:
PRO: It is the easiest to implement
PRO: It is often the fastest as no inter-thread co-ordination is needed
PRO: It makes in-order processing on a per-partition basis very easy to implement (each thread just processes messages in the order it receives them).
CON: More consumers means more TCP connections to the cluster (one per thread). In general Kafka handles connections very efficiently so this is generally a
small cost.
CON: Multiple consumers means more requests being sent to
the server and slightly less batching of data which can cause some
drop in I/O throughput.
CON: The number of total threads across all processes will be limited by the total number of partitions.
If topic has several partitions, messages from different partitions can be processed in parallel. You can create few consumer instances with same group.id and each of consumer will get subset of partitions to consume data.
Kafka doesn't support parallel processing across different topics. By this I mean that groups are not managed across different topics, partitions from different topics might not be assigned evenly.

One should not have more consumer than the partitions. Otherwise, the order of the messages cannot be guaranteed and the way the consumer offset is store will nto work. Partially because of this, Kafka (Java) producers/consumer are not thread-safe.
So in Kafka case, the number of partitions is your parallellism.
So in your scenario, having one partition, run exactly one consumer with exactly one consumer instance in exactly one thread (you can, sure, send the message for later processing to some threads in a pool)

Related

Best ways to design a kafka consumer

Need a help in getting the best design solution for creating Kafka consumers.
Will be having multiple topics and those can be like groups say for example
10 topics that are used to send out emails (10 count is chosen because will be getting more client traffic and want to dedicate a topic per client like each topic for one client so that others will not be delayed or waited)
10 topics to process a business logic and the 10 count explanation is same as above.
Now with this usage what's the best way to design Kafka consumers? Consumer dedicated to each topic ? or is there a way where we can scale up consumer dynamically by passing in which topic it needs to subscribe? For sure will be deploying this in containers but want suggestions on how to get started with consumer part with dynamic scalability and common code. And what's the best technology to implement this type of kafka consumers? (dotnet/java/python) ?
Also please do suggest if partitions make sense in this kind of design so that we can leverage consumer groups.
Consumers belonging to a same consumer group are assigned partitions in a topic.
In kafka, a topic can have multiple partitions. The consumers consume the messages of a particular topic from their assigned partition(s). The messages in the partition are ordered by sequential offsets.
Now, topic-wide record order is not important, you generally want to start with a higher number of partitions in a topic. Let's say start with 100 partitions. Your data will be distributed across the 100 partitions in a topic, assuming null keys or at least 100 unique key values with non colliding hashes, as record keys determine partitioning. If topic order is important, you're limited to one partition, and therefore one consumer thread; however, this thread can separate consumption from processing by loading records into alternative data structures (a queue) for processing.
You can now have 10 consumers consuming from 100 partitions. Each consumer will be assigned to about 10 partitions, and they will consume the messages in a round-robin fashion.
If you want to scale out, you simply increase the number of consumers. If you double the number of consumer to 20 then each consumer will process 5 partitions, thus, you get double throughput.

One KafkaConsumer listening to multiple partitions VS multiple KafkaConsumers listening to multiple partitions

I have ten Kafka Producers each one writing to different partition of a topic.
I cannot tell which is more effective.
Having one consumer listening to the ten partitions or having ten consumers listening to different partition?
There is no difference between these two ways. But remember when you have ten consumers there is overhead for connecting each consumer to Kafka.
If there is a capability in consuming different partitions by one consumer so probably it is enough performant.
Typically, if you have multiple consumers, you'll be able to get more throughput, since you'll have multiple threads/applications pulling data from the kafka cluster, which means you'll be able to parallelize across multiple cores, and maybe multiple servers.
However, you also need to take into account what you're trying to accomplish. Does one process/application need to look at all the data? Are the messages independent of each other? All of this will inform how your application should be designed.
In a default configuration, all of the available partitions for a topic will be distributed evently across all consumers with the same group id. So, you could have one consumer, and it will automatically grab all partitions for that topic. Or you could instantiate ten consumers, and each consumer will get exactly one partition in this case.

Kafka repartitioning

From my understanding partitions and consumers are tied up into a 1:1 relationship in which a single consumer processes a partition. However is there such a way to repartition in the middle of processing?
We are currently trying to optimize a process in which the topic gets consumed across a group but there are cases in which the data processing needs to take longer on a certain consumer while others are already idle. Its like data cleansing where a certain partition might no longer need cleansing while others require fuzzy matching thereby adding complexity to the task a consumer performs.
Your understanding with regards to partitions and consumers is not quite right.
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.
If you have one consumer per partition, then some of the partitions might receive more messages and this is why some of your consumers might be idle while some others might still processing some messages. Note that messages are not always inserted into topic partitions in a round-robin fashion as messages with the same key are placed into the same partition.
in kafka topics are partitioned, and even if you can add partitions to a topic there is no repartitioning: all the data already written to a partition stays there, new data will be partitioned among the existing partitions (in a round robin fashion if you do not define keys, otherwise one key will always land in the same partition as long as you do not add partitions.)
But if you have a consumer group, and you add or remove consumers to this group, there is a group rebalancing where each consumer receives its share of partitions to exclusively consume from.
So if you have 3 partitions (with evenly distributed messages among them) and 2 consumers (in the same group) one consumer will have twice as much messages to handle than the other; with 3 consumers each one will consume one partition; with 4 consumers one will stay idle...
So as you already have evenly distributed messages (which is good), you should have as many consumers as you have partitions, and if it is still not fast enough you may add n partitions and n consumers. (For sure you could also try to optimize the consumer but that is another story...)
Added to answer comment:
Once a consumer -- from a given group -- is consuming a partition, it will continue to do so and will be the only one from the group consuming this partition, even if a lot of other consumers from the same group are idle. In one group a partition is never shared between consumers. (If the consumer crashes, another one will continue the work, and if a new consumer enters the group a rebalance will occur, but anyway only one consumer will work on one partition at a given time).
So one approach, as said in your comment would be to distribute the load evenly over the partitions. Another approach, would be to have a topic dedicated to expensive jobs, let it have a lot of partitions and a lot of consumers; and let the topic for non-expensive jobs have fever consumers.
Last approach that I would not recommend would be to not use the consumer group features and to manage yourself how you consume from Kafka, by using assign and seek methods from the consumer. (See KafkaConsumer JavaDoc for more information). Spark Structured Streaming for example is using that approach, but it is much more complex...

Maximum subscription limit of Kafka Topics Per Consumer

What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.

Multiple Consumers for Kafka topic

I have implemented a kafka consumer similar to the way described in this article: http://howtoprogram.xyz/2016/05/29/create-multi-threaded-apache-kafka-consumer/
The way it is implemented implies only one consumer thread for one partition. So I want 10 consumer threads then I would need 10 topic partitions.
Are there any other working approaches for multithreaded consumers?
I alse looked into this article: https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
However while testing this example in my environment I got an exception topic-1395414642817-47bb4df2 can't rebalance after 4 retries.
The way it is implemented implies only one consumer thread for one partition. So I want 10 consumer threads then I would need 10 topic partitions.
I assume you want to have more consumer thread than partitions? This is not supported by Kafka. In Kafka, each consumer monitors its progress (ie, what it did read the a partitions -- called its offset) itself. If you want more consumer threads than partitions, those threads would somehow need to talk to each other to "devide" the data. This pattern is non standard an not supported. It would also not scale very well, because if you add more consumers, the synchronization overhead would be high.
Having only one consumer thread per partitions keeps Kafka scalable. Best practice is to over partition your topics, to get more flexibility with regard to number of consumer threads.