I have implemented a kafka consumer similar to the way described in this article: http://howtoprogram.xyz/2016/05/29/create-multi-threaded-apache-kafka-consumer/
The way it is implemented implies only one consumer thread for one partition. So I want 10 consumer threads then I would need 10 topic partitions.
Are there any other working approaches for multithreaded consumers?
I alse looked into this article: https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
However while testing this example in my environment I got an exception topic-1395414642817-47bb4df2 can't rebalance after 4 retries.
The way it is implemented implies only one consumer thread for one partition. So I want 10 consumer threads then I would need 10 topic partitions.
I assume you want to have more consumer thread than partitions? This is not supported by Kafka. In Kafka, each consumer monitors its progress (ie, what it did read the a partitions -- called its offset) itself. If you want more consumer threads than partitions, those threads would somehow need to talk to each other to "devide" the data. This pattern is non standard an not supported. It would also not scale very well, because if you add more consumers, the synchronization overhead would be high.
Having only one consumer thread per partitions keeps Kafka scalable. Best practice is to over partition your topics, to get more flexibility with regard to number of consumer threads.
Related
Need a help in getting the best design solution for creating Kafka consumers.
Will be having multiple topics and those can be like groups say for example
10 topics that are used to send out emails (10 count is chosen because will be getting more client traffic and want to dedicate a topic per client like each topic for one client so that others will not be delayed or waited)
10 topics to process a business logic and the 10 count explanation is same as above.
Now with this usage what's the best way to design Kafka consumers? Consumer dedicated to each topic ? or is there a way where we can scale up consumer dynamically by passing in which topic it needs to subscribe? For sure will be deploying this in containers but want suggestions on how to get started with consumer part with dynamic scalability and common code. And what's the best technology to implement this type of kafka consumers? (dotnet/java/python) ?
Also please do suggest if partitions make sense in this kind of design so that we can leverage consumer groups.
Consumers belonging to a same consumer group are assigned partitions in a topic.
In kafka, a topic can have multiple partitions. The consumers consume the messages of a particular topic from their assigned partition(s). The messages in the partition are ordered by sequential offsets.
Now, topic-wide record order is not important, you generally want to start with a higher number of partitions in a topic. Let's say start with 100 partitions. Your data will be distributed across the 100 partitions in a topic, assuming null keys or at least 100 unique key values with non colliding hashes, as record keys determine partitioning. If topic order is important, you're limited to one partition, and therefore one consumer thread; however, this thread can separate consumption from processing by loading records into alternative data structures (a queue) for processing.
You can now have 10 consumers consuming from 100 partitions. Each consumer will be assigned to about 10 partitions, and they will consume the messages in a round-robin fashion.
If you want to scale out, you simply increase the number of consumers. If you double the number of consumer to 20 then each consumer will process 5 partitions, thus, you get double throughput.
I have a Kafka cluster with multiple topics, I'm going to set One partition for each topic and all those topics will be consumed by a single one EC2 instance running with 3 Kafka Consumer threads (One consumer per thread), belong to same Consumer Group.
I haven't experimented it yet, but I'm wondering if the Kafka can do balancing the partitions of all topics to be consumed by 3 threads equally ? or Kafka will assign all partitions to be consumed by only one thread?
The Kafka consumer is NOT thread-safe, you should not share same consumer instance between different thread. Instead you should create new instance for each thread.
From documentation https://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#multithreaded:
1. One Consumer Per Thread
A simple option is to give each thread its own consumer instance. Here are the pros and cons of this approach:
PRO: It is the easiest to implement
PRO: It is often the fastest as no inter-thread co-ordination is needed
PRO: It makes in-order processing on a per-partition basis very easy to implement (each thread just processes messages in the order it receives them).
CON: More consumers means more TCP connections to the cluster (one per thread). In general Kafka handles connections very efficiently so this is generally a
small cost.
CON: Multiple consumers means more requests being sent to
the server and slightly less batching of data which can cause some
drop in I/O throughput.
CON: The number of total threads across all processes will be limited by the total number of partitions.
If topic has several partitions, messages from different partitions can be processed in parallel. You can create few consumer instances with same group.id and each of consumer will get subset of partitions to consume data.
Kafka doesn't support parallel processing across different topics. By this I mean that groups are not managed across different topics, partitions from different topics might not be assigned evenly.
One should not have more consumer than the partitions. Otherwise, the order of the messages cannot be guaranteed and the way the consumer offset is store will nto work. Partially because of this, Kafka (Java) producers/consumer are not thread-safe.
So in Kafka case, the number of partitions is your parallellism.
So in your scenario, having one partition, run exactly one consumer with exactly one consumer instance in exactly one thread (you can, sure, send the message for later processing to some threads in a pool)
Is there any way to pause or throttle a Kafka producer based on consumer lag or other consumer issues? Would the producer need to determine itself if there is consumer lag then perform throttling itself?
Kafka is build on Pub/Sub design. Producer publish the message to centralized topic. Multiple consumers can subscribe to that topic. Since multiple consumers are involve you cannot decide on producer speed. One consumer can be slow other can be fast. Also it is against the design principle otherwise both system will become tightly couple. If you have use case of throttling may be you should evaluate other framework like direct rest call.
Producer and Consumer are decoupled.
Producer push data to Kafka topics (partitions topic), that are stored in Kafka Brokers. Producer doesn't know who and how often consume messages.
Consumer consume data from Brokers. Consumer doesn't know how many producers produce the messages. Even the same messages can be consumed by several consumers that are in different groups. In example some consumer can consume faster than the other.
You can read more about Producer and Consumer in Apache Kafka webpage
It is not possible to throttle the producer/producers weighing on performance of consumers.
In my scenario I don't want to loose events if the disk size is
exceeded before a message is consumed
To tackle your issue, you have to depend on the parallelism offering by the Kafka. Your Kafka topic should have multiple partitions and producers has to use different keys to populate the topic. So your data will be distributed across multiple partitions and bringing a consumer group you can manage load within a group of consumers. All data within a partition can be processed in order, that may be relevant since you are dealing with event processing.
What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.
We started to use Apache Kafka to persist Timeseries data into a Timeseries database. What we started with was to just have a single topic, a producer writing to this topic and a single consumer reading from this topic and dumping the data to the Timeseries database.
We had 3 broker instances and what we noticed in the first try was that the producer was pretty fast in writing messages to the topic. Within a matter of 30 minutes, we had around 1.5 million messages. The consumer was just doing 300 messages per second.
Our next approach was to partition the topic and have more consumer instances (equal to the number of partitions). This definitely improved on the consumer write speed. Now my questions are:
What happens if I set my topic partition to 6, but I have only 3 broker instances. Which broker instance would be the leader for partition 1 to 6?
Is there a formula to determine how many partitions would I be needing? Since this was our test environment, we could play with it and scale it. We might not be able to do the same on our production environment. So how to determine the partition size?
The partitions get distributed amongst your brokers. It's impossible to know which broker will be elected leader of a given partition -- and it can change over time. Depending on which version of Kafka and which Consumer API you use, your consumer may or may not discover partition leaders on its own. With the SimpleConsumer you have to find partition leaders on your own, and respond to new leader election in your code (instead of having it handled by the API automatically).
As to the number of partitions -- there's no real "formula" other than this: you can have no more parallelism than you have partitions. If you have 4 partitions and 5 consumers, one of the consumers will starve. I usually use numbers like 12 or 60 or multiples thereof for the number of partitions for large topics. Something that divides easily and cleanly among variable numbers of consumers.
Also, note that you can later on change the number of partitions, with some caveats. See this answer for how and what the caveats are.