I have a use case where a message needs to broadcasted to all nodes in a horizontally scalable, stateless, application cluster and I am considering Kafka for it. Since each node of the cluster will need to receive ALL messages in the topic, each node of the cluster needs to have its own consumer group.
One can assume here that the volume of messages is not so high that each node cannot handle all messages.
To achieve this with Kafka, I would end up using the instanceId (or some unique identifier) of the consumer process as the consumer group id when consuming from the topic. This will push the number of consumer groups high. As redeployments are done, new consumer groups will start.
How many active consumer groups can I have at maximum at any given time? Will number of consumer groups become a bottleneck before other bottlenecks (like bandwidth etc) kick in?
There will be churn of active consumer groups upon frequent deployment of consumer application. Will this churn over long periods of time in consumer groups scale/sustain for Kafka?
Self Answer to my question: One solution that came from further research is to use the kafka assign() API instead of the subscribe() API to consume. The former does not need a consumer group. I just configure every node to consume messages from all the partitions of the topic.
Acknowledgement to Igore Soarez who seeded the idea of not needing consumer groups to consume in comments.
Related
I am externalising the kafka consumer metadata for topic in db including consumer groups and number of consumer in group.
Consumer_info table has
Topic name,
Consumer group name,
Number of consumers in group
Consumer class name
At app server startup i am reading table and creating consumers (threads) based on number set in table. If consumer group count is set to 3, i create 3 consumer threads. This is based on number of partitions for a given topic
Now in case i need to scale out horizontally, how do i distribute the consumers belonging to same group across multiple app server nodes. Without reading same message more than once.
The initialization code for consumer which will be called at appserver startup reads metadata from db for consumer and creates all the consumer threads on same instance of app server, even if i add more app server instances, they would all be redundant as the first server which was started has spawned the defined consumer threads equal to the number of partitions.any more consumer created on other instances would be idle.
Can u suggest better approach to scale out consumers horizontally
consumer groups and number of consumer in group
Adhoc running kafka-consumer-groups --describe would give you more up-to-date information than an external database query, especially given that consumers can rebalance and can fall out of the group at any moment.
how do i distribute the consumers belonging to same group across multiple app server nodes. Without reading same message more than once
This is how Kafka Consumer groups operate, out of the box, assuming you are not manually assigning partitions in your code.
It is not possible to read a message more than once after you have consumed, acked, and committed that offset within the group
I don't see the need for an external database when you can already attempt to expose an API around kafka-consumer-groups command
Or you can use Stream-Messaging-Manager by Cloudera which shows a lot of this information as well
I'm trying to understand how to dynamically scale out application which consumes a huge number of topics (unfortunately I can't reduce their number - by design each topic is for particular type of data).
I want my application cluster to share the load from all 200+ topics. E.g when a new app node added to the cluster, it should "steal" some topics subscriptions from old nodes, so the load become evenly distributed again.
As far as I understand, Kafka partinions/consumer groups help to parallelize a topic, not to share a load between multiple topics.
You need to make sure that all your App instances use the same Kafka Consumer Group (via group.id). In this case you actually have an even distribution you want. When a new App instance is added, consumer group is going to rebalance and make sure the load is distributed.
Also, when a new topic/partition is created it'll take consumer up to "metadata.max.age.ms" (default is 5 minutes) to start consuming from it. Make sure to set "auto.offset.reset" to "earliest" to not miss any data.
Finally, you might want to use a regex to subscribe to all those topics (if possible).
A Kafka Topic is a grouping of messages of a similar type, so you probably have 200+ types of messages that have be consumed by 200+ types of consumers (even if one consumer may be able to handle several types, logically you have 200+ different handlings).
Kafka Partitions is a way to parallelize the consumption of the messages from one Topic. Each Partition will be fully consumed by one consumer in a consumer group bound to the topic, therefore the total number of partitions for a topic needs to be at least the same as the number of consumers in a consumer group to make sense of the partitioning feature.
So here you would have 200+ Topics, each having N partitions (where N greater or equal to your expected Max number of applications) and each application should consume from all 200+ Topics. Consumers have to label themselves with a consumer group name, each record published to a topic is delivered to one consumer instance within each subscribing consumer group. All consumers can use the same consumer group.
See Kafka documentation for an even better explanation...
What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.
I'm a newbie in Kafka. I had a glance at the Kafka Documentation. It seems that the the message dispatched to a subscribing consumer group is implemented by binding the partition with the consumer instance.
One important thing we should remember when we work with Apache Kafka is the number of consumers in the same consumer group should be less than or equal the number of partitions in the consumed topic. Otherwise, the exceedable consumers will not be received any messages from the topic.
In a non-prod environment, I didn't config the topic partition. In such case, is there only a single partition in Kafka. And If I start multiple consumers sharing the same group and subscribe them to the topic, would the message always dispatched to the same instance in the group? In other words, I have to partition the topic to get the load-balance feature in consumer group?
Thanks!
You are absolutely right. One partitions cannot be processed in paralell (by one consumer group). You can treat partition as atomic and it cannot be split.
If you configure non-prod and prod env with the same amount of partitions per topic, that should help you to find correct number of conumsers and catch problems before moving to prod.
In my setup, I have a consumer group with three processes (3 instances of a service) that can consume from Kafka. What I've found to be happing is that the first node is receiving all of the traffic. If one node is manually killed, the next node picks up all Kafka traffic, but the last remaining node sits idle.
The behavior desired is that all messages get distributed evenly across all instances within the consumer group, which is what I thought should happen. As I understand, the way Kafka works is that it is supposed to distribute the messages evenly amongst all members of a consumer group. Is my understanding correct? I've been trying to determine why it may be that only one member of the consumer group is getting all traffic with no luck. Any thoughts/suggestions?
You need to make sure that the topic has more than one partition to be able to consume it in parallel. A consumer in a consumer group gets one or more allocated partitions from the broker but a single partition will never be shared across several consumers within the same group unless a consumer goes offline. The number of partitions a topic has equals the maximum number of consumers in a consumer group that can feed from a topic.