Kafka consumer horizontal scaling across multiple nodes - apache-kafka

I am externalising the kafka consumer metadata for topic in db including consumer groups and number of consumer in group.
Consumer_info table has
Topic name,
Consumer group name,
Number of consumers in group
Consumer class name
At app server startup i am reading table and creating consumers (threads) based on number set in table. If consumer group count is set to 3, i create 3 consumer threads. This is based on number of partitions for a given topic
Now in case i need to scale out horizontally, how do i distribute the consumers belonging to same group across multiple app server nodes. Without reading same message more than once.
The initialization code for consumer which will be called at appserver startup reads metadata from db for consumer and creates all the consumer threads on same instance of app server, even if i add more app server instances, they would all be redundant as the first server which was started has spawned the defined consumer threads equal to the number of partitions.any more consumer created on other instances would be idle.
Can u suggest better approach to scale out consumers horizontally

consumer groups and number of consumer in group
Adhoc running kafka-consumer-groups --describe would give you more up-to-date information than an external database query, especially given that consumers can rebalance and can fall out of the group at any moment.
how do i distribute the consumers belonging to same group across multiple app server nodes. Without reading same message more than once
This is how Kafka Consumer groups operate, out of the box, assuming you are not manually assigning partitions in your code.
It is not possible to read a message more than once after you have consumed, acked, and committed that offset within the group
I don't see the need for an external database when you can already attempt to expose an API around kafka-consumer-groups command
Or you can use Stream-Messaging-Manager by Cloudera which shows a lot of this information as well

Related

Scaling Maximum Number of Consumer Groups Kafka

I have a use case where a message needs to broadcasted to all nodes in a horizontally scalable, stateless, application cluster and I am considering Kafka for it. Since each node of the cluster will need to receive ALL messages in the topic, each node of the cluster needs to have its own consumer group.
One can assume here that the volume of messages is not so high that each node cannot handle all messages.
To achieve this with Kafka, I would end up using the instanceId (or some unique identifier) of the consumer process as the consumer group id when consuming from the topic. This will push the number of consumer groups high. As redeployments are done, new consumer groups will start.
How many active consumer groups can I have at maximum at any given time? Will number of consumer groups become a bottleneck before other bottlenecks (like bandwidth etc) kick in?
There will be churn of active consumer groups upon frequent deployment of consumer application. Will this churn over long periods of time in consumer groups scale/sustain for Kafka?
Self Answer to my question: One solution that came from further research is to use the kafka assign() API instead of the subscribe() API to consume. The former does not need a consumer group. I just configure every node to consume messages from all the partitions of the topic.
Acknowledgement to Igore Soarez who seeded the idea of not needing consumer groups to consume in comments.

Can consumer groups span multiple servers?

When creating a consumer group in Kafka, does it create a pool of workers that run on the same JVM process or could a consumer group span multiple computers/nodes?
If it spans multiple computers then keeping track of offsets etc. will be hard.
First of all, you don't create consumer groups directly. You just create consumers and consumers that have same group.id will represent a consumer group. When multiple consumers
are subscribed to a topic and belong to the same consumer group, each consumer in
the group will receive messages from a different subset of the partitions in the topic. As shown in the image below:
Of course you can create these consumers in different servers and it is recommended approach for load balancing.
Kafka stores offsets for each consumer groups in topic named __consumer_offsets. So keeping track of the offsets is not that hard. You can check consumer offsets for a consumer groups with a command like this:
"does it create a pool of workers that run on the same jvm process or could a consumer group span multiple computers/nodes?"
It depends on how many jvm processes you create for your consumer group. And, yes, it can span multiple computer/nodes. Kafka's group coordinator will then assign individual threads to a partition of a topic. Note that a single TopicPartition can be consumed at maximum by one consumer (jvm process) within the same consumer group.
"If it spans multiple computers then keeping track of offsets etc. will be hard."
Kafka makes this easy by centrally storing all meta information and progress of each consumer group within an internal topic called "__consumer_offsets" which is available across the entire cluster, if and only if all nodes belong to the same cluster.

Scaling out with 200+ Kafka topics

I'm trying to understand how to dynamically scale out application which consumes a huge number of topics (unfortunately I can't reduce their number - by design each topic is for particular type of data).
I want my application cluster to share the load from all 200+ topics. E.g when a new app node added to the cluster, it should "steal" some topics subscriptions from old nodes, so the load become evenly distributed again.
As far as I understand, Kafka partinions/consumer groups help to parallelize a topic, not to share a load between multiple topics.
You need to make sure that all your App instances use the same Kafka Consumer Group (via group.id). In this case you actually have an even distribution you want. When a new App instance is added, consumer group is going to rebalance and make sure the load is distributed.
Also, when a new topic/partition is created it'll take consumer up to "metadata.max.age.ms" (default is 5 minutes) to start consuming from it. Make sure to set "auto.offset.reset" to "earliest" to not miss any data.
Finally, you might want to use a regex to subscribe to all those topics (if possible).
A Kafka Topic is a grouping of messages of a similar type, so you probably have 200+ types of messages that have be consumed by 200+ types of consumers (even if one consumer may be able to handle several types, logically you have 200+ different handlings).
Kafka Partitions is a way to parallelize the consumption of the messages from one Topic. Each Partition will be fully consumed by one consumer in a consumer group bound to the topic, therefore the total number of partitions for a topic needs to be at least the same as the number of consumers in a consumer group to make sense of the partitioning feature.
So here you would have 200+ Topics, each having N partitions (where N greater or equal to your expected Max number of applications) and each application should consume from all 200+ Topics. Consumers have to label themselves with a consumer group name, each record published to a topic is delivered to one consumer instance within each subscribing consumer group. All consumers can use the same consumer group.
See Kafka documentation for an even better explanation...

Apache Kafka Multiple Consumer Instances

I have a consumer that is supposed to read messages from a topic. This consumer actually reads the messages and writes them to a time series database. We have multiple instances of the time series database running as a cluster on multiple physical machines.
Our plan is to deploy the consumer on all those machines where the time series service is running. So if I have 5 nodes on which the time series service is running, I will install one consumer instance per node. All those consumer instances belong to the same consumer group. So in pictures the set up looks like below:
As you can see, the Producer P1 and P2 write into 2 partitions namely partition 1 and partition 2 of the kafka topic. I then have 4 instances of the time series service where one consumer is running per instance. How should I read using my consumer properly such that I do not end up with duplicate messages in my time series database?
Edit: After reading through the Kafka documentation, I came across these two statements:
If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.
So in my case above, it is behaving like a Queue? Is my understanding correct?
If all consumers belong to one group (have the same groupId), then kafka topic will behave for you as a queue.
Important: there is no reason to have consumers more than partitions, as consumers (out-of-the-box kafka consumers) are scaled by partitions.

Kafka Only One Consumer in Consumer Group Getting Messages

In my setup, I have a consumer group with three processes (3 instances of a service) that can consume from Kafka. What I've found to be happing is that the first node is receiving all of the traffic. If one node is manually killed, the next node picks up all Kafka traffic, but the last remaining node sits idle.
The behavior desired is that all messages get distributed evenly across all instances within the consumer group, which is what I thought should happen. As I understand, the way Kafka works is that it is supposed to distribute the messages evenly amongst all members of a consumer group. Is my understanding correct? I've been trying to determine why it may be that only one member of the consumer group is getting all traffic with no luck. Any thoughts/suggestions?
You need to make sure that the topic has more than one partition to be able to consume it in parallel. A consumer in a consumer group gets one or more allocated partitions from the broker but a single partition will never be shared across several consumers within the same group unless a consumer goes offline. The number of partitions a topic has equals the maximum number of consumers in a consumer group that can feed from a topic.