Kafka Only One Consumer in Consumer Group Getting Messages - apache-kafka

In my setup, I have a consumer group with three processes (3 instances of a service) that can consume from Kafka. What I've found to be happing is that the first node is receiving all of the traffic. If one node is manually killed, the next node picks up all Kafka traffic, but the last remaining node sits idle.
The behavior desired is that all messages get distributed evenly across all instances within the consumer group, which is what I thought should happen. As I understand, the way Kafka works is that it is supposed to distribute the messages evenly amongst all members of a consumer group. Is my understanding correct? I've been trying to determine why it may be that only one member of the consumer group is getting all traffic with no luck. Any thoughts/suggestions?

You need to make sure that the topic has more than one partition to be able to consume it in parallel. A consumer in a consumer group gets one or more allocated partitions from the broker but a single partition will never be shared across several consumers within the same group unless a consumer goes offline. The number of partitions a topic has equals the maximum number of consumers in a consumer group that can feed from a topic.

Related

Scaling Maximum Number of Consumer Groups Kafka

I have a use case where a message needs to broadcasted to all nodes in a horizontally scalable, stateless, application cluster and I am considering Kafka for it. Since each node of the cluster will need to receive ALL messages in the topic, each node of the cluster needs to have its own consumer group.
One can assume here that the volume of messages is not so high that each node cannot handle all messages.
To achieve this with Kafka, I would end up using the instanceId (or some unique identifier) of the consumer process as the consumer group id when consuming from the topic. This will push the number of consumer groups high. As redeployments are done, new consumer groups will start.
How many active consumer groups can I have at maximum at any given time? Will number of consumer groups become a bottleneck before other bottlenecks (like bandwidth etc) kick in?
There will be churn of active consumer groups upon frequent deployment of consumer application. Will this churn over long periods of time in consumer groups scale/sustain for Kafka?
Self Answer to my question: One solution that came from further research is to use the kafka assign() API instead of the subscribe() API to consume. The former does not need a consumer group. I just configure every node to consume messages from all the partitions of the topic.
Acknowledgement to Igore Soarez who seeded the idea of not needing consumer groups to consume in comments.

Kafka consumer horizontal scaling across multiple nodes

I am externalising the kafka consumer metadata for topic in db including consumer groups and number of consumer in group.
Consumer_info table has
Topic name,
Consumer group name,
Number of consumers in group
Consumer class name
At app server startup i am reading table and creating consumers (threads) based on number set in table. If consumer group count is set to 3, i create 3 consumer threads. This is based on number of partitions for a given topic
Now in case i need to scale out horizontally, how do i distribute the consumers belonging to same group across multiple app server nodes. Without reading same message more than once.
The initialization code for consumer which will be called at appserver startup reads metadata from db for consumer and creates all the consumer threads on same instance of app server, even if i add more app server instances, they would all be redundant as the first server which was started has spawned the defined consumer threads equal to the number of partitions.any more consumer created on other instances would be idle.
Can u suggest better approach to scale out consumers horizontally
consumer groups and number of consumer in group
Adhoc running kafka-consumer-groups --describe would give you more up-to-date information than an external database query, especially given that consumers can rebalance and can fall out of the group at any moment.
how do i distribute the consumers belonging to same group across multiple app server nodes. Without reading same message more than once
This is how Kafka Consumer groups operate, out of the box, assuming you are not manually assigning partitions in your code.
It is not possible to read a message more than once after you have consumed, acked, and committed that offset within the group
I don't see the need for an external database when you can already attempt to expose an API around kafka-consumer-groups command
Or you can use Stream-Messaging-Manager by Cloudera which shows a lot of this information as well

Two consumers in the same consumer group, only one actually consumes, other one is idle

I have two instances of the same service reading from a topic.
Topic has 4 partitions.
Consumer group id is the same, however only one instance actually processes messages - the other one stays idle after successfully subscribing to the topic, according to the logs.
My understanding was I can speed up the processing by adding more consumers.
How do I run several consumers in parallel? What did I miss?

If you have less consumers than partitions, what happens?

If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
What if you have multiple consumers on a given topic#partition? I guess the consumer has to somehow keep track of what messages it has already processed in case of duplicates?
In fact, each consumer belongs to a consumer group. When Kafka cluster sends data to a consumer group, all records of a partition will be sent to a single consumer in the group.
If there're more paritions than consumers in a group, some consumers will consume data from more than one partition. If there're more consumers in a group than paritions, some consumers will get no data. If you add new consumer instances to the group, they will take over some partitons from old members. If you remove a consumer from the group (or the consumer dies), its partition will be reassigned to other member.
Now let's take a look at your questions:
If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
NO. Some consumers in the same consumer group will consume data from more than one partition.
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
Kafka will take care of it. If new consumers join the group, or old consumers dies, Kafka will do reblance.
What if you have multiple consumers on a given topic#partition?
You CANNOT have multiple consumers (in a consumer group) to consume data from a single parition. However, if there're more than one consumer group, the same partition can be consumed by one (and only one) consumer in each consumer group.
1) No that means you will one consumer handling more than one consumer.
2) Kafka never assigns same partition to more than one consumer because that will violate order guarantee within a partition.
3) You could implement ConsumerRebalanceListener, in your client code that gets called whenever partitions are assigned or revoked from consumer.
You might want to take a look at this article specically "Assigning partitions to consumers" part. In that i have a sample where you create topic with 3 partitions and then a consumer with ConsumerRebalanceListener telling you which consumer is handling which partition. Now you could play around with it by starting 1 or more consumers and see what happens. The sample code is in github
http://www.javaworld.com/article/3066873/big-data/big-data-messaging-with-kafka-part-2.html

Kafka alway one consumer consume the topic message in one group

I have two consumer servers with same group id subscribed the same topic.
A kafka server is running with only one partition.
As far as I know, the message should be consumed randomly in those two consumer servers.
But now it seems to be always the same consumer server A consume messages, another one does not consume messages.If I stop consumer server A, another one will work fine.
What I expect that they can consume message randomly.
To be able to use two consumer instances in parallel you need at least two partitions in the topic. A consumer will bind to one or more partitions of a topic and other consumers with the same groupId will not claim partitions which already have consumers bound to them. If a consumer fails/crashes, the partition will be released and then picked up by another consumer instance.