Can consumer groups span multiple servers? - apache-kafka

When creating a consumer group in Kafka, does it create a pool of workers that run on the same JVM process or could a consumer group span multiple computers/nodes?
If it spans multiple computers then keeping track of offsets etc. will be hard.

First of all, you don't create consumer groups directly. You just create consumers and consumers that have same group.id will represent a consumer group. When multiple consumers
are subscribed to a topic and belong to the same consumer group, each consumer in
the group will receive messages from a different subset of the partitions in the topic. As shown in the image below:
Of course you can create these consumers in different servers and it is recommended approach for load balancing.
Kafka stores offsets for each consumer groups in topic named __consumer_offsets. So keeping track of the offsets is not that hard. You can check consumer offsets for a consumer groups with a command like this:

"does it create a pool of workers that run on the same jvm process or could a consumer group span multiple computers/nodes?"
It depends on how many jvm processes you create for your consumer group. And, yes, it can span multiple computer/nodes. Kafka's group coordinator will then assign individual threads to a partition of a topic. Note that a single TopicPartition can be consumed at maximum by one consumer (jvm process) within the same consumer group.
"If it spans multiple computers then keeping track of offsets etc. will be hard."
Kafka makes this easy by centrally storing all meta information and progress of each consumer group within an internal topic called "__consumer_offsets" which is available across the entire cluster, if and only if all nodes belong to the same cluster.

Related

What is the need of consumer group in kafka?

I don't understand the practical use case of the consumer group in Kafka.
A partition can only be read by only one consumer in a consumer group, so only a subset of a topic record is read by one consumer.
Can someone help with any practical scenario where the consumer group helps?
It's for parallel processing of event messages from the specific topic.
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
Read more here:
https://docs.confluent.io/5.3.3/kafka/introduction.html#consumers

Kafka consumer horizontal scaling across multiple nodes

I am externalising the kafka consumer metadata for topic in db including consumer groups and number of consumer in group.
Consumer_info table has
Topic name,
Consumer group name,
Number of consumers in group
Consumer class name
At app server startup i am reading table and creating consumers (threads) based on number set in table. If consumer group count is set to 3, i create 3 consumer threads. This is based on number of partitions for a given topic
Now in case i need to scale out horizontally, how do i distribute the consumers belonging to same group across multiple app server nodes. Without reading same message more than once.
The initialization code for consumer which will be called at appserver startup reads metadata from db for consumer and creates all the consumer threads on same instance of app server, even if i add more app server instances, they would all be redundant as the first server which was started has spawned the defined consumer threads equal to the number of partitions.any more consumer created on other instances would be idle.
Can u suggest better approach to scale out consumers horizontally
consumer groups and number of consumer in group
Adhoc running kafka-consumer-groups --describe would give you more up-to-date information than an external database query, especially given that consumers can rebalance and can fall out of the group at any moment.
how do i distribute the consumers belonging to same group across multiple app server nodes. Without reading same message more than once
This is how Kafka Consumer groups operate, out of the box, assuming you are not manually assigning partitions in your code.
It is not possible to read a message more than once after you have consumed, acked, and committed that offset within the group
I don't see the need for an external database when you can already attempt to expose an API around kafka-consumer-groups command
Or you can use Stream-Messaging-Manager by Cloudera which shows a lot of this information as well

Can multiple consumers consume at the same time?

I am new to Kafka . I wanted to know if multiple consumers can consume data from the topic at the same time and process them simultaneously or they are consumed and processed in batches ?
Kafka Consumers group themselves under a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
The processing of the data can be done simultaneously or in batches based on your requirement.

If you have less consumers than partitions, what happens?

If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
What if you have multiple consumers on a given topic#partition? I guess the consumer has to somehow keep track of what messages it has already processed in case of duplicates?
In fact, each consumer belongs to a consumer group. When Kafka cluster sends data to a consumer group, all records of a partition will be sent to a single consumer in the group.
If there're more paritions than consumers in a group, some consumers will consume data from more than one partition. If there're more consumers in a group than paritions, some consumers will get no data. If you add new consumer instances to the group, they will take over some partitons from old members. If you remove a consumer from the group (or the consumer dies), its partition will be reassigned to other member.
Now let's take a look at your questions:
If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
NO. Some consumers in the same consumer group will consume data from more than one partition.
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
Kafka will take care of it. If new consumers join the group, or old consumers dies, Kafka will do reblance.
What if you have multiple consumers on a given topic#partition?
You CANNOT have multiple consumers (in a consumer group) to consume data from a single parition. However, if there're more than one consumer group, the same partition can be consumed by one (and only one) consumer in each consumer group.
1) No that means you will one consumer handling more than one consumer.
2) Kafka never assigns same partition to more than one consumer because that will violate order guarantee within a partition.
3) You could implement ConsumerRebalanceListener, in your client code that gets called whenever partitions are assigned or revoked from consumer.
You might want to take a look at this article specically "Assigning partitions to consumers" part. In that i have a sample where you create topic with 3 partitions and then a consumer with ConsumerRebalanceListener telling you which consumer is handling which partition. Now you could play around with it by starting 1 or more consumers and see what happens. The sample code is in github
http://www.javaworld.com/article/3066873/big-data/big-data-messaging-with-kafka-part-2.html

Kafka Only One Consumer in Consumer Group Getting Messages

In my setup, I have a consumer group with three processes (3 instances of a service) that can consume from Kafka. What I've found to be happing is that the first node is receiving all of the traffic. If one node is manually killed, the next node picks up all Kafka traffic, but the last remaining node sits idle.
The behavior desired is that all messages get distributed evenly across all instances within the consumer group, which is what I thought should happen. As I understand, the way Kafka works is that it is supposed to distribute the messages evenly amongst all members of a consumer group. Is my understanding correct? I've been trying to determine why it may be that only one member of the consumer group is getting all traffic with no luck. Any thoughts/suggestions?
You need to make sure that the topic has more than one partition to be able to consume it in parallel. A consumer in a consumer group gets one or more allocated partitions from the broker but a single partition will never be shared across several consumers within the same group unless a consumer goes offline. The number of partitions a topic has equals the maximum number of consumers in a consumer group that can feed from a topic.