i am new to kafka , my question is how to create multiple consumer groups with multiple consumer instances and assign that consumer instances to consume from specific broker or partition ? For eg: i have to implement as shown in this example image
Consumer groups relate to the high level consumer API while the ability to choose broker or partition to consume from relates to the simple consumer API.
The high level API will do rebalancing among consumers in a group automatically for you but it will consume all partitions for a given topic.
If you want to consume only from specific partitions within a topic, you need to use the simple consumer API and you'll have to deal with partition assignment yourself. There is an example of how to do this in the Kafka wiki.
Related
How does the pubsub work in Kafka?
I was reading about Kafka Topic-Partition theory, and it mentioned that In one consumer group, each partition will be processed by one consumer only. Now there are 2 cases:-
If the producer didn't mention the partition key or message key, the message will be evenly distributed across the partitions of a specific topic. ---- If this is the case, and there can be only one consumer(or subscriber in case of PubSub) per partition, how does all the subscribers receive the similar message?
If I producer produced to a specific partition, then how does the other consumers (or subscribers) receive the message?
How does the PubSub works in each of the above cases? if only a single consumer can get attached to a specific partition, how do other consumers receive the same msg?
Kafka prevents more than one consumer in a group from reading a single partition. If you have a use-case where multiple consumers in a consumer group need to process a particular event, then Kafka is probably the wrong tool. Otherwise, you need to write code external to Kafka API to transmit one consumer's events to other services via other protocols. Kafka Streams Interactive Query feature (with an RPC layer) is one example of this.
Or you would need lots of unique consumers groups to read the same event.
Answer doesn't change when producers send data to a specific partitions since "evenly distributed" partitions are still pre-computed, as far as the consumer is concerned. The consumer API is assigned to specific partitions, and does not coordinate the assignment with any producer.
When I have 1000 of web server and all are interested in messages from a topic. I am thinking of writing a specific data to a particular partition of a topic and 1000+ servers are interest in the data in that particular partition. How good is to implement assign instead of subscribe. How scalable is this approach is. can I assign 1000+ consumer to read data from a particular partition.
In Kafka, every consumer belongs to a consumer group. When a Kafka producer sends a message to a particular group, the records of a partition are being delivered to a single consumer.
If the number of partitions is greater than the number of consumers, then some consumers will consume data from more than one partition. On the other hand, if the number of consumers is greater than the number of partitions, some consumers will be inactive as they will receive no data.
You cannot have multiple consumers -within the same consumer group- consuming data from a single partition. Therefore, in order to consume data from the same partition using N consumers, you'd need to create N distinct consumer groups too.
Note that partitioning enhances the parallelism within a Kafka cluster. If you create thousands of consumers to consume data from only one partition, I suspect that you will lose some level of parallelism.
Subscribe vs Assign
Subscribe makes use of the consumer group; Kafka coordinator sends assignment to a consumer and the partitions of the topics subscribed to, will be distributed to the instances within that group.
Assign forces assignment to a list of topics.
I'm a newbie in Kafka. I had a glance at the Kafka Documentation. It seems that the the message dispatched to a subscribing consumer group is implemented by binding the partition with the consumer instance.
One important thing we should remember when we work with Apache Kafka is the number of consumers in the same consumer group should be less than or equal the number of partitions in the consumed topic. Otherwise, the exceedable consumers will not be received any messages from the topic.
In a non-prod environment, I didn't config the topic partition. In such case, is there only a single partition in Kafka. And If I start multiple consumers sharing the same group and subscribe them to the topic, would the message always dispatched to the same instance in the group? In other words, I have to partition the topic to get the load-balance feature in consumer group?
Thanks!
You are absolutely right. One partitions cannot be processed in paralell (by one consumer group). You can treat partition as atomic and it cannot be split.
If you configure non-prod and prod env with the same amount of partitions per topic, that should help you to find correct number of conumsers and catch problems before moving to prod.
If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
What if you have multiple consumers on a given topic#partition? I guess the consumer has to somehow keep track of what messages it has already processed in case of duplicates?
In fact, each consumer belongs to a consumer group. When Kafka cluster sends data to a consumer group, all records of a partition will be sent to a single consumer in the group.
If there're more paritions than consumers in a group, some consumers will consume data from more than one partition. If there're more consumers in a group than paritions, some consumers will get no data. If you add new consumer instances to the group, they will take over some partitons from old members. If you remove a consumer from the group (or the consumer dies), its partition will be reassigned to other member.
Now let's take a look at your questions:
If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
NO. Some consumers in the same consumer group will consume data from more than one partition.
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
Kafka will take care of it. If new consumers join the group, or old consumers dies, Kafka will do reblance.
What if you have multiple consumers on a given topic#partition?
You CANNOT have multiple consumers (in a consumer group) to consume data from a single parition. However, if there're more than one consumer group, the same partition can be consumed by one (and only one) consumer in each consumer group.
1) No that means you will one consumer handling more than one consumer.
2) Kafka never assigns same partition to more than one consumer because that will violate order guarantee within a partition.
3) You could implement ConsumerRebalanceListener, in your client code that gets called whenever partitions are assigned or revoked from consumer.
You might want to take a look at this article specically "Assigning partitions to consumers" part. In that i have a sample where you create topic with 3 partitions and then a consumer with ConsumerRebalanceListener telling you which consumer is handling which partition. Now you could play around with it by starting 1 or more consumers and see what happens. The sample code is in github
http://www.javaworld.com/article/3066873/big-data/big-data-messaging-with-kafka-part-2.html
Is there a way I can make a kafka topic non persistant? I plan to use multiple consumers in a single topic but I dont want all my consumers picking up the same messages.
In kafka to simulate the behaviour of a queue all your consumers would be in the same consumer group.
See the kafka docs for more information
Consumers
Messaging traditionally has two models: queuing and publish-subscribe.
In a queue, a pool of consumers may read from a server and each
message goes to one of them; in publish-subscribe the message is
broadcast to all consumers. Kafka offers a single consumer abstraction
that generalizes both of these—the consumer group. Consumers label
themselves with a consumer group name, and each message published to a
topic is delivered to one consumer instance within each subscribing
consumer group. Consumer instances can be in separate processes or on
separate machines.
If all the consumer instances have the same consumer group, then this
works just like a traditional queue balancing load over the consumers.
If you want to control when messages are deleted from the log you can set retention.ms or retention.bytes in the topic configuration. Be aware that these parameters will delete a message disregarding if it was consumed or not