I'm trying to wrap my head around kafka and the thing that confuses me are the partitions. From all/most of the examples I have seen the consumers/products seem to have implicit knowledge of the partitions, (which partition to write messages to, which partition to read messages from). Is this correct, I initially thought that partitions are internal to the system and the consumers/producers dont need to know partition information. If they need to know partition information then aren't we exposing the inner structure of the topic to a certain extent to the outside world?
In kafka every partition in a topic has a set of brokers, and at most one broker leader per partition. You cannot have more consumers of a topic than the number of partitions because otherwise some consumer would be inactive.You can have multiple partitions for a single consumer, but cannot have multiple consumers for a single partition. So the number of partitions must be chosen according to the throughput you expect. The number of partitions can be increased on a topic, but never decreased. When consumers connect to a partition they actually connect to the broker leader to consume messages.
Anyway the partition leader could change, so the consumer would get an error and should send the request for meta-data to the cluster controller in order to get the info on the new partition leader. At consumer startup partitions are assigned according to the kafka parameter partition.assignment.strategy. Of course if consumers start at different times on the same consumer group there will be partition rebalance.
Finally you need a lot of info on the kafka cluser structure as a client.
I am writing a kafka consumer application. I have a topic with 4 partitions - 1 is leader and 3 are followers. Producer uses key to identify a partition to push a message.
If I write a consumer and run it on different nodes or start 4 instances of same consumer, how message consuming will happen ? Does all 4 instances will get same messages ?
What happens in the case of multiple consumer(same group) consuming a single topic?
Do they get same data?
How offset is managed? Is it separate for each consumer?
I would suggest that you read at least first few chapters of confluent's definitive guide to kafka to get a priliminary understanding of how kafka works.
I've kept my answers brief. Please refer to the book for detailed explanation.
How offset is managed? Is it separate for each consumer?
Depends on the group id. Only one offset is managed for a group.
What happens in the case of multiple consumer(same group) consuming a single topic?
Consumers can be multiple - all can be identified by the same or different groups.
If 2 consumers belong to the same group, both will not get all messages.
Do they get same data?
No. Once a message is sent and a read is committed, the offset is incremented for that group. So a different consumer with the same group will not receive that message.
Hope that helps :)
What happens in the case of multiple consumer(same group) consuming a single topic?
Answer: Producers send records to a particular partition based on the record’s key here. The default partitioner for Java uses a hash of the record’s key to choose the partition. When there are multiple consumers in same consumer group, each consumer gets different partition. So, in this case, only single consumer receives all the messages. When the consumer which is receiving messages goes down, group coordinator (one of the brokers in the cluster) triggers rebalance and then that partition is assigned to one of the available consumer.
Do they get same data?
Answer: If consumer commits consumed messages to partition and goes down, so as stated above, rebalance occurs. The consumer who gets this partition, will not get messages. But if consumer goes down before committing its then the consumer who gets this partition, will get messages.
How offset is managed? Is it separate for each consumer?
Answer: No, offset is not separate to each consumer. Partition never gets assigned to multiple consumers in same consumer group at a time. The consumer who gets partition assigned, gets offset as well by default.
If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
What if you have multiple consumers on a given topic#partition? I guess the consumer has to somehow keep track of what messages it has already processed in case of duplicates?
In fact, each consumer belongs to a consumer group. When Kafka cluster sends data to a consumer group, all records of a partition will be sent to a single consumer in the group.
If there're more paritions than consumers in a group, some consumers will consume data from more than one partition. If there're more consumers in a group than paritions, some consumers will get no data. If you add new consumer instances to the group, they will take over some partitons from old members. If you remove a consumer from the group (or the consumer dies), its partition will be reassigned to other member.
Now let's take a look at your questions:
If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
NO. Some consumers in the same consumer group will consume data from more than one partition.
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
Kafka will take care of it. If new consumers join the group, or old consumers dies, Kafka will do reblance.
What if you have multiple consumers on a given topic#partition?
You CANNOT have multiple consumers (in a consumer group) to consume data from a single parition. However, if there're more than one consumer group, the same partition can be consumed by one (and only one) consumer in each consumer group.
1) No that means you will one consumer handling more than one consumer.
2) Kafka never assigns same partition to more than one consumer because that will violate order guarantee within a partition.
3) You could implement ConsumerRebalanceListener, in your client code that gets called whenever partitions are assigned or revoked from consumer.
You might want to take a look at this article specically "Assigning partitions to consumers" part. In that i have a sample where you create topic with 3 partitions and then a consumer with ConsumerRebalanceListener telling you which consumer is handling which partition. Now you could play around with it by starting 1 or more consumers and see what happens. The sample code is in github
http://www.javaworld.com/article/3066873/big-data/big-data-messaging-with-kafka-part-2.html
In Apache Kafka 0.8.2 office document, section 5.6 Distribution, Consumers and Consumer Groups subsection, it says that
The consumers in a group divide up the partitions as fairly as
possible, each partition is consumed by exactly one consumer in a
consumer group.
But I have found that in practice, it is possible that multiple consumers in a consumer group can consuming data from a single partition by sending FetchRequest from the same topic-partition.
And in the followed Consumer Id Registry subsection
In addition to the group_id which is shared by all consumers in a
group, each consumer is given a transient, unique consumer_id (of the
form hostname:uuid) for identification purposes. Consumer ids are
registered in the following directory.
/consumers/[group_id]/ids/[consumer_id] --> {"topic1": #streams, ...,
"topicN": #streams} (ephemeral node)
It says there is a unique id for each consumer. However, I could not found such structure in zookeeper.
I do not know when consumer start to register? The client library I used is kakfa-python 0.9.4.
May this help
(1) For your second question.
https://github.com/dpkp/kafka-python/issues/472
And issue38
It said "Coordinated Consumer Group support is under development."
(2) For your first question.
It said "This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. "(statement A). This depends on clients implements. This may be not right in some kafka clients. I just have experience in python and cpp. If group was implemented, each message is consumed by exactly one consumer in the group. How to assign partitions between consumers in one group is different. When there are more partitions than consumers, Statement A may be right. But it is also possible that the partitions may be re-assigned when new partitions join or leave the existing group. In this case, partition A may be consumed by consumer A firstly and then consumed by consumer B, which is possible. In some clients, you can choose the assignment algorithms, such as round-robin, and so on.
I read kafka document, still don't know how consume one topic parallel?
Suppose:
I have one topic like "something happened" (don't split this topic), and I have many customers that want to consume it.
So what should I do, so that multiple customers can consume it parallel? Should I use partitioning and customer groups?
I have one idea about this, but I'm not sure whether is it right.
Make many partitions about the same topic, and make one partition to one customer, so one producer must produce the same to these partitions, and every customer in the different customer group, is it right?
Using partitions is the way of being able to parallelize the consumption of a topic. Let´s say you have 10 partitions for your topic, then you can have 10 consumers in the same consumer group reading one partition each. If you have less consumers than partitions, then they would be responsible for more than one partition each. If you have more consumers than partitions, then there would be consumers who would not get any partition assigned to them and have nothing to do except being available to replace another consumer who has died.
Each topic in Kafka can be organized into many partitions. Partition allows for parallel consumption increasing throughput.
Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.
For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below
Each consumer uses a single stream.
In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.
Each consumer uses more than one stream
(say C1 uses 3, C2 uses 3 and C3 uses 4). In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.
Just to add the list of answers, Confluent has a library to do this for you, like Rapids. The project is here:
https://github.com/confluentinc/parallel-consumer
It's open source.
Note: I'm the author.
#Lundahl did all the didactic, I'll give you a pratical sample.
Create a topic for some meaning, e.g. news_events with the parallelism your consumers will need (partitions), you can calc that using the time to process one message, the number of messages you will have and the time you want to have all the messages processed.
Let's create consumers for that topic, you wan't to read the news and your sister or brother also, each one on your time, then every one needs a consumer group id, this way kafka will know that threads a,b,c are for one consumer group and the d,e,c are for the second consumer group, every consumer group will receive the same messages, process it at their time and won't affect each other.
A message will come at one or other partition, never at two, by default Kafka makes round robin to choose the partition, remember, all consumers groups can connect and read data from all the same partitions
I would suggest you to use rapids-kafka-client, a library which do that parallelism stuff for you, choose the number of threads equal the number of partitions you have, choose a consumer group, and see the magic happen.
public static void main(String[] args){
ConsumerConfig.<String, String>builder()
.prop(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(GROUP_ID_CONFIG, "news-app")
.topics("news_events")
.consumers(7)
.callback((ctx, record) -> {
System.out.printf("status=consumed, value=%s%n", record.value());
})
.build()
.consume()
.waitFor();
}
You can read more about consumer groups, topics and partitions here
I assume what you want is parallel consumption between customers in a publish/subscribe fashion.
Beside that, you can also have parallel consumption within a single customer in order to scale the consumer application.
Parallel consumption between customers
If by "customers" you mean different organizations which are interested in consuming topic's messages independently, all you need is consumer groups.
This is a simple publish/subscribe pattern where each customer runs its own application and read all topic's messages without interfering with others.
Each customer application can be seen as a consumer group, made up by one or more Kafka consumers (whether running on a single node or spread across a cluster), all of them sharing the consumer group's identifier.
You achieve this goal regardless of partitions. In case topic is partitioned, you don't need to worry about writing the same message to all partitions. Remember that in Kafka messages are durable, a message read by a Kafka consumer is not deleted and is available to be read by other Kafka consumers from a different consumer group (until it expires). Furthermore, partitions are not meant to work like this, they help scale storage of data (at a certain point all topic's data wouldn't fit into just one node) and scale consumer applications as you can see below.
Parallel consumption within single customer
You can further parallelize, or better to say, scale the consumption of messages within a consumer group with, in fact, Kafka consumers.
Imagine topic is huge, producers write into it with an high rate, and consumer group has only one consumer: this poor consumer may struggle to keep up with the message arrival rate, especially if message processing is time-consuming too.
That's the case where you need partitions and more consumers in your consumer group, so that Kafka will assign partitions to consumers to distribute reading load among them.
How partition assignment works has been already explained in other answers here, but basically for a given consumer group:
each topic's partition is assigned exclusively to one consumer,
a consumer might get assigned more partitions
if consumers are more than topic's partitions, some of them will stay idle as they won't get assigned any partition to consume from.
Remember that message ordering in Kafka is guaranteed only at partition level, so if you have many partitions and ordering matters, you need to choose the right message key to partition data according to your requirements.
For example if you want messages be ordered by device, a device_id would be your key that guarantees messages of the same device will be written to the same partition.