How is message sequence preserved for topic with many partitions? - apache-kafka

I want any information/explanation on how Kafka maintains a message sequence when messages are written to topic with multiple partition.
For e.g. I have multiple message producer each producing messages sequentially and writing on the Kafka topic with more than 1 partition. In this case, how consumer group will work to consume messages.

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Even within one partition, you still could encounter the out-of-order events if retries is enabled and max.in.flight.requests.per.connection is larger than 1.
Work-around is create a topic with only one partition although it means only one consumer process per consumer group.

Kafka will store messages in the partitions according to the message key given to the producer. If none is given, then the messages will be written in a round-robin style into the partitions. To keep ordering for a topic, you need to make sure the ordered sequence has the same key, or that the topic has only one partition.

Related

Apache Kafka PubSub

How does the pubsub work in Kafka?
I was reading about Kafka Topic-Partition theory, and it mentioned that In one consumer group, each partition will be processed by one consumer only. Now there are 2 cases:-
If the producer didn't mention the partition key or message key, the message will be evenly distributed across the partitions of a specific topic. ---- If this is the case, and there can be only one consumer(or subscriber in case of PubSub) per partition, how does all the subscribers receive the similar message?
If I producer produced to a specific partition, then how does the other consumers (or subscribers) receive the message?
How does the PubSub works in each of the above cases? if only a single consumer can get attached to a specific partition, how do other consumers receive the same msg?
Kafka prevents more than one consumer in a group from reading a single partition. If you have a use-case where multiple consumers in a consumer group need to process a particular event, then Kafka is probably the wrong tool. Otherwise, you need to write code external to Kafka API to transmit one consumer's events to other services via other protocols. Kafka Streams Interactive Query feature (with an RPC layer) is one example of this.
Or you would need lots of unique consumers groups to read the same event.
Answer doesn't change when producers send data to a specific partitions since "evenly distributed" partitions are still pre-computed, as far as the consumer is concerned. The consumer API is assigned to specific partitions, and does not coordinate the assignment with any producer.

Kafka default partitioner behavior when number of producers more than partitions

From the kafka faq page
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key
So all the messages with a particular key will always go to the same partition in a topic:
How does the consumer know which partition the producer wrote to, so it can consume directly from that partition?
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered so that the consumers can consume messages from specific producers?
How does the consumer know which partition the producer wrote to
Doesn't need to, or at least shouldn't, as this would create a tight coupling between clients. All consumer instances should be responsible for handling all messages for the subscribed topic. While you can assign a Consumer to a list of TopicPartition instances, and you can call the methods of the DefaultPartitioner for a given key to find out what partition it would have gone to, I've personally not run across a need for that. Also, keep in mind, that Producers have full control over the partitioner.class setting, and do not need to inform Consumers about this setting.
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered...
Number of producers or partitions doesn't matter. Batches are sequentially written to partitions. You can limit the number of batches sent at once per Producer client (and you only need one instance per application) with max.in.flight.requests, but for separate applications, you of course cannot control any ordering
so that the consumers can consume messages from specific producers?
Again, this should not be done.
Kafka is distributed event streaming, one of its use cases is decoupling services from producers to consumers, the producer producing/one application messages to topics and consumers /another application reads from topics,
If you have more then one producer, the order that data would be in the kafka/topic/partition is not guaranteed between producers, it will be the order of the messages that are written to the topic, (even with one producer there might be issues in ordering , read about idempotent producer)
The offset is atomic action which will promise that no two messages will get same offset.
The offset is running number, it has a meaning only in the specific topic and specfic partition
If using the default partioner it means you are using murmur2 algorithm to decide to which partition to send the messages, while sending a record to kafka that contains a key , the partioner in the producer runs the hash function which returns a value, the value is the number of the partition that this key would be sent to, this is same murmur2 function, so for the same key, with different producer you'll keep getting same partition value
The consumer is assigned/subscribed to handle topic/partition, it does not know which key was sent to each partition, there is assignor function which decides in consumer group, which consumer would handle which partition

Consuming from single kafka partition by multiple consumers

I read following in kafka docs:
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time.
Kafka only provides a total order over records within a partition, not between different partitions in a topic.
Per-partition ordering combined with the ability to partition data by key is sufficient for most applications.
However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
I read following on this page:
Consumers read from any single partition, allowing you to scale throughput of message consumption in a similar fashion to message production.
Consumers can also be organized into consumer groups for a given topic — each consumer within the group reads from a unique partition and the group as a whole consumes all messages from the entire topic.
If you have more consumers than partitions then some consumers will be idle because they have no partitions to read from.
If you have more partitions than consumers then consumers will receive messages from multiple partitions.
If you have equal numbers of consumers and partitions, each consumer reads messages in order from exactly one partition.
Doubts
Does this means that single partition cannot be consumed by multiple consumers? Cant we have single partition and a consumer group with more than one consumer and make them all consume from single partition?
If single partition can be consumed by only single consumer, I was thinking why is this design decision?
What if I need total order over records and still need it to be consumed parallel? Is it undoable in Kafka? Or such scenario does not make sense?
Within a consumer group, at any time a partition can only be consumed by a single consumer. No you can't have 2 consumers within the same group consuming from the same partition at the same time.
Kafka Consumer groups allow to have multiple consumer "sort of" behave like a single entity. The group as a whole should only consume messages once. If multiple consumer in a group were to consume the same partitions, these records would be processed multiple times.
If you need to consume a partition multiple times, be sure these consumers are in different groups.
When processing needs to happen in order (serially) at any time there's only a single task to do. If you have records 1, 2 and 3 and want to process them in order, you cannot do anything until message 1 has been processed. It's the same for message 2 and 3. So what do you want to do in parallel?

Does consumer consume from replica partitions if multiple consumers running under same consumer group?

I am writing a kafka consumer application. I have a topic with 4 partitions - 1 is leader and 3 are followers. Producer uses key to identify a partition to push a message.
If I write a consumer and run it on different nodes or start 4 instances of same consumer, how message consuming will happen ? Does all 4 instances will get same messages ?
What happens in the case of multiple consumer(same group) consuming a single topic?
Do they get same data?
How offset is managed? Is it separate for each consumer?
I would suggest that you read at least first few chapters of confluent's definitive guide to kafka to get a priliminary understanding of how kafka works.
I've kept my answers brief. Please refer to the book for detailed explanation.
How offset is managed? Is it separate for each consumer?
Depends on the group id. Only one offset is managed for a group.
What happens in the case of multiple consumer(same group) consuming a single topic?
Consumers can be multiple - all can be identified by the same or different groups.
If 2 consumers belong to the same group, both will not get all messages.
Do they get same data?
No. Once a message is sent and a read is committed, the offset is incremented for that group. So a different consumer with the same group will not receive that message.
Hope that helps :)
What happens in the case of multiple consumer(same group) consuming a single topic?
Answer: Producers send records to a particular partition based on the record’s key here. The default partitioner for Java uses a hash of the record’s key to choose the partition. When there are multiple consumers in same consumer group, each consumer gets different partition. So, in this case, only single consumer receives all the messages. When the consumer which is receiving messages goes down, group coordinator (one of the brokers in the cluster) triggers rebalance and then that partition is assigned to one of the available consumer.
Do they get same data?
Answer: If consumer commits consumed messages to partition and goes down, so as stated above, rebalance occurs. The consumer who gets this partition, will not get messages. But if consumer goes down before committing its then the consumer who gets this partition, will get messages.
How offset is managed? Is it separate for each consumer?
Answer: No, offset is not separate to each consumer. Partition never gets assigned to multiple consumers in same consumer group at a time. The consumer who gets partition assigned, gets offset as well by default.

How multiple consumer group consumers work across partition on the same topic in Kafka?

I was reading this SO answer and many such blogs.
What I know:
Multiple consumers can run on a single partition when running multiple consumers with multiple consumer group id and only one consumer from a consumer group can consume at a given time from a partition.
My question is related to multiple consumers from multiple consumer groups consuming from the same topic:
What happens in the case of multiple consumers(different groups) consuming a single topic(eventually the same partition)?
Do they get the same data?
How offset is managed? Is it separate for each consumer?
(Might be opinion based) How do you or generally recommended way is to handle overlapping data across two consumers of a separate group operating on a single partition?
Edit:
"overlapping data": means two consumers of separate consumer groups operating on the same partition getting the same data.
Yes they get the same data. Kafka only stores one copy of the data in the topic partitions' commit log. If consumers are not in the same group then they can each get the same data using fetch requests from the clients' consumer library. The assignment of which partitions each group member will get is managed by the lead consumer of each group. The entire process in detailed steps is documented here https://community.hortonworks.com/articles/72378/understanding-kafka-consumer-partition-assignment.html
Offsets are "managed" by the consumers, but "stored" in a special __consumer_offsets topic on the Kafka brokers.
Offsets are stored for each (consumer group, topic, partition) tuple. This combination is also used as the key when publishing offsets to the __consumer_offsets topic so that log compaction can delete old unneeded offset commit messages and so that all offsets for the same (consumer group, topic, partition) tuple are stored in the same partition of the __consumer_offsets topic (which defaults to 50 partitions)
Each consumer group gets every message from a subscribed topic.
Yes
Offset are stored by partition. For example let's say you have a topic with 2 partitions and a consumer group named cg made up of 2 consumers. In that case Kafka assigns each of the consumers one of the partitions. Then the consumers fetch the offset for the partition they were assigned to from Kafka (e.g. consumer 'asks' Kafka: "What is the offset for this topic for consumer group cg partition 1", or partition 2 for the other consumer). After getting the correct offset the consumer polls some Kafka broker for the next message in that partition.
I'm not entirely sure what you mean by overlapping data, can you clarify a bit or give an example?