Multiple consumers with same group id - apache-kafka

I am a beginner in Kafka. I understood that multiple consumers with same group id can't consume messages from the same partition in a topic. I am wondering what may happen if multiple Kafka consumers from a consumer group read the same message from a partition and why its a bad thing.
.

Obviously processing the same record multiple times is almost never intended, but it more comes down to offset management
If multiple consumers in a group read the same message and commit the offset of the message to indicate it's successfully been processed, then the final commit (the slowest consumer) always wins. Meanwhile, other consumers would've already continued processing other data.
When that happens, and any consumer client restarts, it would need to rewind to the last committed offset, despite having already processed messages afterwards

Related

Redundant Kafka Consumers

Can one have 2 clients reading a single topic such that they never receive the same message? If one client dies, the other keeps reading and gets all the messages.
In a word "redundant clients" - not for performance sake but for client failover.
All I have seen is examples of N partitions and >N clients in a consumer group where N clients get messages and the rest are idle. It's not optimal to have 2 clients on a single partition where one client does nothing until the other client fails.
More than one clients in the same consumer group cannot be assigned the same partition at the same time, therefore will never receive the same messages
The scenario you're asking for is more fault tolerance than load balancing... Assuming one partition, if you run two consumers and one encounters some fatal exception while consuming a message and doesn't commit that offset and the client dies, then the secondary idle consumer will pickup from the last committed offset and try consuming those same messages after the consumer group rebalances

How does Kafka guarantee consumers doesn't read a single message twice?

How does Kafka guarantee consumers doesn't read a single message twice?
Or is the above scenario possible?
Could the same message be read twice by single or by multiple consumers?
There are many scenarios which cause Consumer to consume the duplicate message
Producer published the message successfully but failed to acknowledge which cause to retry the same message
Producer publishing a batch of the message but failed partially published messages. In that case, it will retry and resent the same batch again which will cause duplicate
Consumers receive a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false).
If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
To guarantee not to consume duplicate messages the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side.
You can use the below parameter to achieve exactly one semantic. But please you have understood this comes with a compromise with performance.
enable idempotence on the producer side which will guarantee not to publish the same message twice
enable.idempotence=true
Defined Transaction (isolation.level) is read_committed
isolation.level=read_committed
In Kafka Stream above setting can be achieved by setting Exactly-Once
semantic true to make it as unit transaction
Idempotent
Idempotent delivery enables producers to write messages to Kafka exactly once to a particular partition of a topic during the lifetime of a single producer without data loss and order per partition.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure) Consumer uses isolation. level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
Please refer more in detail refrence
It is absolutely possible if you don't make your consume process idempotent.
For example; you are implementing at-least-one delivery semantic and firstly process messages and then commit offsets. It is possible to couldn't commit offsets because of server failure or rebalance. (maybe your consumer is revoked at that time) So when you poll you will get same messages twice.
To be precise, this is what Kafka guarantees:
Kafka provides order guarantee of messages in a partition
Produced messages are considered "committed" when they were written to the partition on all its in-sync replicas
Messages that are committed will not be losts as long as at least one replica remains alive
Consumers can only read messages that are committed
Regarding consuming messages, the consumers keep track of their progress in a partition by saving the last offset read in an internal compacted Kafka topic.
Kafka consumers can automatically commit the offset if enable.auto.commit is enabled. However, that will give "at most once" semantics. Hence, usually the flag is disabled and the developer commits the offset explicitly once the processing is complete.

Kafka Topic ordering when scaling up the partitions

Consider your producers create messages for the users of a system and the order of them is important in the user level.
My producers, add messages to the topic which have two partitions and I am using hashing against the user_id to put all the messages of each user in the same partition to guarantee the order.
How can I scale up the system and add more partitions to the topic while keeping the order of the messages?
How Kafka treat the messages that are already produced before partitioning?
What will happen to the messages that consume but not committed back to the Kafka to update the offset?
1.use a treeset(ordered set) cache messages at consumer client, keep 1 minute(or less); kafka only guarantee one partition's order, and I think producer also cannot guarantee order。
2.if you not commit offset manually, in the next fetch request ,will get same message. anyway, at consumer client, you should ensure message idempotency, even you conmmited offset.

How to read messages from kafka consumer group without consuming?

I'm managing a kafka queue using a common consumer group across multiple machines. Now I also need to show the current content of the queue. How do I read only those messages within the group which haven't been read, yet making those messages again readable by other consumers in the group which actually processes those messages. Any help would be appreciated.
In Kafka, the notion of "reading" messages from a topic and that of "consuming" them are the same thing. At a high level, the only thing that makes a "consumed" message unavailable to a consumer is that consumer setting its read offset to a value beyond that of the message in question. Thus, you can turn off the autocommit feature of your consumers and avoid committing offsets in cases where you'd like only to "read" but not to "consume".
A good proxy for getting "all messages which haven't been read" is to compare the latest committed offset to the highwater mark offset per partition. This provides a notion of "lag" that indicates how far behind a given consumer is in its consumption of a partition. The fetch_consumer_lag CLI function in pykafka is a good example of how to do this.
In Kafka, a partition can be consumed by only one consumer in a group i.e. if your topic has 10 partitions and you spawned 20 consumers with same groupId, then only 10 will be connected to Kafka and remaining 10 will be sitting idle. A new consumer will be identified by Kafka only in case one of the existing consumer dies or does not poll from the topic.
AFAIK, I don't think you can do what I understand you want to do within a consumer group. You can obviously create another groupId and process message based on the information gathered by first consumer group.
Kafka now has a KStream.peek() method
See proposal "Add KStream peek method".
It's not 100% clear to me from the docs that this prevents consuming of message that's peeked from the topic, but I can't see how you could use it in any crash-safe, robust way unless it does.
See also:
Handling consumer rebalance when implementing synchronous auto-offset commit
High-Level Consumer and peeking messages
I think that you can use publish-subscribe model. Then each consumer has own offset and could consume all messages for itself.

Kafka Message at-least-once mode at multi-consumer

Kafka messaging use at-least-once message delivery to ensure every message to be processed, and uses a message offset to indicates which message is to deliver next.
When there are multiple consumers, if some deadly message cause a consumer crash during message processing, will this message be redelivered to other consumers and spread the death? If some slow message blocked a single consumer, can other consumers keep going and process subsequent messages?
Or even worse, if a slow and deadly message caused a consumer crash, will it cause other consumers start from its offset again?
There are a few things to consider here:
A Kafka topic partition can be consumed by one consumer in a consumer group at a time. So if two consumers belong to two different groups they can consume from the same partition simultaneously.
Stored offsets are per consumer group. So each topic partition has a stored offset for each active (or recently active) consumer group with consumer(s) subscribed to that partition.
Offsets can be auto-committed at certain intervals, or manually committed (by the consumer application).
So let's look at the scenarios you described.
Some deadly message causes a consumer crash during message processing
If offsets are auto-committed, chances are by the time the processing of the message fails and crashes the consumer, the offset is already committed and the next consumer in the group that takes over would not see that message anymore.
If offsets are manually committed after processing is done, then the offset of that message will not be committed (for simplicity, I am assuming one message is read and processed at a time, but this can be easily generalized) because of the consumer crash. So any other consumer in the group that is (will be) subscribed to that topic will read the message again after taking over that partition. So it's possible that it will crash other consumers too. If offsets are committed before message processing, then the next consumers won't see the message because the offset is already committed when the first consumer crashed.
Some slow message blocks a single consumer: As long as the consumer is considered alive no other consumer in the group will take over. If the slowness goes beyond the consumer's session.timeout.ms the consumer will be considered dead and removed from the group. So whether another consumer in the group will read that message depends on how/when the offset is committed.
Slow and deadly message causes a consumer crash: This scenario should be similar to the previous ones in terms of how Kafka handles it. Either slowness is detected first or the crash occurs first. Again the main thing is how/when the offset is committed.
I hope that helps with your questions.