Is manipulating the "read-offset" as kafka consumer bad-practice? - apache-kafka

We have an ongoing discussion about the correct (or intended) usage of Kafka for events.
The arguing point is the ability of a consumer to not only subscribe (or resubscribe) to a topic but also to modify its own read offset.
Am I right in saying that "A consumer should be design in a way that it never modifies its own read offset!"
Reasoning behind this:
The consumer cannot know what events actually are stored inside a topic (log retention)
... So restoring a complete state from "delta"-events is not possible.
The consumer has consumed an event once and confirmed this to the broker. why consuming again?

If your consumer instances belongs to same consumer group, consumer need not to keep the state of reading from topic. The state of reading is nothing but offset of topic up to which record your consumer read so far. If your topic has multiple partitions consumers belong to the same consumer group can distribute the work load among consumers. In case one of the consumers crashed or failed other consumers from same consumer group will be aware of from which partition offset they continue to consume the record.

Related

Kafka: change in consumers number in a group

I understand that Kafka semantics is that a consumer group must read a record only once. To achieve this, Kafka consumers maintain an offset, which is then conveyed to brokers with read requests so that brokers can send data accordingly to ensure that already read data is not resend(). But how does broker and consumers react when their is a change in consumer group, like addition of a new consumer or an existing consumer going down?
There are few things which needs to be considered here.
A consumer goes down, then how is its offset information taken into
account while assigning its partitions to active consumers?
A new consumer joins, then how does system ensures that it doesn't read a
data its consumer group has already read?
If consumers join/leave a group, there's a consumer group rebalance. All consumers in the group will temporarily be suspended, then new partitions will be assigned to consume from.
If those consumers were processing, then there's a good chance that they'll re-consume the same data.
If you use transactions, the chance that happens could be a reduced as records will be consumed "exactly once". But this doesn't necessarily mean "successfully processed and offset committed" exactly once.

Does consumer consume from replica partitions if multiple consumers running under same consumer group?

I am writing a kafka consumer application. I have a topic with 4 partitions - 1 is leader and 3 are followers. Producer uses key to identify a partition to push a message.
If I write a consumer and run it on different nodes or start 4 instances of same consumer, how message consuming will happen ? Does all 4 instances will get same messages ?
What happens in the case of multiple consumer(same group) consuming a single topic?
Do they get same data?
How offset is managed? Is it separate for each consumer?
I would suggest that you read at least first few chapters of confluent's definitive guide to kafka to get a priliminary understanding of how kafka works.
I've kept my answers brief. Please refer to the book for detailed explanation.
How offset is managed? Is it separate for each consumer?
Depends on the group id. Only one offset is managed for a group.
What happens in the case of multiple consumer(same group) consuming a single topic?
Consumers can be multiple - all can be identified by the same or different groups.
If 2 consumers belong to the same group, both will not get all messages.
Do they get same data?
No. Once a message is sent and a read is committed, the offset is incremented for that group. So a different consumer with the same group will not receive that message.
Hope that helps :)
What happens in the case of multiple consumer(same group) consuming a single topic?
Answer: Producers send records to a particular partition based on the record’s key here. The default partitioner for Java uses a hash of the record’s key to choose the partition. When there are multiple consumers in same consumer group, each consumer gets different partition. So, in this case, only single consumer receives all the messages. When the consumer which is receiving messages goes down, group coordinator (one of the brokers in the cluster) triggers rebalance and then that partition is assigned to one of the available consumer.
Do they get same data?
Answer: If consumer commits consumed messages to partition and goes down, so as stated above, rebalance occurs. The consumer who gets this partition, will not get messages. But if consumer goes down before committing its then the consumer who gets this partition, will get messages.
How offset is managed? Is it separate for each consumer?
Answer: No, offset is not separate to each consumer. Partition never gets assigned to multiple consumers in same consumer group at a time. The consumer who gets partition assigned, gets offset as well by default.

How to read messages from kafka consumer group without consuming?

I'm managing a kafka queue using a common consumer group across multiple machines. Now I also need to show the current content of the queue. How do I read only those messages within the group which haven't been read, yet making those messages again readable by other consumers in the group which actually processes those messages. Any help would be appreciated.
In Kafka, the notion of "reading" messages from a topic and that of "consuming" them are the same thing. At a high level, the only thing that makes a "consumed" message unavailable to a consumer is that consumer setting its read offset to a value beyond that of the message in question. Thus, you can turn off the autocommit feature of your consumers and avoid committing offsets in cases where you'd like only to "read" but not to "consume".
A good proxy for getting "all messages which haven't been read" is to compare the latest committed offset to the highwater mark offset per partition. This provides a notion of "lag" that indicates how far behind a given consumer is in its consumption of a partition. The fetch_consumer_lag CLI function in pykafka is a good example of how to do this.
In Kafka, a partition can be consumed by only one consumer in a group i.e. if your topic has 10 partitions and you spawned 20 consumers with same groupId, then only 10 will be connected to Kafka and remaining 10 will be sitting idle. A new consumer will be identified by Kafka only in case one of the existing consumer dies or does not poll from the topic.
AFAIK, I don't think you can do what I understand you want to do within a consumer group. You can obviously create another groupId and process message based on the information gathered by first consumer group.
Kafka now has a KStream.peek() method
See proposal "Add KStream peek method".
It's not 100% clear to me from the docs that this prevents consuming of message that's peeked from the topic, but I can't see how you could use it in any crash-safe, robust way unless it does.
See also:
Handling consumer rebalance when implementing synchronous auto-offset commit
High-Level Consumer and peeking messages
I think that you can use publish-subscribe model. Then each consumer has own offset and could consume all messages for itself.

Kafka persist message until consumed by all groups

I have a Kafka topic which has multiple consumer groups. I need the messages on the topic to not be removed when their persistence duration expires if they have not been read by all consumer groups.
Is it possible to setup additional persistence rules beyond the duration? I need the messages to always stay on a topic if they have never been consumed.
Would it be possible to "refresh" the timeout on a message if it has not been consumed and it's duration expires?
This is not possible in Kafka. Kafka - unlike many more traditional message brokers - does not track which message has been or has not been consumed. This is the responsibility of the consumers. And because the broker doesn't track this, it cannot do the topic cleanup based on this.
In some cases, you can use compacted topics which will keep the last message for each key. Thanks to that even a consumer which connected late might be able to recover the state. But this works only in specific data types such as state changes etc.

Does Kafka have a visibility timeout?

Does Kafka have something analogues to an SQS visibility timeout?
What is the property called?
I can't seem to find anything like this in the docs.
Kafka works a little bit differently than SQS.
Every message resides on a single topic partition. Kafka consumers are organized into consumer groups. A single partition can be assigned only to a single consumer in a group. That means that other consumers from the same CG won't get the same message and the same message will only be re-sent if the same consumer pulls messages from a broker and hasn't committed the offset.
If Kafka broker designated as group coordinator detects that consumer died, rebalance happens and that partition can be assigned to another consumer. But again this will be the only consumer that gets messages from that partition.
So as you can see since Kafka is not using the competing consumer pattern, there's no notion of visibility timeout.