Kafka manual offset managment issue - apache-kafka

While implementing manual offset management, I encountered the following issue: (using 0.9)
In order to manage the offsets manually, for each consumed record, I retrieve the current offset of the record and commit the new offset (currentOffset + 1, since the offset reset strategy is "latest").
When a new consumer group is created, it has no explicit offsets (offset is "unknown"), therefore, if it didn't consume messages from all existing partitions before it is stopped, it will have committed offsets for only part of the partitions (the ones the consumer got messages from), while the offset for the rest of the partitions will still be "unknown".
When the consumer is started again, it gets only some of the messages that were produced while it was down (only the ones from the partitions that had a committed offset), the messages from partitions with "unknown" offset are lost and will never be consumed due to the offset reset strategy.
Since it's unacceptable in my case to miss any messages once a consumer group is created, I'd like to explicitly commit an offset for each partition before starting consumption.
To do that I found two options:
Use low level consumer to send an offset request.
Use high level consumer, call consumer.poll(0) (to trigger the assignment), then call consumer.assignment(), and for each TopicPartition call consumer.committed(topicPartition); consumer.seekToEnd(topicPartition); consumer.position(topicPartition) and eventually commit all offsets.
Both are more complex and noisy than I'd expect (I'd expect a simpler API I could use to get the log end position for all partitions assigned to a consumer).
Any thoughts or ideas for a better implementation would be appreciated.
10x.

Using consumer API totally depends upon where are you committing offsets.
If your offsets are getting stored in Kafka broker then definitely
you should use high-level consumer API it will provide you with more control
over offsets.
If you are keeping offsets in zookeeper than you can use any old consumer API like
List< KafkaStream < byte[], byte[] > > streams
=consumer.createMessageStreamsByFilter(new Whitelist(topicRegex),1)

Related

Kafka consumer-group liveness empty topic partitions

Following up on this question - I would like to know semantics between consumer-groups and offset expiry. In general I'm curious to know, how kafka protocol determines some specific offset (for consumer-group, topic, partition combination) to be expired ? Is it basing on periodic commits from consumer that are part of the group-protocol or does the offset-tick gets applied after all consumers are deemed dead/closed ? Im thinking this could have repercussions when dealing with topic-partitions to which data isn't produced frequently. In my case, we have a consumer-group reading from a fairly idle topic (not much data produced). Since, the consumer-group doesnt periodically commit any offsets, can we ever be in danger of loosing previously committed offsets. For example, when some unforeseen rebalance happens, the topic-partitions could get re-assigned with lost offset-commits and this could cause the consumer to read data from the earliest (configured auto.offset.reset) point ?
For user-topics, offset expiry / topic retention is completely decoupled from consumer-group offsets. Segments do not "reopen" when a consumer accesses them.
At a minimum, segment.bytes, retention.ms(or minutes/hours), retention.bytes all determine when log segments get deleted.
For the internal __consumer_offsets topic, offsets.retention.minutes controls when it is deleted (also in coordination with its segment.bytes).
The LogCleaner thread actively removes closed segments on a periodic basis, not the consumers. If a consumer is lagging considerably, and upon requesting offsets from a segment that had been deleted, then the auto.offset.reset gets applied.

Where is offset of consumer stored in Kafka [duplicate]

From what I understand a consumer reads messages off a particular topic, and the consumer client will periodically commit the offset.
So if for some reason the consumer fails a particular message, that offset won't be committed and you can then go back and reprocess he message.
Is there anything that tracks the offset you just consumed and the offset you then commit?
Does kafka distinguish between consumed offset and commited offset?
Yes, there is a big difference. The consumed offset is managed by the consumer in such a way that the consumer will fetch subsequent messages out of a topic partition.
The consumer can (but it is not a must) commit a message either automatically or by calling the commit API. The information is stored in a Kafka internal topic called __consumer_offsets and stores the committed offset based on ConsumerGroup, Topic and Partition. It will be used if the client is getting restartet or a new consumer joins/leaves the ConsumerGroup.
Just keep in mind that if your client does not committ offset n but later committs offset n+1, for Kafka it won't make a different to the case when you commit both offsets.
Edit: More details on consumed and committed offsets can be found in the JavaDocs of KafkaConsumer on Offsets and Consumer Position:
Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5. There are actually two notions of position relevant to the user of the consumer:
The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(Duration).
The committed position is the last offset that has been stored securely. Should the process fail and restart, this is the offset that the consumer will recover to. The consumer can either automatically commit offsets periodically; or it can choose to control this committed position manually by calling one of the commit APIs (e.g. commitSync and commitAsync).
This distinction gives the consumer control over when a record is considered consumed. It is discussed in further detail below.

Does kafka distinguish between consumed offset and commited offset?

From what I understand a consumer reads messages off a particular topic, and the consumer client will periodically commit the offset.
So if for some reason the consumer fails a particular message, that offset won't be committed and you can then go back and reprocess he message.
Is there anything that tracks the offset you just consumed and the offset you then commit?
Does kafka distinguish between consumed offset and commited offset?
Yes, there is a big difference. The consumed offset is managed by the consumer in such a way that the consumer will fetch subsequent messages out of a topic partition.
The consumer can (but it is not a must) commit a message either automatically or by calling the commit API. The information is stored in a Kafka internal topic called __consumer_offsets and stores the committed offset based on ConsumerGroup, Topic and Partition. It will be used if the client is getting restartet or a new consumer joins/leaves the ConsumerGroup.
Just keep in mind that if your client does not committ offset n but later committs offset n+1, for Kafka it won't make a different to the case when you commit both offsets.
Edit: More details on consumed and committed offsets can be found in the JavaDocs of KafkaConsumer on Offsets and Consumer Position:
Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5. There are actually two notions of position relevant to the user of the consumer:
The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(Duration).
The committed position is the last offset that has been stored securely. Should the process fail and restart, this is the offset that the consumer will recover to. The consumer can either automatically commit offsets periodically; or it can choose to control this committed position manually by calling one of the commit APIs (e.g. commitSync and commitAsync).
This distinction gives the consumer control over when a record is considered consumed. It is discussed in further detail below.

Kafka, consumer offset and multiple async commits

I'm trying to understand how Kafka handles the situation where multiple manual commits are issued by the consumer.
As a thought experiment assume a single topic/partition with a single consumer. I publish two messages to this topic and they are processed async by the consumer and the consumer does a manual commit after message processing completes. Now if message 1 completes first followed by message 2, I would expect the broker to store the offset at 2. What happens in the reverse scenario? Would the broker now set the offset back to 1 from 2, or is there logic preventing the offset from decreasing?
From reading the docs it appears that the topic 'position' is defined as the max committed offset +1, which would imply that Kafka is invariant to the order the messages are committed in. But it is unclear to me what happens in the case where a consumer disconnects and reconnects to the broker, will it continue from the max committed offset or the latest committed offset?
Thanks

Does a Kafka Consumer receive a list of offsets first, before receiving the bytes/data?

I'm quite new to Apache Kafka and I'm currently reading Learning Apache Kafka, 2ed, (2015). Chapter 3, paragraph Kafka Design fundamentals says the following:
Consumers always consume messages from a particular partition sequentially and also acknowledge the message offset. This acknowledgement implies that the consumer has consumed all prior messages. Consumers issue an asynchronous pull request containing the offset of the message to be consumed to the broker and get the buffer of bytes.
I'm a bit thrown off by the word 'acknowledge'. Do I understand it correctly that Kafka sends the offset first and then the consumer uses the list of offsets to pull request the data it has not consumed yet?
Thanks in advance,
Nick
On startup, KafkaConsumer issues a offset lookup request to the brokers for the specific consumer group that was configured on this consumer. If valid offsets are returned those are used. Otherwise, the consumer uses an initial offset according to auto.offset.reset parameter.
Afterwards, offsets are maintained mainly in-memory within the consumer. Each poll() sends the current offset to the broker and on broker reply consumer updates the in-memory offsets.
Additionally, in-memory offset are committed/acked to the broker from time to time. This can happen automatically within poll() if auto commit is enabled, or commit() must be called explicitly to send offsets to the broker for reliably storing them.