Kafka, consumer offset and multiple async commits - apache-kafka

I'm trying to understand how Kafka handles the situation where multiple manual commits are issued by the consumer.
As a thought experiment assume a single topic/partition with a single consumer. I publish two messages to this topic and they are processed async by the consumer and the consumer does a manual commit after message processing completes. Now if message 1 completes first followed by message 2, I would expect the broker to store the offset at 2. What happens in the reverse scenario? Would the broker now set the offset back to 1 from 2, or is there logic preventing the offset from decreasing?
From reading the docs it appears that the topic 'position' is defined as the max committed offset +1, which would imply that Kafka is invariant to the order the messages are committed in. But it is unclear to me what happens in the case where a consumer disconnects and reconnects to the broker, will it continue from the max committed offset or the latest committed offset?
Thanks

Related

Where is offset of consumer stored in Kafka [duplicate]

From what I understand a consumer reads messages off a particular topic, and the consumer client will periodically commit the offset.
So if for some reason the consumer fails a particular message, that offset won't be committed and you can then go back and reprocess he message.
Is there anything that tracks the offset you just consumed and the offset you then commit?
Does kafka distinguish between consumed offset and commited offset?
Yes, there is a big difference. The consumed offset is managed by the consumer in such a way that the consumer will fetch subsequent messages out of a topic partition.
The consumer can (but it is not a must) commit a message either automatically or by calling the commit API. The information is stored in a Kafka internal topic called __consumer_offsets and stores the committed offset based on ConsumerGroup, Topic and Partition. It will be used if the client is getting restartet or a new consumer joins/leaves the ConsumerGroup.
Just keep in mind that if your client does not committ offset n but later committs offset n+1, for Kafka it won't make a different to the case when you commit both offsets.
Edit: More details on consumed and committed offsets can be found in the JavaDocs of KafkaConsumer on Offsets and Consumer Position:
Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5. There are actually two notions of position relevant to the user of the consumer:
The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(Duration).
The committed position is the last offset that has been stored securely. Should the process fail and restart, this is the offset that the consumer will recover to. The consumer can either automatically commit offsets periodically; or it can choose to control this committed position manually by calling one of the commit APIs (e.g. commitSync and commitAsync).
This distinction gives the consumer control over when a record is considered consumed. It is discussed in further detail below.

Does kafka distinguish between consumed offset and commited offset?

From what I understand a consumer reads messages off a particular topic, and the consumer client will periodically commit the offset.
So if for some reason the consumer fails a particular message, that offset won't be committed and you can then go back and reprocess he message.
Is there anything that tracks the offset you just consumed and the offset you then commit?
Does kafka distinguish between consumed offset and commited offset?
Yes, there is a big difference. The consumed offset is managed by the consumer in such a way that the consumer will fetch subsequent messages out of a topic partition.
The consumer can (but it is not a must) commit a message either automatically or by calling the commit API. The information is stored in a Kafka internal topic called __consumer_offsets and stores the committed offset based on ConsumerGroup, Topic and Partition. It will be used if the client is getting restartet or a new consumer joins/leaves the ConsumerGroup.
Just keep in mind that if your client does not committ offset n but later committs offset n+1, for Kafka it won't make a different to the case when you commit both offsets.
Edit: More details on consumed and committed offsets can be found in the JavaDocs of KafkaConsumer on Offsets and Consumer Position:
Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5. There are actually two notions of position relevant to the user of the consumer:
The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(Duration).
The committed position is the last offset that has been stored securely. Should the process fail and restart, this is the offset that the consumer will recover to. The consumer can either automatically commit offsets periodically; or it can choose to control this committed position manually by calling one of the commit APIs (e.g. commitSync and commitAsync).
This distinction gives the consumer control over when a record is considered consumed. It is discussed in further detail below.

How does Kafka provides next batch of records to poll when commitAsync gets failed in committing offset

I have a use-case regarding consuming records by Kafka consumer.
For instance,
I have 1 topic which has 1 partition. Currently, it has 10 records and while consuming the first 10 records, another 10 records are written to the partition.
myConsumer polls the first time and returns the first 10 records say 0 - 9 records.
It processed all the records successfully.
It invoked commitAsync() to Kafka to commit the last offset.
Commit response is in processing. It can be a success or a failure.
But, since it is an asynchronous mode, it continues to poll for the next batch.
Now, how does either Kafka or consumer poll know that it has to read from the 10th position? Because the commitAsync request has not yet completed.
Please help me in understanding this concept.
Commit Offset tells the broker that the consumer has processed the corresponding message successfully. The consumer itself would be aware of its progress (except for start of consumer where it gets its last committed offset from broker).
At step-5 in your description, the commit offset is in progress. So:
Broker does not know that 0-9 records have been processed
Consumer itself has the read the messages and so it knows that is has read 0-9 messages. So it will know to read 10th onwards next.
Possible Scenarios
Lets say the commit fails for (0-9). Your next batch, say (10-15) is processed and committed succesfully then there is no harm done. Since we mark to the broker that processing till 15 is complete.
Lets say the commit fails for (0-9). Your next batch, (10-15) is processed and before committing, the consumer goes down. When your consumer is brought back up, it takes its state from broker (which does not have commit for either of the batch). So it will start reading from 0th message.
You can come up with several other scenarios as well. I guess the bottom line is, the importance of commit will come into picture when your consumer is restarted for whatever reason and it has get its last processed offset from kafka broker.

Kafka manual offset managment issue

While implementing manual offset management, I encountered the following issue: (using 0.9)
In order to manage the offsets manually, for each consumed record, I retrieve the current offset of the record and commit the new offset (currentOffset + 1, since the offset reset strategy is "latest").
When a new consumer group is created, it has no explicit offsets (offset is "unknown"), therefore, if it didn't consume messages from all existing partitions before it is stopped, it will have committed offsets for only part of the partitions (the ones the consumer got messages from), while the offset for the rest of the partitions will still be "unknown".
When the consumer is started again, it gets only some of the messages that were produced while it was down (only the ones from the partitions that had a committed offset), the messages from partitions with "unknown" offset are lost and will never be consumed due to the offset reset strategy.
Since it's unacceptable in my case to miss any messages once a consumer group is created, I'd like to explicitly commit an offset for each partition before starting consumption.
To do that I found two options:
Use low level consumer to send an offset request.
Use high level consumer, call consumer.poll(0) (to trigger the assignment), then call consumer.assignment(), and for each TopicPartition call consumer.committed(topicPartition); consumer.seekToEnd(topicPartition); consumer.position(topicPartition) and eventually commit all offsets.
Both are more complex and noisy than I'd expect (I'd expect a simpler API I could use to get the log end position for all partitions assigned to a consumer).
Any thoughts or ideas for a better implementation would be appreciated.
10x.
Using consumer API totally depends upon where are you committing offsets.
If your offsets are getting stored in Kafka broker then definitely
you should use high-level consumer API it will provide you with more control
over offsets.
If you are keeping offsets in zookeeper than you can use any old consumer API like
List< KafkaStream < byte[], byte[] > > streams
=consumer.createMessageStreamsByFilter(new Whitelist(topicRegex),1)

Does a Kafka Consumer receive a list of offsets first, before receiving the bytes/data?

I'm quite new to Apache Kafka and I'm currently reading Learning Apache Kafka, 2ed, (2015). Chapter 3, paragraph Kafka Design fundamentals says the following:
Consumers always consume messages from a particular partition sequentially and also acknowledge the message offset. This acknowledgement implies that the consumer has consumed all prior messages. Consumers issue an asynchronous pull request containing the offset of the message to be consumed to the broker and get the buffer of bytes.
I'm a bit thrown off by the word 'acknowledge'. Do I understand it correctly that Kafka sends the offset first and then the consumer uses the list of offsets to pull request the data it has not consumed yet?
Thanks in advance,
Nick
On startup, KafkaConsumer issues a offset lookup request to the brokers for the specific consumer group that was configured on this consumer. If valid offsets are returned those are used. Otherwise, the consumer uses an initial offset according to auto.offset.reset parameter.
Afterwards, offsets are maintained mainly in-memory within the consumer. Each poll() sends the current offset to the broker and on broker reply consumer updates the in-memory offsets.
Additionally, in-memory offset are committed/acked to the broker from time to time. This can happen automatically within poll() if auto commit is enabled, or commit() must be called explicitly to send offsets to the broker for reliably storing them.