How does Kafka handle the committed offset while enabling auto commit - apache-kafka

I'm a newbie on Kafka and trying to figure out how it works.
If I'm right, a Kafka broker will send a bunch of messages in one poll of consumer. In other words, when the consumer invokes the function poll, it will get a bunch of messages and then the consumer will process these messages one by one.
Now, let's assume that there are 100 messages in the broker, from 1 to 100. When the consumer invokes the function poll, 10 messages are sent together: 1 - 10, 11 - 20... At the same time, the consumer will commit automatically the committed offset to the broker every 5 seconds.
Saying that at some moment, the consumer is sending the committed offset while it is processing the 15th meesage.
In this case, I don't know which number is the committed offset, 11 or 14?
If it's 11, it means that if the broker needs to resend for some reason, it will resend the bunch of messages from 11 to 20, but if it's 14, it means that it will resend the bunch of messages from 14 to 23.

"In this case, I don't know which number is the committed offset, 11 or 14?"
The auto commit will commit always the highest offset that was fetched during a poll. In your case it would commit back 20, independent of which offset is currently being processed by the client.
I guess this example shows you that enabling auto commit comes with some downsides. I recommend to take control of the committed offsets yourself by disabling it and only committing offsets after the processing of all messages was successful. However, there are use cases where you simply can enable auto commit without the need to ever think about it.
"If it's 11, it means that if the broker needs to resend for some reason, it will resend the bunch of messages from 11 to 20, but if it's 14, it means that it will resend the bunch of messages from 14 to 23."
There isa difference between a consumed and a committed offset. Committed offsets only get relevant when you re-start your application or consumers join or leave the consumerGroup of your client. Otherwise, the poll method does not care so much about the committed while the application is running. I have written some more details on the difference between committed and consumed offsets in another answer.

Related

How does Kafka provides next batch of records to poll when commitAsync gets failed in committing offset

I have a use-case regarding consuming records by Kafka consumer.
For instance,
I have 1 topic which has 1 partition. Currently, it has 10 records and while consuming the first 10 records, another 10 records are written to the partition.
myConsumer polls the first time and returns the first 10 records say 0 - 9 records.
It processed all the records successfully.
It invoked commitAsync() to Kafka to commit the last offset.
Commit response is in processing. It can be a success or a failure.
But, since it is an asynchronous mode, it continues to poll for the next batch.
Now, how does either Kafka or consumer poll know that it has to read from the 10th position? Because the commitAsync request has not yet completed.
Please help me in understanding this concept.
Commit Offset tells the broker that the consumer has processed the corresponding message successfully. The consumer itself would be aware of its progress (except for start of consumer where it gets its last committed offset from broker).
At step-5 in your description, the commit offset is in progress. So:
Broker does not know that 0-9 records have been processed
Consumer itself has the read the messages and so it knows that is has read 0-9 messages. So it will know to read 10th onwards next.
Possible Scenarios
Lets say the commit fails for (0-9). Your next batch, say (10-15) is processed and committed succesfully then there is no harm done. Since we mark to the broker that processing till 15 is complete.
Lets say the commit fails for (0-9). Your next batch, (10-15) is processed and before committing, the consumer goes down. When your consumer is brought back up, it takes its state from broker (which does not have commit for either of the batch). So it will start reading from 0th message.
You can come up with several other scenarios as well. I guess the bottom line is, the importance of commit will come into picture when your consumer is restarted for whatever reason and it has get its last processed offset from kafka broker.

Kafka, consumer offset and multiple async commits

I'm trying to understand how Kafka handles the situation where multiple manual commits are issued by the consumer.
As a thought experiment assume a single topic/partition with a single consumer. I publish two messages to this topic and they are processed async by the consumer and the consumer does a manual commit after message processing completes. Now if message 1 completes first followed by message 2, I would expect the broker to store the offset at 2. What happens in the reverse scenario? Would the broker now set the offset back to 1 from 2, or is there logic preventing the offset from decreasing?
From reading the docs it appears that the topic 'position' is defined as the max committed offset +1, which would imply that Kafka is invariant to the order the messages are committed in. But it is unclear to me what happens in the case where a consumer disconnects and reconnects to the broker, will it continue from the max committed offset or the latest committed offset?
Thanks

Kafka Transactional Producer & Consumer

Kafka generates offset for each message. Say, I am producing messages 5 and the offsets will be from 1 to 5.
But, In a transactional producer, Say, I produced 5 messages and committed, and then 5 messages but aborted and then 5 messages committed.
So, the last committed 5 messages will have offset from 6 to 10 or 11 to 15?
What if i dont abort or dont commit. Will the messages still be posted?
How Kafka ignores offsets which are not committed? As, kafka commit logs are offset based. Does it use transaction commit log for transactional consumer to commit offsets and return Last stable offset? Or, is it from __transaction_state topic which maintains the offsets?
The last 5 messages have offsets 11 to 15. When consuming with isolation.level=read_committed, the consumer will "jump" from offset 6 to 11.
If you don't commit or abort the transaction, it will automatically be timed out (aborted) after transaction.max.timeout.ms has elapsed.
Along with the message data, Kafka stores a bunch of metadata and is able to identify for each message if it has been committed or not. As committing offsets is the same as writing to a partition (the only difference is that it's done automatically by Kafka in an internal topic __consumer_offsets) it works the same way for offsets. Offsets added via sendOffsetsToTransaction() that were aborted or not committed will automatically be skipped.
As mentioned in another of your questions, I recommend having a look a tthe KIP that added exactly-once semantics to Kafka. It details all these mechanics and will help you gettting a better understanding: https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging

Kafka Consumer is getting few (not all) old messages (that was already processed earlier)

We have topics with retention set as 7 days (168 hours). Messages are consumed in real-time as and when the producer sends the message. Everything is working as expected. However recently on a production server, Devops changed the time zone from PST to EST accidentally as part of OS patch.
After Kafka server restart, we saw few (not all of them, but random) old messages being consumed by the consumers. We asked Devops to change it back to PST and restart. Again the old messages re-appeared this weekend as well.
We have not seen this problem in lower environments (Dev, QA, Stage etc).
Kafka version: kafka_2.12-0.11.0.2
Any help is highly appreciated.
Adding more info... Recently our CentOS had a patch update and somehow, admins changed from PST timezone to EST and started Kafka servers... After that our consumers started seeing messages from offset 0. After debugging, I found the timezone change and admins changed back from EST to PST after 4 days. Our message producers were sending messages before and after timezone changes regularly. After timezone change from EST back to PST, Kafka servers were restarted and I am seeing the bellow warning.
This log happened when we chnaged back from EST to PST : (server.log)
[2018-06-13 18:36:34,430] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/app/kafka_2.12-0.11.0.2/data/__consumer_offsets-21/00000000000000002076.index) has non-zero size but the last offset is 2076 which is no larger than the base offset 2076.}. deleting /app/kafka_2.12-0.11.0.2/data/__consumer_offsets-21/00000000000000002076.timeindex, /app/kafka_2.12-0.11.0.2/data/__consumer_offsets-21/00000000000000002076.index, and /app/kafka_2.12-0.11.0.2/data/__consumer_offsets-21/00000000000000002076.txnindex and rebuilding index... (kafka.log.Log)
We restarted consumers after 3 days of timezone change back from EST to PST and started seeing consumer messages with offset 0 again.
As on Kafka v2.3.0
You can set
"enable.auto.commit" : "true",// default is true as well
"auto.commit.interval.ms" : "1000"
This means that So after every 1 second, a Consumer is going to commit its Offset to Kafka or every time data is fetched from the specified Topic it will commit the latest Offset.
So no sooner your Kafka Consumer has started and 1 second has elapsed, it will never read the messages that were received by the consumer and committed. This setting does not require Kafka Server to be restarted.
I think this is because you will restart the program before you Commit new offsets.
Managing offsets
For each consumer group, Kafka maintains the committed offset for each partition being consumed. When a consumer processes a message, it doesn't remove it from the partition. Instead, it just updates its current offset using a process called committing the offset.
If a consumer fails after processing a message but before committing its offset, the committed offset information will not reflect the processing of the message. This means that the message will be processed again by the next consumer in that group to be assigned the partition.
Committing offsets automatically
The easiest way to commit offsets is to let the Kafka consumer do it automatically. This is simple but it does give less control than committing manually. By default, a consumer automatically commits offsets every 5 seconds. This default commit happens every 5 seconds, regardless of the progress the consumer is making towards processing the messages. In addition, when the consumer calls poll(), this also causes the latest offset returned from the previous call to poll() to be committed (because it's probably been processed).
If the committed offset overtakes the processing of the messages and there is a consumer failure, it's possible that some messages might not be processed. This is because processing restarts at the committed offset, which is later than the last message to be processed before the failure. For this reason, if reliability is more important than simplicity, it's usually best to commit offsets manually.
Committing offsets manually
If ‍enable.auto.commit is set to false, the consumer commits its offsets manually. It can do this either synchronously or asynchronously. A common pattern is to commit the offset of the latest processed message based on a periodic timer. This pattern means that every message is processed at least once, but the committed offset never overtakes the progress of messages that are actively being processed. The frequency of the periodic timer controls the number of messages that can be reprocessed following a consumer failure. Messages are retrieved again from the last saved committed offset when the application restarts or when the group rebalances.
The committed offset is the offset of the messages from which processing is resumed. This is usually the offset of the most recently processed message plus one.
From this article, which I think is very helpful.

How to delete specific number of lines from kafka topic by using python or using any inbuilt method?

I am facing a problem while using consumer.poll() method .After fetching data by using poll() method consumer won't have any data to commit so Please help me to remove specific number of lines from the kafka topic .
You need to make sure, that the data is fully processed before you commit it to avoid "data loss" in case if consumer failure.
Thus, if you enable auto.commit, make sure that you process all data completely after a poll() before you issue the next poll() because each poll() implicitly commits all data from its previous poll().
If this is not possible, you should disable auto.commit and commit manually after data got completely processed via consumer.commit(...). For this, keep in mind that you do no need to commit each message individually, and that a commit with offset X implicitly commits all messages with offsets < X (e.g., after processing message of offset 5, you commit offset 6 -- the committed offset is not the last successfully processed message, but the next message you want to process). And a commit of offset 6, commits all messages with offset 0 to 5. Thus, you should not commit offset 6 before all messages with smaller offset got completely processed.