Kafka generates offset for each message. Say, I am producing messages 5 and the offsets will be from 1 to 5.
But, In a transactional producer, Say, I produced 5 messages and committed, and then 5 messages but aborted and then 5 messages committed.
So, the last committed 5 messages will have offset from 6 to 10 or 11 to 15?
What if i dont abort or dont commit. Will the messages still be posted?
How Kafka ignores offsets which are not committed? As, kafka commit logs are offset based. Does it use transaction commit log for transactional consumer to commit offsets and return Last stable offset? Or, is it from __transaction_state topic which maintains the offsets?
The last 5 messages have offsets 11 to 15. When consuming with isolation.level=read_committed, the consumer will "jump" from offset 6 to 11.
If you don't commit or abort the transaction, it will automatically be timed out (aborted) after transaction.max.timeout.ms has elapsed.
Along with the message data, Kafka stores a bunch of metadata and is able to identify for each message if it has been committed or not. As committing offsets is the same as writing to a partition (the only difference is that it's done automatically by Kafka in an internal topic __consumer_offsets) it works the same way for offsets. Offsets added via sendOffsetsToTransaction() that were aborted or not committed will automatically be skipped.
As mentioned in another of your questions, I recommend having a look a tthe KIP that added exactly-once semantics to Kafka. It details all these mechanics and will help you gettting a better understanding: https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
Related
I'm running a program that starts with a message in the topic, consumes it, processes it, commits the next offset, and publishes a new message to the same topic, all in a transactional fashion. I have the following (simplified) trace:
Fetch READ_COMMITTED at offset 20 for partition test-topic-0
processing message at offset 20
Committed offset 21 for partition test-topic-0
Sending PRODUCE
COMMITTING_TRANSACTION
Fetch READ_COMMITTED at offset 22 for partition test-topic-0
processing message at offset 22 <==== first time
...rebalance...
Setting offset for partition test-topic-0 to the committed offset FetchPosition{offset=21
Committed offset 23 for partition test-topic-0
Sending PRODUCE
COMMITTING_TRANSACTION
Fetch READ_COMMITTED at offset 24 for partition test-topic-0
stale fetch response for partition test-topic-0 since its offset 24 does not match the expected offset FetchPosition{offset=21
Fetch READ_COMMITTED at offset 21 for partition test-topic-0
processing message at offset 22 <==== second time
As a result of this I process the message "22" twice. Is it expected for kafka to just rewind the consumer offset to before the committed offset? Does the ordering of the log look right? I can update the question with the full log if necessary but I don't think there is anything useful there.
Looks like a rebalance occurred before the producer could complete the transaction. Would be helpful to see the code / configs you're using / version of Kafka.
The transactional consume-process-produce requires the producer to do a couple of different things. When processing a batch of records:
producer.beginTransaction() - this method guarantees that everything produced from time it was called until the transaction is aborted/committed, to be part of a single transaction.
producer.send(producerRecord) - for each message you process in the batch.
producer.sendOffsetsToTransaction( Map<TopicPartition, OffsetAndMetadata> offsetsToCommit, consumer.groupMetadata() ) - once you've gone through the batch, which will commit the offsets as part of the transaction. Note that committing offsets any other way will not provide transactional guarantees.
Once all records from the batch have been produced and you committed offsets as part of the transaction, you finally commit the transaction and seal the deal - producer.commitTransaction()
With that said, this should explain why it rejected message 24 and reprocessed message 22. I believe message 23 did not get to the last producer step, but would need to see the code to be sure. From the Kafka definitive guide:
In order to guarantee that messages will be read in order,
read_committed mode will not return messages that were produced
after the point when the first still-open transaction began (known as
the Last Stable Offset, or LSO). Those messages will be witheld until
that transaction is committed or aborted by the producer, or until the
reach transaction.timeout.ms (15 min default) and are aborted by the
broker.
And
The two main mistakes (to transactions) are assuming that exactly once guarantees apply on actions other than producing to Kafka, and that consumers always read entire transactions and have information about transaction boundaries.
I'm a newbie on Kafka and trying to figure out how it works.
If I'm right, a Kafka broker will send a bunch of messages in one poll of consumer. In other words, when the consumer invokes the function poll, it will get a bunch of messages and then the consumer will process these messages one by one.
Now, let's assume that there are 100 messages in the broker, from 1 to 100. When the consumer invokes the function poll, 10 messages are sent together: 1 - 10, 11 - 20... At the same time, the consumer will commit automatically the committed offset to the broker every 5 seconds.
Saying that at some moment, the consumer is sending the committed offset while it is processing the 15th meesage.
In this case, I don't know which number is the committed offset, 11 or 14?
If it's 11, it means that if the broker needs to resend for some reason, it will resend the bunch of messages from 11 to 20, but if it's 14, it means that it will resend the bunch of messages from 14 to 23.
"In this case, I don't know which number is the committed offset, 11 or 14?"
The auto commit will commit always the highest offset that was fetched during a poll. In your case it would commit back 20, independent of which offset is currently being processed by the client.
I guess this example shows you that enabling auto commit comes with some downsides. I recommend to take control of the committed offsets yourself by disabling it and only committing offsets after the processing of all messages was successful. However, there are use cases where you simply can enable auto commit without the need to ever think about it.
"If it's 11, it means that if the broker needs to resend for some reason, it will resend the bunch of messages from 11 to 20, but if it's 14, it means that it will resend the bunch of messages from 14 to 23."
There isa difference between a consumed and a committed offset. Committed offsets only get relevant when you re-start your application or consumers join or leave the consumerGroup of your client. Otherwise, the poll method does not care so much about the committed while the application is running. I have written some more details on the difference between committed and consumed offsets in another answer.
I have a use-case regarding consuming records by Kafka consumer.
For instance,
I have 1 topic which has 1 partition. Currently, it has 10 records and while consuming the first 10 records, another 10 records are written to the partition.
myConsumer polls the first time and returns the first 10 records say 0 - 9 records.
It processed all the records successfully.
It invoked commitAsync() to Kafka to commit the last offset.
Commit response is in processing. It can be a success or a failure.
But, since it is an asynchronous mode, it continues to poll for the next batch.
Now, how does either Kafka or consumer poll know that it has to read from the 10th position? Because the commitAsync request has not yet completed.
Please help me in understanding this concept.
Commit Offset tells the broker that the consumer has processed the corresponding message successfully. The consumer itself would be aware of its progress (except for start of consumer where it gets its last committed offset from broker).
At step-5 in your description, the commit offset is in progress. So:
Broker does not know that 0-9 records have been processed
Consumer itself has the read the messages and so it knows that is has read 0-9 messages. So it will know to read 10th onwards next.
Possible Scenarios
Lets say the commit fails for (0-9). Your next batch, say (10-15) is processed and committed succesfully then there is no harm done. Since we mark to the broker that processing till 15 is complete.
Lets say the commit fails for (0-9). Your next batch, (10-15) is processed and before committing, the consumer goes down. When your consumer is brought back up, it takes its state from broker (which does not have commit for either of the batch). So it will start reading from 0th message.
You can come up with several other scenarios as well. I guess the bottom line is, the importance of commit will come into picture when your consumer is restarted for whatever reason and it has get its last processed offset from kafka broker.
I am using exactly once semantics provided by kafka. Therefore, my producer writes a message within a transaction. While my producer was sending 100th message, right between send() and commitTransaction(); I killed the producer process.
I read last few uncommitted messages in my topic.
Consumer record offset - message number
0 - 1
2 - 2
196 - 99 <- Last committed message by producer
198 - 100 <- 100th message was sent but not committed
Now, when I run consumer with read_committed isolation level. It exactly reads from 1-99 messages. But for that I have read entire topic. Eventually, I am going to store millions of messages in a topic. So, reading entire topic is not preferable.
Also, assume that consumer is polling messages from a broker and there is some communication issue with a kafka borker and consumer. Last message read by consumer is let's say offset#50. This means that I could not identify the last committed message in a topic reliably.
I used other methods i.e.
seekToEnd() - took me to offset#200
endOffsets() - took me to offset#200
Is there a way to get message that was committed by the Kafka producer reliably ? (In my case, Offset#196)
I'm trying to understand how Kafka handles the situation where multiple manual commits are issued by the consumer.
As a thought experiment assume a single topic/partition with a single consumer. I publish two messages to this topic and they are processed async by the consumer and the consumer does a manual commit after message processing completes. Now if message 1 completes first followed by message 2, I would expect the broker to store the offset at 2. What happens in the reverse scenario? Would the broker now set the offset back to 1 from 2, or is there logic preventing the offset from decreasing?
From reading the docs it appears that the topic 'position' is defined as the max committed offset +1, which would imply that Kafka is invariant to the order the messages are committed in. But it is unclear to me what happens in the case where a consumer disconnects and reconnects to the broker, will it continue from the max committed offset or the latest committed offset?
Thanks