Currently in my kafka consumer i have turned off auto commit, so currently when processing of messages failed for ex: three invalid messages, the manual ack fails and the lag increases to three.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
orders 0 35 38 3
After that if a new incoming valid message comes through and the processing of that message is successfully completed, it is manually acked and
after that consumer looks like this
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
orders 0 39 39 0
Why does consumer set the current-offset to 39 when the messages with offset 36, 37, 38 were not successfully processed
and they are never read again by the same consumer
Can anyone pls explain this behavior? Thanks in advance!
In Kafka, consumers don't ack every single messages. Instead they ack (commit) the offset of the last message they processed.
For example, if you commit offset 15, it implicitly means you've processed all messages before from 0 to 15. Also when committing 15, you overwrite any previous commit, so you cannot know if you committed 13 or 14 before.
I suggest you read the Consumer Position section in the docs that go over this concept.
Regarding reprocessing, Kafka offers a few options. When hitting a processing failure, before polling for more messages and processing new records, you can try to reprocess the message. Another option is to skip it as invalid and carry on (what you are currently doing).
On the other hand, you could ensure data is good by running a Streams job to pipe valid messages into a checked topic and forward bad messages to a DLQ. Then consume from this checked topic where you know you only have good messages. See validation for kafka topic messages
Related
According to this log-end-offset for the consumers in a consumer group for a topic should add up to the number of messages in that topic. I have a case where log-end-offset is adding up to twice the number of messages in the topic (log-end-offset adds up to 28 whereas there are only 14 messages in the topic). What are some potential explanations for this?
The current issue I am facing with this jdbc sink connector is that there are there is a bad message at offset 0 i.e. if the connector tries to process it then it will fail due to violating a db constraint. We have been able to work around this by manually moving the connector's consumer offset s.t. it skips over the bad message. Then randomly months later, it tried to go back and process it even though nobody manually asked it to. The two issues seem related - it seems like something is tricking the connector into thinking it needs to reprocess all of the messages in the topic, hence why the log-end-offsets add up to twice the number of messages in the topic.
We are on Confluent 5.3.3.
Hi offset is always moving forward even when old messages getting prune / deleted from the topic after configurable retention times / size. So in your case having log end offset of 28 and only 14 messages which are really inside the topic at the moment, is a valid situation.
I'm a newbie on Kafka and trying to figure out how it works.
If I'm right, a Kafka broker will send a bunch of messages in one poll of consumer. In other words, when the consumer invokes the function poll, it will get a bunch of messages and then the consumer will process these messages one by one.
Now, let's assume that there are 100 messages in the broker, from 1 to 100. When the consumer invokes the function poll, 10 messages are sent together: 1 - 10, 11 - 20... At the same time, the consumer will commit automatically the committed offset to the broker every 5 seconds.
Saying that at some moment, the consumer is sending the committed offset while it is processing the 15th meesage.
In this case, I don't know which number is the committed offset, 11 or 14?
If it's 11, it means that if the broker needs to resend for some reason, it will resend the bunch of messages from 11 to 20, but if it's 14, it means that it will resend the bunch of messages from 14 to 23.
"In this case, I don't know which number is the committed offset, 11 or 14?"
The auto commit will commit always the highest offset that was fetched during a poll. In your case it would commit back 20, independent of which offset is currently being processed by the client.
I guess this example shows you that enabling auto commit comes with some downsides. I recommend to take control of the committed offsets yourself by disabling it and only committing offsets after the processing of all messages was successful. However, there are use cases where you simply can enable auto commit without the need to ever think about it.
"If it's 11, it means that if the broker needs to resend for some reason, it will resend the bunch of messages from 11 to 20, but if it's 14, it means that it will resend the bunch of messages from 14 to 23."
There isa difference between a consumed and a committed offset. Committed offsets only get relevant when you re-start your application or consumers join or leave the consumerGroup of your client. Otherwise, the poll method does not care so much about the committed while the application is running. I have written some more details on the difference between committed and consumed offsets in another answer.
I new to Kafka and i have a configuration where i have a source Kafka topic which has messages with a default retention for 7 days. I have 3 brokers with 1 partition and 1 replication.
When i try to consume messages from source Kafka topic and to my target Kafka topic i was able to consume messages in the same order. Now my question is if i am trying to reprocess all the messages from my source Kafka and consume in ,y Target Kafka i see that my Target Kafka is not consuming any messages. I know that duplication should be avoided but lets say i have a scenario where i have 100 messages in my source Kafka and i am expecting 200 messages in my target Kafka after running it twice. But i am just getting 100 messages in my first run and my second run returns nothing.
Can some one please explain why this is happening and what is the functionality behind it ?
Kafka consumer reads data from a partition of a topic. One consumer can read from one partition at one time only.
Once a message has been read by the consumer, it can't be re-read again. Let me first explain the current offset. When we call a poll method, Kafka sends some messages to us. Let us assume we have 100 records in the partition. The initial position of the current offset is 0. We made our first call and received 100 messages. Now Kafka will move the current offset to 100.
The current offset is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll and that has been committed. So, the consumer doesn't get the same record twice because of the current offset. Please go through the following diagram and URL for complete understanding.
https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/
I am using exactly once semantics provided by kafka. Therefore, my producer writes a message within a transaction. While my producer was sending 100th message, right between send() and commitTransaction(); I killed the producer process.
I read last few uncommitted messages in my topic.
Consumer record offset - message number
0 - 1
2 - 2
196 - 99 <- Last committed message by producer
198 - 100 <- 100th message was sent but not committed
Now, when I run consumer with read_committed isolation level. It exactly reads from 1-99 messages. But for that I have read entire topic. Eventually, I am going to store millions of messages in a topic. So, reading entire topic is not preferable.
Also, assume that consumer is polling messages from a broker and there is some communication issue with a kafka borker and consumer. Last message read by consumer is let's say offset#50. This means that I could not identify the last committed message in a topic reliably.
I used other methods i.e.
seekToEnd() - took me to offset#200
endOffsets() - took me to offset#200
Is there a way to get message that was committed by the Kafka producer reliably ? (In my case, Offset#196)
Kafka generates offset for each message. Say, I am producing messages 5 and the offsets will be from 1 to 5.
But, In a transactional producer, Say, I produced 5 messages and committed, and then 5 messages but aborted and then 5 messages committed.
So, the last committed 5 messages will have offset from 6 to 10 or 11 to 15?
What if i dont abort or dont commit. Will the messages still be posted?
How Kafka ignores offsets which are not committed? As, kafka commit logs are offset based. Does it use transaction commit log for transactional consumer to commit offsets and return Last stable offset? Or, is it from __transaction_state topic which maintains the offsets?
The last 5 messages have offsets 11 to 15. When consuming with isolation.level=read_committed, the consumer will "jump" from offset 6 to 11.
If you don't commit or abort the transaction, it will automatically be timed out (aborted) after transaction.max.timeout.ms has elapsed.
Along with the message data, Kafka stores a bunch of metadata and is able to identify for each message if it has been committed or not. As committing offsets is the same as writing to a partition (the only difference is that it's done automatically by Kafka in an internal topic __consumer_offsets) it works the same way for offsets. Offsets added via sendOffsetsToTransaction() that were aborted or not committed will automatically be skipped.
As mentioned in another of your questions, I recommend having a look a tthe KIP that added exactly-once semantics to Kafka. It details all these mechanics and will help you gettting a better understanding: https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging