Kafka consumer - how to recognized offset skipping/missing offsets? - apache-kafka

Setup:
We have a Debezium/Kafka Connect setup with an Debezium Oracle producer and a Confluend JDBC consumer/sink.
Starting position / background / problem:
Due to high traffic we have decreased the log.retention.minutes to 1h which is suitable in 99% of the time.
But in some rare cases one of the kafka consumers gets a slow down and can't keep up any longer. In that case messages will be deleted in Kafka (due to the aforementioned retention period) before they were picked up and handled by the consumer.
In the default config, the consumer then will skip the missing records be choosing the earliest available offset. This leads to inconsistencies on the target side.
Question:
How to handle those situations (if raising the log.retention.minutes isn't an option)?
Note: We would be fine, if the consumer would just throw an exception/stop/etc in case it can't find a message for its given offset.
What we've tried to far...
We tried setting auto.offset.reset to none for the consumer and expected the consumer to stop in case it can't find an offset. In theory this should work. In practice it immeadiately throws an exception when the consumer gets instantiated because there's no first/initial offset.
Final thoughts
So is there another config parameter we could use? (Something like "throw exception if offset is missing/skipped, but not on first start"?) Or is there a JMX metric we could monitor in case a consumer is skipping messages?

setting auto.offset.reset to none for the consumer and expected the consumer to stop in case it can't find an offset
That's what it'll do, yes.
In practice it immediately throws an exception when the consumer gets instantiated because there's no first/initial offset
You'll need to actually initialize the group first, then seek it to the earliest offset. E.g. kafka-consumer-offsets --reset-offsets --to-earliest --group connect-<name>
Something like "throw exception if offset is missing/skipped, but not on first start"?)
There's nothing to differentiate auto.offset.reset between "first" and "next" starts. But, you could create the connector with consumer.override.auto.offset.reset=earliest, then wait for it to be running, then set it back to none with a PUT /config call. Then repeat whenever it stops running again.
JMX metric we could monitor in case a consumer is skipping messages
Not that I know of; the metrics are mostly reporting bytes processed. You'd have to additionally track how many bytes you expect it to read.
You'd need other monitoring solutions to detect log segments being deleted on the broker, and tracking those offset ranges compared to the offsets your consumer is currently reading.

Related

Using seek to listen get only uncommitted offset from beginning

I am using Spring Kafka and have a requirement where I have to listen from a DLQ topic and put the message to another topic after few minutes. Here I am only acknowledging a msg only when it is put to another topic else I am not committing it and calling kafkaListenerEndpointRegistry.stop() which is stopping my kafka consumer. Then there is scheduled cron job running after every 3 minutes and starts the consumer by running kafkaListenerEndpointRegistry.start() and since auto.offset.reset is set to earliest then consumer is getting all msgs from previously uncommitted offset and checking their eligibility to be put on other topic.
This approach is working fine for small volume but for very large volume I am not seeing the expected retries in both topics. So I am suspecting that this might be happening because I am using kafkaListenerEndpointRegistry.stop() to stop the consumer. If I am able to seek to beginning of offset for each partition and get all msgs from uncommitted offset then I don't have to stop and start my consumer.
For this, I tried ConsumerSeekAware.onPartitionAssigned and calling callback.seekToBeginning() to reset offsets. But looks like it's also consuming all committed offset which is increasing huge load on my services. So is there anything I am missing or seekToBeginning always read all msgs(committed and uncommitted).
and is there any way to trigger partition assignment manually while running kafka consumer so that it goes to onPartitionAssigned method?
auto.offset.reset is set to earliest then consumer is getting all msgs from previously uncommitted
auto.offset.reset is meaningless if there is a committed offset; it just determines the behavior if there is no committed offset.
seekToBeginning always read all msgs(committed and uncommitted).
Kafka maintains 2 pointers - the current position and the committed offset; seek has nothing to do with committed offset, seekToBeginning just changes the position to the earliest record, so the next poll will return all records.
This approach is working fine for small volume but for very large volume I am not seeing the expected retries in both topics. So I am suspecting that this might be happening because I am using kafkaListenerEndpointRegistry.stop() to stop the consumer.
That should not be a problem; you might want to consider using a container stopping error handler instead; then throw an exception and the container will stop itself (you should also set the stopImmediate container property).
https://docs.spring.io/spring-kafka/docs/current/reference/html/#container-stopping-error-handlers

Apache Kafka Cleanup while consuming messages

Playing around with Apache Kafka and its retention mechanism I'm thinking about following situation:
A consumer fetches first batch of messages with offsets 1-5
The cleaner deletes the first 10 messages, so the topic now has offsets 11-15
In the next poll, the consumer fetches the next batch with offsets 11-15
As you can see the consumer lost the offsets 6-10.
Question, is such a situation possible at all? With other words, will the cleaner execute while there is an active consumer? If yes, is the consumer able to somehow recognize that gap?
Yes such a scenario can happen. The exact steps will be a bit different:
Consumer fetches message 1-5
Messages 1-10 are deleted
Consumer tries to fetch message 6 but this offset is out of range
Consumer uses its offset reset policy auto.offset.reset to find a new valid offset.
If set to latest, the consumer moves to the end of the partition
If set to earliest the consumer moves to offset 11
If none or unset, the consumer throws an exception
To avoid such scenarios, you should monitor the lead of your consumer group. It's similar to the lag, but the lead indicates how far from the start of the partition the consumer is. Being near the start has the risk of messages being deleted before they are consumed.
If consumers are near the limits, you can dynamically add more consumers or increase the topic retention size/time if needed.
Setting auto.offset.reset to none will throw an exception if this happens, the other values only log it.
Question, is such a situation possible at all? will the cleaner execute while there is an active consumer
Yes, if the messages have crossed TTL (Time to live) period before they are consumed, this situation is possible.
Is the consumer able to somehow recognize that gap?
In case where you suspect your configuration (high consumer lag, low TTL) might lead to this, the consumer should track offsets. kafka-consumer-groups.sh command gives you the information position of all consumers in a consumer group as well as how far behind the end of the log they are.

Kafka Polling Mechanism

Kafka messages that has been posted by the producer are keep appearing the consumer end after a specific interval
I tried to consume a message from my Kafka topic, That Time I face the issue I explained above. I suppose, it happens due to repolling after 5 mins(Which is a default poll interval) set. Is my understanding correct?
My Expected result is the message should not be reprocessed again and again. It should be processed only once. How Can I achieve that?
Your configuration seems to be
enable.auto.commit: false and auto.commit.interval.ms: some value
Second configuration is causing messages appearing after some specific interval(some value).
Same message is appearing at consumer end for processing because the message was not processed successfully first time.
If no last offset information available with zookeeper or broker, and auto.offset.reset is set to smallest (or earliest) then processing will start from 0th offset.
Change auto.offset.reset to largest (or latest) if you do not want to reprocess the same message.

Kafka producer buffering

Suppose there is a producer which is running and I run a consumer a few minutes later. I noticed that the consumer will consume old messages that has been produced by the producer but I don't want that happens. How can I do that? Is there any config parameters in broker to be set and solve this problem?
It really depends on the use case, you didn't really provide much information about the architecture. For instance - once the consumer is up, is it a long running consumer, or does it just wake up for a short while and consumes new messages arriving?
You can take any of the following approaches:
Filter ConsumerRecord by timestamp, so you will automatically throw away messages that were produced over configurable time.
In my team we're using ephemeral groups. That is - each time the service goes up, we generate a new group id for the consumer group, setting auto.offset.reset to latest
Seek to timestamp - since kafka 0.10 you can seek to a certain position. Use consumer.offsetsForTimes to get the offset of each topic partition for the desired time, and then use consumer.seek to get to the given offset.
If you use a consumer group, but never commit to kafka, then each time the a consumer is assigned to a topic partition, it will start consuming according to auto.offset.reset policy...

kafka consumer sessions timing out

We have an application that a consumer reads a message and the thread does a number of things, including database accesses before a message is produced to another topic. The time between consuming and producing the message on the thread can take several minutes. Once message is produced to new topic, a commit is done to indicate we are done with work on the consumer queue message. Auto commit is disabled for this reason.
I'm using the high level consumer and what I'm noticing is that zookeeper and kafka sessions timeout because it is taking too long before we do anything on consumer queue so kafka ends up rebalancing every time the thread goes back to read more from consumer queue and it starts to take a long time before a consumer reads a new message after a while.
I can set zookeeper session timeout very high to not make that a problem but then i have to adjust the rebalance parameters accordingly and kafka won't pickup a new consumer for a while among other side effects.
What are my options to solve this problem? Is there a way to heartbeat to kafka and zookeeper to keep both happy? Do i still have these same issues if i were to use a simple consumer?
It sounds like your problems boil down to relying on the high-level consumer to manage the last-read offset. Using a simple consumer would solve that problem since you control the persistence of that offset. Note that all the high-level consumer commit does is store the last read offset in zookeeper. There's no other action taken and the message you just read is still there in the partition and is readable by other consumers.
With the kafka simple consumer, you have much more control over when and how that offset storage takes place. You can even persist that offset somewhere other than Zookeeper (a data base, for example).
The bad news is that while the simple consumer itself is simpler than the high-level consumer, there's a lot more work you have to do code-wise to make it work. You'll also have to write code to access multiple partitions - something the high-level consumer does quite nicely for you.
I think issue is consumer's poll method trigger consumer's heartbeat request. And when you increase session.timeout. Consumer's heartbeat will not reach to coordinator. Because of this heartbeat skipping, coordinator mark consumer dead. And also consumer rejoining is very slow especially in case of single consumer.
I have faced a similar issue and to solve that I have to change following parameter in consumer config properties
session.timeout.ms=
request.timeout.ms=more than session timeout
Also you have to add following property in server.properties at kafka broker node.
group.max.session.timeout.ms =
You can see the following link for more detail.
http://grokbase.com/t/kafka/users/16324waa50/session-timeout-ms-limit