KafkaConsumer resume partition cannot continue to receive uncommitted messages - apache-kafka

I'm using one topic, one partition, one consumer, Kafka client version is 0.10.
I got two different results:
If I paused partition first, then to produce a message and to invoke resume method. KafkaConsumer can poll the uncommitted message successfully.
But If I produced message first and didn't commit its offset, then to pause the partition, after several seconds, to invoke the resume method. KafkaConsumer would not receive the uncommitted message. I checked it on Kafka server using kafka-consumer-groups.sh, it shows LOG-END-OFFSET minus CURRENT-OFFSET = LAG = 1.
I have been trying to figure out it for two days, I repeated such tests a lot of times, the results are always like so. I need some suggestion or someone can tell me its Kafka's original mechanism.

For your observation#2, if you restart the application, it will supply you all records from the un-committed offset, i.e. the missing record and if your consumer again does not commit, it will be sent again when application registers consumer with Kafka upon restart. It is expected.
Assuming you are using consumer.poll() which creates a hybrid-streaming interface i.e. if accumulates data coming into Kafka for the duration mentioned and provides it to the consumer for processing once the duration is finished. This continuous accumulation happens in the backend and is not dependent on whether you have committed offset or not.
KafkaConsumer
The position of the consumer gives the offset of the next record that
will be given out. It will be one larger than the highest offset the
consumer has seen in that partition. It automatically advances every
time the consumer receives messages in a call to poll(long).

Related

Using seek to listen get only uncommitted offset from beginning

I am using Spring Kafka and have a requirement where I have to listen from a DLQ topic and put the message to another topic after few minutes. Here I am only acknowledging a msg only when it is put to another topic else I am not committing it and calling kafkaListenerEndpointRegistry.stop() which is stopping my kafka consumer. Then there is scheduled cron job running after every 3 minutes and starts the consumer by running kafkaListenerEndpointRegistry.start() and since auto.offset.reset is set to earliest then consumer is getting all msgs from previously uncommitted offset and checking their eligibility to be put on other topic.
This approach is working fine for small volume but for very large volume I am not seeing the expected retries in both topics. So I am suspecting that this might be happening because I am using kafkaListenerEndpointRegistry.stop() to stop the consumer. If I am able to seek to beginning of offset for each partition and get all msgs from uncommitted offset then I don't have to stop and start my consumer.
For this, I tried ConsumerSeekAware.onPartitionAssigned and calling callback.seekToBeginning() to reset offsets. But looks like it's also consuming all committed offset which is increasing huge load on my services. So is there anything I am missing or seekToBeginning always read all msgs(committed and uncommitted).
and is there any way to trigger partition assignment manually while running kafka consumer so that it goes to onPartitionAssigned method?
auto.offset.reset is set to earliest then consumer is getting all msgs from previously uncommitted
auto.offset.reset is meaningless if there is a committed offset; it just determines the behavior if there is no committed offset.
seekToBeginning always read all msgs(committed and uncommitted).
Kafka maintains 2 pointers - the current position and the committed offset; seek has nothing to do with committed offset, seekToBeginning just changes the position to the earliest record, so the next poll will return all records.
This approach is working fine for small volume but for very large volume I am not seeing the expected retries in both topics. So I am suspecting that this might be happening because I am using kafkaListenerEndpointRegistry.stop() to stop the consumer.
That should not be a problem; you might want to consider using a container stopping error handler instead; then throw an exception and the container will stop itself (you should also set the stopImmediate container property).
https://docs.spring.io/spring-kafka/docs/current/reference/html/#container-stopping-error-handlers

Apache Kafka Cleanup while consuming messages

Playing around with Apache Kafka and its retention mechanism I'm thinking about following situation:
A consumer fetches first batch of messages with offsets 1-5
The cleaner deletes the first 10 messages, so the topic now has offsets 11-15
In the next poll, the consumer fetches the next batch with offsets 11-15
As you can see the consumer lost the offsets 6-10.
Question, is such a situation possible at all? With other words, will the cleaner execute while there is an active consumer? If yes, is the consumer able to somehow recognize that gap?
Yes such a scenario can happen. The exact steps will be a bit different:
Consumer fetches message 1-5
Messages 1-10 are deleted
Consumer tries to fetch message 6 but this offset is out of range
Consumer uses its offset reset policy auto.offset.reset to find a new valid offset.
If set to latest, the consumer moves to the end of the partition
If set to earliest the consumer moves to offset 11
If none or unset, the consumer throws an exception
To avoid such scenarios, you should monitor the lead of your consumer group. It's similar to the lag, but the lead indicates how far from the start of the partition the consumer is. Being near the start has the risk of messages being deleted before they are consumed.
If consumers are near the limits, you can dynamically add more consumers or increase the topic retention size/time if needed.
Setting auto.offset.reset to none will throw an exception if this happens, the other values only log it.
Question, is such a situation possible at all? will the cleaner execute while there is an active consumer
Yes, if the messages have crossed TTL (Time to live) period before they are consumed, this situation is possible.
Is the consumer able to somehow recognize that gap?
In case where you suspect your configuration (high consumer lag, low TTL) might lead to this, the consumer should track offsets. kafka-consumer-groups.sh command gives you the information position of all consumers in a consumer group as well as how far behind the end of the log they are.

Kafka producer buffering

Suppose there is a producer which is running and I run a consumer a few minutes later. I noticed that the consumer will consume old messages that has been produced by the producer but I don't want that happens. How can I do that? Is there any config parameters in broker to be set and solve this problem?
It really depends on the use case, you didn't really provide much information about the architecture. For instance - once the consumer is up, is it a long running consumer, or does it just wake up for a short while and consumes new messages arriving?
You can take any of the following approaches:
Filter ConsumerRecord by timestamp, so you will automatically throw away messages that were produced over configurable time.
In my team we're using ephemeral groups. That is - each time the service goes up, we generate a new group id for the consumer group, setting auto.offset.reset to latest
Seek to timestamp - since kafka 0.10 you can seek to a certain position. Use consumer.offsetsForTimes to get the offset of each topic partition for the desired time, and then use consumer.seek to get to the given offset.
If you use a consumer group, but never commit to kafka, then each time the a consumer is assigned to a topic partition, it will start consuming according to auto.offset.reset policy...

Why is the kafka consumer consuming the same message hundreds of times?

I see from the logs that exact same message is consumed by the 665 times. Why does this happen?
I also see this in the logs
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies
that the poll loop is spending too much time message processing. You can address this either by increasing the session
timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
Consumer properties
group.id=someGroupId
bootstrap.servers=kafka:9092
enable.auto.commit=false
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
session.timeout.ms=30000
max.poll.records=20
PS: Is it possible to consume only a specific number of messages like 10 or 50 or 100 messages from the 1000 that are in the queue?
I was looking at 'fetch.max.bytes' config, but it seems like it is for a message size rather than number of messages.
Thanks
The answer lies in the understanding of the following concepts:
session.timeout.ms
heartbeats
max.poll.interval.ms
In your case, your consumer receives a message via poll() but is not able to complete the processing in max.poll.interval.ms time. Therefore, it is assumed hung by the Broker and re-balancing of partitions happen due to which this consumer loses the ownership of all partitions. It is marked dead and is no longer part of a consumer group.
Then when your consumer completes the processing and calls poll() again two things happen:
Commit fails as the consumer no longer owns the partitions.
Broker identifies that the consumer is up again and therefore a re-balance is triggered and the consumer again joins the Consumer Group, start owning partitions and request messages from the Broker. Since the earlier message was not marked as committed (refer #1 above, failed commit) and is pending processing, the broker delivers the same message to consumer again.
Consumer again takes a lot of time to process and since is unable to finish processing in less than max.poll.interval.ms, 1. and 2. keep repeating in a loop.
To fix the problem, you can increase the max.poll.interval.ms to a large enough value based on how much time your consumer needs for processing. Then your consumer will not get marked as dead and will not receive duplicate messages.
However, the real fix is to check your processing logic and try to reduce the processing time.
The fix is described in the message you pasted:
You can address this either by increasing the session timeout or by
reducing the maximum size of batches returned in poll() with
max.poll.records.
The reason is a timeout is reached before your consumer is able to process and commit the message. When your Kafka consumer "commits", it's basically acknowledging receipt of the previous message, advancing the offset, and therefore moving onto the next message. But if that timeout is passed (as is the case for you), the consumer's commit isn't effective because it's happening too late; then the next time the consumer asks for a message, it's given the same message
Some of your options are to:
Increase session.timeout.ms=30000, so the consumer has more time
process the messages
Decrease the max.poll.records=20 so the consumer has less messages it'll need to work on before the timeout occurs. But this doesn't really apply to you because your consumer is already only just working on a single message
Or turn on enable.auto.commit, which probably also isn't the best solution for you because it might result in dropping messages though, as mentioned below:
If we allowed offsets to auto commit as in the previous example
messages would be considered consumed after they were given out by the
consumer, and it would be possible that our process could fail after
we have read messages into our in-memory buffer but before they had
been inserted into the database.
Source: https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

Kafka Message at-least-once mode at multi-consumer

Kafka messaging use at-least-once message delivery to ensure every message to be processed, and uses a message offset to indicates which message is to deliver next.
When there are multiple consumers, if some deadly message cause a consumer crash during message processing, will this message be redelivered to other consumers and spread the death? If some slow message blocked a single consumer, can other consumers keep going and process subsequent messages?
Or even worse, if a slow and deadly message caused a consumer crash, will it cause other consumers start from its offset again?
There are a few things to consider here:
A Kafka topic partition can be consumed by one consumer in a consumer group at a time. So if two consumers belong to two different groups they can consume from the same partition simultaneously.
Stored offsets are per consumer group. So each topic partition has a stored offset for each active (or recently active) consumer group with consumer(s) subscribed to that partition.
Offsets can be auto-committed at certain intervals, or manually committed (by the consumer application).
So let's look at the scenarios you described.
Some deadly message causes a consumer crash during message processing
If offsets are auto-committed, chances are by the time the processing of the message fails and crashes the consumer, the offset is already committed and the next consumer in the group that takes over would not see that message anymore.
If offsets are manually committed after processing is done, then the offset of that message will not be committed (for simplicity, I am assuming one message is read and processed at a time, but this can be easily generalized) because of the consumer crash. So any other consumer in the group that is (will be) subscribed to that topic will read the message again after taking over that partition. So it's possible that it will crash other consumers too. If offsets are committed before message processing, then the next consumers won't see the message because the offset is already committed when the first consumer crashed.
Some slow message blocks a single consumer: As long as the consumer is considered alive no other consumer in the group will take over. If the slowness goes beyond the consumer's session.timeout.ms the consumer will be considered dead and removed from the group. So whether another consumer in the group will read that message depends on how/when the offset is committed.
Slow and deadly message causes a consumer crash: This scenario should be similar to the previous ones in terms of how Kafka handles it. Either slowness is detected first or the crash occurs first. Again the main thing is how/when the offset is committed.
I hope that helps with your questions.