Kafka consuming same messge multiple times - apache-kafka

I am queuing a single message in kafka queue.
Kafka Prop :
enable.auto.commit=true
auto.commit.interval.ms=5000
max.poll.interval.ms=30000
Processing of my message was taking around 10mins. So message was keep on processing after every 5 mins.
Then I have changed the prop max.poll.interval.ms to 20 mins. Now the issue is fixed.
But my question is this : why is this happening. Since I already have auto commit enabled and it should happen every 5sec, then why my messages are not marked committed in the former case

When enable.auto.commit is set to true, then the largest offset is committed every auto.commit.interval.ms of time. However, this happens only whenever poll() is called. In every poll and in your case every 20mins (max.poll.interval.ms), the enable.auto.commit is checked. Whenever you poll(), the consumer checks if it is time to commit the offsets it returned in the last poll.
Now in your case, poll() is called every 20 minutes which means that it might even take up to additional 20 minutes (+5000ms) before committing the offset.

Related

What is the delay time between each poll

In kafka documentation i'm trying to understand this property max.poll.interval.ms
The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.
This mean each poll will happen before the poll-time-out by default it is 5 minutes. So my question is exactly how much time consumer thread takes between two consecutive polls?
For example: Consumer Thread 1
First poll--> with 100 records
--> process 100 records (took 1 minute)
--> consumer submitted offset
Second poll--> with 100 records
--> process 100 records (took 1 minute)
--> consumer submitted offset
Does consumer take time between first and second poll? if yes, why? and how can we change that time ( assume this when topic has huge data)
It's not clear what you mean by "take time between"; if you are talking about the spring-kafka listener container, there is no wait or sleep, if that's what you mean.
The consumer is polled immediately after the offsets are committed.
So, max.poll.interval.ms must be large enough for your listener to process max.poll.records (plus some extra, just in case).
But, no, there are no delays added between polls, just the time it takes the listener to handle the results of the poll.

Relationship between maxPollRecords and autoCommitEnable in kafka

Can Someone Please give me some good example and relationship between the kafka params maxPollRecords and autoCommitEnable in Kafka.
There is no relationship as such between them . Let me explain the two configs to you.
In Kafka there are two ways a consumer can commit offsets -
1.Manual Offset Commit - where the responsibility of committing offsets lies with the developer.
2.Enable Auto Commit- This is where the Kafka consumer takes the responsibility of committing offsets for you. How it works is, on every poll() call you make on the consumer , it is checked whether it is time to commit the offset ( this is dictated by auto.commit.interval.ms configuration), if it is time, it commits the offset.
For example - Suppose the auto.commit.interval.ms is set to 7 secs and every call to poll() takes 8 secs. So on a particular call to poll(), it will check, if the time to commit offset has elapsed , which in this example would have , then it will commit the offsets fetched from the previous poll.
Offsets are also committed during the closing of a consumer.
Here are some links you can look at -
https://kafka.apache.org/documentation/#consumerconfigs
https://kafka.apache.org/11/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
Does kafka lose message if consumer holds message longer then auto commit interval time?
Now , onto Max.poll.records. By, this configuration, you can tell the kafka consumer, what are the maximum number of records , you would like it return on a single call to poll(). Note you will generally not change the defaults for this , unless your record processing is slow , and you want to ensure that your consumer is not considered dead , because of the slowness of processing too many records.

Does kafka lose message if consumer holds message longer then auto commit interval time?

Say if auto-commit interval time is 30 seconds, consumer for some reasons could not process the message and hold it longer than 30 seconds then crash. does the auto-commit offset mechanism commits this offset anyway right before consumer crash?
If my assumption is correct, the message is lost as its offset committed but the message itself has not been processed?
Lets consider your Consumer group name is Test and you have a single consumer in the Consumer Group.
When Auto-Commit is enabled, offsets are committed only during poll() calls and during closing of a consumer.
For example- auto.commit.interval.ms is 5 secs, and every call to poll() takes 7 secs. When making every call to poll(), it will check if the auto commit interval has elapsed, if it has, like in the above example, it will commit the offset.
Offsets are also committed during closing of a consumer.
From the documentation -
"Close the consumer, waiting for up to the default timeout of 30 seconds for any needed cleanup. If auto-commit is enabled, this will commit the current offsets if possible within the default timeout".
You can read more about it here -
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
Now, onto your question, if poll() is not called again or consumer is not closed, it won't commit the offset.
If the Consumer receives message N, commits it and then crashes before having fully processed it then by default the Consumer will considered this message processed.
Note that the message is still on the broker, so it can be re-consumed to be processed. But that require some logic in your application to not only restart from last committed position but also check if previous records were processed successfully.
If your application typically takes a long time to process messages, maybe you want to switch to manual commit instead of auto. That way you'll be able to better control when you commit and avoid this issue.

Kafka enable.auto.commit set to false but poll still fetch "next" messages

I want to tell Kafka when my consumer has successfully processed a record so I have turned auto-commit off by settting enable.auto.commit to false. I have two messages on a topic I am subscribed to at offset zero and one and have created a consumer so that each call to poll will return at most one record (by setting max.poll.records to 1).
I now call consumer.poll(5000) and receive the first message but I do not acknowledge it; I do not call commitSync or commitAsync. If I now call consumer.poll(5000) again, using the same consumer, I expect to get the exact same message I just read but, instead, I receive the second message.
How do I get consumer.poll to keep handing out the same message until I explicitly acknowledge it?
What you described is the expected behaviour. Every time you call poll(), it will return the next messages. The offset you commit is only used when connecting a new consumer so it knows where to (re)start from.
In MessageHub, we've set the session.timeout to 30 seconds. So you need to call poll() slightly faster to avoid being disconnected. If your processing takes longer than that, then I can think of 2 options:
Use Kafka 0.10.2 and set max.poll.interval.ms to tell your Kafka client to keep the session alive (without you having to call poll()) while you process the previous record. (This feature was added in 0.10.1 but we don't support that version. 0.10.2 works because it's capable to work with 0.10.0 brokers)
Use seek() to move back to the previous offset after poll so it keeps returning the same record.
Hope this helps!

Kafka Consumer - Poll behaviour

I'm facing some serious problems trying to implement a solution for my needs, regarding KafkaConsumer (>=0.9).
Let's imagine I have a function that has to read just n messages from a kafka topic.
For example: getMsgs(5) --> gets next 5 kafka messages in topic.
So, I have a loop that looks like this. Edited with actual correct parameters. In this case, the consumer's max.poll.records param was set to 1, so the actual loop only iterated once. Different consumers(some of them iterated through many messages) shared an abstract father (this one), that's why it's coded that way. The numMss part was ad-hoc for this consumer.
for (boolean exit= false;!exit;)
{
Records = consumer.poll(config.pollTime);
for (Record r:records)
{
processRecord(r); //do my things
numMss++;
if (numMss==maximum) //maximum=5
{
exit=true;
break;
}
}
}
Taking this into account, the problem is that the poll() method could get more than 5 messages. For example, if it gets 10 messages, my code will forget forever those other 5 messages, since Kafka will think they're already consumed.
I tried commiting the offset but doesn't seem to work:
consumer.commitSync(Collections.singletonMap(partition,
new OffsetAndMetadata(record.offset() + 1)));
Even with the offset configuration, whenever I launch again the consumer, it won't start from the 6th message (remember, I just wanted 5 messages), but from the 11th (since the first poll consumed 10 messages).
Is there any solution for this, or maybe (most surely) am I missing something?
Thanks in advance!!
You can set max.poll.records to whatever number you like such that at most you will get that many records on each poll.
For your use case that you stated in this problem you don't have to commit offsets explicitly by yourself. you can just set enable.auto.commit to trueand set auto.offset.reset to earliest such that it will kick in when there is no consumer group.id (other words when you are about start reading from a partition for the very first time). Once you have a group.id and some consumer offsets stored in Kafka and in case your Kafka consumer process dies it will continue from the last committed offset since it is the default behavior because when a consumer starts it will first look for if there are any committed offsets and if so, will continue from the last committed offset and auto.offset.reset won't kick in.
Had you disabled auto commit by setting enable.auto.commit to false. You need to disable that if you want to manually commit the offset. Without that next call to poll() will automatically commit the latest offset of the messages you received from previous poll().
From Kafka 0.9 the auto.offset.reset parameter names have changed;
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
set auto.offset.reset property as "earliest". Then try consume, you will get the consumed records from the committed offset.
Or you use consumer.seek(TopicPartition, offset) api before poll.