Kafka Consumer - Poll behaviour - apache-kafka

I'm facing some serious problems trying to implement a solution for my needs, regarding KafkaConsumer (>=0.9).
Let's imagine I have a function that has to read just n messages from a kafka topic.
For example: getMsgs(5) --> gets next 5 kafka messages in topic.
So, I have a loop that looks like this. Edited with actual correct parameters. In this case, the consumer's max.poll.records param was set to 1, so the actual loop only iterated once. Different consumers(some of them iterated through many messages) shared an abstract father (this one), that's why it's coded that way. The numMss part was ad-hoc for this consumer.
for (boolean exit= false;!exit;)
{
Records = consumer.poll(config.pollTime);
for (Record r:records)
{
processRecord(r); //do my things
numMss++;
if (numMss==maximum) //maximum=5
{
exit=true;
break;
}
}
}
Taking this into account, the problem is that the poll() method could get more than 5 messages. For example, if it gets 10 messages, my code will forget forever those other 5 messages, since Kafka will think they're already consumed.
I tried commiting the offset but doesn't seem to work:
consumer.commitSync(Collections.singletonMap(partition,
new OffsetAndMetadata(record.offset() + 1)));
Even with the offset configuration, whenever I launch again the consumer, it won't start from the 6th message (remember, I just wanted 5 messages), but from the 11th (since the first poll consumed 10 messages).
Is there any solution for this, or maybe (most surely) am I missing something?
Thanks in advance!!

You can set max.poll.records to whatever number you like such that at most you will get that many records on each poll.
For your use case that you stated in this problem you don't have to commit offsets explicitly by yourself. you can just set enable.auto.commit to trueand set auto.offset.reset to earliest such that it will kick in when there is no consumer group.id (other words when you are about start reading from a partition for the very first time). Once you have a group.id and some consumer offsets stored in Kafka and in case your Kafka consumer process dies it will continue from the last committed offset since it is the default behavior because when a consumer starts it will first look for if there are any committed offsets and if so, will continue from the last committed offset and auto.offset.reset won't kick in.

Had you disabled auto commit by setting enable.auto.commit to false. You need to disable that if you want to manually commit the offset. Without that next call to poll() will automatically commit the latest offset of the messages you received from previous poll().

From Kafka 0.9 the auto.offset.reset parameter names have changed;
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.

set auto.offset.reset property as "earliest". Then try consume, you will get the consumed records from the committed offset.
Or you use consumer.seek(TopicPartition, offset) api before poll.

Related

Using seek to listen get only uncommitted offset from beginning

I am using Spring Kafka and have a requirement where I have to listen from a DLQ topic and put the message to another topic after few minutes. Here I am only acknowledging a msg only when it is put to another topic else I am not committing it and calling kafkaListenerEndpointRegistry.stop() which is stopping my kafka consumer. Then there is scheduled cron job running after every 3 minutes and starts the consumer by running kafkaListenerEndpointRegistry.start() and since auto.offset.reset is set to earliest then consumer is getting all msgs from previously uncommitted offset and checking their eligibility to be put on other topic.
This approach is working fine for small volume but for very large volume I am not seeing the expected retries in both topics. So I am suspecting that this might be happening because I am using kafkaListenerEndpointRegistry.stop() to stop the consumer. If I am able to seek to beginning of offset for each partition and get all msgs from uncommitted offset then I don't have to stop and start my consumer.
For this, I tried ConsumerSeekAware.onPartitionAssigned and calling callback.seekToBeginning() to reset offsets. But looks like it's also consuming all committed offset which is increasing huge load on my services. So is there anything I am missing or seekToBeginning always read all msgs(committed and uncommitted).
and is there any way to trigger partition assignment manually while running kafka consumer so that it goes to onPartitionAssigned method?
auto.offset.reset is set to earliest then consumer is getting all msgs from previously uncommitted
auto.offset.reset is meaningless if there is a committed offset; it just determines the behavior if there is no committed offset.
seekToBeginning always read all msgs(committed and uncommitted).
Kafka maintains 2 pointers - the current position and the committed offset; seek has nothing to do with committed offset, seekToBeginning just changes the position to the earliest record, so the next poll will return all records.
This approach is working fine for small volume but for very large volume I am not seeing the expected retries in both topics. So I am suspecting that this might be happening because I am using kafkaListenerEndpointRegistry.stop() to stop the consumer.
That should not be a problem; you might want to consider using a container stopping error handler instead; then throw an exception and the container will stop itself (you should also set the stopImmediate container property).
https://docs.spring.io/spring-kafka/docs/current/reference/html/#container-stopping-error-handlers

Relationship between maxPollRecords and autoCommitEnable in kafka

Can Someone Please give me some good example and relationship between the kafka params maxPollRecords and autoCommitEnable in Kafka.
There is no relationship as such between them . Let me explain the two configs to you.
In Kafka there are two ways a consumer can commit offsets -
1.Manual Offset Commit - where the responsibility of committing offsets lies with the developer.
2.Enable Auto Commit- This is where the Kafka consumer takes the responsibility of committing offsets for you. How it works is, on every poll() call you make on the consumer , it is checked whether it is time to commit the offset ( this is dictated by auto.commit.interval.ms configuration), if it is time, it commits the offset.
For example - Suppose the auto.commit.interval.ms is set to 7 secs and every call to poll() takes 8 secs. So on a particular call to poll(), it will check, if the time to commit offset has elapsed , which in this example would have , then it will commit the offsets fetched from the previous poll.
Offsets are also committed during the closing of a consumer.
Here are some links you can look at -
https://kafka.apache.org/documentation/#consumerconfigs
https://kafka.apache.org/11/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
Does kafka lose message if consumer holds message longer then auto commit interval time?
Now , onto Max.poll.records. By, this configuration, you can tell the kafka consumer, what are the maximum number of records , you would like it return on a single call to poll(). Note you will generally not change the defaults for this , unless your record processing is slow , and you want to ensure that your consumer is not considered dead , because of the slowness of processing too many records.

Is there a way to stop Kafka consumer at a specific offset?

I can seek to a specific offset. Is there a way to stop the consumer at a specific offset? In other words, consume till my given offset. As far as I know, Kafka does not offer such a function. Please correct me if I am wrong.
Eg. partition has offsets 1-10. I only want to consume from 3-8. After consuming the 8th message, program should exit.
Yes, kafka does not offer this function, but you could achieve this in your consumer code. You could try use commitSync() to control this.
public void commitSync(Map offsets)
Commit the specified offsets for the specified list of topics and partitions.
This commits offsets to Kafka. The offsets committed using this API will be used on the first fetch after every rebalance and also on startup. As such, if you need to store offsets in anything other than Kafka, this API should not be used. The committed offset should be the next message your application will consume, i.e. lastProcessedMessageOffset + 1.
This is a synchronous commits and will block until either the commit succeeds or an unrecoverable error is encountered (in which case it is thrown to the caller).
Something like this:
while (goAhead) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
if (record.offset() > OFFSET_BOUND) {
consumer.commitSync(Collections.singletonMap(new TopicPartition(record.topic(), record.partition()), new OffsetAndMetadata(record.offset())));
goAhead = false;
break;
}
process(record);
}
}
You should set the "enable.auto.commit" to false in code above. In your case the OFFSET_BOUND could be set to 8. Because the commited offset is just 9 in your example, So next time the consumer will fetch from this position.
Assuming that partition offsets are continuous (i.e. not log compacted) you could configure your consumer (using max.poll.records config) so it reads certain number of records in each poll. This would let you stop at the offset you want.
As I know max.poll.records is a client feature. Kafka fetch protocol has only bytes limitations https://kafka.apache.org/protocol#The_Messages_Fetch
So you will read more messages under hood in general

kafka subscribe commit offset manually

I am using Kafka 9 and confused with the behavior of subscribe.
Why does it expects group.id with subscribe.
Do we need to commit the offset manually using commitSync. Even if don't do that I see that it always starts from the latest.
Is there a way a replay the messages from beginning.
Why does it expects group.id with subscribe?
The concept of consumer groups is used by Kafka to enable parallel consumption of topics - every message will be delivered once per consumer group, no matter how many consumers actually are in that group. This is why the group parameter is mandatory, without a group Kafka would not know how this consumer should be treated in relation to other consumers that might subscribe to the same topic.
Whenever you start a consumer it will join a consumer group, based on how many other consumers are in this consumer group it will then be assigned partitions to read from. For these partitions it then checks whether a list read offset is known, if one is found it will start reading messages from this point.
If no offset is found, the parameter auto.offset.reset controls whether reading starts at the earliest or latest message in the partition.
Do we need to commit the offset manually using commitSync? Even if
don't do that I see that it always starts from the latest.
Whether or not you need to commit the offset depends on the value you choose for the parameter enable.auto.commit. By default this is set to true, which means the consumer will automatically commit its offset regularly (how often is defined by auto.commit.interval.ms). If you set this to false, then you will need to commit the offsets yourself.
This default behavior is probably also what is causing your "problem" where your consumer always starts with the latest message. Since the offset was auto-committed it will use that offset.
Is there a way a replay the messages from beginning?
If you want to start reading from the beginning every time, you can call seekToBeginning, which will reset to the first message in all subscribed partitions if called without parameters, or just those partitions that you pass in.

What determines Kafka consumer offset?

I am relatively new to Kafka. I have done a bit of experimenting with it, but a few things are unclear to me regarding consumer offset. From what I have understood so far, when a consumer starts, the offset it will start reading from is determined by the configuration setting auto.offset.reset (correct me if I am wrong).
Now say for example that there are 10 messages (offsets 0 to 9) in the topic, and a consumer happened to consume 5 of them before it went down (or before I killed the consumer). Then say I restart that consumer process. My questions are:
If the auto.offset.reset is set to earliest, is it always going to start consuming from offset 0?
If the auto.offset.reset is set to latest, is it going to start consuming from offset 5?
Is the behavior regarding this kind of scenario always deterministic?
Please don't hesitate to comment if anything in my question is unclear.
It is a bit more complex than you described.
The auto.offset.reset config kicks in ONLY if your consumer group does not have a valid offset committed somewhere (2 supported offset storages now are Kafka and Zookeeper), and it also depends on what sort of consumer you use.
If you use a high-level java consumer then imagine following scenarios:
You have a consumer in a consumer group group1 that has consumed 5 messages and died. Next time you start this consumer it won't even use that auto.offset.reset config and will continue from the place it died because it will just fetch the stored offset from the offset storage (Kafka or ZK as I mentioned).
You have messages in a topic (like you described) and you start a consumer in a new consumer group group2. There is no offset stored anywhere and this time the auto.offset.reset config will decide whether to start from the beginning of the topic (earliest) or from the end of the topic (latest)
One more thing that affects what offset value will correspond to earliest and latest configs is log retention policy. Imagine you have a topic with retention configured to 1 hour. You produce 5 messages, and then an hour later you post 5 more messages. The latest offset will still remain the same as in previous example but the earliest one won't be able to be 0 because Kafka will already remove these messages and thus the earliest available offset will be 5.
Everything mentioned above is not related to SimpleConsumer and every time you run it, it will decide where to start from using the auto.offset.reset config.
If you use Kafka version older than 0.9, you have to replace earliest, latest with smallest,largest.
Just an update: From Kafka 0.9 and forth, Kafka is using a new Java version of the consumer and the auto.offset.reset parameter names have changed; From the manual:
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found
for the consumer's group
anything else: throw exception to the consumer.
I spent some time to find this after checking the accepted answer, so I thought it might be useful for the community to post it.
Further more there's offsets.retention.minutes. If time since last commit is > offsets.retention.minutes, then auto.offset.reset also kicks in