I have a java Kafka consumer in which I am fetching ConsumerRecords in a batch to process. The sample code is as follows -
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
DoSomeProcessing (record.value());
}
consumer.commitAsync();
}
private void DoSomeProcessing(String record) {
//make an external call to a system which can take random time for different requests or timeout in 5 seconds.
}
The problem I have is for how or which offset to commit if the later record is produced but the previous record is still not timed out.
Lets suppose I get 2 records in a batch, the external call for 1st message is still awaited, and for 2nd call completed. If I wait for 5 seconds for the external response, the consumption from Kafka message can become super slow in cases. If I do not wait for 1st request to complete before doing another poll, what offset do I commit to Kafka? If I commit 2, and if the consumer crashes, 1st message will be lost as next time latest committed offset would be 2.
I think you analyzed the problem correctly, and the answer is probably what you suspect: you can't commit offsets until every offset less than and equal to that offset has been processed. That's just how Kafka works: it's very much oriented around strong ordering.
The solution is to increase the number of partitions and consumers so you get the parallelism you desire. This is not great from some angles—you needs more threads and resources—but at least you get to write synchronous code.
What you can do is that you can setup an error pipeline. For the messages that are failing, you will commit that message and push it to the error queue and will process it later.
Related
I am using #KafkaListener with props as
max.poll.records to 50. (Each record takes 40-60 sec to process)
enable-auto-commit=false
ack-mode to manual immediate
Below is the logic
#KafkaListener(groupId=“ABC”, topic=“Data1” containerFactory=“myCustomContainerFactory”)
public void listen(ConsumerRecord<String, Object> record, Acknowledge ack) {
try{
process(record);
ack.acknowledge();
}
Catch(e){
reprocess() // pause container and seek
}
}
Other props like max.poll.interval.ms, session.timeout.ms or heartbeat are of default values
I am not able to understand whats going wrong here,
Suppose if 500 msg are published to 2 partition
I am not sure why the consumer is not polling records as per max.poll.records prop actually its polls all 500 msg as soon as the application starts or msg are published by producer
Its observed that after processing some records say approx 5-7 mins consumer re reads an offset again.. which actually was read fine processed and acknowledged..
After a hour the log file shows that same messages are read multiple times.
Any help is appreciated
Thanks.
The default max.poll.interval.ms is 300,000 milliseconds (5 minutes).
You either need to reduce max.poll.records or increase the interval - otherwise Kafka will force a rebalance due to a non-responsive consumer.
With such a large processing time, I would recommend max.poll.records=1; you clearly don't need higher throughput.
I have a use-case regarding consuming records by Kafka consumer.
For instance,
I have 1 topic which has 1 partition. Currently, it has 10 records and while consuming the first 10 records, another 10 records are written to the partition.
myConsumer polls the first time and returns the first 10 records say 0 - 9 records.
It processed all the records successfully.
It invoked commitAsync() to Kafka to commit the last offset.
Commit response is in processing. It can be a success or a failure.
But, since it is an asynchronous mode, it continues to poll for the next batch.
Now, how does either Kafka or consumer poll know that it has to read from the 10th position? Because the commitAsync request has not yet completed.
Please help me in understanding this concept.
Commit Offset tells the broker that the consumer has processed the corresponding message successfully. The consumer itself would be aware of its progress (except for start of consumer where it gets its last committed offset from broker).
At step-5 in your description, the commit offset is in progress. So:
Broker does not know that 0-9 records have been processed
Consumer itself has the read the messages and so it knows that is has read 0-9 messages. So it will know to read 10th onwards next.
Possible Scenarios
Lets say the commit fails for (0-9). Your next batch, say (10-15) is processed and committed succesfully then there is no harm done. Since we mark to the broker that processing till 15 is complete.
Lets say the commit fails for (0-9). Your next batch, (10-15) is processed and before committing, the consumer goes down. When your consumer is brought back up, it takes its state from broker (which does not have commit for either of the batch). So it will start reading from 0th message.
You can come up with several other scenarios as well. I guess the bottom line is, the importance of commit will come into picture when your consumer is restarted for whatever reason and it has get its last processed offset from kafka broker.
I'm using one topic, one partition, one consumer, Kafka client version is 0.10.
I got two different results:
If I paused partition first, then to produce a message and to invoke resume method. KafkaConsumer can poll the uncommitted message successfully.
But If I produced message first and didn't commit its offset, then to pause the partition, after several seconds, to invoke the resume method. KafkaConsumer would not receive the uncommitted message. I checked it on Kafka server using kafka-consumer-groups.sh, it shows LOG-END-OFFSET minus CURRENT-OFFSET = LAG = 1.
I have been trying to figure out it for two days, I repeated such tests a lot of times, the results are always like so. I need some suggestion or someone can tell me its Kafka's original mechanism.
For your observation#2, if you restart the application, it will supply you all records from the un-committed offset, i.e. the missing record and if your consumer again does not commit, it will be sent again when application registers consumer with Kafka upon restart. It is expected.
Assuming you are using consumer.poll() which creates a hybrid-streaming interface i.e. if accumulates data coming into Kafka for the duration mentioned and provides it to the consumer for processing once the duration is finished. This continuous accumulation happens in the backend and is not dependent on whether you have committed offset or not.
KafkaConsumer
The position of the consumer gives the offset of the next record that
will be given out. It will be one larger than the highest offset the
consumer has seen in that partition. It automatically advances every
time the consumer receives messages in a call to poll(long).
I can seek to a specific offset. Is there a way to stop the consumer at a specific offset? In other words, consume till my given offset. As far as I know, Kafka does not offer such a function. Please correct me if I am wrong.
Eg. partition has offsets 1-10. I only want to consume from 3-8. After consuming the 8th message, program should exit.
Yes, kafka does not offer this function, but you could achieve this in your consumer code. You could try use commitSync() to control this.
public void commitSync(Map offsets)
Commit the specified offsets for the specified list of topics and partitions.
This commits offsets to Kafka. The offsets committed using this API will be used on the first fetch after every rebalance and also on startup. As such, if you need to store offsets in anything other than Kafka, this API should not be used. The committed offset should be the next message your application will consume, i.e. lastProcessedMessageOffset + 1.
This is a synchronous commits and will block until either the commit succeeds or an unrecoverable error is encountered (in which case it is thrown to the caller).
Something like this:
while (goAhead) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
if (record.offset() > OFFSET_BOUND) {
consumer.commitSync(Collections.singletonMap(new TopicPartition(record.topic(), record.partition()), new OffsetAndMetadata(record.offset())));
goAhead = false;
break;
}
process(record);
}
}
You should set the "enable.auto.commit" to false in code above. In your case the OFFSET_BOUND could be set to 8. Because the commited offset is just 9 in your example, So next time the consumer will fetch from this position.
Assuming that partition offsets are continuous (i.e. not log compacted) you could configure your consumer (using max.poll.records config) so it reads certain number of records in each poll. This would let you stop at the offset you want.
As I know max.poll.records is a client feature. Kafka fetch protocol has only bytes limitations https://kafka.apache.org/protocol#The_Messages_Fetch
So you will read more messages under hood in general
In our code, we plan to manually commit the offset. Our processing of data is long run and hence we follow the pattern suggested before
Read the records
Process the records in its own thread
pause the consumer
continue polling paused consumer so that it is alive
When the records are processed, commit the offsets
When commit done, then resume the consumer
The code somewhat looks like this:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(kafkaConfig.getTopicPolling());
if (!records.isEmpty()) {
task = pool.submit(new ProcessorTask(processor, createRecordsList(records)));
}
if (shouldPause(task)) {
consumer.pause(listener.getPartitions());
}
if (isDoneProcessing(task)) {
consumer.commitSync();
consumer.resume(listener.getPartitions());
}
}
If you notice, we commit using commitSync() (without any parameters).
Since the consumer is paused, in the next iteration we would get no records. But commitSync() would happen later. In that case which offset's would it try to commit? I have read the definitive guide and googled but cannot find any information about it.
I think we should explicitly save the offsets. But I am not sure if the current code would be an issue.
Any information would be helpful.
Thanks,
Prateek
If you call consumer.commitSync() with no parameters it should commit the latest offset that your consumer has received. Since you can receive many messages in a single poll() you might want to have finer control over the commit and explicitly commit a specific offset such as the latest message that your consumer has successfully processed. This can be done by calling commitSync(Map<TopicPartition,OffsetAndMetadata> offsets)
You can see the syntax for the two ways to call commitSync here in the Consumer Javadoc http://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#commitSync()