Relationship between maxPollRecords and autoCommitEnable in kafka - apache-kafka

Can Someone Please give me some good example and relationship between the kafka params maxPollRecords and autoCommitEnable in Kafka.

There is no relationship as such between them . Let me explain the two configs to you.
In Kafka there are two ways a consumer can commit offsets -
1.Manual Offset Commit - where the responsibility of committing offsets lies with the developer.
2.Enable Auto Commit- This is where the Kafka consumer takes the responsibility of committing offsets for you. How it works is, on every poll() call you make on the consumer , it is checked whether it is time to commit the offset ( this is dictated by auto.commit.interval.ms configuration), if it is time, it commits the offset.
For example - Suppose the auto.commit.interval.ms is set to 7 secs and every call to poll() takes 8 secs. So on a particular call to poll(), it will check, if the time to commit offset has elapsed , which in this example would have , then it will commit the offsets fetched from the previous poll.
Offsets are also committed during the closing of a consumer.
Here are some links you can look at -
https://kafka.apache.org/documentation/#consumerconfigs
https://kafka.apache.org/11/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
Does kafka lose message if consumer holds message longer then auto commit interval time?
Now , onto Max.poll.records. By, this configuration, you can tell the kafka consumer, what are the maximum number of records , you would like it return on a single call to poll(). Note you will generally not change the defaults for this , unless your record processing is slow , and you want to ensure that your consumer is not considered dead , because of the slowness of processing too many records.

Related

How does Kafka provides next batch of records to poll when commitAsync gets failed in committing offset

I have a use-case regarding consuming records by Kafka consumer.
For instance,
I have 1 topic which has 1 partition. Currently, it has 10 records and while consuming the first 10 records, another 10 records are written to the partition.
myConsumer polls the first time and returns the first 10 records say 0 - 9 records.
It processed all the records successfully.
It invoked commitAsync() to Kafka to commit the last offset.
Commit response is in processing. It can be a success or a failure.
But, since it is an asynchronous mode, it continues to poll for the next batch.
Now, how does either Kafka or consumer poll know that it has to read from the 10th position? Because the commitAsync request has not yet completed.
Please help me in understanding this concept.
Commit Offset tells the broker that the consumer has processed the corresponding message successfully. The consumer itself would be aware of its progress (except for start of consumer where it gets its last committed offset from broker).
At step-5 in your description, the commit offset is in progress. So:
Broker does not know that 0-9 records have been processed
Consumer itself has the read the messages and so it knows that is has read 0-9 messages. So it will know to read 10th onwards next.
Possible Scenarios
Lets say the commit fails for (0-9). Your next batch, say (10-15) is processed and committed succesfully then there is no harm done. Since we mark to the broker that processing till 15 is complete.
Lets say the commit fails for (0-9). Your next batch, (10-15) is processed and before committing, the consumer goes down. When your consumer is brought back up, it takes its state from broker (which does not have commit for either of the batch). So it will start reading from 0th message.
You can come up with several other scenarios as well. I guess the bottom line is, the importance of commit will come into picture when your consumer is restarted for whatever reason and it has get its last processed offset from kafka broker.

How to make Kafka consumer read from last committed offset but not from last consumed offset?

My requirement is simple yet not able to implement using plain consumer, I would like to consume records from Last committed offset position, every time I poll. I mean once after I polled set of records, if I am not manually committed the offset for those records, then I would expect the same set of records to be returned to me on the next poll. Is it possible to use plain Kafka consumers..? FYI I already configured my consumer, not to auto-commit.
The current workaround I employed is that I am manually seeking offset to last committed offset before every poll, but it is adding needless roundtrip and adding latency to message processing? is there an out of the box configuration available in Kafka consumer to achieve what I am expecting?

Does kafka partition assignment happen across processes?

I have a topic with 20 partitions and 3 processes with consumers(with the same group_id) consuming messages from the topic.
But I am seeing a discrepancy where unless one of the process commits , the other consumer(in a different process) is not reading any message.
The consumers in other process do cconsume messages when I set auto-commit to true. (which is why I suspect the consumers are being assigned to the first partition in each process)
Can someone please help me out with this issue? And also how to consume messages parallely across processes ?
If it is of any use , I am doing this on a pod(kubernetes) , where the 3 processes are 3 different mules.
Commit shouldn't make any difference because the committed offset is only used when there is a change in group membership. With three processes there would be some rebalancing while they start up but then when all 3 are running they will each have a fair share of the partitions.
Each time they poll, they keep track in memory of which offset they have consumed on each partition and each poll causes them to fetch from that point on. Whether they commit or not doesn't affect that behaviour.
Autocommit also makes little difference - it just means a commit is done synchronously during a subsequent poll rather than your application code doing it. The only real reason to manually commit is if you spawn other threads to process messages and so need to avoid committing messages that have not actually been processed - doing this is generally not advisable - better to add consumers to increase throughput rather than trying to share out processing within a consumer.
One possible explanation is just infrequent polling. You mention that other consumers are picking up partitions, and committing affects behaviour so I think it is safe to say that rebalances must be happening. Rebalances are caused by either a change in partitions at the broker (presumably not the case) or a change in group membership caused by either heartbeat thread dying (a pod being stopped) or a consumer failing to poll for a long time (default 5 minutes, set by max.poll.interval.ms)
After a rebalance, each partition is assigned to a consumer, and if a previous consumer has ever committed an offset for that partition, then the new one will poll from that offset. If not then the new one will poll from either the start of the partition or the high watermark - set by auto.offset.reset - default is latest (high watermark)
So, if you have a consumer, it polls but doesn't commit, and doesn't poll again for 5 minutes then a rebalance happens, a new consumer picks up the partition, starts from the end (so skipping any messages up to that point). Its first poll will return nothing as it is starting from the end. If it doesn't poll for 5 minutes another rebalance happens and the sequence repeats.
That could be the cause - there should be more information about what is going on in your logs - Kafka consumer code puts in plenty of helpful INFO level logging about rebalances.

Does kafka lose message if consumer holds message longer then auto commit interval time?

Say if auto-commit interval time is 30 seconds, consumer for some reasons could not process the message and hold it longer than 30 seconds then crash. does the auto-commit offset mechanism commits this offset anyway right before consumer crash?
If my assumption is correct, the message is lost as its offset committed but the message itself has not been processed?
Lets consider your Consumer group name is Test and you have a single consumer in the Consumer Group.
When Auto-Commit is enabled, offsets are committed only during poll() calls and during closing of a consumer.
For example- auto.commit.interval.ms is 5 secs, and every call to poll() takes 7 secs. When making every call to poll(), it will check if the auto commit interval has elapsed, if it has, like in the above example, it will commit the offset.
Offsets are also committed during closing of a consumer.
From the documentation -
"Close the consumer, waiting for up to the default timeout of 30 seconds for any needed cleanup. If auto-commit is enabled, this will commit the current offsets if possible within the default timeout".
You can read more about it here -
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
Now, onto your question, if poll() is not called again or consumer is not closed, it won't commit the offset.
If the Consumer receives message N, commits it and then crashes before having fully processed it then by default the Consumer will considered this message processed.
Note that the message is still on the broker, so it can be re-consumed to be processed. But that require some logic in your application to not only restart from last committed position but also check if previous records were processed successfully.
If your application typically takes a long time to process messages, maybe you want to switch to manual commit instead of auto. That way you'll be able to better control when you commit and avoid this issue.

Kafka Consumer - Poll behaviour

I'm facing some serious problems trying to implement a solution for my needs, regarding KafkaConsumer (>=0.9).
Let's imagine I have a function that has to read just n messages from a kafka topic.
For example: getMsgs(5) --> gets next 5 kafka messages in topic.
So, I have a loop that looks like this. Edited with actual correct parameters. In this case, the consumer's max.poll.records param was set to 1, so the actual loop only iterated once. Different consumers(some of them iterated through many messages) shared an abstract father (this one), that's why it's coded that way. The numMss part was ad-hoc for this consumer.
for (boolean exit= false;!exit;)
{
Records = consumer.poll(config.pollTime);
for (Record r:records)
{
processRecord(r); //do my things
numMss++;
if (numMss==maximum) //maximum=5
{
exit=true;
break;
}
}
}
Taking this into account, the problem is that the poll() method could get more than 5 messages. For example, if it gets 10 messages, my code will forget forever those other 5 messages, since Kafka will think they're already consumed.
I tried commiting the offset but doesn't seem to work:
consumer.commitSync(Collections.singletonMap(partition,
new OffsetAndMetadata(record.offset() + 1)));
Even with the offset configuration, whenever I launch again the consumer, it won't start from the 6th message (remember, I just wanted 5 messages), but from the 11th (since the first poll consumed 10 messages).
Is there any solution for this, or maybe (most surely) am I missing something?
Thanks in advance!!
You can set max.poll.records to whatever number you like such that at most you will get that many records on each poll.
For your use case that you stated in this problem you don't have to commit offsets explicitly by yourself. you can just set enable.auto.commit to trueand set auto.offset.reset to earliest such that it will kick in when there is no consumer group.id (other words when you are about start reading from a partition for the very first time). Once you have a group.id and some consumer offsets stored in Kafka and in case your Kafka consumer process dies it will continue from the last committed offset since it is the default behavior because when a consumer starts it will first look for if there are any committed offsets and if so, will continue from the last committed offset and auto.offset.reset won't kick in.
Had you disabled auto commit by setting enable.auto.commit to false. You need to disable that if you want to manually commit the offset. Without that next call to poll() will automatically commit the latest offset of the messages you received from previous poll().
From Kafka 0.9 the auto.offset.reset parameter names have changed;
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
set auto.offset.reset property as "earliest". Then try consume, you will get the consumed records from the committed offset.
Or you use consumer.seek(TopicPartition, offset) api before poll.