I'm using commitSync() after processing of messages in Kafka. I wanted to know for how much time commitSync() tries to commit before raising an error? And if it gives an error then would the same message be polled again later or it is assumed to be consumed?
If you don't specify a timeout, commitSync() blocks for the duration specified by default.api.timeout.ms. This is 60 seconds by default.
If it fails, that consumer instance will not poll the same messages again, it is considering consumed.
However, if that consumer instance was to crash, a new instance using the same consumer group would restart from the last successfully committed position.
Related
I am using manual kafka commit by setting property enable.auto.commit as false while initialising the Kafka consumer and calling kafka commit manually after receiving and processing the message.
However since the processing of message in my consumer is time taking, I am getting Exception with message "error": "Broker: Group rebalance in progress"
The reason being that commit after rebalance timeout is rejected with this error. Now the recovery action for this is either I exit and re-instantiate the process which will trigger rebalancing and partition assignment again. Another way is to catch this exception and then continue as usual which will work correctly only if the poll() call is blocked till the rebalancing is complete, otherwise it will fetch the next packet from the batch and might process and commit it successfully leading to loss of the message whose commit got failed while rebalancing.
So, Need to know what is the correct way to handle this case, should I re-instantiate the process or should I catch and ignore the exception?
The best approach is to ignore if it happens occasionally, and if it happens frequently then reduce the max.poll.records or increase the max.poll.interval.ms to ensure it does only happen occasionally. Also, ensure that your code can handle duplicate records (if you can't do that then there is a different answer).
The error you see is, as you probably realise, just because by the time the consumer committed, the group had decided that it had probably gone and so it's partitions were picked up by a different consumer as part of a rebalance - the new consumer would have started from the last committed offset, hence duplicates.
Given that the original consumer is alive and well it will no doubt poll again and so trigger another rebalance. This poll won't block waiting for rebalance to occur - each poll allows for some communication about the current state of the group (within the polling thread) and after a number of polls the new allocation of partitions will be agreed and accepted after which the rebalance is considered compete and that poll will tell the consumer it's partition allocation and return a set of records.
I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.
I'm using one topic, one partition, one consumer, Kafka client version is 0.10.
I got two different results:
If I paused partition first, then to produce a message and to invoke resume method. KafkaConsumer can poll the uncommitted message successfully.
But If I produced message first and didn't commit its offset, then to pause the partition, after several seconds, to invoke the resume method. KafkaConsumer would not receive the uncommitted message. I checked it on Kafka server using kafka-consumer-groups.sh, it shows LOG-END-OFFSET minus CURRENT-OFFSET = LAG = 1.
I have been trying to figure out it for two days, I repeated such tests a lot of times, the results are always like so. I need some suggestion or someone can tell me its Kafka's original mechanism.
For your observation#2, if you restart the application, it will supply you all records from the un-committed offset, i.e. the missing record and if your consumer again does not commit, it will be sent again when application registers consumer with Kafka upon restart. It is expected.
Assuming you are using consumer.poll() which creates a hybrid-streaming interface i.e. if accumulates data coming into Kafka for the duration mentioned and provides it to the consumer for processing once the duration is finished. This continuous accumulation happens in the backend and is not dependent on whether you have committed offset or not.
KafkaConsumer
The position of the consumer gives the offset of the next record that
will be given out. It will be one larger than the highest offset the
consumer has seen in that partition. It automatically advances every
time the consumer receives messages in a call to poll(long).
I see from the logs that exact same message is consumed by the 665 times. Why does this happen?
I also see this in the logs
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies
that the poll loop is spending too much time message processing. You can address this either by increasing the session
timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
Consumer properties
group.id=someGroupId
bootstrap.servers=kafka:9092
enable.auto.commit=false
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
session.timeout.ms=30000
max.poll.records=20
PS: Is it possible to consume only a specific number of messages like 10 or 50 or 100 messages from the 1000 that are in the queue?
I was looking at 'fetch.max.bytes' config, but it seems like it is for a message size rather than number of messages.
Thanks
The answer lies in the understanding of the following concepts:
session.timeout.ms
heartbeats
max.poll.interval.ms
In your case, your consumer receives a message via poll() but is not able to complete the processing in max.poll.interval.ms time. Therefore, it is assumed hung by the Broker and re-balancing of partitions happen due to which this consumer loses the ownership of all partitions. It is marked dead and is no longer part of a consumer group.
Then when your consumer completes the processing and calls poll() again two things happen:
Commit fails as the consumer no longer owns the partitions.
Broker identifies that the consumer is up again and therefore a re-balance is triggered and the consumer again joins the Consumer Group, start owning partitions and request messages from the Broker. Since the earlier message was not marked as committed (refer #1 above, failed commit) and is pending processing, the broker delivers the same message to consumer again.
Consumer again takes a lot of time to process and since is unable to finish processing in less than max.poll.interval.ms, 1. and 2. keep repeating in a loop.
To fix the problem, you can increase the max.poll.interval.ms to a large enough value based on how much time your consumer needs for processing. Then your consumer will not get marked as dead and will not receive duplicate messages.
However, the real fix is to check your processing logic and try to reduce the processing time.
The fix is described in the message you pasted:
You can address this either by increasing the session timeout or by
reducing the maximum size of batches returned in poll() with
max.poll.records.
The reason is a timeout is reached before your consumer is able to process and commit the message. When your Kafka consumer "commits", it's basically acknowledging receipt of the previous message, advancing the offset, and therefore moving onto the next message. But if that timeout is passed (as is the case for you), the consumer's commit isn't effective because it's happening too late; then the next time the consumer asks for a message, it's given the same message
Some of your options are to:
Increase session.timeout.ms=30000, so the consumer has more time
process the messages
Decrease the max.poll.records=20 so the consumer has less messages it'll need to work on before the timeout occurs. But this doesn't really apply to you because your consumer is already only just working on a single message
Or turn on enable.auto.commit, which probably also isn't the best solution for you because it might result in dropping messages though, as mentioned below:
If we allowed offsets to auto commit as in the previous example
messages would be considered consumed after they were given out by the
consumer, and it would be possible that our process could fail after
we have read messages into our in-memory buffer but before they had
been inserted into the database.
Source: https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
I want to tell Kafka when my consumer has successfully processed a record so I have turned auto-commit off by settting enable.auto.commit to false. I have two messages on a topic I am subscribed to at offset zero and one and have created a consumer so that each call to poll will return at most one record (by setting max.poll.records to 1).
I now call consumer.poll(5000) and receive the first message but I do not acknowledge it; I do not call commitSync or commitAsync. If I now call consumer.poll(5000) again, using the same consumer, I expect to get the exact same message I just read but, instead, I receive the second message.
How do I get consumer.poll to keep handing out the same message until I explicitly acknowledge it?
What you described is the expected behaviour. Every time you call poll(), it will return the next messages. The offset you commit is only used when connecting a new consumer so it knows where to (re)start from.
In MessageHub, we've set the session.timeout to 30 seconds. So you need to call poll() slightly faster to avoid being disconnected. If your processing takes longer than that, then I can think of 2 options:
Use Kafka 0.10.2 and set max.poll.interval.ms to tell your Kafka client to keep the session alive (without you having to call poll()) while you process the previous record. (This feature was added in 0.10.1 but we don't support that version. 0.10.2 works because it's capable to work with 0.10.0 brokers)
Use seek() to move back to the previous offset after poll so it keeps returning the same record.
Hope this helps!