Kafka Timeout Exception:Batch Expired - apache-kafka

We are facing following issue on production:
org.apache.kafka.common.errors.TimeoutException: Batch Expired.
Is it because of invalid configuration like batch size, request timeout or any other thing?

The error indicates that some records are put into the queue at a faster rate than they can be sent from the client.
When your Producer sends messages, they are stored in buffer (before sending the to the target broker) and the records are grouped together into batches in order to increase throughput. When a new record is added to the batch, it must be sent within a -configurable- time window which is controlled by request.timeout.ms (the default is set to 30 seconds). If the batch is in the queue for longer time, a TimeoutException is thrown and the batch records will then be removed from the queue and won't be delivered to the broker.
Increasing the value of request.timeout.ms should do the trick for you.

Related

Kafka org.apache.kafka.common.errors.TimeoutException for some of the records

Getting below error message when producing record to Azure event hub(kafka enabled)
Expiring 14 record(s) for eventhubname: 30125 ms has passed since batch creation plus linger time
Stack used azure eventhub , Spring kafka
below config present in Kafka producer config
props.put(ProducerConfig.RETRIES_CONFIG, "3");
Would like to know if kafka producer will be retried 3 times incase of above error message
ProducerConfig.RETRIES_CONFIG -> This configuration has no use.
Default retry is set as Integer.MAX_VALUE = 2147483647.
Producer automatically retry in case of failure.
Kindly check following configuration and tune accordingly.
linger.ms=0 ( try with zero, this will send batch request as soon as possible, Non zero value require more tuning along with other parameters)
buffer.memory -> Max memory used by Producer in buffer, try increasing this.
max.block.ms -> check this value, This is probable cause of timeoutException. Increase this as per scenario.

Spring Kafka Consumer Unable to Rejoin after LeaveGroup Request

We are using Spring Kafka in Production under heavy load.
We have used #KafkaListener annotations and created those listeners as part of Spring Boot Services.
Very frequently these consumers send LeaveGroup requests to the coordinator and then the consumers hang/stuck indefinitely without any log or error. The only option we are left in that case is to redeploy that particular instance.
This is the series of logs that we see:
Attempt to heartbeat failed since group is rebalancing
Attempt to heartbeat failed since group is rebalancing
This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
Member consumer-1-0d0b333d-9e5b-4038-ac3f-e5d59c4e9d19 sending LeaveGroup request to coordinator 172.25.128.233:9092 (id: 2147483560 rack: null)
Additional information:
We are using below Kafka Configs:
Batch Size: 10
Max Poll Interval: 12 mins
Heart Beat: 8 seconds
Max Partition Size: 10 MB
Session Timeout: 25 seconds
Request Timeout: 20 mins
Basically, we want to know once the Consumer leaves the group, why it is not sending the Join Request again?
You should understand that the retry suspends the consumer thread (if a BackOffPolicy is used). There are no calls to Consumer.poll() during the retries. Kafka has two properties to determine consumer health. The session.timeout.ms is used to determine if the consumer is active. Since kafka-clients version 0.10.1.0, heartbeats are sent on a background thread, so a slow consumer no longer affects that. max.poll.interval.ms (default: five minutes) is used to determine if a consumer appears to be hung (taking too long to process records from the last poll). If the time between poll() calls exceeds this, the broker revokes the assigned partitions and performs a rebalance. For lengthy retry sequences, with back off, this can easily happen.
Since version 2.1.3, you can avoid this problem by using stateful retry in conjunction with a SeekToCurrentErrorHandler. In this case, each delivery attempt throws the exception back to the container, the error handler re-seeks the unprocessed offsets, and the same message is redelivered by the next poll(). This avoids the problem of exceeding the max.poll.interval.ms property (as long as an individual delay between attempts does not exceed it). So, when you use an ExponentialBackOffPolicy, you must ensure that the maxInterval is less than the max.poll.interval.ms property. To enable stateful retry, you can use the RetryingMessageListenerAdapter constructor that takes a stateful boolean argument (set it to true). When you configure the listener container factory (for #KafkaListener), set the factory’s statefulRetry property to true.
https://docs.spring.io/spring-kafka/reference/html/#stateful-retry

when will trigger producer send a request?

if i send just one record at producer side and wait, when will producer sends the record to broker?
In kafka docs, i found the config called "linger.ms", and it says:
once we get batch.size worth of records for a partition it will be
sent immediately regardless of this setting, however if we have
fewer
than this many bytes accumulated for this partition we will 'linger'
for the specified time waiting for more records to show up.
According above docs, i have two questions.
if producer receives datas which size reaches batch.size, it will immediately trigger to send a request which only contains one batch to broker? But as we know, one request can contain many batches, so how does it happen?
does it mean that even the received datas are not enough of batch.size, it will also trigger to send a request to broker after waiting linger.ms ?
In Kafka, the lowest unit of sending is a record (a KV pair).
Kafka producer attempts to send records in batches in-order to optimize data transmission. So a single push from producer to the cluster -- to the broker leader to be precise -- could contain multiple records.
Moreover, batching always applies only to a given partition. Records produced to different partitions cannot be batched together, though they could form multiple batches.
There are a few parameters which influence the batching behaviour, as described in the documentation:
buffer.memory -
The total bytes of memory the producer can use to buffer records
waiting to be sent to the server. If records are sent faster than they
can be delivered to the server the producer will block for
max.block.ms after which it will throw an exception.
batch.size -
The producer will attempt to batch records together into fewer
requests whenever multiple records are being sent to the same
partition. This helps performance on both the client and the server.
This configuration controls the default batch size in bytes. No
attempt will be made to batch records larger than this size.
Requests sent to brokers will contain multiple batches, one for each
partition with data available to be sent.
linger.ms -
The producer groups together any records that arrive in between
request transmissions into a single batched request. Normally this
occurs only under load when records arrive faster than they can be
sent out. However in some circumstances the client may want to reduce
the number of requests even under moderate load. This setting
accomplishes this by adding a small amount of artificial delay—that
is, rather than immediately sending out a record the producer will
wait for up to the given delay to allow other records to be sent so
that the sends can be batched together. This can be thought of as
analogous to Nagle's algorithm in TCP. This setting gives the upper
bound on the delay for batching: once we get batch.size worth of
records for a partition it will be sent immediately regardless of this
setting, however if we have fewer than this many bytes accumulated for
this partition we will 'linger' for the specified time waiting for
more records to show up. This setting defaults to 0 (i.e. no delay).
Setting linger.ms=5, for example, would have the effect of reducing
the number of requests sent but would add up to 5ms of latency to
records sent in the absence of load.
So from above documentation, you could understand - linger.ms is an artificial delay to wait if there are not enough bytes to transmit, but if producer accumulates enough bytes before linger.ms is elapsed, then the request is sent anyway.
On top of that, batching is also influenced by max.request.size
max.request.size -
The maximum size of a request in bytes. This setting will limit the
number of record batches the producer will send in a single request to
avoid sending huge requests. This is also effectively a cap on the
maximum record batch size. Note that the server has its own cap on
record batch size which may be different from this.

Why is the kafka consumer consuming the same message hundreds of times?

I see from the logs that exact same message is consumed by the 665 times. Why does this happen?
I also see this in the logs
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies
that the poll loop is spending too much time message processing. You can address this either by increasing the session
timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
Consumer properties
group.id=someGroupId
bootstrap.servers=kafka:9092
enable.auto.commit=false
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
session.timeout.ms=30000
max.poll.records=20
PS: Is it possible to consume only a specific number of messages like 10 or 50 or 100 messages from the 1000 that are in the queue?
I was looking at 'fetch.max.bytes' config, but it seems like it is for a message size rather than number of messages.
Thanks
The answer lies in the understanding of the following concepts:
session.timeout.ms
heartbeats
max.poll.interval.ms
In your case, your consumer receives a message via poll() but is not able to complete the processing in max.poll.interval.ms time. Therefore, it is assumed hung by the Broker and re-balancing of partitions happen due to which this consumer loses the ownership of all partitions. It is marked dead and is no longer part of a consumer group.
Then when your consumer completes the processing and calls poll() again two things happen:
Commit fails as the consumer no longer owns the partitions.
Broker identifies that the consumer is up again and therefore a re-balance is triggered and the consumer again joins the Consumer Group, start owning partitions and request messages from the Broker. Since the earlier message was not marked as committed (refer #1 above, failed commit) and is pending processing, the broker delivers the same message to consumer again.
Consumer again takes a lot of time to process and since is unable to finish processing in less than max.poll.interval.ms, 1. and 2. keep repeating in a loop.
To fix the problem, you can increase the max.poll.interval.ms to a large enough value based on how much time your consumer needs for processing. Then your consumer will not get marked as dead and will not receive duplicate messages.
However, the real fix is to check your processing logic and try to reduce the processing time.
The fix is described in the message you pasted:
You can address this either by increasing the session timeout or by
reducing the maximum size of batches returned in poll() with
max.poll.records.
The reason is a timeout is reached before your consumer is able to process and commit the message. When your Kafka consumer "commits", it's basically acknowledging receipt of the previous message, advancing the offset, and therefore moving onto the next message. But if that timeout is passed (as is the case for you), the consumer's commit isn't effective because it's happening too late; then the next time the consumer asks for a message, it's given the same message
Some of your options are to:
Increase session.timeout.ms=30000, so the consumer has more time
process the messages
Decrease the max.poll.records=20 so the consumer has less messages it'll need to work on before the timeout occurs. But this doesn't really apply to you because your consumer is already only just working on a single message
Or turn on enable.auto.commit, which probably also isn't the best solution for you because it might result in dropping messages though, as mentioned below:
If we allowed offsets to auto commit as in the previous example
messages would be considered consumed after they were given out by the
consumer, and it would be possible that our process could fail after
we have read messages into our in-memory buffer but before they had
been inserted into the database.
Source: https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

Apache Kafka: TimeoutException and then nothing works‏

I am trying to send a message from a producer to a kafka node in another DC.
Both the producer and consumer are set with the default 0.10.0.0 configuration and the message sizes are not so small (around 500k).
Most of the time when sending messages I am encountered with these exceptions:
org.apache.kafka.common.errors.TimeoutException: Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for topic-0
org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory within the configured max blocking time 60000 ms.
And after that no more messages get transferred (even the callback for remaining messages is not getting called).
Just wanted to chime in because I received the exact same errors today. I tried increasing the request.timeout.ms, decreasing batch.size, and even setting the batch.size to zero. However, nothing worked.
It turned out it was because the server couldn't connect to one of the 10 Kafka cluster nodes. So, what I saw were some inappropriate exceptions being thrown. By the way, we are using Kafka 0.9.0.1 if it matters.
According to Kafka documentation:
A small batch size will make batching less common and may reduce throughput (a batch size of zero will disable batching entirely). A very large batch size may use memory a bit more wastefully as we will always allocate a buffer of the specified batch size in anticipation of additional records.
Set batch.size = 0, it will resolve the issue.