When do Kafka consumer retries happen? - apache-kafka

There is a retry feature on Kafka clients. I am struggling to find out when a retry happens. Would a retry happen if the connection to the broker in interrupted briefly? How about if the brokers were not reachable for 5 mins? Will the messages get delivered once the brokers are back up? Or does retry only happen on known scenarios to the kafka clients?

Kafka Producer consists of a pool of buffer space that holds records that haven't yet been transmitted to the server as well as a background I/O thread that is responsible for turning these batch records into requests and transmitting them to the cluster.
For example if records are sent faster than they can be delivered to the server the producer will block for max.block.ms after which it will throw an exception. Then client assumes batch is failed and will retry to send the batch based on retries config
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for my-test-topic-4 due to 30024 ms has passed since batch creation plus linger time
Suppose if the retries config is set to 3 and if all retries fails then the batch is lost
error: Failed to send message after 3 tries
How about if the brokers were not reachable for 5 mins?
If the broker is down and in mean time if retry is exhausted then you lost the data

Related

Messages sent synchronously to Kafka fails when controller broker is restarted

We need to send message to Kafka in sync as we can not afford loosing messages. Also we can only wait for few seconds for write to complete. We are using following config for producer. During rolling restart we are seeing timeouts for requests when controller broker is restarted at the end.
acks=all
request timeout = 200
retries = 3
Why are we seeing timeouts during
controller broker restart and not when other brokers were restarted during rolling restart?
How much time it takes for a new controller to get elected considering it is not a big deployment?
Can these timeouts be avoided considering the time constraints?

When does Kafka throw BufferExhaustedException?

org.apache.kafka.clients.producer.BufferExhaustedException: Failed to allocate memory within the configured max blocking time 5 ms.
This says that exception is thrown when producer is unable to allocate memory to the record within the configured max blocking time 5 ms.
This is what it says when I was trying to add Kafka s3-sink connectors. There are 11 topics in two kafka brokers and there were consumers present already consuming from these topics. I was spinning out a 2 node Kafka connect cluster with 11 connectors trying to consume from these topics. But there was a huge spike in errors when I started these s3-sink connectors. Once I stopped these connectors, the errors dropped and seemed to be fine. Then I started the consumers again with less number of tasks and this time the errors spiked up when there was a sudden surge in the traffic and back to normal when the traffic was back to normal. There was a max retry of 5 and it messages failed to write even after 5 attempts.
From whatever I had read, it might be due to producer batch size or producer rate being higher than the consumer rate. And I guess each consumer will be occupying upto 64 mb when there is bursty traffic. Could that be the reason? Should I try and increase the blocking time?
Producer Config:
lingerTime: 0
maxBlockTime: 5
bufferMemory: 1024000
batchSize: 102400
ack: "1"
maxRequestSize: 102400
retries: 1
maxInFlightRequestsPerConn: 1000
It was actually due to the increase in the IOPS of the EC2 instances that Kafka brokers couldn't handle. Increasing the number of bytes fetched per poll and decreasing the frequency of polls fixed it.

Kafka producers failing when one Kafka Broker goes down

We have a kafka cluster with 4 brokers. We have setup the topic with the configuration
replication.factor=3, min.insync.replicas=2
We noticed that whenever a single broker fails, our producers start failing within 60-90 seconds with the below error
org.apache.kafka.common.errors.TimeoutException: Batch containing 19 record(s) expired due to timeout while requesting metadata from brokers for a-13
[ERROR] ERROR Parser:567 - org.apache.kafka.common.errors.TimeoutException: Batch containing 19 record(s) expired due to timeout while requesting metadata from brokers for a-13
We have the below producer configs on the producer side.
acks=all,
request.timeout.ms=120000
retry.backoff.ms=5000
retries=3
linger.ms=250
max.in.flight.requests.per.connection=2
As per the configuration will the producer take atleast 6 minutes before failing? As request.timeout.ms=2 minutes and retries=3?
We do not have unclean leader election enabled. We are running Kafka 2.0 and the producer client version is 0.10.0.1.
We have the replica.lag.time.max.ms is set to 10s on the brokers. When the issue happened we noticed that the leader re-election happened within 40seconds. So I am confused why the producers are failing almost instantly when one broker goes down.
I can provide more info if required.
You set acks=all, and failed to mention which broker is down.
Sounds like the failed broker hosted one of the topic's partitions, and the ack is failing.

Is it possible to log or handle automatic kafka producer retries when acks=all

I know I can set acks=all in Kafka producer configuration to make producer to wait for acknowledgement from leader after all replicas receive the message sent. If the acknowledgement timeout occurs, producer retries sending message. This happens transparently without requiring any code changes. Is it possible to have some stats of those retries. Is it possible to know which message involved retries and how many retries. Does Kafka provide any kind of hook to be called before / after retry so that we can log some message?

Kafka is trying to send messages to a broker in "recovery mode"

I have the following setup
3 Kafka (v2.1.1) Brokers
5 Zookeeper instances
Kafka brokers have the following configuration:
auto.create.topics.enable: 'false'
default.replication.factor: 1
delete.topic.enable: 'false'
log.cleaner.threads: 1
log.message.format.version: '2.1'
log.retention.hours: 168
num.partitions: 1
offsets.topic.replication.factor: 1
transaction.state.log.min.isr: '2'
transaction.state.log.replication.factor: '3'
zookeeper.connection.timeout.ms: 10000
zookeeper.session.timeout.ms: 10000
min.insync.replicas: '2'
request.timeout.ms: 30000
Producer configuration (using Spring Kafka) is more or less as following:
...
acks: all
retries: Integer.MAX_VALUE
deployment.timeout.ms: 360000ms
enable.idempotence: true
...
This configuration I read as follows: There are three Kafka brokers, but once one of them dies, it is fine if only at least two replicate and persist the data before sending the ack back (= in sync replicas). In case of failure, Kafka producer will keep retrying for 6 minutes, but then gives up.
This is the scenario which causes me headache:
All Kafka and Zookeeper instances are up and alive
I start sending messages in chunks (500 pcs each)
In the middle of the processing, one of the Brokers dies (hard kill)
Immediately, I see logs like 2019-08-09 13:06:39.805 WARN 1 --- [b6b45bb5c-7dxh7] o.a.k.c.NetworkClient : [Producer clientId=bla-6b6b45bb5c-7dxh7, transactionalId=bla-6b6b45bb5c-7dxh70] 4 partitions have leader brokers without a matching listener, including [...] (question 1: I do not see any further messages coming in, does this really mean the whole cluster is now stuck and waiting for the dead Broker to come back???)
After the dead Broker starts to boot up again, it starts with recovery of the corrupted index. This operation takes more than 10 minutes as I have a lot of data on the Kafka cluster
Every 30s, the producer tries to send the message again (due to request.timeout.ms property set to 30s)
Since my deployment.timeout.ms is se to 6 minutes and the Broker needs 10 minutes to recover and does not persist the data until then, the producer gives up and stops retrying = I potentially lose the data
The questions are
Why the Kafka cluster waits until the dead Broker comes back?
When the producer realizes the Broker does not respond, why it does not try to connect another Broker?
The thread is completely stuck for 6 minutes and waiting until the dead Broker recovers, how can I tell the producer to rather try another Broker?
Am I missing something or is there any good practice to avoid such scenario?
You have a number of questions, I'll take a shot at providing our experience which will hopefully shed light on some of them.
In my product, IBM IDR Replication, we had to provide information for robustness to customers who's topics were being rebalanced, or whom had lost a broker in their clusters. The results of some of our testing was the simply setting the request timeout was not sufficient because in certain circumstances the request would decide not to wait the entire time, and rather perform another retry almost instantly. This burned through the configured number of retries Ie. there are circumstances where the timeout period is circumvented.
As such we instructed users to utilize a formula like the following...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/robust.html
"To tune the values for your environment, adjust the Kafka producer properties retry.backoff.ms and retries according to the following formula:
retry.backoff.ms * retries > the anticipated maximum time for leader change metadata to propagate in the clusterCopy
For example, you might wish to configure retry.backoff.ms=300, retries=150 and max.in.flight.requests.per.connection=1."
So maybe try utilizing retries and retry.backoff.ms. Note that utilizing retries without idempotence can cause batches to be written out of order if you have more than one in flight... so choose accordingly based on your business logic.
It was our experience that the Kafka Producer writes to the broker which is the leader for the topic, and so you have to wait for the new leader to be elected. When it is, if the retry process is still ongoing, the producer transparently determines the new leader and writes data accordingly.