Kafka producers failing when one Kafka Broker goes down - apache-kafka

We have a kafka cluster with 4 brokers. We have setup the topic with the configuration
replication.factor=3, min.insync.replicas=2
We noticed that whenever a single broker fails, our producers start failing within 60-90 seconds with the below error
org.apache.kafka.common.errors.TimeoutException: Batch containing 19 record(s) expired due to timeout while requesting metadata from brokers for a-13
[ERROR] ERROR Parser:567 - org.apache.kafka.common.errors.TimeoutException: Batch containing 19 record(s) expired due to timeout while requesting metadata from brokers for a-13
We have the below producer configs on the producer side.
acks=all,
request.timeout.ms=120000
retry.backoff.ms=5000
retries=3
linger.ms=250
max.in.flight.requests.per.connection=2
As per the configuration will the producer take atleast 6 minutes before failing? As request.timeout.ms=2 minutes and retries=3?
We do not have unclean leader election enabled. We are running Kafka 2.0 and the producer client version is 0.10.0.1.
We have the replica.lag.time.max.ms is set to 10s on the brokers. When the issue happened we noticed that the leader re-election happened within 40seconds. So I am confused why the producers are failing almost instantly when one broker goes down.
I can provide more info if required.

You set acks=all, and failed to mention which broker is down.
Sounds like the failed broker hosted one of the topic's partitions, and the ack is failing.

Related

Would the produced message be copied to all brokers irrespective of the replication factor in kafka

Let's say I've a kafka cluster with 5 brokers and my replication factor is 3. With this configuration, if I send/produce a message, would it be copied to just 3 nodes or all 5 nodes but acknowledges after copying to 3 nodes?
Normally it will be replicated to 3 brokers. But acknowledgement is up to ack config of the producer and min.insync.replicas config.
acks=0 means no acknowledgement. Producer sends message and don't care if it arrives to broker. You can lose messages.
acks=1 means leader acknowledgement. Acknowledgement is sent when leader gets the message without waiting other replicas to replicate message.
acks=all means acknowledgement will be sent when all in-sync-replicas write the message (leader waits in-sync-replicas to replicate)
min.insync.replicas means minimum number of in-sync-replicas to produce messages.
For example:
If you have 3 brokers and replication factor of a topic is 3 and min.insync.replicas is 1, then at the beginning the messages you produce are sent to leader and 2 replicas replicate it. But in case of broker failure or slowness in some of the brokers your number of in-sync-replicas can be just 1. At that point even you set acks=all your messages will be stored just in leader. (until the problem in brokers fixed and they catch up the leader)
So minimum recommended configuration to avoid message lost is having 3 brokers and this config:
topic replication factor=3
min.insync.replicas=2
acks=all
But if you want 3 replicas to get acknowledgement in any case, then this configuration will be fine:
numer of brokers in cluster=5
topic replication factor=3
min.insync.replicas=3
acks=all
**With this config you can also tolerate up to 2 broker failures in cluster.

Kafka is trying to send messages to a broker in "recovery mode"

I have the following setup
3 Kafka (v2.1.1) Brokers
5 Zookeeper instances
Kafka brokers have the following configuration:
auto.create.topics.enable: 'false'
default.replication.factor: 1
delete.topic.enable: 'false'
log.cleaner.threads: 1
log.message.format.version: '2.1'
log.retention.hours: 168
num.partitions: 1
offsets.topic.replication.factor: 1
transaction.state.log.min.isr: '2'
transaction.state.log.replication.factor: '3'
zookeeper.connection.timeout.ms: 10000
zookeeper.session.timeout.ms: 10000
min.insync.replicas: '2'
request.timeout.ms: 30000
Producer configuration (using Spring Kafka) is more or less as following:
...
acks: all
retries: Integer.MAX_VALUE
deployment.timeout.ms: 360000ms
enable.idempotence: true
...
This configuration I read as follows: There are three Kafka brokers, but once one of them dies, it is fine if only at least two replicate and persist the data before sending the ack back (= in sync replicas). In case of failure, Kafka producer will keep retrying for 6 minutes, but then gives up.
This is the scenario which causes me headache:
All Kafka and Zookeeper instances are up and alive
I start sending messages in chunks (500 pcs each)
In the middle of the processing, one of the Brokers dies (hard kill)
Immediately, I see logs like 2019-08-09 13:06:39.805 WARN 1 --- [b6b45bb5c-7dxh7] o.a.k.c.NetworkClient : [Producer clientId=bla-6b6b45bb5c-7dxh7, transactionalId=bla-6b6b45bb5c-7dxh70] 4 partitions have leader brokers without a matching listener, including [...] (question 1: I do not see any further messages coming in, does this really mean the whole cluster is now stuck and waiting for the dead Broker to come back???)
After the dead Broker starts to boot up again, it starts with recovery of the corrupted index. This operation takes more than 10 minutes as I have a lot of data on the Kafka cluster
Every 30s, the producer tries to send the message again (due to request.timeout.ms property set to 30s)
Since my deployment.timeout.ms is se to 6 minutes and the Broker needs 10 minutes to recover and does not persist the data until then, the producer gives up and stops retrying = I potentially lose the data
The questions are
Why the Kafka cluster waits until the dead Broker comes back?
When the producer realizes the Broker does not respond, why it does not try to connect another Broker?
The thread is completely stuck for 6 minutes and waiting until the dead Broker recovers, how can I tell the producer to rather try another Broker?
Am I missing something or is there any good practice to avoid such scenario?
You have a number of questions, I'll take a shot at providing our experience which will hopefully shed light on some of them.
In my product, IBM IDR Replication, we had to provide information for robustness to customers who's topics were being rebalanced, or whom had lost a broker in their clusters. The results of some of our testing was the simply setting the request timeout was not sufficient because in certain circumstances the request would decide not to wait the entire time, and rather perform another retry almost instantly. This burned through the configured number of retries Ie. there are circumstances where the timeout period is circumvented.
As such we instructed users to utilize a formula like the following...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/robust.html
"To tune the values for your environment, adjust the Kafka producer properties retry.backoff.ms and retries according to the following formula:
retry.backoff.ms * retries > the anticipated maximum time for leader change metadata to propagate in the clusterCopy
For example, you might wish to configure retry.backoff.ms=300, retries=150 and max.in.flight.requests.per.connection=1."
So maybe try utilizing retries and retry.backoff.ms. Note that utilizing retries without idempotence can cause batches to be written out of order if you have more than one in flight... so choose accordingly based on your business logic.
It was our experience that the Kafka Producer writes to the broker which is the leader for the topic, and so you have to wait for the new leader to be elected. When it is, if the retry process is still ongoing, the producer transparently determines the new leader and writes data accordingly.

Unclean shutdown breaks Kafka cluster

My team has observed that if broker process die unclean then it will block producer from sending messages to kafka topic.
Here is how to reproduce the problem:
1) Create a Kafka 0.10 with three brokers (A, B and C).
2) Create topic with replication_factor = 2
3) Set producer to send messages with "acks=all" meaning all replicas must be created before able to proceed next message.
3) Force IEM (IBM Endpoint Manager) to send patch to broker A and force server to reboot after patches installed.
Note: min.insync.replicas = 1
Result:
- Producers are not able send messages to kafka topic after broker rebooted and come back to join cluster with following error messages.
[2016-09-28 09:32:41,823] WARN Error while fetching metadata with correlation id 0 : {logstash=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
We suspected that number of replication_factor (2) is not sufficient to our kafka environment but really need an explanation on what happen when broker facing unclean shutdown. The same issue occurred when setting cluster with 2 brokers and replication_factor = 1.
The workaround i used to recover service is to cleanup both kafka topic log file and zookeeper data (rmr /brokers/topics/XXX and rmr /consumers/XXX).
Thanks,
Anukool

How to configure the time it takes for a kafka cluster to re-elect partition leaders after stopping and restarting a broker?

I have the following setup:
3 kafka brokers and a 3 zookeeper ensamble
1 topic with 12 partitions and 3 replicas (each kafka broker is thus the leader of 4 partitions)
I stop one of the brokers - it gets removed from the cluster, leadership of its partitions is moved to the two remaining brokers
I start the broker back - it reappears in the cluster, and eventually the leadership gets rebalanced so each broker is the leader of 4 partitions.
It works OK, except I find the time spent before the rebalancing too long (like minutes). This happens under no load - no messages are sent to the cluster, no messages are consumed.
Kafka version 0.9.0.0, zookeeper 3.4.6
zookeeper tickTime = 2000
kafka zookeeper.connection.timeout.ms = 6000
(basically the default config)
Does anyone know what config parameters in kafka and/or zookeeper influence the time taken for the leader rabalancing ?
as said in the official documentation http://kafka.apache.org/documentation.html#configuration (More details about broker configuration can be found in the scala class kafka.server.KafkaConfig.)
there actually is a leader.imbalance.check.interval.seconds property which defaults to 300 (5 minutes), setting it to 30 seconds does what I need.

Partition re-balance on brokers in Kafka 0.8

The relatively scarce documentation for Kafka 0.8 does not mention what the expected behaviour for balancing existing topics, partitions and replicas on brokers is.
More specifically, what is the expected behaviour on arrival of a broker and on crash of a broker (leader or not) ?
Thanks.
I have tested those 2 cases a while ago and not under heavy load. I have one producer sending 10k messages (just a little string) synchronously to a topic, with replication factor of 2, with 2 partitions, on a cluster of 2 brokers. There are 2 consumers. Each component is deployed on a separate machine. What I have observed is :
On normal operation : broker 1 is leader on partition 1 and replica on partition 2. broker 2 is leader on partition 2 and replica on partition 1. Bring a broker 3 into the cluster don't trigger rebalance on partitions automatically.
On broker revival (crashed than reboot) : rebalancing is transparent to the producer and consumers. The rebooting broker replicate the log first and then make itself available.
On broker crashed (leader or not) : simulated by a kill -9 on any one broker. The producer and consumers get frozen until the ephemeral node in ZK of the killed broker is expired. After that, operations are resumed normally.