We use 3 node kafka clusters running 2.7.0 with quite high number of topics and partitions. Almost all the topics have only 1 partition and replication factor of 3 so that gives us roughly:
topics: 7325
partitions total in cluster (including replica): 22110
Brokers are relatively small with
6vcpu
16gb memory
500GB in /var/lib/kafka occupied by partitions data
As you can imagine because we have 3 brokers and replication factor 3 the data is very evenly spread across brokers. Each broker leads very similar (same) amount of partitions and the number of partitions per broker is equal. Under normal circumstances.
Before doing rolling restart yesterday everything was in-sync. We stopped the process and started it again after 1 minute. It took some 10minutes to get synchronized with Zookeeper and start listening on port.
After saing 'Kafka server started'. Nothing is happening. There is no CPU, memory or disk activity. The partition data is visible on data disk. There are no messages in log for more than 1 day now since process booted up.
We've tried restarting zookeeper cluster (one by one). We've tried restart of broker again. Now it's been 24 hours since last restart and still not change.
Broker itself is reporting it leads 0 partitions. Leadership for all the partitions moved to other brokers and they are reporting that everything located in this broker is not in sync.
I'm aware the number of partitions per broker is far exceeding the recommendation but I'm still confused by lack of any activity or log messages. Any ideas what should be checked further? It looks like something is stuck somewhere. I checked the kafka ACLs and there are no block messages related to broker username.
I tried another restart with DEBUG mode and it seems there is some problem with metadata. These two messages are constantly repeating:
[2022-05-13 16:33:25,688] DEBUG [broker-1-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-05-13 16:33:25,688] DEBUG [broker-1-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
With kcat it's also impossible to fetch metadata about topics (meaning if I specify this broker as bootstrap server).
We need to send message to Kafka in sync as we can not afford loosing messages. Also we can only wait for few seconds for write to complete. We are using following config for producer. During rolling restart we are seeing timeouts for requests when controller broker is restarted at the end.
acks=all
request timeout = 200
retries = 3
Why are we seeing timeouts during
controller broker restart and not when other brokers were restarted during rolling restart?
How much time it takes for a new controller to get elected considering it is not a big deployment?
Can these timeouts be avoided considering the time constraints?
I have the following setup
3 Kafka (v2.1.1) Brokers
5 Zookeeper instances
Kafka brokers have the following configuration:
auto.create.topics.enable: 'false'
default.replication.factor: 1
delete.topic.enable: 'false'
log.cleaner.threads: 1
log.message.format.version: '2.1'
log.retention.hours: 168
num.partitions: 1
offsets.topic.replication.factor: 1
transaction.state.log.min.isr: '2'
transaction.state.log.replication.factor: '3'
zookeeper.connection.timeout.ms: 10000
zookeeper.session.timeout.ms: 10000
min.insync.replicas: '2'
request.timeout.ms: 30000
Producer configuration (using Spring Kafka) is more or less as following:
...
acks: all
retries: Integer.MAX_VALUE
deployment.timeout.ms: 360000ms
enable.idempotence: true
...
This configuration I read as follows: There are three Kafka brokers, but once one of them dies, it is fine if only at least two replicate and persist the data before sending the ack back (= in sync replicas). In case of failure, Kafka producer will keep retrying for 6 minutes, but then gives up.
This is the scenario which causes me headache:
All Kafka and Zookeeper instances are up and alive
I start sending messages in chunks (500 pcs each)
In the middle of the processing, one of the Brokers dies (hard kill)
Immediately, I see logs like 2019-08-09 13:06:39.805 WARN 1 --- [b6b45bb5c-7dxh7] o.a.k.c.NetworkClient : [Producer clientId=bla-6b6b45bb5c-7dxh7, transactionalId=bla-6b6b45bb5c-7dxh70] 4 partitions have leader brokers without a matching listener, including [...] (question 1: I do not see any further messages coming in, does this really mean the whole cluster is now stuck and waiting for the dead Broker to come back???)
After the dead Broker starts to boot up again, it starts with recovery of the corrupted index. This operation takes more than 10 minutes as I have a lot of data on the Kafka cluster
Every 30s, the producer tries to send the message again (due to request.timeout.ms property set to 30s)
Since my deployment.timeout.ms is se to 6 minutes and the Broker needs 10 minutes to recover and does not persist the data until then, the producer gives up and stops retrying = I potentially lose the data
The questions are
Why the Kafka cluster waits until the dead Broker comes back?
When the producer realizes the Broker does not respond, why it does not try to connect another Broker?
The thread is completely stuck for 6 minutes and waiting until the dead Broker recovers, how can I tell the producer to rather try another Broker?
Am I missing something or is there any good practice to avoid such scenario?
You have a number of questions, I'll take a shot at providing our experience which will hopefully shed light on some of them.
In my product, IBM IDR Replication, we had to provide information for robustness to customers who's topics were being rebalanced, or whom had lost a broker in their clusters. The results of some of our testing was the simply setting the request timeout was not sufficient because in certain circumstances the request would decide not to wait the entire time, and rather perform another retry almost instantly. This burned through the configured number of retries Ie. there are circumstances where the timeout period is circumvented.
As such we instructed users to utilize a formula like the following...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/robust.html
"To tune the values for your environment, adjust the Kafka producer properties retry.backoff.ms and retries according to the following formula:
retry.backoff.ms * retries > the anticipated maximum time for leader change metadata to propagate in the clusterCopy
For example, you might wish to configure retry.backoff.ms=300, retries=150 and max.in.flight.requests.per.connection=1."
So maybe try utilizing retries and retry.backoff.ms. Note that utilizing retries without idempotence can cause batches to be written out of order if you have more than one in flight... so choose accordingly based on your business logic.
It was our experience that the Kafka Producer writes to the broker which is the leader for the topic, and so you have to wait for the new leader to be elected. When it is, if the retry process is still ongoing, the producer transparently determines the new leader and writes data accordingly.
Right now we are running kafka in AWS EC2 servers and zookeeper is also running on separate EC2 instances.
We have created a service (system units ) for kafka and zookeeper to make sure that they are started in case the server gets rebooted.
The problem is sometimes zookeeper severs are little late in starting and kafka brokers by that time getting terminated.
So to deal with this issue we are planning to increase the zookeeper.connection.timeout.ms to some high number like 10 mins, at the broker side. Is this a good approach ?
Are there any size effect of increasing the zookeeper.connection.timeout.ms timeout in zookeeper ?
Increasing zookeeper.connection.timeout.ms may or may not handle your problem in hand but there is a possibility that it will take longer time to detect a broker soft failure.
Couple of things you can do:
1) You must alter the System to launch the kafka to delay by 10 mins (the time you wanted to put in zookeper timeout).
2) We are using HDP cluster which automatically takes care of such scenarios.
Here is an explanation from Kafka FAQs:
During a broker soft failure, e.g., a long GC, its session on ZooKeeper may timeout and hence be treated as failed. Upon detecting this situation, Kafka will migrate all the partition leaderships it currently hosts to other replicas. And once the broker resumes from the soft failure, it can only act as the follower replica of the partitions it originally leads.
To move the leadership back to the brokers, one can use the preferred-leader-election tool here. Also, in 0.8.2 a new feature will be added which periodically trigger this functionality (details here).
To reduce Zookeeper session expiration, either tune the GC or increase zookeeper.session.timeout.ms in the broker config.
https://cwiki.apache.org/confluence/display/KAFKA/FAQ
Hope this helps
I have an kafka environment which has 2 brokers and 1 zookeeper.
While I am trying to produce messages to kafka, if i stop broker 1(which is the leader one) the client stops producing messaging and give me the below error although the broker 2 is elected as a new leader for the topic and partions.
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
After 10 minutes passed, since broker 2 is new leader i expected producer to send data to broker 2 but it continued failing by giving above exception. lastRefreshMs and lastSuccessfullRefreshMs is still same although the metadataExpireMs is 300000 for producer.
I am using kafka new Producer implementation on producer side.
It seems that when producer is initiated, it binds to one broker and if that broker goes down it is not even trying to connect to another brokers in cluster.
But my expectation is if a broker goes down, it should directly check metadata for another brokers that are available and send data to them.
Btw my topic is 4 partition and has replication factor of 2. Giving this info in case it makes sense.
Configuration params.
{request.timeout.ms=30000, retry.backoff.ms=100, buffer.memory=33554432, ssl.truststore.password=null, batch.size=16384, ssl.keymanager.algorithm=SunX509, receive.buffer.bytes=32768, ssl.cipher.suites=null, ssl.key.password=null, sasl.kerberos.ticket.renew.jitter=0.05, ssl.provider=null, sasl.kerberos.service.name=null, max.in.flight.requests.per.connection=5, sasl.kerberos.ticket.renew.window.factor=0.8, bootstrap.servers=[10.201.83.166:9500, 10.201.83.167:9500], client.id=rest-interface, max.request.size=1048576, acks=1, linger.ms=0, sasl.kerberos.kinit.cmd=/usr/bin/kinit, ssl.enabled.protocols=[TLSv1.2, TLSv1.1, TLSv1], metadata.fetch.timeout.ms=60000, ssl.endpoint.identification.algorithm=null, ssl.keystore.location=null, value.serializer=class org.apache.kafka.common.serialization.ByteArraySerializer, ssl.truststore.location=null, ssl.keystore.password=null, key.serializer=class org.apache.kafka.common.serialization.ByteArraySerializer, block.on.buffer.full=false, metrics.sample.window.ms=30000, metadata.max.age.ms=300000, security.protocol=PLAINTEXT, ssl.protocol=TLS, sasl.kerberos.min.time.before.relogin=60000, timeout.ms=30000, connections.max.idle.ms=540000, ssl.trustmanager.algorithm=PKIX, metric.reporters=[], compression.type=none, ssl.truststore.type=JKS, max.block.ms=60000, retries=0, send.buffer.bytes=131072, partitioner.class=class org.apache.kafka.clients.producer.internals.DefaultPartitioner, reconnect.backoff.ms=50, metrics.num.samples=2, ssl.keystore.type=JKS}
Use Case:
1- Start BR1 and BR2 Produce data (Leader is BR1)
2- Stop BR2 produce data(fine)
3- Stop BR1(which means there is no active working broker in cluster at this time) and then Start BR2 and produce data (failed although leader is BR2)
4- Start BR1 produce data(leader is still BR2 but data is produced finely)
5- Stop BR2(now BR1 is leader)
6- Stop BR1(BR1 is still leader)
7- Start BR1 produce data(message is produced fine again)
If producer send the latest successful data to BR1 and then all brokers goes down, the producer expects BR1 to get up again although BR2 is up and new leader. Is this an expected behaviour?
After spending hours I figured out the behaviour of kafka in my situation. May be this is a bug or may be this needs to be done this way for the reasons lie under the hood but actually if i would do such implementation i wouldn't do this way :)
When all brokers goes down, if you are able to get up only one broker this must be the broker which went down last in order to produce messages successfully.
Let's say you have 5 brokers; BR1, BR2, BR3, BR4 and BR5. If all goes down and if the lastly dead broker is BR3(which was the last leader), although you start all brokers BR1, BR2, BR4 and BR5, it will not make any sense unless you start BR3.
You need to increase the number of retries.
In your case you need to set it to >=5.
That is the only way for your producer to know that your cluster has a new leader.
Besides that, make sure that all your brokers have a copy of your partition(s). Else you aren't going to get a new leader.
in the latest kafka version, when a broker down and that's have a leader partition which used by a producer. The producer will retry until catch retriable exception, then producer need to update metadata. The new metadata can be fetch from leastLoadNode. So new leader will be updated and producer can write there.