Kafka broker occassionally takes much longer than usual to load logs on startup - apache-kafka

We are observing that Kafka brokers occasionally take much more time to load logs on startup than usual. Much longer in this case means 40 minutes instead of at most 1 minute. This happens during a rolling restart following the procedure described by Confluent. This happens after the broker reported that controlled shutdown was succesful.
Kafka Setup
Confluent Platform 5.5.0
Kafka Version 2.5.0
3 Replicas (minimum 2 in sync)
Controlled broker shutdown enabled
1TB of AWS EBS for Kafka log storage
Other potentially useful information
We make extensive use of Kafka Streams
We use exactly-once processing and transactional producers/consumers
Observations
It is not always the same broker that takes a long time.
It does not only occur when the broker is the active controller.
A log partition that loads quickly (15ms) can take a long time (9549 ms) for the same broker a day later.
We experienced this issue before on Kafka 2.4.0 but after upgrading to 2.5.0 it did not occur for a few weeks.
Does anyone have an idea what could be causing this? Or what additional information would be useful to track down the issue?

Related

Records associated to a Kafka batch listener are not consumed for some partitions after several rebalances (resiliency testing)

Some weeks ago my project has been updated to use Kafka 3.2.1 instead of using the one coming with Spring Boot 2.7.3 (3.1.1). We made this upgrade to avoid an issue in Kafka streams – Illegal state and argument exceptions were not ending in the uncaught exception handler.
On the consumer side, we also moved to the cooperative sticky assignator.
In parallel, we started some resiliency tests and we started to have issues with Kafka records that are not consumed anymore on some partitions when using a Kafka batch listener. The issue occurred after several rebalances caused by the test (deployment is done in Kubernetes and we stopped some pods, micro services and broker instances). The issue not present on every listeners. Kafka brokers and micro-services are up and running.
During our investigations,
we enabled Kafka events and we can clearly see that the consumer is started
we can see in the logs that the partitions that are not consuming events are assigned.
debug has been enabled on the KafkaMessageListenerContainer. We see a lot of occurrences of Receive: 0 records and Commit list: {}
Is there any blocking points to use Kafka 3.2.1 with Spring Boot/Kafka 2.7.3/2.8.8?
Any help or other advices are more than welcome to progress our investigations.
Multiple listeners are defined, the retry seems to be fired from another listener (shared err handler?).
This is a known bug, fixed in the next release:
https://github.com/spring-projects/spring-kafka/issues/2382
https://github.com/spring-projects/spring-kafka/commit/3de1e89ba697ead04de171cfa35273bb0daddbe6
Temporary work around is to give each container its own error handler.

Flink Kafka integration issues

Context:
Built a fraud detection kind of app.
Business logic is all fine, but when putting into production, Kafka cluster is becoming unstable.
The topic to which it wrote have approx 80 events/sec. post running for > 1 hours Kafka broker offsets are getting corrupted.
Which is making consumers on this broker to fail.
Haven’t find any issues with load testing on non-prod environments.
Not sure what was causing issues.
Topic config: single partition, replication 3.
Kafka version - 2.1
Flink version - 1.11.2-scala-2.12
Using same version for Kafka consumer, Kafka producer.
Any suggestions are welcomed, what can be the cause.

Kafka : Failed to update metadata after 60000 ms with only one broker down

We have a kafka producer configured as -
metadata.broker.list=broker1:9092,broker2:9092,broker3:9092,broker4:9092
serializer.class=kafka.serializer.StringEncoder
request.required.acks=1
request.timeout.ms=30000
batch.num.messages=25
message.send.max.retries=3
producer.type=async
compression.codec=snappy
Replication Factor is 3 and total number of partition currently is 108
Rest of the properties are default.
This producer was running absolutely fine. Then, due to some reason, one of the broker went down. Then, our producer started to show the log as -
"Failed to update metadata after 60000 ms". Nothing else was there in the log and we were seeing this error. In some interval, few requests were getting blocked, even if producer was async.
This issue was resolved when the broker was again up and running.
What can be the reason of this? One broker down should not affect the system as a whole as per my understanding.
Posting the answer for someone who might face this issue -
The reason is older version of Kafka Producer. The kafka producers take bootstrap servers as list. In older versions, for fetching metadata, producers will try to connect with all the servers in Round Robin fashion. So, if one of the broker is down, the requests going to this server will fail and this message will come.
Solution:
Upgrade to newer producer version.
can reduce metadata.fetch.timeout.ms settings: This will ensure the main thread is not getting blocked and send will fail soon. Default value is 60000ms. Not needed in higher version
Note: Kafka send method is blocked till the producer is able to write to buffer.
I got the same error because I forgot to create the topic. Once I created the topic the issue was resolved.

Kafka streams 1.0: processing timeout with high max.poll.interval.ms and session.timeout.ms

I am using a stateless processor using Kafka streams 1.0 with kafka broker 1.0.1
The problem is, the CustomProcessor get closed every few seconds, which resulted in rebalance signal, I am using the following configs:
session.timeout.ms=15000
heartbeat.interval.ms=3000 // set it to 1/3 session.timeout
max.poll.interval.ms=Integer.MAX_VALUE // make it that large as I am doing a intensive computational operations that might take up to 10 mins processing 1 kafka message (NLP operations)
max.poll.records=1
despite this configuration and my understanding of how kafka timeout configurations work, I see the consumer rebalancing every few seconds.
I already went through the below article and other stackoverflow questions. about how to tune the long time operations and avoid very long session timeout that will make failure detection so late, however I still see unexpected behavior, unless I misunderstand something.
KIP-62
Diff between session.timeout.ms and max.poll.interval
Kafka kstreams processing timeout
For the consumer environment setup, I have 8 machines each 16 code, and consuming from 1 topic with 100 partitions, I am following what practice this confluent doc here recommends.
Any pointers?
I figured it out. after lots of debugging and enable verbose logging for both kafka streams client and the broker, it turned out to 2 things:
There is a critical bug in streams 1.0.0 (HERE), so I upgraded my client version from 1.0.0 to 1.0.1
I update the value of the consumer property default.deserialization.exception.handler from org.apache.kafka.streams.errors.LogAndFailExceptionHandler to org.apache.kafka.streams.errors.LogAndContinueExceptionHandler.
After the above 2 changes, everything went so perfect with no restarts, I am using grafana to monitor the restarts, and for the past 48 hours, there is no single restart happened.
I might do more troubleshooting to make sure which of the 2 items above make the real fix, but I am on a hurry to deploy to production, so if anybody is intrested to start from there, go ahead, else, once I got time will do the further analysis and update the answer!
So happy to get this fixed!!!

Maximum value for zookeeper.connection.timeout.ms

Right now we are running kafka in AWS EC2 servers and zookeeper is also running on separate EC2 instances.
We have created a service (system units ) for kafka and zookeeper to make sure that they are started in case the server gets rebooted.
The problem is sometimes zookeeper severs are little late in starting and kafka brokers by that time getting terminated.
So to deal with this issue we are planning to increase the zookeeper.connection.timeout.ms to some high number like 10 mins, at the broker side. Is this a good approach ?
Are there any size effect of increasing the zookeeper.connection.timeout.ms timeout in zookeeper ?
Increasing zookeeper.connection.timeout.ms may or may not handle your problem in hand but there is a possibility that it will take longer time to detect a broker soft failure.
Couple of things you can do:
1) You must alter the System to launch the kafka to delay by 10 mins (the time you wanted to put in zookeper timeout).
2) We are using HDP cluster which automatically takes care of such scenarios.
Here is an explanation from Kafka FAQs:
During a broker soft failure, e.g., a long GC, its session on ZooKeeper may timeout and hence be treated as failed. Upon detecting this situation, Kafka will migrate all the partition leaderships it currently hosts to other replicas. And once the broker resumes from the soft failure, it can only act as the follower replica of the partitions it originally leads.
To move the leadership back to the brokers, one can use the preferred-leader-election tool here. Also, in 0.8.2 a new feature will be added which periodically trigger this functionality (details here).
To reduce Zookeeper session expiration, either tune the GC or increase zookeeper.session.timeout.ms in the broker config.
https://cwiki.apache.org/confluence/display/KAFKA/FAQ
Hope this helps