Kafka : Failed to update metadata after 60000 ms with only one broker down - apache-kafka

We have a kafka producer configured as -
metadata.broker.list=broker1:9092,broker2:9092,broker3:9092,broker4:9092
serializer.class=kafka.serializer.StringEncoder
request.required.acks=1
request.timeout.ms=30000
batch.num.messages=25
message.send.max.retries=3
producer.type=async
compression.codec=snappy
Replication Factor is 3 and total number of partition currently is 108
Rest of the properties are default.
This producer was running absolutely fine. Then, due to some reason, one of the broker went down. Then, our producer started to show the log as -
"Failed to update metadata after 60000 ms". Nothing else was there in the log and we were seeing this error. In some interval, few requests were getting blocked, even if producer was async.
This issue was resolved when the broker was again up and running.
What can be the reason of this? One broker down should not affect the system as a whole as per my understanding.

Posting the answer for someone who might face this issue -
The reason is older version of Kafka Producer. The kafka producers take bootstrap servers as list. In older versions, for fetching metadata, producers will try to connect with all the servers in Round Robin fashion. So, if one of the broker is down, the requests going to this server will fail and this message will come.
Solution:
Upgrade to newer producer version.
can reduce metadata.fetch.timeout.ms settings: This will ensure the main thread is not getting blocked and send will fail soon. Default value is 60000ms. Not needed in higher version
Note: Kafka send method is blocked till the producer is able to write to buffer.

I got the same error because I forgot to create the topic. Once I created the topic the issue was resolved.

Related

Records associated to a Kafka batch listener are not consumed for some partitions after several rebalances (resiliency testing)

Some weeks ago my project has been updated to use Kafka 3.2.1 instead of using the one coming with Spring Boot 2.7.3 (3.1.1). We made this upgrade to avoid an issue in Kafka streams – Illegal state and argument exceptions were not ending in the uncaught exception handler.
On the consumer side, we also moved to the cooperative sticky assignator.
In parallel, we started some resiliency tests and we started to have issues with Kafka records that are not consumed anymore on some partitions when using a Kafka batch listener. The issue occurred after several rebalances caused by the test (deployment is done in Kubernetes and we stopped some pods, micro services and broker instances). The issue not present on every listeners. Kafka brokers and micro-services are up and running.
During our investigations,
we enabled Kafka events and we can clearly see that the consumer is started
we can see in the logs that the partitions that are not consuming events are assigned.
debug has been enabled on the KafkaMessageListenerContainer. We see a lot of occurrences of Receive: 0 records and Commit list: {}
Is there any blocking points to use Kafka 3.2.1 with Spring Boot/Kafka 2.7.3/2.8.8?
Any help or other advices are more than welcome to progress our investigations.
Multiple listeners are defined, the retry seems to be fired from another listener (shared err handler?).
This is a known bug, fixed in the next release:
https://github.com/spring-projects/spring-kafka/issues/2382
https://github.com/spring-projects/spring-kafka/commit/3de1e89ba697ead04de171cfa35273bb0daddbe6
Temporary work around is to give each container its own error handler.

Confluent.Kafka.KafkaException: Broker: Specified group generation id is not valid

Environment
3-node Kafka Cluster
Amazon MSK
v2.3
1 topic
6 partitions
1 consumer group with 2 consumers
Running in Kubernetes
Confluent .NET SDK 1.2.2
Except for bootstrap.servers and group.id, all of the default settings.
Problem
First, one of my consumers encounters the following exception.
Confluent.Kafka.KafkaException: Broker: Specified group generation id is not valid
at Confluent.Kafka.Impl.SafeKafkaHandle.Commit(IEnumerable`1 offsets)
at Confluent.Kafka.Consumer`2.Commit(IEnumerable`1 offsets)
The exception is trapped and the consumer is supposed to retry, but instead the app sits idle. The container is still up and running, but not consuming any more messages.
What's weirder is that the broker never reassigns that consumer's partitions so the consumer lag on those partitions begins to grow. It seems like the consumer is both alive (since the broker is not reassigning its partitions) and dead (since it cannot commit its offset or consume more messages). If we intervene and manually restart the consumers then the partitions are reassigned and the situation goes back to normal.
I'm not entirely sure what to make of the exception above. Google doesn't offer much. The most relevant lead I have is this issue in GitHub, which involves a broker restarting. To my knowledge, that is not happening in my situation. Any assistance would be greatly appreciated.
at least I have found a solution for me.
In my code I did a manual commit and set EnableAutoCommit = false.
Somehow it was possible that for an offset a commit was executed twice. I removed the manual commits on the consumer and set EnableAutoCommit = true.
After that it worked.

Kafka broker occassionally takes much longer than usual to load logs on startup

We are observing that Kafka brokers occasionally take much more time to load logs on startup than usual. Much longer in this case means 40 minutes instead of at most 1 minute. This happens during a rolling restart following the procedure described by Confluent. This happens after the broker reported that controlled shutdown was succesful.
Kafka Setup
Confluent Platform 5.5.0
Kafka Version 2.5.0
3 Replicas (minimum 2 in sync)
Controlled broker shutdown enabled
1TB of AWS EBS for Kafka log storage
Other potentially useful information
We make extensive use of Kafka Streams
We use exactly-once processing and transactional producers/consumers
Observations
It is not always the same broker that takes a long time.
It does not only occur when the broker is the active controller.
A log partition that loads quickly (15ms) can take a long time (9549 ms) for the same broker a day later.
We experienced this issue before on Kafka 2.4.0 but after upgrading to 2.5.0 it did not occur for a few weeks.
Does anyone have an idea what could be causing this? Or what additional information would be useful to track down the issue?

Kafka INVALID_FETCH_SESSION_EPOCH

We are using a kafka broker setup with a kafka streams application that runs using Spring cloud stream kafka. Although it seems to run fine, we do get the following error statements in our log:
2019-02-21 22:37:20,253 INFO kafka-coordinator-heartbeat-thread | anomaly-timeline org.apache.kafka.clients.FetchSessionHandler [Consumer clientId=anomaly-timeline-56dc4481-3086-4359-a8e8-d2dae12272a2-StreamThread-1-consumer, groupId=anomaly-timeline] Node 2 was unable to process the fetch request with (sessionId=1290440723, epoch=2089): INVALID_FETCH_SESSION_EPOCH.
I searched the internet but there is not much information on this error. I guessed that it could have something to do with a difference in time settings between the broker and the consumer, but both machines have the same timeserver settings.
Any idea how this can be resolved?
There is a concept of fetch session, introduced within KIP-227 since 1.1.0 release: https://cwiki.apache.org/confluence/display/KAFKA/KIP-227%3A+Introduce+Incremental+FetchRequests+to+Increase+Partition+Scalability
Kafka brokers, which are replica followers, fetch messages from the leader. In order to avoid sending full metadata each time for all partitions, only those partitions which changed are sent within the same fetch session.
When we look into Kafka's code, we can see an example, when this is returned:
if (session.epoch != expectedEpoch) {
info(s"Incremental fetch session ${session.id} expected epoch $expectedEpoch, but " +
s"got ${session.epoch}. Possible duplicate request.")
new FetchResponse(Errors.INVALID_FETCH_SESSION_EPOCH, new FetchSession.RESP_MAP, 0, session.id)
} else {
src: https://github.com/axbaretto/kafka/blob/ab2212c45daa841c2f16e9b1697187eb0e3aec8c/core/src/main/scala/kafka/server/FetchSession.scala#L493
In general, if you don't have thousands of partitions and, at the same time, this doesn't happen very often, then it shouldn't worry you.
It seems as this might be caused by Kafka-8052 issue, which was fixed for Kafka 2.3.0
Indeed, you can have this message when rolling or retention-based deletion occurs, as zen pointed out in comments. It's not a problem if it doesn't happen all the time. If it does, check your log.roll and log.retention configurations.
Updating the client version to 2.3 (same version from broker) fix it for me.
In our case, The root cause was kafka Broker - client incompatibility. If your cluster is behind the client version you might see all kinds of odd problems such as this.
Our kafka broker is on 1.x.x and our kafka-consumer was on 2.x.x. As soon as we downgraded our spring-cloud-dependencies to Finchley.RELEASE our problem was solved.
dependencyManagement {
imports {
mavenBom "org.springframework.cloud:spring-cloud-dependencies:Finchley.RELEASE"
}
}

Kafka: What happens when the entire Kafka Cluster is down?

We're testing out the Producer and Consumer using Kafka. A few questions:
What happens when all the brokers are down and they're not responding at all?
Does the Producer need to keep pinging the Kafka brokers to know when it is back up online? Or is there a more elegant way for the Producer application to know?
How does Zookeeper help in all this? What if the ZK is down as well?
If one or more brokers are down, the producer will re-try for a certain period of time (based on the settings). And during this time one or more of the consumers will not be able to read anything until the respective brokers are up.
But if the cluster is down for a longer period than your total re-try period, then probably you need to find a way to resend those failed messages again.
This is the one scenario where Kafka Mirroring(MirrorMaker tool) comes into picture.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27846330
Producer will fail because cluster will be unavailable, this means they will get a non retriable error from kafka client implementation and depending on your client process, message will buffer on the local send queue of your application.
I'm sure that if zookeeper is down your system will not work anymore. This is one of the weakness of Kafka, he need zookeeper to work.