Kafka, streams applications runs for a few days then "WARN broker may not be available" - apache-kafka

We are running around 30 kafka-streams applications on kubernetes. They usually run fine for a few days but after some random time some of the applications enter a state
2021-11-05 09:44:01,183 [1-producer] WARN o.a.k.c.NetworkClient - [Producer clientId=appId-cd74cfbc-4c6a-49b4-823d-ac21beae2c27-StreamThread-1-producer] Connection to node -1 (********) could not be established. Broker may not be available.
2021-11-05 09:44:01,183 [amThread-1] WARN o.a.k.c.NetworkClient - [Consumer clientId=appId-cd74cfbc-4c6a-49b4-823d-ac21beae2c27-StreamThread-1-consumer, groupId=groupId] Connection to node -2 (******) could not be established. Broker may not be available.
2021-11-05 09:44:01,183 [amThread-1] WARN o.a.k.c.NetworkClient - [Consumer clientId=appId-cd74cfbc-4c6a-49b4-823d-ac21beae2c27-StreamThread-1-consumer, groupId=groupId] Bootstrap broker ****** (id: -2 rack: null) disconnected
2021-11-05 09:44:01,183 [1-producer] WARN o.a.k.c.NetworkClient - [Producer clientId=cd74cfbc-4c6a-49b4-823d-ac21beae2c27-StreamThread-1-producer] Bootstrap broker ****** (id: -1 rack: null)
These warnings will continue being spammed out in the log for around every 2 minutes and last time it kept going on for 2 days until we realized some apps had stopped processing (I know we are working on setting up some monitoring). In the end the applications wont recover until manually restarted, and since the applications never throw an error it never gets propagated to the liveness probe.
Does anyone know if there are some configs or so that we can add to the kafka-streams library so that the applications stops spamming warnings after a certain amount of times and just throws an error so we can fail and automatically restart?
Also worth mentioning is that the brokers are up and running and other streams applications are up and processing against the same kafka-cluster.

Related

Kafka producer does not signal that all brokers are unreachable

When all brokers/node of a cluster are unreachable, the error in the Kafka producer callback is a generic "Topic XXX not present in metadata after 60000 ms".
When I activate the DEBUG log level, I can see that all attempts to deliver the message to any node are failing:
DEBUG org.apache.kafka.clients.NetworkClient - Initialize connection to node node2.url:443 (id: 2 rack: null) for sending metadata request
DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node node2.url:443 (id: 2 rack: null) using address node2.url:443/X.X.X.X:443
....
DEBUG org.apache.kafka.clients.NetworkClient - Disconnecting from node 2 due to socket connection setup timeout. The timeout value is 16024 ms.
DEBUG org.apache.kafka.clients.NetworkClient - Initialize connection to node node0.url:443 (id: 0 rack: null) for sending metadata request
DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node node0.url:443 (id: 0 rack: null) using address node0.url:443/X.X.X.X:443
....
DEBUG org.apache.kafka.clients.NetworkClient - Disconnecting from node 0 due to socket connection setup timeout. The timeout value is 17408 ms.
and so on, until, after the deliver timeout, the send() Callback gets the error:
ERROR my.kafka.SenderClass - Topic XXX not present in metadata after 60000 ms.
Unlike bootstrap url, all nodes could be unreachable for example for wrong DNS entries or whatever.
How can the application understand that all nodes were not reachable? This is traced just as DEBUG info and is not avialable to the producer send() callback.
Such an error detail at application level would speed up troubleshoooting.
This error is usually signaled by standard webservice SOAP/REST interface.
The producer only cares about the cluster Controller for bootstrapping and the leaders of the partitions it needs to write to (one of those leaders could be the Controller). That being said, it doesn't need to know about "all" brokers.
How can the application understand that all nodes were not reachable?
If you set acks=1 or acks=all, then the callback should know at least one broker had the data written. If not, there was some error.
You can use an AdminClient outside of the Producer client to describe the topic(s) and fetch metadata about the leader partitions, then use standard TCP socket network requests to try and ping those advertised listeners from Java
FWIW, port 443 should ideally be reserved for HTTPS traffic, not Kafka. Kafka is not a REST/SOAP service.

Kafka client faced disconnected will reconnect again by again, anyway can disable it?

The client will reconnect again by itself again, anyway I can disable it?
Should I close Kafka client?
[2022-06-30 16:17:51,332] WARN [Consumer clientId=consumer-xxx, groupId=xxx] Bootstrap broker xxx:xxx (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2022-06-30 16:17:52,387] WARN [Consumer clientId=consumer-pg-hudi001_5-1, groupId=pg-hudi001_5] Connection to node -1 (xxx) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
The consumer will constantly poll and periodically refresh its metadata about what brokers are available from the cluster Controller.
The only way to stop it from trying to connect is to close() the consumer, yes.

Kafka consumer should fail on "Bootstrap broker disconnected"

When a Kafka consumer cannot access the bootstrap broker it indefinitely tries to reconnect with the following message:
WARN NetworkClient - [Consumer clientId=consumer-testGroup-1, groupId=testGroup] Connection to node -1 (localhost/127.0.0.1:9999) could not be established. Broker may not be available.
WARN NetworkClient - [Consumer clientId=consumer-testGroup-1, groupId=testGroup] Bootstrap broker localhost:9999 (id: -1 rack: null) disconnected
What I want is that the consumer throws an exception and aborts the execution. In the docs I couldn't find a property to limit the retries.
Is there a recommended way to implement this behaviour or a property I overlooked?
I am using the KafkaReceiver class from project reactor.

Auto restart Quarkus Microservice after broker unavailability

I have a very simple Quarkus microservice which uses smallrye reactive messaging (kafka). Sometimes my kafka broker goes down and I got the following logs :
2020-09-24 04:04:27,067 WARN [org.apa.kaf.cli.NetworkClient] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Bootstrap broker xxxxxxx.xxxx.xxx:2202 (id: -1 rack: null) disconnected 2020-09-24 04:04:27,083 WARN [org.apa.kaf.cli.NetworkClient] (kafka-producer-network-thread | producer-3) [Producer clientId=producer-3] Connection to node -1 (xxxxx.xxxx.xxxx.fr/XX.XX.XX.XXX:2202) could not be established. Broker may not be available.
After the broker has been restarted, I have to manually restart my microservice. Is it possible to add to capability to the microservice to reconsume the new incoming messages without any manual action?
Thank you!
If you are using KafkaProducer and Consumer API they automatically reconnect once the broker is up again.
Please ensure that in your application you do not throw an exception and kill the thread. If you keep the thread alive then it will reconnect. Catch all exceptions for Consumer thread to ensure it is not exiting due to a runtime exception.

Continuous warnings in Kafka server.log

i started zookeeper and Kafka server in Linux machine. Both servers are started successfully. After that i created a "test" Topic. Everything working fine using console producer and console consumer. But when i try to send an event from a remote machine to Kafka server, it is failing. After some investigation i added the below configuration in server.properties
listeners=PLAINTEXT://0.0.0.0:9092
advertised.listeners=PLAINTEXT://xx.xx.xx.xx:9092
Then i got below continues Warning in Kafka logs
[2018-05-25 14:48:27,685] WARN [Controller id=0, targetBrokerId=0] Connection to node 0 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
But exceptions in controller.log
[2018-05-25 14:48:27,583] WARN [RequestSendThread controllerId=0] Controller 0's connection to broker xx.xx.xx.xx:9092 (id: 0 rack: null) was unsuccessful (kafka.controller.Req
uestSendThread)
java.io.IOException: Connection to xx.xx.xx.xx:9092 (id: 0 rack: null) failed.
at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:70)
at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:271)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:225)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Please help to solve this issue
Thanks in advance