kafka-streams alert on kafka connection faliure - apache-kafka

When kafka-streams app is running and Kafka is suddenly down, the app enters into "waiting" mode , the consumers and producers threads sending warning logs on them not be able to connect, and when Kafka is back, everything should (theoretically) go back to normal.
I'm trying to get an alert on this situation and I'm not able to find the place to catch that and send log/metric.
I tried the following:
streams.setUncaughtExceptionHandler but this occurs only on exceptions which is not the case here
extending ProductionExceptionHandler and change default.production.exception.handler property to my class which extend this interface. again, as with setUncaughtExceptionHandler there is not exception being thrown here so nothing really happens.
I know Kafka has its own metrics which I can listen to and find if broker is down. but there can be a situations where Kafka brokers are just fine and the my kafka-streams app is not able to connect(i.e bad authentication configuration or vpn/vpc issues)
what can I do to catch those issues and log them /report them ?
update
see the consumer/producer logs in case of kafka not available:
2020-08-24 21:41:32,055 [my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1] WARN o.apache.kafka.clients.NetworkClient - [] [Consumer clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-consumer, groupId=my-kafka-streams-app] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected
2020-08-24 21:41:32,186 [kafka-admin-client-thread | my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-admin] WARN o.apache.kafka.clients.NetworkClient - [] [AdminClient clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-admin] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
2020-08-24 21:41:32,250 [kafka-producer-network-thread | my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-producer] WARN o.apache.kafka.clients.NetworkClient - [] [Producer clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-producer] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.

This case is not easy to detect programmatically. The problem is, that the clients don't really expose their state to Kafka Streams, and thus Kafka Streams does not really know about the disconnect. There is KIP that proposes to add a DISCONNECT state, but it's not easy to implement (cf https://cwiki.apache.org/confluence/display/KAFKA/KIP-457%3A+Add+DISCONNECTED+status+to+Kafka+Streams).
The exception handler you mention don't help for this situation, as no exception is thrown (at least not within the Kafka Streams code base).
What you can try is to monitor consumer lag or some Kafka Streams metrics (like processing rate). They might provide a good enough proxy. Cf https://docs.confluent.io/current/streams/monitoring.html

Related

Custom event handling for KafkaAdminClient

My goal is to do something when the broker is down, but couldn't manage to do it.
The code is simple:
val properties = new Properties()
properties.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
val client = AdminClient.create(properties)
//Suppose that the App just runs from here without consuming/producing
it starts up, then I manually shutdown kafka.
Logs arrives:
2021-06-23T13:51:16,681+02:00 WARN [kafka-admin-client-thread | adminclient-1] org.apache.kafka.clients.NetworkClient: [AdminClient clientId=adminclient-1] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
How to handle this? Basically I just want to invoke a custom method when the broker is down.
there is nothing I can 'catch'
And couldn't even find an evenListener in AdminClient/KafkaAdminClient (or I am just looking at the wrong place)
edit: And of course I would like to invoke my custom code too when the broker is back to life
You cant issue a command to a server that isn't running... You would need to run a Kafka Java process check on the broker server itself that does not use Kafka-related tools (e.g. jps or systemctl or checking some /var/run/kafka.pid)

Kafka commit failed due to unsuccessful group coordinator rediscovery

I have an issue with the commit in one of my services. It uses consumer.assign, not subscribe. After processing messages it commits offsets in kafka using commitAsync. Sometimes (one in a few days) commit failed with a RetriableCommitFailedException and in logs I see messages like this:
[Consumer clientId=my-client-id, groupId=my-group-id] Offset commit failed on partition my-topic-28 at offset 283259051: The request timed out.
[Consumer clientId=my-client-id, groupId=my-group-id] Group coordinator 10.54.116.10:9093 (id: 2147483643 rack: null) is unavailable or invalid due to cause: error response REQUEST_TIMED_OUT.isDisconnected: false. Rediscovery will be attempted.
For some reason sometimes this rediscovery has no effect and after 10 minutes of retrying commit is still failing.
At first, I thought that this is somehow related to the fact that I'm using assign, not subscribe. And I somehow receive rebalance that I don't handle properly. But according to the javadocs ConsumerRebalanceListener is not working with the assign, so the problem itself not with the rebalance.
Also, admins said that all kafka nodes was fine when I received an error, and partition leader was not changing.
At the current moment, I have no clue in what direction should I move? Why commit fail even after 10 minutes of retrying? Why group coordinator rediscovery failed sometimes?
I'm using java client 2.8.0, broker version is 2.3.1.

Apache Flink & Kafka FETCH_SESSION_ID_NOT_FOUND info logs

Our Flink application has a Kafka datasource.
Application is run with 32 parallelism.
When I look at the logs, I see a lot of statements about FETCH_SESSION_ID_NOT_FOUND.
2020-05-04 11:04:47,753 INFO org.apache.kafka.clients.FetchSessionHandler - [Consumer clientId=consumer-81, groupId=sampleGroup]
Node 26 was unable to process the fetch request with (sessionId=439766827, epoch=42): FETCH_SESSION_ID_NOT_FOUND.
2020-05-04 11:04:48,230 INFO org.apache.kafka.clients.FetchSessionHandler - [Consumer clientId=consumer-78, groupId=sampleGroup]
Node 28 was unable to process the fetch request with (sessionId=281654250, epoch=42): FETCH_SESSION_ID_NOT_FOUND.
What do these log statements mean?
What are the possible negative effects?
Not: I have no experience with Apache Kafka
Thanks..
This can happen for a few reasons but the most common one is the FetchSession cache being full on the brokers.
By default, brokers cache up to 1000 FetchSessions (configured via max.incremental.fetch.session.cache.slots). When this fills up, brokers cam evict cache entries. If your client cache entry is gone, it will received the FETCH_SESSION_ID_NOT_FOUND error.
This error is not fatal and consumers should send a new full FetchRequest automatically and keep working.
You can check the size of the FetchSession cache using the kafka.server:type=FetchSessionCache,name=NumIncrementalFetchSessions metric.

Kafka Streams Meatadata request only contains internal topics

I'm running an Kafka Streams app with version 2.1.0. I found after running for some time, my app (63 nodes) will enter ERROR state one by one. Eventually, all 63 nodes are down.
The exception is :
ERROR o.a.k.s.p.i.ProcessorStateManager - task [2_2] Failed to
flush state store KSTREAM-REDUCE-STATE-STORE-0000000014:
org.apache.kafka.streams.errors.StreamsException: task [2_2]
Abort sending since an error caught with a previous record
(key 110646599468 value InterimMessage [sessionStart=1567150872690,count=1]
timestamp 1567154490411) to topic item.interim due to
org.apache.kafka.common.errors.TimeoutException: Failed to update
metadata after 60000 ms.
You can increase producer parameter `retries` and `retry.backoff.ms`
to avoid this error.
I enabled the DEBUG logging and found out that the exception happens when the KStream ask for metadata update for internal topics only but not the destination topic. (item.interim is the destination topic)
Normally,
[Producer clientId=client-autocreate-StreamThread-1-producer] Sending metadata
request (type=MetadataRequest,
topics=item.interim,test-KSTREAM-REDUCE-STATE-STORE-0000000014-changelog)
to node XXX:9092 (id: 7 rack: XXX)
But before the exception, it was
[Producer clientId=client-autocreate-StreamThread-1-producer] Sending metadata
request (type=MetadataRequest,
topics=test-KSTREAM-REDUCE-STATE-STORE-0000000014-changelog)
to node XXX:9092 (id: 7 rack: XXX)
Config I have changed:
max.request.size=14000000
receive.buffer.bytes=32768
auto.offset.reset=latest
enable.auto.commit=false
default.api.timeout.ms=180000
cache.max.bytes.buffering=10485760
retries=20
retry.backoff.ms=80000
request.timeout.ms=120000
commit.interval.ms=100
num.stream.threads=1
session.timeout.ms=30000
I'm really confused. Could anyone help me to understand, why producer will send different Metadata request? And any possible way to solve the problem? Thanks a lot!

Kafka unrecoverable if broker dies

We have a kafka cluster with three brokers (node ids 0,1,2) and a zookeeper setup with three nodes.
We created a topic "test" on this cluster with 20 partitions and replication factor 2. We are using Java producer API to send messages to this topic. One of the kafka broker intermittently goes down after which it is unrecoverable. To simulate the case, we killed one of the broker manually. As per the kafka arch, it is supposed to self recover, but which is not happening. When I describe the topic on the console, I see the number of ISR's reduced to one for few of the partitions as one of the broker killed. Now, whenever we are trying to push messages via the producer API (either Java client or console producer), we are encountering SocketTimeoutException.. One quick look into the logs says, "Unable to fetch the metadata"
WARN [2015-07-01 22:55:07,590] [ReplicaFetcherThread-0-3][] kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-3],
Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 23711; ClientId: ReplicaFetcherThread-0-3;
ReplicaId: 0; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [zuluDelta,2] -> PartitionFetchInfo(11409,1048576),[zuluDelta,14] -> PartitionFetchInfo(11483,1048576).
Possible cause: java.nio.channels.ClosedChannelException
[2015-07-01 23:37:40,426] WARN Fetching topic metadata with correlation id 0 for topics [Set(test)] from broker [id:1,host:abc-0042.yy.xxx.com,port:9092] failed (kafka.client.ClientUtils$)
java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
at kafka.utils.Utils$.read(Utils.scala:380)
at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
at kafka.network.BlockingChannel.receive(BlockingChannel.scala:111)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
Any leads will be appreciated...
From your error Unable to fetch metadata it could mostly be because you could have set the bootstrap.servers in the producer to the broker that has died.
Ideally, you must have more than one broker in the bootstrap.servers list because if one of the broker fails (or is unreachable) then the other could give you the metadata.
FYI: Metadata is the information about a particular topic that tells how many number of partitions it has, their leader brokers, follower brokers etc.
So, when a key is produced to a partition, its corresponding leader broker will be the one to whom the messages will be sent to.
From your question, your ISR set has only one broker. You could try setting the bootstrap.server to this broker.