Custom event handling for KafkaAdminClient - scala

My goal is to do something when the broker is down, but couldn't manage to do it.
The code is simple:
val properties = new Properties()
properties.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
val client = AdminClient.create(properties)
//Suppose that the App just runs from here without consuming/producing
it starts up, then I manually shutdown kafka.
Logs arrives:
2021-06-23T13:51:16,681+02:00 WARN [kafka-admin-client-thread | adminclient-1] org.apache.kafka.clients.NetworkClient: [AdminClient clientId=adminclient-1] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
How to handle this? Basically I just want to invoke a custom method when the broker is down.
there is nothing I can 'catch'
And couldn't even find an evenListener in AdminClient/KafkaAdminClient (or I am just looking at the wrong place)
edit: And of course I would like to invoke my custom code too when the broker is back to life

You cant issue a command to a server that isn't running... You would need to run a Kafka Java process check on the broker server itself that does not use Kafka-related tools (e.g. jps or systemctl or checking some /var/run/kafka.pid)

Related

How to set consumer config values for Kafka Mirrormaker-2 2.6.1?

I am attempting to use mirrormaker 2 to replicate data between AWS Managed Kafkas (MSK) in 2 different AWS regions - one in eu-west-1 (CLOUD_EU) and the other in us-west-2 (CLOUD_NA), both running Kafka 2.6.1. For testing I am currently trying just to replicate topics 1 way, from EU -> NA.
I am starting a mirrormaker connect cluster using ./bin/connect-mirror-maker.sh and a properties file (included)
This works fine for topics with small messages on them, but one of my topic has binary messages up to 20MB in size. When I try to replicate that topic I get an error every 30 seconds
[2022-04-21 13:47:05,268] INFO [Consumer clientId=consumer-29, groupId=null] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 2: {}. (org.apache.kafka.clients.FetchSessionHandler:481)
org.apache.kafka.common.errors.DisconnectException
When logging in DEBUG to get more information we get
[2022-04-21 13:47:05,267] DEBUG [Consumer clientId=consumer-29, groupId=null] Disconnecting from node 2 due to request timeout. (org.apache.kafka.clients.NetworkClient:784)
[2022-04-21 13:47:05,268] DEBUG [Consumer clientId=consumer-29, groupId=null] Cancelled request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=consumer-29, correlationId=35) due to node 2 being disconnected (org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient:593)
It gets stuck in a loop constantly disconnecting with request timeout every 30s and then trying again.
Looking at this, I suspect that the problem is the request.timeout.ms is on the default (30s) and it times out trying to read the topic with many large messages.
I followed the guide at https://github.com/apache/kafka/tree/trunk/connect/mirror to attempt to configure the consumer properties, however, no matter what I set, the timeout for the consumer remains fixed at the default, confirmed both by kafka outputting its config in the log and by timing how long between the disconnect messages. e.g. I set:
CLOUD_EU.consumer.request.timeout.ms=120000
In the properties that I start MM-2 with.
based on various guides I have found while looking at this, I have also tried
CLOUD_EU.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.override.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.override.request.timeout.ms=120000
None of which have worked.
How can I change the consumer request.timeout setting? The log is approx 10,000 lines long, but everywhere where the ConsumerConfig is logged out it logs request.timeout.ms = 30000
Properties file I am using:
# specify any number of cluster aliases
clusters = CLOUD_EU, CLOUD_NA
# connection information for each cluster
CLOUD_EU.bootstrap.servers = kafka.eu-west-1.amazonaws.com:9092
CLOUD_NA.bootstrap.servers = kafka.us-west-2.amazonaws.com:9092
# enable and configure individual replication flows
CLOUD_EU->CLOUD_NA.enabled = true
CLOUD_EU->CLOUD_NA.topics = METRICS_ATTACHMENTS_OVERSIZE_EU
CLOUD_NA->CLOUD_EU.enabled = false
replication.factor=3
tasks.max = 1
############################# Internal Topic Settings #############################
checkpoints.topic.replication.factor=3
heartbeats.topic.replication.factor=3
offset-syncs.topic.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
############################ Kafka Settings ###################################
# CLOUD_EU cluster over writes
CLOUD_EU.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.session.timeout.ms=150000

Debugging root cause for high Purgatory size in Kafka

We using ELK stack (7.10.2) in Kubernetes (1.21.5). After several time our service provider Gardener change OS version (318.9.0 -> 576.1.0) and our troubles with logging stack started.
It seems, that Kafka (v 2.8.1, 2 pods) not stream data to Logstash (7.10.2, 2 pods), but sent it by chunks of data every few moments. In fact, in Kibana we not see continual adding log records, but we see bunch of new records every few moments. If high load occur (e.g. debugging some component in k8s cluster), this delay is rising to minutes.
We discovered, that metric delayed fetch in purgatory is jumping with very similar pattern
see screenshot, like a "saw". When I downgrade OS version on nodes from current (576.2.0, orange) to previous one (318.9.0, blue), problem disappeared. As you expected, we dont stay on same OS version much longer.
I asked Gardener staff for assistance, but without root cause they are not able help us. We not change any settings, component versions, ... Just OS version on nodes.
From Logstashs debug log I can see, that Logstash is continuously connecting/disconnecting to Kafka:
[2022-01-17T08:53:33,232][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=elk-logstash-indexer-6c84d6bf8c-58gnz-containers-10, groupId=containers] Attempt to heartbeat failed since group is rebalancing
[2022-01-17T08:53:30,501][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=elk-logstash-indexer-6c84d6bf8c-ct29t-containers-49, groupId=containers] Discovered group coordinator elk-kafka-0.kafka.logging.svc.cluster.local:9092 (id: 2147483647 rack: null)
[2022-01-17T08:53:30,001][INFO ][org.apache.kafka.common.utils.AppInfoParser] Kafka startTimeMs: 1642409610000
These lines are still repeating in loop.
Similar situation I can see on Kafka:
[2022-01-20 11:55:04,241] DEBUG [broker-0-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,241] DEBUG [broker-0-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,342] DEBUG [broker-0-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,342] DEBUG [broker-0-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,365] DEBUG Accepted connection from /10.250.1.127:53678 on /100.96.30.21:9092 and assigned it to processor 1, sendBufferSize [actual|requested]: [102400|102400] recvBufferSize [actual|requested]: [102400|102400] (kafka.network.Acceptor)
[2022-01-20 11:55:04,365] DEBUG Processor 1 listening to new connection from /10.250.1.127:53678 (kafka.network.Processor)
[2022-01-20 11:55:04,368] DEBUG [SocketServer listenerType=ZK_BROKER, nodeId=0] Connection with /10.250.1.127 disconnected (org.apache.kafka.common.network.Selector)
I attempted:
double resources for Kafka and Logstash (no change occurred)
change container engine from Docker to ContainerD (problem was worse in ContainerD, ~400 -> ~1000)
change Logstash parameters for Kafka plugin (no change occurred)
compare Kernel settings (5.4.0 -> 5.10.0, I not spotted any interesting changes)
temporary disable Karydia for Kafka, Logstash and ZooKeeper (no change occurred)
temporary upgrade Logstash version (7.10.2 -> 7.12.0, without success, all tested version have same bad behavior, move to higher version currently isnt possible without change version of another components in ELK)
Unfortunately, I am not a Kafka expert, I am not sure, that connecting/disconnecting is root cause of some of our non-optimal settings, or communication is interference by something unknow for us.
I would like to ask community for help with this problem. Some suggestion, how to continue with investigation are very welcome too.

kafka-streams alert on kafka connection faliure

When kafka-streams app is running and Kafka is suddenly down, the app enters into "waiting" mode , the consumers and producers threads sending warning logs on them not be able to connect, and when Kafka is back, everything should (theoretically) go back to normal.
I'm trying to get an alert on this situation and I'm not able to find the place to catch that and send log/metric.
I tried the following:
streams.setUncaughtExceptionHandler but this occurs only on exceptions which is not the case here
extending ProductionExceptionHandler and change default.production.exception.handler property to my class which extend this interface. again, as with setUncaughtExceptionHandler there is not exception being thrown here so nothing really happens.
I know Kafka has its own metrics which I can listen to and find if broker is down. but there can be a situations where Kafka brokers are just fine and the my kafka-streams app is not able to connect(i.e bad authentication configuration or vpn/vpc issues)
what can I do to catch those issues and log them /report them ?
update
see the consumer/producer logs in case of kafka not available:
2020-08-24 21:41:32,055 [my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1] WARN o.apache.kafka.clients.NetworkClient - [] [Consumer clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-consumer, groupId=my-kafka-streams-app] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected
2020-08-24 21:41:32,186 [kafka-admin-client-thread | my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-admin] WARN o.apache.kafka.clients.NetworkClient - [] [AdminClient clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-admin] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
2020-08-24 21:41:32,250 [kafka-producer-network-thread | my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-producer] WARN o.apache.kafka.clients.NetworkClient - [] [Producer clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-producer] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
This case is not easy to detect programmatically. The problem is, that the clients don't really expose their state to Kafka Streams, and thus Kafka Streams does not really know about the disconnect. There is KIP that proposes to add a DISCONNECT state, but it's not easy to implement (cf https://cwiki.apache.org/confluence/display/KAFKA/KIP-457%3A+Add+DISCONNECTED+status+to+Kafka+Streams).
The exception handler you mention don't help for this situation, as no exception is thrown (at least not within the Kafka Streams code base).
What you can try is to monitor consumer lag or some Kafka Streams metrics (like processing rate). They might provide a good enough proxy. Cf https://docs.confluent.io/current/streams/monitoring.html

Apache Flink & Kafka FETCH_SESSION_ID_NOT_FOUND info logs

Our Flink application has a Kafka datasource.
Application is run with 32 parallelism.
When I look at the logs, I see a lot of statements about FETCH_SESSION_ID_NOT_FOUND.
2020-05-04 11:04:47,753 INFO org.apache.kafka.clients.FetchSessionHandler - [Consumer clientId=consumer-81, groupId=sampleGroup]
Node 26 was unable to process the fetch request with (sessionId=439766827, epoch=42): FETCH_SESSION_ID_NOT_FOUND.
2020-05-04 11:04:48,230 INFO org.apache.kafka.clients.FetchSessionHandler - [Consumer clientId=consumer-78, groupId=sampleGroup]
Node 28 was unable to process the fetch request with (sessionId=281654250, epoch=42): FETCH_SESSION_ID_NOT_FOUND.
What do these log statements mean?
What are the possible negative effects?
Not: I have no experience with Apache Kafka
Thanks..
This can happen for a few reasons but the most common one is the FetchSession cache being full on the brokers.
By default, brokers cache up to 1000 FetchSessions (configured via max.incremental.fetch.session.cache.slots). When this fills up, brokers cam evict cache entries. If your client cache entry is gone, it will received the FETCH_SESSION_ID_NOT_FOUND error.
This error is not fatal and consumers should send a new full FetchRequest automatically and keep working.
You can check the size of the FetchSession cache using the kafka.server:type=FetchSessionCache,name=NumIncrementalFetchSessions metric.

Kafka Connection error in contoller.logs

I am using single node Kafka(v 0.10.2) and single node zookeeper (v 3.4.8) and my controller.log file is filled with this exception
java.io.IOException: Connection to 1 was disconnected before the response was read
at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114)
at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136)
at kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142)
at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:192)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:184)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
I googled this exception but was not able to find the root cause for this exception. Can someone suggest me why this error is happening and how to prevent it?
I also encounter same issue in multi-node cluster scenario. It was because of connection shutdown between kafka-node and zookeeper. I would suggest to restart zookeeper server then kafka-node to re-establish the connection therefore broker should be handle pub/sub message transition.
Hope it would raise you from this.