Kafka Streams app does NOT fail when the Kafka cluster goes down - apache-kafka

I have a Kafka Streams app running (0.10.2.1). When I shut down the Kafka cluster the streams app continues to wait for the next message, when the cluster is brought back up, it will resume consuming messages. For the duration that the cluster is down the app appears to be working fine. I have tested this for over 45 minutes.
I would expect Kafka to throw an exception or stop. I have configure a StateListener to log when KafkaStreams shuts down, however it is never invoked.
kafkaStreams.setStateListener((newState, _) => {
if (newState == KafkaStreams.State.NOT_RUNNING) {
Log.error("Kafka died unexpectedly.")
}
})
How do I get Kafka to throw an exception or shutdown when it cannot connect to the cluster?
Note: this assumes that cluster goes down after the app has started

Why would you want the Kafka Streams app to go down?
The app should be resilient to broker failures, that is, keep going patiently until the broker recovers and it seems that this is what it's doing. If you have multiple instances of the Kafka Streams application and one of them loses connectivity to the broker, the load will be re-balanced onto the remaining instances. If each instance that lost connectivity just shut itself down, you would be losing instances and with them losing redundancy and parallelism even if the broker connectivity recovered. The way it is now Kafka Streams is designed for resilience. I'd argue that this is the correct behaviour.
IMHO if you want to detect broker (or connectivity) failures, that's a use case for monitoring, not for introducing failures into Kafka Streams applications.

Related

How to manage Flink application when Kafka broker is unavailable?

I have a Flink application running in production which writes data to a Kafka topic owned by an external vendor.
We were notified by the vendor that they would be migrating their cluster and hence there will be downtime where the Kafka brokers will not be available.
My question is, what will happen to the Flink application data when the topic is not available to write data into? Can I allow my Flink application to continue running or should I stop it and wait for the brokers to be up and running?
The task will fail if it can't connect to the Kafka Sink. What it does after failing will depend on your Task Failure Recovery strategy.
If you don't want to keep an eye on when Kafka will be available again, a fixed-delay with infinite retries and a long delay or an exponential-delay strategy may be your best option to not overload your infrastructure too much with unnecessary restarts.

Kafka Streams Apps Threads fail transaction and are fenced and restarted after Kafka broker restart

We are noticing Streams Apps threads fail transactions during rolling restarts of our Kafka Brokers. The transaction failure causes stream thread fencing which in turn causes a restart of the thread and re-balancing. The re-balancing causes some delay in processing. Our goal is to make broker restarts as smooth as possible and prevent processing delays as much as possible.
For our rolling Broker restarts we use the controlled.shutdown=true configuration, and before each restart we wait for all partitions to be in-sync across all replicas.
For our Streams Apps we have properly configured group.instance.id and an appropriate session.timeout.ms so that rolling restarts of the streams apps themselves are smooth and without re-balances.
From the Kafka Streams app logs I have identified a sequence of events leading up to the fencing:
Broker starts shutting down
App logs error producing to topic due to NOT_LEADER_OR_FOLLOWER
App heartbeats failing because coordinator is restarting broker
App discovers new group coordinator (this bounces a a bit between the restarting broker and live brokers)
App stabilizes
Broker starting up again
App fails to do fetch request to starting broker due to FETCH_SESSION_ID_NOT_FOUND
App discovers starting broker as transaction coordinator
App transaction fails due to one of two reasons:
InvalidProducerEpochException: Producer attempted to produce with an old epoch.
ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one
Stream threads end up in fatal error state, get fenced and restarted which causes a rebalance.
What could be causing the two exceptions that cause stream thread transactions to fail? My intuition is that the broker starting up is assigned as transaction coordinator before it has synced its transaction states with the in-sync brokers. This could explain old epochs or different transactional ids to be known by that broker.
How can we further identify what is going wrong here and how it can be improved?
you can set request.timeout.ms in kafka streams which will make stream API wait for a longer period of time. if kafka broker is not up in a given period of time then only it will throw an exception which can be handled by using ProductionExceptionHandler as described in Handling exceptions in Kafka streams

Is it always necessary to restart streams after a Kafka broker outage/failover

We are using Kafka streams (0.11.0.1) api to consume events from topic. But whenever there is a Kafka Broker outage/failover, we need to restart all Kafka streamers to recover from following error:
"Connection to node 39366 could not be established. Broker may not be available."
Just wondering if it is really required for streamers to do streams close and restart? Why streamers are not able recover from this issue automatically? Or are we missing any configuration in client/Broker?
Now we are planning to introduce code changes to handle all stream exception and trigger an automated restart of streams. But am really worried if that is the right way to handle this scenario.
If you think in a real world use case where hundreds of clients connected to brokers and restarting each of them, its not making any sense.

Kafka Consumer: How To Catch Failover Of Brokers?

I'm currently trying to design a Kafka failover system to detect when a Kafka cluster (broker) goes down, but I'm having trouble getting the Consumer to realize that the broker is down.
I see that a Consumer will continuously poll if there is no data (while it is still connected to the broker) but if the broker goes down, my Consumer just sits there doing nothing forever.
Is there a property somewhere that I need to set for it to raise an Exception or something if it is unable to communicate with the broker? Or do I need to utilize some sort of a callback?
Thanks!

Apache kafka storm, persistence during maintenance

I have Ubuntu 14.04TS. I use Node.js->Kafka->Storm->MongoDB chain. With initial development, everything goes well. Messages are finally stored into mMngoDB.
In Kafka, I have one Zookeeper and broker0 in kakfa1. broker1 in kafka2. With Storm, Zookeeper, nimbus, and DRPC are located at storm1. Supervisor and worker are located at storm2.
Now the questions is when I do update storm1 and storm2. I stopped all processes of storm1 and storm2. I suppose Kafka will buffer the message from Node.js. After I restarted both storm1 and storm2, and redeployed topology, I found messages during storm1 storm2's, down and up, are lost. So indeed, Kafka does not keep persistence of messages during storm maintenance period.
In my mind, Kafka will remember the last index of the message it receive acknowledgement.
In all, how could I prevent message from lost when storm is under maintenance.