I have a Flink application running in production which writes data to a Kafka topic owned by an external vendor.
We were notified by the vendor that they would be migrating their cluster and hence there will be downtime where the Kafka brokers will not be available.
My question is, what will happen to the Flink application data when the topic is not available to write data into? Can I allow my Flink application to continue running or should I stop it and wait for the brokers to be up and running?
The task will fail if it can't connect to the Kafka Sink. What it does after failing will depend on your Task Failure Recovery strategy.
If you don't want to keep an eye on when Kafka will be available again, a fixed-delay with infinite retries and a long delay or an exponential-delay strategy may be your best option to not overload your infrastructure too much with unnecessary restarts.
Related
I have a Kafka server which runs on a single node. There is only 1 node because it's a test server. But even for a test server, I need to be sure that no data loss will occur while upgrade is in process.
I upgrade Kafka as:
Stop Kafka, Zookeeper, Kafka Connect and Schema Registry.
Upgrade all the components.
Start upgraded services.
Data loss may occur in the first step, where kafka is not running. I guess you can do a rolling update (?) with multiple brokers to prevent data loss but in my case it is not possible. How can I do something similar with a single broker? Is it possible? If not, what is the best approach for upgrading?
I have to say, obviously, you are always vulnerable to data losses if you are using only one node.
If you can't have more nodes you have the only choice:
Stop producing;
Stop consuming;
Enable parameter controlled.shutdown.enable - this will ensure that your broker saved offset in case of a shutdown.
I guess the first 2 steps are quite tricky.
Unfortunately, there is not much to play with - Kafka was not designed to be fault-tolerant with only one node.
The process of a rolling upgrade is still the same for a single broker.
Existing data during the upgrade shouldn't be lost.
Obviously, if producers are still running, all their requests will be denied while the broker is down, thus why you not only need multiple brokers to prevent data-loss, but a balanced cluster (with unclean leader election disabled) where your restart cycles don't completely take a set of topics offline.
We are using Kafka streams (0.11.0.1) api to consume events from topic. But whenever there is a Kafka Broker outage/failover, we need to restart all Kafka streamers to recover from following error:
"Connection to node 39366 could not be established. Broker may not be available."
Just wondering if it is really required for streamers to do streams close and restart? Why streamers are not able recover from this issue automatically? Or are we missing any configuration in client/Broker?
Now we are planning to introduce code changes to handle all stream exception and trigger an automated restart of streams. But am really worried if that is the right way to handle this scenario.
If you think in a real world use case where hundreds of clients connected to brokers and restarting each of them, its not making any sense.
I have a Kafka Streams app running (0.10.2.1). When I shut down the Kafka cluster the streams app continues to wait for the next message, when the cluster is brought back up, it will resume consuming messages. For the duration that the cluster is down the app appears to be working fine. I have tested this for over 45 minutes.
I would expect Kafka to throw an exception or stop. I have configure a StateListener to log when KafkaStreams shuts down, however it is never invoked.
kafkaStreams.setStateListener((newState, _) => {
if (newState == KafkaStreams.State.NOT_RUNNING) {
Log.error("Kafka died unexpectedly.")
}
})
How do I get Kafka to throw an exception or shutdown when it cannot connect to the cluster?
Note: this assumes that cluster goes down after the app has started
Why would you want the Kafka Streams app to go down?
The app should be resilient to broker failures, that is, keep going patiently until the broker recovers and it seems that this is what it's doing. If you have multiple instances of the Kafka Streams application and one of them loses connectivity to the broker, the load will be re-balanced onto the remaining instances. If each instance that lost connectivity just shut itself down, you would be losing instances and with them losing redundancy and parallelism even if the broker connectivity recovered. The way it is now Kafka Streams is designed for resilience. I'd argue that this is the correct behaviour.
IMHO if you want to detect broker (or connectivity) failures, that's a use case for monitoring, not for introducing failures into Kafka Streams applications.
I have a setup where I'm pushing events to kafka and then running a Kafka Streams application on the same cluster. Is it fair to say that the only way to scale the Kafka Streams application is to scale the kafka cluster itself by adding nodes or increasing Partitions?
In that case, how do I ensure that my consumers will not bring down the cluster and ensure that the critical pipelines are always "on". Is there any concept of Topology Priority which can avoid a possible downtime? I want to be able to expose the streams for anyone to build applications on without compromising the core pipelines. If the solution is to setup another kafka cluster, does it make more sense to use Apache storm instead, for all the adhoc queries? (I understand that a lot of consumers could still cause issues with the kafka cluster, but at least the topology processing is isolated now)
It is not recommended to run your Streams application on the same servers as your brokers (even if this is technically possible). Kafka's Streams API offers an application-based approach -- not a cluster-based approach -- because it's a library and not a framework.
It is not required to scale your Kafka cluster to scale your Streams application. In general, the parallelism of a Streams application is limited by the number of partitions of your app's input topics. It is recommended to over-partition your topic (the overhead for this is rather small) to guard against scaling limitations.
Thus, it is even simpler to "offer anyone to build applications" as everyone owns their application. There is no need to submit apps to a cluster. They can be executed anywhere you like (thus, each team can deploy their Streams application the same way by which they deploy any other application they have). Thus, you have many deployment options from a WAR file, over YARN/Mesos, to containers (like Kubernetes). Whatever works best for you.
Even if frameworks like Flink, Storm, or Samza offer cluster management, you can only use such tools that are integrated with those frameworks (for example, Samza requires YARN -- no other options available). Let's say you have already a Mesos setup, you can reuse it for your Kafka Streams applications -- no need for a dedicated "Kafka Streams cluster" (because there is no such thing).
An application’s processor topology is scaled by breaking it into
multiple tasks.
More specifically, Kafka Streams creates a fixed number of tasks based
on the input stream partitions for the application, with each task
assigned a list of partitions from the input streams (i.e., Kafka
topics).
The assignment of partitions to tasks never changes so that each task
is a fixed unit of parallelism of the application. Tasks can then
instantiate their own processor topology based on the assigned
partitions; they also maintain a buffer for each of its assigned
partitions and process messages one-at-a-time from these record
buffers.
As a result stream tasks can be processed independently and in
parallel without manual intervention.
It is important to understand that Kafka Streams is not a resource
manager, but a library that “runs” anywhere its stream processing
application runs. Multiple instances of the application are executed
either on the same machine, or spread across multiple machines and
tasks can be distributed automatically by the library to those running
application instances.
The assignment of partitions to tasks never changes; if an application
instance fails, all its assigned tasks will be restarted on other
instances and continue to consume from the same stream partitions.
The processing of the stream happens in the machines where the application is running.
I recommend you to have a look to this guide, it can help you to better understand the way Kafka Streams work.
I have Ubuntu 14.04TS. I use Node.js->Kafka->Storm->MongoDB chain. With initial development, everything goes well. Messages are finally stored into mMngoDB.
In Kafka, I have one Zookeeper and broker0 in kakfa1. broker1 in kafka2. With Storm, Zookeeper, nimbus, and DRPC are located at storm1. Supervisor and worker are located at storm2.
Now the questions is when I do update storm1 and storm2. I stopped all processes of storm1 and storm2. I suppose Kafka will buffer the message from Node.js. After I restarted both storm1 and storm2, and redeployed topology, I found messages during storm1 storm2's, down and up, are lost. So indeed, Kafka does not keep persistence of messages during storm maintenance period.
In my mind, Kafka will remember the last index of the message it receive acknowledgement.
In all, how could I prevent message from lost when storm is under maintenance.