Storm - KeeperException - apache-kafka

I'm running a storm cluster. I have a nimbus, zookeeper, Kafka server, and a supervisor in one node,
and another supervisor in a separate node.
When I deploy the topology which has a simple Kafka-spout in the first node. The supervisor in second node throws run time exception.
But it works fine with the supervisor in the first node. How to solve this?

Let me ask some questions regarding your topology:
1. How are you ensuring that spout executor runs only on first supervisor node? It can run on any supervisor node.
2. Supervisor node registered correctly in the cluster? I mean this node shows on UI? Because as per the exception, it seems zookeeper does not aware of this node.
3. If spout runs on first supervisor node means that kafka config parameters like hostname might have been specified like "localhost". So when spout runs on first node, it contacts the localhost only for kafka queue. And if spout tries to run on second supervisor node, it fails because, for him "localhost" is itself and there kafka queu is not there.
--Hariprasad

I have faced similar issue with Storm versions before 0.6.2. We tried running Zookeeper on a separate node and the issue was resolved, but we soon upgraded our version of Storm and no longer faced the issue. Try to run Zookeeper on a separate node and see if that helps. Also check if you have correctly configured the supervisors in the cluster. Check this Google Groups Thread to see if you have correct configuration options.

Related

Create Producer when the first broker in the list of brokers is down

I have a multi-node Kafka cluster which I use for consuming and producing.
In my application, I use confluent-kafka-go(1.6.1) to create producers and consumers. Everything works great when I produce and consume messages.
This is how I configure my bootstrap server list
"bootstrap.servers":"localhost:9092,localhost:9093,localhost:9094"
But the moment when I start giving out the IP address of the brokers in bootstrap.servers and if the first broker is down, it seems that the producer repeatedly fails creation telling
Failed to initialize Producer ID: Local: Timed out
If I remove the IP of the failed node, producing and consuming messages work.
If the broker is down after I create the producer/consumer, they continue to be usable by switching over to other nodes.
How should I configure bootstrap.servers in such a way that the producer will be created using the available nodes?
You shouldn't really be running 3 brokers on the same machine anyway, but using multiple unique servers works fine for me when the first is down (and the cluster elects a different leader if it needs to), so sounds like you either lost the primary leader of your topic partitions or you've lost the Controller. Enabling retires on the producer should be able fix itself (by making a new metadata request for partition leaders)
Overall, it's just a CSV; there's no other way to configure that property itself. You could stick a reverse proxy in front of the brokers that resolves only to healthy nodes, but then you'd be conflicting with a potential DNS cache

Kafka Cluster cotinues to run without zookeeper

I have a five node kafka cluster(confluent 5.5 community edition) with 3 zookeeper nodeseach on different aws instances.
While doing failover testing , noticed that the kafka cluster works fine even if all zookeeper nodes are down.
I was able to produce , consume and also create new consumers.
why does the kafka cluster not stop if it cannot connect to any zookeeper nodes ?
What would be the possible issues if we are unaware of such a failure scenario in production and kafka cluster continues to run without zookeeper connectivity ?
how do we handle such a scenario ?
Broker leader election, topic creation, simple ACLs (if you use them) still depend on Zookeeper. For other basic functions relying on the Kafka bootstrap protocols, they might still work, sure. There should definitely be broker logs indicating connection was lost
Ideally you'd have basic process healthchecking and incident management software that you shouldn't miss critical services going down in prod
How to handle? Restart Zookeeper...

Kafka won't start if a Zookeeper node is down

I have Kafka and Zookeeper co-located on the same servers, with multiple nodes.
In Kafka's server.properties, I have a line like
zookeeper.connect=server1:2181,server2:2181...
the problem is, Kafka will not start until all of the Zookeeper nodes are available. Otherwise, I will get an error like "fatal error during Kafka startup" and "Timed out waiting for connection while in state: CONNECTING" even though the other Zookeeper nodes are up.
This makes it challenging to script startup of each node independently, since the startup scripts on one node are dependent on the state of other nodes.
First: is this expected behavior or am I doing something wrong? Suppose I have 3 nodes in Zookeeper cluster; all 3 nodes have to be up for Kafka to start? That seems counterintuitive, since a larger cluster would actually increase the chance of failure on startup rather than provide more resiliency.
Second: What's a good solution for this? Is the only approach to make Kafka on each node wait until Zookeeper is fully up on all nodes?
As far as I know, this is a prerequisite for Kafka to start up correctly, and I don't think too much of a burden. If the zookeeper cluster itself is already having problems at startup time, Kafka itself might run into problems, so ensuring that the Zookeeper cluster is healthy is a good initial check, IMHO.
A way to get around this limitation is to configure a single-node Zookeeper cluster, and tell Kafka to use that cluster. After the fact, you can grow the zookeeper cluster to 3 or more nodes, while Kafka is already up and running. More details can be found here:
Adding new ZooKeeper node in Kafka cluster?
For the record, Kafka itself is completely fine if the Zookeeper cluster goes down once it's up and running. It just wouldn't be able to accept new producer/consumer connections or create topics, but the current ones that are active on the cluster continue to work just fine.
We have met the same problem in our production environment.
It turns out to be a bug (ZOOKEEPER-2184) from zookeeper library which kafka uses talking to zookeeper.
Our kafka version is 1.1.1 which use zookeeper-3.4.10.jar.
After we replaced it with zookeeper-3.4.13.jar, kafka can restart successfully.

Change storm zookeeper server in storm.yaml

I want to point my storm cluster to 3 zookeeper nodes. From my understanding, I should just change storm.yaml configuration in nimbus and supervisor nodes without stopping any of the nodes. Is it correct?
How should I validate my change?
I think you need to restart Nimbus and the supervisors once you've updated storm.yaml in order for them to pick up the change.
I'm assuming you currently have 1 Zookeeper node? In that case, you can validate your change after updating to 3 by stopping the Zookeeper node and checking that your Nimbus and supervisors keep running.
An alternative is to enable debug logging for org.apache.curator, it will most likely log which nodes it is connecting to.

During rolling upgrade/restart, how to detect when a kafka broker is "done"?

I need to automate a rolling restart of a kafka cluster (3 kafka brokers). I can easily do it manually - restart one after the other, while checking the log to see when it's fine (e.g., when the new process has joined the cluster).
What is a good way to automate this check? How can I ask the broker whether it's up and running, connected to its peers, all topics up-to-date and such? In my restart script, I have access to the metrics, but to be frank, I did not really see one there which gives me a clear picture.
Another way would be to ask what a good "readyness" probe would be that does not simply check some TCP/IP port, but looks at the actual server...
I would suggest exposing JMX metrics and tracking the following for cluster health
the controller count (must be 1 over the whole cluster)
under replicated partitions (should be zero for healthy cluster)
unclean leader elections (if you don't disable this in server.properties make sure there are none in the metric counts)
ISR shrinks within a reasonable time period, like 10 minute window (should be none)
Also, Yelp has tooling for rolling restarts implemented in Python, which requires Jolokia JMX Agents installed on the brokers, and it polls the metrics to make sure some of the above conditions are true
Assuming your cluster was healthy at the beginning of the restart operation, at a minimum, after each broker restart, you should ensure that the under-replicated partition count returns to zero before restarting the next broker.
As the previous responders mentioned, there is existing code out there to automate this. I don’t use Jolikia, myself, but my solution (which I’m working on now) also uses JMX metrics.
Kakfa Utils by Yelp is one of the best tools that can be used to detect when a kafka broker is "done". Specifically, kafka_rolling_restart is the tool which gets broker details from zookeeper and URP (Under Replicated Partitions) metrics from each broker. When a broker is restarted, total URPs across Kafka cluster is periodically collected and when it goes to zero, it restarts another broker. The controller broker is restarted at the last.