Storm fault tolerance: Nimbus reassigns worker to a different machine? - restart

How do I make storm-nimbus to restart worker on the same machine?
To test the fault tolerance, I do a kill -9 on a worker process expecting the worker to be restarted on the same machine, but on one of the machines, nimbus launches the worker on another machine!!!
Nimbus log does not show several tries or anything unusual or errors!
Would appreciate any help, Thanks!

You shouldn't need to. Workers should be able to switch to an open slot on any supervisor. If you have a bolt that doesn't accomodate this because it is reading data on a particular supervisor, this is a design problem.
Additionally, Storm's fault tolerance is intended to handle not only worker failures, but also supervisor failures, in which case you won't be able to restart a worker on the same supervisor. You shouldn't need to worry where a worker is: that's a feature of Storm.

Related

Is a Kafka Connect worker a machine/server or just a cpu core?

In the docs Kafka Connect workers are described as processes, so in my understanding cores of cpu.
But in the same docs they are meant to provide automatic fault tolerance (in their distributed mode), so in my understanding different machines, since fault tolerance at process level is meaningless imo.
Somebody could enlighten me please ?
A Kafka Connect worker is a JVM process.
You can run multiple Kafka Connect workers in distributed mode configured as a cluster, and if one worker dies the work (tasks) are distributed amongst the remaining workers.
Typically you would deploy one Kafka Connect worker per machine. Running multiple Kafka Connect workers in distributed mode on one machine is not something that would generally make sense IMO.
I have not tested it but I don't believe that a Kafka Connect worker is tied to one CPU.
For more explanation see here: https://youtu.be/oNK3lB8Z-ZA?t=1337 (slides: https://rmoff.dev/bbuzz19-kafka-connect)

When one Broker has a problem, what is the best way to resolve the situation?

If I have three Brokers running in Kafka cluster, and one of them failed due to an error. So I only have two running brokers left.
1) Usually, when this happens, restarting a failed broker will solve the problem?
2) If restarting the broker wouldn't solve the problem, can I erase all the data that the failed Broker had and restart it? (Because all the data will be restored by two other Brokers). Is this method okay in production? If not, why?
When I was testing Kafka with my desktop on Windows 10 long time ago, if a Broker has an error and the restarting the server wouldn't work, I erased all the data. Then, it began to run okay. (I am aware of Kafka and Windows issues.) So, I am curious if this would work on multi-clustered Kafka (Linux) environments.
Ultimately, it depends what the error is. If it is a networking error, then there is nothing necessarily wrong with the logs, so you should leave them alone (unless they are not being replicated properly).
The main downside of deleting all data from a broker is that some topics may only have one replica, and it is on that node. Or if you lose other brokers while the replication is catching up, then all data is potentially gone. Also, if you have many TB of data that is replicating back to one node, then you have to be aware of any disk/network contention that may occur, and consider throttling the replication (which would take hours for the node to be healthy again)
But yes, Windows and Linux ultimately work the same in this regard, and it is one way to address a clustered environment

Kafka cluster with single broker

I'm looking to start using Kafka for a system and I'm trying to cover all use cases.
Normally it would be run as a cluster of brokers running on virtual servers (replication factor 3-5). but some customers though don't care about resilience and a broker failure needing a manual reboot of the whole system is fine with them, they just care about hardware costs.
So my question is, are there any issues with using Kafka as a single broker system for small installations with low throughput?
Cheers
It's absolutely OK to use a single Kafka broker. Note, however, that with a single broker you won't have a highly available service meaning that when the broker fails you will have a downtime.
Your replication-factor will be limited to 1 and therefore all of the partitions of a topic will be stored on the same node.
For a proof-of-concept or non-critical dev work, a single node cluster works just fine. However having a cluster has multiple benefits. It's okay to go with a single node cluster if the following are not important/relevant for you.
scalability [spreads load across multiple brokers to maintain certain throughput]
fail-over [guards against data loss in case one/more node(s) go down]
availability [system remains reachable and functioning even if one/more node(s) go down]

Storm - KeeperException

I'm running a storm cluster. I have a nimbus, zookeeper, Kafka server, and a supervisor in one node,
and another supervisor in a separate node.
When I deploy the topology which has a simple Kafka-spout in the first node. The supervisor in second node throws run time exception.
But it works fine with the supervisor in the first node. How to solve this?
Let me ask some questions regarding your topology:
1. How are you ensuring that spout executor runs only on first supervisor node? It can run on any supervisor node.
2. Supervisor node registered correctly in the cluster? I mean this node shows on UI? Because as per the exception, it seems zookeeper does not aware of this node.
3. If spout runs on first supervisor node means that kafka config parameters like hostname might have been specified like "localhost". So when spout runs on first node, it contacts the localhost only for kafka queue. And if spout tries to run on second supervisor node, it fails because, for him "localhost" is itself and there kafka queu is not there.
--Hariprasad
I have faced similar issue with Storm versions before 0.6.2. We tried running Zookeeper on a separate node and the issue was resolved, but we soon upgraded our version of Storm and no longer faced the issue. Try to run Zookeeper on a separate node and see if that helps. Also check if you have correctly configured the supervisors in the cluster. Check this Google Groups Thread to see if you have correct configuration options.

how does storm leverage zookeeper for resilience?

from the description of Storm, it is based on Zookeeper, and whenever a worker node dies, it can be recovered and get its state from zookeeper.
Does any one know how that is done? specifically
how does the failed worker node get recovered?
how does zookeeper keep its state. AFAIK, each zone can only store a small amount to data.
Are you talking about workers or supervisors? Each storm worker node runs a storm "supervisor" daemon which manages worker processes.
You need to setup supervision (something like daemontools or supervisord, which is unrelated to storm supervisors) to monitor and restart nimbus and supervisor daemons in case they take an exception. Both nimbus and supervisors are fail fast and stateless. Zookepeer is used for coordination between nimbus and supervisors along with holding state information, which is in zookeeper or on disk so as to not lose state information.
State data isn't large and Zookeeper should be run supervised too.
Check this for more fault tolerance details.