How to configure various supervisors for a nimbus in storm? - apache-kafka

I have a nimbus server and 3 other supervisor servers. And I have 11 storm topologies running. But all of them are running in the Nimbus only. How to configure the other supervisors so that the topologies get distributed among various supervisors. Which configuration files I have to change?

It seems that there is something funny going on. For the two hosts corona-stage-storm-supervisor-01 and corona-stage-storm-supervisor-02 there are two supervisors each. However, a host should have only one supervisor running. I would assume that this "confuses" Nimbus and it uses remaining host (corona-storm-nimbus-01) that does only have a single supervisor running.
See Storm documentation for more detail (and talk to your admin who did the setup):
https://storm.apache.org/releases/1.0.0/Setting-up-a-Storm-cluster.html
About number of workers: this parameter defines how many worker JVM are use for a topology (the supervisor JVM starts worker JVM that do the actual work -- supervisors are basically "host local master" for coordination). You can set it in you job configuration via conf.setNumWorkers(int). If you want a topology to spread out over multiple hosts, you need to increase the parameter. Nevertheless, for multiple topologies as in your case, a value of one might also be ok -- different topologies should run of different host, independently of this parameter.
See Storm documentation for more details:
https://storm.apache.org/releases/1.0.0/Understanding-the-parallelism-of-a-Storm-topology.html

Related

Consumen multiple network interfaces of single machine for kafka cluster

I have a Linux machine with 3 network interfaces,
let's say IPs are 192.168.1.101,192.168.1.102,192.168.1.103
I want to consume all 3 IPs of this single node to create a Kafka cluster with other nodes, Should all 3 IPs have their separate brokers?
Also using nic bonding is not recommended, all IPs need to be utilized
Overall, I'm not sure why you'd want to do this... If you are using separate volumes (log.dirs) for each address, then maybe you'd want separate Java processes, sure, but you'd still be sharing the same memory, and having that machine be a single point of failure.
In any case, you can set one process to have advertised.listeners list out each of those addresses for clients to communicate with, however, you'd still have to deal with port allocations in the OS, so you might need to set listeners like so
listeners=PLAINTEXT_1://0.0.0.0:9092,PLAINTEXT_2://0.0.0.0:9093,PLAINTEXT_3://0.0.0.0:9094
And make sure you have listener.security.protocol.map setup as well using those names
Note that clients will only communicate with the leader topic-partition at any time, so if you have one broker JVM process and 3 addresses for it, then really only one address is going to be utilized. One optimization for that could be your intra-cluster replication can use a separate NIC.

Running kafka connect in Distributed mode?

I have a total of 3 VM's(CloudVPS). Each of them has java, confluent open source installed on them. In VM1 I am running 3 processes of Splunk-sink-connector which reads from different topics and are running on different ports. And using REST calls I posted JSON configuration to each of them.
Since I am running in distributed mode I want to take advantage of other 2 VM's also. Can anyone please tell me what to do, to add other 2 VM's to those 3 processes to achieve parallel processing.
You just need to run Kafka Connect in Distributed mode on the three VMs, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
See also:
https://rmoff.net/2019/11/22/common-mistakes-made-when-configuring-multiple-kafka-connect-workers/
http://rmoff.dev/ksldn19-kafka-connect

What to do after one node in zookeeper cluster fails?

According to https://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkMulitServerSetup
Cross Machine Requirements For the ZooKeeper service to be active,
there must be a majority of non-failing machines that can communicate
with each other. To create a deployment that can tolerate the failure
of F machines, you should count on deploying 2xF+1 machines. Thus, a
deployment that consists of three machines can handle one failure, and
a deployment of five machines can handle two failures. Note that a
deployment of six machines can only handle two failures since three
machines is not a majority. For this reason, ZooKeeper deployments are
usually made up of an odd number of machines.
To achieve the highest probability of tolerating a failure you should
try to make machine failures independent. For example, if most of the
machines share the same switch, failure of that switch could cause a
correlated failure and bring down the service. The same holds true of
shared power circuits, cooling systems, etc.
My question is:
What should we do after we identified a node failure within Zookeeper cluster to make the cluster 2F+1 again? Do we need to restart all the zookeeper nodes? Also the clients connects to Zookeeper cluster, suppose we used DNS name and the recovered node using same DNS name.
For example:
10.51.22.89 zookeeper1
10.51.22.126 zookeeper2
10.51.23.216 zookeeper3
if 10.51.22.89 dies and we bring up 10.51.22.90 as zookeeper1, and all the nodes can identify this change.
If you connect 10.51.22.90 as zookeeper1 (with the same myid file and configuration as 10.51.22.89 had before) and the data dir is empty, the process will connect to current leader (zookeeper2 or zookeeper3) and copy snapshot of the data. After successful initialization the node will inform rest of the cluster nodes and you have 2F+1 again.
Try this yourself, having tail -f on log files. It won't hurt the cluster and you will learn a lot on zookeeper internals ;-)

Spark error: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I have a virtual machine in which a spark-2.0.0-bin-hadoop2.7 in standalone mode is installed.
I ran ./sbin/start-all.sh to run the master and the slave.
When I do ./bin/spark-shell --master spark://192.168.43.27:7077 --driver-memory 600m --executor-memory 600m --executor-cores 1 in the machine itself the task's status is RUNNING and I am able to compute code in spark shell.
When I do exactly the same command from another machine in the network, the status is "RUNNING" again, but the spark-shell throws WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources. I guess the problem is not directly related to resources because the same command works in the virtual machine itself, but not when it comes from other machines.
I checked most of the topics related to this error and none of them solved my problem. I even disabled firewall with sudo ufw disable just to make sure but no success (based on this link) which suggests:
Disable Firewall on the client : This was the solution that worked for me. Since I was working on a prototype in-house code, I disabled the firewall on the client node. For some reason the worker nodes, were not able to talk back to the client for me. For production purposes, you would want to open-up certain number of ports required.
There are two known reasons for this:
Your application requires more resources (cores, memory) than allocated. Increasing worker cores and memory should solve it. Most other answers focus on this.
Where less known, the firewall is blocking the communication between master and workers. This could happen especially you are using cloud service. According to Spark Security, besides the standard 8080, 8081, 7077, 4040 ports, you also need to make sure the master and worker can communicate via the SPARK_WORKER_PORT, spark.driver.port and spark.blockManager.port; the latter three are used by submitting jobs and are randomly assigned by the program (if left unconfigured). You may try to open all ports to run a quick test.
Add an example of #Fountaine007's first bullet.
I ran into the same issue and it's because the allocated vcores is less than the application's expectation.
For my specific scenario, I increased the value of yarn.nodemanager.resource.cpu-vcores under $HADOOP_HOME/etc/hadoop/yarn-site.xml.
For memory related issue, you may also need to modify yarn.nodemanager.resource.memory-mb.

Supervisors in STORM

I have a doubt in storm and here is goes:
Can multiple supervisors run on a single node? or is it the fact that we can run only one supervisor in one machine?
Thanks.
In Principle, There should be 1 Supervisor daemon per 1 physical machine. Why ?
Answer : Nimbus receives heart beat of Supervisor daemon and try to restart it in case supervisor died, if nimbus failed permanently on restart attempt. Nimbus will assign that job to another Supervisor.
Imagine, two Supervisors going down same time as they are from same physical machine, poor fault tolerant !!
running two Supervisor daemons will also be waste of memory resources.
If you have very high memory machines simply increase number of workers by adding more ports in storm.yaml instead adding supervisor.slots.ports.
Theoretically possible - practically you may not need to do it - unless you are doing a PoC/Demo. I did this for one of the demo I gave by making multiple copies of storm and changing the ports for one of the supervisors - you can do it by changing supervisors.slots.ports.
It is designed basically per node. So one node should have only one supervisor. This daemon deals with number of worker processes that you configured based on ports.
So there is no need of extra supervisor daemon per node.
It is possible to run multiple supervisors on a single host. Have a look at this post in storm-user mailing list.
Just copy multiple Storm, and change the storm.yaml to specify
different ports for each supervisor(supervisor.slots.ports)
Supervisor is configured per node basis. Running multiple supervisor on a single node does not make much sense. The sole purpose of the supervisor daemon is to start/stop the worker process (each of these workers are responsible for running subset of topologies). From the doc page ..
The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it.