How to build a scene where zookeeper jitter cause to multiple rapid elections - apache-zookeeper

I have two program, fc(failoverController) and web(webServer). And I use zookeeper to ensure high reliability.
fc will deploy on two server, two fc use apache-curator LeaderSelector to elect master, and the master will start a web process, and web process will provide services. In order not to give up leadership, I use a while(true) at the end of the function takeLeadership().
But in a certain situation, our custom deploy zookeeper on three vmware esxi virtual machine. and they are snapshot the three vm (snapshot vm memory) everyday.
one day, there has been a strange phenomenon, fc1 become master, A few milliseconds, fc2 become master, The time difference between before and after is very short. This triggered a bug in our program, we have two master.
In order to fix this problem, we use an AtomBoolean var, declare if zk status become LOST or SUSPEND, and use this var mark whether to exit takeLeadership.
now I want to test this two master case, how can I build a scene where zookeeper jitter cause to multiple rapid elections.
I has tested the following operations, but can't reproduce:
frequent restart of zk services.
use tcpkill to kill one of fc to zk port.

Related

How do I setup a Active / Passive environment with two nodes in OpenShift?

I am trying to configure a Active/Passive cluster with two nodes (using OpenShift). The second passive node should be a hot standby, in other words it is up and running but not doing anything, until the first node dies. Then the passive node becomes active and a new passive node is started.
I have read the High Availability documentation, however it just seems to cover the theory. Furthermore it seems like overkill ( I am thinking there might be an easier way to meet my goal).
Where would I start?
What you are asking for goes against the usual practice for how Kubernetes/OpenShift is used. You wouldn't have hot standby nodes, you would always use all nodes in the cluster. You would then allow for enough additional capacity in your cluster such that loosing a node doesn't cause a problem as other nodes would have enough capacity to then run the applications. In this scenario the Kubernetes scheduler would automatically restart any applications which were on a failed node on the other nodes in the cluster, without you needing to perform any explicit failover steps.
So don't try and do anything special, setup your cluster with the two nodes, with applications being distributed across both. If you need to have the ability to run with only a single node, make sure it has enough capacity to run everything. If over time you add more applications and one node is not enough, add a third node, with all three being used in normal case. You can then handle failure of a single node again.

What to do after one node in zookeeper cluster fails?

According to https://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkMulitServerSetup
Cross Machine Requirements For the ZooKeeper service to be active,
there must be a majority of non-failing machines that can communicate
with each other. To create a deployment that can tolerate the failure
of F machines, you should count on deploying 2xF+1 machines. Thus, a
deployment that consists of three machines can handle one failure, and
a deployment of five machines can handle two failures. Note that a
deployment of six machines can only handle two failures since three
machines is not a majority. For this reason, ZooKeeper deployments are
usually made up of an odd number of machines.
To achieve the highest probability of tolerating a failure you should
try to make machine failures independent. For example, if most of the
machines share the same switch, failure of that switch could cause a
correlated failure and bring down the service. The same holds true of
shared power circuits, cooling systems, etc.
My question is:
What should we do after we identified a node failure within Zookeeper cluster to make the cluster 2F+1 again? Do we need to restart all the zookeeper nodes? Also the clients connects to Zookeeper cluster, suppose we used DNS name and the recovered node using same DNS name.
For example:
10.51.22.89 zookeeper1
10.51.22.126 zookeeper2
10.51.23.216 zookeeper3
if 10.51.22.89 dies and we bring up 10.51.22.90 as zookeeper1, and all the nodes can identify this change.
If you connect 10.51.22.90 as zookeeper1 (with the same myid file and configuration as 10.51.22.89 had before) and the data dir is empty, the process will connect to current leader (zookeeper2 or zookeeper3) and copy snapshot of the data. After successful initialization the node will inform rest of the cluster nodes and you have 2F+1 again.
Try this yourself, having tail -f on log files. It won't hurt the cluster and you will learn a lot on zookeeper internals ;-)

Zooker Failover Strategies

We are young team building an applicaiton using Storm and Kafka.
We have common Zookeeper ensemble of 3 nodes which is used by both Storm and Kafka.
I wrote a test case to test zooker Failovers
1) Check all the three nodes are running and confirm one is elected as a Leader.
2) Using Zookeeper unix client, created a znode and set a value. Verify the values are reflected on other nodes.
3) Modify the znode. set value in one node and verify other nodes have the change reflected.
4) Kill one of the worker nodes and make sure the master/leader is notified about the crash.
5) Kill the leader node. Verify out of other two nodes, one is elected as a leader.
Do i need i add any more test case? additional ideas/suggestion/pointers to add?
From the documentation
Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.
more on setting up automatic failover

Supervisors in STORM

I have a doubt in storm and here is goes:
Can multiple supervisors run on a single node? or is it the fact that we can run only one supervisor in one machine?
Thanks.
In Principle, There should be 1 Supervisor daemon per 1 physical machine. Why ?
Answer : Nimbus receives heart beat of Supervisor daemon and try to restart it in case supervisor died, if nimbus failed permanently on restart attempt. Nimbus will assign that job to another Supervisor.
Imagine, two Supervisors going down same time as they are from same physical machine, poor fault tolerant !!
running two Supervisor daemons will also be waste of memory resources.
If you have very high memory machines simply increase number of workers by adding more ports in storm.yaml instead adding supervisor.slots.ports.
Theoretically possible - practically you may not need to do it - unless you are doing a PoC/Demo. I did this for one of the demo I gave by making multiple copies of storm and changing the ports for one of the supervisors - you can do it by changing supervisors.slots.ports.
It is designed basically per node. So one node should have only one supervisor. This daemon deals with number of worker processes that you configured based on ports.
So there is no need of extra supervisor daemon per node.
It is possible to run multiple supervisors on a single host. Have a look at this post in storm-user mailing list.
Just copy multiple Storm, and change the storm.yaml to specify
different ports for each supervisor(supervisor.slots.ports)
Supervisor is configured per node basis. Running multiple supervisor on a single node does not make much sense. The sole purpose of the supervisor daemon is to start/stop the worker process (each of these workers are responsible for running subset of topologies). From the doc page ..
The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it.

Load distribution to instances of a perl script running on each node of a cluster

I have a perl script (call it worker) installed on each node/machine (4 total) of a cluster (each running RHEL). The script itself is configured as a RedHat Cluster service (which means the RH cluster manager would ensure that one and exactly one instance of this script is running as long as at least one node in the cluster is up).
I have X amount of work to be done every day once a day, which this script does. So far the X was small enough and only one instance of this script was enough to do it. But now the load is going to increase and along with High Availability (viz already implemented using RHCS), I also need load distribution.
Question is how do I do that?
Of course I have a way to split the work in n parts of size X/n each. Options I had in mind:
Create a new load distributor, which splits the work in jobs of X/n. AND one of the following:
Create a named pipe on the network file system (which is mounted and visible on all nodes), post all jobs to the pipe. Make each worker script on each node read (atomic) from the pipe and do the work. OR
Make each worker script on each node listen on a TCP socket and the load distributor send jobs to each this socket in a round robin (or some other algo) fashion.
Theoretical problem with #1 is that we've observed some nasty latency problems with NFS. And I'm not even sure if NFS would support IPC via named pipes across machines.
Theoretical problem with #2 is that I have to implement some monitors to ensure that each worker is running and listening, which being a noob to Perl, I'm not sure if is easy enough.
I personally prefer load distributor creating a pool and workers pulling from it, rather than load distributor tracking each worker and pushing work to each. Any other options?
I'm open to new ideas as well. :)
Thanks!
-- edit --
using Perl 5.8.8, to be precise: This is perl, v5.8.8 built for x86_64-linux-thread-multi
If you want to keep it simple use a database to store the jobs and then have each worker lock the table and get the jobs they need then unlock and let the next worker do it's thing. This isn't the most scalable solution since you'll have lock contention, but with just 4 nodes it should be fine.
But if you start going down this road it might make sense to look at a dedicated job-queue system like Gearman.