Mesos cluster fails to elect master when using replicated_log - apache-zookeeper

Test environment: multi-node mesos 0.27.2 cluster on AWS (3 x masters, 2 x slaves, quorum=2).
Tested persistence with zkCli.sh and it works fine.
If i start the masters with --registry=in_memory, it works fine, master is elected, i can start tasks via Marathon.
If i use the default (--registry=replicated_log) the cluster fails to elect a master:
https://gist.github.com/mitel/67acd44408f4d51af192
EDIT: apparently the problem was the firewall. Applied an allow-all type of rule to all my security groups and now i have a stable master. Once i figure out what was blocking the communication i'll post it here.

Discovered that mesos masters also initiate connections to other masters on 5050. After adding the egress rule to the master's security group, the cluster is stable, master election happens as expected. firewall rules
UPDATE: for those who try to build an internal firewall between the various components of mesos/zk/.. - don't do it. better to design the security as in Mesosphere's DCOS

First off, let me briefly clarify the flags meaning for posterity. --registry does not influence leader election, it specifies the persistence strategy for the registry (where Mesos tracks data that should be carried over failover). The in_memory value should not be used in production, it may even be removed in the future.
Leader election is performed by zookeeper. According to your log, you use the following zookeeper cluster: zk://10.1.69.172:2181,10.1.9.139:2181,10.1.79.211:2181/mesos.
Now, from your log, the cluster did not fail to elect the master, it actually did it twice:
I0313 18:35:28.257139 3253 master.cpp:1710] The newly elected leader is master#10.1.69.172:5050 with id edd3e4a7-ede8-44fe-b24c-67a8790e2b79
...
I0313 18:35:36.074087 3257 master.cpp:1710] The newly elected leader is master#10.1.9.139:5050 with id c4fd7c4d-e3ce-4ac3-9d8a-28c841dca7f5
I can't say why exactly the leader was elected twice, for that I would need logs from 2 other masters as well. According to your log, the last elected master is on 10.1.9.139:5050, which is most probably not the one you provided the log from.
One suspicious thing I see in the log is that master IDs differ for the same IP:port. Do you have an idea why?
I0313 18:35:28.237251 3244 master.cpp:374] Master 24ecdfff-2c97-4de8-8b9c-dcea91115809 (10.1.69.172) started on 10.1.69.172:5050
...
I0313 18:35:28.257139 3253 master.cpp:1710] The newly elected leader is master#10.1.69.172:5050 with id edd3e4a7-ede8-44fe-b24c-67a8790e2b79

Related

Fail back from slave to master in Artemis colocated HA config is not working

I have a 4 node Artemis 2.10 cluster on Linux configured for async IO journal, replication and colocated HA servers. I have been testing the failover and fail back but its not working. When shutting down one server (A) in an HA pair the colocated backup on the second server (B) will activate and correctly processes messages intended for the original server A. I modified the ColocatedFailoverExample from Artemis examples to check this and it is working. The problem is that when I bring up the original server A it starts, becomes live, registers acceptors and addresses and joins the cluster but a number of things are wrong:
looking at the artemis console for server A there is no longer a colocated_backup_1 listed to show that it is providing a colocated backup to server B.
Server A coming back up causes the server that was failed over to, server B, to go totally offline and only function as a backup. The master it was providing stops and no longer displays addresses or accepters in the UI.
Although it says its running as a backup Server B also doesn't have the colocated_backup_1 object shown in its console either.
Server B seems to be part of the cluster still but in the UI there is no green master node shown for it anymore - just a red slave node circle. Client connections to server B fail, most likely because the colocated master that was running on it was shutdown.
In the Artemis UI for server B under node-name > cluster-connections > cluster-name the attributes for the cluster show no nodes in the node array and the node id is wrong. The node id is now the same as the id of the master broker on server A. Its almost as if the information for the colocated_backup_01 broker on server B that was running before failover has replaced the server B live server and there's now only one broker on server B - the colocated backup.
This all happens immediately when I bring up server A. The UI for server B immediately refreshes at that time and the colocated_backup_01 entry disappears and the acceptors and addresses links under what was the master broker name for server B just disappear. The cluster attributes page will refresh and the 3 nodes that were listed there in the "nodes" attribute disappear and the "nodes" attribute is empty.
Now if I take down server B instead and bring it up, the roles between the server pair are swapped. Now server B becomes live again and is shown as a master node in the topology (but still no colocated_backup_01 in the UI) and the server A master broker goes offline and server A reconfigures as a backup/slave node. Whether server A or B is in this "offline", broker backup-only state the value of the Node property in the cluster attributes shown in the UI is the same value for both. Prior to doing the failover test they had different node ids which makes sense but the colocated_backup_01 backup on each did share the node id of the node it was backing up.
To summarize what I think is happening: The master that is coming backup after failover seems to trigger its partner backup node to come up as a backup but to also stop being a master node itself. From that point the pair colocation stops and there is only ever one live master between the two servers instead of one on each. The fail-back feature seems to be not only failing the original master back but shutting down the colocated master on that backup as well. Almost as if the topology between the two was configured to be colocated and its treating it a the standard two-node HA config where one server is the master and one is the slave.
The only way to fix the issue with the pair is to stop both servers and
remove everything under the broker "data" directory on both
boxes before starting them again. Just removing the colocated backup files on each machine isn't enough - everything under "data" has to go. After doing this they come up correctly and both are live masters and they pair up as HA colocated backups for each other again.
Here is the ha-policy section of the broker.xml file which is the same for all 4 servers:
<ha-policy>
<replication>
<colocated>
<backup-request-retries>5</backup-request-retries>
<backup-request-retry-interval>5000</backup-request-retry-interval>
<max-backups>1</max-backups>
<request-backup>true</request-backup>
<backup-port-offset>20</backup-port-offset>
<master>
<check-for-live-server>true</check-for-live-server>
<vote-on-replication-failure>true</vote-on-replication-failure>
</master>
<slave>
<max-saved-replicated-journals-size>-1</max-saved-replicated-journals-size>
<allow-failback>true</allow-failback>
<vote-on-replication-failure>true</vote-on-replication-failure>
<restart-backup>false</restart-backup>
</slave>
</colocated>
</replication>
</ha-policy>

What to do after one node in zookeeper cluster fails?

According to https://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkMulitServerSetup
Cross Machine Requirements For the ZooKeeper service to be active,
there must be a majority of non-failing machines that can communicate
with each other. To create a deployment that can tolerate the failure
of F machines, you should count on deploying 2xF+1 machines. Thus, a
deployment that consists of three machines can handle one failure, and
a deployment of five machines can handle two failures. Note that a
deployment of six machines can only handle two failures since three
machines is not a majority. For this reason, ZooKeeper deployments are
usually made up of an odd number of machines.
To achieve the highest probability of tolerating a failure you should
try to make machine failures independent. For example, if most of the
machines share the same switch, failure of that switch could cause a
correlated failure and bring down the service. The same holds true of
shared power circuits, cooling systems, etc.
My question is:
What should we do after we identified a node failure within Zookeeper cluster to make the cluster 2F+1 again? Do we need to restart all the zookeeper nodes? Also the clients connects to Zookeeper cluster, suppose we used DNS name and the recovered node using same DNS name.
For example:
10.51.22.89 zookeeper1
10.51.22.126 zookeeper2
10.51.23.216 zookeeper3
if 10.51.22.89 dies and we bring up 10.51.22.90 as zookeeper1, and all the nodes can identify this change.
If you connect 10.51.22.90 as zookeeper1 (with the same myid file and configuration as 10.51.22.89 had before) and the data dir is empty, the process will connect to current leader (zookeeper2 or zookeeper3) and copy snapshot of the data. After successful initialization the node will inform rest of the cluster nodes and you have 2F+1 again.
Try this yourself, having tail -f on log files. It won't hurt the cluster and you will learn a lot on zookeeper internals ;-)

Zooker Failover Strategies

We are young team building an applicaiton using Storm and Kafka.
We have common Zookeeper ensemble of 3 nodes which is used by both Storm and Kafka.
I wrote a test case to test zooker Failovers
1) Check all the three nodes are running and confirm one is elected as a Leader.
2) Using Zookeeper unix client, created a znode and set a value. Verify the values are reflected on other nodes.
3) Modify the znode. set value in one node and verify other nodes have the change reflected.
4) Kill one of the worker nodes and make sure the master/leader is notified about the crash.
5) Kill the leader node. Verify out of other two nodes, one is elected as a leader.
Do i need i add any more test case? additional ideas/suggestion/pointers to add?
From the documentation
Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.
more on setting up automatic failover

DRP for postgres-xl

After installing and setting up a 2 node cluster of postgres-xl 9.2, where coordinator and GTM are running on node1 and the Datanode is set up on node2.
Now before I use it in production I have to deliver a DRP solution.
Does anyone have a DR plan for postgres-xl 9.2 architechture?
Best Regards,
Aviel B.
So from what you described you only have one of each node... What are you expecting to recover too??
Postgres-XL is a clustered solution. If you only have one of each node then you have no cluster and not only are you not getting any scaling advantage it is actually going to run slower than stand alone Postgres. Plus you have nothing to recover to. If you lose either node you have completely lost the database.
Also the docs recommend you put the coordinator and data nodes on the same server if you are going to combine nodes.
So for the simplest solution in Replication mode you would need something like
Server1 GTM
Server2 GTM Proxy
Server3 Coordinator 1 & DataNode 1
Server4 Coordinator 2 & DataNode 2
Postgres-XL has no fail over support so any failure will require manual intervention.
If you use the replication DISTRIBUTED BY option you would just remove the failing node from the cluster and restart everything.
If you used another DISTRIBUTED BY options then data is shared over multiple nodes which means if you lose any node you lose everything. So for this option you will need to have a slave instance of every data node and coordinator node you have. If one of the nodes fails then you would remove that node from the cluster and replace it with its slave backup node. Then restart it all.

Should Zookeeper cluster be assigned to only one SolrCloud cluster

I wonder about the best strategy with regard to Zookeeper and SolrCloud clusters. Should one Zookeeper cluster be dedicated per SolrCloud cluster or multiple SolrCloud clusters can share one Zookeeper cluster? I guess the former must be a very safe approach but I am wondering if the 2nd option is fine as well.
As far as I know, SolrCloud use Zookeeper to share cluster state (up, down nodes) and to load core shared configurations (solrconfig.xml, schema.xml, etc...) on boot. If you have clients based on SolrJ's CloudSolrServer implementation than they will mostly perform reads of the cluster state.
In this respect, I think it should be fine to share the same ZK ensemble. Many reads and few writes, this is exactly what ZK is designed for.
SolrCloud puts very little load on a ZooKeeper cluster, so if it's purely a performance consideration then there's no problem. It would probably be a waste of resources to have one ZK cluster per SolrCloud if they're all on a local network. Just make sure the ZooKeeper configurations are in separate ZooKeeper paths. For example, using -zkHost :/ for one SolrCloud, and replace "path1" with "path2" for the second one will put the solr files in separate paths within ZooKeeper to ensure they don't conflict.
Note that the ZK cluster should be well-configured and robust, because if it goes down then none of the SolrClouds are going to be able to respond to changes in node availability or state. (If SolrCloud leader is lost, not connectable, or if a node enters recovering state, etc.)