Zookeeper for High availability

Zookeeper for High availability - apache-zookeeper

How does zookeeper work in the following situation.
Consider I am having 3 (1,2,3) vm's and different services are running at their endpoints. My entire administration setup (TAC) is available only on the 1st vm (virtual machine) that means whenever a client wants to connect, it would by default connect to the first vm. My other 2 vm's they are just running bunch of services. This entire cluster setup is maintained by the Zookeeper.
My question is what is the 1st vm fails. I know zookeeper maintains high availability by electing another vm as the master but client by default only points to 1st vm but not to other two. Is there any chance I can overcome this situation by getting the Ip of the first node as my admin setup is entirely present only on that node or in any other method?

Related

How to repair bad IP addresses in standalone Service Fabric cluster

We've just shipped a standalone service fabric cluster to a customer site with a misconfiguration. Our setup:
Service Fabric 6.4
2 Windows servers, each running 3 Hyper-V virtual machines that host the cluster
We configured the cluster locally using static IP addresses for the nodes. When the servers arrived, the IP addresses of the Hyper-V machines were changed to conform to the customer's available IP addresses. Now we can't connect to the cluster, since every IP in the clusterConfig is wrong. Is there any way we can recover from this without re-installing the cluster? We'd prefer to keep the new IP's assigned to the VM's if possible.

I've tested this only on my test environment (I've never done this on production before so do it on your own risk), but since you can't connect to the cluster anyway I think it is worth to try.
Connect to each virtual machine which is a part of the cluster and do following steps:
Locate Service Fabric Cluster files (usually C:\ProgramData\SF\{nodeName}\Fabric)
Take ClusterManifest.current.xml file and copy it to temp folder (for example C:\temp)
Go to Fabric.Data subfolder, take InfrastructureManifest.xml file and copy it to the same temp folder
Inside each file you have copied change IP addresses for nodes to correct values
Stop FabricHostSvc process by running net stop FabricHostSvc command in powershell
After successful stop run this powershell (admin mode) command to update node cluster configuration:
New-ServiceFabricNodeConfiguration -ClusterManifestPath C:\temp\ClusterManifest.current.xml -InfrastructureManifestPath C:\temp\InfrastructureManifest.xml
Once the config is updated start FabricHostSvc net start FabricHostSvc
Do this for each node and pray for the best.

Azure ServiceFabric ImageStoreService fault

I deployed this 2 node type cluster in Azure, with an enabled reverse proxy, and waited for it to deploy - cluster has 10 nodes, instance count of 5 for both node types....when I connected the system explorer after 45 min... why would this system service be failing? I have not deployed the application yet....

I have the exact same problem - same two failing system services. From what I have been able to gather, it is a lack of disk space issue. What size of VMs are you using?
I had more failing services in quorum-loss state to begin with. Restarting the Virtual Machine Scale Set resolved most of these problems, but not FaultAnalysisService and ImageStoreService. Surely it should be possible to run a cluster on the smallest of VMs offered in Azure...

Service Fabric Reliability on Standalone Cluster

I am running Service Fabric across a three node standalone cluster. Each cluster is on a separate virtual machine in a corporate enterprise cloud environment. Recently two of my virtual machines on which the nodes reside where deleted (one of the deleted machines being the machine which the cluster was created from). After this deletion, I attempted access Service Fabric Explorer on the remaining machine only to get a "Page Cannot be found" error. Furthermore, the Connect-ServiceFabricCluster (for attempting to connect to the remaining node) and the Get-ServiceFabricApplication Powershell commands fail stating:
"A communication error caused the operation to fail."
and
"No cluster endpoint is reachable, please check if there is a connectivity/firewall/DNS issue."
respectively.
Under what conditions does Service Fabric's automatic failover capability work on a standalone cluster? Are there any steps that can be taken so that I would still be able to access Service Fabric from the remaining node(s) on a standalone cluster if several nodes suddenly go down at once?

The cluster services run as stateful services on the cluster. For a stateful service you need a minimum number of nodes running, to guarantee its availability and ability to preserve state. The minimum number of nodes is equal to the target replica set count of the partition/service.
If less than the minimum number of nodes are available, your (cluster) services will stop working.
More info here.
The cluster size is determined by your business needs. However, you
must have a minimum cluster size of three nodes (machines or virtual
machines).

Windows Failover Cluster not online during creation of SQL Always On Availability Group

I've been following this tutorial to create an Azure SQL AlwaysOn Availability Group using Powershell:
Tutorial: AlwaysOn Availability Groups in Windows Azure (PowerShell)
When I get to the command that invokes the CreateAzureFailoverCluster powershell script, I check the state of the failover cluster. In Failover Cluster Manager, it is shown as "the cluster network name is not online"
When I look at the Cluster Events, I see this:
Cluster network name resource 'Cluster Name' cannot be brought online. Ensure that the network adapters for dependent IP address resources have access to at least one DNS server. Alternatively, enable NetBIOS for dependent IP addresses.
Each of the 3 servers in the cluster has access to the DC via ping. All of the preceding setup steps execute correctly. The servers are all on the 10.10.2.x/24 IP range except the DC, which is on 10.10.0.0/16 (with IP of 10.10.0.4)
All of the settings have been validated by prior execution of the tutorial on a different Azure subscription to create a failover cluster that works fine.
Cluster validation reveals this warning:
The "Cluster Group" does not contain an Cluster IP Address resource. This is a required resource for the group. It may be difficult to manage the cluster with this resource missing
(sic)
How do I add a Cluster IP Address resource?

There was nothing wrong with the configuration or the steps taken.
It took the cluster 3 hours to come online.
Attempting to bring the cluster online manually, failed for at least 20 mins after creating the cluster.
The Windows Event logs on all 4 servers showed nothing to say when the cluster moved into the online state.
It seems the correct solution was to work on something else until the cluster started under its retry policy.

Did you setup a fixed IP address in the cluster, using the cluster manager? there's a bug, DHCP will assign the cluster the IP address of one of the sql server isntances. I just assigned a high enough number (x.x.x.171, I think), and the problem went away.

Could we use zookeeper as a monitoring app?

I have a single server instance in the cluster which has got some special duties and we can have only one such server in the cluster. I would like to monitor the health of this server and bring up another instance if this node fails and make the new node join the cluster. Do you think Apache zookeeper can do the job? Could you point me to an example?