Windows Failover Cluster not online during creation of SQL Always On Availability Group - powershell

I've been following this tutorial to create an Azure SQL AlwaysOn Availability Group using Powershell:
Tutorial: AlwaysOn Availability Groups in Windows Azure (PowerShell)
When I get to the command that invokes the CreateAzureFailoverCluster powershell script, I check the state of the failover cluster. In Failover Cluster Manager, it is shown as "the cluster network name is not online"
When I look at the Cluster Events, I see this:
Cluster network name resource 'Cluster Name' cannot be brought online. Ensure that the network adapters for dependent IP address resources have access to at least one DNS server. Alternatively, enable NetBIOS for dependent IP addresses.
Each of the 3 servers in the cluster has access to the DC via ping. All of the preceding setup steps execute correctly. The servers are all on the 10.10.2.x/24 IP range except the DC, which is on 10.10.0.0/16 (with IP of 10.10.0.4)
All of the settings have been validated by prior execution of the tutorial on a different Azure subscription to create a failover cluster that works fine.
Cluster validation reveals this warning:
The "Cluster Group" does not contain an Cluster IP Address resource. This is a required resource for the group. It may be difficult to manage the cluster with this resource missing
(sic)
How do I add a Cluster IP Address resource?

There was nothing wrong with the configuration or the steps taken.
It took the cluster 3 hours to come online.
Attempting to bring the cluster online manually, failed for at least 20 mins after creating the cluster.
The Windows Event logs on all 4 servers showed nothing to say when the cluster moved into the online state.
It seems the correct solution was to work on something else until the cluster started under its retry policy.

Did you setup a fixed IP address in the cluster, using the cluster manager? there's a bug, DHCP will assign the cluster the IP address of one of the sql server isntances. I just assigned a high enough number (x.x.x.171, I think), and the problem went away.

Related

Traffic between kubernetes API server and ETCD

We have a self managed kubernetes cluster deployed in AWS EC2 instances. For HA, we are running in 3 AZs - 3 API server nodes and 3 ETCD nodes (one node in each AZ). In API server, we have specified all 3 ETCD server's endpoints
--etcd-servers=https://0.etcd.com:2379,https://1.etcd.com:2379,https://2.etcd.com:2379
Currently the API server is sending request to any of the ETCD node randomly and we are getting charged with high cross AZ data transfer bill by AWS.
So, would like to know if we have some way to intelligently select ETCD node from API server - if the API server from AZ-1 wants to communicate, it selects ETCD from the same AZ (AZ-1) if available otherwise go cross AZ.
Amount of traffic - We analysed VPC flow logs and on a single day we see around 16T per day flowing from port 2379. This is the major contributor, trying if we can reduce this.
I searched on the internet if people have solved this couldn't find any. Also found ETCD client has only 2 values for EndpointSelectionMode one is Random which is the default behaviour and another one is PriotrizeLeader: https://pkg.go.dev/go.etcd.io/etcd/client#EndpointSelectionMode
So ETCD client is also not providing this flexibility

How to repair bad IP addresses in standalone Service Fabric cluster

We've just shipped a standalone service fabric cluster to a customer site with a misconfiguration. Our setup:
Service Fabric 6.4
2 Windows servers, each running 3 Hyper-V virtual machines that host the cluster
We configured the cluster locally using static IP addresses for the nodes. When the servers arrived, the IP addresses of the Hyper-V machines were changed to conform to the customer's available IP addresses. Now we can't connect to the cluster, since every IP in the clusterConfig is wrong. Is there any way we can recover from this without re-installing the cluster? We'd prefer to keep the new IP's assigned to the VM's if possible.
I've tested this only on my test environment (I've never done this on production before so do it on your own risk), but since you can't connect to the cluster anyway I think it is worth to try.
Connect to each virtual machine which is a part of the cluster and do following steps:
Locate Service Fabric Cluster files (usually C:\ProgramData\SF\{nodeName}\Fabric)
Take ClusterManifest.current.xml file and copy it to temp folder (for example C:\temp)
Go to Fabric.Data subfolder, take InfrastructureManifest.xml file and copy it to the same temp folder
Inside each file you have copied change IP addresses for nodes to correct values
Stop FabricHostSvc process by running net stop FabricHostSvc command in powershell
After successful stop run this powershell (admin mode) command to update node cluster configuration:
New-ServiceFabricNodeConfiguration -ClusterManifestPath C:\temp\ClusterManifest.current.xml -InfrastructureManifestPath C:\temp\InfrastructureManifest.xml
Once the config is updated start FabricHostSvc net start FabricHostSvc
Do this for each node and pray for the best.

Connect to On Premises Service Fabric Cluster

I've followed the steps from Microsoft to create a Multi-Node On-Premises Service Fabric cluster. I've deployed a stateless app to the cluster and it seems to be working fine. When I have been connecting to the cluster I have used the IP Address of one of the nodes. Doing that, I can connect via Powershell using Connect-ServiceFabricCluster nodename:19000 and I can connect to the Service Fabric Explorer website (http://nodename:19080/explorer/index.html).
The examples online suggest that if I hosted in Azure I can connect to http://mycluster.eastus.cloudapp.azure.com:19000 and it resolves, however I can't work out what the equivalent is on my local. I tried connecting to my sample cluster: Connect-ServiceFabricCluster sampleCluster.domain.local:19000 but that returns:
WARNING: Failed to contact Naming Service. Attempting to contact Failover Manager Service...
WARNING: Failed to contact Failover Manager Service, Attempting to contact FMM...
False
WARNING: No such host is known
Connect-ServiceFabricCluster : No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue.
Am I missing something in my setup? Should there be a central DNS entry somewhere that allows me to connect to the cluster? Or am I trying to do something that isn't supported On-Premises?
Yup, you're missing a load balancer.
This is the best resource I could find to help, I'll paste relevant contents in the event of it becoming unavailable.
Reverse Proxy — When you provision a Service Fabric cluster, you have an option of installing Reverse Proxy on each of the nodes on the cluster. It performs the service resolution on the client’s behalf and forwards the request to the correct node which contains the application. In majority of the cases, services running on the Service Fabric run only on the subset of the nodes. Since the load balancer will not know which nodes contain the requested service, the client libraries will have to wrap the requests in a retry-loop to resolve service endpoints. Using Reverse Proxy will address the issue since it runs on each node and will know exactly on what nodes is the service running on. Clients outside the cluster can reach the services running inside the cluster via Reverse Proxy without any additional configuration.
Source: Azure Service Fabric is amazing
I have an Azure Service Fabric resource running, but the same rules apply. As the article states, you'll need a reverse proxy/load balancer to resolve not only what nodes are running the API, but also to balance the load between the nodes running that API. So, health probes are necessary too so that the load balancer knows which nodes are viable options for sending traffic to.
As an example, Azure creates 2 rules off the bat:
1. LBHttpRule on TCP/19080 with a TCP probe on port 19080 every 5 seconds with a 2 count error threshold.
2. LBRule on TCP/19000 with a TCP probe on port 19000 every 5 seconds with a 2 count error threshold.
What you need to add to make this forward-facing is a rule where you forward port 80 to your service http port. Then the health probe can be an http probe that hits a path to test a 200 return.
Once you get into the cluster, you can resolve the services normally and SF will take care of availability.
In Azure-land, this is abstracted again to using something like API Management to further reverse proxy it to SSL. What a mess but it works.
Once your load balancer is set up, you'll have a single IP to hit for management, publishing, and regular traffic.

Zookeeper for High availability

How does zookeeper work in the following situation.
Consider I am having 3 (1,2,3) vm's and different services are running at their endpoints. My entire administration setup (TAC) is available only on the 1st vm (virtual machine) that means whenever a client wants to connect, it would by default connect to the first vm. My other 2 vm's they are just running bunch of services. This entire cluster setup is maintained by the Zookeeper.
My question is what is the 1st vm fails. I know zookeeper maintains high availability by electing another vm as the master but client by default only points to 1st vm but not to other two. Is there any chance I can overcome this situation by getting the Ip of the first node as my admin setup is entirely present only on that node or in any other method?

Service Fabric Reliability on Standalone Cluster

I am running Service Fabric across a three node standalone cluster. Each cluster is on a separate virtual machine in a corporate enterprise cloud environment. Recently two of my virtual machines on which the nodes reside where deleted (one of the deleted machines being the machine which the cluster was created from). After this deletion, I attempted access Service Fabric Explorer on the remaining machine only to get a "Page Cannot be found" error. Furthermore, the Connect-ServiceFabricCluster (for attempting to connect to the remaining node) and the Get-ServiceFabricApplication Powershell commands fail stating:
"A communication error caused the operation to fail."
and
"No cluster endpoint is reachable, please check if there is a connectivity/firewall/DNS issue."
respectively.
Under what conditions does Service Fabric's automatic failover capability work on a standalone cluster? Are there any steps that can be taken so that I would still be able to access Service Fabric from the remaining node(s) on a standalone cluster if several nodes suddenly go down at once?
The cluster services run as stateful services on the cluster. For a stateful service you need a minimum number of nodes running, to guarantee its availability and ability to preserve state. The minimum number of nodes is equal to the target replica set count of the partition/service.
If less than the minimum number of nodes are available, your (cluster) services will stop working.
More info here.
The cluster size is determined by your business needs. However, you
must have a minimum cluster size of three nodes (machines or virtual
machines).