Service Fabric Reliability on Standalone Cluster - azure-service-fabric

I am running Service Fabric across a three node standalone cluster. Each cluster is on a separate virtual machine in a corporate enterprise cloud environment. Recently two of my virtual machines on which the nodes reside where deleted (one of the deleted machines being the machine which the cluster was created from). After this deletion, I attempted access Service Fabric Explorer on the remaining machine only to get a "Page Cannot be found" error. Furthermore, the Connect-ServiceFabricCluster (for attempting to connect to the remaining node) and the Get-ServiceFabricApplication Powershell commands fail stating:
"A communication error caused the operation to fail."
and
"No cluster endpoint is reachable, please check if there is a connectivity/firewall/DNS issue."
respectively.
Under what conditions does Service Fabric's automatic failover capability work on a standalone cluster? Are there any steps that can be taken so that I would still be able to access Service Fabric from the remaining node(s) on a standalone cluster if several nodes suddenly go down at once?

The cluster services run as stateful services on the cluster. For a stateful service you need a minimum number of nodes running, to guarantee its availability and ability to preserve state. The minimum number of nodes is equal to the target replica set count of the partition/service.
If less than the minimum number of nodes are available, your (cluster) services will stop working.
More info here.
The cluster size is determined by your business needs. However, you
must have a minimum cluster size of three nodes (machines or virtual
machines).

Related

How to add remote vm instance as worker node in kubernetes cluster

I'm new to kubernetes and trying to explore the new things in it. So, my question is
Suppose I have existing kubernetes cluster with 1 master node and 1 worker node. Consider this setup is on AWS, now I have 1 more VM instance available on Oracle Cloud Platform and I want to configure that VM as worker node and attach that worker node to existing cluster.
So, is it possible to do so? Can anybody have any suggestions regarding this.
I would instead divide your clusters up based on region (unless you have a good VPN between your oracle and AWS infrastructure)
You can then run applications across clusters. If you absolutely must have one cluster that is geographically separated, I would create a master (etcd host) in each region that you have a worker node in.
Worker Node and Master Nodes communication is very critical for Kubernetes cluster. Adding nodes from on-prem to a cloud provider or from different cloud provider will make lots of issues from network perspective.
As VPN connection between AWS and Oracle Cloud needed and every time worker node has to cross ocean (probably) to reach master node.
EDIT: From Kubernetes Doc, Clusters cannot span clouds or regions (this functionality will require full federation support).
https://kubernetes.io/docs/setup/best-practices/multiple-zones/

Azure ServiceFabric ImageStoreService fault

I deployed this 2 node type cluster in Azure, with an enabled reverse proxy, and waited for it to deploy - cluster has 10 nodes, instance count of 5 for both node types....when I connected the system explorer after 45 min... why would this system service be failing? I have not deployed the application yet....
I have the exact same problem - same two failing system services. From what I have been able to gather, it is a lack of disk space issue. What size of VMs are you using?
I had more failing services in quorum-loss state to begin with. Restarting the Virtual Machine Scale Set resolved most of these problems, but not FaultAnalysisService and ImageStoreService. Surely it should be possible to run a cluster on the smallest of VMs offered in Azure...

How to restart Service Fabric scale set machines

We have a service fabric cluster with one scale set (primary) with 5 nodes. There was a memory leak in one of our services which drained all of the available memory on the nodes and eventually other services failed. For instance some Powershell commands don't work now. In the Service Fabric Explorer everything is healthy and we don't have any errors or warnings. Is it possible to restart the machines and what is the best way to do it so we could restore the machines to their initial state where all of the services are working?
In the scale set when scaling down it removes the node with the highest index, so it won't help to follow the documentation, scale up and then remove the nodes that are faulty.
What would happen if we restart the scale set nodes one buy one? I see that service fabric handles it - disables the node and activates it afterwards. But from the documentation in silver tier we need to have 5 nodes up and running all the time. So before restarting any of the nodes should we scale up, add one more node and then proceed with the restart?
If the failing nodes has healthy services still running, the best approach is disable the node first with Disable-ServiceFabricNode command, so that any healthy services are moved out of the node with less impact possible.
Once the services are moved, in some cases, just a Restart-ServiceFabricNode command can kill all locked services and come back healthy, without actually restaring the VM.
In last case, you might need to restart the VM via Powershell or Azure Portal to get a fresh start to the node.
If your cluster is running on high density load, you might need to scale up first to bring capacity to the cluster reallocate the services.
Provided you have 'Silver' durability for your cluster, to restart an underlying Service Fabric VM, just go to the VMSS in Azure portal, select the VM and click 'Restart'. With 'Silver' tier, Service Fabric uses the Infrastructure Service to orchestrate disabling and restarting the nodes so you don't have to do all this manually.
Please note, you should not restart all VMs in the scaleset at the same time, or go below the number of VMs needed to be up per your durability level. This could lead to quorum loss and ultimately the demise of your cluster!

Zookeeper for High availability

How does zookeeper work in the following situation.
Consider I am having 3 (1,2,3) vm's and different services are running at their endpoints. My entire administration setup (TAC) is available only on the 1st vm (virtual machine) that means whenever a client wants to connect, it would by default connect to the first vm. My other 2 vm's they are just running bunch of services. This entire cluster setup is maintained by the Zookeeper.
My question is what is the 1st vm fails. I know zookeeper maintains high availability by electing another vm as the master but client by default only points to 1st vm but not to other two. Is there any chance I can overcome this situation by getting the Ip of the first node as my admin setup is entirely present only on that node or in any other method?

how to set aws node and vagrant node when the master node in the local

the kubernetes 1.2 support muti-node acrossing multiple service providers , now the master node running in my laptop , I want to add two work node respectively in amazon and vagrant . how to achieve it?
the kubernetes 1.2 support muti-node acrossing multiple service providers
Where did you see this? It isn't actually true. In 1.2 we added support for nodes across multiple availability zones within the same region on the same service provider (e.g. us-central1-a and us-central1-b in the us-central1 region in GCP). But there is no support for running nodes across regions in the same service provider much less spanning a cluster across service providers.
now the master node running in my laptop , I want to add two work node respectively in amazon and vagrant
The worker nodes must be able to connect directly to the master node. I wouldn't suggest exposing your laptop to the internet directly so that it can be reached from an Amazon data center, but would instead advise you to run the master node in the cloud.
Also note that if you are running nodes in the same cluster across multiple environments (AWS, GCP, Vagrant, bare metal, etc) then you are going to have a difficult time getting networking configured properly so that all pods can reach each other.