Service Fabric Cluster Node gone down suddenly - azure-service-fabric

I have created a 5 node azure service fabric and deployed application on it many times, however one of node on my service fabric has suddenly gone down and is in error state , while the remaining nodes are active. I've tried manually activating the node through cluster explorer , but nothing happens. I've also tried restarting the node, neither of which has helped.
Is there anyway to force this node back online, or am I going to have to delete the cluster and start again ?

Related

kubectl get nodes hangs when I delete a node externally

Been experimenting with Kubernetes/Rancher and encountered some unexpected behavior. Today I'm deliberately putting on my chaos monkey hat and learning how things behave when stuff fails.
Here's what I've done:
1) Using the Rancher UI, stand up a 3 node cluster on Digital Ocean
Success -- a few mins later I have a 3 node cluster, visible in Rancher.
2) Using the Rancher UI, I deleted a node in a 'happy' scenario where I push the appropriate node delete button using Rancher.
Some minutes later, I have a 2 node cluster. Great.
3) Using the Digital Ocean admin UI, I delete a node in an 'oops' scenario as if a sysadmin accidentally deleted a node.
Back on the ranch (sorry), I click here to view the state of the cluster:
Unfortunately after three minutes, I'm getting a gateway timeout
Detailed timeouts in Chrome network inspector
Here's what kubectl says:
$ kubectl get nodes
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
So, question is, what happened here?
I was under the impression Kubernetes was 'self healing' and even if this node I deleted was the etcd leader, it would eventually recover. Been around 2 hours -- do I just need to wait more?

Activation of actors fails on premise cluster

We have some long running jobs created as Service Fabric actors. The actors have no data other than the reminder. When these services gets deployed in local cluster, they seem to activate with no issues.
When we deploy them to server which runs a 3 node cluster some of the services fail to activate. We don't see the memory utilization in node going beyond 50% . However when we added 2 more nodes and ran on 5 node the activation seems to be working fine.
We are using 1 partition and 1 replica count only; so wondering is there some setting that is stopping the fabric to activate more services.
We have also increased the application port range, but no luck.
It is also noticed that after one service activation fails; other statefull services also becomes unstable. They show error of unhealthy partitions.
The cluster also runs some stateless services which runs like a charm.
Any clue why the activation fails for the actors?

Service Fabric Cluster Upgrade Failing

I've an on-premise, secure, development cluster that I wish to upgrade. The current version is 5.7.198.9494. I've followed the steps listed here.
At the time of writing, the latest version of SF is 6.2.283.9494. However, running Get-ServiceFabricRuntimeUpgradeVersion -BaseVersion 5.7.198.9494 shows that I first must update to 6.0.232.9494, before upgrade to 6.2.283.9494.
I run the following in Powershell, and the upgrade does start:
Copy-ServiceFabricClusterPackage -Code -CodePackagePath .\MicrosoftAzureServiceFabric.6.0.232.9494.cab -ImageStoreConnectionString "fabric:ImageStore"
Register-ServiceFabricClusterPackage -Code -CodePackagePath MicrosoftAzureServiceFabric.6.0.232.9494.cab
Start-ServiceFabricClusterUpgrade -Code -CodePackageVersion 6.0.232.9494 -Monitored -FailureAction Rollback
However, after a few minutes the following happens:
Powershell IDE crashes
The Service Fabric Cluster becomes unreachable
Service Fabric Local Cluster Manager disappears from the task bar
Event Viewer will log the events, see below.
Quite some time later, the vm will reboot. Service Fabric Local Cluster Manager will only give options to Setup or Restart the local cluster.
Event viewer has logs in the under Applications and Services Logs -> Microsoft-Service Fabric -> Operational. Most are information about opening, closing, and aborting one of the upgrade domains. There are some warnings about a vm failing to open an upgrade domain stating error: Lease Failed.
This behavior happens consistently, and I've not yet been able to update the cluster. My guess is that we are not able to upgrade a development cluster, but I've not found an article that states that.
Am I doing something incorrectly here, or is it impossible to upgrade a development cluster?
I will assume you have a development cluster with a single node or multiple nodes in a single VM.
As described in the first section of the documentation from the same link your provided:
service-fabric-cluster-upgrade-windows-server
You can upgrade your cluster to the new version only if you're using a
production-style node configuration, where each Service Fabric node is
allocated on a separate physical or virtual machine. If you have a
development cluster, where more than one Service Fabric node is on a
single physical or virtual machine, you must re-create the cluster
with the new version.

Service Fabric App removed after restarting vmss

we have multiple staging environments for our service fabric. The development environment (VMSS) is being deallocated automatically every night in order to save some costs. Our problem is that all applications that were deployed are removed from the cluster.
Any suggestion? are we missing any configuration?
Thanks
Service Fabric stores state on local, ephemeral disks, meaning that if
the virtual machine is moved to a different host, the data does not
move with it. In normal operation, that is not a problem as the new
node is brought up to date by other nodes. However, if you stop all
nodes and restart them later, there is a significant possibility that
most of the nodes start on new hosts and make the system unable to
recover.
Also, if you deallocate a VM, the temp disk is released.
So, always leave at least 3 nodes running for a (Bronze reliability tier) development cluster.
Or delete and recreate the cluster, combined with automated application deployment.
You can scale node SKU's down for the night too.
More info here.

How to restart unresponsive kubernetes master in GKE

The kubernetes master in one of my GKE clusters became unresponsive last night following the infrastructure issue in us-central1-a.
Whenever I run "kubectl get pods" in the default namespace I get the following error message:
Error from server: an error on the server has prevented the request from succeeding
If I run "kubectl get pods --namespace=kube-system", I only see the kube-proxy and the fluentd-logging daemon.
I have trying scaling the cluster down to 0 and then scaling it back up. I have also tried downgrading and upgrading the cluster but that seems to apply only to the nodes (not the master). Is there any GKE/K8S API command to issue a restart to the kubernetes master?
There is not a command that will allow you to restart the Kubernetes master in GKE (since the master is considered a part of the managed service). There is automated infrastructure (and then an oncall engineer from Google) that is responsible for restarting the master if it is unhealthy.
In this particular cases, restarting the master had no effect on restoring it to normal behavior because Google Compute Engine Incident #16011 caused an outage on 2016-06-28 for GKE masters running in us-central1-a (even though that isn't indicated on the Google Cloud Status Dashboard). During the incident, many masters were unavailable.
If you had tried to create a GCE cluster using kube-up.sh during that time, you would have similarly seen that it would be unable to create a functional master VM due to the SSD Persistent disk latency issues.
I'm trying to have at least one version to upgrade ready, if you trying to upgrade the master, it will restart and work within few minutes. Otherwise you should wait around 3 days while Google team will reboot it. On e-mail/phone, then won't help you. And unless you have payed support (transition to which taking few days), they won't give a bird.