Azure ServiceFabric ImageStoreService fault - azure-service-fabric

I deployed this 2 node type cluster in Azure, with an enabled reverse proxy, and waited for it to deploy - cluster has 10 nodes, instance count of 5 for both node types....when I connected the system explorer after 45 min... why would this system service be failing? I have not deployed the application yet....

I have the exact same problem - same two failing system services. From what I have been able to gather, it is a lack of disk space issue. What size of VMs are you using?
I had more failing services in quorum-loss state to begin with. Restarting the Virtual Machine Scale Set resolved most of these problems, but not FaultAnalysisService and ImageStoreService. Surely it should be possible to run a cluster on the smallest of VMs offered in Azure...

Related

Activation of actors fails on premise cluster

We have some long running jobs created as Service Fabric actors. The actors have no data other than the reminder. When these services gets deployed in local cluster, they seem to activate with no issues.
When we deploy them to server which runs a 3 node cluster some of the services fail to activate. We don't see the memory utilization in node going beyond 50% . However when we added 2 more nodes and ran on 5 node the activation seems to be working fine.
We are using 1 partition and 1 replica count only; so wondering is there some setting that is stopping the fabric to activate more services.
We have also increased the application port range, but no luck.
It is also noticed that after one service activation fails; other statefull services also becomes unstable. They show error of unhealthy partitions.
The cluster also runs some stateless services which runs like a charm.
Any clue why the activation fails for the actors?

How to restart Service Fabric scale set machines

We have a service fabric cluster with one scale set (primary) with 5 nodes. There was a memory leak in one of our services which drained all of the available memory on the nodes and eventually other services failed. For instance some Powershell commands don't work now. In the Service Fabric Explorer everything is healthy and we don't have any errors or warnings. Is it possible to restart the machines and what is the best way to do it so we could restore the machines to their initial state where all of the services are working?
In the scale set when scaling down it removes the node with the highest index, so it won't help to follow the documentation, scale up and then remove the nodes that are faulty.
What would happen if we restart the scale set nodes one buy one? I see that service fabric handles it - disables the node and activates it afterwards. But from the documentation in silver tier we need to have 5 nodes up and running all the time. So before restarting any of the nodes should we scale up, add one more node and then proceed with the restart?
If the failing nodes has healthy services still running, the best approach is disable the node first with Disable-ServiceFabricNode command, so that any healthy services are moved out of the node with less impact possible.
Once the services are moved, in some cases, just a Restart-ServiceFabricNode command can kill all locked services and come back healthy, without actually restaring the VM.
In last case, you might need to restart the VM via Powershell or Azure Portal to get a fresh start to the node.
If your cluster is running on high density load, you might need to scale up first to bring capacity to the cluster reallocate the services.
Provided you have 'Silver' durability for your cluster, to restart an underlying Service Fabric VM, just go to the VMSS in Azure portal, select the VM and click 'Restart'. With 'Silver' tier, Service Fabric uses the Infrastructure Service to orchestrate disabling and restarting the nodes so you don't have to do all this manually.
Please note, you should not restart all VMs in the scaleset at the same time, or go below the number of VMs needed to be up per your durability level. This could lead to quorum loss and ultimately the demise of your cluster!

Service Fabric App removed after restarting vmss

we have multiple staging environments for our service fabric. The development environment (VMSS) is being deallocated automatically every night in order to save some costs. Our problem is that all applications that were deployed are removed from the cluster.
Any suggestion? are we missing any configuration?
Thanks
Service Fabric stores state on local, ephemeral disks, meaning that if
the virtual machine is moved to a different host, the data does not
move with it. In normal operation, that is not a problem as the new
node is brought up to date by other nodes. However, if you stop all
nodes and restart them later, there is a significant possibility that
most of the nodes start on new hosts and make the system unable to
recover.
Also, if you deallocate a VM, the temp disk is released.
So, always leave at least 3 nodes running for a (Bronze reliability tier) development cluster.
Or delete and recreate the cluster, combined with automated application deployment.
You can scale node SKU's down for the night too.
More info here.

Service Fabric Reliability on Standalone Cluster

I am running Service Fabric across a three node standalone cluster. Each cluster is on a separate virtual machine in a corporate enterprise cloud environment. Recently two of my virtual machines on which the nodes reside where deleted (one of the deleted machines being the machine which the cluster was created from). After this deletion, I attempted access Service Fabric Explorer on the remaining machine only to get a "Page Cannot be found" error. Furthermore, the Connect-ServiceFabricCluster (for attempting to connect to the remaining node) and the Get-ServiceFabricApplication Powershell commands fail stating:
"A communication error caused the operation to fail."
and
"No cluster endpoint is reachable, please check if there is a connectivity/firewall/DNS issue."
respectively.
Under what conditions does Service Fabric's automatic failover capability work on a standalone cluster? Are there any steps that can be taken so that I would still be able to access Service Fabric from the remaining node(s) on a standalone cluster if several nodes suddenly go down at once?
The cluster services run as stateful services on the cluster. For a stateful service you need a minimum number of nodes running, to guarantee its availability and ability to preserve state. The minimum number of nodes is equal to the target replica set count of the partition/service.
If less than the minimum number of nodes are available, your (cluster) services will stop working.
More info here.
The cluster size is determined by your business needs. However, you
must have a minimum cluster size of three nodes (machines or virtual
machines).

how to set aws node and vagrant node when the master node in the local

the kubernetes 1.2 support muti-node acrossing multiple service providers , now the master node running in my laptop , I want to add two work node respectively in amazon and vagrant . how to achieve it?
the kubernetes 1.2 support muti-node acrossing multiple service providers
Where did you see this? It isn't actually true. In 1.2 we added support for nodes across multiple availability zones within the same region on the same service provider (e.g. us-central1-a and us-central1-b in the us-central1 region in GCP). But there is no support for running nodes across regions in the same service provider much less spanning a cluster across service providers.
now the master node running in my laptop , I want to add two work node respectively in amazon and vagrant
The worker nodes must be able to connect directly to the master node. I wouldn't suggest exposing your laptop to the internet directly so that it can be reached from an Amazon data center, but would instead advise you to run the master node in the cloud.
Also note that if you are running nodes in the same cluster across multiple environments (AWS, GCP, Vagrant, bare metal, etc) then you are going to have a difficult time getting networking configured properly so that all pods can reach each other.