Activation of actors fails on premise cluster - azure-service-fabric

We have some long running jobs created as Service Fabric actors. The actors have no data other than the reminder. When these services gets deployed in local cluster, they seem to activate with no issues.
When we deploy them to server which runs a 3 node cluster some of the services fail to activate. We don't see the memory utilization in node going beyond 50% . However when we added 2 more nodes and ran on 5 node the activation seems to be working fine.
We are using 1 partition and 1 replica count only; so wondering is there some setting that is stopping the fabric to activate more services.
We have also increased the application port range, but no luck.
It is also noticed that after one service activation fails; other statefull services also becomes unstable. They show error of unhealthy partitions.
The cluster also runs some stateless services which runs like a charm.
Any clue why the activation fails for the actors?

Related

Why does my Dask client show zero workers, cores, and memory?

I'm using Dask deployed using Helm on a Kubernetes cluster in Kubernetes Engine on GCP. My current cluster set up has 5 nodes with each node having 8 cpus, 30 gb:
I ran the notebook named 05-nyc-taxi.ipynb, which resulted in workers getting killed.
When I restarted the Dask client it shows that I now have zero workers and zero memory:
However, when I run kubectl get services and kubectl get pods, it shows that my pods and services are running:
Any idea why this might be the case?
When you restart the client, is kills all the workers, and starts making new ones. That process is asynchronous, but the rendering of the client object happens immediately - so there are no workers at that moment. You could render the client object again (and again) later:
In[]: client
Or check the dashboard.
Or, better, you could render the cluster object itself which, so long as you have jupyter widgets installed in the environment, will update itself in real-time. If you didn't happen to assign your cluster object before, it will also be available as client.cluster.
btw: why are you having to restart the cluster like this?

AWS ECS service running SSH behind Network Load Balancer + Target Group slow to deploy with CodeDeploy

I have an ECS service that serves an SSH process. I am deploying updates to this service through CodeDeploy. I noticed that this service is much slower to deploy than other services with identical images deployed at the same time using CodePipeline. The difference with this service is that it's behind an NLB (the others are no LB or behind an ALB).
The service is set to 1 container, deploying 200%/100% so the services brings up 1 new container, ensure's it's healthy, then removes the old one. What I see happen is:
New Container started in Initial state
3+ minutes later, New Container becomes Healthy. Old Container enters Draining
2+ minutes later, Old Container finishes Draining and stops
Deploying thus takes 5-7 minutes, mostly waiting for health checks or draining. However, I'm pretty sure SSH starts up very quickly, and I have the following settings on the target group which should make things relatively quick:
TCP health check on the correct port
Healthy/Unhealthy threshold: 2
Interval: 10s
Deregistation Delay: 10s
ECS Docker stop custom timeout: 65s
So the minimum time from SSH being up to the old container being terminated would be:
2*10=20s for TCP health check to turn to Healthy
10s for the deregistration delay before Docker stop
65s for the Docker stop timeout
This is 115 seconds, which is a lot less the observed 5-7 minutes. Other services take 1-3 minutes and the LB/Target Group timings are not nearly as aggressive there.
Any ideas why my service behind an NLB seems slow to cycle through these lifecycle transitions?
You are not doing anything wrong here; this simply appears to be a (current) limitation of this product.
I recently noticed similar delays in registration/availability time with ECS services behind an NLB and decided to explore. I created a simple Javascript TCP echo server and set it up as an ECS service behind an NLB (ECS service count of 1). Like you, I used a TCP healthcheck with a healthy/unhealthy threshold of 2 and interval/deregistration delay of 10 seconds.
After the initial deploy was successful and the service reachable via the NLB, I wanted to see how long it would take for service to be restored in the event of a complete failure of the underlying instance. To simulate, I killed the service via the ECS console. After several iterations of this test, I consistently observed a timeline similar to the following (times are in seconds):
0s: killed service
5s: ECS reports old service draining
Target Group shows service draining
ECS reports new service instance is started
15s: ECS reports new task is registered
Target Group shows new instance with status of 'initial'
135s: TCP healthcheck traffic from the load balancer starts arriving
for the service (as measured by tcpdump on the EC2 host running
the container)
225s: Target Group finally marks the service as 'healthy'
ECS reports service has reached a steady state
I performed the same tests with a simple express app behind an ALB, and the gap between ECS starting the service and the ALB reporting it healthy was 10-15 seconds. The best result we achieved testing the NLB was 3.5 minutes from service stop to full availability.
I shared these findings with AWS via support case, asking specifically for clarification on why there was a consistent 120 second gap before the NLB started healthchecking the service and why we consistently saw 90-120 seconds between the beginning of healthchecks and service availability. They confirmed that this behavior is known but did not offer a time for resolution or a strategy to decrease latency in service availability.
Unfortunately, this will not do much to help resolve your issue, but at least you can know that you're not doing anything wrong.

How to restart Service Fabric scale set machines

We have a service fabric cluster with one scale set (primary) with 5 nodes. There was a memory leak in one of our services which drained all of the available memory on the nodes and eventually other services failed. For instance some Powershell commands don't work now. In the Service Fabric Explorer everything is healthy and we don't have any errors or warnings. Is it possible to restart the machines and what is the best way to do it so we could restore the machines to their initial state where all of the services are working?
In the scale set when scaling down it removes the node with the highest index, so it won't help to follow the documentation, scale up and then remove the nodes that are faulty.
What would happen if we restart the scale set nodes one buy one? I see that service fabric handles it - disables the node and activates it afterwards. But from the documentation in silver tier we need to have 5 nodes up and running all the time. So before restarting any of the nodes should we scale up, add one more node and then proceed with the restart?
If the failing nodes has healthy services still running, the best approach is disable the node first with Disable-ServiceFabricNode command, so that any healthy services are moved out of the node with less impact possible.
Once the services are moved, in some cases, just a Restart-ServiceFabricNode command can kill all locked services and come back healthy, without actually restaring the VM.
In last case, you might need to restart the VM via Powershell or Azure Portal to get a fresh start to the node.
If your cluster is running on high density load, you might need to scale up first to bring capacity to the cluster reallocate the services.
Provided you have 'Silver' durability for your cluster, to restart an underlying Service Fabric VM, just go to the VMSS in Azure portal, select the VM and click 'Restart'. With 'Silver' tier, Service Fabric uses the Infrastructure Service to orchestrate disabling and restarting the nodes so you don't have to do all this manually.
Please note, you should not restart all VMs in the scaleset at the same time, or go below the number of VMs needed to be up per your durability level. This could lead to quorum loss and ultimately the demise of your cluster!

Service Fabric Reliability on Standalone Cluster

I am running Service Fabric across a three node standalone cluster. Each cluster is on a separate virtual machine in a corporate enterprise cloud environment. Recently two of my virtual machines on which the nodes reside where deleted (one of the deleted machines being the machine which the cluster was created from). After this deletion, I attempted access Service Fabric Explorer on the remaining machine only to get a "Page Cannot be found" error. Furthermore, the Connect-ServiceFabricCluster (for attempting to connect to the remaining node) and the Get-ServiceFabricApplication Powershell commands fail stating:
"A communication error caused the operation to fail."
and
"No cluster endpoint is reachable, please check if there is a connectivity/firewall/DNS issue."
respectively.
Under what conditions does Service Fabric's automatic failover capability work on a standalone cluster? Are there any steps that can be taken so that I would still be able to access Service Fabric from the remaining node(s) on a standalone cluster if several nodes suddenly go down at once?
The cluster services run as stateful services on the cluster. For a stateful service you need a minimum number of nodes running, to guarantee its availability and ability to preserve state. The minimum number of nodes is equal to the target replica set count of the partition/service.
If less than the minimum number of nodes are available, your (cluster) services will stop working.
More info here.
The cluster size is determined by your business needs. However, you
must have a minimum cluster size of three nodes (machines or virtual
machines).

What happens when the Kubernetes master fails?

I've been trying to figure out what happens when the Kubernetes master fails in a cluster that only has one master. Do web requests still get routed to pods if this happens, or does the entire system just shut down?
According to the OpenShift 3 documentation, which is built on top of Kubernetes, (https://docs.openshift.com/enterprise/3.2/architecture/infrastructure_components/kubernetes_infrastructure.html), if a master fails, nodes continue to function properly, but the system looses its ability to manage pods. Is this the same for vanilla Kubernetes?
In typical setups, the master nodes run both the API and etcd and are either largely or fully responsible for managing the underlying cloud infrastructure. When they are offline or degraded, the API will be offline or degraded.
In the event that they, etcd, or the API are fully offline, the cluster ceases to be a cluster and is instead a bunch of ad-hoc nodes for this period. The cluster will not be able to respond to node failures, create new resources, move pods to new nodes, etc. Until both:
Enough etcd instances are back online to form a quorum and make progress (for a visual explanation of how this works and what these terms mean, see this page).
At least one API server can service requests
In a partially degraded state, the API server may be able to respond to requests that only read data.
However, in any case, life for applications will continue as normal unless nodes are rebooted, or there is a dramatic failure of some sort during this time, because TCP/ UDP services, load balancers, DNS, the dashboard, etc. Should all continue to function for at least some time. Eventually, these things will all fail on different timescales. In single master setups or complete API failure, DNS failure will probably happen first as caches expire (on the order of minutes, though the exact timing is configurable, see the coredns cache plugin documentation). This is a good reason to consider a multi-master setup–DNS and service routing can continue to function indefinitely in a degraded state, even if etcd can no longer make progress.
There are actions that you could take as an operator which would accelerate failures, especially in a fully degraded state. For instance, rebooting a node would cause DNS queries and in fact probably all pod and service networking functionality until at least one master comes back online. Restarting DNS pods or kube-proxy would also be bad.
If you'd like to test this out yourself, I recommend kubeadm-dind-cluster, kind or, for more exotic setups, kubeadm on VMs or bare metal. Note: kubectl proxy will not work during API failure, as that routes traffic through the master(s).
Kubernetes cluster without a master is like a company running without a Manager.
No one else can instruct the workers(k8s components) other than the Manager(master node)(even you, the owner of the cluster, can only instruct the Manager)
Everything works as usual. Until the work is finished or something stopped them.(because the master node died after assigning the works)
As there is no Manager to re-assign any work for them, the workers will wait and wait until the Manager comes back.
The best practice is to assign multiple managers(master) to your cluster.
Although your data plane and running applications does not immediately starts breaking but there are several scenarios where cluster admins will wish they had multi-master setup. Key to understanding the impact would be understanding which all components talk to master for what and how and more importantly when will they fail if master fails.
Although your application pods running on data plane will not get immediately impacted but imagine a very possible scenario - your traffic suddenly surged and your horizontal pod autoscaler kicked in. The autoscaling would not work as Metrics Server collects resource metrics from Kubelets and exposes them in Kubernetes apiserver through Metrics API for use by Horizontal Pod Autoscaler and vertical pod autoscaler ( but your API server is already dead).If your pod memory shoots up because of high load then it will eventually lead to getting killed by k8s OOM killer. If any of the pods die, then since controller manager and scheduler talks to API Server to watch for current state of pods so they too will fail. In short a new pod will not be scheduled and your application may stop responding.
One thing to highlight is that Kubernetes system components communicate only with the API server. They don’t
talk to each other directly and so their functionality themselves could fail I guess. Unavailable master plane can mean several things - failure of any or all of these components - API server,etcd, kube scheduler, controller manager or worst the entire node had crashed.
If API server is unavailable - no one can use kubectl as generally all commands talk to API server ( meaning you cannot connect to cluster, cannot login into any pods to check anything on container file system. You will not be able to see application logs unless you have any additional centralized log management system).
If etcd database failed or got corrupted - your entire cluster state data is gone and the admins would want to restore it from backups as early as possible.
In short - a failed single master control plane although may not immediately impact traffic serving capability but cannot be relied on for serving your traffic.