Load Balancer has mystery "Failed" status - azure-service-fabric

I have a testing/staging Service Fabric cluster with which I've been working through a proof-of-concept system for a project. As of last night, my load balancer is showing a "Failed" message on it, but with no other detail or messages that I can find I have no idea what the issue actually is.
I'm still able to hit my services within the cluster, and they're executing requests successfully. All Load Balancer rules seem to be working and passing traffic, but the options to add/delete/modify anything within the Load Balancer are greyed out. The "Diagnose and solve problems" sheet shows 0 failed deployments, 0 errors, and 0 alerts. I've restarted the cluster and republished all the applications on it, but the only options I appear to have on the load balancer itself are to delete it or move it to another resource group.
How can I troubleshoot this further without tearing up my cluster and starting over (again)?

Related

Kubernetes, Traefik load balancer with sticky session server unreachable for a few seconds during rolling update

When releasing an update, it all works fine (pods are destroyed one by one and the new version fired up), but if I refresh the website during a rolling update sometime I get a server unreachable error. I guess it is due to the fact the load balancer still trying to send traffic to a pod which probably in the terminating state.
How can I resolve this issue? Should I add a retry policy to the Traefik Ingressroute or is there a way to force the pod to be removed from the load balancer before anything?

GCP kubernetes - Ingress reporting: All backend services are in UNHEALTHY state

I'm fairly new to K8s but not so new that I haven't got a couple of running stacks and even a production site :)
I've noticed in a new deployment the ingress as below:
Type: Ingress
Load balancer: External HTTP(S) LB
Is reporting All backend services are in UNHEALTHY state which is odd since the service is working and traffic is/has been being served from it for a week.
Now, on closer inspection Backend services: k8s-be-32460--etcetc is what's unhappy. So using the GUI I click that...
Then I start to see the frontend with a funnel for ASIA, Europe, & America. Which seems to be funneling all traffic to Europe. Presumably, this is normal for the distributed external load balancer service (as per the docs) and my cluster resides in Europe. Cool. Except...
k8s-ig--etcetc europe-west1-b 1 of 3 instances healthy
1 out of 3 instances you say? eh?
And this is about as far as I've got so far. Can anyone shed any light?
Edit:
Ok, so one of the nodes reporting as unhealthy was in fact a node from the default-node-pool. I have now scaled back to 0 nodes since as far I'm aware the preference is to manage them explicitly. Leaving us with just 2 nodes. 1 of which is un-healthy according to the ingress, despite both being in the same zone.
Digging even further somehow it is reporting in the GUI that only one of the instance group instances is healthy. Yet these instances are auto-created by GCP I don't manage them.
Any ideas?
Edit 2:
I followed this right the way through SSH to each of the VM's in the instance group and executing the health check on each node. One does indeed fail.
Just a simple curl localhost:32460 one routes & the other doesn't. Though there is something listening on 32460 as shown here
tcp6 0 0 :::32460 :::* LISTEN -
The healthcheck is HTTP / 32460
Any ideas why a single node will have stopped working. As I say, I'm not savvy with how this underlying VM has been configured.
Wondering now whether it's just some sort of straightforward routing issue but it's extremely convoluted at this point.
This works for me:
In my case, I was exposing an API, this API hasn't a default route, that is to say, if I type myIp, It's was returning a 404 error (not found). So, I made a test and I put a "default" route on my Startup.cs, like this:
app.UseEndpoints(endpoints =>
{
endpoints.MapGet("/", async context =>
{
await context.Response.WriteAsync("Hola mundillo");
});
endpoints.MapControllers();
});
Then, the status passed from unhealthy to Ok. Maybe that isn't a definitive solution, but maybe, It can help someone to find the error.

kubectl get nodes hangs when I delete a node externally

Been experimenting with Kubernetes/Rancher and encountered some unexpected behavior. Today I'm deliberately putting on my chaos monkey hat and learning how things behave when stuff fails.
Here's what I've done:
1) Using the Rancher UI, stand up a 3 node cluster on Digital Ocean
Success -- a few mins later I have a 3 node cluster, visible in Rancher.
2) Using the Rancher UI, I deleted a node in a 'happy' scenario where I push the appropriate node delete button using Rancher.
Some minutes later, I have a 2 node cluster. Great.
3) Using the Digital Ocean admin UI, I delete a node in an 'oops' scenario as if a sysadmin accidentally deleted a node.
Back on the ranch (sorry), I click here to view the state of the cluster:
Unfortunately after three minutes, I'm getting a gateway timeout
Detailed timeouts in Chrome network inspector
Here's what kubectl says:
$ kubectl get nodes
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
So, question is, what happened here?
I was under the impression Kubernetes was 'self healing' and even if this node I deleted was the etcd leader, it would eventually recover. Been around 2 hours -- do I just need to wait more?

How to send alert when Azure Service fabric health goes bad

Recently the our Azure service fabric health went bad after a deployment. The deployment was successful but service fabric health went bad due to some code issue and it was not rolling back. Only on looking into the service fabric explorer did we know that the cluster went bad
Is there a way to get an email alert when the service fabric health goes bad.
Scenarios where service fabric failed
Whole cluster so what happened was 1 service went bad(showed in red) and was consuming a lot of memory and that in turn caused other services to go bad. after which the whole cluster I had to log into the scaleset to see which services was taking most of the memory.
In another case we added another reliable collection to existing reliable collection to statefull service. This caused failure.
in each of the cases i need to look at the servifabric explorer and then go to each scale set to see the actual error message.

Do web api stateless services in azure service fabric cluster go to sleep after a period of inactivity?

We have bunch of owin based web api services hosted in azure service fabric cluster. All of those services have been mapped to different ports in associated load balancer. There are 2 out-of-the-box probes when cluster is created. They are: FabricGatewayProbe and FabricHttpGatewayProbe. We added our port rules and used FabricGatewayProbe in them.
For some reason, these service endpoints seem to be going to sleep after a period of inactivity because clients of those services are timing out. We tried adjusting load balancer idle time out period to 30 minutes (which is maximum). It seems to help immediately but only for a brief period and then we are back to time out errors.
Where else should I be looking for resolution of this problem?
So further to our comments I agree that the documentation is open to interpretation, but after doing some testing I can confirm the following:
When creating a new cluster via the portal it will give you a 1:1 relation of rule to probe and I have also been able to reproduce your issue when modifying one of my existing ARM templates to use the same existing probe as you have.
On reflection this makes sense as a probe is effectively being bound to a service, if you attempt to share a probe for rules on different ports how will the load balancer know if one of the services is actually up, also Service Fabric (depending on your instance count settings) will move the services between nodes.
So if you had two services on different ports using the same probe on different nodes the service not using the port from the probe will receive the error that the request took too long to respond.
A little long winded so hopefully a quick illustration will help show what I mean.