(Kubernetes HPA) Scaling up pods got connection refused

(Kubernetes HPA) Scaling up pods got connection refused - kubernetes

I have some amount of traffic that can boost the cpu usage up to 180%. I tried using a single pod which works fine but the response was extremely slow. When I configured my HPA to cpu=80%, min=1 and max={2 or more} I hit connection refused when HPA was creating more pods. I tried put a large value to min (ie. min = 3) the connection refused relief but there will be too many idle pods when traffic is low. Is there any way to stop putting pod online until it is completely started?

I hit connection refused when HPA was creating more pods
Kubernetes uses the readinessProbe, to determine whether to redirect clients to some pods. If the readinessProbe for a Pod is not successful, then Service whose selectors could have matched that Pod would not take it under consideration.
If there is no readinessProbe defined, or if it was misconfigured, Pods that are still starting up may end up serving client requests. Connection refused could suggest there was no process listening yet for incoming connections.
Please share your deployment/statefulset/..., if you need further assistance setting this up.

Related

Intermittent failure of K8S DNS resolver / dial udp / Operation cancelled

We're having a medium sized Kubernetes cluster. So imagine a situation where approximately 70 pods are being connecting to a one socket server. It works fine most of the time, however, from time to time one or two pods just fail to resolve k8s DNS, and it times out with the following error:
Error: dial tcp: lookup thishost.production.svc.cluster.local on 10.32.0.10:53: read udp 100.65.63.202:36638->100.64.209.61:53: i/o timeout at
What we noticed is that this is not the only service that's failing intermittently. Other services experience that from time to time. We used to ignore it, since it was very random and rate, however in the above case that is very noticeable. The only solution is to actually kill the faulty pod. (Restarting doesn't help)
Has anyone experienced this? Do you have any tips on how to debug it/ fix?
It almost feels as if it's beyond our expertise and is fully related to the internals of the DNS resolver.
Kubernetes version: 1.23.4
Container Network: cilium

this issue most probably will be related to the CNI.
I would suggest following the link to debug the issue:
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
and to be able to help you we need more information:
is this cluster on-premise or cloud?
what are you using for CNI?
how many nodes are running and are they all in the same subnet? if yes, dose they have other interfaces?
share the below command result.
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o wide
when you restart the pod to solve the issue temp does it stay on the same node or does it change?

What exactly means that K8s will stop route traffic to the pod while the readiness probe is fail

The problem I'm trying to solve is horizontal scaling for the web application, where some sessions lead to high CPU usage. The idea is to use Readiness probe to inform K8s that pod is loaded with the current task and new traffic has to be sent to another one (HPA will do the work and prepare a new pod).
But I want that session that processing on the initial pod will be active and once work is done the result will be delivered to the user.
The question is does it mean that if readiness probe fail K8s will:
Stop route ALL traffic to the pod, drop current sessions that open through ingress.
Stop route NEW traffic to the pod, but current sessions will be active during the specified timeout.
Thank you in advance.

UPDATE
It seems like I was totally not right in my 1st edit. More correct is to specify that It will Stop route NEW traffic to the pod but TCP connections like ssh will still be alive.
When the Endpoints controller receives the notification that the readiness probe failed, it removes the Pod as an Endpoint in the Service that the Pod is a part of. Then API server sends this information to the kube-proxies running on the worker nodes and kube-proxies update the iptables rules on its node, which is what prevents new connections from being forwarded to this Pod. However it's worth knowing that the TCP protocol is a stateful protocol (unlike HTTP) so existing connections (e.g ssh sessions) will still be active.

Are hitless rolling updates possible on GKE with externalTrafficPolicy: Local?

I have a GKE cluster (1.12.10-gke.17).
I'm running the nginx-ingress-controller with type: LoadBalancer.
I've set externalTrafficPolicy: Local to preserve the source ip.
Everything works great, except during rolling updates. I have maxSurge: 1 and maxUnavailable: 0.
My problem is that during a rolling update, I start getting request timeouts. I suspect the Google load balancer is still sending requests to the node where the pod is Terminating even though the health checks are failing. This happens for about 30-60s starting right when the pod changes from Running to Terminating. Everything stabilizes after a while and traffic eventually goes only to the new node with the new pod.
If the load balancer is slow to stop sending requests to a terminating pod, is there some way to make these rolling deploys hitless?
My understanding is that in a normal k8s service, where externalTrafficPolicy is not normal, the Google load balancer simply sends requests to all nodes and let's the iptables sort it out. When a pod is Terminating the iptables are updated quickly and traffic does not get sent to that pod anymore. In the case where externalTrafficPolicy is Local however, if the node that receives the request does not have a Running pod, then the request times out, which is what is happening here.
If this is correct, then I only see two options
stop sending requests to the node with a Terminating pod
continue servicing requests even though the pod is Terminating
I feel like option 1 is difficult since it requires informing the load balancer that the pod is about to start Terminating.
I've made some progress on option 2, but so far haven't gotten it working. I've managed to continue serving requests from the pod by adding a preStop lifecycle hook which just runs sleep 60, but I think the problem is that the healthCheckNodePort reports localEndpoints: 0 and I suspect something is blocking the request between arriving at the node and getting to the pod. Perhaps, the iptables aren't routing when localEndpoints: 0.
I've also adjusted the Google load balancer health check, which is different from the readinessProbe and livenessProbe, to the "fastest" settings possible e.g. 1s interval, 1 failure threshold and I've verified that the load balancer backend aka k8s node, indeed fails health checks quickly, but continues to send requests to the terminating pod anyway.

There is a similar discussion here. Although it's not identical, it's a similar use case.
Everything sounds like it is working as expected.
The LoadBalancer will send traffic to any healthy node based on the LoadBalancer health check. The LoadBalancer is unaware of individual pods.
The health check will mark a node as unhealthy once the health check threshold is crossed, ie HC is sent every x seconds with x timeout delay, x number of failed requests. This causes a delay between the time that the pod goes into terminating and it is marked as unhealthy.
Also note that once the pod is marked as notReady, the pod is removed from the service endpoint. If there is no other pod on a node, traffic will continue reaching this node (because of the HC behaviour explained above), the requests can't be forwarded because of the externalTrafficPolicy (traffic remains on the node where it was sent).
There are a couple of ways to address this.
To minimize the amount of time between a terminated pod and the node being marked as unhealthy, you can set a more aggressive health check. The trouble with this is that an overly sensitive HC may cause false positives, usually increases the overhead on the node (additional health check requests), and it will not fully eliminate the failed requests.
Have enough pods running so that there are always at least 2 pods per node. Since the service removes the pod from the endpoint once it goes into notReady, requests will just get sent to the running pod instead. The downside here is that you will either have additional overhead (more pods) or a tighter grouping (more vulnerable to failure). It also won't fully eliminate the failed requests, but they will be incredibly few.
Tweak the HC and your container to work together:
3a.Have the HC endpoint be separate from the normal path you use.
3b. Configure the container readinessProbe to match the main path your container serves traffic on (it will be different from the LB HC path)
3c. Configure your image so that when SIGTERM is received, the first thing to go down is the HC path.
3d. Configure the image to gracefully drain all connections once a SIGTERM is received rather than immediately closing all sessions and connections.
This should mean that ongoing sessions will gracefully terminate which reduces errors.
It should also mean that the node will start failing HC probes even though it is ready to serve normal traffic, this gives time for the node to be marked as unhealthy and the LB will stop sending traffic to it before it is no longer able to serve requests.
The problem with this last option is 2 fold. First, it is more complex to configure. The other issue is that it means your pods will take longer to terminate so rolling updates will take longer, so will any other process that relies on gracefully terminating the pod such as draining the node. The second issue isn't too bad unless you are in need of a quick turn around.

GKE: 502 when stopping instance

I'm having troubles with my Kubernetes ingress on GKE. I'm simulating termination of a preemptible instance by manually deleting it (through the GCP dashboard). I am running a regional GKE cluster (one VM in each avaibility zone in us-west1).
A few seconds after selecting delete on only one of the VMs I start receiving 502 errors through the load balancer. Stackdriver logs for the load balancer list the error as failed_to_connect_to_backend.
Monitoring the health of backend-service shows the backend being terminated go from HEALTHY to UNHEALTHY and then disappearing while the other two backends remain HEALTHY.
After a few seconds requests begin to succeed again.
I'm quite confused why the load balancer is unable to direct traffic to the healthy nodes while one goes down - or maybe this is a kubernetes issue? Could the load balancer be properly routing traffic to a healthy instance, but the kubernetes NodePort service on that instance proxies the request back to the unhealthy instance for some reason?

Well, I would say if you kill a node from GCP Console, you are kind of killing it from outside in. It will take time until kubelet will realize this event. So kube-proxy also won't update service endpoint and the iptables immediately.
Until that happens, ingress controller will keep sending packets to the services specified by ingress rule, and the services to the pods, that no longer exist.
This is just a speculation. I might be wrong. But from GCP documentation, if you are using preemptible VMs, your app should be fail tolerant.
[EXTRA]
So, let's consider two general scenarios. In the first one we will send kubectl delete pod command, while with the second one we will kill a node abruptly.
with kubectl delete pod ... you are saying api-server that you want to kill a pod. api-server will summon kubelet to kill the pod, it will re-create it on another node (if the case). kube-proxy will update the iptables so the services will forward the requests to the right pod.
If you kill the node, that's kubelet that first realizes that something goes wrong, so it reports this to the api-server. api-server will re-schedule the pods on a different node (always). The rest is the same.
My point is that there is a difference between api-server knowing from the beginning that no packets can be send to a pod, and being notified once kubelet realizes that the node is unhealthy.
How to solve this? you can't. And actually this should be logical. You want to have the same performance with a preemptible machines, that cost about 5 times cheaper, then a normal VM? If this would be possible, everyone would be using these VMs.
Finally, again, Google advises using preemptible, if your application is failure tolerant.

For Kubernetes high-availability on GCE, does the master load balancer need health checks?

I was following this blog post on high availability kubernetes, but I noticed that when I shut down one of the masters, kubectl get pods would intermittently return
The connection to the server 1.2.3.4 was refused - did you specify the right host or port?
My guess is that this is because the kubernetes-master load balancer does not health check the master instances so the unhealthy master is still in the load balancer rotation.
I just wanted to check and see if this is a known issue or if there something wrong with my setup. I can add the health check manually, but it feels like I'm doing something wrong.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse