I'm having troubles with my Kubernetes ingress on GKE. I'm simulating termination of a preemptible instance by manually deleting it (through the GCP dashboard). I am running a regional GKE cluster (one VM in each avaibility zone in us-west1).
A few seconds after selecting delete on only one of the VMs I start receiving 502 errors through the load balancer. Stackdriver logs for the load balancer list the error as failed_to_connect_to_backend.
Monitoring the health of backend-service shows the backend being terminated go from HEALTHY to UNHEALTHY and then disappearing while the other two backends remain HEALTHY.
After a few seconds requests begin to succeed again.
I'm quite confused why the load balancer is unable to direct traffic to the healthy nodes while one goes down - or maybe this is a kubernetes issue? Could the load balancer be properly routing traffic to a healthy instance, but the kubernetes NodePort service on that instance proxies the request back to the unhealthy instance for some reason?
Well, I would say if you kill a node from GCP Console, you are kind of killing it from outside in. It will take time until kubelet will realize this event. So kube-proxy also won't update service endpoint and the iptables immediately.
Until that happens, ingress controller will keep sending packets to the services specified by ingress rule, and the services to the pods, that no longer exist.
This is just a speculation. I might be wrong. But from GCP documentation, if you are using preemptible VMs, your app should be fail tolerant.
[EXTRA]
So, let's consider two general scenarios. In the first one we will send kubectl delete pod command, while with the second one we will kill a node abruptly.
with kubectl delete pod ... you are saying api-server that you want to kill a pod. api-server will summon kubelet to kill the pod, it will re-create it on another node (if the case). kube-proxy will update the iptables so the services will forward the requests to the right pod.
If you kill the node, that's kubelet that first realizes that something goes wrong, so it reports this to the api-server. api-server will re-schedule the pods on a different node (always). The rest is the same.
My point is that there is a difference between api-server knowing from the beginning that no packets can be send to a pod, and being notified once kubelet realizes that the node is unhealthy.
How to solve this? you can't. And actually this should be logical. You want to have the same performance with a preemptible machines, that cost about 5 times cheaper, then a normal VM? If this would be possible, everyone would be using these VMs.
Finally, again, Google advises using preemptible, if your application is failure tolerant.
Related
I'm running a web-server deployment in an EKS cluster. The deployment is exposed behind a NodePort service, ingress resource, and AWS Load Balancer controller.
This deployment is configured to run on "always-on" nodes, using a Node Selector.
The EKS cluster runs additional auto-scaled workloads which can also use spot instances if needed (in the same namespace).
Since the Node-Port service exposes a static port across all nodes in the cluster, there are many targets in the said target group, which are being registered and de-registered whenever a new node is being added/removed from the cluster.
What exactly happens if a request from the client is being navigated to the service that resides in a node that is about the be scaled down?
I'm asking since I'm getting many 504 Gateway Timeouts from the ALB. Specifically, these requests do not reach our FE/BE pods and terminate at the ALB level.
Welcome to the community #gil-shelef!
Based on AWS documentation, there should be used additional handlers to add both resilience and cost-savings.
Let's start with understanding how this works:
There is a specific node termination handler DaemonSet which adds pods to each spot instances and listens to spot instance interruption notification. This provides a possibility to gracefully terminate any running pods on that node, drain the node from loadbalancer and for Kubernetes scheduler to reschedule removed pods on different instances.
Workflow looks like following (taken from aws documentation - Spot Instance Interruption Handling. This link also has an example):
The workflow can be summarized as:
Identify that a Spot Instance is about to be interrupted in two minutes.
Use the two-minute notification window to gracefully prepare the node for termination.
Taint the node and cordon it off to prevent new pods from being placed on it.
Drain connections on the running pods.
Once pods are removed from endpoints, kube-proxy will trigger an update in iptables. It takes a little bit of time. To make this smoother for end-users, you should consider adding pre-stop pause about 5-10 seconds. More information about how this happens and how you can mitigate it, you can find in my answer here.
Also here are links for these handlers:
Node termination handler
Cluster autoscaler on AWS
For your last question, please check this AWS KB article on how to troubleshoot EKS and 504 errors
I've recently been reading myself into kubernetes and want to create a StatefulSet for a service of mine.
As far as I understood, a StatefulSet with let's say 5 replicas offers certian dns entries to reach it.
E.g. myservice1.internaldns.net, myservice2.internaldns.net
What would now happen, if one of the pods behind the dns entries goes down, even if it's just for a small amount of time?
I had a hard time finding information on this.
Is the request held until the pod is back? Will it be router to another pod, possibly losing the respective state? Will it just straightup fail?
If you're Pod is not ready, then the traffic is not forwarded to that Pod. So, your service will not load balance traffic to Pods that are not ready.
To decide if the given Pod is ready or not, you should define readinessProbe. I recommend reading the Kubernetes documentation on "Configure Liveness, Readiness and Startup Probes".
I have a Kubernetes cluster with two services deployed: SvcA and SvcB - both in the service mesh.
SvcA is backed by a single Pod, SvcA_P1. The application in SvcA_P1 exposes a PreStop HTTP hook. When performing a "kubectl drain" command on the node where SvcA_P1 resides, the Pod transitions into the "terminating" state and remains in that state until the application has completed its work (the rest request returns and Kubernetes removes the pod). The work for SvcA_P1 includes completing ongoing in-dialog (belonging to established sessions) HTTP requests/responses. It can stay in the "terminating" state for hours before completing.
When the Pod enters the "terminating" phase, Istio sidecar appears to remove the SvcA_P1 from the pool. Requests sent to SvcA_P1 from e.g., SvcB_P1 are rejected with a "no healthy upstream".
Is there a way to configure Istio/Envoy to:
Continue to send traffic/sessions with affinity to SvcA_P1 while in "terminating" state?
Reject traffic without session affinity to SvcA_P1 (no JSESSIONID, cookies, or special HTTP headers)?
I have played around with the DestinationRule(s), modifying trafficPolicy.loadBalancer.consistentHash.[httpHeaderName|httpCookie] with no luck. Once the Envoy removes the upstream server, the new destination is re-hashed using the reduced set of servers.
Thanks,
Thor
According to Kubernetes documentation, when pod must be deleted three things happen simultaneously:
Pod shows up as “Terminating” when listed in client commands
When the Kubelet sees that a Pod has been marked as terminating because the "dead" timer for the Pod has been set in the API server,
it begins the pod shutdown process.
If the pod has defined a preStop hook, it is invoked inside of the pod. If the preStop hook is still running after the grace period
expires, step 2 is then invoked with a small (2 second) extended grace
period.
Pod is removed from endpoints list for service, and are no longer considered part of the set of running pods for replication
controllers. Pods that shutdown slowly cannot continue to serve
traffic as load balancers (like the service proxy) remove them from
their rotations.
As soon as Istio works like a mesh network below/behind Kubernetes Services and Services no longer consider a Pod in Terminating state as a destination for the traffic, tweaking Istio policies doesn't help much.
Is there a way to configure Istio/Envoy to continue to send traffic/sessions with affinity to SvcA_P1 while in "terminating" state?
This problem is at Kubernetes level rather than Istio/Envoy level: by default, upon entering the "Terminating" state, Pods are removed from their corresponding Services.
You can change that behaviour by telling your Service to advertise Pods in the "Terminating" state: see that answer.
I am shutting down my k8s node manually to see if this affect the master.
After shutdown I check status of nodes:
kubectl get nodes
The node which went down is still seen Ready in Status. As a consequence k8s still tries to schedule pods on this node but actually cannot. And even worst it doesn't reschedule pods on other healthy nodes.
After a while (5-10 mins) k8s notices the node has gone.
Is that expected behavior? If not how can I fix this?
I did research do find out how K8s checks node health, I couldn't find anything valuable.
I found the problem myself.
I was cutting connection at network layer with firewall rules. Since kubelet opened a session before new deny rules node was seen Ready. As it was ready it was receiving traffic. And the traffic would be blocked by the new rules since they have no open session.
So this inconsistency happens only when you change firewall rules.
Standard practice for a rolling update of hosts behind load balancer is to gracefully take the hosts out of rotation. This can be done by marking the host "un-healthy" and ensuring the host is no longer receiving requests from the load balancer.
Does Kubernetes do something similar for pods managed by a ReplicationController and servicing a LoadBalancer Service?
I.e., does Kubernetes take a pod out of the LoadBalancer rotation, ensure incoming traffic has died-down, and only then issue pod shutdown?
Actually, once you delete the pod, it will be in "terminating" state until it is destroyed (after terminationGracePeriodSeconds) which means it is removed from the service load balancer, but still capable of serving existing requests.
We also use "readiness" health checks, and preStop is synchronous, so you could make your preStop hook mark the readiness of the pod to be false, and then wait for it to be removed from the load balancer, before having the preStop hook exit.
Not quite. Kubernetes will send a stop command to the containers in the pod. If the application doesn't stop it will force kill the container (after terminationGracePeriodSeconds parameter).
There are a bunch of bugs opened to take care of this: https://github.com/kubernetes/kubernetes/issues/2789
I can't think of anything elegant that will do this.
There is a preStop parameter for pods that will execute a script before termination. You could modify from here the pod label and rename it to something else. This will fool the replication controller and it will see that it has now a lower number of replicas.
For the pods with this label you will have to make your own logic on stopping them when they have finished working.