Web-Server running in an EKS cluster with spot-instances - kubernetes

I'm running a web-server deployment in an EKS cluster. The deployment is exposed behind a NodePort service, ingress resource, and AWS Load Balancer controller.
This deployment is configured to run on "always-on" nodes, using a Node Selector.
The EKS cluster runs additional auto-scaled workloads which can also use spot instances if needed (in the same namespace).
Since the Node-Port service exposes a static port across all nodes in the cluster, there are many targets in the said target group, which are being registered and de-registered whenever a new node is being added/removed from the cluster.
What exactly happens if a request from the client is being navigated to the service that resides in a node that is about the be scaled down?
I'm asking since I'm getting many 504 Gateway Timeouts from the ALB. Specifically, these requests do not reach our FE/BE pods and terminate at the ALB level.

Welcome to the community #gil-shelef!
Based on AWS documentation, there should be used additional handlers to add both resilience and cost-savings.
Let's start with understanding how this works:
There is a specific node termination handler DaemonSet which adds pods to each spot instances and listens to spot instance interruption notification. This provides a possibility to gracefully terminate any running pods on that node, drain the node from loadbalancer and for Kubernetes scheduler to reschedule removed pods on different instances.
Workflow looks like following (taken from aws documentation - Spot Instance Interruption Handling. This link also has an example):
The workflow can be summarized as:
Identify that a Spot Instance is about to be interrupted in two minutes.
Use the two-minute notification window to gracefully prepare the node for termination.
Taint the node and cordon it off to prevent new pods from being placed on it.
Drain connections on the running pods.
Once pods are removed from endpoints, kube-proxy will trigger an update in iptables. It takes a little bit of time. To make this smoother for end-users, you should consider adding pre-stop pause about 5-10 seconds. More information about how this happens and how you can mitigate it, you can find in my answer here.
Also here are links for these handlers:
Node termination handler
Cluster autoscaler on AWS
For your last question, please check this AWS KB article on how to troubleshoot EKS and 504 errors

Related

How does kube-proxy behave when it can't reach the master?

From what I've read about Kubernetes, if the master(s) die, the workers should still be able to function as normal (https://stackoverflow.com/a/39173007/281469), although no new scheduling will occur.
However, I've found this to not be the case when the master can also schedule worker pods. Take a 2-node cluster, where one node is a master and the other a worker, and the master has the taints removed:
If I shut down the master and docker exec into one of the containers on the worker I can see that:
nc -zv ip-of-pod 80
succeeds, but
nc -zv ip-of-service 80
fails half of the time. The Kubernetes version is v1.15.10, using iptables mode for kube-proxy.
I'm guessing that since the kube-proxy on the worker node can't connect to the apiserver, it will not remove the master node from the iptables rules.
Questions:
Is it expected behaviour that kube-proxy won't stop routing to pods on master nodes, or is there something "broken"?
Are any workarounds available for this kind of setup to allow the worker nodes to still function correctly?
I realise the best thing to do is separate the CP nodes but that's not viable for what I'm working on at the moment.
Is it expected behaviour that kube-proxy won't stop routing to pods on
master nodes, or is there something "broken"?
Are any workarounds
available for this kind of setup to allow the worker nodes to still
function correctly?
The cluster master plays the role of decision maker for the various activities in cluster's nodes. This can include scheduling workloads, managing the workloads' lifecycle, scaling etc.. Each node is managed by the master components and contains the services necessary to run pods. The services on a node typically includes the kube-proxy, container runtime and kubelet.
The kube-proxy component enforces network rules on nodes and helps kubernetes in managing the connectivity among Pods and Services. Also, the kube-proxy, acts as an egress-based load-balancing controller which keeps monitoring the the kubernetes API server and continually updates node's iptables subsystem based on it.
In simple terms, the master node only is aware of everything and is in charge of creating the list of routing rules as well based on node addition or deletion etc. kube-proxy plays a kind of enforcer whereby it takes charge of checking with master, syncing the information and enforcing the rules on the list.
If the master node(API server) is down, the cluster will not be able to respond to API commands or deploy nodes. If another master node is not available, there shall be no one else available who can instruct the worker nodes on change in work allocation and hence they shall continue to execute the operations that were earlier scheduled by the master until the time the master node is back and gives different instructions. Inline to it, kube-proxy shall also be unable to get the latest rules by sync up with master, however it shall not stop routing and shall continue to handle the networking and routing functionalities (uses the earlier iptable rules that were determined before the master node went down) that shall allow network communication to your pods provided all pods in worker nodes are still up and running.
Single master node based architecture is not a preferred deployment architecture for production. Considering that resilience and reliability is one of the major business goal of kubernetes, it is recommended as a best practice to have HA cluster based architecture to avoid single point of failure.
Once you remove taints, kubernetes scheduler don't need any tolerations to schedule pods on your master node. So it is as good as your worker node with control plane components running on it and you can also run your workload pods on this node (although its not a recommended practice).
Kube-proxy (https://kubernetes.io/docs/concepts/overview/components/#kube-proxy) is the component deployed on all the nodes of cluster and it handles the networking and routing connection to your pods. So, even if your master node is down kube-proxy still works fine on the worker node and it will route traffic to your pods running on worker node.
If all your pods are running in worker nodes (which are still up and running), then kube-proxy will continue to route traffic to your pods even via service.
There is nothing inherent in Kubernetes that would cause this. The master node role is just for humans, and if you've removed the taints then the nodes are just normal nodes. That said, remember that usual rules about scheduling and resource requests apply so if your pods don't all fit then things wouldn't be scheduled. It's possible your Kubernetes deploy system set up more specialized firewall rules or similar around the control plane nodes, but that would be dependent on that system.

Is there a way to configure Istio to route traffic to a POD which is in the terminating state?

I have a Kubernetes cluster with two services deployed: SvcA and SvcB - both in the service mesh.
SvcA is backed by a single Pod, SvcA_P1. The application in SvcA_P1 exposes a PreStop HTTP hook. When performing a "kubectl drain" command on the node where SvcA_P1 resides, the Pod transitions into the "terminating" state and remains in that state until the application has completed its work (the rest request returns and Kubernetes removes the pod). The work for SvcA_P1 includes completing ongoing in-dialog (belonging to established sessions) HTTP requests/responses. It can stay in the "terminating" state for hours before completing.
When the Pod enters the "terminating" phase, Istio sidecar appears to remove the SvcA_P1 from the pool. Requests sent to SvcA_P1 from e.g., SvcB_P1 are rejected with a "no healthy upstream".
Is there a way to configure Istio/Envoy to:
Continue to send traffic/sessions with affinity to SvcA_P1 while in "terminating" state?
Reject traffic without session affinity to SvcA_P1 (no JSESSIONID, cookies, or special HTTP headers)?
I have played around with the DestinationRule(s), modifying trafficPolicy.loadBalancer.consistentHash.[httpHeaderName|httpCookie] with no luck. Once the Envoy removes the upstream server, the new destination is re-hashed using the reduced set of servers.
Thanks,
Thor
According to Kubernetes documentation, when pod must be deleted three things happen simultaneously:
Pod shows up as “Terminating” when listed in client commands
When the Kubelet sees that a Pod has been marked as terminating because the "dead" timer for the Pod has been set in the API server,
it begins the pod shutdown process.
If the pod has defined a preStop hook, it is invoked inside of the pod. If the preStop hook is still running after the grace period
expires, step 2 is then invoked with a small (2 second) extended grace
period.
Pod is removed from endpoints list for service, and are no longer considered part of the set of running pods for replication
controllers. Pods that shutdown slowly cannot continue to serve
traffic as load balancers (like the service proxy) remove them from
their rotations.
As soon as Istio works like a mesh network below/behind Kubernetes Services and Services no longer consider a Pod in Terminating state as a destination for the traffic, tweaking Istio policies doesn't help much.
Is there a way to configure Istio/Envoy to continue to send traffic/sessions with affinity to SvcA_P1 while in "terminating" state?
This problem is at Kubernetes level rather than Istio/Envoy level: by default, upon entering the "Terminating" state, Pods are removed from their corresponding Services.
You can change that behaviour by telling your Service to advertise Pods in the "Terminating" state: see that answer.

GKE: 502 when stopping instance

I'm having troubles with my Kubernetes ingress on GKE. I'm simulating termination of a preemptible instance by manually deleting it (through the GCP dashboard). I am running a regional GKE cluster (one VM in each avaibility zone in us-west1).
A few seconds after selecting delete on only one of the VMs I start receiving 502 errors through the load balancer. Stackdriver logs for the load balancer list the error as failed_to_connect_to_backend.
Monitoring the health of backend-service shows the backend being terminated go from HEALTHY to UNHEALTHY and then disappearing while the other two backends remain HEALTHY.
After a few seconds requests begin to succeed again.
I'm quite confused why the load balancer is unable to direct traffic to the healthy nodes while one goes down - or maybe this is a kubernetes issue? Could the load balancer be properly routing traffic to a healthy instance, but the kubernetes NodePort service on that instance proxies the request back to the unhealthy instance for some reason?
Well, I would say if you kill a node from GCP Console, you are kind of killing it from outside in. It will take time until kubelet will realize this event. So kube-proxy also won't update service endpoint and the iptables immediately.
Until that happens, ingress controller will keep sending packets to the services specified by ingress rule, and the services to the pods, that no longer exist.
This is just a speculation. I might be wrong. But from GCP documentation, if you are using preemptible VMs, your app should be fail tolerant.
[EXTRA]
So, let's consider two general scenarios. In the first one we will send kubectl delete pod command, while with the second one we will kill a node abruptly.
with kubectl delete pod ... you are saying api-server that you want to kill a pod. api-server will summon kubelet to kill the pod, it will re-create it on another node (if the case). kube-proxy will update the iptables so the services will forward the requests to the right pod.
If you kill the node, that's kubelet that first realizes that something goes wrong, so it reports this to the api-server. api-server will re-schedule the pods on a different node (always). The rest is the same.
My point is that there is a difference between api-server knowing from the beginning that no packets can be send to a pod, and being notified once kubelet realizes that the node is unhealthy.
How to solve this? you can't. And actually this should be logical. You want to have the same performance with a preemptible machines, that cost about 5 times cheaper, then a normal VM? If this would be possible, everyone would be using these VMs.
Finally, again, Google advises using preemptible, if your application is failure tolerant.

Specify scheduling order of a Kubernetes DaemonSet

I have Consul running in my cluster and each node runs a consul-agent as a DaemonSet. I also have other DaemonSets that interact with Consul and therefore require a consul-agent to be running in order to communicate with the Consul servers.
My problem is, if my DaemonSet is started before the consul-agent, the application will error as it cannot connect to Consul and subsequently get restarted.
I also notice the same problem with other DaemonSets, e.g Weave, as it requires kube-proxy and kube-dns. If Weave is started first, it will constantly restart until the kube services are ready.
I know I could add retry logic to my application, but I was wondering if it was possible to specify the order in which DaemonSets are scheduled?
Kubernetes itself does not provide a way to specific dependencies between pods / deployments / services (e.g. "start pod A only if service B is available" or "start pod A after pod B").
The currect approach (based on what I found while researching this) seems to be retry logic or an init container. To quote the docs:
They run to completion before any app Containers start, whereas app Containers run in parallel, so Init Containers provide an easy way to block or delay the startup of app Containers until some set of preconditions are met.
This means you can either add retry logic to your application (which I would recommend as it might help you in different situations such as a short service outage) our you can use an init container that polls a health endpoint via the Kubernetes service name until it gets a satisfying response.
retry logic is preferred over startup dependency ordering, since it handles both the initial bringup case and recovery from post-start outages

Does a Kubernetes rolling-update gracefully remove pods from a service load balancer

Standard practice for a rolling update of hosts behind load balancer is to gracefully take the hosts out of rotation. This can be done by marking the host "un-healthy" and ensuring the host is no longer receiving requests from the load balancer.
Does Kubernetes do something similar for pods managed by a ReplicationController and servicing a LoadBalancer Service?
I.e., does Kubernetes take a pod out of the LoadBalancer rotation, ensure incoming traffic has died-down, and only then issue pod shutdown?
Actually, once you delete the pod, it will be in "terminating" state until it is destroyed (after terminationGracePeriodSeconds) which means it is removed from the service load balancer, but still capable of serving existing requests.
We also use "readiness" health checks, and preStop is synchronous, so you could make your preStop hook mark the readiness of the pod to be false, and then wait for it to be removed from the load balancer, before having the preStop hook exit.
Not quite. Kubernetes will send a stop command to the containers in the pod. If the application doesn't stop it will force kill the container (after terminationGracePeriodSeconds parameter).
There are a bunch of bugs opened to take care of this: https://github.com/kubernetes/kubernetes/issues/2789
I can't think of anything elegant that will do this.
There is a preStop parameter for pods that will execute a script before termination. You could modify from here the pod label and rename it to something else. This will fool the replication controller and it will see that it has now a lower number of replicas.
For the pods with this label you will have to make your own logic on stopping them when they have finished working.