Graceful termination of kubernetes pods - kubernetes

We have an application with 4 pods running with a load balancer! We want to try the rolling update, but we are not sure what happens when a pod goes down! The documentation is unclear! Particularly this quote from Termination Of Pods:
Pod is removed from endpoints list for service, and are no longer considered part of the set of running pods for replication controllers. Pods that shutdown slowly can continue to serve traffic as load balancers (like the service proxy) remove them from their rotations.
So, if someone can guide us on the following questions :
1.) When a pod is shutting down, can it still serve new requests? Or does the load balancer not consider it?
2.) Does it complete the requests it is processing till the grace-period is exhausted? and then kills the container even if any process is still running?
3.) Also, this mentions replication controllers, what we have is a Deployment and Deployment has replica sets, so will there be any difference?
We went through this question but the answers are conflicting without any source : Does a Kubernetes rolling-update gracefully remove pods from a service load balancer

1) when a Pod is shutting down it's state is changed to Terminating and it is not considered by the LoadBalancer - as described in the Pod termination docs
2) Yes - you might want to look at the pod.Spec.TerminationGracePeriodSeconds configuration to gain some control. You'll find details in the API documentation
3) No - the ReplicaSet and the Deployment take care of scheduling Pods, there's no difference when it comes to the shutdown behaviour of the Pods

Related

Web-Server running in an EKS cluster with spot-instances

I'm running a web-server deployment in an EKS cluster. The deployment is exposed behind a NodePort service, ingress resource, and AWS Load Balancer controller.
This deployment is configured to run on "always-on" nodes, using a Node Selector.
The EKS cluster runs additional auto-scaled workloads which can also use spot instances if needed (in the same namespace).
Since the Node-Port service exposes a static port across all nodes in the cluster, there are many targets in the said target group, which are being registered and de-registered whenever a new node is being added/removed from the cluster.
What exactly happens if a request from the client is being navigated to the service that resides in a node that is about the be scaled down?
I'm asking since I'm getting many 504 Gateway Timeouts from the ALB. Specifically, these requests do not reach our FE/BE pods and terminate at the ALB level.
Welcome to the community #gil-shelef!
Based on AWS documentation, there should be used additional handlers to add both resilience and cost-savings.
Let's start with understanding how this works:
There is a specific node termination handler DaemonSet which adds pods to each spot instances and listens to spot instance interruption notification. This provides a possibility to gracefully terminate any running pods on that node, drain the node from loadbalancer and for Kubernetes scheduler to reschedule removed pods on different instances.
Workflow looks like following (taken from aws documentation - Spot Instance Interruption Handling. This link also has an example):
The workflow can be summarized as:
Identify that a Spot Instance is about to be interrupted in two minutes.
Use the two-minute notification window to gracefully prepare the node for termination.
Taint the node and cordon it off to prevent new pods from being placed on it.
Drain connections on the running pods.
Once pods are removed from endpoints, kube-proxy will trigger an update in iptables. It takes a little bit of time. To make this smoother for end-users, you should consider adding pre-stop pause about 5-10 seconds. More information about how this happens and how you can mitigate it, you can find in my answer here.
Also here are links for these handlers:
Node termination handler
Cluster autoscaler on AWS
For your last question, please check this AWS KB article on how to troubleshoot EKS and 504 errors

Is there a way to configure Istio to route traffic to a POD which is in the terminating state?

I have a Kubernetes cluster with two services deployed: SvcA and SvcB - both in the service mesh.
SvcA is backed by a single Pod, SvcA_P1. The application in SvcA_P1 exposes a PreStop HTTP hook. When performing a "kubectl drain" command on the node where SvcA_P1 resides, the Pod transitions into the "terminating" state and remains in that state until the application has completed its work (the rest request returns and Kubernetes removes the pod). The work for SvcA_P1 includes completing ongoing in-dialog (belonging to established sessions) HTTP requests/responses. It can stay in the "terminating" state for hours before completing.
When the Pod enters the "terminating" phase, Istio sidecar appears to remove the SvcA_P1 from the pool. Requests sent to SvcA_P1 from e.g., SvcB_P1 are rejected with a "no healthy upstream".
Is there a way to configure Istio/Envoy to:
Continue to send traffic/sessions with affinity to SvcA_P1 while in "terminating" state?
Reject traffic without session affinity to SvcA_P1 (no JSESSIONID, cookies, or special HTTP headers)?
I have played around with the DestinationRule(s), modifying trafficPolicy.loadBalancer.consistentHash.[httpHeaderName|httpCookie] with no luck. Once the Envoy removes the upstream server, the new destination is re-hashed using the reduced set of servers.
Thanks,
Thor
According to Kubernetes documentation, when pod must be deleted three things happen simultaneously:
Pod shows up as “Terminating” when listed in client commands
When the Kubelet sees that a Pod has been marked as terminating because the "dead" timer for the Pod has been set in the API server,
it begins the pod shutdown process.
If the pod has defined a preStop hook, it is invoked inside of the pod. If the preStop hook is still running after the grace period
expires, step 2 is then invoked with a small (2 second) extended grace
period.
Pod is removed from endpoints list for service, and are no longer considered part of the set of running pods for replication
controllers. Pods that shutdown slowly cannot continue to serve
traffic as load balancers (like the service proxy) remove them from
their rotations.
As soon as Istio works like a mesh network below/behind Kubernetes Services and Services no longer consider a Pod in Terminating state as a destination for the traffic, tweaking Istio policies doesn't help much.
Is there a way to configure Istio/Envoy to continue to send traffic/sessions with affinity to SvcA_P1 while in "terminating" state?
This problem is at Kubernetes level rather than Istio/Envoy level: by default, upon entering the "Terminating" state, Pods are removed from their corresponding Services.
You can change that behaviour by telling your Service to advertise Pods in the "Terminating" state: see that answer.

Specify scheduling order of a Kubernetes DaemonSet

I have Consul running in my cluster and each node runs a consul-agent as a DaemonSet. I also have other DaemonSets that interact with Consul and therefore require a consul-agent to be running in order to communicate with the Consul servers.
My problem is, if my DaemonSet is started before the consul-agent, the application will error as it cannot connect to Consul and subsequently get restarted.
I also notice the same problem with other DaemonSets, e.g Weave, as it requires kube-proxy and kube-dns. If Weave is started first, it will constantly restart until the kube services are ready.
I know I could add retry logic to my application, but I was wondering if it was possible to specify the order in which DaemonSets are scheduled?
Kubernetes itself does not provide a way to specific dependencies between pods / deployments / services (e.g. "start pod A only if service B is available" or "start pod A after pod B").
The currect approach (based on what I found while researching this) seems to be retry logic or an init container. To quote the docs:
They run to completion before any app Containers start, whereas app Containers run in parallel, so Init Containers provide an easy way to block or delay the startup of app Containers until some set of preconditions are met.
This means you can either add retry logic to your application (which I would recommend as it might help you in different situations such as a short service outage) our you can use an init container that polls a health endpoint via the Kubernetes service name until it gets a satisfying response.
retry logic is preferred over startup dependency ordering, since it handles both the initial bringup case and recovery from post-start outages

Configure Kubernetes StatefulSet to start pods first restart failed containers after start?

Basic info
Hi, I'm encountering a problem with Kubernetes StatefulSets. I'm trying to spin up a set with 3 replicas.
These replicas/pods each have a container which pings a container in the other pods based on their network-id.
The container requires a response from all the pods. If it does not get a response the container will fail. In my situation I need 3 pods/replicas for my setup to work.
Problem description
What happens is the following. Kubernetes starts 2 pods rather fast. However since I need 3 pods for a fully functional cluster the first 2 pods keep crashing as the 3rd is not up yet.
For some reason Kubernetes opts to keep restarting both pods instead of adding the 3rd pod so my cluster will function.
I've seen my setup run properly after about 15 minutes because Kubernetes added the 3rd pod by then.
Question
So, my question.
Does anyone know a way to delay restarting failed containers until the desired amount of pods/replicas have been booted?
I've since found out the cause of this.
StatefulSets launch pods in a specific order. If one of the pods fails to launch it does not launch the next one.
You can add a podManagementPolicy: "Parallel" to launch the pods without waiting for previous pods to be Running.
See this documentation
I think a better way to deal with your problem is to leverage liveness probe, as described in the document, rather than delay the restart time (not configurable in the YAML).
Your pods respond to the liveness probe right after they are started to let Kubernetes know they are alive, which prevents them from being restarted. Meanwhile, your pods keep ping others until they are all up. Only when all your pods are started will serve the external requests. This is similar to creating a Zookeeper ensemble.

Does a Kubernetes rolling-update gracefully remove pods from a service load balancer

Standard practice for a rolling update of hosts behind load balancer is to gracefully take the hosts out of rotation. This can be done by marking the host "un-healthy" and ensuring the host is no longer receiving requests from the load balancer.
Does Kubernetes do something similar for pods managed by a ReplicationController and servicing a LoadBalancer Service?
I.e., does Kubernetes take a pod out of the LoadBalancer rotation, ensure incoming traffic has died-down, and only then issue pod shutdown?
Actually, once you delete the pod, it will be in "terminating" state until it is destroyed (after terminationGracePeriodSeconds) which means it is removed from the service load balancer, but still capable of serving existing requests.
We also use "readiness" health checks, and preStop is synchronous, so you could make your preStop hook mark the readiness of the pod to be false, and then wait for it to be removed from the load balancer, before having the preStop hook exit.
Not quite. Kubernetes will send a stop command to the containers in the pod. If the application doesn't stop it will force kill the container (after terminationGracePeriodSeconds parameter).
There are a bunch of bugs opened to take care of this: https://github.com/kubernetes/kubernetes/issues/2789
I can't think of anything elegant that will do this.
There is a preStop parameter for pods that will execute a script before termination. You could modify from here the pod label and rename it to something else. This will fool the replication controller and it will see that it has now a lower number of replicas.
For the pods with this label you will have to make your own logic on stopping them when they have finished working.