Automatic Pod Deletion Delay in Kubernetes - kubernetes

Is there is a way to automatically delay all Kubernetes pod deletion requests such that the endpoint deregistration is signaled, but the pod's SIGTERM is delayed by several seconds?
It would be preferable, but not required, if the delay only affected pods with an Endpoint/Service.
Background:
It is well established that some traffic can continue to a Pod after a pod has been sent the SIGTERM termination signal due to the asynchronous nature of endpoint deregistration and the deletion signal. The recommended mitigation is to introduce a few seconds delay in the pod's preStop lifecycle hook by invoking sleep.
The difficulty rapidly arises where the pod's deployment may be done via helm or other upstream source, or else there are large numbers of deployments and containers to be managed. Modifying many deployments in such a way may be difficult, or even impossible (e.g. the container may not have a sleep binary, shell, or anything but the application executable).
I briefly explored a mutating admission controller, but that seems unworkable to dynamically add a preStop hook, as all images do not have a /bin/sleep or already have a preStop that could need image-specific knowledge to merge.
(Of course, all of this could be avoided if the K8S API made the endpoint deregistration synchronous with a timeout to avoid deadlock (hint, hint), but I haven't seen any discussions of such a change. Yes, there are tons of reasons why this isn't synchronous, but that doesn't mean something can't be done.)

Kubernetes lifecycle has following steps.
Pod is set to the “Terminating” State and removed from the endpoints list of all Services
preStop hook is executed
SIGTERM signal is sent to the pod
Kubernetes waits for a grace period, default is 30 seconds
SIGKILL signal is sent to pod, and the pod is removed
Grace period is what you need.
It's important to node that this grace period is happening in parallel to the preStop hook and the SIGTERM signal.
A call to the preStop hook fails if the container is already in terminated or completed state. It is blocking, meaning it is synchronous, so it must complete before the call to delete the container can be sent.
Here you can read more about Container Lifecycle Hooks.
So for example you could set the terminationGracePeriodSeconds: 90 and this might look like the following:
spec:
terminationGracePeriodSeconds: 90
containers:
- name: myApplication
You can read the Kubernetes docs regarding Termination of Pods. I also recommend great blog post Kubernetes best practices: terminating with grace.

Related

Can a kubernetes pod respond to a request when it is removed from a service endpoint?

I know that the moment the pod receives a deletion request, it is deleted from the service endpoint and no longer receives the request. However, I'm not sure if the pod can return a response to a request it received just before it was deleted from the service endpoint.
If the pod IP is missing from the service's endpoint, can it still respond to requests?
There are many reasons why Kubernetes might terminate a healthy container (for example, node drain, termination due to lack of resources on the node, rolling update).
Once Kubernetes has decided to terminate a Pod, a series of events takes place:
1 - Pod is set to the “Terminating” State and removed from the endpoints list of all Services
At this point, the pod stops getting new traffic. Containers running in the pod will not be affected.
2 - preStop Hook is executed
The preStop Hook is a special command or http request that is sent to the containers in the pod.
If your application doesn’t gracefully shut down when receiving a SIGTERM you can use this hook to trigger a graceful shutdown. Most programs gracefully shut down when receiving a SIGTERM, but if you are using third-party code or are managing a system you don’t have control over, the preStop hook is a great way to trigger a graceful shutdown without modifying the application.
3 - SIGTERM signal is sent to the pod
At this point, Kubernetes will send a SIGTERM signal to the containers in the pod. This signal lets the containers know that they are going to be shut down soon.
Your code should listen for this event and start shutting down cleanly at this point. This may include stopping any long-lived connections (like a database connection or WebSocket stream), saving the current state, or anything like that.
Even if you are using the preStop hook, it is important that you test what happens to your application if you send it a SIGTERM signal, so you are not surprised in production!
4 - Kubernetes waits for a grace period
At this point, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.
If your app finishes shutting down and exits before the terminationGracePeriod is done, Kubernetes moves to the next step immediately.
If your pod usually takes longer than 30 seconds to shut down, make sure you increase the grace period. You can do that by setting the terminationGracePeriodSeconds option in the Pod YAML.
5 - SIGKILL signal is sent to pod, and the pod is removed
If the containers are still running after the grace period, they are sent the SIGKILL signal and forcibly removed. At this point, all Kubernetes objects are cleaned up as well.
I hope this gives a good idea of the Kubernetes termination lifecycle and how to handle a Pod termination gracefully.
Based on this article.
As long as the pod is deleted and container stopped, it cannot respond to requests, no matter whether it is removed from the service's endpoints.
If the pod container is still alive, it can respond to requests, no matter whether you visit it through svc or not.

Kubernetes how to maintain session affinity during hpa scale down event

I have deploy my application in kubernetes using deployment.
Whenever user gets login to application pod will generate session for that user.
To maintain session stickiness I have set session cookie using Nginx ingress annotations.
When hpa scale down pods application user is phasing a logout problem when pod is get terminated. If ingress has generated a session using this pod. It needs to log in again.
What I want is some sort of graceful termination of the connection. when pod is in a terminating state it should serve existing sessions until grace period.
What i want is some sort of graceful termination of connection. when
pod is in terminating state it should serve existing sessions
until grace period.
You can use the key in POD spec : terminationGracePeriodSeconds
this will wait for the mentioned second and after that POD will get terminated.
You can read more at : https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace
The answer Harsh Manvar is great, However, I want to expand it a bit :)
You can of course use terminationGracePeriodSeconds in the POD spec. Look at the example yaml:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: my-image
terminationGracePeriodSeconds: 60
At this point, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.
If your app finishes shutting down and exits before the terminationGracePeriod is done, Kubernetes moves to the next step immediately.
If your pod usually takes longer than 30 seconds to shut down, make sure you increase the grace period. You can do that by setting the terminationGracePeriodSeconds option in the Pod YAML. For example, to change it to 60 seconds.
For more look here.
If you want to know how exactly looks like pod lifecycle see this link to the official documentation. The part about the termination of pods should be most interesting. You will also have it described how exactly the termination takes place.
It is recommended that applications deployed on Kubernetes have a design that complies with the recommended standards.
One set of standards for modern, cloud-based applications is known as the Twelve-Factor App:
Twelve-factor processes are stateless and share-nothing. Any data that needs to persist must be stored in a stateful backing service, typically a database.
Some web systems rely on “sticky sessions” – that is, caching user session data in memory of the app’s process and expecting future requests from the same visitor to be routed to the same process. Sticky sessions are a violation of twelve-factor and should never be used or relied upon. Session state data is a good candidate for a datastore that offers time-expiration, such as Memcached or Redis.

What happens when you scale down a Kubernetes Pod?

When you decrease the number of pods in a Kubernetes workload, I am assuming it is doing a soft kill. Does it start that process by stopping incoming connections? In an event-driven microservice environment, where container reads message from a queue. When I deploy, what happens to the message that are currently being processed. Does it stop taking messages from the queue?
When you decrease the number of pods in a kubernetes workload, i am assuming it is doing a soft kill.
Yes, it does a graceful termination, so your pod get a SIGTERM signal, but it is up to you to implement this handling in the app, before the pod is killed after the configured graceful termination period, by default 30 seconds but can be configured with the terminationGracePeriodSeconds field of the Pod.
In an event-driven microservice environment, where container reads message from a queue. When i deploy, what happens to the message that are currently being processed. Does is stop taking messages from the queue?
As explained above, your app needs to implement the handling of the SIGTERM signal and e.g. stop consuming new messages from the queue. You also need to properly configure the terminationGracePeriodSeconds so that messages can fully be processed before the pod is evicted.
A good explanation of this is Kubernetes best practices: terminating with grace
Does it start that process by stopping incoming connections?
Yes, your pod is removed from the Kubernetes Endpoint list, so it should work if you access your pods via Services.

With Kubernetes Is there a way to wait for a pod to finish its ongoing tasks before updating it?

I'm managing a application inside kubernetes,
I have a front end (nginx, flask) and a backend (celery)
Long running tasks are sent to the backend using a middle-ware (rabbitmq)
My issue here is that i can receive long running tasks at anytime, and i don't want it to disturb my plan of upgrading the version of my application.
I'm using the command kubectl apply -f $MY_FILE to deploy/update my application. But if i do it when a celery po is busy, the pod will be terminated, and i'll be losing the task.
I tried using the readiness probe, but the pods are still being terminated.
My question is, is there a way for kube to target only 'free' pods, and wait for the busy on to finish ?
Thank you
You can use preStop hooks to complete ongoing task before the pod is terminated.
Kubernetes sends the preStop event immediately before the Container is terminated. Kubernetes’ management of the Container blocks until the preStop handler completes, unless the Pod’s grace period expires.For more details, see Termination of Pods.
https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/#define-poststart-and-prestop-handlers
One way is to create another deployment with the new image, expose it as a service. Pass on any new requests ONLY to this new deployment/service.
Meanwhile, the old deployment/service can still continue processing the existing requests and not take any new ones. Once all the requests are processed the old deployment/service can be deleted.
The only problem with this approach, roughly double the resources are required for some duration as old/new deployment/service run in parallel.
Something like a A/B testing. FYI ... Istio makes is easy with traffic management.

How does Kubernetes knows what pod to kill when downscaling?

Is there a way to tell Kubernetes what pods to kill before or after a downscale? For example, suppose that I have 10 replicas and I want to downscale them to 5, but I want certain replicas to be alive and others to be killed after the downscale. Is that possible?
While it's not possible to selectively choose which pod is killed, you can prevent what you're really concerned about, which is the killing of pods that are in the midst of processing tasks. This requires you do two things:
Your application should be able to listen for and handle SIGTERM events, which Kubernetes sends to pods before it kills them. In your case, your app would handle SIGTERM by finishing any in-flight tasks then exiting.
You set the terminationGracePeriodSeconds on the pod to something greater than the longest time it takes for the longest task to be processed. Setting this property extends the period of time between k8s sending the SIGTERM (asking your application to finish up), and SIGKILL (forcefully terminating).
As per provided by #Matt link and #Robert Bailey's answer, currently K8s ReplicaSets based resources don't support scaling functions, removing some specific Pods from replicas pool. You can find related #45509 issue and followed up #75763 PR.
you can use stateful sets instead of replicasets:
https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
they will be created sequentially (my-app0,my-app1,myapp2), and when you will scale down, they will be terminated in reverse order, from {N-1..0}.