What happens when you scale down a Kubernetes Pod?

What happens when you scale down a Kubernetes Pod? - kubernetes

When you decrease the number of pods in a Kubernetes workload, I am assuming it is doing a soft kill. Does it start that process by stopping incoming connections? In an event-driven microservice environment, where container reads message from a queue. When I deploy, what happens to the message that are currently being processed. Does it stop taking messages from the queue?

When you decrease the number of pods in a kubernetes workload, i am assuming it is doing a soft kill.
Yes, it does a graceful termination, so your pod get a SIGTERM signal, but it is up to you to implement this handling in the app, before the pod is killed after the configured graceful termination period, by default 30 seconds but can be configured with the terminationGracePeriodSeconds field of the Pod.
In an event-driven microservice environment, where container reads message from a queue. When i deploy, what happens to the message that are currently being processed. Does is stop taking messages from the queue?
As explained above, your app needs to implement the handling of the SIGTERM signal and e.g. stop consuming new messages from the queue. You also need to properly configure the terminationGracePeriodSeconds so that messages can fully be processed before the pod is evicted.
A good explanation of this is Kubernetes best practices: terminating with grace
Does it start that process by stopping incoming connections?
Yes, your pod is removed from the Kubernetes Endpoint list, so it should work if you access your pods via Services.

Related

Can a kubernetes pod respond to a request when it is removed from a service endpoint?

I know that the moment the pod receives a deletion request, it is deleted from the service endpoint and no longer receives the request. However, I'm not sure if the pod can return a response to a request it received just before it was deleted from the service endpoint.
If the pod IP is missing from the service's endpoint, can it still respond to requests?

There are many reasons why Kubernetes might terminate a healthy container (for example, node drain, termination due to lack of resources on the node, rolling update).
Once Kubernetes has decided to terminate a Pod, a series of events takes place:
1 - Pod is set to the “Terminating” State and removed from the endpoints list of all Services
At this point, the pod stops getting new traffic. Containers running in the pod will not be affected.
2 - preStop Hook is executed
The preStop Hook is a special command or http request that is sent to the containers in the pod.
If your application doesn’t gracefully shut down when receiving a SIGTERM you can use this hook to trigger a graceful shutdown. Most programs gracefully shut down when receiving a SIGTERM, but if you are using third-party code or are managing a system you don’t have control over, the preStop hook is a great way to trigger a graceful shutdown without modifying the application.
3 - SIGTERM signal is sent to the pod
At this point, Kubernetes will send a SIGTERM signal to the containers in the pod. This signal lets the containers know that they are going to be shut down soon.
Your code should listen for this event and start shutting down cleanly at this point. This may include stopping any long-lived connections (like a database connection or WebSocket stream), saving the current state, or anything like that.
Even if you are using the preStop hook, it is important that you test what happens to your application if you send it a SIGTERM signal, so you are not surprised in production!
4 - Kubernetes waits for a grace period
At this point, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.
If your app finishes shutting down and exits before the terminationGracePeriod is done, Kubernetes moves to the next step immediately.
If your pod usually takes longer than 30 seconds to shut down, make sure you increase the grace period. You can do that by setting the terminationGracePeriodSeconds option in the Pod YAML.
5 - SIGKILL signal is sent to pod, and the pod is removed
If the containers are still running after the grace period, they are sent the SIGKILL signal and forcibly removed. At this point, all Kubernetes objects are cleaned up as well.
I hope this gives a good idea of the Kubernetes termination lifecycle and how to handle a Pod termination gracefully.
Based on this article.

As long as the pod is deleted and container stopped, it cannot respond to requests, no matter whether it is removed from the service's endpoints.
If the pod container is still alive, it can respond to requests, no matter whether you visit it through svc or not.

Kubernetes deployment - instantly replace all pods

Our application consists of pods that handle asynchronous tasks from a queue. They might take up to half an hour to finish a task, but once sent a SIGINT or SIGTERM, will not pick up any new tasks.
What we want to achieve with a Kubernetes deployment is that all existing pods are sent the signal to terminate, and without waiting for them to be stopped, create the new pods.
Based on the documentation, I think this can be achieved by setting maxSurge and maxUnavailable both to 100%. Would this achieve our goal?

Kubernetes HPA and Scaling Down

I have a kubernetes HPA set up in my cluster, and it works as expected scaling up and down instances of pods as the cpu/memory increases and decreases.
The only thing is that my pods handle web requests, so it occasionally scales down a pod that's in the process of handling a web request. The web server never gets a response back from the pod that was scaled down and thus the caller of the web api gets an error back.
This all makes sense theoretically. My question is does anyone know of a best practice way to handle this? Is there some way I can wait until all requests are processed before scaling down? Or some other way to ensure that requests complete before HPA scales down the pod?
I can think of a few solutions, none of which I like:
Add retry mechanism to the caller and just leave the cluster as is.
Don't use HPA for web request pods (seems like it defeats the purpose).
Try to create some sort of custom metric and see if I can get that metric into Kubernetes (e.x https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-custom-metrics)
Any suggestions would be appreciated. Thanks in advance!

Graceful shutdown of pods
You must design your apps to support graceful shutdown. First your pod will receive a SIGTERM signal and after 30 seconds (can be configured) your pod will receive a SIGKILL signal and be removed. See Termination of pods
SIGTERM: When your app receives termination signal, your pod will not receive new requests but you should try to fulfill responses of already received requests.
Design for idempotency
Your apps should also be designed for idempotency so you can safely retry failed requests.

How does Kubernetes knows what pod to kill when downscaling?

Is there a way to tell Kubernetes what pods to kill before or after a downscale? For example, suppose that I have 10 replicas and I want to downscale them to 5, but I want certain replicas to be alive and others to be killed after the downscale. Is that possible?

While it's not possible to selectively choose which pod is killed, you can prevent what you're really concerned about, which is the killing of pods that are in the midst of processing tasks. This requires you do two things:
Your application should be able to listen for and handle SIGTERM events, which Kubernetes sends to pods before it kills them. In your case, your app would handle SIGTERM by finishing any in-flight tasks then exiting.
You set the terminationGracePeriodSeconds on the pod to something greater than the longest time it takes for the longest task to be processed. Setting this property extends the period of time between k8s sending the SIGTERM (asking your application to finish up), and SIGKILL (forcefully terminating).

As per provided by #Matt link and #Robert Bailey's answer, currently K8s ReplicaSets based resources don't support scaling functions, removing some specific Pods from replicas pool. You can find related #45509 issue and followed up #75763 PR.

you can use stateful sets instead of replicasets:
https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
they will be created sequentially (my-app0,my-app1,myapp2), and when you will scale down, they will be terminated in reverse order, from {N-1..0}.

Automatic Pod Deletion Delay in Kubernetes

Is there is a way to automatically delay all Kubernetes pod deletion requests such that the endpoint deregistration is signaled, but the pod's SIGTERM is delayed by several seconds?
It would be preferable, but not required, if the delay only affected pods with an Endpoint/Service.
Background:
It is well established that some traffic can continue to a Pod after a pod has been sent the SIGTERM termination signal due to the asynchronous nature of endpoint deregistration and the deletion signal. The recommended mitigation is to introduce a few seconds delay in the pod's preStop lifecycle hook by invoking sleep.
The difficulty rapidly arises where the pod's deployment may be done via helm or other upstream source, or else there are large numbers of deployments and containers to be managed. Modifying many deployments in such a way may be difficult, or even impossible (e.g. the container may not have a sleep binary, shell, or anything but the application executable).
I briefly explored a mutating admission controller, but that seems unworkable to dynamically add a preStop hook, as all images do not have a /bin/sleep or already have a preStop that could need image-specific knowledge to merge.
(Of course, all of this could be avoided if the K8S API made the endpoint deregistration synchronous with a timeout to avoid deadlock (hint, hint), but I haven't seen any discussions of such a change. Yes, there are tons of reasons why this isn't synchronous, but that doesn't mean something can't be done.)

Kubernetes lifecycle has following steps.
Pod is set to the “Terminating” State and removed from the endpoints list of all Services
preStop hook is executed
SIGTERM signal is sent to the pod
Kubernetes waits for a grace period, default is 30 seconds
SIGKILL signal is sent to pod, and the pod is removed
Grace period is what you need.
It's important to node that this grace period is happening in parallel to the preStop hook and the SIGTERM signal.
A call to the preStop hook fails if the container is already in terminated or completed state. It is blocking, meaning it is synchronous, so it must complete before the call to delete the container can be sent.
Here you can read more about Container Lifecycle Hooks.
So for example you could set the terminationGracePeriodSeconds: 90 and this might look like the following:
spec:
terminationGracePeriodSeconds: 90
containers:
- name: myApplication
You can read the Kubernetes docs regarding Termination of Pods. I also recommend great blog post Kubernetes best practices: terminating with grace.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse