How does Kubernetes knows what pod to kill when downscaling? - kubernetes

Is there a way to tell Kubernetes what pods to kill before or after a downscale? For example, suppose that I have 10 replicas and I want to downscale them to 5, but I want certain replicas to be alive and others to be killed after the downscale. Is that possible?

While it's not possible to selectively choose which pod is killed, you can prevent what you're really concerned about, which is the killing of pods that are in the midst of processing tasks. This requires you do two things:
Your application should be able to listen for and handle SIGTERM events, which Kubernetes sends to pods before it kills them. In your case, your app would handle SIGTERM by finishing any in-flight tasks then exiting.
You set the terminationGracePeriodSeconds on the pod to something greater than the longest time it takes for the longest task to be processed. Setting this property extends the period of time between k8s sending the SIGTERM (asking your application to finish up), and SIGKILL (forcefully terminating).

As per provided by #Matt link and #Robert Bailey's answer, currently K8s ReplicaSets based resources don't support scaling functions, removing some specific Pods from replicas pool. You can find related #45509 issue and followed up #75763 PR.

you can use stateful sets instead of replicasets:
https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
they will be created sequentially (my-app0,my-app1,myapp2), and when you will scale down, they will be terminated in reverse order, from {N-1..0}.

Related

Is it okay to change the pod eviction timeout?(k8s, openshift)

I want to know about pod eviction timeout. I've already read k8s, openshift manual and some blog.
but i couldn't find an article on impact of reducing pod-eviction-timeout.(default : 5m)
I think there is a reason why the default value is 5 minutes. but I can't find reason...
Can you tell me how it will affect k8s cluster if I change the settings?
(EX: Change pod-eviction-time: 2minute or less)
refer: we have openshift(okd) cluster and it has many services.
if the 5m timeout is a valid choice or not depends on your services and your infrastructure.
there are multiple reasons for a pod to be evicted like node pressure, scheduling priorities due to resource limits, priorityClasses, taints/tolerations, etc. basically, pods will be evicted on some kind of failure or on some kind of scheduling event that can also be initiated by a user.
if you change the timeout, kubernetes will not wait as long to forcefully kill the processes during the eviction. that can lead to some unwanted behaviour with stateful services, because it may not have enough time to be shutdown gracefully and the attached volume may not be available in time when the pod is scheduled on another node again. with stateless services everything is easier, so there won't be such problems.
in short: if you are running stateless services, this should not lead to any problems. if you have stateful services, it may cause problems, but that cannot be answered generally. you gotta test it and see what happens, because you (and your team) know your services best.

Kubernetes deployment - instantly replace all pods

Our application consists of pods that handle asynchronous tasks from a queue. They might take up to half an hour to finish a task, but once sent a SIGINT or SIGTERM, will not pick up any new tasks.
What we want to achieve with a Kubernetes deployment is that all existing pods are sent the signal to terminate, and without waiting for them to be stopped, create the new pods.
Based on the documentation, I think this can be achieved by setting maxSurge and maxUnavailable both to 100%. Would this achieve our goal?

What happens when you scale down a Kubernetes Pod?

When you decrease the number of pods in a Kubernetes workload, I am assuming it is doing a soft kill. Does it start that process by stopping incoming connections? In an event-driven microservice environment, where container reads message from a queue. When I deploy, what happens to the message that are currently being processed. Does it stop taking messages from the queue?
When you decrease the number of pods in a kubernetes workload, i am assuming it is doing a soft kill.
Yes, it does a graceful termination, so your pod get a SIGTERM signal, but it is up to you to implement this handling in the app, before the pod is killed after the configured graceful termination period, by default 30 seconds but can be configured with the terminationGracePeriodSeconds field of the Pod.
In an event-driven microservice environment, where container reads message from a queue. When i deploy, what happens to the message that are currently being processed. Does is stop taking messages from the queue?
As explained above, your app needs to implement the handling of the SIGTERM signal and e.g. stop consuming new messages from the queue. You also need to properly configure the terminationGracePeriodSeconds so that messages can fully be processed before the pod is evicted.
A good explanation of this is Kubernetes best practices: terminating with grace
Does it start that process by stopping incoming connections?
Yes, your pod is removed from the Kubernetes Endpoint list, so it should work if you access your pods via Services.

K8s - Schedule new pod before the old one is terminated

I have read up on the Kubernetes docs but I'm unable to get a clear answer on my question. I'm using the official cluster-autoscaler.
I have a deployment that specifies one replica should be running. When a pod is terminated (for example, was running on a node that is getting scaled-down) is the new pod scheduled before the termination begins or after the termination is done? The docs say that schedule happens when terminating, but don't mention at which phase.
To achieve seamless node scale-down without disruption to any services, I would expect k8 to scale up pods to N+1 replicas (at this point pods are scheduled only to nodes that are not scaling down) and then drain the node. Based on my testing, it first drains, and then schedules any missing pods based on configurations. Is it possible to configure this behaviour or this is currently not possible to do?
From what I understand, seamless updates are easy with RollingUpdate strategy. I have not find the same "Rolling" strategy to be possible for scale-down.
EDIT
TL;DR I'm looking for HA on a) two+ replica deployment and b) one replica deployment
a) Can be achieved by using PDBs. Checkout Fritz's answer. If you need pods scheduled on different nodes, leverage anti-affinity (Marc's answer)
b) If you're okay with short disruption, PDB is the official way to go. If you need a workaround, my answer can be of inspiration.
The scale down behavior can be configured with what is called a Disruption Budget
In your Deployment Manifest you can define maxUnavailable and minAvailable number of Pods during voluntary disruptions like draining nodes.
For how to do it, check out the K8s Documentation.
Below are some insight, hope this will help :
If you use a deployment, then the scheduler checks that you always have the desired number of replicas running. No less, no more. So when you kill a node (which have one of your replicas), the new pod will be scheduled after the termination of one of your original replicas. It's up to you to anticipate if it's a planified maintenance.
If you have lots of nodes (meaning more than one) and want to achieve HA (high availability) for your deployments, then you should have a look at pod affinity/anti-affinity. You can find out more in the official doc
Hate to answer my own question, but an easy solution to high-availability service with only one pod (not wasting resources with running one idle replica) is to use PreStop hook (to make the action blocking if proper SIGTERM handling is not implemented) together with terminationGracePeriodSeconds with enough time for the other service to start.
Contradicting to what has been said here, the scheduling happens when pod is terminating. After quick testing (should have done that together with reading docs) where I created a busybox (sh sleep 3600) deployment with one replica and terminationGracePeriodSeconds set to 240 seconds.
By deleting the pod, it will enter the Terminating state and stay in that state for 240 seconds. Immediately after marking the pod as Terminating, new pod was scheduled instead of it.
So the previous pod has time to finish whatever it is doing and the other one can seamlessly take its place.
I haven't tested how will the networking behave since LB will stop sending new requests, but I assume the downtime will be much lower than without the terminationGracePeriodSeconds set to a higher amount than the default.
Beware that is not official by any means but serves as a workaround for my use case.

Does Kubernetes support connection draining?

Does Kubernetes support connection draining?
For example, my deployment rolls out a new version of my web app container.
In connection draining mode Kubernetes should spin up a new container from the new image and route all new traffic coming to my service to this new instance. The old instance should remain alive long enough to send a response for existing connections.
Kubernetes does support connection draining, but how it happens is controlled by the Pods, and is called graceful termination.
Graceful Termination
Let's take an example of a set of Pods serving traffic through a Service. This is a simplified example, the full details can be found in the documentation.
The system (or a user) notifies the API that the Pod needs to stop.
The Pod is set to the Terminating state. This removes it from a Service serving traffic. Existing connections are maintained, but new connections should stop as soon as the load balancers recognize the change.
The system sends SIGTERM to all containers in the Pod.
The system waits terminationGracePeriodSeconds (default 30s), or until the Pod completes on it's own.
If containers in the Pod are still running, they are sent SIGKILL and terminated immediately. At this point the Pod is forcefully terminated if it is still running.
This not only covers the simple termination case, but the exact same process is used in rolling update deployments, each Pod is terminated in the exact same way and is given the opportunity to clean up.
Using Graceful Termination For Connection Draining
If you do not handle SIGTERM in your app, your Pods will immediately terminate, since the default action of SIGTERM is to terminate the process immediately, and the grace period is not used since the Pod exits on its own.
If you need "connection draining", this is the basic way you would implement it in Kubernetes:
Handle the SIGTERM signal, and clean up your connections in a way that your application decides. This may simply be "do nothing" to allow in-flight connections to clear out. Long running connections may be terminated in a way that is (more) friendly to client applications.
Set the terminationGracePeriodSeconds long enough for your Pod to clean up after itself.
There are some more options which could help to enable a zero downtime deployment. Here's a summary:
1. Pod Graceful Termination
s. Answer of Kekoa
Downside: Pod directly receives SIGTERM and will not accept any new requests (dependent on the implementation).
2. Pre Stop Hook
Pod receives SIGTERM after a waiting time and can still accept any new requests. You should set terminationGracePeriodSeconds to a value larger than the preStop sleeping time.
lifecycle:
preStop:
exec:
command: ["/bin/bash","-c","/bin/sleep 90"]
This is the recommended solution of the Azure Application Gateway Ingress Controller.
Downside: Pod is removed from the list of Endpoints and might not be visible to other pods.
3. Helm Chart Hooks
If you need some cleanup before the Pods are removed from the Endpoints, you need Helm Chart Hooks.
apiVersion: batch/v1
kind: Job
metadata:
name: graceful-shutdown-mydeployment
annotations:
"helm.sh/hook": pre-delete,pre-upgrade,pre-rollback
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
app.kubernetes.io/name: graceful-shutdown-mydeployment
spec:
template:
spec:
containers:
- name: graceful-shutdown
image: ...
command: ...
restartPolicy: Never
backoffLimit: 0
For more details see https://stackoverflow.com/a/66733416/1909531
Downside: Helm required
Full lifecycle
Here's how these options are executed in the pod's lifecycle.
Execute Helm Chart Hook
Set Pod to Terminating State + Executing Pre Stop Hook
Send SIGTERM
Wait terminationGracePeriodSeconds
Stop Pod by sending SIGKILL, if containers are still running.
Choosing an option
I would first try 1. terminationGracePeriodSeconds, then 2. Pre Stop Hook and at last 3. Helm Chart Hook as the complexity rises.
No, deployments do not support connection draining per se. Draining connections effectively happen as old pods stop & new pods start, clients connected to old pods will have to reconnect to new pods. As clients connect to the service, it's all transparent to clients. You do need to ensure that your application can handle different versions running concurrently, but that is a good idea anyway as it minimises downtime in upgrades & allows you to perform things like A/B testing.
There are a couple of different strategies which will let you tweak how your upgrades take place: deployments support two update strategies: Recreate or RollingUpdate.
With Recreate, old pods are stopped before new pods are started. This leads to a period of downtime but ensures that all clients connect to either the old or the new version - there will never be a time when both old & new pods are servicing clients at the same time. If downtime is acceptable to you then this may be an option to you.
Most of the time, however, downtime is unacceptable for a service so RollingUpdate is more appropriate. This starts up new pods & as it does so it stops old pods. As old pods are stopped, clients connected to them have to reconnect. Eventually there will be no old pods & all clients will have reconnected to new pods.
While there is no option to do connection draining as you suggest, you can configure the rolling update via maxUnavailable and maxSurge options. From http://kubernetes.io/docs/user-guide/deployments/#rolling-update-deployment:
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (e.g. 5) or a percentage of desired Pods (e.g. 10%). The absolute number is calculated from percentage by rounding up. This can not be 0 if .spec.strategy.rollingUpdate.maxSurge is 0. By default, a fixed value of 1 is used.
For example, when this value is set to 30%, the old Replica Set can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old Replica Set can be scaled down further, followed by scaling up the new Replica Set, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
.spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created above the desired number of Pods. Value can be an absolute number (e.g. 5) or a percentage of desired Pods (e.g. 10%). This can not be 0 if MaxUnavailable is 0. The absolute number is calculated from percentage by rounding up. By default, a value of 1 is used.
For example, when this value is set to 30%, the new Replica Set can be scaled up immediately when the rolling update starts, such that the total number of old and new Pods do not exceed 130% of desired Pods. Once old Pods have been killed, the new Replica Set can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130% of desired Pods.
Hope that helps.