Why does scaling down a deployment seem to always remove the newest pods? - kubernetes

(Before I start, I'm using minikube v27 on Windows 10.)
I have created a deployment with the nginx 'hello world' container with a desired count of 2:
I actually went into the '2 hours' old pod and edited the index.html file from the welcome message to "broken" - I want to play with k8s to seem what it would look like if one pod was 'faulty'.
If I scale this deployment up to more instances and then scale down again, I almost expected k8s to remove the oldest pods, but it consistently removes the newest:
How do I make it remove the oldest pods first?
(Ideally, I'd like to be able to just say "redeploy everything as the exact same version/image/desired count in a rolling deployment" if that is possible)

Pod deletion preference is based on a ordered series of checks, defined in code here:
https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/controller/controller_utils.go#L737
Summarizing- precedence is given to delete pods:
that are unassigned to a node, vs assigned to a node
that are in pending or not running state, vs running
that are in not-ready, vs ready
that have been in ready state for fewer seconds
that have higher restart counts
that have newer vs older creation times
These checks are not directly configurable.
Given the rules, if you can make an old pod to be not ready, or cause an old pod to restart, it will be removed at scale down time before a newer pod that is ready and has not restarted.
There is discussion around use cases for the ability to control deletion priority, which mostly involve workloads that are a mix of job and service, here:
https://github.com/kubernetes/kubernetes/issues/45509

what about this :
kubectl scale deployment ingress-nginx-controller --replicas=2
Wait until 2 replicas are up.
kubectl delete pod ingress-nginx-controller-oldest-replica
kubectl scale deployment ingress-nginx-controller --replicas=1
I experienced zero downtime doing so while removing oldest pod.

Related

How to delete a pod from Kubernetes master node?

Does anyone know how to delete pod from kubernetes master node? I have this one master node on bare-metal ubuntu server. When i'm trying to delete it with "kubectl delete pod .." or force deleting from there: https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/ it doesnt work. the pod is creating again and again...
The pods in a Statefulsets are managed by ReplicaSets and will be recreated again if the current and the desired replicas defined in the spec do not match.
The document you linked provides instructions as to how to kill the pods forcefully avoiding the graceful shutdown behaviour which can have unexpected behaviour depending on the application.
The link clearly states the pods will be recreated in the section:
Force deletions do not wait for confirmation from the kubelet that the Pod has been terminated. Irrespective of whether a force deletion is successful in killing a Pod, it will immediately free up the name from the apiserver. This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee.
If you want the pods to be stopped and new pods for the Statefulset do not get created, you need to scale down the Statefulset by changing the replicas to 0.
You can read the official docs for how to scale the Statefulset replicas.
The key to figuring out how to kill the pod will be to understand how it was created. For example, if the pod is part of a deployment with a declared replicas count as 1, Once you kill/ force kill, Kubernetes detects a mismatch between the desired state (the number of replicas defined in the deployment configuration) to the current state and will create a new pod to replace the one that was deleted - therefor in this example you will need to either scale the deployment to 0 or delete the deployment.
If we need to kill any pod we can just scale down the replica set.
kubectl scale deploy <deployment_name> --replicas=<expected_no_of_replicas>
Way of deleting pods will depends on how you created it. If you created it individually ( not part of a ReplicaSet/ReplicationController/Deployment ) then you can delete pod directly. otherwise the only option to delete is the scale option. In production setup what I believe is all are using Deployment option out of ReplicaSet/ReplicationController/Deployment( Please refer documents and understand the difference between all those three options )

Kubernetes Deployment Rolling Updates

I have an application that I deploy on Kubernetes.
This application has 4 replicas and I'm doing a rolling update on each deployment.
This application has a graceful shutdown which can take tens of minutes (it has to wait for running tasks to finish).
My problem is that during updates, I have over-capacity since all the older version pods are stuck at "Terminating" status while all the new pods are created.
During the updates, I end up running with 8 containers and it is something I'm trying to avoid.
I tried to set maxSurge to 0, but this setting doesn't take into consideration the "Terminating" pods, so the load on my servers during the deployment is too high.
The behaviour I'm trying to get is that new pods will only get created after the old version pods finished successfully, so at all times I'm not exceeding the number of replicas I set.
I wonder if there is a way to achieve such behaviour.
What I ended up doing is creating a StatefulSet with podManagementPolicy: Parallel and updateStrategy to OnDelete.
I also set terminationGracePeriodSeconds to the maximum time it takes for a pod to terminate.
As a part of my deployment process, I apply the new StatefulSet with the new image and then delete all the running pods.
This way all the pods are entering Terminating state and whenever a pod finished its task and terminated a new pod with the new image will replace it.
This way I'm able to keep a static number of replicas during the whole deployment process.
Let me suggest the following strategy:
Deployments implement the concept of ready pods to aide rolling updates. Readiness probes allow the deployment to gradually update pods while giving you the control to determine when the rolling update can proceed.
A Ready pod is one that is considered successfully updated by the Deployment and will no longer count towards the surge count for deployment. A pod will be considered ready if its readiness probe is successful and spec.minReadySeconds have passed since the pod was created. The default for these options will result in a pod that is ready as soon as its containers start.
So, what you can do, is implement (if you haven't done so yet) a readiness probe for your pods in addition to setting the spec.minReadySeconds to a value that will make sense (worst case) to the time that it takes for your pods to terminate.
This will ensure rolling out will happen gradually and in coordination to your requirements.
In addition to that, don't forget to configure a deadline for the rollout.
By default, after the rollout can’t make any progress in 10 minutes, it’s considered as failed. The time after which the Deployment is considered failed is configurable through the progressDeadlineSeconds property in the Deployment spec.

Apply changed limits in Kubernetes?

I changed the limits (default requested amount of CPU) on my Kubernetes cluster. Of course the new limits don't affect already running Pods. So, how can I apply the new (lower) limits to already running Pods.
Is there any way to update the limits in the running Pods without restarting them?
If I have to restart the Pods, how can this be done without deleting and recreating them? (I am really using pure Pods, no Depoyments or so)
You need to restart the Pods:
You can't update the resources field of a running Pod. The update would be rejected.
You need to create new Pods and delete the old ones. You can create the new ones first and delete the old ones when the new ones are running, if this allows you to avoid downtime.
You can do that only when you run it as a deployment, or atleast run a pod with RestartPolicy as RestartAlways, so you can always scale down to zero and scale up to 1 for a safe restart. In your case, considering that you just run your pod using kubectl run without any restartpolicy, or a restartpolicy as Never, i would run another pod, test and kill the already running ones. Expect better answers from any one.
You cant change the properties of a running pod. It would reject the changes. Rather you can create a deployment whose rolling update feature ensures,
one pod will be running during the update of limits.
There wont be any downtime for your pod.

kubernetes creates more pods than scale amount

I have encountered a strange situation in one of our clusters, where all of a sudden a number of new pods have been created so that we end up with a greater number of running pods than the scale amount.
So in the dashboard it will show
serviceX pods: 8/2
and then 8 running instances of that service
Questions
How can this possibly happen?
Is there an easy way to get rid
of the extra pods (which all seem to be running)?
I have tried changing the scale amount in the dashboard and the extra pods do not disappear.
Both Pod and deployment are full-fledged objects in the Kubernetes API. Deployment manages creating Pods by means of ReplicaSets. What it boils down to is that Deployment will create Pods with spec taken from the template.
In your case deployment name edgeservicepublic-svc is set to have 13 replicas. Deployment is a kind of controller in Kubernetes. Its is naturally that this controller with continuously check if 13 pods are created. When a deployment is added to the cluster, it will automatically spin up the requested number of pods, and then monitor them. If a pod dies, the deployment will automatically re-create it. Probably at first not enough pods are created co controller with pursue to achieve desried number of them.
To make sure your deployment works properly you can delete deployment, make sure that that pods are deleted. Make sure that you haven't set up autoscaler ( $ kubectl get hpa ) if so, delete it. Then if you want to change deployment specification edit deployment configuration file and apply changes ($ kubectl apply -f deployment_configuration_file.yaml).
Useful documentation about deployment , autoscaling in context of GKE.
EDIT:
Basically at first place check autoscaler then delete it if it exists. I told you to delete deployment because you told that you try to change scale amount/ number of replicas. So if you want to be 100 % sure that changes are applied is to delete whole deployment end then recreate it with desired number of replicas. Of course you can just apply changes in deployment configuration file ($ kubectl edit ...) or ( $ kubectl apply -f ) but sometimes existing pods are not deleted so it will be saver. You could also create new deployment with the same parameters but different name.

Kubernetes deployment: Don't terminate pod until new pod has been in Running state for 2 minutes

When I do kubectl delete pod or kubectl patch, how can I ensure that the old pod doesn't actually delete itself until the replacement pod has been running for 2 minutes? (And if the replacement pod dies before the 2 minute mark, then don't even delete the old pod)
The reason is that my initiation takes about 2 minutes to pull some latest data and run calculations; after 2 minutes it'll reach a point where it'll either error out or continue running with the updated calculations.
I want to be able to occasionally delete the pod so that it'll restart and get new versions of stuff (because getting new versions is only done at the beginning of the code).
Is there a way I can do this without an init container? Because I'm concerned that it would be difficult to pass the calculated results from the init container to the main container
We need to tweak two parameters.
We have set the minReadySeconds as 2 min. Or we can use readiness probe instead of hardcoded 2min.
We have to do rolling update with maxSurge > 0 (default: 1) and maxUnavailable: 0. This will bring the new pod(s) and only if it becomes ready, old pod(s) will be killed. This process continues for rest of the pods.
Note: 0 <= maxSurge <= replicaCount
If you are using a Deployment, you can set minReadySeconds in the spec to 120 (seconds). Kubernetes will not consider it actually ready and in service (and therefore spin down old pods) until the pod reports it has been ready for that long.