Kubernetes Deployment Terminate Oldest Pod - kubernetes

I'm using Azure Kubernetes Service to run a Go application that pulls from RabbitMQ, runs some processing, and returns it. The pods scale to handle an increase of jobs. Pretty run-of-the-mill stuff.
The HPA is setup like this:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
production Deployment/production 79%/80% 2 10 10 4d11h
staging Deployment/staging 2%/80% 1 2 1 4d11h
What happens is as the HPA scales up and down, there are always 2 pods that will stay running. We've found that after running for so long, the Go app on those pods will time out. Sometimes that's days, sometimes it weeks. Yes, we could probably dig into the code and figure this out, but it's kind of a low priority for that team.
Another solution I've thought of would be to have the HPA remove the oldest pods first. This would mean that the oldest pod would never be more than a few hours old. A first-in, first-out model.
However, I don't see any clear way to do that. It's entirely possible that it isn't, but it seems like something that could work.
Am I missing something? Is there a way to make this work?

In my opinion(I also mentioned in comment) - the most simple(not sure about elegance) way is to have some cronjob that will periodically clean timed out pods.
One CronJob object is like one line of a crontab (cron table) file. It
runs a job periodically on a given schedule, written in Cron format.
CronJobs are useful for creating periodic and recurring tasks, like
running backups or sending emails. CronJobs can also schedule
individual tasks for a specific time, such as scheduling a Job for
when your cluster is likely to be idle.
Cronjobs examples and howto:
How To Use Kubernetes’ Job and CronJob
Kubernetes: Delete pods older than X days
https://github.com/dignajar/clean-pods <-- real example

Related

How will a scheduled (rolling) restart of a service be affected by an ongoing upgrade (and vice versa)

Due to a memory leak in one of our services I am planning to add a k8s CronJob to schedule a periodic restart of the leaking service. Right now we do not have the resources to look into the mem leak properly, so we need a temporary solution to quickly minimize the issues caused by the leak. It will be a rolling restart, as outlined here:
How to schedule pods restart
I have already tested this in our test cluster, and it seems to work as expected. The service has 2 replicas in test, and 3 in production.
My plan is to schedule the CronJob to run every 2 hours.
I am now wondering: How will the new CronJob behave if it should happen to execute while a service upgrade is already running? We do rolling upgrades to achieve zero downtime, and we sometimes roll out upgrades several times a day. I don't want to limit the people who deploy upgrades by saying "please ensure you never deploy near to 08:00, 10:00, 12:00 etc". That will never work in the long term.
And vice versa, I am also wondering what will happen if an upgrade is started while the CronJob is already running and the pods are restarting.
Does kubernetes have something built-in to handle this kind of conflict?
This answer to the linked question recommends using kubectl rollout restart from a CronJob pod. That command internally works by adding an annotation to the deployment's pod spec; since the pod spec is different, it triggers a new rolling upgrade of the deployment.
Say you're running an ordinary redeployment; that will change the image: setting in the pod spec. At about the same time, the kubectl rollout restart happens that changes an annotation setting in the pod spec. The Kubernetes API forces these two changes to be serialized, so the final deployment object will always have both changes in it.
This question then reduces to "what happens if a deployment changes and needs to trigger a redeployment, while a redeployment is already running?" The Deployment documentation covers this case: it will start deploying new pods on the newest version of the pod spec and treat all older ones as "old", so a pod with the intermediate state might only exist for a couple of minutes before getting replaced.
In short: this should work consistently and you shouldn't need to take any special precautions.

Kubernetes Deployment Rolling Updates

I have an application that I deploy on Kubernetes.
This application has 4 replicas and I'm doing a rolling update on each deployment.
This application has a graceful shutdown which can take tens of minutes (it has to wait for running tasks to finish).
My problem is that during updates, I have over-capacity since all the older version pods are stuck at "Terminating" status while all the new pods are created.
During the updates, I end up running with 8 containers and it is something I'm trying to avoid.
I tried to set maxSurge to 0, but this setting doesn't take into consideration the "Terminating" pods, so the load on my servers during the deployment is too high.
The behaviour I'm trying to get is that new pods will only get created after the old version pods finished successfully, so at all times I'm not exceeding the number of replicas I set.
I wonder if there is a way to achieve such behaviour.
What I ended up doing is creating a StatefulSet with podManagementPolicy: Parallel and updateStrategy to OnDelete.
I also set terminationGracePeriodSeconds to the maximum time it takes for a pod to terminate.
As a part of my deployment process, I apply the new StatefulSet with the new image and then delete all the running pods.
This way all the pods are entering Terminating state and whenever a pod finished its task and terminated a new pod with the new image will replace it.
This way I'm able to keep a static number of replicas during the whole deployment process.
Let me suggest the following strategy:
Deployments implement the concept of ready pods to aide rolling updates. Readiness probes allow the deployment to gradually update pods while giving you the control to determine when the rolling update can proceed.
A Ready pod is one that is considered successfully updated by the Deployment and will no longer count towards the surge count for deployment. A pod will be considered ready if its readiness probe is successful and spec.minReadySeconds have passed since the pod was created. The default for these options will result in a pod that is ready as soon as its containers start.
So, what you can do, is implement (if you haven't done so yet) a readiness probe for your pods in addition to setting the spec.minReadySeconds to a value that will make sense (worst case) to the time that it takes for your pods to terminate.
This will ensure rolling out will happen gradually and in coordination to your requirements.
In addition to that, don't forget to configure a deadline for the rollout.
By default, after the rollout can’t make any progress in 10 minutes, it’s considered as failed. The time after which the Deployment is considered failed is configurable through the progressDeadlineSeconds property in the Deployment spec.

Scheduling a controller to run every one hour in Kubernetes

I have a console application which does some operations when run and I generate an image of it using docker. Now, I would like to deploy it to Kubernetes and run it every hour, is it possible that I could do it in K8?
I have read about Cron jobs but that's being offered only from version 1.4
The short answer. Sure, you can do it with a CronJob and yes it does create a Pod. You can configure Job History Limits to control how many failed, completed pods you want to keep before Kubernetes deletes them.
Note that CronJob is a subset of the Job resource.

Why does scaling down a deployment seem to always remove the newest pods?

(Before I start, I'm using minikube v27 on Windows 10.)
I have created a deployment with the nginx 'hello world' container with a desired count of 2:
I actually went into the '2 hours' old pod and edited the index.html file from the welcome message to "broken" - I want to play with k8s to seem what it would look like if one pod was 'faulty'.
If I scale this deployment up to more instances and then scale down again, I almost expected k8s to remove the oldest pods, but it consistently removes the newest:
How do I make it remove the oldest pods first?
(Ideally, I'd like to be able to just say "redeploy everything as the exact same version/image/desired count in a rolling deployment" if that is possible)
Pod deletion preference is based on a ordered series of checks, defined in code here:
https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/controller/controller_utils.go#L737
Summarizing- precedence is given to delete pods:
that are unassigned to a node, vs assigned to a node
that are in pending or not running state, vs running
that are in not-ready, vs ready
that have been in ready state for fewer seconds
that have higher restart counts
that have newer vs older creation times
These checks are not directly configurable.
Given the rules, if you can make an old pod to be not ready, or cause an old pod to restart, it will be removed at scale down time before a newer pod that is ready and has not restarted.
There is discussion around use cases for the ability to control deletion priority, which mostly involve workloads that are a mix of job and service, here:
https://github.com/kubernetes/kubernetes/issues/45509
what about this :
kubectl scale deployment ingress-nginx-controller --replicas=2
Wait until 2 replicas are up.
kubectl delete pod ingress-nginx-controller-oldest-replica
kubectl scale deployment ingress-nginx-controller --replicas=1
I experienced zero downtime doing so while removing oldest pod.

What is stored in a kubernetes job and how do I check resource use of old job(s)?

This morning I learned about the (unfortunate) default in kubernetes of all previously run cronjobs' jobs instances being retained in the cluster. Mea culpa for not reading that detail in the documentation. I also notice that deleting jobs (kubectl delete job [<foo> or --all]) takes quite a long time. Further, I noticed that even a reasonably provisioned kubernetes cluster with three large nodes appears to fail (get timeouts of all sorts when trying to use kubectl) when there are just ~750 such old jobs in the system (plus some other active containers that otherwise had not entailed heavy load) [Correction: there were also ~7k pods associated with those old jobs that were also retained :-o]. (I did learn about the configuration settings to limit/avoid storing old jobs from cronjobs, so this won't be a problem [for me] in the future.)
So, since I couldn't find documentation for kubernetes about this, my (related) questions are:
what exactly is stored when kubernetes retains old jobs? (Presumably it's the associated pod's logs and some metadata, but this doesn't explain why they seemed to place such a load on the cluster.)
is there a way to see the resources (disk only, I assume, but maybe
there is some other resource) that individual or collective old jobs
are using?
why does deleting a kubernetes job take on the order of a minute?
I don't know if k8s provides that kinda details of what job is consuming how much disk space but here is something you can try.
Try to find the pods associated with the job:
kubectl get pods --selector=job-name=<job name> --output=jsonpath={.items..metadata.name}
Once you know the pod then find the docker container associated with it:
kubectl describe pod <pod name>
In the above output look for Node & Container ID. Now go on that node and in that node goto path /var/lib/docker/containers/<container id found above> here you can do some investigation to find out what is wrong.