How do we ensure scale in protection for kubernetes pods? - kubernetes

While scaling-in, HPA shouldn't terminate a pod that has a job running on it.
This is taken care of by AWS autoscaling groups in the form of scale-in protection for instances. Is there something similar in kubernetes?

You use terminationGracePeriodSeconds to make your worker process wait until it is done. It will get a SIGTERM, then has that many seconds to finish (default 9 but you can make it anything, some of my workers have it set to 12 hours), then SIGKILL if it hasn't exited. So stop accepting new work units on SIGTERM, set the threshold to be the length of your longest work unit, and no worries :)

Related

Airflow tasks failing with SIGTERM when worker pod downscaling

I am running an airflow cluster on EKS on AWS. I have setup some scaling config for worker setup. If CPU/Mem > 70% then airflow spins up new worker pod. However I am facing an issue when these worker pods are scaling down. When worker pods start scaling down, two things happen:
If no tasks is running on a worker pod, it terminates within 40sec.
If any task is running on a worker pod, it terminates in about 8min, and after one more minute, I find the task failing on UI.
I have setup below two properties in helm chart for worker pod termiantion.
celery:
## if celery worker Pods are gracefully terminated
## - consider defining a `workers.podDisruptionBudget` to prevent there not being
## enough available workers during graceful termination waiting periods
##
## graceful termination process:
## 1. prevent worker accepting new tasks
## 2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for tasks to finish
## 3. send SIGTERM to worker
## 4. wait AT MOST `workers.terminationPeriod` for kill to finish
## 5. send SIGKILL to worker
##
gracefullTermination: true
## how many seconds to wait for tasks to finish before SIGTERM of the celery worker
##
gracefullTerminationPeriod: 180
## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
## - [WARNING] tasks that are still running during SIGKILL will be orphaned, this is important
## to understand with KubernetesPodOperator(), as Pods may continue running
##
terminationPeriod: 120
I can see that worker pod should shutdown after 5 mins or irrespective task running or not. So I am not sure why I see total of 8 min for worker pod termination. And my main issue is there any way I can setup config so that worker pod only terminates when task running on it finishes execution. Since tasks in my dags can run anywhere between few minutes to few hours so I don't want to put a large value for gracefullTerminationPeriod. I Would appreciate any solution around this.
Some more info: Generally the long running task is a python operator which runs either a presto sql query or Databricks job via Prestohook or DatabricksOperator respectively. And I don't want these to recivie SIGTERM before they complete their execution on worker pod scaling down.
This is not possible due to limitations from K8 end. More details are available here. However by using a large value of "gracefulTerminationPeriod" works, although this is not what I intended to do but it works better than I originally thought. When large value of gracefulTerminationPeriod is set, workers doesn't wait around for gracefulTerminationPeriod time to terminate. If a worker pod is marked for termination it terminates as soon as tasks running on it reaches zero.
Until K8 accept proposed changes and new community helm chart is released, I think this is the best solution without incurring costs of keeping worker up.

How does the scale function in Kubernetes ensure that the current task or request is completed when scaling down?

Will unfinished tasks in the container be terminated when scaling down in Kubernetes?
Will unfinished tasks in the container be terminated when scaling down in Kubernetes?
Yes, to manage it you have a few options or best practices to follow.
You can check for termination with grace
Default value is 30 seconds so if it's taking longer than 30 seconds make sure you are setting it.
Add and manage the terminationGracePeriodSecond into the YAML config
https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace

How to terminate only certain pods based on wheather or not they have finnished a certain task in kubernetes?

I'm having trouble with finding a solution that allows to terminate only certain pods in a deployment.
The application running inside the pods does some processing which can a take lot of time to be finished.
Let's say I have 10 tasks that are stored in a database and I issue a command to scale the deployment to 10 pods.
Let's say that after some time 3 of the pods have finished their tasks and are no longer required.
How can i scale down the deployment from 10 to 7 while terminate only the pods that have finished the tasks and not the pods that are still processing those tasks?
I don't know if more details are needed but i will happily edit the question if there are more details needed to give an answer for this kind of problem.
In this case Kubernetes Job might be better suited for this kind of task.

Kubernetes Cronjob: Reset missed start times after cluster recovery

I have a cluster that includes a Cronjob scheduled to run every 5 minutes.
We recently experienced an issue that incurred downtime and required manual recovery of the cluster. Although now healthy again, this particular cronjob is failing to run with the following error:
Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.
I understand that the Cronjob has 'missed' a number of scheduled jobs while the cluster was down, and this has past a threshold at which no further jobs will be scheduled.
How can I reset the number of missed start times and have these jobs scheduled again (without scheduling all the missed jobs to suddenly run?)
Per the kubernetes Cronjob docs, there does not seem to be a way to cleanly resolve this. Setting the .spec.startingDeadlineSeconds value to a large number will re-schedule all missed occurrences that fall within the increased window.
My solution was just to kubectl delete cronjob x-y-z and recreate it, which worked as desired.

Prevent K8S HPA from deleting pod after load is reduced

I have sidekiq custom metrics coming from prometheus adapter. Using thoes queue metrics from prometheus i have setup HPA. When jobs in queue in sidekiq goes above say 1000 jobs HPA triggers 10 new pods. Then each pod will execute 100 jobs in queue. When jobs are reduced to say 400. HPA will scale-down. But when scale-down happens, hpa kills pods say 4 pods are killed. Thoes 4 pods were still running jobs say each pod was running 30-50 jobs. Now when hpa deletes these 4 pods, jobs running on them are also terminated. And thoes jobs are marked as failed in sidekiq.
So what i want to achieve is stop hpa from deleting pods which are executing the jobs. Moreover i want hpa to not scale-down even after load is reduced to minimum, instead delete pods when jobs in queue in sidekiq metrics is 0.
Is there any way to achieve this?
Weird usage, honestly: you're wasting resources even your traffic is on the cool-down phase but since you didn't provide further details, here it is.
Actually, it's not possible to achieve what you desire since the common behavior is to support a growing load against your workload. The unique wait to achieve this (and this is not recommended) is to change the horizontal-pod-autoscaler-downscale-stabilization Kubernetes Controller Manager's flag to a higher value.
JFI, the doc warns you:
Note: When tuning these parameter values, a cluster operator should be aware of the possible consequences. If the delay (cooldown) value is set too long, there could be complaints that the Horizontal Pod Autoscaler is not responsive to workload changes. However, if the delay value is set too short, the scale of the replicas set may keep thrashing as usual.
As per the discussion and the work done by #Hb_1993 it can be done with a pre-stop hook to delay the eviction, where the delay is based on operation time or some logic to know if the procession is done or not.
A pre-stop hook is a lifecycle method which is invoked before a pod is evicted, and we can then attach to this event and perform some logic like performing ping check to make sure that our pod has completed the processing of current request.
PS- Use this solution with a pinch of salt as this might not work in all the cases or produce unintended results.
To do this, we introduce asleep in the preStop hook that delays the
shutdown sequence.
More details can be found in this article.
https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304