Prevent K8S HPA from deleting pod after load is reduced - kubernetes

I have sidekiq custom metrics coming from prometheus adapter. Using thoes queue metrics from prometheus i have setup HPA. When jobs in queue in sidekiq goes above say 1000 jobs HPA triggers 10 new pods. Then each pod will execute 100 jobs in queue. When jobs are reduced to say 400. HPA will scale-down. But when scale-down happens, hpa kills pods say 4 pods are killed. Thoes 4 pods were still running jobs say each pod was running 30-50 jobs. Now when hpa deletes these 4 pods, jobs running on them are also terminated. And thoes jobs are marked as failed in sidekiq.
So what i want to achieve is stop hpa from deleting pods which are executing the jobs. Moreover i want hpa to not scale-down even after load is reduced to minimum, instead delete pods when jobs in queue in sidekiq metrics is 0.
Is there any way to achieve this?

Weird usage, honestly: you're wasting resources even your traffic is on the cool-down phase but since you didn't provide further details, here it is.
Actually, it's not possible to achieve what you desire since the common behavior is to support a growing load against your workload. The unique wait to achieve this (and this is not recommended) is to change the horizontal-pod-autoscaler-downscale-stabilization Kubernetes Controller Manager's flag to a higher value.
JFI, the doc warns you:
Note: When tuning these parameter values, a cluster operator should be aware of the possible consequences. If the delay (cooldown) value is set too long, there could be complaints that the Horizontal Pod Autoscaler is not responsive to workload changes. However, if the delay value is set too short, the scale of the replicas set may keep thrashing as usual.

As per the discussion and the work done by #Hb_1993 it can be done with a pre-stop hook to delay the eviction, where the delay is based on operation time or some logic to know if the procession is done or not.
A pre-stop hook is a lifecycle method which is invoked before a pod is evicted, and we can then attach to this event and perform some logic like performing ping check to make sure that our pod has completed the processing of current request.
PS- Use this solution with a pinch of salt as this might not work in all the cases or produce unintended results.
To do this, we introduce asleep in the preStop hook that delays the
shutdown sequence.
More details can be found in this article.
https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304

Related

How do we ensure scale in protection for kubernetes pods?

While scaling-in, HPA shouldn't terminate a pod that has a job running on it.
This is taken care of by AWS autoscaling groups in the form of scale-in protection for instances. Is there something similar in kubernetes?
You use terminationGracePeriodSeconds to make your worker process wait until it is done. It will get a SIGTERM, then has that many seconds to finish (default 9 but you can make it anything, some of my workers have it set to 12 hours), then SIGKILL if it hasn't exited. So stop accepting new work units on SIGTERM, set the threshold to be the length of your longest work unit, and no worries :)

How to terminate only certain pods based on wheather or not they have finnished a certain task in kubernetes?

I'm having trouble with finding a solution that allows to terminate only certain pods in a deployment.
The application running inside the pods does some processing which can a take lot of time to be finished.
Let's say I have 10 tasks that are stored in a database and I issue a command to scale the deployment to 10 pods.
Let's say that after some time 3 of the pods have finished their tasks and are no longer required.
How can i scale down the deployment from 10 to 7 while terminate only the pods that have finished the tasks and not the pods that are still processing those tasks?
I don't know if more details are needed but i will happily edit the question if there are more details needed to give an answer for this kind of problem.
In this case Kubernetes Job might be better suited for this kind of task.

Delay in Kubernetes Job status update when running many jobs in parallel

I have a bit of a unique use-case where I want to run a large number (thousands to tens of thousands) of Kubernetes Jobs at once. Each job consists of a single container, Parallelism 1 and Completions 1, with no side-car or agent. My cluster has plenty of capacity for the resources I'm requesting.
My problem is that the Job status is not transitioning to Complete for a significant period of time when I run many jobs concurrently.
My application submits Jobs and has a watcher on the namespace - as soon as a Job's status transitions to 'succeeded 1', we delete the Job and send information back to the application. The application needs this to happen as soon as possible in order to define and submit subsequent Jobs.
I'm able to submit new Job requests as fast as I want, and Pod scheduling happens without delay, but beyond about one or two hundred concurrent Jobs I get significant delay between a Job's Pod completing and the Job's status updating to Complete. At only around 1,000 jobs in the cluster, it can easily take 5-10 minutes for a Job status to update.
This tells me there is some process in the Kubernetes Control Plane that needs more resources to process Pod completion events more rapidly, or a configuration option that enables it to process more tasks in parallel. However, my system monitoring tools have not yet been able to identify any Control Plane services that are maxing out their available resources while the cluster processes the backlog, and all other operations on the cluster appear to be normal.
My question is - where should I look for system resource or configuration bottlenecks? I don't know enough about Kubernetes to know exactly what components are responsible for updating a Job's status.

Complete parallel Kubernetes job when one worker pod succeeds

I have a simple containerised python script which I am trying to parallelise with Kubernetes. This script guesses hashes until it finds a hashed value below a certain threshold.
I am only interested in the first such value, so I wish to create a Kubernetes job that spawns n worker pods and completes as soon as one worker pod finds a suitable value.
By default, Kubernetes jobs wait until all worker pods complete before marking the job as complete. I have so far been unable to find a way around this (no mention of this job pattern in the documentation), and have been relying on checking the logs of bare pods via a bash script to determine whether one has completed.
Is there a native means to achieve this? And, if not, what would be the best approach?
Hi look this link https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#parallel-jobs.
I've never tried it but it seems possible to launch several pods and configure the end of the job when x pods have finished. In your case x is 1.
We can define two specifications for parallel Jobs:
1. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions.
the Job represents the overall task, and is complete when there is
one successful Pod for each value in the range 1 to
.spec.completions
not implemented yet: Each Pod is passed a different index in the
range 1 to .spec.completions.
2. Parallel Jobs with a work queue:
do not specify .spec.completions, default to .spec.parallelism
the Pods must coordinate amongst themselves or an external service to
determine what each should work on.
For example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
when any Pod from the Job terminates with success, no new Pods are
created
once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success
once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting
For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set .spec.parallelism, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
You can also take a look on single job which starts controller pod:
This pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but complete control over what Pods are created and how work is assigned to them.
At the same time take under consideration that completition status of Job set by dafault - when specified number of successful completions is reached it ensure that all tasks are processed properly. Applying this status before all tasks are finished is not secure solution.
You should also know that finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here is official documentations: jobs-parallel-processing , parallel-jobs.
Useful blog: article-parallel job.
EDIT:
Another option is that you can create special script which will continuously check values you look for. Using job then will not be necessary, you can simply use deployment.

Scaling Down Replicas On Completion Without Termination

I'm using Kubernetes replicas to run 'run-to-completion' tasks.
I currently increase the number of replicas when we have a work item to complete on a queue and the container immediately consumes the item. As the desired replicas has never changed, Kubernetes kicks off a new replica which then finds that there's nothing to do and so terminates immediately (and repeat). If I scale down the replicas before a container has finished (i.e. as soon as a queue item is consumed), one of the replicas that are doing work terminates prematurely.
Is there a way to reduce the replicas and not enforce termination?
You can add the termination grace period for all pods so when container terminates it's wait for some time.
you can also add the SIG term in lifecycle also that particular pod donot take next item from queue.
lifecycle:
preStop:
exec:
# SIGTERM triggers a quick exit; gracefully terminate instead
command: ["/usr/sbin/nginx","-s","quit"]
you can also have look at this : https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace
If you find yourself in the same situation as me, have a look at Argo: https://argoproj.github.io/argo/