I'm using Kubernetes replicas to run 'run-to-completion' tasks.
I currently increase the number of replicas when we have a work item to complete on a queue and the container immediately consumes the item. As the desired replicas has never changed, Kubernetes kicks off a new replica which then finds that there's nothing to do and so terminates immediately (and repeat). If I scale down the replicas before a container has finished (i.e. as soon as a queue item is consumed), one of the replicas that are doing work terminates prematurely.
Is there a way to reduce the replicas and not enforce termination?
You can add the termination grace period for all pods so when container terminates it's wait for some time.
you can also add the SIG term in lifecycle also that particular pod donot take next item from queue.
lifecycle:
preStop:
exec:
# SIGTERM triggers a quick exit; gracefully terminate instead
command: ["/usr/sbin/nginx","-s","quit"]
you can also have look at this : https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace
If you find yourself in the same situation as me, have a look at Argo: https://argoproj.github.io/argo/
Related
I'm having a problem where a job runs out of memory, and K8s is continually trying to run it again, despite it having no chance of succeeding, since it's going to use the same amount of memory every time. I want it to simply let the job fail and sit there, and I'll take care of creating a new one with a higher memory limit, if desired, and/or deleting the existing failed job.
I have
restartPolicy: Never
backoffLimit: 0
From the not-so-clear things I've read, setting backoffLimit to 1 might do the trick. But is that true? Would that make it restart once, or is the 1 the number of times it can be run, including the first attempt?
Should I switch from jobs to pods? The main issue with that, is then I don't think K8s will restart the pod on another K8s worker node should the one it's running on go down, and that's a situation where I'd want the job to automatically be restarted on another node.
backoffLimit should be 1 as shown below
backoffLimit: 1
Setting backoffLimit to 0 is correct, if the Job is supposed to run once and not be restarted:
backoffLimit: Specifies the number of retries before marking this job failed.
Switching your workload to a Pod would make sense as long as you are not interested in restarts in combination with backoff limits.
I am looking for a solution where the running process inside the pod completes before deleting. Also, stop the service to send a new request to the terminating pod. But I am kind of confused between
terminationGracePeriodSeconds
and
lifecycle:
preStop:
exec:
command:
- "sleep"
- "60"
Termination grace period and preStop hook are two different things. The grace period is the time the kubelet gives you to shut down gracefully (by handling TERM signals). The preStop hook can be used to run a command before the pod is shut down. If the preStop hook exceeds the grace period, it's given another 2 seconds to terminate.
That said, i wouldn't rely too much on graceful shutdowns. Pods live in a cold and hostile environment and can be killed anytime, you should be prepared for that.
More info on pod termination: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
I couldn't find anything in the official kubernetes docs about this. What's the actual low-level process for replacing a long-running cron job? I'd like to understand this so my application can handle it properly.
Is it a clean SIGHUP/SIGTERM signal that gets sent to the app that's running?
Is there a waiting period after that signal gets sent, so the app has time to cleanup/shutdown before it potentially gets killed? If so, what's that timeout in seconds? Or does it wait forever?
For reference, here's the Replace policy explanation in the docs:
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/
Concurrency Policy
Replace: If it is time for a new job run and the previous job run hasn’t finished yet, the cron job replaces the currently running job run with a new job run
A CronJob has just another Pod underneath.
When a Cronjob with a concurrency policy of "Replace" is still active, the Job will be deleted, which also deletes the Pod.
When a Pod is deleted the Linux container/s will be sent a SIGTERM and then a SIGKILL after a grace period, defaulted to 30 seconds. The terminationGracePeriodSeconds property in a PodSpec can be set to override that default.
Due to the flag added to the DeleteJob call, it sounds like this delete is only deleting values from the kube key/value store. Which would mean the new Job/Pod could be created while the current Job/Pod is still terminating. You could confirm with a Job that doesn't respect a SIGTERM and has a terminationGracePeriodSeconds set to a few times your clusters scheduling speed.
I have sidekiq custom metrics coming from prometheus adapter. Using thoes queue metrics from prometheus i have setup HPA. When jobs in queue in sidekiq goes above say 1000 jobs HPA triggers 10 new pods. Then each pod will execute 100 jobs in queue. When jobs are reduced to say 400. HPA will scale-down. But when scale-down happens, hpa kills pods say 4 pods are killed. Thoes 4 pods were still running jobs say each pod was running 30-50 jobs. Now when hpa deletes these 4 pods, jobs running on them are also terminated. And thoes jobs are marked as failed in sidekiq.
So what i want to achieve is stop hpa from deleting pods which are executing the jobs. Moreover i want hpa to not scale-down even after load is reduced to minimum, instead delete pods when jobs in queue in sidekiq metrics is 0.
Is there any way to achieve this?
Weird usage, honestly: you're wasting resources even your traffic is on the cool-down phase but since you didn't provide further details, here it is.
Actually, it's not possible to achieve what you desire since the common behavior is to support a growing load against your workload. The unique wait to achieve this (and this is not recommended) is to change the horizontal-pod-autoscaler-downscale-stabilization Kubernetes Controller Manager's flag to a higher value.
JFI, the doc warns you:
Note: When tuning these parameter values, a cluster operator should be aware of the possible consequences. If the delay (cooldown) value is set too long, there could be complaints that the Horizontal Pod Autoscaler is not responsive to workload changes. However, if the delay value is set too short, the scale of the replicas set may keep thrashing as usual.
As per the discussion and the work done by #Hb_1993 it can be done with a pre-stop hook to delay the eviction, where the delay is based on operation time or some logic to know if the procession is done or not.
A pre-stop hook is a lifecycle method which is invoked before a pod is evicted, and we can then attach to this event and perform some logic like performing ping check to make sure that our pod has completed the processing of current request.
PS- Use this solution with a pinch of salt as this might not work in all the cases or produce unintended results.
To do this, we introduce asleep in the preStop hook that delays the
shutdown sequence.
More details can be found in this article.
https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304
I have a situation where each message in a message queue has to be processed by a separate instance (one pod can process one message at a time). Many messages can be processed at once, but there is a limit of parallel executions. Once it's reached, no new messages are being pulled from the queue. Message processing takes about 30 minutes. No state needs to be stored on the pods between calls (all data is read from a database when pod starts processing a message). A new message should spawn a new pod, once the processing finishes, the pod should die.
Should I use Deployments, ReplicaSets, StatefulSets, Services? (we use Kubernetes with Azure) I guess the main
I've tried ReplicaSets, but in a situation when three messages are being processed and one finishes, scaling down a ReplicaSet can kill a working pod, which is definately not what I need.
I would say that since you do not need to handle state you must discard the StatefulSets, on the other hand, a Deployment is a higher-level concept of ReplicaSets, so, you should use a Deployment as it takes care of the replica set. Lastly, as your processing is under demand I would consider using Jobs, once the job completes the task it frees the resources and dies, this would require extra code to create the jobs based on a helper but could be very handy.