I need to scale a set of pods that run queue-based workers. Jobs for workers can run for a long time (hours) and should not get interrupted. The number of pods is based on the length of the worker queue. Scaling would be either using the horizontal autoscaler using custom metrics, or a simple controller that changes the number of replicas.
Problem with either solution is that, when scaling down, there is no control over which pod(s) get terminated. At any given time, most workers are likely working on short running jobs, idle, or (more rare) processing a long running job. I'd like to avoid killing the long running job workers, idle or short running job workers can be terminated without issue.
What would be a way to do this with low complexity? One thing I can think of is to do this based on CPU usage of the pods. Not ideal, but it could be good enough. Another method could be that workers somehow expose a priority indicating whether they are the preferred pod to be deleted. This priority could change every time a worker picks up a new job though.
Eventually all jobs will be short running and this problem will go away, but that is a longer term goal for now.
Since version 1.22 there is a beta feature that helps you do that. You can add the annotation controller.kubernetes.io/pod-deletion-cost with a value in the range [-2147483647, 2147483647] and this will cause pods with lower value to be killed first. Default is 0, so anything negative on one pod will cause a pod to get killed during downscaling, e.g.
kubectl annotate pods my-pod-12345678-abcde controller.kubernetes.io/pod-deletion-cost=-1000
Link to discussion about the implementation of this feature: Scale down a deployment by removing specific pods (PodDeletionCost) #2255
Link to the documentation: ReplicaSet / Pod deletion cost
During the process of termination of a pod, Kubernetes sends a SIGTERM signal to the container of your pod. You can use that signal to gracefully shutdown your app. The problem is that Kubernetes does not wait forever for your application to finish and in your case your app may take a long time to exit.
In this case I recommend you use a preStop hook, which is completed before Kubernetes sends the KILL signal to the container. There is an example here on how to use handlers:
apiVersion: v1
kind: Pod
metadata:
name: lifecycle-demo
spec:
containers:
- name: lifecycle-demo-container
image: nginx
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]
preStop:
exec:
command: ["/bin/sh","-c","nginx -s quit; while killall -0 nginx; do sleep 1; done"]
There is a kind of workaround that can give some control over the pod termination.
Not quite sure if it the best practice, but at least you can try it and test if it suits your app.
Increase the Deployment grace period with terminationGracePeriodSeconds: 3600 where 3600 is the time in seconds of the longest possible task in the app. This makes sure that the pods will not be terminated by the end of the grace period. Read the docs about the pod termination process in detail.
Define a preStop handler. More details about lifecycle hooks can be found in docs as well as in the example. In my case, I've used the script below to create the file which will later be used as a trigger to terminate the pod (probably there are more elegant solutions).
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "touch /home/node/app/preStop"]
Stop your app running as soon as the condition is met. When the app exits the pod terminates as well. It is not possible to end the process with PID 1 from preStop shell script so you need to add some logic to the app to terminate itself. In my case, it is a NodeJS app, there is a scheduler that is running every 30 seconds and checks whether two conditions are met. !isNodeBusy identifies whether it is allowed to finish the app and fs.existsSync('/home/node/app/preStop') whether preStop hook was triggered. It might be different logic for your app but you get the basic idea.
schedule.scheduleJob('*/30 * * * * *', () => {
if(!isNodeBusy && fs.existsSync('/home/node/app/preStop')){
process.exit();
}
});
Keep in mind that this workaround works only with voluntary disruptions and obviously not helpful with involuntary disruptions. More info in docs.
I think running this type of workload using a Deployment or similar, and using a HorizontalPodAutoscaler for scaling, is the wrong way to go. One way you could go about this is to:
Define a controller (this could perhaps be a Deployment) whose task is to periodically create a Kubernetes Job object.
The spec of the Job should contain a value for .spec.parallelism equal to the maximum number of concurrent executions you will accept.
The Pods spawned by the Job then run your processing logic. They should each pull a message from the queue, process it, and then delete it from the queue (in the case of success).
The Job must exit with the correct status (success or failure). This ensures that the Job recognises when the processing has completed, and so will not spin up additional Pods.
Using this method, .spec.parallelism controls the autoscaling based on how much work there is to be done, and scale-down is an automatic benefit of using a Job.
You are looking for Pod Priority and Preemption. By configuring a high priority PriorityClass for your pods you can ensure that they won't be removed to make space for other pods with a lower priority.
Create a new PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class will not cause other pods to be preempted."
Set your new PriorityClass in your pods
priorityClassName: high-priority
The value: 1000000 in the PriorityClass configures the scheduling priority of the pod. The higher the value the more important the pod is.
For those who lands on this page facing the issues of Pods getting killed while Node scaling down -
This is an expected feature of Cluster Autoscaler as CA will try to optimize the pods so that it could use a minimum size of the cluster.
However, You can protect your Job pods from eviction (getting killed) by creating a PodDisruptionBudget with maxUnavailable=0 for them.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: sample-pdb
spec:
maxUnavailable: 0
selector:
matchLabels:
app: <your_app_name>
Related
I have an application that I want to scale in parallel but would love a way to have a brief pause between each new pod getting created.
I need this because during pod creation I update a file, and if the file isn't closed by the time the next pod tries to open it and update it, it will fail.
So I just need a 1-second buffer between pods opening and closing this file.
Right now if a scale happens and more than 1 new pod is added, they hit the file at the same time. So one pod will work, the other will fail, and I have to wait for the pod to timeout for k8s to kill it and recreate it, at which point it will work, but wastes a lot of valuable time during scaling events when I need new pods as quickly as possible to handle the load.
Not sure if I'm not wording my searches correctly, but not able to find this on the k8s website. Any ideas would be helpful.
(additional note, I know on StatefulSet it will scale 1 pod at a time by default, However, that method requires all pods to be healthy to continue to add that one node at a time, and if any pod becomes unhealthy it will stop scaling until all pods are healthy again. Which during high load situations, wouldn't be ideal)
You will need to do some work in order to achieve it.
Few "hacks/solutions"
1. Init Container
Add an init container to your deployment
The init container will write/delete a destination file like lock.txt
The init container will check to see if this file exist, if so he will wait until the file is removed
Once the file is removed the init container will exit so the pod will "start"
2. Probe 'PostStart'
Add a PostStart probe to your pods which will read the current number of the desired pods vs the actual ones and if required execute a scale
apiVersion: v1
kind: Pod
metadata:
name: lifecycle-demo
spec:
containers:
- name: ...
image: ...
lifecycle:
postStart:
exec:
command: [<get the current number of pods from the replica>
<check to see if you need to restart a new pod>
<scale a new pod if required>]
3. CronJob
Scale your pods with CronJob, Use a file or config map to set the desired number of pods.
Apply a Cron job that scales your pods and when you reached the desired pods remove the CronJob
Before I start all the services I need a pod that would do some initialization. But I dont want this pod to be running after the init is done and also all the other pods/services should start after this init is done. I am aware of init containers in a pod, but I dont think that would solve this issue, as I want the pod to exit after initialization.
You are recommended to let Kubernetes handle pods automatically instead of manual management by yourself, when you can. Consider Job for run-once tasks like this:
apiVersion: batch/v1
kind: Job
metadata:
name: myjob
spec:
template:
spec:
restartPolicy: Never
containers:
- name:
image:
$ kubectl apply -f job.yml
Kubernetes will create a pod to run that job. The pod will complete as soon as the job exits. Note that the completed job and its pod will be released from consuming resources but not removed completely, to give you a chance to examine its status and log. Deleting the job will remove the pod completely.
Jobs can do advanced things like restarting on failure with exponential backoff, running tasks in parallelism, and limiting the time it runs.
It depend on the Init task, but the init container is the best option you have (https://kubernetes.io/docs/concepts/workloads/pods/init-containers/).
Kubernetes will create the initContainer before the other container, and when it accomplished it's task, it will exit.
Make the InitContainer exit gracefully (with a code 0) so that k8s will understand that the container completed the task it was meant to do and do no try to restart it.
You init task will be done, and the container will no longer exist
You can try attaching handlers to Container lifecycle events. Kubernetes supports the postStart and preStop events. Kubernetes sends the postStart event immediately after a Container is started, and it sends the preStop event immediately before the Container is terminated.
https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/
When I am in local, with minikube, I checked that each time I rebuild my app, and roll it out with :
kubectl rollout restart deployment/api-local -n influx
The container lasts 30 sec to be terminated. I have no problem for production matters, but in local dev, I end losing a lot of time waiting, as I can rebuild my app 100 times maybe ( = losing 50 minutes a day ).
Is there some trick to shorten this terminating time ? I understand this would not be part of k8s best practices, but it would make sense to me.
Set the terminationGracePeriodSeconds in the deployment/pod configuration. Per the reference docs, it is:
...Optional duration in seconds the pod needs to terminate gracefully.
May be decreased in delete request. Value must be non-negative
integer. The value zero indicates delete immediately. If this value is
nil, the default grace period will be used instead. The grace period
is the duration in seconds after the processes running in the pod are
sent a termination signal and the time when the processes are forcibly
halted with a kill signal. Set this value longer than the expected
cleanup time for your process. Defaults to 30 seconds.
Example:
spec:
replicas: 1
selector:
...
template:
spec:
terminationGracePeriodSeconds: 0
containers:
...
Also, from here:
The kubectl delete command supports the --grace-period=
option which allows a user to override the default and specify their
own value. The value 0 force deletes the Pod. You must specify an
additional flag --force along with --grace-period=0 in order to
perform force deletions.
Example (specify the namespace, if needed):
kubectl delete pod <pod-name> --now
...or
kubectl delete pod <pod-name> --grace-period=0 --force
I have a container that runs some data fetching from a MySQL database and simply displays the result in console.log(), and want to run this as a cron job in GKE. So far I have the container working on my local machine, and have successfully deployed this to GKE (in terms of there being no errors thrown so far as I can see).
However, the pods that were created were just left as Running instead of stopping after completion of the task. Are the pods supposed to stop automatically after executing all the code, or do they require explicit instruction to stop and if so what is the command to terminate a pod after creation (by the Cron Job)?
I'm reading that there is supposedly some kind of termination grace period of ~30s by default, but after running a minutely-executed cronjob for ~20minutes, all the pods were still running. Not sure if there's a way to terminate the pods from inside the code, otherwise it would be a little silly to have a cronjob generating lots of pods left running idly..My cronjob.yaml below:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: test
spec:
schedule: "5 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: test
image: gcr.io/project/test:v1
# env:
# - name: "DELAY"
# value: 15
restartPolicy: OnFailure
A CronJob is essentially a cookie cutter for jobs. That is, it knows how to create jobs and execute them at a certain time. Now, that being said, when looking at garbage collection and clean up behaviour of a CronJob, we can simply look at what the Kubernetes docs have to say about this topic in the context of jobs:
When a Job completes, no more Pods are created, but the Pods are not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status. It is up to the user to delete old jobs after noting their status. Delete the job with kubectl (e.g. kubectl delete jobs/pi or kubectl delete -f ./job.yaml).
Adding a process.kill(); line in the code to explicitly end the process after the code has finished executing allowed the pod to automatically stop after execution
A job in Kubernetes is intended to run a single instance of a pod and ensure it runs to completion. As another answer specifies, a CronJob is a factory for Jobs which knows how and when to spawn a job according to the specified schedule.
Accordingly, and unlike a service which is intended to run forever, the container(s) in the pod created by the pod must exit upon completion of the job. There is a notable problem with the sidecar pattern which often requires manual pod lifecycle handling; if your main pod requires additional pods to provide logging or database access, you must arrange for these to exit upon completion of the main pod, otherwise they will remain running and k8s will not consider the job complete. In such circumstances, the pod associated with the job will never terminate.
The termination grace period is not applicable here: this timer applies after Kubernetes has requested that your pod terminate (e.g. if you delete it). It specifies the maximum time the pod is afforded to shutdown gracefully before the kubelet will summarily terminate it. If Kubernetes never considers your job to be complete, this phase of the pod lifecycle will not be entered.
Furthermore, old pods are kept around after completion for some time to allow perusal of logs and such. You may see pods listed which are not actively running and so not consuming compute resources on your worker nodes.
If your pods are not completing, please provide more information regarding the code they are running so we can assist in determining why the process never exits.
What are the benefits of a Job with single Pod over just a single Pod with restart policy OnFailure to reliably execute once in kubernetes?
As discussed in Job being constanly recreated despite RestartPolicy: Never, in case of a Job a new Pod will be created endlessly in case container returned non-zero status. The same applies to a single OnFailure Pod, only this time no new pods are created which is even cleaner.
What are the cons and pros of either approach? Can Pod restart parameters, such as restart delay, or number of retry attempts can be controlled in either case?
The difference is that if a Job doesn't complete because the node that its pod was on went offline for some reason, then a new pod will be created to run on a different node. If a single pod doesn't complete because its node became unavailable, it won't be rescheduled onto a different node.