How to ensure kubernetes cronjob does not restart on failure - kubernetes

I have a cronjob that sends out emails to customers. It occasionally fails for various reasons. I do not want it to restart, but it still does.
I am running Kubernetes on GKE. To get it to stop, I have to delete the CronJob and then kill all the pods it creates manually.
This is bad, for obvious reasons.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
creationTimestamp: 2018-06-21T14:48:46Z
name: dailytasks
namespace: default
resourceVersion: "20390223"
selfLink: [redacted]
uid: [redacted]
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1
jobTemplate:
metadata:
creationTimestamp: null
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- kubernetes/daily_tasks.sh
env:
- name: DB_HOST
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
envFrom:
- secretRef:
name: my-secrets
image: [redacted]
imagePullPolicy: IfNotPresent
name: dailytasks
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
schedule: 0 14 * * *
successfulJobsHistoryLimit: 3
suspend: true
status:
active:
- apiVersion: batch
kind: Job
name: dailytasks-1533218400
namespace: default
resourceVersion: "20383182"
uid: [redacted]
lastScheduleTime: 2018-08-02T14:00:00Z

It turns out that you have to set a backoffLimit: 0 in combination with restartPolicy: Never in combination with concurrencyPolicy: Forbid.
backoffLimit means the number of times it will retry before it is considered failed. The default is 6.
concurrencyPolicy set to Forbid means it will run 0 or 1 times, but not more.
restartPolicy set to Never means it won't restart on failure.
You need to do all 3 of these things, or your cronjob may run more than once.
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1
jobTemplate:
metadata:
creationTimestamp: null
spec:
[ADD THIS -->]backoffLimit: 0
template:
... MORE STUFF ...

The kubernetes cronjob resources has a field, suspend in its spec.
You can't do it by default, but if you want to ensure it doesn't run, you could update the script that sends emails and have it patch the cronjob resource to add suspend: true if it fails
Something like this
kubectl patch cronjob <name> -p '{"spec": { "suspend": true }}'

Related

openshift cronjob with imagestream

when i update my imagestream and then trigger a run of a cronjob, the cronjob will still use the previous imagestream instead of pulling latest. despite the fact i have configured the cronjob to always pull.
the way ive been testing this is to:
push a new image to the stream, and verify the imagestream is updated
check the cronjob obj and verify the image associated to the container still has the old image stream hash
trigger a new run of the cronjob, which I would think would pull a new image since the pull policy is always. -- but it does not, the cronjob starts with a container using the old image stream.
heres the yaml:
apiVersion: template.openshift.io/v1
kind: Template
metadata:
name: cool-cron-job-template
parameters:
- name: ENVIRONMENT
displayName: Environment
required: true
objects:
- apiVersion: v1
kind: ImageStream
metadata:
name: cool-cron-job
namespace: cool-namespace
labels:
app: cool-cron-job
owner: cool-owner
spec:
lookupPolicy:
local: true
- apiVersion: batch/v1
kind: CronJob
metadata:
name: cool-cron-job-cron-job
namespace: cool-namespace
labels:
app: cool-cron-job
owner: cool-owner
spec:
schedule: "10 0 1 * *"
concurrencyPolicy: "Forbid"
startingDeadlineSeconds: 200
suspend: false
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
metadata:
labels:
app: cool-cron-job
cronjob: "true"
annotations:
alpha.image.policy.openshift.io/resolve-names: '*'
spec:
dnsPolicy: ClusterFirst
restartPolicy: OnFailure
securityContext: { }
terminationGracePeriodSeconds: 0
containers:
- command: [ "python", "-m", "cool_cron_job.handler" ]
imagePullPolicy: Always
name: cool-cron-job-container
image: cool-cron-job:latest

kubernetes vpa for CronJob

I need to run VPA for CronJob. I refer to this doc.
I think i followed it properly but it doesn't work for me.
using GKE, 1.17
VPA version is vpa-release-0.8
I created CronJob and VPA with this file.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
metadata:
labels:
app: hello
spec:
containers:
- name: hello
image: busybox
imagePullPolicy: IfNotPresent
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-vpa
spec:
targetRef:
apiVersion: "batch/v1beta1"
kind: CronJob
name: hello
updatePolicy:
updateMode: "Auto"
When I type this command:
kubectl describe vpa
I got this result:
Name: my-vpa
Namespace: default
Labels: <none>
Annotations: <none>
API Version: autoscaling.k8s.io/v1
Kind: VerticalPodAutoscaler
Metadata:
Creation Timestamp: 2021-02-08T07:38:23Z
Generation: 2
Resource Version: 3762
Self Link: /apis/autoscaling.k8s.io/v1/namespaces/default/verticalpodautoscalers/my-vpa
UID: 07803254-c549-4568-a062-144c570a8d41
Spec:
Target Ref:
API Version: batch/v1beta1
Kind: CronJob
Name: hello
Update Policy:
Update Mode: Auto
Status:
Conditions:
Last Transition Time: 2021-02-08T07:39:14Z
Status: False
Type: RecommendationProvided
Recommendation:
Events: <none>
#mario oh!! so there was not enough time to get metrics to recommend
resource.... – 변상현 Feb 10 at 2:36
Yes, exactly. If the only task of your CronJob is to echo Hello from the Kubernetes cluster and exit you won't get any recommendations from VPA as this is not a resource-intensive task.
However if you modify your command so that it generates an artificial load in your CronJob-managed pods:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
metadata:
labels:
app: hello
spec:
containers:
- name: hello
image: busybox
imagePullPolicy: IfNotPresent
args:
- /bin/sh
- -c
- date; dd if=/dev/urandom | gzip -9 >> /dev/null
restartPolicy: OnFailure
after a few minutes you'll get the expected result:
$ kubectl describe vpa my-vpa
Name: my-vpa
Namespace: default
Labels: <none>
Annotations: <none>
API Version: autoscaling.k8s.io/v1
Kind: VerticalPodAutoscaler
Metadata:
Creation Timestamp: 2021-05-22T13:02:27Z
Generation: 8
...
Manager: vpa-recommender
Operation: Update
Time: 2021-05-22T13:29:40Z
Resource Version: 5534471
Self Link: /apis/autoscaling.k8s.io/v1/namespaces/default/verticalpodautoscalers/my-vpa
UID: e37abd79-296d-4f72-8bd5-f2409457e9ff
Spec:
Target Ref:
API Version: batch/v1beta1
Kind: CronJob
Name: hello
Update Policy:
Update Mode: Auto
Status:
Conditions:
Last Transition Time: 2021-05-22T13:39:40Z
Status: False
Type: LowConfidence
Last Transition Time: 2021-05-22T13:29:40Z
Status: True
Type: RecommendationProvided
Recommendation:
Container Recommendations:
Container Name: hello
Lower Bound:
Cpu: 1185m
Memory: 2097152
Target:
Cpu: 1375m
Memory: 2097152
Uncapped Target:
Cpu: 1375m
Memory: 2097152
Upper Bound:
Cpu: 96655m
Memory: 115343360
Events: <none>
❗Important: Just don't leave it running for too long as you might be quite surprised with your bill 😉

Kubernetes failed job with no pods

I see a failed job that created no pods.And also there is no information in the events.Since there are no pods,I could not check the logs.
Here is the description of the job which failed.
kubectl describe job time-limited-rbac-1604010900 -n add-ons
Name: time-limited-rbac-1604010900
Namespace: add-ons
Selector: controller-uid=0816b9b3-814c-4802-83cf-5d5f3456701d
Labels: controller-uid=0816b9b3-814c-4802-83cf-5d5f3456701d
job-name=time-limited-rbac-1604010900
Annotations: <none>
Controlled By: CronJob/time-limited-rbac
Parallelism: 1
Completions: <unset>
Start Time: Thu, 29 Oct 2020 15:35:08 -0700
Active Deadline Seconds: 280s
Pods Statuses: 0 Running / 0 Succeeded / 1 Failed
Pod Template:
Labels: controller-uid=0816b9b3-814c-4802-83cf-5d5f3456701d
job-name=time-limited-rbac-1604010900
Service Account: time-limited-rbac
Containers:
time-limited-rbac:
Image: bitnami/kubectl:latest
Port: <none>
Host Port: <none>
Command:
/bin/bash
Args:
/var/tmp/time-limited-rbac.sh
Environment: <none>
Mounts:
/var/tmp/ from script (rw)
Volumes:
script:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: time-limited-rbac-script
Optional: false
Events: <none>
Here is the description of CronJob.
apiVersion: v1
items:
- apiVersion: batch/v1beta1
kind: CronJob
metadata:
annotations:
meta.helm.sh/release-name: time-limited-rbac
meta.helm.sh/release-namespace: add-ons
labels:
app.kubernetes.io/name: time-limited-rbac
name: time-limited-rbac
spec:
concurrencyPolicy: Replace
failedJobsHistoryLimit: 1
jobTemplate:
metadata:
creationTimestamp: null
spec:
activeDeadlineSeconds: 280
backoffLimit: 3
parallelism: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- /var/tmp/time-limited-rbac.sh
command:
- /bin/bash
image: bitnami/kubectl:latest
imagePullPolicy: Always
name: time-limited-rbac
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/tmp/
name: script
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: time-limited-rbac
serviceAccountName: time-limited-rbac
terminationGracePeriodSeconds: 0
volumes:
- configMap:
defaultMode: 356
name: time-limited-rbac-script
name: script
schedule: '*/5 * * * *'
successfulJobsHistoryLimit: 3
suspend: false
Is there any way to tune thie cronjob to avoid such scenarios? We are receiving this issue atleast once or twice everyday.

kubernetes cronjob in GKE stop scheduling the job after a few weeks

I have this yaml for cronjob, running in google kubernetes engine:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
creationTimestamp: 2019-04-22T18:20:51Z
name: cron-field-velocity-field-details-manager
namespace: master
resourceVersion: "73643714"
selfLink: /apis/batch/v1beta1/namespaces/master/cronjobs/cron-field-velocity-field-details-manager
uid: 5be9e8d5-652b-11e9-bf91-42010a9600af
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1
jobTemplate:
metadata:
creationTimestamp: null
spec:
template:
metadata:
creationTimestamp: null
labels:
app: cron-field-velocity-field-details-manager
chart: field-velocity-field-details-manager-0.0.1
heritage: Tiller
release: master-field-velocity-field-details-manager
spec:
containers:
- args:
- ./field-velocity-field-details-manager.dll
command:
- dotnet
image: taranisag/field-velocity-field-details-manager:master.993b179
imagePullPolicy: IfNotPresent
name: cron-field-velocity-field-details-manager
resources:
requests:
cpu: "2"
memory: 2Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: regsecret
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
schedule: '* 2,14 * * *'
successfulJobsHistoryLimit: 3
suspend: false
status:
lastScheduleTime: 2019-06-20T02:00:00Z
It was working for a few weeks meaning the job was running twice a day, but it stop running a week ago.
There was no indication of an error and the last run was completed successfully
Is it something in the yaml I defined wrong ?

Kubernetes Cron Jobs - Run multiple pods for a cron job

Our requirement is we need to do batch processing every 3 hours but single process can not handle the work load. we have to run multiple pods for the same cron job. Is there any way to do that ?
Thank you.
You can provide parallelism: <num_of_pods> to cronjob.spec.jobTemplate.spec and it will run the multiple pods () at the same time.
Following is the example of a cronjob which runs 3 nginx pod every minute.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
creationTimestamp: null
labels:
run: cron1
name: cron1
spec:
concurrencyPolicy: Allow
jobTemplate:
metadata:
creationTimestamp: null
spec:
parallelism: 3
template:
metadata:
creationTimestamp: null
labels:
run: cron1
spec:
containers:
- image: nginx
name: cron1
resources: {}
restartPolicy: OnFailure
schedule: '*/1 * * * *'
concurrencyPolicy: Forbid
status: {}