How to fail a (cron) job after a certain number of retries? - kubernetes

We have a Kubernetes cluster of web scraping cron jobs set up. All seems to go well until a cron job starts to fail (e.g., when a site structure changes and our scraper no longer works). It looks like every now and then a few failing cron jobs will continue to retry to the point it brings down our cluster. Running kubectl get cronjobs (prior to a cluster failure) will show too many jobs running for a failing job.
I've attempted following the note described here regarding a known issue with the pod backoff failure policy; however, that does not seem to work.
Here is our config for reference:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: scrape-al
spec:
schedule: '*/15 * * * *'
concurrencyPolicy: Allow
failedJobsHistoryLimit: 0
successfulJobsHistoryLimit: 0
jobTemplate:
metadata:
labels:
app: scrape
scrape: al
spec:
template:
spec:
containers:
- name: scrape-al
image: 'govhawk/openstates:1.3.1-beta'
command:
- /opt/openstates/openstates/pupa-scrape.sh
args:
- al bills --scrape
restartPolicy: Never
backoffLimit: 3
Ideally we would prefer that a cron job would be terminated after N retries (e.g., something like kubectl delete cronjob my-cron-job after my-cron-job has failed 5 times). Any ideas or suggestions would be much appreciated. Thanks!

You can tell your Job to stop retrying using backoffLimit.
Specifies the number of retries before marking this job failed.
In your case
spec:
template:
spec:
containers:
- name: scrape-al
image: 'govhawk/openstates:1.3.1-beta'
command:
- /opt/openstates/openstates/pupa-scrape.sh
args:
- al bills --scrape
restartPolicy: Never
backoffLimit: 3
You set 3 asbackoffLimit of your Job. That means when a Job is created by CronJob, It will retry 3 times if fails. This controls Job, not CronJob
When Job is failed, another Job will be created again as your scheduled period.
You want:
If I am not wrong, you want to stop scheduling new Job, when your scheduled Jobs are failed for 5 times. Right?
Answer:
In that case, this is not possible automatically.
Possible solution:
You need to suspend CronJob so than it stop scheduling new Job.
Suspend: true
You can do this manually. If you do not want to do this manually, you need to setup a watcher, that will watch your CronJob status, and will update CronJob to suspend if necessary.

Related

How make kubernetes job fail with it cannot find the pod image?

When a container image is not present on the cluster the pod fails with the error ErrImageNeverPull but the job never fails. Is there a configuration that I can add to make sure the job fails if the pod startup fails.
apiVersion: batch/v1
kind: Job
metadata:
name: image-not-present
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 120
template:
spec:
serviceAccountName: consolehub
containers:
- name: image-not-present
image: aipaintr/image_not_present:latest
imagePullPolicy: Never
restartPolicy: OnFailure
You can config activeDeadlineSeconds for this case. However, you have know how long your job take to reach Complete status to avoid this timeout can kill your pod processing.
From the documents:
The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded.
For example: I have created job with wrong image and activeDeadlineSeconds: 100. Obviously, the pod stuck with status Pending because of wrong image.kubectl describe pod
After 100 seconds, the Job was Fail and the pod was killed as well.
kubectl describe job

How to run kubernetes cronjob immediately

Im very new to kubernetes ,here i tired a cronjob yaml in which the pods are created at every 1 minute.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
but the pods are created only after 1 minute.is it possible to run job immediately and after that every 1 minute ?
As already stated in the comments CronJob is backed by Job. What you can do is literally launch Cronjob and Job resources using the same spec at the same time. You can do that conveniently using helm chart or kustomize.
Alternatively you can place both manifests in the same file or two files in the same directory and then use:
kubectl apply -f <file/dir>
With this workaround initial Job is started and then after some time Cronjob.
The downside of this solution is that first Job is standalone and it is not included in the Cronjob's history. Another possible side effect is that the first Job and first CronJob can run in parallel if the Job cannot finish its tasks fast enough. concurrencyPolicy does not take that Job into consideration.
From the documentation:
A cron job creates a job object about once per execution time of its
schedule. We say "about" because there are certain circumstances where
two jobs might be created, or no job might be created. We attempt to
make these rare, but do not completely prevent them.
So if you want to keep the task execution more strict, perhaps it may be better to use Bash wrapper script with sleep 1 between task executions or design an app that forks sub processes after specified interval, create a container image and run it as a Deployment.

Why does a kubernetes cronjob pauses

I have cronjob that is defined by this manifest:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: trigger
spec:
concurrencyPolicy: Forbid
startingDeadlineSeconds: 5
schedule: "*/1 * * * *"
jobTemplate:
spec:
activeDeadlineSeconds: 50
backoffLimit: 1
parallelism: 1
template:
spec:
containers:
- env:
- name: ApiKey
valueFrom:
secretKeyRef:
key: apiKey
name: something
name: trigger
image: curlimages/curl:7.71.1
args:
- -H
- "Content-Type: application/json"
- -H
- "Authorization: $(ApiKey)"
- -d
- '{}'
- http://url
restartPolicy: Never
It sort of works, but not 100%. For some reason it runs 10 jobs, then it pauses for 5-10 minutes or so and then run 10 new jobs. No errors are reported, but we don't understand why it pauses.
Any ideas on what might cause a cronjob in kubernetes to pause?
The most common problem of running CronJobs on k8s is
spawning to many pods which consume all cluster resources.
It is very important to set proper CronJob limitations. So try to set memory limits for pods.
Also speaking about concurrencyPolicy you set concurrencyPolicy param to Forbid which means that the cron job does not allow concurrent runs; if it is time for a new job run and the previous job run hasn't finished yet, the cron job skips the new job run.
The .spec.concurrencyPolicy field is optional. It specifies how to treat concurrent executions of a job that is created by this cron job. There are following concurrency policies:
Allow (default): The cron job allows concurrently running jobs
Forbid: explained above
Replace: If it is time for a new job run and the previous job run hasn't finished yet, the cron job replaces the currently running job run with a new job run
Try to change policy to allow or replace according to your needs.
Speaking about a non-parallel Job, you can leave .spec.parallelism unset. When it is unset, it is defaulted to 1.
Take a look: cron-jobs-running-for-one-cron-execution-point-in-kubernetes, cron-job-limitations, cron-jobs.

Avoid multiple cron jobs running for one cron execution point in Kubernetes

EDIT: Question is solved, it was my mistake, i simply used the wrong cron settings. I assumed "* 2 * * *" would only run once per day at 2, but in fact it runs every minute past the hour 2. So Kubernetes behaves correctly.
I keep having multiple jobs running at one cron execution point. But it seems only if those jobs have a very short runtime. Any idea why this happens and how I can prevent it? I use concurrencyPolicy: Forbid, backoffLimit: 0 and restartPolicy: Never.
Example for a cron job that is supposed to run once per day, but runs multiple times just after its scheduled run time:
job-1554346620 1/1 11s 4h42m
job-1554346680 1/1 11s 4h41m
job-1554346740 1/1 10s 4h40m
Relevant config:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: job
spec:
schedule: "* 2 * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: job
image: job_image:latest
command: ["rake", "run_job"]
restartPolicy: Never
imagePullSecrets:
- name: regcred
backoffLimit: 0
The most common problem of running CronJobs on k8s is:
spawning to many pods which consume all cluster resources
It is very important to set proper CronJob limitations
If you are not sure what you need - just take this example as a template:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: my-first-conjob
namespace: devenv-admitriev
spec:
schedule: "*/10 * * * *" # MM HH DD MM WKD -- Minutes, Hour, Day, Month, Weekday (eg. Sun, Mon)
successfulJobsHistoryLimit: 3 # how many completed jobs should be kept
failedJobsHistoryLimit: 1 # how many failed jobs should be kept
suspend: false # Here you can suspend cronjob without deliting it
concurrencyPolicy: Forbid # Choose Forbid if you don't want concurrent executions of your Job
# The amount of time that Kubernetes can miss and still start a job.
# If Kubernetes missed too many job starts (100)
# then Kubernetes logs an error and doesn’t start any future jobs.
startingDeadlineSeconds: 300 # if a job hasn't started in this many seconds, skip
jobTemplate:
spec:
parallelism: 1 # How many pods will be instantiated at once.
completions: 1 # How many containers of the job are instantiated one after the other (sequentially) inside the pod.
backoffLimit: 3 # Maximum pod restarts in case of failure
activeDeadlineSeconds: 1800 # Limit the time for which a Job can continue to run
template:
spec:
restartPolicy: Never # If you want to restart - use OnFailure
terminationGracePeriodSeconds: 30
containers:
- name: my-first-conjob
image: busybox
command:
- /bin/sh
args:
- -c
- date; echo sleeping....; sleep 90s; echo exiting...;
resources:
requests:
memory: '128Mi'
limits:
memory: '1Gi'
Hi it's not clear what you expected - looking into the question but if I understand correctly you mean not running all cronjobs at the same time:
1. First option - it's to change their schedule time,
2. Second option try to use in your spec template other options like - Parallel Jobs - described: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
"For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer"
jobTemplate:
spec:
parallelism: 1
template:
To recreate this task please provide more details.
In addition for "Jobs History"
by default successfulJobsHistoryLimit and failedJobsHistoryLimit are set to 3 and 1 respectively.
Please take at: https://kubernetes.io/docs/tasks/job/
If you are interested you can set-up limit in "spec" section:
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
Hope this help.

Kubernetes - how to run job only once

I have a job definition based on example from kubernetes website.
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout-6
spec:
activeDeadlineSeconds: 30
completions: 1
parallelism: 1
template:
metadata:
name: pi
spec:
containers:
- name: pi
image: perl
command: ["exit", "1"]
restartPolicy: Never
I would like run this job once and not restart if fails. With comand exit 1 kubernetes trying to run new pod to get exit 0 code until reach activeDeadlineSeconds timeout. How can avoid that? I would like run build commands in kubernetes to check compilation and if compilation fails I'll get exit code different than 0. I don't want run compilation again.
Is it possible? How?
By now this is possible by setting backoffLimit: 0 which tells the controller to do 0 retries. default is 6
If you want a one-try command runner, you probably should create bare pod, because the job will try to execute the command until it's successful or the active deadline is met.
Just create the pod from your template:
apiVersion: v1
kind: Pod
metadata:
name: pi
spec:
containers:
- name: pi
image: perl
command: ["exit", "1"]
restartPolicy: Never
Sadly there is currently no way to prevent the job controller to just respawn new pods when they fail, but the kubernetes community is working on a solution, see:
"Backoff policy and failed pod limit" https://github.com/kubernetes/community/pull/583