In Kubernetes cronjobs, It is stated in the limitations section that
Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency.
What I understand from this is that, If the startingDeadlineSeconds is set to 10 and the cronjob couldn't start for some reason at its scheduled time, then it can still be attempted to start again as long as those 10 seconds haven't passed, however, after the 10 seconds, it for sure won't be started, is this correct?
Also, If I have concurrencyPolicy set to Forbid, does K8s count it as a fail if a cronjob tries to be scheduled, when there is one already running?
After investigating the code base of the Kubernetes repo, so this is how the CronJob controller works:
The CronJob controller will check the every 10 seconds the list of cronjobs in the given Kubernetes Client.
For every CronJob, it checks how many schedules it missed in the duration from the lastScheduleTime till now. If there are more than 100 missed schedules, then it doesn't start the job and records the event:
"FailedNeedsStart", "Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew."
It is important to note, that if the field startingDeadlineSeconds is set (not nil), it will count how many missed jobs occurred from the value of startingDeadlineSeconds till now. For example, if startingDeadlineSeconds = 200, It will count how many missed jobs occurred in the last 200 seconds. The exact implementation of counting how many missed schedules can be found here.
In case there are not more than a 100 missed schedules from the previous step, the CronJob controller will check if the time now is not after the time of its scheduledTime + startingDeadlineSeconds , i.e. that it's not too late to start the job (passed the deadline). If it wasn't too late, the job will continue to be attempted to be started by the CronJob Controller. However, If it is already too late, then it doesn't start the job and records the event:
"Missed starting window for {cronjob name}. Missed scheduled time to start a job {scheduledTime}"
It is also important to note, that if the field startingDeadlineSeconds is not set, then it means there is no deadline at all. This means the job will be attempted to start by the CronJob controller without checking if it's later or not.
Therefore to answer the questions above:
1. If the startingDeadlineSeconds is set to 10 and the cronjob couldn't start for some reason at its scheduled time, then it can still be attempted to start again as long as those 10 seconds haven't passed, however, after the 10 seconds, it for sure won't be started, is this correct?
The CronJob controller will attempt to start the job and it will be successfully scheduled if the 10 seconds after it's schedule time haven't passed yet. However, if the deadline has passed, it won't be started this run, and it will be counted as a missed schedule in later executions.
2. If I have concurrencyPolicy set to Forbid, does K8s count it as a fail if a cronjob tries to be scheduled, when there is one already running?
Yes, it will be counted as a missed schedule. Since missed schedules are calculated as I stated above in point 2.
Related
there is a "natural" ( I mean thought parameter) way to limit the number of triggering a dag (let say every 24 hours).
I don't want to schedule it, but some user can trigger the same dag multiple time, and for resources and others reason, I want it only once .
As I see "depends_on_past" depend only against the previous run, but it could be many time a day.
Thx
Not directly, but you could likely implement task_instance_mutation_hook in the first task of the DAG, it could then immediately fail the task if you check if it's been run several times the same day.
https://airflow.apache.org/docs/apache-airflow/stable/concepts/cluster-policies.html#task-instance-mutation
Basically I am trying to do is play around with pod lifecycle and check if we can do some cleanup/backup such as copy logs before the pod terminates.
What I need :
Copy logs/heapdumps from container to a hostPath/S3 before terminating
What I tried:
I used a preStop hook with a bash command to echo a message (just to see if it works !!). Used terminationGracePeriodSeconds with a delay to preStop and toggle them to see if the process works. Ex. keep terminationGracePeriodSeconds:30 sec (default) and set preStop command to sleep by 50 sec and the message should not be generated since the container will be terminated by then. This works as expected.
My questions:
what kind of processes are allowed(recommended) for a preStop hook? As copying logs/heapdumps of 15 gigs or more will take a lot of time. This time will then be used to define terminationGracePeriodSeconds
what happens when preStop takes more time than the set gracePeriod ?
(in case logs are huge say 10 gigs)
what happens if I do not have any hooks but still set terminationGracePeriodSeconds ? will the container remain up until that grace time ?
I found this article which closely relates to this but could not follow through https://github.com/kubernetes/kubernetes/issues/24695
All inputs appreciated !!
what kind of processes are allowed(recommended) for a preStop hook? As copying logs/heapdumps of 15 gigs or more will take a lot of time. This time will then be used to define terminationGracePeriodSeconds
Anything goes here, it's more of an opinion and how you would like your pods to linger around. Another option is to let your pods terminate and store your data in some place (i.e, AWS S3, EBS) where data will persist past the pod lifecycle then use something like Job to clean up the data, etc.
what happens when preStop takes more time than the set gracePeriod? (in case logs are huge say 10 gigs)
Your preStop will not complete which may mean incomplete data or data corruption.
what happens if I do not have any hooks but still set terminationGracePeriodSeconds ? will the container remain up until that grace time ?
This explains would be the sequence:
A SIGTERM signal is sent to the main process in each container, and a “grace period” countdown starts.
If a container doesn’t terminate within the grace period, a SIGKILL signal will be sent and the container.
I have a Quartz v1.x stateful job. The repeat interval is let's say 1 minute. The job itself typically terminates within a second, but it might happen that it lasts long, let's say 5 minutes. The scheduler prevents parallel run, but when the long running job finishes, it starts it again over and over again those, which were missed during the long running job. In this example, 5 other runs will be scheduled right after the long execution finishes. What I want is to make the scheduler "forget" the missed starts. E.g. if a job starts at 12:00 and finished at 12:05, then simply omit the runs at 12:01, 12:02, 12:03, 12:04, and depending on the exact finish, even 12:05. Is this somehow possible?
I need stateful job for preventing the parallel execution. Stateless job with proper annotation is not an option, because we are using Quartz version 1.x. I already tried playing around with the misfire policies (e.g. MISFIRE_INSTRUCTION_DO_NOTHING), but it seems that these are not intended for such situations. Could anyone help me?
My problem description is follows:
I have n state based database infinite crawlers:
Currently how it is happening:
We are using single machine for crawling.
We have three level of priority queue. High, Medium and LOW.
At starting all Database job are put into lower level queue.
Worker reads a job from queue and do operation.
After finishing job it reschedule it with a delay of 5 minutes.
Solution I found
For Priority Queue I can use:
-
http://zookeeper.apache.org/doc/r3.2.2/recipes.html#sc_recipes_priorityQueues
Problem solution I am still searching are:
How to reschedule a job in queue with future schedule time. Is there
a way to do that in zookeeper ?
Canceling a already started job. Suppose user change his database
authentication details. I want to stop already running job for that
database and restart with new details.
What I thought is while starting a worker It will subscribe for that
it's znode changes and if something happen, It will stop that job and
reschedule it.
Infinite Queue
What I thought is that after finishing it will remove it from queue and
readd it with future schdule time. (It implementation depend on point 1)
Is it correct way of doing this task infinite task?
I have a Quartz Job that I want to fire every minute. The job itself contains logic to check to see if there is a process to run and if there is, this job could take 45 minutes to complete.
Using a Simple Trigger, will Quartz fire this job off every minute even if there is one already running? Or if the interval is set to 1 minute, does that mean that Quartz will wait 1 minute after the job is done before it fires the next job?
If the trigger is set to fire every minute, it will fire every minute (and a new job instance will be created and invoked).
Unless the related job is marked #DisallowConcurrentExecution.