Nomad periodic job starts immediately - scheduled-tasks

I have a nomad periodic job that has this in the job config:
periodic {
cron = "* */2 * * *"
prohibit_overlap = true
}
However, I find that when nomad finishes running the job (the job takes less than two hours), it more or less immediately starts it again (sometimes within 60 seconds). I expected it to start it after about two hours.
Why does that happen, and how can I make nomad start the job only every two hours?

You have * in minutes field, so your job will start every minute. You should change cron to 0 */2 * * *.

Related

I want to restart my kubernates pods after every 21 days

I want to restart my Kubernetes pods after every 21 days I want probably a crone job that can do that. That pod should be able to restart pods from all of my deployments.
I don't want to write any extra script for that is that possible ?
Thanks you .
If you want it to run every 21 days, no matter the day of month, then cron job doesn't support a simple, direct translation of that requirement and most other similar scheduling systems.
In that specific case you'd have to track the days yourself and calculate the next day it should run at by adding 21 days to the current date whenever the function runs.
My suggestion is :
if you want it to run at 00:00 every 21st day of the month:
Cron Job schedule is : 0 0 21 * * (at 00:00 on day-of-month 21).
Output :
If you want it to run at 12:00 every 21st day of the month:
Cron Job schedule is : 0 12 21 * * (at 12:00 on day-of-month 21).
Output :
Or else
If you want to add the 21 days yourself, you can refer and use setInterval to schedule a function to run at a specific time as below :
const waitSeconds = 1814400; // 21 days
function scheduledFunction() {
// Do something
}
// Run now
scheduledFunction();
// Run every 21 days
setInterval(scheduledFunction, waitSeconds);
For more information Refer this SO Link to how to schedule Pod restart.

Disable InboundChannelAdapter / Poller with autoStartup false

In Spring Integration, I want to disable a poller by setting the autoStartup=false on the InboundChannelAdapter. But with the following setup, none of my pollers are firing on either my Tomcat instance 1 nor Tomcat instance 2. I have two Tomcat instances with the same code deployed. I want the pollers to be disabled on one of the instances since I do not want the same job polling on the two Tomcat instances concurrently.
Here is the InboundChannelAdapter:
#Bean
#InboundChannelAdapter(value = "irsDataPrepJobInputWeekdayChannel", poller = #Poller(cron="${batch.job.schedule.cron.weekdays.irsDataPrepJobRunner}", maxMessagesPerPoll="1" ), autoStartup = "${batch.job.schedule.cron.weekdays.irsDataPrepJobRunner.autoStartup}")
public MessageSource<JobLaunchRequest> pollIrsDataPrepWeekdayJob() {
return () -> new GenericMessage<>(requestIrsDataPrepWeekdayJob());
}
The property files are as follows. Property file for Tomcat instance 1:
# I wish for this job to run on Tomcat instance 1
batch.job.schedule.cron.riStateAgencyTransmissionJobRunner=0 50 14 * * *
# since autoStartup defaults to true, I do not provide:
#batch.job.schedule.cron.riStateAgencyTransmissionJobRunner.autoStartup=true
# I do NOT wish for this job to run on Tomcat instance 1
batch.job.schedule.cron.weekdays.irsDataPrepJobRunner.autoStartup=false
# need to supply as poller has a cron placeholder
batch.job.schedule.cron.weekdays.irsDataPrepJobRunner=0 0/7 * * * 1-5
Property file for Tomcat instance 2:
# I wish for this job to run on Tomcat instance 2
batch.job.schedule.cron.weekdays.irsDataPrepJobRunner=0 0/7 * * * 1-5
# since autoStartup defaults to true, I do not provide:
#batch.job.schedule.cron.weekdays.irsDataPrepJobRunner.autoStartup=true
# I do NOT wish for this job to run on Tomcat instance 2
batch.job.schedule.cron.riStateAgencyTransmissionJobRunner.autoStartup=false
# need to supply as poller has a cron placeholder
batch.job.schedule.cron.riStateAgencyTransmissionJobRunner=0 50 14 * * *
The properties files are passed as a VM option, e.g. "-Druntime.scheduler=dev1". I cannot disable the poller on one of the JVMs using "-" as the cron expression -- something similar to the ask here: Poller annotation with cron expression should support a special disable character
My goal of being able to call the job manually from either Tomcat instance 1 or Tomcat instance 2 is working. My problem with the setup mentioned above, is that none of the pollers are firing as per their cron expression.
Consider to investigate a leader election pattern: https://docs.spring.io/spring-integration/docs/current/reference/html/messaging-endpoints.html#leadership-event-handling.
This way you have those endpoints in non-started state by default and in the same role. The election is going to chose a leader and start only this one.
Of course there has to be some shared external service to control leadership.

How to cleanup failed CronJob spawned Jobs once a more recent job passes

I am running management tasks using Kubernetes CronJobs and have Prometheus alerting on when one of the spawned Jobs fails using kube-state-metrics:
kube_job_status_failed{job="kube-state-metrics"} > 0
I want to have it so that when a more recent Job passes then the failed ones are cleaned up so that the alert stops firing.
Does the CronJob resource support this behaviour on its own?
Workarounds would be to make the Job clean up failed ones as the last step or to create a much more complicated alert rule to take the most recent Job as the definitive status, but they are not the nicest solutions IMO.
Kubernetes version: v1.15.1
As a workaround the following query would show CronJobs where the last finished Job has failed
(max by(owner_name, namespace) (kube_job_status_start_time * on(job_name) group_left(owner_name) ((kube_job_status_succeeded / kube_job_status_succeeded == 1) + on(job_name) group_left(owner_name) (0 * kube_job_owner{owner_is_controller="true",owner_kind="CronJob"}))))
< bool
(max by(owner_name, namespace) (kube_job_status_start_time * on(job_name) group_left(owner_name) ((kube_job_status_failed / kube_job_status_failed == 1) + on(job_name) group_left(owner_name) (0 * kube_job_owner{owner_is_controller="true",owner_kind="CronJob"})))) == 1
There's a great Kubernetes guide on cleaning up jobs.
Specifically, the ttlSecondsAfterFinished defined in the JobSpec API.
This should do what you're asking, I.E. If a bunch of failed jobs occur, when one succeeds, the time before they should all be removed.

Kubernetes jobs and back-off limit values: is the value a number of retries or minutes?

I was reading the Kubernetes documentation about jobs and retries. I found this:
There are situations where you want to fail a Job after some amount of
retries due to a logical error in configuration etc. To do so, set
.spec.backoffLimit to specify the number of retries before considering
a Job as failed. The back-off limit is set by default to 6. Failed
Pods associated with the Job are recreated by the Job controller with
an exponential back-off delay (10s, 20s, 40s …) capped at six minutes.
The back-off count is reset if no new failed Pods appear before the
Job’s next status check.
I had two questions about the above quote:
The back-off limit value is on minutes or number of retries? The documentation example using the value 6 (six) is confuse, because he initially affirms that the value is the number of retries but after that said "capped at six minutes".
There is a way to define the back-off delay time? As I understand, this behavior (10s, 20s, 40s …) is default and can't be changed.
No confusion about the .spec.backoffLimit is is the number of retries.
The Job controller recreates the failed Pods (associated with the Job) in an exponential delay (10s, 20s, 40s, ... , 360s). And of course, this delay time is set by the Job controller.
If the Pod fails, after 10s new Pod will be created
If fails again, after 20s new one will be created
If fails again, after 40s new one comes
If fails again, next one comes after 80s (1m 20s)
If fails again, next one comes after 160s (2m 40s)
If fails again, after 320s (5m 20s), new Pod comes
If fails again, after 360s (not 640s, cause it is greater than 360s or 6m) you will see the next one
By looking at the source code, it seems like the backoffLimit attribute specifies the failure count rather than failure time.
Excerpt of the code mentioned above:
func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rErr error) {
// ...
succeeded, failed := getStatus(&job, pods, uncounted, expectedRmFinalizers)
// ...
jobHasNewFailure := failed > job.Status.Failed
exceedsBackoffLimit := jobHasNewFailure && (active != *job.Spec.Parallelism) &&
(failed > *job.Spec.BackoffLimit)
// ...
}

ECS/Fargate - can I schedule a job to run every 5 minutes UNLESS its already running?

I've got an ECS/Fargate task that runs every five minutes. Is there a way to tell it to not run if the prior instance is still working? At the moment I'm just passing it a cron expression, and there's nothing in the cron/rate aws doc about blocking subsequent runs.
Conseptually I'm looking for something similar to Spring's #Scheduled(fixedDelay=xxx) where it'll run every five minutes after it finishes.
EDIT - I've created the task using cloudformation, not the cli
This solution works if you are using Cloudwatch Logging for your ECS application
- Have your script emit a 'task completed' or 'script successfully completed running' message so you can track it later on.
Using the describeLogStreams function, first retrieve the latest log stream. This will be the stream that was created for the task which ran 5 minutes ago in your case.
Once you have the name of the stream, check the last few logged events (text printed in the stream) to see if it's the expected task completed event that your stream should have printed. Use the getLogEvents function for this.
If it isn't, don't launch the next task and invoke a wait or handle as needed
Schedule your script to run every 5 minutes as you would normally.
API links to aws-sdk docs are below. This script is written in JS and uses the AWS-SDK (https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS.html) but you can use boto3 for python or a different lib for other languages
API ref for describeLogStreams
API ref for getLogEvents
const logGroupName = 'logGroupName';
this.cloudwatchlogs = new AwsSdk.CloudWatchLogs({
apiVersion: '2014-03-28', region: 'us-east-1'
});
// Get the latest log stream for your task's log group.
// Limit results to 1 to get only one stream back.
var descLogStreamsParams = {
logGroupName: logGroupName,
descending: true,
limit: 1,
orderBy: 'LastEventTime'
};
this.cloudwatchlogs.describeLogStreams(descLogStreamsParams, (err, data) => {
// Log Stream for the previous task run..
const latestLogStreamName = data.logStreams[0].logStreamName;
// Call getLogEvents to read from this log stream now..
const getEventsParams = {
logGroupName: logGroupName,
logStreamName: latestLogStreamName,
};
this.cloudwatchlogs.getLogEvents(params, (err, data) => {
const latestParsedMessage = JSON.parse(data.events[0].message);
// Loop over this to get last n messages
// ...
});
});
If you are launching the task with the CLI, the run-task command will return you the task-arn.
You can then use this to check the status of that task:
aws ecs describe-tasks --cluster MYCLUSTER --tasks TASK-ARN --query 'tasks[0].lastStatus'
It will return RUNNING if it's still running, STOPPED if stopped, etc.
Note that Fargate is very aggressive about harvesting stopped tasks. If that command returns null, you can consider it STOPPED.