ECS/Fargate - can I schedule a job to run every 5 minutes UNLESS its already running? - amazon-ecs

I've got an ECS/Fargate task that runs every five minutes. Is there a way to tell it to not run if the prior instance is still working? At the moment I'm just passing it a cron expression, and there's nothing in the cron/rate aws doc about blocking subsequent runs.
Conseptually I'm looking for something similar to Spring's #Scheduled(fixedDelay=xxx) where it'll run every five minutes after it finishes.
EDIT - I've created the task using cloudformation, not the cli

This solution works if you are using Cloudwatch Logging for your ECS application
- Have your script emit a 'task completed' or 'script successfully completed running' message so you can track it later on.
Using the describeLogStreams function, first retrieve the latest log stream. This will be the stream that was created for the task which ran 5 minutes ago in your case.
Once you have the name of the stream, check the last few logged events (text printed in the stream) to see if it's the expected task completed event that your stream should have printed. Use the getLogEvents function for this.
If it isn't, don't launch the next task and invoke a wait or handle as needed
Schedule your script to run every 5 minutes as you would normally.
API links to aws-sdk docs are below. This script is written in JS and uses the AWS-SDK (https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS.html) but you can use boto3 for python or a different lib for other languages
API ref for describeLogStreams
API ref for getLogEvents
const logGroupName = 'logGroupName';
this.cloudwatchlogs = new AwsSdk.CloudWatchLogs({
apiVersion: '2014-03-28', region: 'us-east-1'
});
// Get the latest log stream for your task's log group.
// Limit results to 1 to get only one stream back.
var descLogStreamsParams = {
logGroupName: logGroupName,
descending: true,
limit: 1,
orderBy: 'LastEventTime'
};
this.cloudwatchlogs.describeLogStreams(descLogStreamsParams, (err, data) => {
// Log Stream for the previous task run..
const latestLogStreamName = data.logStreams[0].logStreamName;
// Call getLogEvents to read from this log stream now..
const getEventsParams = {
logGroupName: logGroupName,
logStreamName: latestLogStreamName,
};
this.cloudwatchlogs.getLogEvents(params, (err, data) => {
const latestParsedMessage = JSON.parse(data.events[0].message);
// Loop over this to get last n messages
// ...
});
});

If you are launching the task with the CLI, the run-task command will return you the task-arn.
You can then use this to check the status of that task:
aws ecs describe-tasks --cluster MYCLUSTER --tasks TASK-ARN --query 'tasks[0].lastStatus'
It will return RUNNING if it's still running, STOPPED if stopped, etc.
Note that Fargate is very aggressive about harvesting stopped tasks. If that command returns null, you can consider it STOPPED.

Related

How to monitor a quartz scheduler job?

I am very new to quartz scheduler. I am aware that we can enable logs for quartz jobs and triggers by doing the following configuration
org.quartz.plugin.jobHistory.class: org.quartz.plugins.history.LoggingJobHistoryPlugin
# Format of Log Generated
org.quartz.plugin.jobHistory.jobSuccessMessage= Job [{1}.{0}] execution complete and reports: { 8 }
org.quartz.plugin.jobHistory.jobToBeFiredMessage= Job [{1}.{0}] to be fired by trigger [{4}.{3}], re-fire: { 7 }
org.quartz.plugin.triggHistory.class= org.quartz.plugins.history.LoggingTriggerHistoryPlugin
# Format of Log Generated
org.quartz.plugin.triggHistory.triggerFiredMessage= Trigger \{1\}.\{0\} fired job \{6\}.\{5\} at: \{4, date, HH:mm:ss MM/dd/yyyy\}
org.quartz.plugin.triggHistory.triggerCompleteMessage= Trigger \{1\}.\{0\} completed firing job \{6\}.\{5\} at \{4, date, HH:mm:ss MM/dd/yyyy\}
But I am trying to understand if there is any way to directly get the quantitative metrics like how many jobs are currently running or duration for each job etc.
I am also aware of various tools like quartz-dask which gives a ui for the said metrics. But I am more interested in the metrics which in turn I could push to my prometheus instance

How to wait until a job is done or a file is updated in airflow

I am trying to use apache-airflow, with google cloud-composer, to shedule batch processing that result in the training of a model with google ai platform. I failed to use airflow operators as I explain in this question unable to specify master_type in MLEngineTrainingOperator
Using the command line I managed to launch a job successfully.
So now my issue is to integrate this command in airflow.
Using BashOperator I can train the model but I need to wait for the job to be completed before creating a version and setting it as the default. This DAG create a version before the job is done
bash_command_train = "gcloud ai-platform jobs submit training training_job_name " \
"--packages=gs://path/to/the/package.tar.gz " \
"--python-version=3.5 --region=europe-west1 --runtime-version=1.14" \
" --module-name=trainer.train --scale-tier=CUSTOM --master-machine-type=n1-highmem-16"
bash_train_operator = BashOperator(task_id='train_with_bash_command',
bash_command=bash_command_train,
dag=dag,)
...
create_version_op = MLEngineVersionOperator(
task_id='create_version',
project_id=PROJECT,
model_name=MODEL_NAME,
version={
'name': version_name,
'deploymentUri': export_uri,
'runtimeVersion': RUNTIME_VERSION,
'pythonVersion': '3.5',
'framework': 'SCIKIT_LEARN',
},
operation='create')
set_version_default_op = MLEngineVersionOperator(
task_id='set_version_as_default',
project_id=PROJECT,
model_name=MODEL_NAME,
version={'name': version_name},
operation='set_default')
# Ordering the tasks
bash_train_operator >> create_version_op >> set_version_default_op
The training result in updating of a file in Gcloud storage. So I am looking for an operator or a sensor that will wait until this file is updated, I noticed GoogleCloudStorageObjectUpdatedSensor, but I dont know how to make it retry until this file is updated.
An other solution would be to check for the job to be completed, but I can't find how too.
Any help would be greatly appreciated.
The Google Cloud documentation for the --stream-logs flag:
"Block until job completion and stream the logs while the job runs."
Add this flag to bash_command_train and I think it should solve your problem. The command should only release once the job is finished, then Airflow will mark it as success. It will also let you monitor your training job's logs in Airflow.

How to send logs to Cloudwatch from a background process running in AWS Fargate containers?

I'm using Fargate. My container is running two processes. Celery worker in the background and Django in the foreground. The foreground process emits logs to stdout, hence AWS takes care of sending Django logs to concerned Cloudwatch Log Group and Stream.
Since its running in the background, how do send the celery worker's logs to (a different Log Stream within same Log Group) Cloudwatch?
If there's no way to move the second process to the separate container and log it as usual, you may install awslogs package to the container and set it up to read background process' log files and send content to CloudWatch.
But I'd not recommend such approach.
Again this is not necessarily a Fargate based issue or question. For logging in celery check this out http://docs.celeryproject.org/en/latest/userguide/tasks.html#logging.
The worker won’t update the redirection if you create a logger instance somewhere in your task or task module.
If you want to redirect sys.stdout and sys.stderr to a custom logger you have to enable this manually, for example:
import sys
logger = get_task_logger(__name__)
#app.task(bind=True)
def add(self, x, y):
old_outs = sys.stdout, sys.stderr
rlevel = self.app.conf.worker_redirect_stdouts_level
try:
self.app.log.redirect_stdouts_to_logger(logger, rlevel)
print('Adding {0} + {1}'.format(x, y))
return x + y
finally:
sys.stdout, sys.stderr = old_outs
And for logging with Fargate, i would use the awslogs driver. Below is how you configure as documented here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html
If using the console:
Or this if in cloudformation template:
Like #OK999 said, Celery is designed to swallow logs whether its on Fargate or not. We ended up using a Django LOGGING config like:
LOGGING = {
'version': 1,
# This only "disables" but the loggers don't propagate
# 'disable_existing_loggers': False,
...
'handlers': {
'console': {
'level': env.str('LOGGING_LEVEL'),
'class': 'logging.StreamHandler',
'formatter': 'verbose',
},
},
'loggers': {
...
# celery won't route logs to console without this
'celery': {
# filtered at the handler
'level': logging.DEBUG,
'handlers': ['console'],
},
...
We had to make this change long before transitioning to Fargate.

How spring batch admin is stopping a running job?

How spring batch admin is stopping a running job from the UI .
On the spring batch admin's online documentation i have read the following lines .
"A job that is executing can be stopped by the user (whether or not it
is launchable). The stop signal is sent via the database and once
detected by Spring Batch in whatever process is running the job, the
job is stopped (status moves from STOPPING to STOPPED) and no further
processing takes place."
Does that mean Spring batch admin UI is directly changing the status of job inside the spring batch table ?
UPDATE: I tried executing the below query on the running job .
update batch_job_execution set status="STOPPED" where job_ins
tance_id=19;
The above query is getting updated in the DB but spring batch is not bale to stop the running job.
If anybody has tried this please do share the logic here .
You're confused between Batch Status vs. Exit Status.
What are you doing with that SQL is changed the STATUS to STOPPED
When a job is running you can stop the job from the code. In each step iteration, check their status and if STOPPING its set, then send the step to stop ongoing.
Anyway, what you doing is not elegant. The correct way is explained in Common Batch Patterns -> 11.2 Stopping a Job Manually for Business Reasons
public class FooProcessor implements ItemProcessor<FooIn,FooOut>{
public FooOut process(FooIn foo) throws Exception {
if (sendToStop(item)) {
throw new MyStopException("I need to Stop: " + item);
}
//do my stuff
return new FooOut(foo);
}
}
Another simple way to stop chunk step is return null in the reader. This tells us that no more elements to iterate the reader
public T read() throws Exception {
T item = delegate.read();
if (ifNeedStop(item)) {
return null; // end the step here
}
return item;
}
I investigated the spring batch code.
It seems they update both the version and status of the BATCH_JOB_EXECUTION.
This works for me:
update batch_job_execution set status="STOPPED", version=version+1 where job_instance_id=19;
If you look into the jars of spring batch admin, you can see that in AbstractStep.java(spring-batch admin class) it checks for the status of the Step and Job from Database .
Based on this status it validates step before running it .
This works well for all cases except in chunk, since next step is called after large processing . If you want to implement in it, you can implement your own listener to check status (but it will increase DB hits) .

find if a job is running in Quartz1.6

I would like to clarify details of the scheduler.getCurrentlyExecutingJobs() method in Quartz1.6. I have a job that should have only one instance running at any given moment. It can be triggered to "run Now" from a UI but if a job instance already running for this job - nothing should happen.
This is how I check whether there is a job running that interests me:
List<JobExecutionContext> currentJobs = scheduler.getCurrentlyExecutingJobs();
for (JobExecutionContext jobCtx: currentJobs){
jobName = jobCtx.getJobDetail().getName();
groupName = jobCtx.getJobDetail().getGroup();
if (jobName.equalsIgnoreCase("job_I_am_looking_for_name") &&
groupName.equalsIgnoreCase("job_group_I_am_looking_for_name")) {
//found it!
logger.warn("the job is already running - do nothing");
}
}
then, to test this, I have a unit test that tries to schedule two instances of this job one after the other. I was expecting to see the warning when trying to schedule the second job, however, instead, I'm getting this exception:
org.quartz.ObjectAlreadyExistsException: Unable to store Job with name:
'job_I_am_looking_for_name' and group: 'job_group_I_am_looking_for_name',
because one already exists with this identification.
When I run this unit test in a debug mode, with the break on this line:
List currentJobs = scheduler.getCurrentlyExecutingJobs();
I see the the list is empty - so the scheduler does not see this job as running , but it still fails to schedule it again - which tells me the job was indeed running at the time...
Am I missing some finer points with this scheduler method?
Thanks!
Marina
For the benefit of others, I'm posting an answer to the issue I was having - I got help from the Terracotta Forum's Zemian Deng: posting on Terracotta's forum
Here is the re-cap:
The actual checking of the running jobs was working fine - it was just timing in the Unit tests, of course. I've added some sleeping in the job, and tweaked unit tests to schedule the second job while the first one is still running - and verified that I could indeed find the first job still running.
The exception I was getting was because I was trying to schedule a new job with the same name, rather than try to trigger the already stored in the scheduler job. The following code worked exactly as I needed:
List<JobExecutionContext> currentJobs = scheduler.getCurrentlyExecutingJobs();
for (JobExecutionContext jobCtx: currentJobs){
jobName = jobCtx.getJobDetail().getName();
groupName = jobCtx.getJobDetail().getGroup();
if (jobName.equalsIgnoreCase("job_I_am_looking_for_name") && groupName.equalsIgnoreCase("job_group_I_am_looking_for_name")) {
//found it!
logger.warn("the job is already running - do nothing");
return;
}
}
// check if this job is already stored in the scheduler
JobDetail emailJob;
emailJob = scheduler.getJobDetail("job_I_am_looking_for_name", "job_group_I_am_looking_for_name");
if (emailJob == null){
// this job is not in the scheduler yet
// create JobDetail object for my job
emailJob = jobFactory.getObject();
emailJob.setName("job_I_am_looking_for_name");
emailJob.setGroup("job_group_I_am_looking_for_name");
scheduler.addJob(emailJob, true);
}
// this job is in the scheduler and it is not running right now - run it now
scheduler.triggerJob("job_I_am_looking_for_name", "job_group_I_am_looking_for_name");
Thanks!
Marina