How to wait until a job is done or a file is updated in airflow - google-cloud-storage

I am trying to use apache-airflow, with google cloud-composer, to shedule batch processing that result in the training of a model with google ai platform. I failed to use airflow operators as I explain in this question unable to specify master_type in MLEngineTrainingOperator
Using the command line I managed to launch a job successfully.
So now my issue is to integrate this command in airflow.
Using BashOperator I can train the model but I need to wait for the job to be completed before creating a version and setting it as the default. This DAG create a version before the job is done
bash_command_train = "gcloud ai-platform jobs submit training training_job_name " \
"--packages=gs://path/to/the/package.tar.gz " \
"--python-version=3.5 --region=europe-west1 --runtime-version=1.14" \
" --module-name=trainer.train --scale-tier=CUSTOM --master-machine-type=n1-highmem-16"
bash_train_operator = BashOperator(task_id='train_with_bash_command',
bash_command=bash_command_train,
dag=dag,)
...
create_version_op = MLEngineVersionOperator(
task_id='create_version',
project_id=PROJECT,
model_name=MODEL_NAME,
version={
'name': version_name,
'deploymentUri': export_uri,
'runtimeVersion': RUNTIME_VERSION,
'pythonVersion': '3.5',
'framework': 'SCIKIT_LEARN',
},
operation='create')
set_version_default_op = MLEngineVersionOperator(
task_id='set_version_as_default',
project_id=PROJECT,
model_name=MODEL_NAME,
version={'name': version_name},
operation='set_default')
# Ordering the tasks
bash_train_operator >> create_version_op >> set_version_default_op
The training result in updating of a file in Gcloud storage. So I am looking for an operator or a sensor that will wait until this file is updated, I noticed GoogleCloudStorageObjectUpdatedSensor, but I dont know how to make it retry until this file is updated.
An other solution would be to check for the job to be completed, but I can't find how too.
Any help would be greatly appreciated.

The Google Cloud documentation for the --stream-logs flag:
"Block until job completion and stream the logs while the job runs."
Add this flag to bash_command_train and I think it should solve your problem. The command should only release once the job is finished, then Airflow will mark it as success. It will also let you monitor your training job's logs in Airflow.

Related

Beam SDK harness still trying to launch docker when I set environment_type to be `PROCESS`

According to the beam harness documentation:
PROCESS: User code is executed by processes that are automatically started by the runner on each worker node.
args = [
"--runner=portableRunner",
"--streaming",
"--sdk_worker_parallelism=2",
"--environment_type=PROCESS",
"--environment_config={\"command\": \"/opt/apache/beam/boot\"}",
]
consumer_config = {
"security.protocol": "SASL_SSL",
"sasl.mechanism": "AWS_MSK_IAM",
"sasl.jaas.config": "software.amazon.msk.auth.iam.IAMLoginModule required;",
"sasl.client.callback.handler.class": "software.amazon.msk.auth.iam.IAMClientCallbackHandler",
"bootstrap.servers": bootstrap_servers,
}
with beam.Pipeline(options=PipelineOptions(args)) as p:
data = p | "Reading messages from Kafka" >> ReadFromKafka(
consumer_config=consumer_config,
topics=topics,
with_metadata=True
)
data | 'Writing to stdout' >> beam.Map(logging.info)
But when I run the code (deployed to k8s using flinkk8soperator), it is complaining:
Caused by: java.io.IOException: Cannot run program "docker": error=2, No such file or directory
Wondering if I misunderstand anything? Thanks!
After couple digging, I finally make the cross language work without using DinD or DooD. Here's the steps:
Ensure both job and task manager mount a shared volume for artifact staging. (This is required, otherwise the task manager will complained unable to find the submitted jar)
Ensure your docker image can run both java and python beam code, here's what I did:
# python SDK
COPY --from=apache/beam_python3.7_sdk:2.41.0 /opt/apache/beam/ /opt/apache/beam/
# java SDK
COPY --from=apache/beam_java8_sdk:2.41.0 /opt/apache/beam/ /opt/apache/beam_java/
In the job, you'll need to start the expansion service with extra args, for example the KafkaIo:
from apache_beam.io.kafka import ReadFromKafka, default_io_expansion_service
ReadFromKafka(
consumer_config=consumer_config,
topics=[topic],
with_metadata=False,
expansion_service=default_io_expansion_service(
append_args=[
'--defaultEnvironmentType=PROCESS',
"--defaultEnvironmentConfig={\"command\":\"/opt/apache/beam_java/boot\"}",
]
)
You portable execution relies on xLang support that requires starting a Java SDK with docker. Your cluster doesn't have docker installed.

Is there a function in celery for finding waiting messages in a queue? [duplicate]

How can I retrieve a list of tasks in a queue that are yet to be processed?
EDIT: See other answers for getting a list of tasks in the queue.
You should look here:
Celery Guide - Inspecting Workers
Basically this:
my_app = Celery(...)
# Inspect all nodes.
i = my_app.control.inspect()
# Show the items that have an ETA or are scheduled for later processing
i.scheduled()
# Show tasks that are currently active.
i.active()
# Show tasks that have been claimed by workers
i.reserved()
Depending on what you want
If you are using Celery+Django simplest way to inspect tasks using commands directly from your terminal in your virtual environment or using a full path to celery:
Doc: http://docs.celeryproject.org/en/latest/userguide/workers.html?highlight=revoke#inspecting-workers
$ celery inspect reserved
$ celery inspect active
$ celery inspect registered
$ celery inspect scheduled
Also if you are using Celery+RabbitMQ you can inspect the list of queues using the following command:
More info: https://linux.die.net/man/1/rabbitmqctl
$ sudo rabbitmqctl list_queues
if you are using rabbitMQ, use this in terminal:
sudo rabbitmqctl list_queues
it will print list of queues with number of pending tasks. for example:
Listing queues ...
0b27d8c59fba4974893ec22d478a7093 0
0e0a2da9828a48bc86fe993b210d984f 0
10#torob2.celery.pidbox 0
11926b79e30a4f0a9d95df61b6f402f7 0
15c036ad25884b82839495fb29bd6395 1
celerey_mail_worker#torob2.celery.pidbox 0
celery 166
celeryev.795ec5bb-a919-46a8-80c6-5d91d2fcf2aa 0
celeryev.faa4da32-a225-4f6c-be3b-d8814856d1b6 0
the number in right column is number of tasks in the queue. in above, celery queue has 166 pending task.
If you don't use prioritized tasks, this is actually pretty simple if you're using Redis. To get the task counts:
redis-cli -h HOST -p PORT -n DATABASE_NUMBER llen QUEUE_NAME
But, prioritized tasks use a different key in redis, so the full picture is slightly more complicated. The full picture is that you need to query redis for every priority of task. In python (and from the Flower project), this looks like:
PRIORITY_SEP = '\x06\x16'
DEFAULT_PRIORITY_STEPS = [0, 3, 6, 9]
def make_queue_name_for_pri(queue, pri):
"""Make a queue name for redis
Celery uses PRIORITY_SEP to separate different priorities of tasks into
different queues in Redis. Each queue-priority combination becomes a key in
redis with names like:
- batch1\x06\x163 <-- P3 queue named batch1
There's more information about this in Github, but it doesn't look like it
will change any time soon:
- https://github.com/celery/kombu/issues/422
In that ticket the code below, from the Flower project, is referenced:
- https://github.com/mher/flower/blob/master/flower/utils/broker.py#L135
:param queue: The name of the queue to make a name for.
:param pri: The priority to make a name with.
:return: A name for the queue-priority pair.
"""
if pri not in DEFAULT_PRIORITY_STEPS:
raise ValueError('Priority not in priority steps')
return '{0}{1}{2}'.format(*((queue, PRIORITY_SEP, pri) if pri else
(queue, '', '')))
def get_queue_length(queue_name='celery'):
"""Get the number of tasks in a celery queue.
:param queue_name: The name of the queue you want to inspect.
:return: the number of items in the queue.
"""
priority_names = [make_queue_name_for_pri(queue_name, pri) for pri in
DEFAULT_PRIORITY_STEPS]
r = redis.StrictRedis(
host=settings.REDIS_HOST,
port=settings.REDIS_PORT,
db=settings.REDIS_DATABASES['CELERY'],
)
return sum([r.llen(x) for x in priority_names])
If you want to get an actual task, you can use something like:
redis-cli -h HOST -p PORT -n DATABASE_NUMBER lrange QUEUE_NAME 0 -1
From there you'll have to deserialize the returned list. In my case I was able to accomplish this with something like:
r = redis.StrictRedis(
host=settings.REDIS_HOST,
port=settings.REDIS_PORT,
db=settings.REDIS_DATABASES['CELERY'],
)
l = r.lrange('celery', 0, -1)
pickle.loads(base64.decodestring(json.loads(l[0])['body']))
Just be warned that deserialization can take a moment, and you'll need to adjust the commands above to work with various priorities.
To retrieve tasks from backend, use this
from amqplib import client_0_8 as amqp
conn = amqp.Connection(host="localhost:5672 ", userid="guest",
password="guest", virtual_host="/", insist=False)
chan = conn.channel()
name, jobs, consumers = chan.queue_declare(queue="queue_name", passive=True)
A copy-paste solution for Redis with json serialization:
def get_celery_queue_items(queue_name):
import base64
import json
# Get a configured instance of a celery app:
from yourproject.celery import app as celery_app
with celery_app.pool.acquire(block=True) as conn:
tasks = conn.default_channel.client.lrange(queue_name, 0, -1)
decoded_tasks = []
for task in tasks:
j = json.loads(task)
body = json.loads(base64.b64decode(j['body']))
decoded_tasks.append(body)
return decoded_tasks
It works with Django. Just don't forget to change yourproject.celery.
This worked for me in my application:
def get_celery_queue_active_jobs(queue_name):
connection = <CELERY_APP_INSTANCE>.connection()
try:
channel = connection.channel()
name, jobs, consumers = channel.queue_declare(queue=queue_name, passive=True)
active_jobs = []
def dump_message(message):
active_jobs.append(message.properties['application_headers']['task'])
channel.basic_consume(queue=queue_name, callback=dump_message)
for job in range(jobs):
connection.drain_events()
return active_jobs
finally:
connection.close()
active_jobs will be a list of strings that correspond to tasks in the queue.
Don't forget to swap out CELERY_APP_INSTANCE with your own.
Thanks to #ashish for pointing me in the right direction with his answer here: https://stackoverflow.com/a/19465670/9843399
The celery inspect module appears to only be aware of the tasks from the workers perspective. If you want to view the messages that are in the queue (yet to be pulled by the workers) I suggest to use pyrabbit, which can interface with the rabbitmq http api to retrieve all kinds of information from the queue.
An example can be found here:
Retrieve queue length with Celery (RabbitMQ, Django)
I think the only way to get the tasks that are waiting is to keep a list of tasks you started and let the task remove itself from the list when it's started.
With rabbitmqctl and list_queues you can get an overview of how many tasks are waiting, but not the tasks itself: http://www.rabbitmq.com/man/rabbitmqctl.1.man.html
If what you want includes the task being processed, but are not finished yet, you can keep a list of you tasks and check their states:
from tasks import add
result = add.delay(4, 4)
result.ready() # True if finished
Or you let Celery store the results with CELERY_RESULT_BACKEND and check which of your tasks are not in there.
As far as I know Celery does not give API for examining tasks that are waiting in the queue. This is broker-specific. If you use Redis as a broker for an example, then examining tasks that are waiting in the celery (default) queue is as simple as:
connect to the broker
list items in the celery list (LRANGE command for an example)
Keep in mind that these are tasks WAITING to be picked by available workers. Your cluster may have some tasks running - those will not be in this list as they have already been picked.
The process of retrieving tasks in particular queue is broker-specific.
I've come to the conclusion the best way to get the number of jobs on a queue is to use rabbitmqctl as has been suggested several times here. To allow any chosen user to run the command with sudo I followed the instructions here (I did skip editing the profile part as I don't mind typing in sudo before the command.)
I also grabbed jamesc's grep and cut snippet and wrapped it up in subprocess calls.
from subprocess import Popen, PIPE
p1 = Popen(["sudo", "rabbitmqctl", "list_queues", "-p", "[name of your virtula host"], stdout=PIPE)
p2 = Popen(["grep", "-e", "^celery\s"], stdin=p1.stdout, stdout=PIPE)
p3 = Popen(["cut", "-f2"], stdin=p2.stdout, stdout=PIPE)
p1.stdout.close()
p2.stdout.close()
print("number of jobs on queue: %i" % int(p3.communicate()[0]))
If you control the code of the tasks then you can work around the problem by letting a task trigger a trivial retry the first time it executes, then checking inspect().reserved(). The retry registers the task with the result backend, and celery can see that. The task must accept self or context as first parameter so we can access the retry count.
#task(bind=True)
def mytask(self):
if self.request.retries == 0:
raise self.retry(exc=MyTrivialError(), countdown=1)
...
This solution is broker agnostic, ie. you don't have to worry about whether you are using RabbitMQ or Redis to store the tasks.
EDIT: after testing I've found this to be only a partial solution. The size of reserved is limited to the prefetch setting for the worker.
from celery.task.control import inspect
def key_in_list(k, l):
return bool([True for i in l if k in i.values()])
def check_task(task_id):
task_value_dict = inspect().active().values()
for task_list in task_value_dict:
if self.key_in_list(task_id, task_list):
return True
return False
With subprocess.run:
import subprocess
import re
active_process_txt = subprocess.run(['celery', '-A', 'my_proj', 'inspect', 'active'],
stdout=subprocess.PIPE).stdout.decode('utf-8')
return len(re.findall(r'worker_pid', active_process_txt))
Be careful to change my_proj with your_proj
To get the number of tasks on a queue you can use the flower library, here is a simplified example:
from flower.utils.broker import Broker
from django.conf import settings
def get_queue_length(queue):
broker = Broker(settings.CELERY_BROKER_URL)
queues_result = broker.queues([queue])
return queues_result.result()[0]['messages']

How to monitor a quartz scheduler job?

I am very new to quartz scheduler. I am aware that we can enable logs for quartz jobs and triggers by doing the following configuration
org.quartz.plugin.jobHistory.class: org.quartz.plugins.history.LoggingJobHistoryPlugin
# Format of Log Generated
org.quartz.plugin.jobHistory.jobSuccessMessage= Job [{1}.{0}] execution complete and reports: { 8 }
org.quartz.plugin.jobHistory.jobToBeFiredMessage= Job [{1}.{0}] to be fired by trigger [{4}.{3}], re-fire: { 7 }
org.quartz.plugin.triggHistory.class= org.quartz.plugins.history.LoggingTriggerHistoryPlugin
# Format of Log Generated
org.quartz.plugin.triggHistory.triggerFiredMessage= Trigger \{1\}.\{0\} fired job \{6\}.\{5\} at: \{4, date, HH:mm:ss MM/dd/yyyy\}
org.quartz.plugin.triggHistory.triggerCompleteMessage= Trigger \{1\}.\{0\} completed firing job \{6\}.\{5\} at \{4, date, HH:mm:ss MM/dd/yyyy\}
But I am trying to understand if there is any way to directly get the quantitative metrics like how many jobs are currently running or duration for each job etc.
I am also aware of various tools like quartz-dask which gives a ui for the said metrics. But I am more interested in the metrics which in turn I could push to my prometheus instance

TypeError--using slurm queue to submit pyiron jobs

I'm facing some issues while running pyiron jobs on my HPC via the pysqa adapter. I had accidentally erased the main pyiron directory containing pyiron, projects and resources folders. I had copied all the three from another cluster. The only thing that I think will cause problem is sqlite.db file in the resources folder. Previously, I had no issues running interactive VASP jobs through the adapter. I'm guessing something happened after the deletion incident.
The pyiron version I'm using is: 0.2.17
Here is a minimal example using an Interactive vasp job that I have tried:
from pyiron import Project
pr = Project('Al-test')
structure = pr.create_structure('Al', 'fcc', 4.05)
pr.remove_jobs(recursive=True)
from pysqa import QueueAdapter
sqa = QueueAdapter(directory='~/pyiron/resources/queues/')
sqa.queue_view
pr.job_table()
job = pr.create_job(pr.job_type.Vasp, 'job_int')
job.structure = structure
job.server.run_mode.interactive = True
job.executable.executable_path = '~/pyiron/resources/vasp/bin/run_vasp_5.4.4_std_mpi.sh'
job.input.incar['NCORE']=4
job.server.queue = 'slurm'
job.server.cores=16
job.server.view_queues()
sqa.get_queue_status()
job.run(run_again=True)
end of the error log:
~/pyiron/pyiron/pyiron/base/server/generic.py in queue_id(self, qid)
208 qid (int): queue ID
209 """
--> 210 self._queue_id = int(qid)
211
212 #property
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Some inputs/feedback on this would be greatly appreciated.
Thanks!
We updated the queuing system interface in pyiron 0.3.X you can read more about this here:
https://pyiron.org/news/releases/2020/09/06/pyiron-0-3-X-HPC-release.html
For pyiron 0.3.X we have a detailed installation guide available on readthedocs.org:
https://pyiron.readthedocs.io/en/latest/source/installation.html#remote-hpc-cluster
So I highly recommend updating to pyiron 0.3.13.
Apart from this the error message basically says that the submission was not successful. If you navigate to the jobs working directory job.working_directory you should find a run_queue.sh script in the working directory. This is the script pyiron is using to submit the job to the queuing system. You can try to submit it manually using sbatch run_queue.sh this should print the queue id if successful and otherwise the error message from your queuing system.

ECS/Fargate - can I schedule a job to run every 5 minutes UNLESS its already running?

I've got an ECS/Fargate task that runs every five minutes. Is there a way to tell it to not run if the prior instance is still working? At the moment I'm just passing it a cron expression, and there's nothing in the cron/rate aws doc about blocking subsequent runs.
Conseptually I'm looking for something similar to Spring's #Scheduled(fixedDelay=xxx) where it'll run every five minutes after it finishes.
EDIT - I've created the task using cloudformation, not the cli
This solution works if you are using Cloudwatch Logging for your ECS application
- Have your script emit a 'task completed' or 'script successfully completed running' message so you can track it later on.
Using the describeLogStreams function, first retrieve the latest log stream. This will be the stream that was created for the task which ran 5 minutes ago in your case.
Once you have the name of the stream, check the last few logged events (text printed in the stream) to see if it's the expected task completed event that your stream should have printed. Use the getLogEvents function for this.
If it isn't, don't launch the next task and invoke a wait or handle as needed
Schedule your script to run every 5 minutes as you would normally.
API links to aws-sdk docs are below. This script is written in JS and uses the AWS-SDK (https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS.html) but you can use boto3 for python or a different lib for other languages
API ref for describeLogStreams
API ref for getLogEvents
const logGroupName = 'logGroupName';
this.cloudwatchlogs = new AwsSdk.CloudWatchLogs({
apiVersion: '2014-03-28', region: 'us-east-1'
});
// Get the latest log stream for your task's log group.
// Limit results to 1 to get only one stream back.
var descLogStreamsParams = {
logGroupName: logGroupName,
descending: true,
limit: 1,
orderBy: 'LastEventTime'
};
this.cloudwatchlogs.describeLogStreams(descLogStreamsParams, (err, data) => {
// Log Stream for the previous task run..
const latestLogStreamName = data.logStreams[0].logStreamName;
// Call getLogEvents to read from this log stream now..
const getEventsParams = {
logGroupName: logGroupName,
logStreamName: latestLogStreamName,
};
this.cloudwatchlogs.getLogEvents(params, (err, data) => {
const latestParsedMessage = JSON.parse(data.events[0].message);
// Loop over this to get last n messages
// ...
});
});
If you are launching the task with the CLI, the run-task command will return you the task-arn.
You can then use this to check the status of that task:
aws ecs describe-tasks --cluster MYCLUSTER --tasks TASK-ARN --query 'tasks[0].lastStatus'
It will return RUNNING if it's still running, STOPPED if stopped, etc.
Note that Fargate is very aggressive about harvesting stopped tasks. If that command returns null, you can consider it STOPPED.