How to ensure only one job fires at a time in Quartz.NET? - quartz-scheduler

I have a Windows Service that uses Quartz.NET to execute jobs that are scheduled. I only want it to pick up a single job at a time. However, occasionally I am seeing behavior that indicates that it has picked up two jobs at once.
There are two log files (the regular one and one automatically generated when the regular one is in use) with jobs that start at the exact same time. I can see both jobs executing in the QRTZ_FIRED_TRIGGERS table, but only one has the correct instance ID, which is odd.
I have configured Quartz to use only a single thread. Is this not how you tell it to only pick up a single job at a time?
Here is my quartz.config file with sensitive values hashed out:
quartz.scheduler.instanceName = DefaultQuartzJobScheduler
quartz.scheduler.instanceId = ######################
quartz.jobstore.clustered = true
quartz.jobstore.clusterCheckinInterval = 15000
quartz.threadPool.type = Quartz.Simpl.SimpleThreadPool, Quartz
quartz.jobStore.useProperties = false
quartz.jobStore.type = Quartz.Impl.AdoJobStore.JobStoreTX, Quartz
quartz.jobStore.driverDelegateType = Quartz.Impl.AdoJobStore.OracleDelegate, Quartz
quartz.jobStore.tablePrefix = QRTZ_
quartz.jobStore.lockHandler.type = Quartz.Impl.AdoJobStore.UpdateLockRowSemaphore, Quartz
quartz.jobStore.misfireThreshold = 60000
quartz.jobStore.dataSource = default
quartz.dataSource.default.connectionString = ######################
quartz.dataSource.default.provider = OracleClient-20
# Customizable values per Node
quartz.threadPool.threadCount = 1
quartz.threadPool.threadPriority = Normal

Make the threadcount = 1.
<add key="quartz.threadPool.threadCount" value="1"/>
<add key="quartz.threadPool.threadPriority" value="Normal"/>
(as you have done)
Make each of your jobs "Stateful"
[PersistJobDataAfterExecution]
[DisallowConcurrentExecution]
public class StatefulDoesNotRunConcurrentlyJob : IJob /* : IStatefulJob */ /* Error 43 'Quartz.IStatefulJob' is obsolete: 'Use DisallowConcurrentExecutionAttribute and/or PersistJobDataAfterExecutionAttribute annotations instead. */
{
}
I've left in the name of the ~~older~~ version of how to do this (namely, the "IStatefulJob") and the error message that is generated when you code to the outdated "IStatefulJob" interface. But the error message gives the hint.
Basically, if you have 1 thread AND every job is marked with "DisallowConcurrentExecution", it should result in 1 job at any given time..running in "serial mode".

Related

Databricks monitoring a column value in a spark table

I have a complex workflow (partly inherited) in databricks that runs multiple child notebooks. Child notebooks should not run if the current run is set to terminated by an external process. Currently I am checking the state of this column before each run. There is some pseudo code below that tries to give an idea. My question is, is there a way to make this read better and make it easier to maintain. A more global if or wait that will check dynamically if the job is "terminated" and move the code on to the tidy up part.
audit = spark.createDataFrame([('122','complete',None),
('123','complete',None),
('124','Processing','Terminated'),
('125','Processing',None)],"job_id","job_status","external_status")
audit.write.overwrite("audit")
def is_terminated(run_id):
current_log = spark.table("audit").filter(f"run_id = {run_id}")
job_terminated =current_log.select('external_status').collect()[0][0]
return(terminated)
run_id = '125' # in live code there is a get current run_id job
terminated = is_terminated(run_id)
if terminated != "Terminated":
dbutils.notebook.run("./notebooks/job1",0)
terminated = is_terminated(run_id)
otherstuff = otherstuffjob(run_id)
if terminated != "Terminated" and otherstuff == True:
dbutils.notebook.run("./notebooks/job2",0)
terminated = is_terminated(run_id)
yetotherstuff = yetotherstuffjob(run_id)
if terminated != "Terminated" and yetothrstuff == True:
dbutils.notebook.run("./notebooks/job3",0)
terminated = is_terminated(run_id)
finalstuff = finalstuffjob(run_id)
if terminated != "Terminated" and finalstuff == True:
dbutils.notebook.run("./notebooks/job4",0)
dbutils.notebook.run("./notebooks/finaltidyupjob",0) #this runs whatever.

Log4cplus setproperty function usage

I use the following configuration for my logger, in the conf file :
log4cplus.appender.TestLogAppender = log4cplus::TimeBasedRollingFileAppender
log4cplus.appender.TestLogAppender.FilenamePattern = %d{yyyyMMdd}.log
log4cplus.appender.TestLogAppender.MaxHistory = 365
log4cplus.appender.TestLogAppender.Schedule = DAILY
log4cplus.appender.TestLogAppender.RollOnClose = false
log4cplus.appender.TestLogAppender.layout = log4cplus::PatternLayout
log4cplus.appender.TestLogAppender.layout.ConversionPattern = %m%n
And in my code, I have the following initializing function for my logger, in which first, I load the configuration file, and then I wish to set the 'FilenamePattern' property to a new value, so that when I run multiple applications, each application will write to it's own log file:
void InitLogger()
{
ProperyConfigurator::doConfigure (L"LogConf.conf");
helpers:Properties prop(L"LogConf.conf");
props.setPropery(L"log4cplus.appender.TestLogAppender.FilenamePattern" ,
"Log/AppLogName.log.%d{yyyy-MM-dd}");
}
The problem is that when I run even one application, the log messages are written to the file as given in the original configuration file (in the 'FilenamePattern' property).
It seems the 'setproperty' didn't set the new value I gave it.
Is there a problem with my initializing logger function?
Have I used the setProperty method wrong?
You are obviously changing the properties after you have already configured the system, so your changes will be ignored. Do this instead:
helpers:Properties props(L"LogConf.conf");
props.setPropery(L"log4cplus.appender.TestLogAppender.FilenamePattern" ,
"Log/AppLogName.log.%d{yyyy-MM-dd}");
ProperyConfigurator propConf (props);
propConf.configure();

airflow TriggerDagRunOperator how to change the execution date

I noticed that for scheduled task the execution date is set in the past according to
Airflow was developed as a solution for ETL needs. In the ETL world,
you typically summarize data. So, if I want to summarize data for
2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be
right after all data for 2016-02-19 becomes available.
however, when a dag triggers another dag the execution time is set to now().
Is there a way to have the triggered dags with the same execution time of triggering dag? Of course, I can rewrite the template and use yesterday_ds, however, this is a tricky solution.
The following class expands on TriggerDagRunOperator to allow passing the execution date as a string that then gets converted back into a datetime. It's a bit hacky but it is the only way I found to get the job done.
from datetime import datetime
import logging
from airflow import settings
from airflow.utils.state import State
from airflow.models import DagBag
from airflow.operators.dagrun_operator import TriggerDagRunOperator, DagRunOrder
class MMTTriggerDagRunOperator(TriggerDagRunOperator):
"""
MMT-patched for passing explicit execution date
(otherwise it's hard to hook the datetime.now() date).
Use when you want to explicity set the execution date on the target DAG
from the controller DAG.
Adapted from Paul Elliot's solution on airflow-dev mailing list archives:
http://mail-archives.apache.org/mod_mbox/airflow-dev/201711.mbox/%3cCAJuWvXgLfipPmMhkbf63puPGfi_ezj8vHYWoSHpBXysXhF_oZQ#mail.gmail.com%3e
Parameters
------------------
execution_date: str
the custom execution date (jinja'd)
Usage Example:
-------------------
my_dag_trigger_operator = MMTTriggerDagRunOperator(
execution_date="{{execution_date}}"
task_id='my_dag_trigger_operator',
trigger_dag_id='my_target_dag_id',
python_callable=lambda: random.getrandbits(1),
params={},
dag=my_controller_dag
)
"""
template_fields = ('execution_date',)
def __init__(
self, trigger_dag_id, python_callable, execution_date,
*args, **kwargs
):
self.execution_date = execution_date
super(MMTTriggerDagRunOperator, self).__init__(
trigger_dag_id=trigger_dag_id, python_callable=python_callable,
*args, **kwargs
)
def execute(self, context):
run_id_dt = datetime.strptime(self.execution_date, '%Y-%m-%d %H:%M:%S')
dro = DagRunOrder(run_id='trig__' + run_id_dt.isoformat())
dro = self.python_callable(context, dro)
if dro:
session = settings.Session()
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
execution_date=self.execution_date,
conf=dro.payload,
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
else:
logging.info("Criteria not met, moving on")
There is an issue you may run into when using this and not setting execution_date=now(): your operator will throw a mysql error if you try to start a dag with an identical execution_date twice. This is because the execution_date and dag_id are used to create the row index and rows with identical indexes cannot be inserted.
I can't think of a reason you would ever want to run two identical dags with the same execution_date in production anyway, but it is something I ran into while testing and you should not be alarmed by it. Simply clear the old job or use a different datetime.
The TriggerDagRunOperator now has an execution_date parameter to set the execution date of the triggered run.
Unfortunately the parameter is not in the template fields.
If it will be added to template fields (or if you override the operator and change the template_fields value) it will be possible to use it like this:
my_trigger_task= TriggerDagRunOperator(task_id='my_trigger_task',
trigger_dag_id="triggered_dag_id",
python_callable=conditionally_trigger,
execution_date= '{{execution_date}}',
dag=dag)
It has not been released yet but you can see the sources here:
https://github.com/apache/incubator-airflow/blob/master/airflow/operators/dagrun_operator.py
The commit that did the change was:
https://github.com/apache/incubator-airflow/commit/089c996fbd9ecb0014dbefedff232e8699ce6283#diff-41f9029188bd5e500dec9804fed26fb4
I improved a bit the MMTTriggerDagRunOperator. The function checks if the dag_run already exists, if found, restart the dag using the clear function of airflow. This allows us to create a dependency between dags because the possibility to have the execution date moved to the triggered dag opens a whole universe of amazing possibilities. I wonder why this is not the default behavior in airflow.
def execute(self, context):
run_id_dt = datetime.strptime(self.execution_date, '%Y-%m-%d %H:%M:%S')
dro = DagRunOrder(run_id='trig__' + run_id_dt.isoformat())
dro = self.python_callable(context, dro)
if dro:
session = settings.Session()
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
if not trigger_dag.get_dagrun( self.execution_date ):
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
execution_date=self.execution_date,
conf=dro.payload,
external_trigger=True
)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
else:
trigger_dag.clear(
start_date = self.execution_date,
end_date = self.execution_date,
only_failed = False,
only_running = False,
confirm_prompt = False,
reset_dag_runs = True,
include_subdags= False,
dry_run = False
)
logging.info("Cleared DagRun {}".format(trigger_dag))
session.close()
else:
logging.info("Criteria not met, moving on")
There is a function available in the experimental API section of airflow that allows you to trigger a dag with a specific execution date. https://github.com/apache/incubator-airflow/blob/master/airflow/api/common/experimental/trigger_dag.py
You can call this function as a part of PythonOperator and achieve the objective.
So it will look like
from airflow.api.common.experimental.trigger_dag import trigger_dag
trigger_operator=PythonOperator(task_id='YOUR_TASK_ID',
python_callable=trigger_dag,
op_args=['dag_id'],
op_kwargs={'execution_date': datetime.now()})

Schedule Celery task to run after other task(s) complete

I want to accomplish something like this:
results = []
for i in range(N):
data = generate_data_slowly()
res = tasks.process_data.apply_async(data)
results.append(res)
celery.collect(results).then(tasks.combine_processed_data())
ie launch asynchronous tasks over a long period of time, then schedule a dependent task that will only be executed once all earlier tasks are complete.
I've looked at things like chain and chord, but it seems like they only work if you can construct your task graph completely upfront.
For anyone interested, I ended up using this snippet:
#app.task(bind=True, max_retries=None)
def wait_for(self, task_id_or_ids):
try:
ready = app.AsyncResult(task_id_or_ids).ready()
except TypeError:
ready = all(app.AsyncResult(task_id).ready()
for task_id in task_id_or_ids)
if not ready:
self.retry(countdown=2**self.request.retries)
And writing the workflow something like this:
task_ids = []
for i in range(N):
task = (generate_data_slowly.si(i) |
process_data.si(i)
)
task_id = task.delay().task_id
task_ids.append(task_id)
final_task = (wait_for(task_ids) |
combine_processed_data.si()
)
final_task.delay()
That way you would be running your tasks synchronously.
The solution depends entirely on how and where data are collected. Roughly, given that generate_data_slowly and tasks.process_data are synchronized, a better approach would be to join both in one task (or a chain) and to group them.
chord will allow you to add a callback to that group.
The simplest example would be:
from celery import chord
#app.task
def getnprocess_data():
data = generate_data_slowly()
return whatever_process_data_does(data)
header = [getnprocess_data.s() for i in range(N)]
callback = combine_processed_data.s()
chord(header)(callback).get()

Replacing Celerybeat with Chronos

How mature is Chronos? Is it a viable alternative to scheduler like celery-beat?
Right now our scheduling implements a periodic "heartbeat" task that checks of "outstanding" events and fires them if they are overdue. We are using python-dateutil's rrule for defining this.
We are looking at alternatives to this approach, and Chronos seems a very attactive alternative: 1) it would mitigate the necessity to use a heartbeat schedule task, 2) it supports RESTful submission of events with ISO8601 format, 3) has a useful interface for management, and 4) it scales.
The crucial requirement is that scheduling needs to be configurable on the fly from the Web Interface. This is why can't use celerybeat's built-in scheduling out of the box.
Are we going to shoot ourselves in the foot by switching over to Chronos?
This SO has solutions to your dynamic periodic task problem. It's not the accepted answer at the moment:
from djcelery.models import PeriodicTask, IntervalSchedule
from datetime import datetime
class TaskScheduler(models.Model):
periodic_task = models.ForeignKey(PeriodicTask)
#staticmethod
def schedule_every(task_name, period, every, args=None, kwargs=None):
""" schedules a task by name every "every" "period". So an example call would be:
TaskScheduler('mycustomtask', 'seconds', 30, [1,2,3])
that would schedule your custom task to run every 30 seconds with the arguments 1 ,2 and 3 passed to the actual task.
"""
permissible_periods = ['days', 'hours', 'minutes', 'seconds']
if period not in permissible_periods:
raise Exception('Invalid period specified')
# create the periodic task and the interval
ptask_name = "%s_%s" % (task_name, datetime.datetime.now()) # create some name for the period task
interval_schedules = IntervalSchedule.objects.filter(period=period, every=every)
if interval_schedules: # just check if interval schedules exist like that already and reuse em
interval_schedule = interval_schedules[0]
else: # create a brand new interval schedule
interval_schedule = IntervalSchedule()
interval_schedule.every = every # should check to make sure this is a positive int
interval_schedule.period = period
interval_schedule.save()
ptask = PeriodicTask(name=ptask_name, task=task_name, interval=interval_schedule)
if args:
ptask.args = args
if kwargs:
ptask.kwargs = kwargs
ptask.save()
return TaskScheduler.objects.create(periodic_task=ptask)
def stop(self):
"""pauses the task"""
ptask = self.periodic_task
ptask.enabled = False
ptask.save()
def start(self):
"""starts the task"""
ptask = self.periodic_task
ptask.enabled = True
ptask.save()
def terminate(self):
self.stop()
ptask = self.periodic_task
self.delete()
ptask.delete()
I haven't used djcelery yet, but it supposedly has an admin interface for dynamic periodic tasks.