Airflow limit daily trigger - triggers

there is a "natural" ( I mean thought parameter) way to limit the number of triggering a dag (let say every 24 hours).
I don't want to schedule it, but some user can trigger the same dag multiple time, and for resources and others reason, I want it only once .
As I see "depends_on_past" depend only against the previous run, but it could be many time a day.
Thx

Not directly, but you could likely implement task_instance_mutation_hook in the first task of the DAG, it could then immediately fail the task if you check if it's been run several times the same day.
https://airflow.apache.org/docs/apache-airflow/stable/concepts/cluster-policies.html#task-instance-mutation

Related

Is using "pg_sleep" in plpgsql Procedures/Functions while using multiple worker background processes concurrently bad practice?

I am running multiple background worker processes in my Postgres database. I am doing this using Pg_Cron extension. I cannot unfortunately use Pg_Timetables, as suggested by another user here.
Thus, I have 5 dependent "Jobs" that need 1 other independent Procedure/Function to execute and complete before they can start. I originally having my Cron jobs simply check every 30minutes or-so some "job_log" table I created to see if the independent Job completed (i.e. if yes, execute Procedure, if not, return out of Procedure and check at next Cron interval)
However, I believe I could simplify the way I am triggering/orchestrating all these Jobs/Procedures greatly if I utilize pg_sleep and start all the Jobs at one -time (so no more checking every 30minutes). I would be running these Jobs in the night time concurrently so I believe it shouldn't effect my actual traffic that much.
i.e.
WHILE some_variable != some_condition LOOP
PERFORM pg_sleep(1);
some_variable := some_value; -- update variable here
END LOOP;
My question is
Would starting all these Jobs at one time (i.e. Setting a concrete time in the Cron expression e.g. 15 18 * * *), and utilizing pg_sleep be bad practice/inefficient as I would be idling 5 background workers while 1 Job completes. The 1 job these are dependent on could take any amount of time to finish i.e. 15 min, 30 min, 1hr (should be < 1 hr though).
Or is better to simply just use a Cron expression to check every 5min or so if the main/independent Job is done, so my other Jobs that are dependent can then run?
Running two schedulers, one of them home-built, seems more complex than just running one scheduler that does 2 (or 6, 1+5, however you count it) different things. If your goal is to make things simpler, does your proposal really achieve that?
I wouldn't worry about 5 backends sleeping on pg_sleep at the same time. But you might worry about them holding back the xid horizon while they do so, which would make vacuuming and HOT pruning less effective. But if you already have one long-running task in one snapshot (the thing they are waiting for) then more of them aren't going to make the matter worse.

is it possible to define Must Start Times for dependent jobs in AutoSys

I'm looking for an option in autosys to send an alert if the dependant jobs don't start within 10 min of the avg run time, for example, the last 30 days. The dependant jobs, do not have a fixed starting condition. They might have run at varying times in the last 30 days based on the completion of starting jobs. Would it be possible in autosys, to dynamically set the must start time for the jobs which don't have starting time rather they are dependant on starting conditions?
I use something like this in my critical jobs. You can try them too, if you want :
must_start_times: "19:01"
days_of_week: mo,tu,we,th,fr,sa,su
start_times: "19:00"

Run scheduler to execute jobs at an interval from the completion of the previous job

I need to create schedulers to execute jobs(class files) at specified intervals..For Now, I'm using Quartz Scheduler which triggers the jobs at defined intervals from the time of triggering of it.
For Eg: Consider I'm giving a cron expression to run for every one hour starting at morning 9.My first run will be at 9 and my second run will be at 10 and so on.
If my job is taking 20 minutes to execute then in that case this method is not that much efficient.
What I need to do is to schedule a job for every one hour from the completion time of the previously ran job
For Eg: Consider my job to run every one hour is triggered at 9 and for the first run it took 20 minutes to run, so for the next time the job should trigger only at 10:20 instead of 10 (ie., one hour from the completion of previous ran job)
I need to know whether there are any methods in Quartz Scheduling to achieve this or any other logic I need to do.
If anyone could help me out on this,it would be very helpful for me.
You can easily achieve this by job-chaining your job executions. There are various approaches you can choose from:
(1) Implement a Quartz JobListener and in its jobWasExecuted method, that is invoked by Quartz whenever a job finishes executing, re-fire your job.
(2) Look at the Quartz JobChainingJobListener that you can use to implement simple job chaining scenarios. Please note that the functionality of this listener is very limited as it does not allow you to insert delays between job executions, there is no support for conditions that must be met before target jobs are executed etc. But you can use it as a good starting point to implement (1).
(3) Use QuartzDesk (our commercial product) or any other product that allows you to create job chains while externalizing and managing all job dependencies outside of your application. A job chain can have multiple target jobs that can be executed immediately, with a fixed delay or at arbitrary time in the future produced by a JavaScript expression. It also allows you to implement somewhat more sophisticated works flows, such as firing a target job when multiple source jobs complete their execution etc. I am attaching screenshots showing you what a simple job chain that re-executes Job1 with a 1 minute delay upon Job1's completion (with any job execution status) looks like:

Is there a way to make the Start Time closer than Schedule Time in an SCOM Task?

I realize that when I execute a SCOM Task on demand from a Powershell script, there are 2 columns in Task Status view called Schedule Time and Start Time. It seems that there is an interval these two fields of around 15 seconds. I'm wondering if there is a way to minimize this time so I could have a response time shorter when I execute an SCOM task on demand.
This is not generally something that users can control. The "ScheduledTime" correlates to the time when the SDK received the request to execute the task. The "StartTime" represents the time that the agent healthservice actually began executing the task workflow locally.
In between those times, things are moving as fast as they can. The request needs to propagate to the database, and a server healthservice needs to be notified that a task is being triggered. The servers then need to determine the correct route for the task message to take, then the healthservices need to actually send and receive the message. Finally, it gets to the actual agent where the task will execute. All of these messages go through the same queues as other monitoring data.
That sequence can be very quick (when running a task against the local server), or fairly slow (in a big Management Group, or when there is lots of load, or if machines/network are slow). Besides upgrading your hardware, you can't really do anything to make the process run quicker.

Use beanstalkd, for periodic tasks, how to always make a job replaced by its latest one?

I am trying to use beanstalk for queuing a large number of periodic
tasks (for example, tasks need processed every N minutes), for each
task, if the last queued job is not completed (not reserved, i mean)
when current job to be added, the last queued job should be replaced
with current job, in other words, only the latest queued job of a task
should be processed.
how can i achieve that using beanstalk?
Ideas i have got right now is:
for each task, use memcached store its latest timestamps (set this
when add jobs to queue),
every time the worker reserved a job successfully, it first checks
timestamps for this task in memcached,
if timestamps of the job is same as timestamps in memcached, then
process this job,
otherwise skip this job, and delete it from the queue.
So is there better ways to do such work? please give your suggestions,
thanks.
I found a memcache/beanstalk combination also the best solution for an implementation where I didnt want a newer but identical job entering a queue.
Until 'named jobs' are done and the software released, that may be one of the better solutions.