I want to run a back to back map reduce job every hour automatically.
So every hour:
startCollection---mapReduce1--->map1ResultCollection---mapReduce2--->map2ResultCollection
I want the mapReduce2 job only to start when the mapReduce1 job is finished.
How do i know when the mapReduce1 job is done and the mapReduce2 is good to go?
Related
there is a "natural" ( I mean thought parameter) way to limit the number of triggering a dag (let say every 24 hours).
I don't want to schedule it, but some user can trigger the same dag multiple time, and for resources and others reason, I want it only once .
As I see "depends_on_past" depend only against the previous run, but it could be many time a day.
Thx
Not directly, but you could likely implement task_instance_mutation_hook in the first task of the DAG, it could then immediately fail the task if you check if it's been run several times the same day.
https://airflow.apache.org/docs/apache-airflow/stable/concepts/cluster-policies.html#task-instance-mutation
In a java program.
I need to read database, take theses data, doing some rest call, write data in a txt file (who have an header, data and a footer).
Job start saturday night and need to finish before saturday morning. If not finish, we need to close file (write footer before) and start a new one.
I started to check some tool to do this job. Spring batch seem interesting.
I can split job with reader, process, writer.
Is there something to check if a job has reach is deadline
Job will be launch with Jentskin
I guess you must use a scheduler for that.
You must read from DB the end date every minute or so, and
if (endDate.compareTo(new Date())<=0)
than the scheduler'job must stop the batch job.
You can use Quartz
Using the Eclipse Job class, it is possible to schedule a job to run certain amount of time after it is scheduled, like this:
Job job = getMyJob();
job.schedule(delayInMilliseconds);
This will run the job after the specified delay, is there a way to create a job that runs at a given hour of the day, everyday?, for example, I want to run a job at 5pm, everyday, so if Eclipse happens to be open at 5pm the job will run, if it is closed, then the job will be skipped that day and it will wait for the next day.
Is there a way to create this type of recurrent job?
No, the Job API doesn't have anything like this.
You could use something like the scheduleAtFixedRate method of ScheduledExecutorService to schedule a Runnable to submit the job once a day.
I have a Quartz Job that I want to fire every minute. The job itself contains logic to check to see if there is a process to run and if there is, this job could take 45 minutes to complete.
Using a Simple Trigger, will Quartz fire this job off every minute even if there is one already running? Or if the interval is set to 1 minute, does that mean that Quartz will wait 1 minute after the job is done before it fires the next job?
If the trigger is set to fire every minute, it will fire every minute (and a new job instance will be created and invoked).
Unless the related job is marked #DisallowConcurrentExecution.
I am trying to use beanstalk for queuing a large number of periodic
tasks (for example, tasks need processed every N minutes), for each
task, if the last queued job is not completed (not reserved, i mean)
when current job to be added, the last queued job should be replaced
with current job, in other words, only the latest queued job of a task
should be processed.
how can i achieve that using beanstalk?
Ideas i have got right now is:
for each task, use memcached store its latest timestamps (set this
when add jobs to queue),
every time the worker reserved a job successfully, it first checks
timestamps for this task in memcached,
if timestamps of the job is same as timestamps in memcached, then
process this job,
otherwise skip this job, and delete it from the queue.
So is there better ways to do such work? please give your suggestions,
thanks.
I found a memcache/beanstalk combination also the best solution for an implementation where I didnt want a newer but identical job entering a queue.
Until 'named jobs' are done and the software released, that may be one of the better solutions.