Apache Airflow scheduling with a time bound and triggering - celery

I'm using airflow with celery Executor. Now I'm planning to develop user interaction for a task to decide to select branch using BranchOperator in a DAG. Its working by running continuous loop to checking value in database. But I feel it is not the good way of approach. Is there any alternative to do this?
And I want to wait for this interaction up-to particular time otherwise I want to stop. Is it possible to do this in airflow? And if is possible then is the any possibility to change this time bound dynamically?
Thank you in advance.

You shouldn't be using a BranchOperator for this. If you want to proceed in your dag based on some value in the db, you should use a Sensor. There are some off the shelf sensors in airflow and you could also look at some of those to create your own. Sensors basically poll for a certain criteria and timeout after a configurable period of time. From your question it seems this is exactly what you need.

Related

Bluemix Auto Scaling API

Is there a way for me to programmatically get notified when Bluemix auto scaling has scaled up or down?
I'm reading streaming data from a queue and would like to make sure the number of instances that I have are balanced and data is partitioned correctly
At present this kind of notification service is not available, only you can do is query the instance scaling history in Web UI. I think this requirement is interesting and should be considered to provide to developer in the future.
This kind of alert isn't available yet but you can write a simple script monitoring output of
cf app (appname)
It returns the number of instances running and the state of each one, with the right combination of awk and grep (or a perl script for example) you could have your own alerter while waiting for this of functionality

How to handle large amounts of scheduled tasks on a web server?

I'm developing a website (using a LAMP stack) which must handle many user-made scheduling tasks. It works as following: an user creates an event and sets a date, and others users (as many as 63) may join. A few hours before the set date, the system must email each user subscribed to that event. And that's it.
However, I have never handled scheduling, and the only tools I know (poorly) are cron and at. My plan is to create an at job for each event, which will call a script that gets all subscribers emails and mails them.
My question is: is my plan/design good? Is it scalable? Are there better options that I should be aware of?
Why a separate cron job for each event? I've done something similar thing for a newsletter with a cron job just running once per hour and if there are any newsletters to be sent it just handles them. In your case you'd have a script that runs once every hour and gets a list of users for events that happen in the desired time interval since.
It will work. As far as scalability, at the minimum make sure that the script runs in it's own process so it doesn't bog down the server unnecessarily.
Create a php-cli script perhaps?
I'm doing most of my work in Rails nowadays, and there's a wealth of background processing libraries one of them is Resque it uses the redis server to keep track of the jobs
I found a PHP clone https://github.com/chrisboulton/php-resque
Might be overkill for your use case, but give it a shot perhaps
If you would consider a proper framework that uses an application server (and not a simple webserver), Spring has a task scheduling layer that's simple to use. Scheduling jobs on the server really requires more than what a simple LAMP install can do, but I haven't used PHP in a while so maybe there's an equivalent.
Here's an article that compares some of your options.

celery task clean-up with DB backend

I'm trying to understand how and when tasks are cleaned up in celery. From looking at the task docs I see that:
Old results will be cleaned automatically, based on the
CELERY_TASK_RESULT_EXPIRES setting. By default this is set to expire
after 1 day: if you have a very busy cluster you should lower this
value.
But this quote is from the RabbitMQ Result Backend section and I do not see any similar text in the Database Backend section. So my question is: is there a backend agnostic approach I can take for old task clean-up with celery and if not is there a DB Backend specific approach I should take? Incase it makes any difference I'm using django-celery. Thanks.
If you click on the link to the setting doc for CELERY_TASK_RESULT_EXPIRES:
http://docs.celeryproject.org/en/latest/userguide/configuration.html#result-expires
It does say that database supports this, but then you need to run celery beat (there's a default periodic task, called every day, to remove expired results).
The backend docs in the task should probably mention this as well, maybe there should be a dedicated guide for backends too. If you want to lobby for this, then please open up an issue at https://github.com/celery/celery/issues

Configuring Quartz.net Tasks

I want to be able to set up one or more Jobs/Triggers when my app runs. The list of Jobs and triggers will come from a db table. I DO NOT care about persisting the jobs to the db for restarting or tracking purposes. Basically I just want to use the DB table as an INIt device. Obviously I can do this by writing the code myself but I am wondering if there is some way to use the SQLJobStore to get this functionality without the overhead of keeping the db updated throughout the life of the app using the scheduler.
Thanks for you help!
Eric
The job store's main purpose is to store the scheduler's state, so there is no way built in to do what you want. You can always write directly to the tables if you want and this will give you the results you want, but this isn't really the best way to do it.
The recommended way to do this would be to write some code that reads the data from your table and then connects to the scheduler using remoting to schedule the jobs.

MS CRM recursive workflow and performance

I’m about to write a workflow in CRM that calls itself every day. This is a recursive workflow.
It will run on half a million entities each day and deactive the record if it was not been upodated in the past 3 days.
I’m worried about performance has anyone else done this.
I haven't personally implemented anything like this, but that's 500,000 records that are floating around in the DB that the async service has to keep track of, which is going to tax your hardware. In addition, CRM keeps track of recursive workflow instances. I don't have the exact specs in front of me, but if a workflow calls itself a set number of times within a certain timeframe, CRM will kill the workflow.
Could you just write a console app that asks the Crm Service for records that haven't been updated in three days, and then deactivate them? Run it as a scheduled task once a day, and then your CRM system doesn't have the burden of keeping track of all those running workflow instances.
EDIT: Ah, I see now you might have been thinking of one workflow that runs on all the records as opposed to workflows running on each record. benjynito's advice makes sense if you go this route, although I still think a scheduled task would be more appropriate than using workflow.
You'll want to make sure your workflow is running in non-peak hours. Assuming you have an on-premise installation you should be able to get away with that. If you're using a hosted instance, you might be worried about one organization running the workflow while another organization is using the system. Use the timeout and maybe a custom workflow activity, if necessary, to force the start time to a certain period.
I'm assuming you'll be as efficient as possible in figuring out which records to deactivate. (i.e. Query Expression would only bring back the records you'll be deactivating).
The built-in infinite loop-protection offered by CRM shouldn't kill your workflow instances. It stops after a call depth of 8, but it resets to 1 if no calls are made for an hour. So the fact that you're doing this once a day should make you OK on the recursive workflow front.