Scheduling Celery tasks from multiple repos - celery

We've got Celery in a dedicated repo (with Beat, Flower and some workers) with some scheduled and non-scheduled tasks. We schedule some tasks using the normal syntax:
app.conf.beat_schedule.update({
'print-hello-again-every-minute': {
'task': 'dummy_scheduled_task',
'schedule': 60.0 # seconds
}
})
We then have another repo with specific business logic that we'd also like to create scheduled tasks for (the tasks themselves also live in this separate repo). We have a worker setup in this repo and it connects to the same Redis as the main repo.
Problem: We would like to avoid setting up another Beat and Flower instance and instead just spin up workers per repo that are close to the business logic that they need to run async.
However, we can't find a way of scheduling recurring tasks from these separate repos to the main Beat instance. Is this at all possible or do we have to use a custom scheduler for something like this?

Related

Use Airflow to run parametrized jobs on-demand and with a schedule

I have a reporting application that uses Celery to process thousands of jobs per day. There is a python module per each report type that encapsulates all job steps. Jobs take customer-specific parameters and typically complete within a few minutes. Currently, jobs are triggered by customers on-demand when they create a new report or request a refresh of an existing one.
Now, I would like to add scheduling, so the jobs run daily, and reports get refreshed automatically. I understand that Airflow shines at task orchestration and scheduling. I also like the idea of expressing my jobs as DAGs and getting the benefit of task retries. I can see how I can use Airflow to run scheduled batch-processing jobs, but I am unsure about my use case.
If I express my jobs as Airflow DAGs, I will still need to run them parametrized for each customer. It means, if the customer creates a new report, I will need to have a way to trigger a DAG with the customer-specific configuration. And with a scheduled execution, I will need to enumerate all customers and create a parametrized (sub-)DAG for each of them. My understanding this should be possible since Airflow supports DAGs created dynamically, however, I am not sure if this is an efficient and correct way to use Airflow.
I wonder if anyway considered using Airflow for a scenario similar to mine.
Celery workflows do literally the same, and you can create and run them at any point of time. Also, Celery has a pretty good scheduler (I have never seen it failing in 5 years of using Celery) - Celery Beat.
Sure, Airflow can be used to do what you need without any problems.
You can use Airflow to create DAGs dynamically, I am not sure if this will work with a scale of 1000 of DAGs though. There are some good examples on astronomer.io on Dynamically Generating DAGs in Airflow.
I have some DAGs and task that are dynamically generated by a yaml configuration with different schedules and configurations. It all works without any issue.
Only thing that might be challenging is the "jobs are triggered by customers on-demand" - I guess you could trigger any DAG with Airflow's REST API, but it's still in a experimental state.

Pause Scheduled tasks in SCDF

Hi I'm running batch jobs via SCDF in openshift environment. All the jobs have been scheduled through the scheduling option in SCDF. Is there way to pause or Hold those jobs from executing instead of destroying the schedules ? Since the number of jobs are more, everytime we have to recreated the schedules for all of them.
Thanks.
We have an open issue: spring-cloud/spring-cloud-dataflow#3276 to add support for it.
Feel free to update the issue with your use-case requirements and the acceptance criteria. Better yet, it'd be great if you can contribute adding support for it in a PR; we would love to collaborate and release it.

Scaling periodic tasks in celery

We have a 10 queue setup in our celery, a large setup each queue have a group of 5 to 10 task and each queue running on dedicated machine and some on multiple machines for scaling.
On the other hand, we have a bunch of periodic tasks, running on a separate machine with single instance, and some of the periodic tasks are taking long to execute and I want to run them in 10 queues instead.
Is there a way to scale celery beat or use it purely to trigger the task on a different destination "one of the 10 queues"?
Please advise?
Use celery routing to dispatch the task to where you need:

Rollback support in Celery

I am planning to use celery as the task management component of my project. It has almost all the features that I need for my project. I will have a set of tasks which can execute independently or in the specified sequence. In the sequential tasks, I want the ability to perform cleanup/rollbacks if one of the intermediate tasks fails. I was wondering if there is a feature available in celery out of the box to do the same or is any workaround available.
Celery doesn't support anything similar to rollback. Your tasks should be small atomic steps and intermediate steps should not corrupt your database.
If you need to revert changes, you can create task which is doing opposite operation and call it when celery is retrying to perform one of the tasks.

Jenkins trigger job by another which are running on offline node

Is there any way to do the following:
I have 2 jobs. One job on offline node has to trigger the second one. Are there any plugins in Jenkins that can do this. I know that TeamCity has a way of achieving this, but I think that Jenkins is more constrictive
When you configure your node, you can set Availability to Take this slave on-line when in demand and off-line when idle.
Set Usage as Leave this machine for tied jobs only
Finally, configure the job to be executed only on that node.
This way, when the job goes to queue and cannot execute (because the node is offline), Jenkins will try to bring this node online. After the job is finished, the node will go back to offline.
This of course relies on the fact that Jenkins is configured to be able to start this node.
One instance will always be turn on, on which the main job can be run. And have created the job which will look in DB and if in the DB no running instances, it will prepare one node. And the third job after running tests will clean up my environment.