How to make celery worker find the tasks to include? - celery

I have a fastapi app and I use celery for some async tasks. I also use docker, so fastapi runs in one container and celery on another. Now I am moving to break the workers into different queues and they will run in different containers. Right now I am using almost the same image for fastapi and celery, but for this new worker I will end up with a image way bigger than it should be since I have code and packages that the worker don't need. To get around that I now have 2 different dockerfiles, one for each worker, but in both of them I have the exact same file for setting up the celery app.
This is the celery I was setting up:
celery_app = Celery(
broker=config.celery_settings.broker_url,
backend=config.celery_settings.result_backend,
include=[
"src.iam.service_layer.tasks",
"src.receipt_tracking.service_layer.tasks",
"src.cfe_scraper.tasks",
],
)
And the idea is to spin off src.cfe_scraper.tasks and in that docker image I will not have src.iam.service_layer.tasks neither src.receipt_tracking.service_layer.tasks. But when I try to build the image I get an error saying that those paths don't exists, which is correct for that case, but if I simple delete the include argument the worker won't have any task registered. Is there an easy way to solve that without having to have two modules to setup different celery apps?

Related

Using celery flower with AWS Ecs Fargate

We have our production running on AWS ECS with fargate, where we are using multiple celery workers.
We have integrated flower to monitor our celery tasks with EFS as persistent db. Everything works fine unless we trigger a new deployment, once we do a new deployment, a new task will be up with new workers and flower consider them as different from the existing workers and existing workers will be considered as offline. Due to this we loose existing data after every deployment.
We tried hard coding worker names, But even after that it works the same way, the only difference is now it is not showing any offline workers after we trigger a deployment.
Please let me know your thoughts on this, is it okay to use flower to monitor celery? or is there any other tools we can use where we will not be facing this kind of issues? if flower is fine please let me know how we can fix this.
Thank you.

Airflow: what do `airflow webserver`, `airflow scheduler` and `airflow worker` exactly do?

I've been working with Airflow for a while now, which was set up by a colleague. Lately I run into several errors, which require me to more in dept know how to fix certain things within Airflow.
I do understand what the 3 processes are, I just don't understand the underlying things that happen when I run them. What exactly happens when I run one of the commands? Can I somewhere see afterwards that they are running? And if I run one of these commands, does this overwrite older webservers/schedulers/workers or add a new one?
Moreover, if I for example run airflow webserver, the screen shows some of the things that are happening. Can I simply get out of this by pressing CTRL + C? Because when I do this, it says things like Worker exiting and Shutting down: Master. Does this mean I'm shutting everything down? How else should I get out of the webserver screen then?
Each process does what they are built to do while they are running (webserver provides a UI, scheduler determines when things need to be run, and workers actually run the tasks).
I think your confusion is that you may be seeing them as commands that tell some sort of "Airflow service" to do something, but they are each standalone commands that start the processes to do stuff. ie. Starting from nothing, you run airflow scheduler: now you have a scheduler running. Run airflow webserver: now you have a webserver running. When you run airflow webserver, it is starting a python flask app. While that process is running, the webserver is running, if you kill command, is goes down.
All three have to be running for airflow as a whole to work (assuming you are using an executor that needs workers). You should only ever had one scheduler running, but if you were to run two processes of airflow webserver (ignoring port conflicts, you would then have two separate http servers running using the same metadata database. Workers are a little different in that you may want multiple worker processes running so you can execute more tasks concurrently. So if you create multiple airflow worker processes, you'll end up with multiple processes taking jobs from the queue, executing them, and updating the task instance with the status of the task.
When you run any of these commands you'll see the stdout and stderr output in console. If you are running them as a daemon or background process, you can check what processes are running on the server.
If you ctrl+c you are sending a signal to kill the process. Ideally for a production airflow cluster, you should have some supervisor monitoring the processes and ensuring that they are always running. Locally you can either run the commands in the foreground of separate shells, minimize them and just keep them running when you need them. Or run them in as a background daemon with the -D argument. ie airflow webserver -D.

Run celery periodic tasks on one machine only

I am working on a django project where I am using celery. I have three two big modules in the project named app1 and app2. I have created two celery apps for that which are running on two separate machines. In the app1 and app2 there are different tasks which i want to run difference machines and it is working fine. But my problem is that i have some periodic_tasks. I have defined a queue named periodic_tasks for them. I want to run these periodic tasks on a separate third machine. Or on the third machine I want to run only the periodic tasks, and these periodic tasks shouldn't executed from the other two machines. Is it possible using celery.
On your third machine, make sure to start up celery with the -Q or --queues option with periodic_tasks. On app1 and app2, start up celery without the periodic_tasks queue. You can read more about queue handling here: http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker-Q

Gracefully update running celery pod in Kubernetes

I have a Kubernetes cluster running Django, Celery, RabbitMq and Celery Beat. I have several periodic tasks spaced out throughout the day (so as to keep server load down). There are only a few hours when no tasks are running, and I want to limit my rolling-updates to those times, without having to track it manually. So I'm looking for a solution that will allow me to fire off a script or task of some sort that will monitor the Celery server, and trigger a rolling update once there's a window in which no tasks are actively running. There are two possible ways I thought of doing this, but I'm not sure which is best, nor how to implement either one.
Run a script (bash or otherwise) that checks up on the Celery server every few minutes, and initiates the rolling-update if the server is inactive
Increment the celery app name before each update (in the Beat run command, the Celery run command, and in the celery.py config file), create a new Celery pod, rolling-update the Beat pod, and then delete the old Celery 12 hours later (a reasonable time span for all running tasks to finish)
Any thoughts would be greatly appreciated.

Dotcloud multiple services

I'm new to dotcloud, and am confused about how multiple services work together.
my yaml build file is:
www:
type: python
db:
type: postgresql
worker:
type: python-worker
broker:
type: rabbitmq
And my supervisord file contains commands to start django celery & celerycam.
When I push my code out to my app, I can see that both the www & worker services start up their own instances of celery & celery cam, and also for example the log files will be different. This makes sense (although isn't made very clear in the dotcloud documentation in IMO - the documentation talks about setting up a worker service, but not how to combine that with other services), but does raise the question of how to configure an application where the python service mainly serves the web page, whilst the python worker service works on background tasks, eg: celery.
The dotcloud documentation daemon makes mention of this:
"However, you should be aware that when you scale your application,
the cron tasks will be scheduled in all scaled instances – which is
probably not what you need! So in many cases, it will still be better
to use a separate service.
Similarly, a lot of (non-worker) services already run Supervisor, so
you can run additional background jobs in those services. Then again,
remember that those background jobs will run in multiple instances if
you scale your application. Moreover, if you add background jobs to
your web service, it will get less resources to serve pages, and your
performance will take a significant hit."
How do you configure dotcloud & your application to run just the webserver on one service, and background tasks on the worker service? Would you scale workers by increasing the concurrency setting in celery (and scaling the one service vertically), by adding extra worker services, or both?
Would you do this so that firstly the webserver service doesn't have to use resources in processing background tasks, and secondly so that you could scale the worker services independently of the webserver service?
There are two tricks.
First you could use different approots for your www and worker services to separate the code they will run:
www:
type: python
approot: frontend
# ...
worker:
type: python-worker
approot: backend
# ...
Second, since your postinstall script is different for each approot, you can copy a file out to become the correct supervisord.conf for that particular service.
You may also want to look at the dotCloud tutorial and sample code for django-celery.
/Andy