Airflow Celery Worker celery-hostname - celery

On airflow 2.1.3, looking at the documentation for the CLI, airflow celery -h
it shows:
-H CELERY_HOSTNAME, --celery-hostname CELERY_HOSTNAME
Set the hostname of celery worker if you have multiple workers on a single machine
I am familiar with Celery and I know you can run multiple workers on the same machine. But with Airflow (and the celery executor) I don't understand how to do so.
if you do, on the same machine:
> airflow celery worker -H 'foo'
> airflow celery worker -H 'bar'
The second command will fail, complaining about the pid, so then:
> airflow celery worker -H 'bar' --pid some-other-pid-file
This will run another worker and it will sync with the first worker BUT you will get a port binding error since airflow binds the worker process to http://0.0.0.0:8793/ no matter what (unless I missed a parameter?).
It seems you are not suppose to run multiple workers per machines... Then my question is, what is the '-H' (--celery-hostname) option for? How would I use it?

The port that celery listens to is also configurable - it is used to serve logs to the webserver:
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#worker-log-server-port
You can run multiple celery workers with different settings for that port or run them with --skip-serve-logs to not open the webserver in the first place (for example if you send logs to sentry/s3/gcs etc.).
BTW. This sounds strange by the way to run several celery workers because you can utilise the machine CPUS by using parallelism. This seems much more practical an easier to manage.

Related

Dash Celery setup

I have docker-compose setup for my Dash application. I need suggestion or preferred way to setup my celery image.
I am using celery for following use-cases and these are cancellable/abortable/revoked task:
Upload file
Model training
Create train, test set
Case-1. Create one service as celery,
command: ["celery", "-A", "tasks", "worker", "--loglevel=INFO", "--pool=prefork", "--concurrency=3", "--statedb=/celery/worker.state"]
So, here we are using default queue, single worker (main) and 3 child/worker processes(ie can execute 3 tasks simultaneously)
Now, if I revoke any task, will it kill the main worker or just that child worker processes executing that task?
Case-2. Create three services as celery-{task_name} ie celery-upload etc,
command: ["celery", "-A", "tasks", "worker", "--loglevel=INFO", "--pool=prefork", "--concurrency=1", , "--statedb=/celery/worker.state", "--queues=upload_queue", , "--hostname=celery_worker_upload_queue"]
So, here we are using custom queue, single worker (main) and 1 child/worker processe(ie can execute 1 task) in its container. This way one service for each task.
Now, if I revoke any task, it will only kill the main worker or just the only child worker processes executing that task in respective container and rest celery containers will be alive?
I tried using below signals with command task.revoke(terminate=True)
SIGKILL and SIGTERM
In this, I observed #worker_process_shutdown.connect and #task_revoked.connect both gets fired.
Does this means main worker and concerned child worker process for whom revoke command is issued(or all child processes as main worker is down) are down?
SIGUSR1
In this, I observed only #task_revoked.connect gets fired.
Does this means main worker is still running/alive and only concerned child worker process for whom revoke command is issued is down?
Which case is preferred?
Is it possible to combine both cases? ie having single celery service with individual workers(main) and individual child worker process and individual queues Or
having single celery service with single worker (main), individual/dedicated child worker processes and individual queues for respective tasks?
One more doubt, As I think, using celery is required for above listed tasks, now say I have button for cleaning a dataframe will this too requires celery?
ie wherever I am dealing with dataframes should I need to use celery?
Please suggest.
UPDATE-2
worker processes = child-worker-process
This is how I am using as below
# Start button
result = background_task_job_one.apply_async(args=(n_clicks,), queue="upload_queue")
# Cancel button
result = result_from_tuple(data, app=celery_app)
result.revoke(terminate=True, signal=signal.SIGUSR1)
# Task
#celery_app.task(bind=True, name="job_one", base=AbortableTask)
def background_task_job_one(self, n_clicks):
msg = "Aborted"
status = False
try:
msg = job(n_clicks) # Long running task
status = True
except SoftTimeLimitExceeded as e:
self.update_state(task_id=self.request.id, state=states.REVOKED)
msg = "Aborted"
status = True
raise Ignore()
finally:
print("FINaLLY")
return status, msg
Is this way ok to handle cancellation of running task? Can you elaborate/explain this line [In practice you should not send signals directly to worker processes.]
Just for clarification from line [In prefork concurrency (the default) you will always have at least two processes running - Celery worker (coordinator) and one or more Celery worker-processes (workers)]
This means
celery -A app worker -P prefork -> 1 main worker and 1 child-worker-process. Is it same as below
celery -A app worker -P prefork -c 1 -> 1 main worker and 1 child-worker-process
Earlier, I tried using class AbortableTask and calling abort(), It was successfully updating the state and status as ABORTED but task was still alive/running.
I read to terminate currently executing task, it is must to pass terminate=True.
This is working, the task stops executing and I need to update task state and status manually to REVOKED, otherwise default PENDING. The only hard-decision to make is to use SIGKILL or SIGTERM or SIGUSR1. I found using SIGUSR1 the main worker process is alive and it revoked only the child worker process executing that task.
Also, luckily I found this link I can setup single celery service with multiple dedicated child-worker-process with its dedicated queues.
Case-3: Celery multi
command: ["celery", "multi", "show", "start", "default", "model", "upload", "-c", "1", "-l", "INFO", "-Q:default", "default_queue", "-Q:model", "model_queue", "-Q:upload", "upload_queue", "-A", "tasks", "-P", "prefork", "-p", "/proj/external/celery/%n.pid", "-f", "/proj/external/celery/%n%I.log", "-S", "/proj/external/celery/worker.state"]
But getting error,
celery service exited code 0
command: bash -c "celery multi start default model upload -c 1 -l INFO -Q:default default_queue -Q:model model_queue -Q:upload upload_queue -A tasks -P prefork -p /proj/external/celery/%n.pid -f /proj/external/celery/%n%I.log -S /proj/external/celery/worker.state"
Here also getting error,
celery | Usage: python -m celery worker [OPTIONS]
celery | Try 'python -m celery worker --help' for help.
celery | Error: No such option: -p
celery | * Child terminated with exit code 2
celery | FAILED
Some doubts, what is preferred 1 worker vs multi worker?
If multi worker with dedicated queues, creating docker service for each task increases the docker-file and services too. So I am trying single celery service with multiple dedicated child-worker-process with its dedicated queues which is easy to abort/revoke/cancel a task.
But getting error with case-3 i.e. celery multi.
Please suggest.
If you revoke a task, it may terminate the working process that was executing the task. The Celery worker will continue working as it needs to coordinate other worker processes. If the life of container is tied to the Celery worker, then container will continue running.
In practice you should not send signals directly to worker processes.
In prefork concurrency (the default) you will always have at least two processes running - Celery worker (coordinator) and one or more Celery worker-processes (workers).
To answer the last question we may need more details. It would be easier if you could run Celery task when all dataframes are available. If that is not the case, then perhaps run individual tasks to process dataframes. It is worth having a look at Celery workflows and see if you can build Chunk-ed workflow. Keep it simple, start with assumption that you have all dataframes available at once, and build from there.

Is it mandatory to run celery as daemon in production

I have seen celery documentation that its advisable to run celery as daemon process. In my case each celery worker is a docker container whose sole purpose is to execute celery tasks. In that scnario also, is it recommended to execute as daemon process?
No, if Celery worker runs inside a container there is no need to run it as daemon.

Why does Celery discourage worker & beat together?

From the celery help function:
> celery worker -h
...
Embedded Beat Options:
-B, --beat Also run the celery beat periodic task scheduler. Please note that there must only be
one instance of this service. .. note:: -B is meant to be used for development
purposes. For production environment, you need to start celery beat separately.
This also appears in the docs.
You can also embed beat inside the worker by enabling the workers -B
option, this is convenient if you’ll never run more than one worker
node, but it’s not commonly used and for that reason isn’t recommended
for production use:
celery -A proj worker -B
But it's not actually explained why it's "bad" to use this in production. Would love some insight.
The --beat option will start a beat scheduler along with the worker.
But you only need one beat scheduler。
In the production environment, you usually have more than one worker running. Using --beat option will be a disaster.
For example: you have a event scheduled at 12:am each day.
If you started two beat process, the event will run twice at 12:am each day.
If you’ll never run more than one worker node, --beat option if just fine.

Using Celery queues with multiple apps

How do you use a Celery queue with the same name for multiple apps?
I have an application with N client databases, which all require Celery task processing on a specific queue M.
For each client database, I have a separate celery worker that I launch like:
celery worker -A client1 -n client1#%h -P solo -Q long
celery worker -A client2 -n client2#%h -P solo -Q long
celery worker -A client3 -n client3#%h -P solo -Q long
When I ran all the workers at once, and tried to kick off a task to client1, I found it never seemed to execute. Then I killed all workers except for the first, and now the first worker receives and executes the task. It turned out that even though each worker's app used a different BROKER_URL, using the same queue caused them to steal each others tasks.
This surprised me, because if I don't specify -Q, meaning Celery pulls from the "default" queue, this doesn't happen.
How do I prevent this with my custom queue? Is the only solution to include a client ID in the queue name? Or is there a more "proper" solution?
For multiple applications I use different Redis databases like
redis://localhost:6379/0
redis://localhost:6379/1
etc.

Can I tie Celery workers to a particular instance given a shared database?

I have a number of machines each with a Django instance, sharing a single Postgres database.
I want to run Celery, preferably using the Django broker and the Postgres database for simplicity. I do not have a high volume of tasks to run, so there is no need to use a different broker for that reason.
I want to run celery tasks which operate on local file storage. This means that I want the celery worker only to run tasks which are on the same machine that triggered the event.
Is this possible with the current setup? If not, how do to it? A local Redis instance for each machine?
I worked out how to make this work. No need for fancy routing or brokers.
I run each celeryd instance with a special queue named after the host. This can be done automatically, like:
./manage.py celeryd -Q celery,`hostname`
I then set up a hostname in the settings.py that stores the hostname:
import socket
CELERY_HOSTNAME = socket.gethostname()
In each Django instance this will have a different value.
I can then specify this queue when I asynchronously call my task:
my_task.apply_async(args=[one, two], queue=settings.CELERY_HOSTNAME)