Adding extra celery configs to Airflow

Adding extra celery configs to Airflow - celery

Anyone know where I can add extra celery configs to airflow celery executor? For instance I want http://docs.celeryproject.org/en/latest/userguide/configuration.html#worker-pool-restarts this property but how do I allow extra celery properties..

Use the just-released Airflow 1.9.0 and this is now configurable.
In airflow.cfg there is this line:
# Import path for celery configuration options
celery_config_options = airflow.config_templates.default_celery.DEFAULT_CELERY_CONFIG
which points to a python file from the import path. The current default version can is https://github.com/apache/incubator-airflow/blob/1.9.0/airflow/config_templates/default_celery.py
If you need a setting that isn't tweakable via that file then create a new module, say 'my_celery_config.py':
CELERY_CONFIG = {
# ....
}
and put it in your AIRFLOW_HOME dir (i.e. along side the dags/ folder) and then set celery_config_options = my_celery_config.CELERY_CONFIG in the config.

In case you're running Airflow in Docker and you want to change the Celery configuration, you need to do the following:
Create an Airflow config folder (if you don't have one already) at the same level where your dags folder is and add a custom celery configuration file (e.g. custom_celery_config.py) there.
Change the default Celery configuration in the custom_celery_config.py. The idea is that this python script should contain a variable, which contains the default Celery configuration plus your changes to it. E.g. if you like to change the task_queues configuration of Celery, your custom_celery_config.py should look like this:
from airflow.config_templates.default_celery import DEFAULT_CELERY_CONFIG
from kombu import Exchange, Queue
CELERY_TASK_QUEUES = [
Queue('task1', Exchange('task1', type='direct'), routing_key='task1', queue_arguments={'x-max-priority': 8}),
Queue('task2', Exchange('task2', type='direct'), routing_key='task2', queue_arguments={'x-max-priority': 6}),
]
CELERY_CONFIG = {
**DEFAULT_CELERY_CONFIG,
"task_queues": CELERY_TASK_QUEUES
}
Mount the config folder in the docker-compose.yml:
volumes:
- /data/airflow/config:/opt/airflow/config
Set the Celery configuration in the docker-compose.yml like this (since Docker can see the config folder, it can access your custom_celery_config.py):
AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS: 'custom_celery_config.CELERY_CONFIG'
Restart the Airflow Webserver, Scheduler etc.
Reference: here.
For more info about the Celery configuration check this documentation.

Related

How to read airflow variables into kubernetes pod instance

I am trying to read airflow variables into my ETL job to populate variables in the curation script. I am using the KubernetesPodOperator. How do I access the metadata database from my k8's pod?
Error I am getting from airflow:
ERROR - Unable to retrieve variable from secrets backend (MetastoreBackend). Checking subsequent secrets backend.
This is what I have in main.py for outputting into the console log. I have a key in airflow variables named "AIRFLOW_URL".
from airflow.models import Variable
AIRFLOW_VAR_AIRFLOW_URL = Variable.get("AIRFLOW_URL")
logger.info("AIRFLOW_VAR_AIRFLOW_URL: %s", AIRFLOW_VAR_AIRFLOW_URL)
Can anyone point me in the right direction?

Your DAG can pass them as environment variables to your Pod, using a template (e.g. KubernetesPodOperator(... env_vars={"MY_VAR": "{{var.value.my_var}}"}, ...)).

It looks like you have a secrets backend set in config without having a secrets backend set up, so Airflow is trying to go there to fetch your variable. See this link.
Alter your config to remove the backend and backend_kwargs keys, and it should look at your Airflow variables first.
[secrets]
backend =
backend_kwargs =

Purge Celery tasks on GCP K8s

I want to purge Celery tasks. Celery is running on GCP Kubernetes in my case. Is there a way to do it from terminal? For example via kubectl?

The solution I found was to write to file shared by both API and
Celery containers. In this file, whenever an interruption is captured,
a flag is set to true. Inside the celery containers I keep
periodically checking the contents of such file. If the flag is set to
true, then I gracefully clear things up and raise an error.
Does this solve your problem? How can I properly kill a celery task in a kubernetes environment?
an alternate solution may be:
$ celery -A proj purge
or
from proj.celery import app
app.control.purge()

How can you use the kubectl tool (in a stateful/local way) for multiple managing multiple clusters from different directories simultaneously?

Is there a way you can run kubectl in a 'session' such that it gets its kubeconfig from a local directory rather then from ~/.kubeconfig?
Example Use Case
Given the abstract nature of the question, it's worth describing why this may be valuable in an example. If someone had an application, call it 'a', and they had 4 kubernetes clusters, each running a, they may have a simple script which did some kubectl actions in each cluster to smoke test a new deployment of A, for example, they may want to deploy the app, and see how many copies of it were autoscaled in each cluster afterward.
Example Solution
As in git, maybe there could be a "try to use a local kubeconfig file if you can find one" as a git-style global setting:
kubectl global set-precedence local-kubectl
Then, in one terminal:
cd firstcluster
cat << EOF > kubeconfig
firstcluster
...
EOF
kubectl get pods
p4
Then, in another terminal:
cd secondcluster/
cat << EOF > kubeconfig
secondcluster
...
EOF
kubectl get pods
p1
p2
p3
Thus, the exact same kubectl commands (without having to set context) actually run against new clusters depending on the directory you are in.
Some ideas for solutions
One idea I had for this, was to write a kubectl-context plugin which somehow made kubectl always check for local kubeconfig, setting context behind the scenes if it could before running, to a context in a global config that matched the directory name.
Another idea I've had along these lines would be to create different users which each had different kubeconfig home files.
And of course, using something like virtualenv, you might be able to do something where kubeconfig files had their own different value.
Final thought
Ultimately I think the goal here is to subvert the idea that a ~/.kubeconfig file has any particular meaning, and instead look at ways that many kubeconfig files can be used in the same machine, however, not just using the --kubeconfig option but rather, in such a way that state is still maintained in a directory local manner.

AFAIK, the config file is under ~/.kube/config and not ~/.kubeconfig. I suppose you are looking at an opinion on your answer, so you gave me the great idea about creating kubevm, inspired by awsvm for the AWS CLI, chefvm for managing multiple Chef servers and rvm for managing multiple Ruby versions.
So, in essence, you could have a kubevm setup that switches between different ~/.kube configs. You can use a CLI like this:
# Use a specific config
kubevm use {YOUR_KUBE_CONFIG|default}
# or
kubevm YOUR_KUBE_CONFIG
# Set your default config
kubevm default YOUR_KUBE_CONFIG
# List your configurations, including current and default
kubevm list
# Create a new config
kubevm create YOUR_KUBE_CONFIG
# Delete a config
kubevm delete YOUR_KUBE_CONFIG
# Copy a config
kubevm copy SRC_CONFIG DEST_CONFIG
# Rename a config
kubevm rename OLD_CONFIG NEW_CONFIG
# Open a config directory in $EDITOR
kubevm edit YOUR_KUBE_CONFIG
# Update kubevm to the latest
kubevm update
Let me know if it's useful!

Celery Flower Broker Tab not populating with broker_api set for rabbitmq api

I'm trying to populate the Broker tab on Celery Flower but when I pass a broker_api like the following example:
python manage.py celery flower --broker_api=http://guest:guest#localhost:15672/api/
I get the following error:
state.py:108 (run) Failed to inspect the broker: 'list' object is not callable
I'm confident the credentials I'm using are correct and the RabbitMQ Management Plugin is enabled. I'm able to access the RabbitMQ monitoring page through the browser.
flower==0.6.0
RabbitMQ 3.2.1
Does anyone know how to fix this?

Try removing the slash after /api/:
python manage.py celery flower --broker_api=http://guest:guest#localhost:15672/api

Had the same issue on an Airflow setup with Celery 5.2.6 and Flower 1.0.0. The solution for me was to launch Flower using:
airflow celery flower --broker-api=http://guest:guest#rabbitmq:15672/api/
For non-Airflow readers, I believe the command should be:
celery flower --broker=amqp://guest:guest#rabbitmq:5672 --broker_api=http://guest:guest#rabbitmq:15672/api/
A few remarks:
The above assumes a shared Docker network. If that's not the case, every #rabbitmq should be replaced with e.g. #localhost
--broker is not needed if running under Airflow's umbrella (it's passed from the Airflow config)
A good test to verify the API works is to access http://guest:guest#localhost:15672/api/index.html locally

Can I tie Celery workers to a particular instance given a shared database?

I have a number of machines each with a Django instance, sharing a single Postgres database.
I want to run Celery, preferably using the Django broker and the Postgres database for simplicity. I do not have a high volume of tasks to run, so there is no need to use a different broker for that reason.
I want to run celery tasks which operate on local file storage. This means that I want the celery worker only to run tasks which are on the same machine that triggered the event.
Is this possible with the current setup? If not, how do to it? A local Redis instance for each machine?

I worked out how to make this work. No need for fancy routing or brokers.
I run each celeryd instance with a special queue named after the host. This can be done automatically, like:
./manage.py celeryd -Q celery,`hostname`
I then set up a hostname in the settings.py that stores the hostname:
import socket
CELERY_HOSTNAME = socket.gethostname()
In each Django instance this will have a different value.
I can then specify this queue when I asynchronously call my task:
my_task.apply_async(args=[one, two], queue=settings.CELERY_HOSTNAME)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Adding extra celery configs to Airflow - celery

Anyone know where I can add extra celery configs to airflow celery executor? For instance I want http://docs.celeryproject.org/en/latest/userguide/configuration.html#worker-pool-restarts this property but how do I allow extra celery properties..

Related

How to read airflow variables into kubernetes pod instance

Purge Celery tasks on GCP K8s

How can you use the kubectl tool (in a stateful/local way) for multiple managing multiple clusters from different directories simultaneously?

Celery Flower Broker Tab not populating with broker_api set for rabbitmq api

Can I tie Celery workers to a particular instance given a shared database?

Categories

Resources