celery result_backend with postgresql - null for some fields in the task table - celery

I am using Celery for a POC. My objective is to create a workflow across a distributed system.
For the purpose of a POC I have created a couple of docker containers each with a worker started with the queue option. I initiated a chain from yet another container. The chain executes successfully across the containers. Now I decided to enable the result_backend to store the results of each task getting executed.
I set the result_backend to postgresql with the schema option. After executing the chain now, I do see the tables created in postgres. However the task_meta table has some columns as null (e.g. the worker, queue.) Where as the task_id and status are correctly populated.
Has anyone faced a similar issue? Any help is appreciated.

I know it has been a long time since you asked the question but I hope it will help others.
By default, Celery does not write all task result attributes to the backend.
Your have to configure it by setting result_extended to True as stated here: https://docs.celeryproject.org/en/stable/userguide/configuration.html#result-extended
So if you configure your app from python you will have to do something like:
import celery
app = Celery(
broker=broker_url,
backend=postgresq_url,
)
app.conf.update(result_extended=True)

Related

Why can't I see my cluster when I'm trying to setup a scheduled task

I have a cluster in ECS with about 20+ services all happily running in it.
I've just uploaded a new image which I want to set up as a daily task. I can create it as a task and run it - the logs indicate it is running to completion.
I've gone into EventBridge and created a Rule, set the detail and cron, I select the target (AWS service), then select ECS task but when I drop the Cluster dropdown it is empty, I can't select a cluster - there are none.
Is this a security issue perhaps or am I missing something elsewhere - can't this be done?
Any help would be much appreciated.
Eventually managed to get this to work. The problem was that I was starting the EventBridge creation process in the wrong region - rookie mistake - so it couldn't see the cluster in the other region. D'Oh!

Airflow PostgresOperator Safety -- execution_timeout not respected? How to kill process in DB if it's taking too long?

Working on getting Airflow implemented at my company but need to perform some safety checks prior to connecting to our prod dbs.
There is concern about stupid SQL being deployed and eating up too many resources. My thought was that an execution_timeout setting on a PostgresOp task would:
Fail the task
Kill the query process in the db
I have found neither to be true.
Code:
with DAG(
# Arguments applied to instantiate this DAG. Update your values here
# All parameters visible in airflow.models.dag
dag_id=DAG_ID,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(minutes=20),
start_date=days_ago(1),
schedule_interval=None,
tags=['admin'],
max_active_runs=1
) as dag:
kill_test = PostgresOperator(
task_id="kill_test",
execution_timeout=timedelta(seconds=10),
postgres_conn_id="usa_db",
sql="""
SET application_name to airflow_test;
<SELECT... intentionally long running query> ;
""")
Airflow does not fail the task after the timeout.
Even when I manually fail the task in the UI, it does not kill the query in the Postgres db.
What is the deal here? Is there any way to put in safety measures to hard kill an Airflow initiated Postgres query in the db?
I'm not posting here, but I have checked:
Airflow UI shows task instance duration way over execution timeout
pg_stat activity to confirm query is running way over execution timeout
I guess you are looking for this parameter runtime_parameters={'statement_timeout': '180000ms'}.(airflow example)
I don't know in which version this was added but if you update your apache-airflow-providers-postgres module to the last version you can use the mentioned parameter.

Use Airflow to run parametrized jobs on-demand and with a schedule

I have a reporting application that uses Celery to process thousands of jobs per day. There is a python module per each report type that encapsulates all job steps. Jobs take customer-specific parameters and typically complete within a few minutes. Currently, jobs are triggered by customers on-demand when they create a new report or request a refresh of an existing one.
Now, I would like to add scheduling, so the jobs run daily, and reports get refreshed automatically. I understand that Airflow shines at task orchestration and scheduling. I also like the idea of expressing my jobs as DAGs and getting the benefit of task retries. I can see how I can use Airflow to run scheduled batch-processing jobs, but I am unsure about my use case.
If I express my jobs as Airflow DAGs, I will still need to run them parametrized for each customer. It means, if the customer creates a new report, I will need to have a way to trigger a DAG with the customer-specific configuration. And with a scheduled execution, I will need to enumerate all customers and create a parametrized (sub-)DAG for each of them. My understanding this should be possible since Airflow supports DAGs created dynamically, however, I am not sure if this is an efficient and correct way to use Airflow.
I wonder if anyway considered using Airflow for a scenario similar to mine.
Celery workflows do literally the same, and you can create and run them at any point of time. Also, Celery has a pretty good scheduler (I have never seen it failing in 5 years of using Celery) - Celery Beat.
Sure, Airflow can be used to do what you need without any problems.
You can use Airflow to create DAGs dynamically, I am not sure if this will work with a scale of 1000 of DAGs though. There are some good examples on astronomer.io on Dynamically Generating DAGs in Airflow.
I have some DAGs and task that are dynamically generated by a yaml configuration with different schedules and configurations. It all works without any issue.
Only thing that might be challenging is the "jobs are triggered by customers on-demand" - I guess you could trigger any DAG with Airflow's REST API, but it's still in a experimental state.

Trigger an airflow DAG asynchronous by a database trigger

I want to consolidate a couple of historically grown scripts (Python, Bash and Powershell) which purpose is to sync data between a lot of different database backends (mostly postgres, but also oracle and sqlserver) and on different sites. There isn't really a master, its more like a loose couple of partner companies working on the same domain specific use cases, everyone with its own data silo and its my job to hold all this together as good as I can.
Currently those scripts I mentioned are cron scheduled and need to run on the origin server where a dataset gets initially written, to sync it to every partner over night.
I am also familiar with and use Apache Airflow in another project. So my idea was to use an workflow management tool like Airflow to streamline the sync process and get it more centralized. But also with Airflow there is only a time interval scheduler available to trigger a DAG.
As most writes come in over postgres databases, I'd like to make use of the NOTIFY/LISTEN feature and already have a python daemon based on this listening to any database change (via triggers) and calling an event handler then.
The last missing piece is how its probably best done to trigger an airflow DAG with this handler and how to keep all this running reliably?
Perhaps there is a better solution?

How can I monitor the tasks started with pyspark

I am using pyspark to run some tasks on a cluster.
I want to see the status of the tasks.
I think that the UI must be started by default
as mentioned here.
But I am unable to get UI (http://localhost:4040 or so).