Airflow PostgresOperator Safety -- execution_timeout not respected? How to kill process in DB if it's taking too long? - postgresql

Working on getting Airflow implemented at my company but need to perform some safety checks prior to connecting to our prod dbs.
There is concern about stupid SQL being deployed and eating up too many resources. My thought was that an execution_timeout setting on a PostgresOp task would:
Fail the task
Kill the query process in the db
I have found neither to be true.
Code:
with DAG(
# Arguments applied to instantiate this DAG. Update your values here
# All parameters visible in airflow.models.dag
dag_id=DAG_ID,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(minutes=20),
start_date=days_ago(1),
schedule_interval=None,
tags=['admin'],
max_active_runs=1
) as dag:
kill_test = PostgresOperator(
task_id="kill_test",
execution_timeout=timedelta(seconds=10),
postgres_conn_id="usa_db",
sql="""
SET application_name to airflow_test;
<SELECT... intentionally long running query> ;
""")
Airflow does not fail the task after the timeout.
Even when I manually fail the task in the UI, it does not kill the query in the Postgres db.
What is the deal here? Is there any way to put in safety measures to hard kill an Airflow initiated Postgres query in the db?
I'm not posting here, but I have checked:
Airflow UI shows task instance duration way over execution timeout
pg_stat activity to confirm query is running way over execution timeout

I guess you are looking for this parameter runtime_parameters={'statement_timeout': '180000ms'}.(airflow example)
I don't know in which version this was added but if you update your apache-airflow-providers-postgres module to the last version you can use the mentioned parameter.

Related

Spring Batch can not obtain job lock via DB (postgres)

I have several instances of "orchestrator" microservice that runs on different nodes and executes Spring Batch jobs. Only one instance has to be "active" and conduct the job at a time. The jobs are scheduled twice a day via #Scheduled annotation with cron expression.
So, mocriservice tries to execute jobs with a single identifying JobParameter that is a LocalDateTime.now() truncated to seconds to compensate time difference between OpenShift nodes my instances run on.
Underlying DB is Postgres 12, which transaction isolation level is set to repeatable read.
The problem seems imossible to me, but it happens and reproduces always. Job execution fails on each microservice instance with DuplicateKeyException on composite PK, which is (not suprisingly) job name and identifying parameter's hash.
The question is how is it possible and what am I missing? Any ideas?
Sorry for such a late answer. There were no problem at all, locks work correctly regardless transaction isolation level. We have two OpenShift clusters - active and inactive. Jobs were running on "inactive" nodes that are called so just because no client traffic routed to them. As it turned out, production support had no access to "inactive" nodes logs :)

Use Airflow to run parametrized jobs on-demand and with a schedule

I have a reporting application that uses Celery to process thousands of jobs per day. There is a python module per each report type that encapsulates all job steps. Jobs take customer-specific parameters and typically complete within a few minutes. Currently, jobs are triggered by customers on-demand when they create a new report or request a refresh of an existing one.
Now, I would like to add scheduling, so the jobs run daily, and reports get refreshed automatically. I understand that Airflow shines at task orchestration and scheduling. I also like the idea of expressing my jobs as DAGs and getting the benefit of task retries. I can see how I can use Airflow to run scheduled batch-processing jobs, but I am unsure about my use case.
If I express my jobs as Airflow DAGs, I will still need to run them parametrized for each customer. It means, if the customer creates a new report, I will need to have a way to trigger a DAG with the customer-specific configuration. And with a scheduled execution, I will need to enumerate all customers and create a parametrized (sub-)DAG for each of them. My understanding this should be possible since Airflow supports DAGs created dynamically, however, I am not sure if this is an efficient and correct way to use Airflow.
I wonder if anyway considered using Airflow for a scenario similar to mine.
Celery workflows do literally the same, and you can create and run them at any point of time. Also, Celery has a pretty good scheduler (I have never seen it failing in 5 years of using Celery) - Celery Beat.
Sure, Airflow can be used to do what you need without any problems.
You can use Airflow to create DAGs dynamically, I am not sure if this will work with a scale of 1000 of DAGs though. There are some good examples on astronomer.io on Dynamically Generating DAGs in Airflow.
I have some DAGs and task that are dynamically generated by a yaml configuration with different schedules and configurations. It all works without any issue.
Only thing that might be challenging is the "jobs are triggered by customers on-demand" - I guess you could trigger any DAG with Airflow's REST API, but it's still in a experimental state.

celery result_backend with postgresql - null for some fields in the task table

I am using Celery for a POC. My objective is to create a workflow across a distributed system.
For the purpose of a POC I have created a couple of docker containers each with a worker started with the queue option. I initiated a chain from yet another container. The chain executes successfully across the containers. Now I decided to enable the result_backend to store the results of each task getting executed.
I set the result_backend to postgresql with the schema option. After executing the chain now, I do see the tables created in postgres. However the task_meta table has some columns as null (e.g. the worker, queue.) Where as the task_id and status are correctly populated.
Has anyone faced a similar issue? Any help is appreciated.
I know it has been a long time since you asked the question but I hope it will help others.
By default, Celery does not write all task result attributes to the backend.
Your have to configure it by setting result_extended to True as stated here: https://docs.celeryproject.org/en/stable/userguide/configuration.html#result-extended
So if you configure your app from python you will have to do something like:
import celery
app = Celery(
broker=broker_url,
backend=postgresq_url,
)
app.conf.update(result_extended=True)

Issues with postgres_operator in Airflow dag

I am currently using Airflow 1.8.2 to schedule some EMR tasks and then execute some long running queries on our Redshift cluster. For that purpose I am using the postgres_operator. The queries take about 30 minutes to run. However, once they are done, the connection never closes and the operator runs for an hour and a half more till its terminated at the 2 hour mark every time. The message on termination is that the server closed the connection unexpectedly.
I've checked the logs on Redshift's end and it shows the queries have run and the connection has been closed. Somehow, that is never communicated back to Airflow. Any directions of what more I could check would be helpful. To give some more info, my Airflow installation is an extension of the https://github.com/puckel/docker-airflow docker image, is run in an ECS cluster and has SQLite as backend since I am still testing Airflow out. Also, I'm using the sequential executor for the backend. I would appreciate any help in this matter.
We had similar issue before but I am using SQLAlchemy to Redshift, if you are using postgres_operator, it should be very similar. It seems Redshift will close the connection if it doesn't see any activity for a long running query, in your case, 30 mins are pretty long query.
Check https://www.postgresql.org/docs/9.5/static/runtime-config-connection.html
you have three settings, tcp_keepalives_idle, tcp_keepalives_idle, tcp_keepalives_count, that sends a live message to redshift to indicate "Hey, I am still alive.
You can pass the following as argument, so something like this: connect_args={'keepalives': 1, 'keepalives_idle':60, 'keepalives_interval': 60}

Problem in submitting jobs in oracle

A job has been submitted and an entry is also there in dba_jobs but this job is not comming in the running state.So there is no entry for the job in dba_jobs_running.But the parameter 'JOB_QUEUE_PROCESS' has the value 10
and there are no jobs in the running state.Please suggest how to solve this problem.
SELECT NEXT_DATE, NEXT_SEC, BROKEN, FAILURES, WHAT
FROM DBA_JOBS
WHERE JOB = :JOB_ID
What's that return? A BROKEN job won't kick off, and if the NEXT_DATE/NEXT_SEC is in the past, it won't kick off either.
I hope you labeled that database parameter correctly i.e. 'JOB_QUEUE_PROCESSES=10'.
This is typically why a job won't run.
Also check that the user/schema that is running the job is correct too.
An alternative is to use a different scheduling tool to run the job (i.e. cron on linux)