Run postgres operator with statement timeout airflow dag - postgresql

so i've been trying to run 2 postgres operator on my DAG and looks like this:
default_args = {
'owner': 'local',
}
log = logging.getLogger(_name_)
TEMP_SETTLEMENT ="""
set statement_timeout to 0;
select function_a();
"""
VACUM_SETTLEMENT="""
vacuum (verbose, analyze) test_table;
"""
try:
with DAG(
dag_id='temp-test',
default_args=default_args,
schedule_interval=none,
start_date=datetime(2021, 10, 1),
max_active_runs=1,
catchup=False
) as dag:
pg = PostgresOperator(
task_id="data",
postgres_conn_id="connection_1",
database="DB_test",
autocommit=True,
sql=TEMP_SETTLEMENT,
)
vacum = PostgresOperator(
task_id="vacum",
postgres_conn_id="connection_1",
database="DB_test",
autocommit=True,
sql=VACUM_SETTLEMENT, )
pg >> vacum
except ImportError as e:
log.warning("Could not import DAGs: %s", str(e))
i keep getting the statement timeout when i try to run the temp_settlement, is there any way to keep the statement_timeout=0?
Thanks

Update:
Starting apache-airflow-providers-postgres>=4.1.0
you can do:
PostgresOperator(
...,
runtime_parameters={'statement_timeout': '3000ms'},
)
This capability was added in PR that solved issue.
Original Answer:
You didn't mention it by from your description I assume that the timeout comes from Postgres and not from Airflow.
For the moment the PostgresOperator does not allow to override the hook/connection settings.
To solve your issue you will need to edit connection_1 in the extra field as explained in the docs you will need to add statement_timeout:
{'statement_timeout': '3600s'}
I opened https://github.com/apache/airflow/issues/21486 as a followup feature request to allow setting statement_timeout directly from the operator.

Related

Airflow default variables - Incremental load setup

I am trying to implement a incremental data load for a data extract from rds postgres to another postgres rds
I am using airflow, to implement the ETL. So, after reading for a while about airflow macros, I decided I'll set up the incremental flow with airflow default variables.
So, the algorithm is this way,
if my previous execution date is None or null or '':
pick data from the beginning of time(in our case its a year back)
else
pick the previous execution date
end if
Note : the following code is to understand default variables at first, and this is not yet implemented to the problem I have mentioned above
The corresponding code for that is as shown below. When I run the dag for the first time, I always end up printing 'None' for previoussuccessfulexecutiondate variable and never the historical date like what I have mentioned. I am unable to figure this out. Any ideas on this would be of great help
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
default_args={'owner':'airflow','start_date': days_ago(1),'depends_on_past':'False'}
dag = DAG('jinja_trial_10',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
executiondate = kwargs.get('execution_date')
previoussuccessfulexecutiondate = kwargs.get('prev_execution_date_success')
previousexecutiondate = kwargs.get('prev_ds_nodash')
if (previoussuccessfulexecutiondate == 'None' or previoussuccessfulexecutiondate is None):
previoussuccessfulexecutiondate = datetime.strftime(datetime.now() - timedelta(days = 365),'%Y-%m-%d')
print('Execution Date : {0}'.format(executiondate))
print('Previous successful execution date : {0}'.format(previoussuccessfulexecutiondate))
print('Previous execution date : {0}'.format(previousexecutiondate))
print('hello')
task_start = DummyOperator(task_id = 'start',dag=dag)
jinja_task= PythonOperator(task_id = 'TryingoutJinjatemplates',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>jinja_task >> task_end
I had to something very similar recently and following code is what i have ended up creating a custom function using DagRun details.
Refer to this answer - if you just want to get last DAG run (irrespective of status).
For me, i had to get the last date of successful run, hence created below function:
def get_last_dag_run(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
for dag_run in dag_runs:
#print all dag runs - debug only
print(f"All ----- state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
print('Success runs ---------------------------------')
dag_runs = list(filter(lambda x: x.state == 'success', dag_runs))
for dag_run in dag_runs:
#print successfull dag runs - debug only
print(f"Success - state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
# return last execution run or default value (01-01-1970)
return dag_runs[0].execution_date if dag_runs else datetime(1970, 1, 1)
After a few experiments and a lot of reading, I came up with the following code and it worked for me
Create a variable in Airflow UI and assign it a value 0
Use Airflow’s predefined variables, to determine whether it is a full
load or a incremental load
Pseudo code -
If
value of Variable created = 0
then
set Variable = 1
set the start data to point in time in the past(a date-time from the inception of a certain process)
set the end date to the value of the "execution_date" (defined as a part of airflow's predefined variables)
else
set the start date to "prev_execution_date_success" (defined as a part of airflow's predefined variables)
set the end date to "execution_date" (defined as a part of airflow's predefined variables)
end
Below is the code snippet for the same
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
from airflow.models import Variable
default_args={'owner':'airflow','start_date': datetime(2020,11,3,12,5),'depends_on_past':'False'}
dag = DAG('airflow_incremental_load_setup',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
# Variable to be created before the running of dag
full_load_check = Variable.get('full_load_completion')
print('full_load_check : {0}'.format(full_load_check))
if full_load_check == '0':
print('First execution')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')))
Variable.set('full_load_check', '1')
start_date = datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')
end_date = datetime.strftime(kwargs.get('execution_date'), '%Y-%m-%d')
else:
print('After the first execution ..')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(kwargs.get('prev_execution_date_success')))
start_date = kwargs.get('prev_execution_date_success')
start_date = parse(str(start_date))
end_date = kwargs.get('execution_date')
end_date = parse(str(end_date))
print('Type of start_date_check : {0}'.format(type(start_date)))
start_date = datetime.strftime(start_date, '%Y-%m-%d')
end_date = datetime.strftime(end_date, '%Y-%m-%d')
task_start = DummyOperator(task_id = 'start',dag=dag)
main_task= PythonOperator(task_id = 'IncrementalJobTask',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>main_task >> task_end
It helped me:
if isinstance(context['prev_execution_date_success'], type(None)):

Apache airflow: setting catchup to False is not working

I have a DAG created on Apache airflow. It seems the scheduler is configured to run it from June 2015 (By the way. I do not know why, but it is a new DAG I created and I didn't backfill it, I only backfilled other dags with different DAG IDs with these date intervals, and the scheduler took those dates and backfilled my new dag. I'm starting to work with airflow).
(Update: I realized DAG is backfilled because the start date is set on the DAG default config, although this does not explain the behaviour I expose below)
I'm trying to stop the scheduler to run all the DAG executions from that date. airflow backfill --mark_success tutorial2 -s '2015-06-01' -e '2019-02-27' command is giving me database errors (see below), so I'm trying to set catchup to False.
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such
table: job [SQL: 'INSERT INTO job (dag_id, state, job_type,
start_date, end_date, latest_heartbeat, executor_class, hostname,
unixname) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters:
('tutorial2', 'running', 'BackfillJob', '2019-02-27 10:52:37.281716',
None, '2019-02-27 10:52:37.281733', 'SequentialExecutor',
'08b6eb432df9', 'airflow')] (Background on this error at:
http://sqlalche.me/e/e3q8)
So I'm using another approach. What I've tried:
Setting catchup_by_default = False in airflow.cfg and restarting the
whole docker container.
Setting catchup = False on my python DAG file and launching the file
with python again.
What I'm seeing on the web UI:
DAG's executions are being launched starting in June 2015:
[![DAG's executions are being launched starting in June 2015.][1]][1]
[1]: https://i.stack.imgur.com/7hlL9.png
Catchup is set to False on DAG's configuration:
[![Catchup is set to False on DAG's configuration][2]][2]
[2]: https://i.stack.imgur.com/E01Cc.png
So I don't understand why those DAG's executions are being launched.
Thank you
DAG code:
"""
Code that goes along with the Airflow tutorial is located at:
https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'catchup' : False,
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'tutorial2', default_args=default_args, schedule_interval='* * * * *')
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)
I think you actually need to specify the catchup at the dag level, not pass it in through default_args. (The latter doesn't really make sense anyway, since those are the default args for the tasks. You couldn't have some tasks catch up and others not.)
Try this:
dag = DAG(
'tutorial2', default_args=default_args, schedule_interval='* * * * *', catchup=False)

Better ways to write SQL statement in DB2 which makes use of "EXISTS" keyword

I have written below SQL in an RPGLE program. Intent is to update the header file (TC400F) if no corresponding records exist in detail file (TC401F). Are there better ways of doing this? By better, I mean that would make the query run faster or would make it look more cleaner.
Exec SQL UPDATE TC400F
SET T40STS = '05',
T40OFL = '1'
WHERE T40SID = :K#T41SID AND
T40PID = :K#T41PID AND
NOT EXISTS (SELECT * FROM TC401F WHERE
T41SID = :K#T41SID AND
T41PID = :K#T41PID );
try this :
Exec SQL UPDATE TC400F f1
SET (f1.T40STS, f1.T40OFL) = ('05', '1')
WHERE f1.T40SID = :K#T41SID AND
f1.T40PID = :K#T41PID AND
NOT EXISTS
(
SELECT * FROM TC401F f2
WHERE (f1.T41SID, f1.T41PID) = (f2.T41SID, f2.T41PID)
);
You have not done anything here that explicitly restricts performance. The optimizer is pretty smart these days. Most of the things that affect SQL performance are external to the statement. Best to write the statement to be semantically correct (as you have above), and let the optimizer do it's thing. Then if you see performance issues, then investigate them with then Explain tools in Run SQL Scripts. Most likely your performance issues will derive from incorrect indexes.
Exec SQL UPDATE TC400F
SET T40STS = '05',
T40OFL = '1'
WHERE (t40sts <> '05' or t40ofl <> '1')
and T40SID = :K#T41SID
AND T40PID = :K#T41PID
and
NOT EXISTS (SELECT onefieldhere FROM TC401F WHERE
T41SID = :K#T41SID AND
T41PID = :K#T41PID );
Add a little optimistic code don't update what is already set.
Please don't make the as400 an orphan by ending a line with an operator that is the start of the next line. Optimistic updates are probably fastest when values are already set because it knows its already done. You could make it better by using the alias field names so someone in the future will know what a t40ofl is. Select only one field from the exists clause so sql won't have to pull in a whole row.

How to connect Jupyter Ipython notebook to Amazon redshift

I am using Mac Yosemite.
I have installed the packages postgresql, psycopg2, and simplejson using conda install "package name".
After the installation I have imported these packages. I tried to create a json file with my amazon redshift credentials
{
"user_name": "YOUR USER NAME",
"password": "YOUR PASSWORD",
"host_name": "YOUR HOST NAME",
"port_num": "5439",
"db_name": "YOUR DATABASE NAME"
}
I used with
open("Credentials.json") as fh:
creds = simplejson.loads(fh.read())
But this is throwing error. These were the instructions given on a website. I tried searching other websites but no site gives a good explanation.
Please let me know the ways I can connect the Jupyter to amazon redshift.
There's a nice guide from RJMetrics here: "Setting up Your Analytics Stack with Jupyter Notebook & AWS Redshift". It uses ipython-sql
This works great and displays results in a grid.
In [1]:
import sqlalchemy
import psycopg2
import simplejson
%load_ext sql
%config SqlMagic.displaylimit = 10
In [2]:
with open("./my_db.creds") as fh:
creds = simplejson.loads(fh.read())
connect_to_db = 'postgresql+psycopg2://' + \
creds['user_name'] + ':' + creds['password'] + '#' + \
creds['host_name'] + ':' + creds['port_num'] + '/' + creds['db_name'];
%sql $connect_to_db
In [3]:
% sql SELECT * FROM my_table LIMIT 25;
Here's how I do it:
----INSERT IN CELL 1-----
import psycopg2
redshift_endpoint = "<add your endpoint>"
redshift_user = "<add your user>"
redshift_pass = "<add your password>"
port = <your port>
dbname = "<your db name>"
----INSERT IN CELL 2-----
from sqlalchemy import create_engine
from sqlalchemy import text
engine_string = "postgresql+psycopg2://%s:%s#%s:%d/%s" \
% (redshift_user, redshift_pass, redshift_endpoint, port, dbname)
engine = create_engine(engine_string)
----INSERT IN CELL 3 - THIS EXAMPLE WILL GET ALL TABLES FROM YOUR DATABASE-----
sql = """
select schemaname, tablename from pg_tables order by schemaname, tablename;
"""
----LOAD RESULTS AS TUPLES TO A LIST-----
tables = []
output = engine.execute(sql)
for row in output:
tables.append(row)
tables
--IF YOU'RE USING PANDAS---
raw_data = pd.read_sql_query(text(sql), engine)
The easiest way is to use this extension -
https://github.com/sat28/jupyter-redshift
The sample notebook shows how it loads redshift utility as an IPython Magic.
Edit 1
Support for writing back to redshift database has also been added.

Use python to execute line in postgresql

I have imported one shapefile named tc_bf25 using qgis, and the following is my python script typed in pyscripter,
import sys
import psycopg2
conn = psycopg2.connect("dbname = 'routing_template' user = 'postgres' host = 'localhost' password = '****'")
cur = conn.cursor()
query = """
ALTER TABLE tc_bf25 ADD COLUMN source integer;
ALTER TABLE tc_bf25 ADD COLUMN target integer;
SELECT assign_vertex_id('tc_bf25', 0.0001, 'the_geom', 'gid')
;"""
cur.execute(query)
query = """
CREATE OR REPLACE VIEW tc_bf25_ext AS
SELECT *, startpoint(the_geom), endpoint(the_geom)
FROM tc_bf25
;"""
cur.execute(query)
query = """
CREATE TABLE node1 AS
SELECT row_number() OVER (ORDER BY foo.p)::integer AS id,
foo.p AS the_geom
FROM (
SELECT DISTINCT tc_bf25_ext.startpoint AS p FROM tc_bf25_ext
UNION
SELECT DISTINCT tc_bf25_ext.endpoint AS p FROM tc_bf25_ext
) foo
GROUP BY foo.p
;"""
cur.execute(query)
query = """
CREATE TABLE network1 AS
SELECT a.*, b.id as start_id, c.id as end_id
FROM tc_bf25_ext AS a
JOIN node AS b ON a.startpoint = b.the_geom
JOIN node AS c ON a.endpoint = c.the_geom
;"""
cur.execute(query)
query = """
ALTER TABLE network1 ADD COLUMN shape_leng double precision;
UPDATE network1 SET shape_leng = length(the_geom)
;"""
cur.execute(query)
I got the error at the second cur.execute(query),
But I go to pgAdmin to check result, even though no error occurs, the first cur.execute(query) didn't add new columns in my table.
What mistake did I make? And how to fix it?
I am working with postgresql 8.4, python 2.7.6 under Windows 8.1 x64.
When using psycopg2, autocommit is set to False by default. The first two statements both refer to table tc_bf25, but the first statement makes an uncommitted change to the table. So try running conn.commit() between statements to see if this resolves the issue
You should run each statement individually. Do not combine multiple statements into a semicolon separated series and run them all at one. It makes error handling and fetching of results much harder.
If you still have the problem once you've made that change, show the exact statement you're having the problem with.
Just to add to #Talvalin you can enable auto-commit by adding
psycopg2.connect("dbname='mydb',user='postgres',host ='localhost',password = '****'")
conn.autocommit = True
after you connect to your database using psycopg2