Apache airflow: setting catchup to False is not working - scheduler

I have a DAG created on Apache airflow. It seems the scheduler is configured to run it from June 2015 (By the way. I do not know why, but it is a new DAG I created and I didn't backfill it, I only backfilled other dags with different DAG IDs with these date intervals, and the scheduler took those dates and backfilled my new dag. I'm starting to work with airflow).
(Update: I realized DAG is backfilled because the start date is set on the DAG default config, although this does not explain the behaviour I expose below)
I'm trying to stop the scheduler to run all the DAG executions from that date. airflow backfill --mark_success tutorial2 -s '2015-06-01' -e '2019-02-27' command is giving me database errors (see below), so I'm trying to set catchup to False.
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such
table: job [SQL: 'INSERT INTO job (dag_id, state, job_type,
start_date, end_date, latest_heartbeat, executor_class, hostname,
unixname) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters:
('tutorial2', 'running', 'BackfillJob', '2019-02-27 10:52:37.281716',
None, '2019-02-27 10:52:37.281733', 'SequentialExecutor',
'08b6eb432df9', 'airflow')] (Background on this error at:
http://sqlalche.me/e/e3q8)
So I'm using another approach. What I've tried:
Setting catchup_by_default = False in airflow.cfg and restarting the
whole docker container.
Setting catchup = False on my python DAG file and launching the file
with python again.
What I'm seeing on the web UI:
DAG's executions are being launched starting in June 2015:
[![DAG's executions are being launched starting in June 2015.][1]][1]
[1]: https://i.stack.imgur.com/7hlL9.png
Catchup is set to False on DAG's configuration:
[![Catchup is set to False on DAG's configuration][2]][2]
[2]: https://i.stack.imgur.com/E01Cc.png
So I don't understand why those DAG's executions are being launched.
Thank you
DAG code:
"""
Code that goes along with the Airflow tutorial is located at:
https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'catchup' : False,
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'tutorial2', default_args=default_args, schedule_interval='* * * * *')
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)

I think you actually need to specify the catchup at the dag level, not pass it in through default_args. (The latter doesn't really make sense anyway, since those are the default args for the tasks. You couldn't have some tasks catch up and others not.)
Try this:
dag = DAG(
'tutorial2', default_args=default_args, schedule_interval='* * * * *', catchup=False)

Related

Run postgres operator with statement timeout airflow dag

so i've been trying to run 2 postgres operator on my DAG and looks like this:
default_args = {
'owner': 'local',
}
log = logging.getLogger(_name_)
TEMP_SETTLEMENT ="""
set statement_timeout to 0;
select function_a();
"""
VACUM_SETTLEMENT="""
vacuum (verbose, analyze) test_table;
"""
try:
with DAG(
dag_id='temp-test',
default_args=default_args,
schedule_interval=none,
start_date=datetime(2021, 10, 1),
max_active_runs=1,
catchup=False
) as dag:
pg = PostgresOperator(
task_id="data",
postgres_conn_id="connection_1",
database="DB_test",
autocommit=True,
sql=TEMP_SETTLEMENT,
)
vacum = PostgresOperator(
task_id="vacum",
postgres_conn_id="connection_1",
database="DB_test",
autocommit=True,
sql=VACUM_SETTLEMENT, )
pg >> vacum
except ImportError as e:
log.warning("Could not import DAGs: %s", str(e))
i keep getting the statement timeout when i try to run the temp_settlement, is there any way to keep the statement_timeout=0?
Thanks
Update:
Starting apache-airflow-providers-postgres>=4.1.0
you can do:
PostgresOperator(
...,
runtime_parameters={'statement_timeout': '3000ms'},
)
This capability was added in PR that solved issue.
Original Answer:
You didn't mention it by from your description I assume that the timeout comes from Postgres and not from Airflow.
For the moment the PostgresOperator does not allow to override the hook/connection settings.
To solve your issue you will need to edit connection_1 in the extra field as explained in the docs you will need to add statement_timeout:
{'statement_timeout': '3600s'}
I opened https://github.com/apache/airflow/issues/21486 as a followup feature request to allow setting statement_timeout directly from the operator.

Use different pandas version for different tasks in the same DAG (Airflow)

Say I have two tasks which uses two versions of, say, pandas
#my_task_one
import pandas as pd #Pandas 1.0.0
def f1(data):
.
.
return 0
and
#my_task_two
import pandas as pd #version 2.0.0
def f2(data):
.
.
return 0
In my airflow (local, no Docker), is there a way to create a venv or requirement-file for each task e.g
#dag.py
t1 = PythonOperator(
task_id = "t1",
python_callable = f1,
requirements = "my_task_one_requirement.txt" #How to set requirements for this task?
)
t2 = PythonOperator(
task_id = "t2",
python_callable = f2,
requirements = "my_task_two_requirement.txt" #How to set requirements for this task?
)
t1>>t2
In case it can't be in the same DAG-file, is there a way to specify the requirements for a given DAG-file e.g placing t1 and t2 in DAG1 and DAG2 respectively, but with different packages/requirement-file?
Airflow has PythonVirtualenvOperator that is suitable for this use case.
t1 = PythonVirtualenvOperator(
task_id="t1",
python_callable=f1,
requirements=["pandas==1.0.0"],
)
t2 = PythonVirtualenvOperator(
task_id="t2",
python_callable=f2,
requirements=["pandas==2.0.0"],
)

Airflow default variables - Incremental load setup

I am trying to implement a incremental data load for a data extract from rds postgres to another postgres rds
I am using airflow, to implement the ETL. So, after reading for a while about airflow macros, I decided I'll set up the incremental flow with airflow default variables.
So, the algorithm is this way,
if my previous execution date is None or null or '':
pick data from the beginning of time(in our case its a year back)
else
pick the previous execution date
end if
Note : the following code is to understand default variables at first, and this is not yet implemented to the problem I have mentioned above
The corresponding code for that is as shown below. When I run the dag for the first time, I always end up printing 'None' for previoussuccessfulexecutiondate variable and never the historical date like what I have mentioned. I am unable to figure this out. Any ideas on this would be of great help
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
default_args={'owner':'airflow','start_date': days_ago(1),'depends_on_past':'False'}
dag = DAG('jinja_trial_10',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
executiondate = kwargs.get('execution_date')
previoussuccessfulexecutiondate = kwargs.get('prev_execution_date_success')
previousexecutiondate = kwargs.get('prev_ds_nodash')
if (previoussuccessfulexecutiondate == 'None' or previoussuccessfulexecutiondate is None):
previoussuccessfulexecutiondate = datetime.strftime(datetime.now() - timedelta(days = 365),'%Y-%m-%d')
print('Execution Date : {0}'.format(executiondate))
print('Previous successful execution date : {0}'.format(previoussuccessfulexecutiondate))
print('Previous execution date : {0}'.format(previousexecutiondate))
print('hello')
task_start = DummyOperator(task_id = 'start',dag=dag)
jinja_task= PythonOperator(task_id = 'TryingoutJinjatemplates',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>jinja_task >> task_end
I had to something very similar recently and following code is what i have ended up creating a custom function using DagRun details.
Refer to this answer - if you just want to get last DAG run (irrespective of status).
For me, i had to get the last date of successful run, hence created below function:
def get_last_dag_run(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
for dag_run in dag_runs:
#print all dag runs - debug only
print(f"All ----- state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
print('Success runs ---------------------------------')
dag_runs = list(filter(lambda x: x.state == 'success', dag_runs))
for dag_run in dag_runs:
#print successfull dag runs - debug only
print(f"Success - state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
# return last execution run or default value (01-01-1970)
return dag_runs[0].execution_date if dag_runs else datetime(1970, 1, 1)
After a few experiments and a lot of reading, I came up with the following code and it worked for me
Create a variable in Airflow UI and assign it a value 0
Use Airflow’s predefined variables, to determine whether it is a full
load or a incremental load
Pseudo code -
If
value of Variable created = 0
then
set Variable = 1
set the start data to point in time in the past(a date-time from the inception of a certain process)
set the end date to the value of the "execution_date" (defined as a part of airflow's predefined variables)
else
set the start date to "prev_execution_date_success" (defined as a part of airflow's predefined variables)
set the end date to "execution_date" (defined as a part of airflow's predefined variables)
end
Below is the code snippet for the same
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
from airflow.models import Variable
default_args={'owner':'airflow','start_date': datetime(2020,11,3,12,5),'depends_on_past':'False'}
dag = DAG('airflow_incremental_load_setup',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
# Variable to be created before the running of dag
full_load_check = Variable.get('full_load_completion')
print('full_load_check : {0}'.format(full_load_check))
if full_load_check == '0':
print('First execution')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')))
Variable.set('full_load_check', '1')
start_date = datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')
end_date = datetime.strftime(kwargs.get('execution_date'), '%Y-%m-%d')
else:
print('After the first execution ..')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(kwargs.get('prev_execution_date_success')))
start_date = kwargs.get('prev_execution_date_success')
start_date = parse(str(start_date))
end_date = kwargs.get('execution_date')
end_date = parse(str(end_date))
print('Type of start_date_check : {0}'.format(type(start_date)))
start_date = datetime.strftime(start_date, '%Y-%m-%d')
end_date = datetime.strftime(end_date, '%Y-%m-%d')
task_start = DummyOperator(task_id = 'start',dag=dag)
main_task= PythonOperator(task_id = 'IncrementalJobTask',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>main_task >> task_end
It helped me:
if isinstance(context['prev_execution_date_success'], type(None)):

Sequence not start with a initial number using aparment gem

I try start a sequence with initial number in tenants, but only public schema got this.
Take a look at my migration:
class CreateDisputes < ActiveRecord::Migration[5.0]
def change
create_table :disputes, id: :uuid do |t|
...
t.integer :code
...
end
execute %{
CREATE SEQUENCE disputes_code_seq INCREMENT BY 1
NO MINVALUE NO MAXVALUE
START WITH 1001 CACHE 1
OWNED BY disputes.code;
ALTER TABLE ONLY disputes
ALTER COLUMN code SET DEFAULT nextval('disputes_code_seq'::regclass);
}
...
end
end
Thanks!
I am no expert in apartement gem. But, apartment is not creating the disputes_code_seq in the tenant's schema.
The workaround for this is to uncomment the following line in config/initializers/apartment.rb
# Apartment can be forced to use raw SQL dumps instead of schema.rb for creating new schemas.
# Use this when you are using some extra features in PostgreSQL that can't be respresented in
# schema.rb, like materialized views etc. (only applies with use_schemas set to true).
# (Note: this option doesn't use db/structure.sql, it creates SQL dump by executing pg_dump)
#
config.use_sql = true
With config.user_sql set to true, Apartment migration will create the sequence for tenant. Here is the log(s) from migration and rails console for reference.
Following is the migration log
ubuntu#ubuntu-xenial:~/devel/apartment/testseq$ rails db:migrate
== 20170224161015 CreateDisputes: migrating ===================================
-- create_table(:disputes)
-> 0.0035s
-- execute("\n CREATE SEQUENCE disputes_code_seq INCREMENT BY 1\n NO MINVALUE NO MAXVALUE\n START WITH 1001 CACHE 1\n OWNED BY disputes.code;\n\n ALTER TABLE ONLY disputes\n ALTER COLUMN code SET DEFAULT nextval('disputes_code_seq'::regclass);\n ")
-> 0.0012s
== 20170224161015 CreateDisputes: migrated (0.0065s) ==========================
[WARNING] - The list of tenants to migrate appears to be empty. This could mean a few things:
1. You may not have created any, in which case you can ignore this message
2. You've run `apartment:migrate` directly without loading the Rails environment
* `apartment:migrate` is now deprecated. Tenants will automatically be migrated with `db:migrate`
Note that your tenants currently haven't been migrated. You'll need to run `db:migrate` to rectify this.
Following is the log of tenant creation and adding a row to disputes
irb(main):001:0> Apartment::Tenant.create('tenant2')
<output snipped for brevity>
irb(main):005:0> Apartment::Tenant.switch!('tenant2')
=> "\"tenant2\""
irb(main):006:0> d = Dispute.new
=> #<Dispute id: nil, code: nil, created_at: nil, updated_at: nil>
irb(main):007:0> d.save
(0.2ms) BEGIN
SQL (0.6ms) INSERT INTO "disputes" ("created_at", "updated_at") VALUES ($1, $2) RETURNING "id" [["created_at", 2017-02-25 03:09:49 UTC], ["updated_at", 2017-02-25 03:09:49 UTC]]
(0.6ms) COMMIT
=> true
irb(main):008:0> d.reload
Dispute Load (0.3ms) SELECT "disputes".* FROM "disputes" WHERE "disputes"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
=> #<Dispute id: 1, code: 1001, created_at: "2017-02-25 03:09:49", updated_at: "2017-02-25 03:09:49">
As you can see in the following log , code is starting with sequence numbers.
irb(main):009:0> d = Dispute.new
=> #<Dispute id: nil, code: nil, created_at: nil, updated_at: nil>
irb(main):010:0> d.save
(0.3ms) BEGIN
SQL (0.6ms) INSERT INTO "disputes" ("created_at", "updated_at") VALUES ($1, $2) RETURNING "id" [["created_at", 2017-02-25 03:11:13 UTC], ["updated_at", 2017-02-25 03:11:13 UTC]]
(0.5ms) COMMIT
=> true
irb(main):011:0> d.reload
Dispute Load (0.5ms) SELECT "disputes".* FROM "disputes" WHERE "disputes"."id" = $1 LIMIT $2 [["id", 2], ["LIMIT", 1]]
=> #<Dispute id: 2, code: 1002, created_at: "2017-02-25 03:11:13", updated_at: "2017-02-25 03:11:13">

SQLAlchemy: Problems Migrating to PostgreSQL from SQLite (e.g. sqlalchemy.exc.ProgrammingError:)

I am having difficulties migrating a working a working script to PGSQL from SQLite. I am using SQLalchemy. When I run the script, it raises the following errors:
raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
sqlalchemy.exc.ProgrammingError: (ProgrammingError) can't adapt 'INSERT INTO cnn_hot_stocks (datetime, list, ticker, price, change, "pctChange") VALUES (%(datetime)s, %(list)s, %(ticker)s, %(price)s, %(change)s, %(pctChange)s)' {'price': Decimal('7.94'), 'list': 'active', 'datetime': datetime.datetime(2012, 6, 23, 11, 45, 1, 544361), 'pctChange': u'+1.53%', 'ticker': u'BAC', 'change': Decimal('0.12')}
The insert call works well when using sqlite engine, but I want to use pgsql to utilize the native Decimal type for keeping financial data correct. I copied the script and just changed the db engine to my postgresql server. Any advice on how to troubleshoot this error would be greatly appreciated for this SQLalchemy newbie... I think I am up a creek on this one! Thanks in advance!
Here are my relevant code segments and table descriptions:
dbstring = "postgresql://postgres:postgres#localhost:5432/algo"
db = create_engine(dbstring)
db.echo = True # Try changing this to True and see what happens
metadata = MetaData(db)
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, autoload=True)
i = cnn_hot_stocks.insert() # running log from cnn hot stocks web-site
def scrape_data():
try:
html = urllib2.urlopen('http://money.cnn.com/data/hotstocks/').read()
markup, errors = tidy_document(html)
soup = BeautifulSoup(markup,)
except Exception as e:
pass
list_map = { 2 : 'active',
3 : 'gainer',
4 : 'loser'
}
# Iterate over 3 tables on CNN hot stock web-site
for x in range(2, 5):
table = soup('table')[x]
for row in table.findAll('tr')[1:]:
timestamp = datetime.now()
col = row.findAll('td')
ticker = col[0].a.string
price = Decimal(col[1].span.string)
change = Decimal(col[2].span.span.string)
pctChange = col[3].span.span.string
log_data = {'datetime' : timestamp,
'list' : list_map[x],
'ticker' : ticker,
'price' : price,
'change' : change,
'pctChange' : pctChange
}
print log_data
# Commit to DB
i.execute(log_data)
TABLE:
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, # log of stocks data on cnn hot stocks lists
Column('datetime', DateTime, primary_key=True),
Column('list', String), # loser/gainer/active
Column('ticker', String),
Column('price', Numeric),
Column('change', Numeric),
Column('pctChange', String),
)
My reading of the documentation is that you have to use numeric instead of decimal.
PostgreSQL has no type named decimal (it's an alias for numeric but not a very full-featured one), and SQL Alchemy seems to expect numeric as the type it can use for abstraction purposes.