Apache airflow: setting catchup to False is not working

Apache airflow: setting catchup to False is not working - scheduler

I have a DAG created on Apache airflow. It seems the scheduler is configured to run it from June 2015 (By the way. I do not know why, but it is a new DAG I created and I didn't backfill it, I only backfilled other dags with different DAG IDs with these date intervals, and the scheduler took those dates and backfilled my new dag. I'm starting to work with airflow).
(Update: I realized DAG is backfilled because the start date is set on the DAG default config, although this does not explain the behaviour I expose below)
I'm trying to stop the scheduler to run all the DAG executions from that date. airflow backfill --mark_success tutorial2 -s '2015-06-01' -e '2019-02-27' command is giving me database errors (see below), so I'm trying to set catchup to False.
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such
table: job [SQL: 'INSERT INTO job (dag_id, state, job_type,
start_date, end_date, latest_heartbeat, executor_class, hostname,
unixname) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters:
('tutorial2', 'running', 'BackfillJob', '2019-02-27 10:52:37.281716',
None, '2019-02-27 10:52:37.281733', 'SequentialExecutor',
'08b6eb432df9', 'airflow')] (Background on this error at:
http://sqlalche.me/e/e3q8)
So I'm using another approach. What I've tried:
Setting catchup_by_default = False in airflow.cfg and restarting the
whole docker container.
Setting catchup = False on my python DAG file and launching the file
with python again.
What I'm seeing on the web UI:
DAG's executions are being launched starting in June 2015:
[![DAG's executions are being launched starting in June 2015.][1]][1]
[1]: https://i.stack.imgur.com/7hlL9.png
Catchup is set to False on DAG's configuration:
[![Catchup is set to False on DAG's configuration][2]][2]
[2]: https://i.stack.imgur.com/E01Cc.png
So I don't understand why those DAG's executions are being launched.
Thank you
DAG code:
"""
Code that goes along with the Airflow tutorial is located at:
https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'catchup' : False,
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'tutorial2', default_args=default_args, schedule_interval='* * * * *')
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)

I think you actually need to specify the catchup at the dag level, not pass it in through default_args. (The latter doesn't really make sense anyway, since those are the default args for the tasks. You couldn't have some tasks catch up and others not.)
Try this:
dag = DAG(
'tutorial2', default_args=default_args, schedule_interval='* * * * *', catchup=False)

Related

Run postgres operator with statement timeout airflow dag

so i've been trying to run 2 postgres operator on my DAG and looks like this:
default_args = {
'owner': 'local',
}
log = logging.getLogger(_name_)
TEMP_SETTLEMENT ="""
set statement_timeout to 0;
select function_a();
"""
VACUM_SETTLEMENT="""
vacuum (verbose, analyze) test_table;
"""
try:
with DAG(
dag_id='temp-test',
default_args=default_args,
schedule_interval=none,
start_date=datetime(2021, 10, 1),
max_active_runs=1,
catchup=False
) as dag:
pg = PostgresOperator(
task_id="data",
postgres_conn_id="connection_1",
database="DB_test",
autocommit=True,
sql=TEMP_SETTLEMENT,
)
vacum = PostgresOperator(
task_id="vacum",
postgres_conn_id="connection_1",
database="DB_test",
autocommit=True,
sql=VACUM_SETTLEMENT, )
pg >> vacum
except ImportError as e:
log.warning("Could not import DAGs: %s", str(e))
i keep getting the statement timeout when i try to run the temp_settlement, is there any way to keep the statement_timeout=0?
Thanks

Update:
Starting apache-airflow-providers-postgres>=4.1.0
you can do:
PostgresOperator(
...,
runtime_parameters={'statement_timeout': '3000ms'},
)
This capability was added in PR that solved issue.
Original Answer:
You didn't mention it by from your description I assume that the timeout comes from Postgres and not from Airflow.
For the moment the PostgresOperator does not allow to override the hook/connection settings.
To solve your issue you will need to edit connection_1 in the extra field as explained in the docs you will need to add statement_timeout:
{'statement_timeout': '3600s'}
I opened https://github.com/apache/airflow/issues/21486 as a followup feature request to allow setting statement_timeout directly from the operator.

Use different pandas version for different tasks in the same DAG (Airflow)

Say I have two tasks which uses two versions of, say, pandas
#my_task_one
import pandas as pd #Pandas 1.0.0
def f1(data):
.
.
return 0
and
#my_task_two
import pandas as pd #version 2.0.0
def f2(data):
.
.
return 0
In my airflow (local, no Docker), is there a way to create a venv or requirement-file for each task e.g
#dag.py
t1 = PythonOperator(
task_id = "t1",
python_callable = f1,
requirements = "my_task_one_requirement.txt" #How to set requirements for this task?
)
t2 = PythonOperator(
task_id = "t2",
python_callable = f2,
requirements = "my_task_two_requirement.txt" #How to set requirements for this task?
)
t1>>t2
In case it can't be in the same DAG-file, is there a way to specify the requirements for a given DAG-file e.g placing t1 and t2 in DAG1 and DAG2 respectively, but with different packages/requirement-file?

Airflow has PythonVirtualenvOperator that is suitable for this use case.
t1 = PythonVirtualenvOperator(
task_id="t1",
python_callable=f1,
requirements=["pandas==1.0.0"],
)
t2 = PythonVirtualenvOperator(
task_id="t2",
python_callable=f2,
requirements=["pandas==2.0.0"],
)

Airflow default variables - Incremental load setup

I am trying to implement a incremental data load for a data extract from rds postgres to another postgres rds
I am using airflow, to implement the ETL. So, after reading for a while about airflow macros, I decided I'll set up the incremental flow with airflow default variables.
So, the algorithm is this way,
if my previous execution date is None or null or '':
pick data from the beginning of time(in our case its a year back)
else
pick the previous execution date
end if
Note : the following code is to understand default variables at first, and this is not yet implemented to the problem I have mentioned above
The corresponding code for that is as shown below. When I run the dag for the first time, I always end up printing 'None' for previoussuccessfulexecutiondate variable and never the historical date like what I have mentioned. I am unable to figure this out. Any ideas on this would be of great help
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
default_args={'owner':'airflow','start_date': days_ago(1),'depends_on_past':'False'}
dag = DAG('jinja_trial_10',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
executiondate = kwargs.get('execution_date')
previoussuccessfulexecutiondate = kwargs.get('prev_execution_date_success')
previousexecutiondate = kwargs.get('prev_ds_nodash')
if (previoussuccessfulexecutiondate == 'None' or previoussuccessfulexecutiondate is None):
previoussuccessfulexecutiondate = datetime.strftime(datetime.now() - timedelta(days = 365),'%Y-%m-%d')
print('Execution Date : {0}'.format(executiondate))
print('Previous successful execution date : {0}'.format(previoussuccessfulexecutiondate))
print('Previous execution date : {0}'.format(previousexecutiondate))
print('hello')
task_start = DummyOperator(task_id = 'start',dag=dag)
jinja_task= PythonOperator(task_id = 'TryingoutJinjatemplates',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>jinja_task >> task_end

I had to something very similar recently and following code is what i have ended up creating a custom function using DagRun details.
Refer to this answer - if you just want to get last DAG run (irrespective of status).
For me, i had to get the last date of successful run, hence created below function:
def get_last_dag_run(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
for dag_run in dag_runs:
#print all dag runs - debug only
print(f"All ----- state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
print('Success runs ---------------------------------')
dag_runs = list(filter(lambda x: x.state == 'success', dag_runs))
for dag_run in dag_runs:
#print successfull dag runs - debug only
print(f"Success - state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
# return last execution run or default value (01-01-1970)
return dag_runs[0].execution_date if dag_runs else datetime(1970, 1, 1)

After a few experiments and a lot of reading, I came up with the following code and it worked for me
Create a variable in Airflow UI and assign it a value 0
Use Airflow’s predefined variables, to determine whether it is a full
load or a incremental load
Pseudo code -
If
value of Variable created = 0
then
set Variable = 1
set the start data to point in time in the past(a date-time from the inception of a certain process)
set the end date to the value of the "execution_date" (defined as a part of airflow's predefined variables)
else
set the start date to "prev_execution_date_success" (defined as a part of airflow's predefined variables)
set the end date to "execution_date" (defined as a part of airflow's predefined variables)
end
Below is the code snippet for the same
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
from airflow.models import Variable
default_args={'owner':'airflow','start_date': datetime(2020,11,3,12,5),'depends_on_past':'False'}
dag = DAG('airflow_incremental_load_setup',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
# Variable to be created before the running of dag
full_load_check = Variable.get('full_load_completion')
print('full_load_check : {0}'.format(full_load_check))
if full_load_check == '0':
print('First execution')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')))
Variable.set('full_load_check', '1')
start_date = datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')
end_date = datetime.strftime(kwargs.get('execution_date'), '%Y-%m-%d')
else:
print('After the first execution ..')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(kwargs.get('prev_execution_date_success')))
start_date = kwargs.get('prev_execution_date_success')
start_date = parse(str(start_date))
end_date = kwargs.get('execution_date')
end_date = parse(str(end_date))
print('Type of start_date_check : {0}'.format(type(start_date)))
start_date = datetime.strftime(start_date, '%Y-%m-%d')
end_date = datetime.strftime(end_date, '%Y-%m-%d')
task_start = DummyOperator(task_id = 'start',dag=dag)
main_task= PythonOperator(task_id = 'IncrementalJobTask',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>main_task >> task_end

It helped me:
if isinstance(context['prev_execution_date_success'], type(None)):

Sequence not start with a initial number using aparment gem

I try start a sequence with initial number in tenants, but only public schema got this.
Take a look at my migration:
class CreateDisputes < ActiveRecord::Migration[5.0]
def change
create_table :disputes, id: :uuid do |t|
...
t.integer :code
...
end
execute %{
CREATE SEQUENCE disputes_code_seq INCREMENT BY 1
NO MINVALUE NO MAXVALUE
START WITH 1001 CACHE 1
OWNED BY disputes.code;
ALTER TABLE ONLY disputes
ALTER COLUMN code SET DEFAULT nextval('disputes_code_seq'::regclass);
}
...
end
end
Thanks!

I am no expert in apartement gem. But, apartment is not creating the disputes_code_seq in the tenant's schema.
The workaround for this is to uncomment the following line in config/initializers/apartment.rb
# Apartment can be forced to use raw SQL dumps instead of schema.rb for creating new schemas.
# Use this when you are using some extra features in PostgreSQL that can't be respresented in
# schema.rb, like materialized views etc. (only applies with use_schemas set to true).
# (Note: this option doesn't use db/structure.sql, it creates SQL dump by executing pg_dump)
#
config.use_sql = true
With config.user_sql set to true, Apartment migration will create the sequence for tenant. Here is the log(s) from migration and rails console for reference.
Following is the migration log
ubuntu#ubuntu-xenial:~/devel/apartment/testseq$ rails db:migrate
== 20170224161015 CreateDisputes: migrating ===================================
-- create_table(:disputes)
-> 0.0035s
-- execute("\n CREATE SEQUENCE disputes_code_seq INCREMENT BY 1\n NO MINVALUE NO MAXVALUE\n START WITH 1001 CACHE 1\n OWNED BY disputes.code;\n\n ALTER TABLE ONLY disputes\n ALTER COLUMN code SET DEFAULT nextval('disputes_code_seq'::regclass);\n ")
-> 0.0012s
== 20170224161015 CreateDisputes: migrated (0.0065s) ==========================
[WARNING] - The list of tenants to migrate appears to be empty. This could mean a few things:
1. You may not have created any, in which case you can ignore this message
2. You've run `apartment:migrate` directly without loading the Rails environment
* `apartment:migrate` is now deprecated. Tenants will automatically be migrated with `db:migrate`
Note that your tenants currently haven't been migrated. You'll need to run `db:migrate` to rectify this.
Following is the log of tenant creation and adding a row to disputes
irb(main):001:0> Apartment::Tenant.create('tenant2')
<output snipped for brevity>
irb(main):005:0> Apartment::Tenant.switch!('tenant2')
=> "\"tenant2\""
irb(main):006:0> d = Dispute.new
=> #<Dispute id: nil, code: nil, created_at: nil, updated_at: nil>
irb(main):007:0> d.save
(0.2ms) BEGIN
SQL (0.6ms) INSERT INTO "disputes" ("created_at", "updated_at") VALUES ($1, $2) RETURNING "id" [["created_at", 2017-02-25 03:09:49 UTC], ["updated_at", 2017-02-25 03:09:49 UTC]]
(0.6ms) COMMIT
=> true
irb(main):008:0> d.reload
Dispute Load (0.3ms) SELECT "disputes".* FROM "disputes" WHERE "disputes"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
=> #<Dispute id: 1, code: 1001, created_at: "2017-02-25 03:09:49", updated_at: "2017-02-25 03:09:49">
As you can see in the following log , code is starting with sequence numbers.
irb(main):009:0> d = Dispute.new
=> #<Dispute id: nil, code: nil, created_at: nil, updated_at: nil>
irb(main):010:0> d.save
(0.3ms) BEGIN
SQL (0.6ms) INSERT INTO "disputes" ("created_at", "updated_at") VALUES ($1, $2) RETURNING "id" [["created_at", 2017-02-25 03:11:13 UTC], ["updated_at", 2017-02-25 03:11:13 UTC]]
(0.5ms) COMMIT
=> true
irb(main):011:0> d.reload
Dispute Load (0.5ms) SELECT "disputes".* FROM "disputes" WHERE "disputes"."id" = $1 LIMIT $2 [["id", 2], ["LIMIT", 1]]
=> #<Dispute id: 2, code: 1002, created_at: "2017-02-25 03:11:13", updated_at: "2017-02-25 03:11:13">

SQLAlchemy: Problems Migrating to PostgreSQL from SQLite (e.g. sqlalchemy.exc.ProgrammingError:)

I am having difficulties migrating a working a working script to PGSQL from SQLite. I am using SQLalchemy. When I run the script, it raises the following errors:
raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
sqlalchemy.exc.ProgrammingError: (ProgrammingError) can't adapt 'INSERT INTO cnn_hot_stocks (datetime, list, ticker, price, change, "pctChange") VALUES (%(datetime)s, %(list)s, %(ticker)s, %(price)s, %(change)s, %(pctChange)s)' {'price': Decimal('7.94'), 'list': 'active', 'datetime': datetime.datetime(2012, 6, 23, 11, 45, 1, 544361), 'pctChange': u'+1.53%', 'ticker': u'BAC', 'change': Decimal('0.12')}
The insert call works well when using sqlite engine, but I want to use pgsql to utilize the native Decimal type for keeping financial data correct. I copied the script and just changed the db engine to my postgresql server. Any advice on how to troubleshoot this error would be greatly appreciated for this SQLalchemy newbie... I think I am up a creek on this one! Thanks in advance!
Here are my relevant code segments and table descriptions:
dbstring = "postgresql://postgres:postgres#localhost:5432/algo"
db = create_engine(dbstring)
db.echo = True # Try changing this to True and see what happens
metadata = MetaData(db)
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, autoload=True)
i = cnn_hot_stocks.insert() # running log from cnn hot stocks web-site
def scrape_data():
try:
html = urllib2.urlopen('http://money.cnn.com/data/hotstocks/').read()
markup, errors = tidy_document(html)
soup = BeautifulSoup(markup,)
except Exception as e:
pass
list_map = { 2 : 'active',
3 : 'gainer',
4 : 'loser'
}
# Iterate over 3 tables on CNN hot stock web-site
for x in range(2, 5):
table = soup('table')[x]
for row in table.findAll('tr')[1:]:
timestamp = datetime.now()
col = row.findAll('td')
ticker = col[0].a.string
price = Decimal(col[1].span.string)
change = Decimal(col[2].span.span.string)
pctChange = col[3].span.span.string
log_data = {'datetime' : timestamp,
'list' : list_map[x],
'ticker' : ticker,
'price' : price,
'change' : change,
'pctChange' : pctChange
}
print log_data
# Commit to DB
i.execute(log_data)
TABLE:
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, # log of stocks data on cnn hot stocks lists
Column('datetime', DateTime, primary_key=True),
Column('list', String), # loser/gainer/active
Column('ticker', String),
Column('price', Numeric),
Column('change', Numeric),
Column('pctChange', String),
)

My reading of the documentation is that you have to use numeric instead of decimal.
PostgreSQL has no type named decimal (it's an alias for numeric but not a very full-featured one), and SQL Alchemy seems to expect numeric as the type it can use for abstraction purposes.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache airflow: setting catchup to False is not working - scheduler

Related

Run postgres operator with statement timeout airflow dag

Use different pandas version for different tasks in the same DAG (Airflow)

Airflow default variables - Incremental load setup

Sequence not start with a initial number using aparment gem

SQLAlchemy: Problems Migrating to PostgreSQL from SQLite (e.g. sqlalchemy.exc.ProgrammingError:)

Categories

Resources