Airflow default variables - Incremental load setup - macros

I am trying to implement a incremental data load for a data extract from rds postgres to another postgres rds
I am using airflow, to implement the ETL. So, after reading for a while about airflow macros, I decided I'll set up the incremental flow with airflow default variables.
So, the algorithm is this way,
if my previous execution date is None or null or '':
pick data from the beginning of time(in our case its a year back)
else
pick the previous execution date
end if
Note : the following code is to understand default variables at first, and this is not yet implemented to the problem I have mentioned above
The corresponding code for that is as shown below. When I run the dag for the first time, I always end up printing 'None' for previoussuccessfulexecutiondate variable and never the historical date like what I have mentioned. I am unable to figure this out. Any ideas on this would be of great help
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
default_args={'owner':'airflow','start_date': days_ago(1),'depends_on_past':'False'}
dag = DAG('jinja_trial_10',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
executiondate = kwargs.get('execution_date')
previoussuccessfulexecutiondate = kwargs.get('prev_execution_date_success')
previousexecutiondate = kwargs.get('prev_ds_nodash')
if (previoussuccessfulexecutiondate == 'None' or previoussuccessfulexecutiondate is None):
previoussuccessfulexecutiondate = datetime.strftime(datetime.now() - timedelta(days = 365),'%Y-%m-%d')
print('Execution Date : {0}'.format(executiondate))
print('Previous successful execution date : {0}'.format(previoussuccessfulexecutiondate))
print('Previous execution date : {0}'.format(previousexecutiondate))
print('hello')
task_start = DummyOperator(task_id = 'start',dag=dag)
jinja_task= PythonOperator(task_id = 'TryingoutJinjatemplates',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>jinja_task >> task_end

I had to something very similar recently and following code is what i have ended up creating a custom function using DagRun details.
Refer to this answer - if you just want to get last DAG run (irrespective of status).
For me, i had to get the last date of successful run, hence created below function:
def get_last_dag_run(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
for dag_run in dag_runs:
#print all dag runs - debug only
print(f"All ----- state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
print('Success runs ---------------------------------')
dag_runs = list(filter(lambda x: x.state == 'success', dag_runs))
for dag_run in dag_runs:
#print successfull dag runs - debug only
print(f"Success - state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
# return last execution run or default value (01-01-1970)
return dag_runs[0].execution_date if dag_runs else datetime(1970, 1, 1)

After a few experiments and a lot of reading, I came up with the following code and it worked for me
Create a variable in Airflow UI and assign it a value 0
Use Airflow’s predefined variables, to determine whether it is a full
load or a incremental load
Pseudo code -
If
value of Variable created = 0
then
set Variable = 1
set the start data to point in time in the past(a date-time from the inception of a certain process)
set the end date to the value of the "execution_date" (defined as a part of airflow's predefined variables)
else
set the start date to "prev_execution_date_success" (defined as a part of airflow's predefined variables)
set the end date to "execution_date" (defined as a part of airflow's predefined variables)
end
Below is the code snippet for the same
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
from airflow.models import Variable
default_args={'owner':'airflow','start_date': datetime(2020,11,3,12,5),'depends_on_past':'False'}
dag = DAG('airflow_incremental_load_setup',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
# Variable to be created before the running of dag
full_load_check = Variable.get('full_load_completion')
print('full_load_check : {0}'.format(full_load_check))
if full_load_check == '0':
print('First execution')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')))
Variable.set('full_load_check', '1')
start_date = datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')
end_date = datetime.strftime(kwargs.get('execution_date'), '%Y-%m-%d')
else:
print('After the first execution ..')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(kwargs.get('prev_execution_date_success')))
start_date = kwargs.get('prev_execution_date_success')
start_date = parse(str(start_date))
end_date = kwargs.get('execution_date')
end_date = parse(str(end_date))
print('Type of start_date_check : {0}'.format(type(start_date)))
start_date = datetime.strftime(start_date, '%Y-%m-%d')
end_date = datetime.strftime(end_date, '%Y-%m-%d')
task_start = DummyOperator(task_id = 'start',dag=dag)
main_task= PythonOperator(task_id = 'IncrementalJobTask',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>main_task >> task_end

It helped me:
if isinstance(context['prev_execution_date_success'], type(None)):

Related

AWS Glue 3: NameError: name 'date_trunc' is not defined

I built a job in AWS Glue Studio setting version to Glue 3, that is spark 3 is supported.
Goal is to truncate the date in column "date" to the minute, that is all seconds set to 00.
I found function date_trunc to be used for that but I get error "NameError: name 'date_trunc' is not defined"
The code runs in a custom transform and looks as follows:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_rounded = df.withColumn("date_truncated", date_trunc("minute", col("date")))
dyf_rounded = DynamicFrame.fromDF(df_rounded, glueContext, "rounded")
return(DynamicFrameCollection({"CustomTransform0": dyf_rounded}, glueContext))
how can i make that function work? I assume I have to import that function but I dont see a way to do that in the Studio designer

Use different pandas version for different tasks in the same DAG (Airflow)

Say I have two tasks which uses two versions of, say, pandas
#my_task_one
import pandas as pd #Pandas 1.0.0
def f1(data):
.
.
return 0
and
#my_task_two
import pandas as pd #version 2.0.0
def f2(data):
.
.
return 0
In my airflow (local, no Docker), is there a way to create a venv or requirement-file for each task e.g
#dag.py
t1 = PythonOperator(
task_id = "t1",
python_callable = f1,
requirements = "my_task_one_requirement.txt" #How to set requirements for this task?
)
t2 = PythonOperator(
task_id = "t2",
python_callable = f2,
requirements = "my_task_two_requirement.txt" #How to set requirements for this task?
)
t1>>t2
In case it can't be in the same DAG-file, is there a way to specify the requirements for a given DAG-file e.g placing t1 and t2 in DAG1 and DAG2 respectively, but with different packages/requirement-file?
Airflow has PythonVirtualenvOperator that is suitable for this use case.
t1 = PythonVirtualenvOperator(
task_id="t1",
python_callable=f1,
requirements=["pandas==1.0.0"],
)
t2 = PythonVirtualenvOperator(
task_id="t2",
python_callable=f2,
requirements=["pandas==2.0.0"],
)

Is there pagination for transaction search?

I am trying to execute the TransactionSearchReq method using the PayPal SOAP API and i get the following warning:
ShortMessage: Search warning
LongMessage: The number of results were truncated. Please change your search parameters if you wish to see all your results.
ErrorCode: 11002
SeverityCode: Warning
It also says in the docs that "The maximum number of transactions that can be returned from a TransactionSearch API call is 100."
(https://developer.paypal.com/docs/classic/api/merchant/TransactionSearch_API_Operation_SOAP/)
Is there some way to paginate results so that I can get more than 100 results from multiple queries?
Here's one way you can do it in Rails. This assumes you want to search from a specific point in time until now, but you could change the end_date to specify an end date. Note that I've added the 'paypal-sdk-merchant' gem to my gemfile (see https://github.com/paypal/merchant-sdk-ruby) and followed the instructions to setup my authentication.
The two things you'll want to edit below are the start_date method (to set your own start date) and the do_something(x) method which will be whatever you want to do to each of the individual orders within your date range.
module PaypalTxnSearch
def check_for_updated_orders
begin
#paypal_order_list = get_paypal_orders_in_range(start_date, end_date)
#paypal_order_list.PaymentTransactions.each do |x|
# This is where you can call a method to process each transaction
do_something(x)
end
# TransactionSearch returns up to 100 of the most recent items.
end while txn_search_result_needs_pagination?
end
def get_paypal_orders_in_range(start_date, end_date)
#api = PayPal::SDK::Merchant::API.new
# Build Transaction Search request object
# https://developer.paypal.com/webapps/developer/docs/classic/api/merchant/TransactionSearch_API_Operation_NVP/
#transaction_search = #api.build_transaction_search(
StartDate: start_date,
EndDate: end_date
)
# Make API call & get response
#response = #api.transaction_search(#transaction_search)
# Access Response
return_response_or_errors(#response)
end
def start_date
# In this example we look back 6 months, but you can change it
Date.today.advance(months: -6)
end
def end_date
if defined?(#paypal_order_list)
#paypal_order_list.PaymentTransactions.last.Timestamp
else
DateTime.now
end
end
def txn_search_result_needs_pagination?
##paypal_order_list.Ack == 'SuccessWithWarning' &&
##paypal_order_list.Errors.count == 1 &&
##paypal_order_list.Errors[0].ErrorCode == '11002'
end
def return_response_or_errors(response)
if response.success?
response
else
response.Errors
end
end
end

Select data within specific month in SELECT-OPTIONS

This ABAP code works but it works only once. I run this code with different parameters but result data does not change. How can I solve it?
PARAMETERS : S_MONTH LIKE ISELLIST-MONTH OBLIGATORY.
SELECT-OPTIONS : S_DATE FOR SY-DATUM.
AT SELECTION-SCREEN ON VALUE-REQUEST FOR S_MONTH.
PERFORM GET_DATES.
FORM GET_DATES.
DATA: MONTH LIKE ISELLIST-MONTH,
FIRST_DAY LIKE SY-DATUM,
LAST_DAY LIKE SY-DATUM.
MONTH = SY-DATUM+0(6). "default
CALL FUNCTION 'POPUP_TO_SELECT_MONTH'
EXPORTING
ACTUAL_MONTH = MONTH
IMPORTING
SELECTED_MONTH = MONTH.
IF SY-SUBRC <> 0.
"put some message
ENDIF.
CONCATENATE MONTH '01' INTO FIRST_DAY.
CALL FUNCTION 'RP_LAST_DAY_OF_MONTHS'
EXPORTING
DAY_IN = FIRST_DAY
IMPORTING
LAST_DAY_OF_MONTH = LAST_DAY.
IF SY-SUBRC <> 0.
"put some message
ENDIF.
S_DATE-LOW = FIRST_DAY.
S_DATE-HIGH = LAST_DAY.
S_DATE-SIGN = 'I'.
S_DATE-OPTION = 'BT'.
APPEND S_DATE.
S_MONTH = MONTH.
ENDFORM.
Add REFRESH S_DATE. before the APPEND S_DATE. You are now just appending every selection you make.

SQLAlchemy: Problems Migrating to PostgreSQL from SQLite (e.g. sqlalchemy.exc.ProgrammingError:)

I am having difficulties migrating a working a working script to PGSQL from SQLite. I am using SQLalchemy. When I run the script, it raises the following errors:
raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
sqlalchemy.exc.ProgrammingError: (ProgrammingError) can't adapt 'INSERT INTO cnn_hot_stocks (datetime, list, ticker, price, change, "pctChange") VALUES (%(datetime)s, %(list)s, %(ticker)s, %(price)s, %(change)s, %(pctChange)s)' {'price': Decimal('7.94'), 'list': 'active', 'datetime': datetime.datetime(2012, 6, 23, 11, 45, 1, 544361), 'pctChange': u'+1.53%', 'ticker': u'BAC', 'change': Decimal('0.12')}
The insert call works well when using sqlite engine, but I want to use pgsql to utilize the native Decimal type for keeping financial data correct. I copied the script and just changed the db engine to my postgresql server. Any advice on how to troubleshoot this error would be greatly appreciated for this SQLalchemy newbie... I think I am up a creek on this one! Thanks in advance!
Here are my relevant code segments and table descriptions:
dbstring = "postgresql://postgres:postgres#localhost:5432/algo"
db = create_engine(dbstring)
db.echo = True # Try changing this to True and see what happens
metadata = MetaData(db)
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, autoload=True)
i = cnn_hot_stocks.insert() # running log from cnn hot stocks web-site
def scrape_data():
try:
html = urllib2.urlopen('http://money.cnn.com/data/hotstocks/').read()
markup, errors = tidy_document(html)
soup = BeautifulSoup(markup,)
except Exception as e:
pass
list_map = { 2 : 'active',
3 : 'gainer',
4 : 'loser'
}
# Iterate over 3 tables on CNN hot stock web-site
for x in range(2, 5):
table = soup('table')[x]
for row in table.findAll('tr')[1:]:
timestamp = datetime.now()
col = row.findAll('td')
ticker = col[0].a.string
price = Decimal(col[1].span.string)
change = Decimal(col[2].span.span.string)
pctChange = col[3].span.span.string
log_data = {'datetime' : timestamp,
'list' : list_map[x],
'ticker' : ticker,
'price' : price,
'change' : change,
'pctChange' : pctChange
}
print log_data
# Commit to DB
i.execute(log_data)
TABLE:
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, # log of stocks data on cnn hot stocks lists
Column('datetime', DateTime, primary_key=True),
Column('list', String), # loser/gainer/active
Column('ticker', String),
Column('price', Numeric),
Column('change', Numeric),
Column('pctChange', String),
)
My reading of the documentation is that you have to use numeric instead of decimal.
PostgreSQL has no type named decimal (it's an alias for numeric but not a very full-featured one), and SQL Alchemy seems to expect numeric as the type it can use for abstraction purposes.