Airflow default variables - Incremental load setup - macros
I am trying to implement a incremental data load for a data extract from rds postgres to another postgres rds
I am using airflow, to implement the ETL. So, after reading for a while about airflow macros, I decided I'll set up the incremental flow with airflow default variables.
So, the algorithm is this way,
if my previous execution date is None or null or '':
pick data from the beginning of time(in our case its a year back)
else
pick the previous execution date
end if
Note : the following code is to understand default variables at first, and this is not yet implemented to the problem I have mentioned above
The corresponding code for that is as shown below. When I run the dag for the first time, I always end up printing 'None' for previoussuccessfulexecutiondate variable and never the historical date like what I have mentioned. I am unable to figure this out. Any ideas on this would be of great help
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
default_args={'owner':'airflow','start_date': days_ago(1),'depends_on_past':'False'}
dag = DAG('jinja_trial_10',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
executiondate = kwargs.get('execution_date')
previoussuccessfulexecutiondate = kwargs.get('prev_execution_date_success')
previousexecutiondate = kwargs.get('prev_ds_nodash')
if (previoussuccessfulexecutiondate == 'None' or previoussuccessfulexecutiondate is None):
previoussuccessfulexecutiondate = datetime.strftime(datetime.now() - timedelta(days = 365),'%Y-%m-%d')
print('Execution Date : {0}'.format(executiondate))
print('Previous successful execution date : {0}'.format(previoussuccessfulexecutiondate))
print('Previous execution date : {0}'.format(previousexecutiondate))
print('hello')
task_start = DummyOperator(task_id = 'start',dag=dag)
jinja_task= PythonOperator(task_id = 'TryingoutJinjatemplates',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>jinja_task >> task_end
I had to something very similar recently and following code is what i have ended up creating a custom function using DagRun details.
Refer to this answer - if you just want to get last DAG run (irrespective of status).
For me, i had to get the last date of successful run, hence created below function:
def get_last_dag_run(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
for dag_run in dag_runs:
#print all dag runs - debug only
print(f"All ----- state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
print('Success runs ---------------------------------')
dag_runs = list(filter(lambda x: x.state == 'success', dag_runs))
for dag_run in dag_runs:
#print successfull dag runs - debug only
print(f"Success - state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
# return last execution run or default value (01-01-1970)
return dag_runs[0].execution_date if dag_runs else datetime(1970, 1, 1)
After a few experiments and a lot of reading, I came up with the following code and it worked for me
Create a variable in Airflow UI and assign it a value 0
Use Airflow’s predefined variables, to determine whether it is a full
load or a incremental load
Pseudo code -
If
value of Variable created = 0
then
set Variable = 1
set the start data to point in time in the past(a date-time from the inception of a certain process)
set the end date to the value of the "execution_date" (defined as a part of airflow's predefined variables)
else
set the start date to "prev_execution_date_success" (defined as a part of airflow's predefined variables)
set the end date to "execution_date" (defined as a part of airflow's predefined variables)
end
Below is the code snippet for the same
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
from airflow.models import Variable
default_args={'owner':'airflow','start_date': datetime(2020,11,3,12,5),'depends_on_past':'False'}
dag = DAG('airflow_incremental_load_setup',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
# Variable to be created before the running of dag
full_load_check = Variable.get('full_load_completion')
print('full_load_check : {0}'.format(full_load_check))
if full_load_check == '0':
print('First execution')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')))
Variable.set('full_load_check', '1')
start_date = datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')
end_date = datetime.strftime(kwargs.get('execution_date'), '%Y-%m-%d')
else:
print('After the first execution ..')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(kwargs.get('prev_execution_date_success')))
start_date = kwargs.get('prev_execution_date_success')
start_date = parse(str(start_date))
end_date = kwargs.get('execution_date')
end_date = parse(str(end_date))
print('Type of start_date_check : {0}'.format(type(start_date)))
start_date = datetime.strftime(start_date, '%Y-%m-%d')
end_date = datetime.strftime(end_date, '%Y-%m-%d')
task_start = DummyOperator(task_id = 'start',dag=dag)
main_task= PythonOperator(task_id = 'IncrementalJobTask',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>main_task >> task_end
It helped me:
if isinstance(context['prev_execution_date_success'], type(None)):
Related
AWS Glue 3: NameError: name 'date_trunc' is not defined
I built a job in AWS Glue Studio setting version to Glue 3, that is spark 3 is supported. Goal is to truncate the date in column "date" to the minute, that is all seconds set to 00. I found function date_trunc to be used for that but I get error "NameError: name 'date_trunc' is not defined" The code runs in a custom transform and looks as follows: def MyTransform (glueContext, dfc) -> DynamicFrameCollection: df = dfc.select(list(dfc.keys())[0]).toDF() df_rounded = df.withColumn("date_truncated", date_trunc("minute", col("date"))) dyf_rounded = DynamicFrame.fromDF(df_rounded, glueContext, "rounded") return(DynamicFrameCollection({"CustomTransform0": dyf_rounded}, glueContext)) how can i make that function work? I assume I have to import that function but I dont see a way to do that in the Studio designer
Use different pandas version for different tasks in the same DAG (Airflow)
Say I have two tasks which uses two versions of, say, pandas #my_task_one import pandas as pd #Pandas 1.0.0 def f1(data): . . return 0 and #my_task_two import pandas as pd #version 2.0.0 def f2(data): . . return 0 In my airflow (local, no Docker), is there a way to create a venv or requirement-file for each task e.g #dag.py t1 = PythonOperator( task_id = "t1", python_callable = f1, requirements = "my_task_one_requirement.txt" #How to set requirements for this task? ) t2 = PythonOperator( task_id = "t2", python_callable = f2, requirements = "my_task_two_requirement.txt" #How to set requirements for this task? ) t1>>t2 In case it can't be in the same DAG-file, is there a way to specify the requirements for a given DAG-file e.g placing t1 and t2 in DAG1 and DAG2 respectively, but with different packages/requirement-file?
Airflow has PythonVirtualenvOperator that is suitable for this use case. t1 = PythonVirtualenvOperator( task_id="t1", python_callable=f1, requirements=["pandas==1.0.0"], ) t2 = PythonVirtualenvOperator( task_id="t2", python_callable=f2, requirements=["pandas==2.0.0"], )
Is there pagination for transaction search?
I am trying to execute the TransactionSearchReq method using the PayPal SOAP API and i get the following warning: ShortMessage: Search warning LongMessage: The number of results were truncated. Please change your search parameters if you wish to see all your results. ErrorCode: 11002 SeverityCode: Warning It also says in the docs that "The maximum number of transactions that can be returned from a TransactionSearch API call is 100." (https://developer.paypal.com/docs/classic/api/merchant/TransactionSearch_API_Operation_SOAP/) Is there some way to paginate results so that I can get more than 100 results from multiple queries?
Here's one way you can do it in Rails. This assumes you want to search from a specific point in time until now, but you could change the end_date to specify an end date. Note that I've added the 'paypal-sdk-merchant' gem to my gemfile (see https://github.com/paypal/merchant-sdk-ruby) and followed the instructions to setup my authentication. The two things you'll want to edit below are the start_date method (to set your own start date) and the do_something(x) method which will be whatever you want to do to each of the individual orders within your date range. module PaypalTxnSearch def check_for_updated_orders begin #paypal_order_list = get_paypal_orders_in_range(start_date, end_date) #paypal_order_list.PaymentTransactions.each do |x| # This is where you can call a method to process each transaction do_something(x) end # TransactionSearch returns up to 100 of the most recent items. end while txn_search_result_needs_pagination? end def get_paypal_orders_in_range(start_date, end_date) #api = PayPal::SDK::Merchant::API.new # Build Transaction Search request object # https://developer.paypal.com/webapps/developer/docs/classic/api/merchant/TransactionSearch_API_Operation_NVP/ #transaction_search = #api.build_transaction_search( StartDate: start_date, EndDate: end_date ) # Make API call & get response #response = #api.transaction_search(#transaction_search) # Access Response return_response_or_errors(#response) end def start_date # In this example we look back 6 months, but you can change it Date.today.advance(months: -6) end def end_date if defined?(#paypal_order_list) #paypal_order_list.PaymentTransactions.last.Timestamp else DateTime.now end end def txn_search_result_needs_pagination? ##paypal_order_list.Ack == 'SuccessWithWarning' && ##paypal_order_list.Errors.count == 1 && ##paypal_order_list.Errors[0].ErrorCode == '11002' end def return_response_or_errors(response) if response.success? response else response.Errors end end end
Select data within specific month in SELECT-OPTIONS
This ABAP code works but it works only once. I run this code with different parameters but result data does not change. How can I solve it? PARAMETERS : S_MONTH LIKE ISELLIST-MONTH OBLIGATORY. SELECT-OPTIONS : S_DATE FOR SY-DATUM. AT SELECTION-SCREEN ON VALUE-REQUEST FOR S_MONTH. PERFORM GET_DATES. FORM GET_DATES. DATA: MONTH LIKE ISELLIST-MONTH, FIRST_DAY LIKE SY-DATUM, LAST_DAY LIKE SY-DATUM. MONTH = SY-DATUM+0(6). "default CALL FUNCTION 'POPUP_TO_SELECT_MONTH' EXPORTING ACTUAL_MONTH = MONTH IMPORTING SELECTED_MONTH = MONTH. IF SY-SUBRC <> 0. "put some message ENDIF. CONCATENATE MONTH '01' INTO FIRST_DAY. CALL FUNCTION 'RP_LAST_DAY_OF_MONTHS' EXPORTING DAY_IN = FIRST_DAY IMPORTING LAST_DAY_OF_MONTH = LAST_DAY. IF SY-SUBRC <> 0. "put some message ENDIF. S_DATE-LOW = FIRST_DAY. S_DATE-HIGH = LAST_DAY. S_DATE-SIGN = 'I'. S_DATE-OPTION = 'BT'. APPEND S_DATE. S_MONTH = MONTH. ENDFORM.
Add REFRESH S_DATE. before the APPEND S_DATE. You are now just appending every selection you make.
SQLAlchemy: Problems Migrating to PostgreSQL from SQLite (e.g. sqlalchemy.exc.ProgrammingError:)
I am having difficulties migrating a working a working script to PGSQL from SQLite. I am using SQLalchemy. When I run the script, it raises the following errors: raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect) sqlalchemy.exc.ProgrammingError: (ProgrammingError) can't adapt 'INSERT INTO cnn_hot_stocks (datetime, list, ticker, price, change, "pctChange") VALUES (%(datetime)s, %(list)s, %(ticker)s, %(price)s, %(change)s, %(pctChange)s)' {'price': Decimal('7.94'), 'list': 'active', 'datetime': datetime.datetime(2012, 6, 23, 11, 45, 1, 544361), 'pctChange': u'+1.53%', 'ticker': u'BAC', 'change': Decimal('0.12')} The insert call works well when using sqlite engine, but I want to use pgsql to utilize the native Decimal type for keeping financial data correct. I copied the script and just changed the db engine to my postgresql server. Any advice on how to troubleshoot this error would be greatly appreciated for this SQLalchemy newbie... I think I am up a creek on this one! Thanks in advance! Here are my relevant code segments and table descriptions: dbstring = "postgresql://postgres:postgres#localhost:5432/algo" db = create_engine(dbstring) db.echo = True # Try changing this to True and see what happens metadata = MetaData(db) cnn_hot_stocks = Table('cnn_hot_stocks', metadata, autoload=True) i = cnn_hot_stocks.insert() # running log from cnn hot stocks web-site def scrape_data(): try: html = urllib2.urlopen('http://money.cnn.com/data/hotstocks/').read() markup, errors = tidy_document(html) soup = BeautifulSoup(markup,) except Exception as e: pass list_map = { 2 : 'active', 3 : 'gainer', 4 : 'loser' } # Iterate over 3 tables on CNN hot stock web-site for x in range(2, 5): table = soup('table')[x] for row in table.findAll('tr')[1:]: timestamp = datetime.now() col = row.findAll('td') ticker = col[0].a.string price = Decimal(col[1].span.string) change = Decimal(col[2].span.span.string) pctChange = col[3].span.span.string log_data = {'datetime' : timestamp, 'list' : list_map[x], 'ticker' : ticker, 'price' : price, 'change' : change, 'pctChange' : pctChange } print log_data # Commit to DB i.execute(log_data) TABLE: cnn_hot_stocks = Table('cnn_hot_stocks', metadata, # log of stocks data on cnn hot stocks lists Column('datetime', DateTime, primary_key=True), Column('list', String), # loser/gainer/active Column('ticker', String), Column('price', Numeric), Column('change', Numeric), Column('pctChange', String), )
My reading of the documentation is that you have to use numeric instead of decimal. PostgreSQL has no type named decimal (it's an alias for numeric but not a very full-featured one), and SQL Alchemy seems to expect numeric as the type it can use for abstraction purposes.