AWS Glue 3: NameError: name 'date_trunc' is not defined - pyspark

I built a job in AWS Glue Studio setting version to Glue 3, that is spark 3 is supported.
Goal is to truncate the date in column "date" to the minute, that is all seconds set to 00.
I found function date_trunc to be used for that but I get error "NameError: name 'date_trunc' is not defined"
The code runs in a custom transform and looks as follows:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_rounded = df.withColumn("date_truncated", date_trunc("minute", col("date")))
dyf_rounded = DynamicFrame.fromDF(df_rounded, glueContext, "rounded")
return(DynamicFrameCollection({"CustomTransform0": dyf_rounded}, glueContext))
how can i make that function work? I assume I have to import that function but I dont see a way to do that in the Studio designer

Related

Clickhouse client syntax error kafka integration

ClickHouse client version 18.16.1 and I'm following this blog post-https://altinity.com/blog/2020/5/21/clickhouse-kafka-engine-tutorial
when creating a table I'm using this syntex
CREATE TABLE readings (
readings_id Int32 Codec(DoubleDelta, LZ4),
time DateTime Codec(DoubleDelta, LZ4),
date ALIAS toDate(time),
temperature Decimal(5,2) Codec(T64, LZ4)
) Engine = MergeTree
PARTITION BY toYYYYMM(time)
ORDER BY (readings_id, time);
and I'm getting an error that says
"""
Code: 62, e.displayText() = DB::Exception: Syntax error: failed at position 76 (line 2, col 23): Codec(DoubleDelta, LZ4),
time DateTime Codec(DoubleDelta, LZ4),
date ALIAS toDate(time),
temperature Decimal(5,2) Codec(T64, LZ4)
)
ENGINE = MergeTr. Expected one of: token, ClosingRoundBracket, Comma, DEFAULT, MATERIALIZED, ALIAS, COMMENT, e.what() = DB::Exception
"""
let me know what I'm doing wrong thanks.

how to pass the case statements in to .selectExpr("*", "all case statements") spark code, from external file

I have below case statement in sql file
note - it is just a sample statement and i saved it as col_sql.sql
"CASE WHEN a = 1 THEN ONE END AS INT_VAL"
, "CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL"
In spark scala code
Im getting the col_sql.sql as per below
val col_file = "dir/path/col_sql.sql"
val col_query = readFile(col_file) --- It is internal converted as string using .mkString
Then passing it to my select query in spark code
.selectExpr("*", col_query )
Expectation --
My expectation is when my spark job is running the case statement should be passed in .selectExpr() function as it is given in sql file, like below it should be passed.
When manually running in spark2-shell it is working correctly but in spark2-summit job it throwing parserDriver error .
Kindly assit me on this.
.selectExpr("*", "CASE WHEN a = 1 THEN ONE END AS INT_VAL", "CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL")
Each argument in selectExpr should resolve to one column (see examples in the doc). In this case you will have to split the expression read from the file, e.g.:
// Example given the complete string, you could split already when reading the file
val col_query = "\"CASE WHEN a = 1 THEN ONE END AS INT_VAL\", \"CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL\""
val cols_queries = col_query.split(",").map(x => x.trim().stripPrefix("\"").stripSuffix("\""))
df.selectExpr("*", cols_queries: _*) // to expand the list into arguments

Airflow default variables - Incremental load setup

I am trying to implement a incremental data load for a data extract from rds postgres to another postgres rds
I am using airflow, to implement the ETL. So, after reading for a while about airflow macros, I decided I'll set up the incremental flow with airflow default variables.
So, the algorithm is this way,
if my previous execution date is None or null or '':
pick data from the beginning of time(in our case its a year back)
else
pick the previous execution date
end if
Note : the following code is to understand default variables at first, and this is not yet implemented to the problem I have mentioned above
The corresponding code for that is as shown below. When I run the dag for the first time, I always end up printing 'None' for previoussuccessfulexecutiondate variable and never the historical date like what I have mentioned. I am unable to figure this out. Any ideas on this would be of great help
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
default_args={'owner':'airflow','start_date': days_ago(1),'depends_on_past':'False'}
dag = DAG('jinja_trial_10',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
executiondate = kwargs.get('execution_date')
previoussuccessfulexecutiondate = kwargs.get('prev_execution_date_success')
previousexecutiondate = kwargs.get('prev_ds_nodash')
if (previoussuccessfulexecutiondate == 'None' or previoussuccessfulexecutiondate is None):
previoussuccessfulexecutiondate = datetime.strftime(datetime.now() - timedelta(days = 365),'%Y-%m-%d')
print('Execution Date : {0}'.format(executiondate))
print('Previous successful execution date : {0}'.format(previoussuccessfulexecutiondate))
print('Previous execution date : {0}'.format(previousexecutiondate))
print('hello')
task_start = DummyOperator(task_id = 'start',dag=dag)
jinja_task= PythonOperator(task_id = 'TryingoutJinjatemplates',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>jinja_task >> task_end
I had to something very similar recently and following code is what i have ended up creating a custom function using DagRun details.
Refer to this answer - if you just want to get last DAG run (irrespective of status).
For me, i had to get the last date of successful run, hence created below function:
def get_last_dag_run(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
for dag_run in dag_runs:
#print all dag runs - debug only
print(f"All ----- state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
print('Success runs ---------------------------------')
dag_runs = list(filter(lambda x: x.state == 'success', dag_runs))
for dag_run in dag_runs:
#print successfull dag runs - debug only
print(f"Success - state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
# return last execution run or default value (01-01-1970)
return dag_runs[0].execution_date if dag_runs else datetime(1970, 1, 1)
After a few experiments and a lot of reading, I came up with the following code and it worked for me
Create a variable in Airflow UI and assign it a value 0
Use Airflow’s predefined variables, to determine whether it is a full
load or a incremental load
Pseudo code -
If
value of Variable created = 0
then
set Variable = 1
set the start data to point in time in the past(a date-time from the inception of a certain process)
set the end date to the value of the "execution_date" (defined as a part of airflow's predefined variables)
else
set the start date to "prev_execution_date_success" (defined as a part of airflow's predefined variables)
set the end date to "execution_date" (defined as a part of airflow's predefined variables)
end
Below is the code snippet for the same
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
from airflow.models import Variable
default_args={'owner':'airflow','start_date': datetime(2020,11,3,12,5),'depends_on_past':'False'}
dag = DAG('airflow_incremental_load_setup',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
# Variable to be created before the running of dag
full_load_check = Variable.get('full_load_completion')
print('full_load_check : {0}'.format(full_load_check))
if full_load_check == '0':
print('First execution')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')))
Variable.set('full_load_check', '1')
start_date = datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')
end_date = datetime.strftime(kwargs.get('execution_date'), '%Y-%m-%d')
else:
print('After the first execution ..')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(kwargs.get('prev_execution_date_success')))
start_date = kwargs.get('prev_execution_date_success')
start_date = parse(str(start_date))
end_date = kwargs.get('execution_date')
end_date = parse(str(end_date))
print('Type of start_date_check : {0}'.format(type(start_date)))
start_date = datetime.strftime(start_date, '%Y-%m-%d')
end_date = datetime.strftime(end_date, '%Y-%m-%d')
task_start = DummyOperator(task_id = 'start',dag=dag)
main_task= PythonOperator(task_id = 'IncrementalJobTask',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>main_task >> task_end
It helped me:
if isinstance(context['prev_execution_date_success'], type(None)):

Try to update postgresql database using Doobie but no update happening

I'm trying to update table in postgresql database passing dynamic value using doobie functional JDBC while executing sql statement getting below error.Any help will be appreciable.
Code
Working code
sql"""UPDATE layout_lll
|SET runtime_params = 'testing string'
|WHERE run_id = '123-ksdjf-oreiwlds-9dadssls-kolb'
|""".stripMargin.update.quick.unsafeRunSync
Not working code
val abcRunTimeParams="testing string"
val runID="123-ksdjf-oreiwlds-9dadssls-kolb"
sql"""UPDATE layout_lll
|SET runtime_params = '${abcRunTimeParams}'
|WHERE run_id = '$runID'
|""".stripMargin.update.quick.unsafeRunSync
Error
Exception in thread "main" org.postgresql.util.PSQLException: The column index is out of range: 3, number of columns: 2.
Remove the ' quotes - Doobie make sure they aren't needed. Doobie (and virtually any other DB library) uses parametrized queries, like:
UPDATE layout_lll
SET runtime_params = ?
WHERE run_id = ?
where ? will be replaced by parameters passes later on. This:
makes SQL injection impossible
helps spotting errors in SQL syntax
When you want to pass parameter, the ' is part of the value passed, not part of the parametrized query. And Doobie (or JDBC driver) will "add" it for you. The variables you pass there are processed by Doobie, they aren't just pasted there like in normal string interpolation.
TL;DR Try running
val abcRunTimeParams="testing string"
val runID="123-ksdjf-oreiwlds-9dadssls-kolb"
sql"""UPDATE layout_lll
|SET runtime_params = ${abcRunTimeParams}
|WHERE run_id = $runID
|""".stripMargin.update.quick.unsafeRunSync

how to link python pandas dataframe to mysqlconnector '%s' value

I am trying to pipe a webscraped pandas dataframe into a MySql table with mysql.connector but I can't seem to link df values to the %s variable. The connection is good (I can add individual rows) but it just returns errors when I replace the value witht he %s.
cnx = mysql.connector.connect(host = 'ip', user = 'user', passwd = 'pass', database = 'db')
cursor = cnx.cursor()
insert_df = ("""INSERT INTO table"
"(page_1, date_1, record_1, task_1)"
"VALUES ('%s','%s','%s','%s')""")
cursor.executemany(insert_df, df)
cnx.commit()
cnx.close()
This returns "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
If I add any additional oiperations it returns "ProgrammingError: Parameters for query must be an Iterable."
I am very new to this so any help is appreciated
Work around for me was to redo my whole process. I ran sqlalchemy, all the documentation makes this very easy. message if you want the code I used.