How to run arbitrary / DDL SQL statements or stored procedures using AWS Glue - pyspark

Is it possible to execute arbitrary SQL commands like ALTER TABLE from AWS Glue python job? I know I can use it to read data from tables but is there a way to execute other database specific commands?
I need to ingest data into a target database and then run some ALTER commands right after.

So after doing extensive research and also opening a case with AWS support, they told me it is not possible from Python shell or Glue pyspark job at this moment. But I just tried something creative and it worked! The idea is to use py4j that sparks relies on already and utilize standard java sql package.
Two huge benefits of this approach:
A huge benefit of this that you can define your database connection as Glue data connection and keep jdbc details and credentials in there without hardcoding them in the Glue code. My example below does that by calling glueContext.extract_jdbc_conf('your_glue_data_connection_name') to get jdbc url and credentials, defined in Glue.
If you need to run SQL commands on a supported out of the box Glue database, you don't even need to use/pass jdbc driver for that database - just make sure you set up Glue connection for that database and add that connection to your Glue job - Glue will upload proper database driver jars.
Remember this code below is executed by a driver process and cannot be executed by Spark workers/executors.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
logger = glueContext.get_logger()
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# dw-poc-dev spark test
source_jdbc_conf = glueContext.extract_jdbc_conf('your_glue_database_connection_name')
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")
conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url'), source_jdbc_conf.get('user'), source_jdbc_conf.get('password'))
print(conn.getMetaData().getDatabaseProductName())
# call stored procedure, in this case I call sp_start_job
cstmt = conn.prepareCall("{call dbo.sp_start_job(?)}");
cstmt.setString("job_name", "testjob");
results = cstmt.execute();
conn.close()

I finally got this working after a couple of hours so hopefully the following will be helpful. My script is heavily influenced by the earlier responses, thank you.
Prerequisites:
You will want the Glue connection configured and tested before attempting any scripts.
When setting up your AWS Glue job, use Spark, Glue version 2.0 or later, and Python version 3.
I recommend to configure this job for just 2 worker threads to save on cost; the bulk of the work is going to be done by the database, not by glue.
The following is tested with an AWS RDS PostgreSQL instance, but is hopefully flexible enough to work for other databases.
The script needs 3 parameters updated near the top of the script (glue_connection_name, database_name, and stored_proc).
The JOB_NAME, connection string, and credentials are retrieved by the script and do not need to be supplied.
If your stored proc will return a dataset then replace executeUpdate with executeQuery.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glue_connection_name = '[Name of your glue connection (not the job name)]'
database_name = '[name of your postgreSQL database]'
stored_proc = '[Stored procedure call, for example public.mystoredproc()]'
#Below this point no changes should be necessary.
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glue_job_name = args['JOB_NAME']
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(glue_job_name, args)
job.commit()
logger = glueContext.get_logger()
logger.info('Getting details for connection ' + glue_connection_name)
source_jdbc_conf = glueContext.extract_jdbc_conf(glue_connection_name)
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")
conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url') + '/' + database_name, source_jdbc_conf.get('user'), source_jdbc_conf.get('password'))
logger.info('Connected to ' + conn.getMetaData().getDatabaseProductName() + ', ' + source_jdbc_conf.get('url') + '/' + database_name)
stmt = conn.createStatement();
rs = stmt.executeUpdate('call ' + stored_proc);
logger.info("Finished")

i modified the code shared by mishkin but it did not work for me. So after troubleshooting a bit i realized for me the connetion from the catalog do not work. so I had to modify it manually and tweak code a little bit. Now the its working but thorwoiung exception in the end as its not able to convert the java results to python result. I did a work around so use with caution.
below is my code.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
#source_jdbc_conf = glueContext.extract_jdbc_conf('redshift_publicschema')
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")
print('Trying to connect to DB')
conn = sc._gateway.jvm.DriverManager.getConnection('jdbc:redshift://redshift-cluster-2-url:4000/databasename', 'myusername', 'mypassword')
print('Trying to connect to DB success!')
print(conn.getMetaData().getDatabaseProductName())
# call stored procedure, in this case I call sp_start_job
stmt = conn.createStatement();
#cstmt = conn.prepareCall("call dbname.schemaname.my_storedproc();");
print('Call to proc trying ')
#cstmt.setString("job_name", "testjob");
try:
rs = stmt.executeQuery('call mySchemaName.my_storedproc()');
except:
print("An exception occurred but proc has run")
#results = cstmt.execute();`enter code here`
conn.close()

It depends. If you are using redshift as a target, you have the option of specifying pre and post actions as part of connection options. You would be able to specify alter actions over there. However for the rest of the target types you might need to use some python module like pg8000( in cases of Postgres) and others

If you attach a connection object to the glue job, you can easily get connection settings:
glue_client = boto3.client('glue')
getjob=glue_client.get_job(JobName=args["JOB_NAME"])
connection_settings = glue_client.get_connection(Name=getjob['Job']['Connections']['Connections'][0])
conn_name = connection_settings['Connection']['Name']
df = glueContext.extract_jdbc_conf(conn_name)

Thanks mishkin for sharing the script. I got below error when I followed the script for Redshift
An error occurred while calling z:java.sql.DriverManager.getConnection. [Amazon]JDSI Required
setting ConnSchema is not present in connection settings
It looks like source_jdbc_conf.get('url') is not passing database name in the JDBC url, so I ended up appending database name to the JDBC url.

Related

DynamicFrame.fromDF causes extreme delay in writing to db using glueContext.write_from_options()

I have a glue job, wherein I need to read data from 2 tables from SQL Server, perform some joins/transformation and write back to another new/truncated table in SQL Server. The size of the data to be written is 15GB approx.
I have tried 2 approaches as follows and see massive difference in performance. I am looking at getting the job to completed in under 10 minutes.
APPROACH 1 - Takes about 17 minutes to overall (Read data from SQL Server, transformations, writing to S3, Read from S3, writing back to SQL Server)
Read from SQLServer into spark dataframes (3 - 5 seconds approx.)
Perform transformation on spark dataframes (5 seconds approx.)
Write the data to a temporary storage to S3 (8 minutes approx.)
Read from S3 using glueContext.create_dynamic_frame.from_options()
into a Dynamic Dataframe
Write to SQLServer table using glueContext.write_from_options() (9 minutes)
APPROACH 2 - Takes about 50 minutes to overall (Read data from SQL Server, transformations, writing back to SQL Server)
Read from SQLServer into spark dataframes (3 - 5 seconds approx.)
Perform transformation on spark dataframes (5 seconds approx.)
Convert spark dataframe a Dynamic Dataframe using
DynamicFrame.fromDF()
Write to SqlServer table using glueContext.write_from_options() (43 minutes)
I observed that in the second approach its taking more time even though I have avoided writing to S3 and read back from S3, by converting spark dataframe to Dynamic dataframe, and use it for writing to SQL Server. Also the tables are truncated before writing the data to it. I was expecting that by removing S3 R/write, i can get the job completed in 10 - 12minutes.
Am I missing something here? Any suggestions please.
Code template for approach1:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from pyspark.sql.types import *
from pyspark.sql.functions import *
import time
from py4j.java_gateway import java_import
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
logger = glueContext.get_logger()
# STEP 1 -- READ DATA FROM TABLES INTO DATAFRAMES
# -----------------------------------------------
# STEP 2 -- PERFORM TRANSFORMATIONS IF ANY, AND WRITE TO DATALAKE - S3
#----------------------------------------------------------------------
df.write.mode("overwrite").csv("s3://<<bucket-name>>/temp_result")
# STEP 3 -- READ DATA FROM S3 INTO NEW DATAFRAMES
#------------------------------------------------
newdf = glueContext.create_dynamic_frame.from_options(connection_type='s3',connection_options = {"paths": ["s3://<<bucket-name>>/temp_result"]},format='csv')
# STEP 4 -- TRUNCATE TARGET TABLE AS ITS A FULL REFRESH ALWAYS IN THE TARGET TABLE
#---------------------------------------------------------------------------------
cstmt = conn.prepareCall("TRUNCATE TABLE mytable_in_db");
results = cstmt.execute();
# STEP 5 -- WRITE TO TARGET TABLE FROM DATAFRAME
# ----------------------------------------------
glueContext.write_from_options(frame_or_dfc=newdf, connection_type="sqlserver", connection_options=connection_sqlserver_options)
job.commit()
Code template for approach2:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from pyspark.sql.types import *
from pyspark.sql.functions import *
import time
from py4j.java_gateway import java_import
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
logger = glueContext.get_logger()
# STEP 1 -- READ DATA FROM TABLES INTO DATAFRAMES
# -----------------------------------------------
# STEP 2 -- PERFORM TRANSFORMATIONS IF ANY AND STORE TO df
#----------------------------------------------------------------------
df contains transformed data
# STEP 3 -- CONVERT SPARK DATAFRAME TO DYNAMIC DATAFRAME
#--------------------------------------------------------
newdf2 = DynamicFrame.fromDF(df, glueContext , "newdf2")
# STEP 4 -- TRUNCATE TARGET TABLE AS ITS A FULL REFRESH ALWAYS IN THE TARGET TABLE
#---------------------------------------------------------------------------------
cstmt = conn.prepareCall("TRUNCATE TABLE mytable_in_db");
results = cstmt.execute();
# STEP 5 -- WRITE TO TARGET TABLE FROM DATAFRAME
# ----------------------------------------------
glueContext.write_from_options(frame_or_dfc=newdf2, connection_type="sqlserver", connection_options=connection_sqlserver_options)
job.commit()
I'm facing same problem. From my SQL Profiler and Activity Monitor it seems that glueContext.create_dynamic_frame.from_options() is the weak spot.
I came to this conclusion after noticing lack of open sessions on the target instance (RDS SQL Server) when Glue is obtaining dynamicFrame from the source (also RDS SQL Server).
Important thing to notice - the idle state happens when bigger tables are to be retained from source (1-2+ million records).
I would suggest to try bypass Glue's methods and go with spark.read method. Create your dataFrame, do all transforms and convert to dynamicFrame for load.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
tbl = df = self.spark.read.format("jdbc").option("url","jdbc:sqlserver://").option("user","").option("password","").option("dbtable","").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").load()

Pass parameters/arguments to HDInsight/Spark Activity in Azure Data Factory

I have an on-demand HDInsight cluster that is launched from a Spark Activity within Azure Data Factory and runs PySpark 3.1. To test out my code, I normally launch Jupyter Notebook from the created HDInsight Cluster page.
Now, I would like to pass some parameters to that Spark activity and retrieve these parameters from within Jupyter notebook code. I've tried doing so in two ways, but none of them worked for me:
Method A. as Arguments and then tried to retrieve them using sys.argv[].
Method B. as Spark configuration and then tried to retrieve them using sc.getConf().getAll().
I suspect that either:
I am not specifying parameters correctly
or using a wrong way to retrieve them in Jupyter Notebook code
or parameters are only valid for the Python *.py scripts specified in the "File path" field, but not for the Jupyter notebooks.
Any pointers on how to pass parameters into HDInsight Spark activity within Azure Data Factory would be much appreciated.
The issue is with the entryFilePath. In the Spark activity of HDInsight cluster, you must either give the entryFilePath as a .jar file or .py file. When we follow this, we can successfully pass arguments which can be utilized using sys.argv.
The following is an example of how you can pass arguments to python script.
The code inside nb1.py (sample) is as shown below:
from pyspark import SparkContext
from pyspark.sql import *
import sys
sc = SparkContext()
sqlContext = HiveContext(sc)
# Create an RDD from sample data which is already available
hvacText = sc.textFile("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
# Create a schema for our data
Entry = Row('Date', 'Time', 'TargetTemp', 'ActualTemp', 'BuildingID')
# Parse the data and create a schema
hvacParts = hvacText.map(lambda s: s.split(',')).filter(lambda s: s[0] != 'Date')
hvac = hvacParts.map(lambda p: Entry(str(p[0]), str(p[1]), int(p[2]), int(p[3]), int(p[6])))
# Infer the schema and create a table
hvacTable = sqlContext.createDataFrame(hvac)
hvacTable.registerTempTable('hvactemptable')
dfw = DataFrameWriter(hvacTable)
#using agrument from pipeline to create table.
dfw.saveAsTable(sys.argv[1])
When the pipeline is triggered, it runs successfully and the required table will be created (name of this table is passed as an argument from pipeline Spark activity). We can query this table in HDInsight cluster's Jupyter notebook using the following query:
select * from new_hvac
NOTE:
So, please ensure that you are passing arguments to python script (.py file) but not a python notebook.

How to pass an argument to a spark submit job in airflow

I have to trigger a pyspark module from airflow using a sparksubmit operator. But, the pyspark module need to take the spark session variable as an argument. I have used application_args to pass the parameter to the pyspark module. But, when I ran the dag the spark submit operator is getting failed and the parameter I passed in considered as None type variable.
Need to know how to pass a argument to a pyspark module triggered through spark_submit_operator.
The DAG code is below:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PRJT").enableHiveSupport().getOrCreate()
spark_config = {
'conn_id': 'spark_default',
'driver_memory': '1g',
'executor_cores': 1,
'num_executors': 1,
'executor_memory': '1g'
}
dag = DAG(
dag_id="spark_session_prgm",
default_args=default_args,
schedule_interval='#daily',
catchup=False)
spark_submit_task1 = SparkSubmitOperator(
task_id='spark_submit_task1',
application='/home/airflow_home/dags/tmp_spark_1.py',
application_args=['spark'],
**spark_config, dag=dag)
The sample code in tmp_spark_1.py program:
After a bit of debugging, I found the solution to my problem.
argparse is the reason why it was not working. Instead, I used sys with sys.argv[1] and it does the job.

BigQuery connector ClassNotFoundException in PySpark on Dataproc

I'm trying to run a script in PySpark, using Dataproc.
The script is kind of a merge between this example and what I need to do, as I wanted to check if everything works. Obviously, it doesn't.
The error I get is:
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat
I made sure I have all the jars, added some new jars as suggested in other similar posts. I also checked the SPARK_HOME variable.
Below you can see the code; the error appears when trying to instantiate table_data.
"""BigQuery I/O PySpark example."""
from __future__ import absolute_import
import json
import pprint
import subprocess
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
As pointed out in the example, you need to include BigQuery connector jar when submitting the job.
Through Dataproc jobs API:
gcloud dataproc jobs submit pyspark --cluster=${CLUSTER} \
/path/to/your/script.py \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
or spark-submit from inside the cluster:
spark-submit --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
/path/to/your/script.py

How to executing arbitrary sql from pyspark notebook using SQLContext?

I'm trying a basic test case of reading data from dashDB into spark and then writing it back to dashDB again.
Step 1. First within the notebook, I read the data:
sqlContext = SQLContext(sc)
dashdata = sqlContext.read.jdbc(
url="jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB:user=****;password=****;",
table="GOSALES.BRANCH"
).cache()
Step 2. Then from dashDB I create the target table:
DROP TABLE ****.FROM_SPARK;
CREATE TABLE ****.FROM_SPARK AS (
SELECT *
FROM GOSALES.BRANCH
) WITH NO DATA
Step 3. Finally, within the notebook I save the data to the table:
from pyspark.sql import DataFrameWriter
writer = DataFrameWriter(dashdata)
dashdata = writer.jdbc(
url="jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB:user=****;password=****;",
table="****.FROM_SPARK"
)
Question: Is it possible to run the sql in step 2 from pyspark? I couldn't see how this could be done from the pyspark documentation. I don't want to use vanilla python for connecting to dashDB because of the effort involved in setting up the library.
Use ibmdbpy. See this brief demo.
With as_idadataframe() you can upload DataFrames into dashDB as a table.
Added key steps here as stackoverflow doesn't like linking to answers:
Step 1: Add a cell containing:
#!pip install --user future
#!pip install --user lazy
#!pip install --user jaydebeapi
#!pip uninstall --yes ibmdbpy
#!pip install ibmdbpy --user --no-deps
#!wget -O $HOME/.local/lib/python2.7/site-packages/ibmdbpy/db2jcc4.jar https://ibm.box.com/shared/static/lmhzyeslp1rqns04ue8dnhz2x7fb6nkc.zip
Step 2: Then from annother notebook cell
from ibmdbpy import IdaDataBase
idadb = IdaDataBase('jdbc:db2://<dashdb server name>:50000/BLUDB:user=<dashdb user>;password=<dashdb pw>')
....
Yes,
You can create table in dashdb from Notebook.
Below is the code for Scala :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.log4j.Logger
import org.apache.log4j.Level
import java.sql.Connection
import java.sql.DriverManager
import java.sql.SQLException
import com.ibm.db2.jcc._
import java.io._
val jdbcClassName="com.ibm.db2.jcc.DB2Driver"
val url="jdbc:db2://awh-yp-small02.services.dal.bluemix.net:50001/BLUDB:sslConnection=true;" // enter the hostip fromc connection settings
val user="<username>"
val password="<password>"
Class.forName(jdbcClassName)
val connection = DriverManager.getConnection(url, user, password)
val stmt = connection.createStatement()
stmt.executeUpdate("CREATE TABLE COL12345(" +
"month VARCHAR(82))")
stmt.close()
connection.commit()
connection.close()