How can we execute a SQL statement (like... 'call store_proc();') in Redshift via PySpark Glue ETL job by utilizing a catalog connection?
I want to pass on the Redshift connection details (host, user, password) from Glue Catalog Connection.
I understand the 'write_dynamic_frame' option but I am not sure how to only execute a SQL statement against the Redshift server.
glueContext.write_dynamic_frame.from_jdbc_conf (frame=data_frame, catalog_connection="Redshift_Catalog_Conn", connection_options = {"preactions":"call stored_prod();","dbtable":"public.table1","database": "admin"}, redshift_tmp_dir="s3://glue_etl/")
As I understand, you want to call a Stored Procedure in RedShift from your Glue ETL Job. One way to do this is as follows:
A simpler way to execute a stored procedure in Redshift is as follows.
post_query="begin; CALL sp_procedure1(); end;"
datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = mydf, \
catalog_connection = "redshift_connection", \
connection_options = {"dbtable": "my_table", "database": "dev","postactions":post_query}, \
redshift_tmp_dir = 's3://tempb/temp/' transformation_ctx = "datasink")
The other more elaborate solution will be run SQL queries in application code.
Establish connection to your RedShift Cluster via Glue connections. Create dynamic frame in Glue with JDBC option.
my_conn_options = {
"url": "jdbc:redshift://host:port/redshift-database-name",
"dbtable": "redshift-table-name",
"user": "username",
"password": "password",
"redshiftTmpDir": args["TempDir"],
"aws_iam_role": "arn:aws:iam::account id:role/role-name"
}
df = glueContext.create_dynamic_frame_from_options("redshift", my_conn_options)
Inorder to execute the stored procedure, we will use Spark SQL. So first convert Glue Dynamic Frame to Spark DF.
spark_df=df.toDF()
spark_df.createOrReplaceTempView("CUSTOM_TABLE_NAME")
spark.sql('call store_proc();')
Your stored procedure in RedShift should have return values which can be written out to variables.
Related
Can someone let me know make the first row the header when saving to SQL Server with Databricks
I am currently using the following code to upload / save to SQL in Azure
jdbcUrl = f"jdbc:sqlserver://{DBServer}.database.windows.net:1433;database={DBDatabase};user={DBUser};password={DBPword};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
df = spark.read.csv("/mnt/lake/RAW/OptionsetMetadata.csv")
df.write.mode("overwrite") \
.format("jdbc") \
.option("url", jdbcUrl) \
.option("dbtable", 'UpdatedProducts')\
.save()
The table looks like the following after saving:
JDBC driver creates the table according to the schema. It looks like that you're reading from the CSV file, and don't specify .option("header", "true") when reading. Just add this option to your read operation.
So i am using pyspark to connect to postgres database from databricks, i can read , i can create table and also update it. but i am unable to delete a record.
dfs = spark.read.format('jdbc')\
.option("url", jdbcUrl)\
.option("user", user)\
.option("password", password)\
.option("query", "DELETE FROM meta.test4 WHERE Emp_Id = 1")\
.load()
this snippet here results in a syntax error
org.postgresql.util.PSQLException: ERROR: syntax error at or near "FROM"
How do i delete a record in postgres?
spark.read is only used for reading data. Internally, it wraps the query in a SELECT * FROM (<query>) so your statement actually becomes:
SELECT * FROM (DELETE FROM meta.test4 WHERE Emp_Id = 1)
and this obviously causes syntax error as you described.
If you want to run DML/DDL operations against remote database, you need to explicitly connect and run a statement using JDBC's Connection and Statement classes. This tutorial provides a nice overview.
I am trying to delete records (not truncate as it's based on condition) from postgre table, can some one help me what would be pyspark command?
For selecting/Inserting I am using below command :
Build the query
logDetlSql=f"(select * from {dqLogDetlStgTbl} where prc_name ='{inputParam.prc_name}' ) logDetlStg "
Execute to fetch data using spark jdbc
dfsql_log_detl_stg = spark.read.jdbc(url=dq_dburl, table=logDetlSql, properties=connection_properties)#.select(*dqLogTblColList)
Write using spark jdbc
dfsql_log_detl_stg.write.jdbc(url=dq_dburl, mode="append", table=dqLogDetlTbl,properties=connection_properties)
I`m trying to save my data frame in *.orc using jdbc in postgresql. I have an intermediate table created in my localhost on the server I use, but the table is not saved in postgresql.
I would like to find out what extensions postgresql works with (you may not be able to create a *.orc table in it), and that it accepts - a Dataset or sql query from the created table.
I'm using spark.
Properties conProperties = new Properties();
conProperties.setProperty("driver", "org.postgresql.Driver");
conProperties.setProperty("user", srgs[2]);
conProperties.setProperty("password", args[3]);
finalTable.write()
.format("orc")
.mode(SaveMode.Overwrite)
.jdbc(args[1], "dataTable", conProperties);
spark-submit --class com.pack.Main --master yarn --deploy-mode cluster /project/build/libs/mainC-1.0-SNAPSHOT.jar ha-cluster?jdbc:postgresql://MyHostame:5432/nameUser&user=nameUser&password=passwordUser
You cannot keep the .orc format in Postgres. You wouldn't want that either.
See updated write below
finalTable.write
.format("jdbc")
.option("url", srgs[1])
.option("dbtable", "dataTable")
.option("user", srgs[2])
.option("password", args[3])
.save()
I am new to dataproc cluster and PySpark so, in the process of looking for codes to load table from bigquery to the cluster, i came across the code below and was unable to figure out what all am i suppose to change for my usecase in this code and what are we providing as an input in the input directory
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
import subprocess
sc = SparkContext()
spark = SparkSession(sc)
bucket = spark._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = spark._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'dataset_new',
'mapred.bq.input.dataset.id': 'retail',
'mapred.bq.input.table.id': 'market',
}
You are trying to use Hadoop BigQuery connector, for Spark you should use Spark BigQuery connector.
To read data from BigQuery you can follow an example:
# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the connector.
bucket = "[bucket]"
spark.conf.set('temporaryGcsBucket', bucket)
# Load data from BigQuery.
words = spark.read.format('bigquery') \
.option('table', 'bigquery-public-data:samples.shakespeare') \
.load()
words.createOrReplaceTempView('words')
# Perform word count.
word_count = spark.sql(
'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()