How to delete records using spark jdbc in postgre sql? - pyspark

I am trying to delete records (not truncate as it's based on condition) from postgre table, can some one help me what would be pyspark command?
For selecting/Inserting I am using below command :
Build the query
logDetlSql=f"(select * from {dqLogDetlStgTbl} where prc_name ='{inputParam.prc_name}' ) logDetlStg "
Execute to fetch data using spark jdbc
dfsql_log_detl_stg = spark.read.jdbc(url=dq_dburl, table=logDetlSql, properties=connection_properties)#.select(*dqLogTblColList)
Write using spark jdbc
dfsql_log_detl_stg.write.jdbc(url=dq_dburl, mode="append", table=dqLogDetlTbl,properties=connection_properties)

Related

Execute Stored Procedure in Glue ETL

How can we execute a SQL statement (like... 'call store_proc();') in Redshift via PySpark Glue ETL job by utilizing a catalog connection?
I want to pass on the Redshift connection details (host, user, password) from Glue Catalog Connection.
I understand the 'write_dynamic_frame' option but I am not sure how to only execute a SQL statement against the Redshift server.
glueContext.write_dynamic_frame.from_jdbc_conf (frame=data_frame, catalog_connection="Redshift_Catalog_Conn", connection_options = {"preactions":"call stored_prod();","dbtable":"public.table1","database": "admin"}, redshift_tmp_dir="s3://glue_etl/")
As I understand, you want to call a Stored Procedure in RedShift from your Glue ETL Job. One way to do this is as follows:
A simpler way to execute a stored procedure in Redshift is as follows.
post_query="begin; CALL sp_procedure1(); end;"
datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = mydf, \
catalog_connection = "redshift_connection", \
connection_options = {"dbtable": "my_table", "database": "dev","postactions":post_query}, \
redshift_tmp_dir = 's3://tempb/temp/' transformation_ctx = "datasink")
The other more elaborate solution will be run SQL queries in application code.
Establish connection to your RedShift Cluster via Glue connections. Create dynamic frame in Glue with JDBC option.
my_conn_options = {
"url": "jdbc:redshift://host:port/redshift-database-name",
"dbtable": "redshift-table-name",
"user": "username",
"password": "password",
"redshiftTmpDir": args["TempDir"],
"aws_iam_role": "arn:aws:iam::account id:role/role-name"
}
df = glueContext.create_dynamic_frame_from_options("redshift", my_conn_options)
Inorder to execute the stored procedure, we will use Spark SQL. So first convert Glue Dynamic Frame to Spark DF.
spark_df=df.toDF()
spark_df.createOrReplaceTempView("CUSTOM_TABLE_NAME")
spark.sql('call store_proc();')
Your stored procedure in RedShift should have return values which can be written out to variables.

Delete records from postgres from databricks. (pyspark)

So i am using pyspark to connect to postgres database from databricks, i can read , i can create table and also update it. but i am unable to delete a record.
dfs = spark.read.format('jdbc')\
.option("url", jdbcUrl)\
.option("user", user)\
.option("password", password)\
.option("query", "DELETE FROM meta.test4 WHERE Emp_Id = 1")\
.load()
this snippet here results in a syntax error
org.postgresql.util.PSQLException: ERROR: syntax error at or near "FROM"
How do i delete a record in postgres?
spark.read is only used for reading data. Internally, it wraps the query in a SELECT * FROM (<query>) so your statement actually becomes:
SELECT * FROM (DELETE FROM meta.test4 WHERE Emp_Id = 1)
and this obviously causes syntax error as you described.
If you want to run DML/DDL operations against remote database, you need to explicitly connect and run a statement using JDBC's Connection and Statement classes. This tutorial provides a nice overview.

spark sql unable to find the database and table which it earlier wrote to

There is a spark component creating a sql table out of transformed data. It successfully saves the data into spark-warehouse under the <database_name>.db folder. The component also tries to read from existing table in order to not blindly overwrite. While reading, spark is unable to find any database other than default.
sparkVersion: 2.4
val spark: SparkSession = SparkSession.builder().master("local[*]").config("spark.debug.maxToStringFields", 100).config("spark.sql.warehouse.dir", "D:/Demo/spark-warehouse/").getOrCreate()
def saveInitialTable(df:DataFrame) {
df.createOrReplaceTempView(Constants.tempTable)
spark.sql("create database " + databaseName)
spark.sql(
s""" create table if not exists $databaseName.$tableName
|using parquet partitioned by (${Constants.partitions.mkString(",")})
|as select * from ${Constants.tempTable}""".stripMargin)
}
def deduplication(dataFrame: DataFrame): DataFrame ={
if(Try(spark.sql("show tables from " + databaseName)).isFailure){
//something
}
}
After saveInitialTable function is performed successfully. In the second run, the deduplication function still is not able to pick up the <database_name>
I am not using hive explicitly anywhere, just spark DataFrames and SQL API.
When I run the repl in the same directory as spark-warehouse, it too gives on default database.
scala> spark.sql("show databases").show()
2021-10-07 18:45:57 WARN ObjectStore:6666 - Version information not found in metastore.
hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
2021-10-07 18:45:57 WARN ObjectStore:568 - Failed to get database default, returning
NoSuchObjectException
+------------+
|databaseName|
+------------+
| default|
+------------+

Unable to load hive table from spark dataframe with more than 25 columns in hdp 3

We were trying to populate hive table from spark shell. Dataframe with 25 columns got successfully added to the hive table using hive warehouse connector. But for more than this limit we got below error:
Caused by: java.lang.IllegalArgumentException: Missing required char ':' at 'struct<_c0:string,_c1:string,_c2:string,_c3:string,_c4:string,_c5:string,_c6:string,_c7:string,_c8:string,_c9:string,_c10:string,_c11:string,_c12:string,_c13:string,_c14:string,_c15:string,_c16:string,_c17:string,_c18:string,_c19:string,_c20:string,_c21:string,_c22:string,_c23:string,...^ 2 more fields>'
at org.apache.orc.TypeDescription.requireChar(TypeDescription.java:293)
Below is the sample input file data (input file is of type csv).
|col1 |col2 |col3 |col4 |col5 |col6 |col7 |col8 |col9 |col10 |col11 |col12 |col13 |col14 |col15 |col16 |col17|col18 |col19 |col20 |col21 |col22 |col23 |col24 |col25|col26 |
|--------------------|-----|-----|-------------------|--------|---------------|-----------|--------|--------|--------|--------|--------|--------|--------|--------|------|-----|---------------------------------------------|--------|-------|---------|---------|---------|------------------------------------|-----|----------|
|11111100000000000000|CID81|DID72|2015-08-31 00:17:00|null_val|919122222222222|1627298243 |null_val|null_val|null_val|null_val|null_val|null_val|Download|null_val|Mobile|NA |x-nid:xyz<-ch-nid->N4444.245881.ABC-119490111|12452524|1586949|sometext |sometext |sometext1|8b8d94af-5407-42fa-9c4f-baaa618377c8|Click|2015-08-31|
|22222200000000000000|CID82|DID73|2015-08-31 00:57:00|null_val|919122222222222|73171145211|null_val|null_val|null_val|null_val|null_val|null_val|Download|null_val|Tablet|NA |x-nid:xyz<-ch-nid->N4444.245881.ABC-119490111|12452530|1586956|88200211 |88200211 |sometext2|9b04580d-1669-4eb3-a5b0-4d9cec422f93|Click|2015-08-31|
|33333300000000000000|CID83|DID74|2015-08-31 00:17:00|null_val|919122222222222|73171145211|null_val|null_val|null_val|null_val|null_val|null_val|Download|null_val|Laptop|NA |x-nid:xyz<-ch-nid->N4444.245881.ABC-119490111|12452533|1586952|sometext2|sometext2|sometext3|3ab8511d-6f85-4e1f-8b11-a1d9b159f22f|Click|2015-08-31|
Spark shell was instantiated using below command:
spark-shell --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;user=raj_ops"
Version of HDP is 3.0.1
Hive table was created using below command:
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.createTable("tablename").ifNotExists().column()...create()
Data was saved using below command:
df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").option("table", "tablename").mode("append").save()
Kindly help us on this.
Thank you in advance.
I faced this problem, after thoroughly examining the source code of the following classes:
org.apache.orc.TypeDescription
org.apache.spark.sql.types.StructType
org.apache.spark.util.Utils
I found out that the culprit was the variable DEFAULT_MAX_TO_STRING_FIELDS inside the class org.apache.spark.util.Utils:
/* The performance overhead of creating and logging strings for wide schemas can be large. To limit the impact, we bound the number of fields to include by default. This can be overridden by setting the 'spark.debug.maxToStringFields' conf in SparkEnv. */
val DEFAULT_MAX_TO_STRING_FIELDS = 25
so, after setting this property, by example: conf.set("spark.debug.maxToStringFields", "128"); in my application, the issue has gone.
I hope it can help others.

Write Spark Dataframe to PosgreSQL

I'm trying to write a Spark Dataframe to a pre-created PostgreSQL table. I get the following error during the INSERT process of my job :
java.sql.BatchUpdateException: Batch entry 0 INSERT INTO ref.tableA(a,b) VALUES ('Mike',548758) was aborted. Call getNextException to see the cause.
I also tried to catch the error and call the getNextException method but i still have the same error in the logs. In order to write the Dataframe to the corresponding table i used the following process :
val jdbcProps = new java.util.Properties()
jdbcProps.setProperty("driver", Config.psqlDriver)
jdbcProps.setProperty("user", Config.psqlUser)
jdbcProps.setProperty("password", Config.psqlPassword)
jdbcProps.setProperty("stringtype", "unspecified")
df.write
.format("jdbc")
.mode(SaveMode.Append)
.jdbc(Config.psqlUrl, tableName, jdbcProps)
Package versions :
- Spark : 1.6.2
- Scala : 2.10.6
Any ideas ?