Upsert in RDBMS(Mysql) table using Pyspark DataFrames and JDBC - pyspark

I'm trying to perform merge(upsert) operation on MySql using Pyspark DataFrames and JDBC connection.
Followed the below article to do same which is in scala. ( https://medium.com/#thomaspt748/how-to-upsert-data-into-relational-database-using-apache-spark-part-2-45a9d49d0f43 ).
But I need to perform upsert operations using pyspark got stuck at iterating through Pyspark Dataframe to call upsert function like below. Need to pass the source dataframe as input to read as parameters and perform sql upsert.
(In simple terms performing the sql upsert using pyspark dataframe)
def upsertToDelta(id, name, price, purchase_date):
try:
connection = mysql.connector.connect(host='localhost',
database='Electronics',
user='pynative',
password='pynative##29')
cursor = connection.cursor()
mySql_insert_query = "MERGE INTO targetTable USING
VALUES (%s, %s, %s, %s) as INSROW((Id, Name, Price, Purchase_date)
WHEN NOT MATCHED THEN INSERT VALUES (INSROW.Id,INSROW.Price,INSROW.Purchase,INSROW.Purchase_date)
WHEN MATCHED THEN UPDATE SET set Name=INSROW.Name"
recordTuple = (id, name, price, purchase_date)
cursor.execute(mySql_insert_query, recordTuple)
connection.commit()
print("Record inserted successfully into test table")
except mysql.connector.Error as error:
print("Failed to insert into MySQL table {}".format(error))
**
dataFrame.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.start()
**
Any help on this is highly appreciated.

Related

Execute SQL stored in dataframe using pyspark

I have a list of SQL's stored in a hive table column, I have to get one sql at a time from the hive table and execute the SQL, I'm getting the SQL as a dataframe, but can anyone tell me how to execute the SQL stored as a dataframe?
The column parameter_value contains the SQL
extrac_sql = spark.sql(""" select parameter_value
from schema_name.table_params
where project_name = 'some_projectname'
and sub_project_name = 'some_sub_project'
and parameter_name = 'extract_sql' """)
now the extract_sql contains the sql, how to execute it?
You can do it as follows:
sqls = spark.sql(""" select parameter_value
from schema_name.table_params
where project_name = 'some_projectname'
and sub_project_name = 'some_sub_project'
and parameter_name = 'extract_sql' """).collect()
for sql in sqls:
spark.sql(sql[0]).show()

How to add a new column to a Delta Lake table?

I'm trying to add a new column to data stored as a Delta Table in Azure Blob Storage. Most of the actions being done on the data are upserts, with many updates and few new inserts. My code to write data currently looks like this:
DeltaTable.forPath(spark, deltaPath)
.as("dest_table")
.merge(myDF.as("source_table"),
"dest_table.id = source_table.id")
.whenNotMatched()
.insertAll()
.whenMatched(upsertCond)
.updateExpr(upsertStat)
.execute()
From these docs, it looks like Delta Lake supports adding new columns on insertAll() and updateAll() calls only. However, I'm updating only when certain conditions are met and want the new column added to all the existing data (with a default value of null).
I've come up with a solution that seems extremely clunky and am wondering if there's a more elegant approach. Here's my current proposed solution:
// Read in existing data
val myData = spark.read.format("delta").load(deltaPath)
// Register table with Hive metastore
myData.write.format("delta").saveAsTable("input_data")
// Add new column
spark.sql("ALTER TABLE input_data ADD COLUMNS (new_col string)")
// Save as DataFrame and overwrite data on disk
val sqlDF = spark.sql("SELECT * FROM input_data")
sqlDF.write.format("delta").option("mergeSchema", "true").mode("overwrite").save(deltaPath)
Alter your delta table first and then you do your merge operation:
from pyspark.sql.functions import lit
spark.read.format("delta").load('/mnt/delta/cov')\
.withColumn("Recovered", lit(''))\
.write\
.format("delta")\
.mode("overwrite")\
.option("overwriteSchema", "true")\
.save('/mnt/delta/cov')
New columns can also be added with SQL commands as follows:
ALTER TABLE dbName.TableName ADD COLUMNS (newColumnName dataType)
UPDATE dbName.TableName SET newColumnName = val;
This is the approach that worked for me using scala
Having a delta table, named original_table, which path is:
val path_to_delta = "/mnt/my/path"
This table currently has got 1M records with the following schema: pk, field1, field2, field3, field4
I want to add a new field, named new_field, to the existing schema without loosing the data already stored in original_table.
So I first created a dummy record with a simple schema containing just pk and newfield
case class new_schema(
pk: String,
newfield: String
)
I created a dummy record using that schema:
import spark.implicits._
val dummy_record = Seq(new new_schema("delete_later", null)).toDF
I inserted this new record (the existing 1M records will have newfield populated as null). I also removed this dummy record from the original table:
dummy_record
.write
.format("delta")
.option("mergeSchema", "true")
.mode("append")
.save(path_to_delta )
val original_dt : DeltaTable = DeltaTable.forPath(spark, path_to_delta )
original_dt .delete("pk = 'delete_later'")
Now the original table will have 6 fields: pk, field1, field2, field3, field4 and newfield
Finally I upsert the newfield values in the corresponding 1M records using pk as join key
val df_with_new_field = // You bring new data from somewhere...
original_dt
.as("original")
.merge(
df_with_new_field .as("new"),
"original.pk = new.pk")
.whenMatched
.update( Map(
"newfield" -> col("new.newfield")
))
.execute()
https://www.databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html
Have you tried using the merge statement?
https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html

Check the Metastore for the Table availability in Spark

I want to check if the table is available in a specific hive database before using the below select query.
How to get the information from the metastore.
sparkSession.sql("select * from db_name.table_name")
you can run below command before run your operation on the table
sparkSession.sql("use databaseName");
val df = sparkSession.sql("show tables like 'tableName'")
if(df.head(1).isEmpty == false){
//write the code
}
Try this
#This will give the list of tables in the database as a dataframe.
tables = sparkSession.sql('SHOW TABLES IN '+db_name)
#You can collect them to bring in a list of row items
tables_list = tables.select('tableName').collect()
# convert them into a python array
tables_list_arr=[]
for table_name in tables:
tables_list_arr.append(str(table_name['tableName']))
# Then execute your command
if(table_name in tables_list_arr):
sparkSession.sql("select * from db_name.table_name")

How to resolve this erros "org.apache.spark.SparkException: Requested partitioning does not match the tablename table" in spark-shell

While writing data into hive partitioned table, I am getting below error.
org.apache.spark.SparkException: Requested partitioning does not match the tablename table:
I have converted my RDD to a DF using case class and then I am trying to write the data into the existing hive partitioned table. But I am getting his error and as per the printed logs "Requested partitions:" is coming as blank. Partition columns are coming as expected in the hive table.
spark-shell error :-
scala> data1.write.format("hive").partitionBy("category", "state").mode("append").saveAsTable("sampleb.sparkhive6")
org.apache.spark.SparkException: Requested partitioning does not match the sparkhive6 table:
Requested partitions:
Table partitions: category,state
Hive table format :-
hive> describe formatted sparkhive6;
OK
col_name data_type comment
txnno int
txndate string
custno int
amount double
product string
city string
spendby string
Partition Information
col_name data_type comment
category string
state string
Try with insertInto() function instead of saveAsTable().
scala> data1.write.format("hive")
.partitionBy("category", "state")
.mode("append")
.insertInto("sampleb.sparkhive6")
(or)
Register a temp view on top of the dataframe then write with sql statement to insert data into hive table.
scala> data1.createOrReplaceTempView("temp_vw")
scala> spark.sql("insert into sampleb.sparkhive6 partition(category,state) select txnno,txndate,custno,amount,product,city,spendby,category,state from temp_vw")

Spark SQL - regexp_replace not updating the column value

I ran the following query in Hive and it successfully updated the column value in the table: select id, regexp_replace(full_name,'A','C') from table
But when I ran the same query from Spark SQL, it did not update the actual records
hiveContext.sql("select id, regexp_replace(full_name,'A','C') from table")
but when I do a hiveContext.sql("select id, regexp_replace(full_name,'A','C') from table").show() -- it displays A replaced with C successfully ... only in the display and not in the actual table
I tried to assign the result to another variable
val vFullName = hiveContext.sql("select id, regexp_replace(full_name,'A','C') from table")
and then
vFullName.show() -- it displays the original values without replacement
How do I get the value replaced in the table from SparkSQL?