Update only changed rows pyspark delta table databricks - pyspark

There is a requirement to update only changed rows in an existing table compared to the created dataframe. So rightnow , i do subtract and get the changed rows, but not sure how to merge into existing table.
old_df = spark.sql("select * from existing table")
diff = new_df.subtract(old_df)
diff dataframe has to be now inserted(if new rows) or update existing records
(deltaTable.alias("full_df").merge(
merge_df.alias("append_df"),
"full_df.col1 = append_df.col1 OR full_df.col2 =append_df.col2")
.whenNotMatchedInsertAll()
.execute()
)
This is not updating existing records (case: col2 value changed ; col1 not changed)

.whenMatchedUpdateAll() accepts a condition which you can use to preserve unchanged rows:
(deltaTable.alias("full_df").merge(
merge_df.alias("append_df"),
"full_df.col1 = append_df.col1 OR full_df.col2 = append_df.col2")
.whenNotMatchedInsertAll()
.whenMatchedUpdateAll("full_df.col1 != append_df.col1 OR full_df.col2 != append_df.col2")
.execute()
)

Related

PySpark Code Modification to Remove Nulls

I received help with following PySpark to prevent errors when doing a Merge in Databricks, see here
Databricks Error: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table conflicting way
I was wondering if I could get help to modify the code to drop NULLs.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
df2 = partdf.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("Id")))
df3 = df2.filter("rn = 1").drop("rn")
Thanks
The code that you are using does not completely delete the rows where P_key is null. It is applying the row number for null values and where row number value is 1 where P_key is null, that row is not getting deleted.
You can instead use the df.na.drop instead to get the required result.
df.na.drop(subset=["P_key"]).show(truncate=False)
To make your approach work, you can use the following approach. Add a row with least possible unique id value. Store this id in a variable, use the same code and add additional condition in filter as shown below.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,when,col
df = spark.read.option("header",True).csv("dbfs:/FileStore/sample1.csv")
#adding row with least possible id value.
dup_id = '0'
new_row = spark.createDataFrame([[dup_id,'','x','x']], schema = ['id','P_key','c1','c2'])
#replacing empty string with null for P_Key
new_row = new_row.withColumn('P_key',when(col('P_key')=='',None).otherwise(col('P_key')))
df = df.union(new_row) #row added
#code to remove duplicates
df2 = df.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("id")))
df2.show(truncate=False)
#additional condition to remove added id row.
df3 = df2.filter((df2.rn == 1) & (df2.P_key!=dup_id)).drop("rn")
df3.show()

How to add a new column to a Delta Lake table?

I'm trying to add a new column to data stored as a Delta Table in Azure Blob Storage. Most of the actions being done on the data are upserts, with many updates and few new inserts. My code to write data currently looks like this:
DeltaTable.forPath(spark, deltaPath)
.as("dest_table")
.merge(myDF.as("source_table"),
"dest_table.id = source_table.id")
.whenNotMatched()
.insertAll()
.whenMatched(upsertCond)
.updateExpr(upsertStat)
.execute()
From these docs, it looks like Delta Lake supports adding new columns on insertAll() and updateAll() calls only. However, I'm updating only when certain conditions are met and want the new column added to all the existing data (with a default value of null).
I've come up with a solution that seems extremely clunky and am wondering if there's a more elegant approach. Here's my current proposed solution:
// Read in existing data
val myData = spark.read.format("delta").load(deltaPath)
// Register table with Hive metastore
myData.write.format("delta").saveAsTable("input_data")
// Add new column
spark.sql("ALTER TABLE input_data ADD COLUMNS (new_col string)")
// Save as DataFrame and overwrite data on disk
val sqlDF = spark.sql("SELECT * FROM input_data")
sqlDF.write.format("delta").option("mergeSchema", "true").mode("overwrite").save(deltaPath)
Alter your delta table first and then you do your merge operation:
from pyspark.sql.functions import lit
spark.read.format("delta").load('/mnt/delta/cov')\
.withColumn("Recovered", lit(''))\
.write\
.format("delta")\
.mode("overwrite")\
.option("overwriteSchema", "true")\
.save('/mnt/delta/cov')
New columns can also be added with SQL commands as follows:
ALTER TABLE dbName.TableName ADD COLUMNS (newColumnName dataType)
UPDATE dbName.TableName SET newColumnName = val;
This is the approach that worked for me using scala
Having a delta table, named original_table, which path is:
val path_to_delta = "/mnt/my/path"
This table currently has got 1M records with the following schema: pk, field1, field2, field3, field4
I want to add a new field, named new_field, to the existing schema without loosing the data already stored in original_table.
So I first created a dummy record with a simple schema containing just pk and newfield
case class new_schema(
pk: String,
newfield: String
)
I created a dummy record using that schema:
import spark.implicits._
val dummy_record = Seq(new new_schema("delete_later", null)).toDF
I inserted this new record (the existing 1M records will have newfield populated as null). I also removed this dummy record from the original table:
dummy_record
.write
.format("delta")
.option("mergeSchema", "true")
.mode("append")
.save(path_to_delta )
val original_dt : DeltaTable = DeltaTable.forPath(spark, path_to_delta )
original_dt .delete("pk = 'delete_later'")
Now the original table will have 6 fields: pk, field1, field2, field3, field4 and newfield
Finally I upsert the newfield values in the corresponding 1M records using pk as join key
val df_with_new_field = // You bring new data from somewhere...
original_dt
.as("original")
.merge(
df_with_new_field .as("new"),
"original.pk = new.pk")
.whenMatched
.update( Map(
"newfield" -> col("new.newfield")
))
.execute()
https://www.databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html
Have you tried using the merge statement?
https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html

How to Compare columns of two tables using Spark?

I am trying to compare two tables() by reading as DataFrames. And for each common column in those tables using concatenation of a primary key say order_id with other columns like order_date, order_name, order_event.
The Scala Code I am using
val primary_key=order_id
for (i <- commonColumnsList){
val column_name = i
val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
//Get those records which aren common in both old/new tables
matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
//Get those records which aren't common in both old/new tables
nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)
//Total Null/Non-Null Counts in both old and new tables.
nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame
//Put the result for a given column in a Seq variable, later convert it to Dataframe.
tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
(nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
// Final Step: Create DataFrame using Seq and some Schema.
spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)
The above code is working fine for a medium set of Data, but as the number of Columns and Records increases in my New & Old Table, the execution time is increasing. Any sort of advice is appreciated.
Thank you in Advance.
You can do the following:
1. Outer join the old and new dataframe on priamary key
joined_df = df_old.join(df_new, primary_key, "outer")
2. Cache it if you possibly can. This will save you a lot of time
3. Now you can iterate over columns and compare columns using spark functions (.isNull for not matched, == for matched etc)
for (col <- df_new.columns){
val matchCount = df_joined.filter(df_new[col].isNotNull && df_old[col].isNotNull).count()
val nonMatchCount = ...
}
This should be considerably faster, especially when you can cache your dataframe. If you can't it might be a good idea so save the joined df to disk in order to avoid a shuffle each time

Extract and Replace values from duplicates rows in PySpark Data Frame

I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""

How to delete data from Hive external table for Non-Partition column?

I have created an external table in Hive partitioned by client and month.
The requirement asks to delete the data for ID=201 from that table but it's not partitioned by the ID column.
I have tried to do with Insert Overwrite but it's not working.
We are using Spark 2.2.0.
How can I solve this problem?
val sqlDF = spark.sql("select * from db.table")
val newSqlDF1 = sqlDF.filter(!col("ID").isin("201") && col("month").isin("062016"))
val columns = newSqlDF1.schema.fieldNames.mkString(",")
newSqlDF1.createOrReplaceTempView("myTempTable") --34
spark.sql(s"INSERT OVERWRITE TABLE db.table PARTITION(client, month) select ${columns} from myTempTable")