Delta: Insert based on condition (WhenMatchedInsert) - pyspark

I am looking for a smarter way to perform an insert into a delta table based on a condition that does InsertWhenMatched where I don't need to fake skipping the update part of the merge with the update_condition = "true = false".
I wasn't able to find something. I assume the options within sdf.format("delta").mode("append").options(***) might give a solution but couldn't find any docu which options are supported.
I am using Databricks on Azure with runtime 11.3 LTS
Let's say I have the follwing tables
sdf1 = spark.createDataFrame(
[
(1, "foo", "dd", "1", "99"),
(2, "bar", "2sfs", "1", "99"),
],
["id", "col", "col2", "s_date", "e_date"],
)
sdf2 = spark.createDataFrame(
[
(1, "foo", "dd", "1", "99"),
(2, "bar", "2sfs", "33", "99"),
(3, "bar", "dwdw", "3", "5"),
],
["id", "col", "col2", "s_date", "e_date"],
)
My expectation is to add only the rows from the second sdf2 based on a condition. let's say in the case target.id <> source.id
I dummy created an insert when matched logic by myself
from delta.tables import DeltaTable
insert_condition = "target.id <> source.id"
merge_condition = f"not ({insert_condition})"
#merge_condition = "target.id = source.id"
update_condition = "true = false"
insert_condition = None
delta_path = delta_path = "/mnt/raw/testNiko/matchedInsert"
#write SDF1`
sdf1.write.format("delta").mode("overwrite").option("overwriteSchema", "True").save(delta_path)
#Insert when matched SDF2
delta_table = DeltaTable.forPath(spark, delta_path)
delta_merge_builder = delta_table.alias("target").merge(sdf2.alias("source"), merge_condition)
delta_merge_builder = delta_merge_builder.whenMatchedUpdateAll(update_condition)
delta_merge_builder = delta_merge_builder.whenNotMatchedInsertAll(insert_condition)
delta_merge_builder.execute()
sdf_merge = spark.read.format("delta").load(delta_path)
display(sdf_merge)
The expected result is there but looking forward to a smarter idea which does InsertWhenMatched where I don't need to fake skipping the update part of the merge with the update_condition = "true = false"

Related

Is it possible to combine .agg(dictionary) and renaming the resulting column with .alias() in Pyspark?

I have a pyspark dataframe 'pyspark_df' I want to group the data and aggregate the data with a general function string name like one of the following :'avg', 'count', 'max', 'mean', 'min', or 'sum'.
I need the resulting aggregated name to be 'aggregated' regardless of the aggregation type.
I have been able to do this as follows.
seriesname = 'Group'
dateVar = 'as_of_date'
aggSeriesName = 'Balance'
aggType = 'sum'
name_to_be_Changed = aggType + '(' + aggSeriesName + ')'
group_sorted = pyspark_df.groupby(dateVar,seriesname).agg({aggSeriesName: aggType}).withColumnRenamed(name_to_be_Changed,'aggregated').toPandas()
However, is there a way to do this via .alias()? I have seen this used as follows
group_sorted = pyspark_df.groupby(dateVar,seriesname).agg(sum(aggSeriesName).alias('aggregated')).toPandas()
How do I use alias in a way that I don't have to type out the 'sum(aggSeriesName)' portion? Hopefully I am being clear.
I'm not sure why you are asking this question and can't therefore provide a proper alternative solution. As far as I know it is not possible to combine .agg(dictionary) and renaming the resulting column with .alias. withColumnRenamed is the way to go for this case.
What you also can do is applying a selectExpr:
vertices = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
aggSeriesName = 'age'
aggType = 'sum'
targetName = 'aggregated'
bla = vertices.selectExpr('{}({}) as {}'.format(aggType, aggSeriesName, targetName))
bla.show()
Output:
+----------+
|aggregated|
+----------+
| 257|
+----------+

Pyspark isin with column in argument doesn't exclude rows

I need to exclude rows which doesn't have True value in column Status.
In my opinion this filter( isin( )== False) structure should solve my problem but it doesn't.
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
df_t = df[df.status == "True"]
from pyspark.sql import functions as sf
df_f = df.filter(df.status.isin(df_t.name)== False)
I expect row:
B | False
any help is greatly appreciated!
First, I think in your last statement, you meant to use df.name instead of df.status.
df_f = df.filter(df.status.isin(df_t.name)== False)
Second, even if you use df.name, it still won't work.
Because it's mixing the columns (Column type) from two DataFrames, i.e. df_t and df in your final statement. I don't think this works in pyspark.
However, you can achieve the same effect using other methods.
If I understand correctly, you want to select 'A' and 'C' first through 'status' column, then select the rows excluding ['A', 'C']. The thing here is to extend the selection to the second row of 'A', which can be achieved by Window. See below:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
# create an auxiliary column satisfying the condition
df = df.withColumn("flag", F.when(df['status']=="True", 1).otherwise(0))
df.show()
# extend the selection to other rows with the same 'name'
df = df.withColumn('flag', F.max(df['flag']).over(Window.partitionBy('name')))
df.show()
#filter is now easy
df_f = df.filter(df.flag==0)
df_f.show()

Scala code to label rows of data frame based on another data frame

I just started learning scala to do data analytics and I encountered a problem when I try to label my data rows based on another data frame.
Suppose I have a df1 with columns "date","id","value",and"label" which is set to be "F" for all rows in df1 in the beginning. Then I have this df2 which is a smaller set of data with columns "date","id","value".Then I want to change the row label in df1 from "F" to "T" if that row appears in df2, i.e.some row in df2 has the same combination of ("date","id","value")as that row in df1.
I tried with df.filter and df.join but seems that both cannot solve my problem.
I Think this is what you are looking for.
val spark =SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
//create Dataframe 1
val df1 = spark.sparkContext.parallelize(Seq(
("2016-01-01", 1, "abcd", "F"),
("2016-01-01", 2, "efg", "F"),
("2016-01-01", 3, "hij", "F"),
("2016-01-01", 4, "klm", "F")
)).toDF("date","id","value", "label")
//Create Dataframe 2
val df2 = spark.sparkContext.parallelize(Seq(
("2016-01-01", 1, "abcd"),
("2016-01-01", 3, "hij")
)).toDF("date1","id1","value1")
val condition = $"date" === $"date1" && $"id" === $"id1" && $"value" === $"value1"
//Join two dataframe with above condition
val result = df1.join(df2, condition, "left")
// check wather both fields contain same value and drop columns
val finalResult = result.withColumn("label", condition)
.drop("date1","id1","value1")
//Update column label from true false to T or F
finalResult.withColumn("label", when(col("label") === true, "T").otherwise("F")).show
The basic idea is to join the two and then calculate the result. Something like this:
df2Mod = df2.withColumn("tmp", lit(true))
joined = df1.join(df2Mod , df1("date") <=> df2Mod ("date") && df1("id") <=> df2Mod("id") && df1("value") <=> df2Mod("value"), "left_outer")
joined.withColumn("label", when(joined("tmp").isNull, "F").otherwise("T")
The idea is that we add the "tmp" column and then do a left_outer join. "tmp" would be null for everything not in df2 and therefore we can use that to calculate the label.

spark apply function to columns in parallel

Spark will process the data in parallel, but not the operations. In my DAG I want to call a function per column like
Spark processing columns in parallel the values for each column could be calculated independently from other columns. Is there any way to achieve such parallelism via spark-SQL API? Utilizing window functions Spark dynamic DAG is a lot slower and different from hard coded DAG helped to optimize the DAG by a lot but only executes in a serial fashion.
An example which contains a little bit more information can be found https://github.com/geoHeil/sparkContrastCoding
The minimum example below:
val df = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
val inputToDrop = Seq("col3TooMany")
val inputToBias = Seq("col1", "col2")
val targetCounts = df.filter(df("TARGET") === 1).groupBy("TARGET").agg(count("TARGET").as("cnt_foo_eq_1"))
val newDF = df.toDF.join(broadcast(targetCounts), Seq("TARGET"), "left")
newDF.cache
def handleBias(df: DataFrame, colName: String, target: String = target) = {
val w1 = Window.partitionBy(colName)
val w2 = Window.partitionBy(colName, target)
df.withColumn("cnt_group", count("*").over(w2))
.withColumn("pre2_" + colName, mean(target).over(w1))
.withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
.drop("cnt_group")
}
val joinUDF = udf((newColumn: String, newValue: String, codingVariant: Int, results: Map[String, Map[String, Seq[Double]]]) => {
results.get(newColumn) match {
case Some(tt) => {
val nestedArray = tt.getOrElse(newValue, Seq(0.0))
if (codingVariant == 0) {
nestedArray.head
} else {
nestedArray.last
}
}
case None => throw new Exception("Column not contained in initial data frame")
}
})
Now I want to apply my handleBias function to all the columns, unfortunately, this is not executed in parallel.
val res = (inputToDrop ++ inputToBias).toSet.foldLeft(newDF) {
(currentDF, colName) =>
{
logger.info("using col " + colName)
handleBias(currentDF, colName)
}
}
.drop("cnt_foo_eq_1")
val combined = ((inputToDrop ++ inputToBias).toSet).foldLeft(res) {
(currentDF, colName) =>
{
currentDF
.withColumn("combined_" + colName, map(col(colName), array(col("pre_" + colName), col("pre2_" + colName))))
}
}
val columnsToUse = combined
.select(combined.columns
.filter(_.startsWith("combined_"))
map (combined(_)): _*)
val newNames = columnsToUse.columns.map(_.split("combined_").last)
val renamed = columnsToUse.toDF(newNames: _*)
val cols = renamed.columns
val localData = renamed.collect
val columnsMap = cols.map { colName =>
colName -> localData.flatMap(_.getAs[Map[String, Seq[Double]]](colName)).toMap
}.toMap
values for each column could be calculated independently from other columns
While it is true it doesn't really help your case. You can generate a number of independent DataFrames, each one with its own additions, but it doesn't mean you can automatically combine this into a single execution plan.
Each application of handleBias shuffles your data twice and output DataFrames don't have the same data distribution as the parent DataFrame. This is why when you fold over the list of columns each addition has to be performed separately.
Theoretically you could design a pipeline which can be expressed (with pseudocode) like this:
add unique id:
df_with_id = df.withColumn("id", unique_id())
compute each df independently and convert to wide format:
dfs = for (c in columns)
yield handle_bias(df, c).withColumn(
"pres", explode([(pre_name, pre_value), (pre2_name, pre2_value)])
)
union all partial results:
combined = dfs.reduce(union)
pivot to convert from long to wide format:
combined.groupBy("id").pivot("pres._1").agg(first("pres._2"))
but I doubt it is worth all the fuss. The process you use is extremely heavy as it is and requires a significant network and disk IO.
If number of total levels (sum count(distinct x)) for x in columns)) is relatively low you can try to compute all statistics with a single pass using for example aggregateByKey with Map[Tuple2[_, _], StatCounter] otherwise consider downsampling to the level where you can compute statistics locally.

Spark Scala: Issue Substituting Filter Expression In DataFrame

I have a dataframe created which holds the join of 2 tables.
I want to compare each field of table1 to that of table2 (Schema is same)
Columns in Table A = colA1, colB1, colC1 , ...
Columns in Table B = colA2, colB2, colC2, ...
So, I need to filter out the data which satisfies the condition
(colA1 = colA2) AND (colB1 = colB2) AND (colC1 = colC2) and so on.
Since my table has a lot of fields, I tried to build a similar exp.
val filterCols = Seq("colA","colB","colC")
val sq = '"'
val exp = filterCols.map({ x => s"(join_df1($sq${x}1$sq) === join_df1($sq${x}2$sq))" }).mkString(" && ")
Resultant Exp : res28: String = (join_df1("colA1") === join_df1("colA2")) && (join_df1("colB1") === join_df1("colB2")) && (join_df1("colC1") === join_df1("colC2"))
Now when i try to substitute it to the dataframe, it throws me an error.
join_df1.filter($exp)
I am not sure whether I am doing it right .I need to find a way to substitute my expression and filter out value.
Any help is appreciated.
Thanks in advance
This is not valid SQL. Try:
val df = Seq(
("a", "a", "b", "b", "c", "c"),
("a", "A", "b", "B", "c", "C")).toDF("a1", "a2", "b1", "b2", "c1", "c2")
val filterCols = Seq("A", "B", "C")
val exp = filterCols.map(x => s"${x}1 = ${x}2").mkString(" AND ")
df.where(exp)