Duplication data from streams on merge in Delta Tables - pyspark

I have a source table with say following data
+----------------+---+--------+-----------------+---------+
|registrationDate| id|custName| email|eventName|
+----------------+---+--------+-----------------+---------+
| 17-02-2023| 2| Person2|person2#gmail.com| INSERT|
| 17-02-2023| 1| Person1|person1#gmail.com| INSERT|
| 17-02-2023| 5| Person5|person5#gmail.com| INSERT|
| 17-02-2023| 4| Person4|person4#gmail.com| INSERT|
| 17-02-2023| 3| Person3|person3#gmail.com| INSERT|
+----------------+---+--------+-----------------+---------+
the above table is being stored into my S3 in delta format.
Now I'm creating a dataframe from Kinesis streams , and trying to merge it into my delta table. Every operation works fine - upserts, deletes everything but let's say the dataframe generated from the stream looks something like below -
+---------+---+---------------+----------------+----------------+
|eventName|id |custName |email |registrationDate|
+---------+---+---------------+----------------+----------------+
|REMOVE |1 |null |null |null |
|MODIFY |2 |ModPerson2 |modemail#mod.com|09-02-2023 |
|MODIFY |3 |3modifiedperson|modp#mod.com |09-02-2023 |
|INSERT |100|Person100 |p100#p.com |09-02-2023 |
|INSERT |200|Person200 |p200#p.com |09-02-2023 |
|REMOVE |5 |null |null |null |
|REMOVE |200|null |null |null |
+---------+---+---------------+----------------+----------------+
it is evident from the dataframe above created by data streams that i'm inserting a record with ID 200 while also deleting that record with the same id in the same batch.The records with ID 1 and 5 are deleted but not 200.
Merging is not possible in this case of duplication to the delta table. How do i counter this ?
deltaTable = DeltaTable.forPath(spark, 's3://path-to-my-s3')
(deltaTable.alias("first_df").merge(
updated_data.alias("append_df"),
"first_df.id = append_df.id")
.whenMatchedDelete("append_df.eventName='REMOVE'")
.whenMatchedUpdateAll("first_df.id = append_df.id")
.whenNotMatchedInsertAll()
.execute()
)
Resulting Delta table after Merge - Id with 200 still remains
+----------------+---+---------------+-----------------+---------+
|registrationDate| id| custName| email|eventName|
+----------------+---+---------------+-----------------+---------+
| 09-02-2023| 2| ModPerson2| modemail#mod.com| MODIFY|
| 17-02-2023| 4| Person4|person4#gmail.com| INSERT|
| 09-02-2023|100| Person100| p100#p.com| INSERT|
| 09-02-2023|200| Person200| p200#p.com| INSERT|
| null|200| null| null| REMOVE|
| 09-02-2023| 3|3modifiedperson| modp#mod.com| MODIFY|
+----------------+---+---------------+-----------------+---------+

Related

How to perform one to many mapping on spark scala dataframe column using flatmaps

I am looking for specifically a flatmap solution to a problem of mocking the data column in a spark-scala dataframe by using data duplicacy technique like 1 to many mapping inside flatmap
My given data is something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
+---+----+-----+
and my expectation after doing 1 to 3 mapping of the id column will be something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
|2 |null|null |
|3 |null|null |
|1 |null|null |
|2 |null|null |
|1 |null|null |
|3 |null|null |
+---+----+-----+
Please feel free to let me know if there is any clarification required on the requirement part
Thanks in advance!!!
I see that you are attempting to generate data with a requirement of re-using values in the ID column.
You can just select the ID column and generate random values and do a union back to your original dataset.
For example:
val data = Seq((1,"asd",15), (2,"asd",20), (3,"test",99)).toDF("id","testName","marks")
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
+---+--------+-----+
import org.apache.spark.sql.types._
val newRecords = data.select("id").withColumn("testName", concat(lit("name_"), lit(rand()*10).cast(IntegerType).cast(StringType))).withColumn("marks", lit(rand()*100).cast(IntegerType))
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
val result = data.unionAll(newRecords)
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
you can run the randomisation portion of the code using a loop and do a union of all the generated dataframes.

how to rename the Columns Produced by count() function in Scala

I have the below df:
+------+-------+--------+
|student| vars|observed|
+------+-------+--------+
| 1| ABC | 19|
| 1| ABC | 1|
| 2| CDB | 1|
| 1| ABC | 8|
| 3| XYZ | 3|
| 1| ABC | 389|
| 2| CDB | 946|
| 1| ABC | 342|
|+------+-------+--------+
I wanted to add a new frequency column groupBy two columns "student", "vars" in SCALA.
val frequency = df.groupBy($"student", $"vars").count()
This code generates a "count" column with the frequencies BUT losing observed column from the df.
I would like to create a new df as follows without losing "observed" column
+------+-------+--------+------------+
|student| vars|observed|total_count |
+------+-------+--------+------------+
| 1| ABC | 9|22
| 1| ABC | 1|22
| 2| CDB | 1|7
| 1| ABC | 2|22
| 3| XYZ | 3|3
| 1| ABC | 8|22
| 2| CDB | 6|7
| 1| ABC | 2|22
|+------+-------+-------+--------------+
You cannot do this directly but there are couple of ways,
You can join original df with count df. check here
You collect the observed column while doing aggregation and explode it again
With explode:
val frequency = df.groupBy("student", "vars").agg(collect_list("observed").as("observed_list"),count("*").as("total_count")).select($"student", $"vars",explode($"observed_list").alias("observed"), $"total_count")
scala> frequency.show(false)
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+
We can use Window functions as well
val windowSpec = Window.partitionBy("student","vars")
val frequency = df.withColumn("total_count", count(col("student")) over windowSpec)
.show
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+

How to get number of lines resulted by join in Spark

Consider these two Dataframes:
+---+
|id |
+---+
|1 |
|2 |
|3 |
+---+
+---+-----+
|idz|word |
+---+-----+
|1 |bat |
|1 |mouse|
|2 |horse|
+---+-----+
I am doing a Left join on ID=IDZ:
val r = df1.join(df2, (df1("id") === df2("idz")), "left_outer").
withColumn("ID_EMPLOYE_VENDEUR", when(col("word") =!= ("null"), col("word")).otherwise(null)).drop("word")
r.show(false)
+---+----+------------------+
|id |idz |ID_EMPLOYE_VENDEUR|
+---+----+------------------+
|1 |1 |mouse |
|1 |1 |bat |
|2 |2 |horse |
|3 |null|null |
+---+----+------------------+
But what if I only want to keep the lines whose ID only have one equal IDZ? If not, I would Like to have null in ID_EMPLOYE_VENDEUR. Desired output is:
+---+----+------------------+
|id |idz |ID_EMPLOYE_VENDEUR|
+---+----+------------------+
|1 |1 |null | --Because the Join resulted two different lines
|2 |2 |horse |
|3 |null|null |
+---+----+------------------+
I should precise that I am working on a large DF. The solution should be not very expensive in time.
Thank you
As per you mentioned data your data is too large, so groupBy is not good option to group data and join Windows over function as below :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("idz")
val newDF = df1.withColumn("count", count("idz").over(windowSpec)).dropDuplicates("idz").withColumn("word", when(col("count") >=2 , lit(null)).otherwise(col("word"))).drop("count")
val r = df1.join(newDF, (df1("id") === newDF("idz")), "left_outer").withColumn("ID_EMPLOYE_VENDEUR", when(col("word") =!= ("null"), col("word")).otherwise(null)).drop("word")
r show
+---+----+------------------+
| id| idz|ID_EMPLOYE_VENDEUR|
+---+----+------------------+
| 1| 1| null|
| 3|null| null|
| 2| 2| horse|
+---+----+------------------+
You can retrieve easily the information that more than two df2's idz matched a single df1's id with a groupBy and a join.
r.join(
r.groupBy("id").count().as("g"),
$"g.id" === r("id")
)
.withColumn(
"ID_EMPLOYE_VENDEUR",
expr("if(count != 1, null, ID_EMPLOYE_VENDEUR)")
)
.drop($"g.id").drop("count")
.distinct()
.show()
Note: Both the groupBy and the join do not trigger any additional exchange step (shuffle around network) because the dataframe r is already partitioned on id (because it is the result of a join on id).

Duplicating the record count in apache spark

This is an extension of this question, Apache Spark group by combining types and sub types.
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
All good with that solution, however the expected output should be counted in different categories conditionally.
So, the output should look like,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share1 | 2|
| Boston| share2 | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share1 | 1|
| Warsaw|share2 | 1|
| Warsaw|like | 1|
+-------+--------+-----+
Here if the action is share, I need to have that counted both in share1 and share2. When I count it programmatically, I use case statement and say case when action is share, share1 = share1 +1 , share2 = share2+1
But how can I do this in Scala or pyspark or sql ?
Simple filter and unions should give you your desired output
val media = sales.groupBy("city", "media").count()
val action = sales.groupBy("city", "action").count().select($"city", $"action".as("media"), $"count")
val share = action.filter($"media" === "share")
media.union(action.filter($"media" =!= "share"))
.union(share.withColumn("media", lit("share1")))
.union(share.withColumn("media", lit("share2")))
.show(false)
which should give you
+-------+--------+-----+
|city |media |count|
+-------+--------+-----+
|Boston |facebook|1 |
|Boston |twitter |1 |
|Toronto|twitter |1 |
|Warsaw |facebook|2 |
|Warsaw |like |1 |
|Toronto|like |1 |
|Boston |share1 |2 |
|Warsaw |share1 |1 |
|Boston |share2 |2 |
|Warsaw |share2 |1 |
+-------+--------+-----+

apply an aggregate result to all ungrouped rows of a dataframe in spark

assume there is a dataframe as follows:
machine_id | value
1| 5
1| 3
2| 6
2| 9
2| 14
I want to produce a final dataframe like this
machine_id | value | diff
1| 5| 1
1| 3| -1
2| 6| -4
2| 10| 0
2| 14| 4
the values in "diff" column is computed as groupBy($"machine_id").avg($"value") - value.
note that the avg for machine_id==1 is (5+3)/2 = 4 and for machine_id ==2 is (6+10+14)/3 = 10
What is the best way to produce such a final dataframe in Apache Spark?
You can use Window function to get the desired output
Given the dataframe as
+----------+-----+
|machine_id|value|
+----------+-----+
|1 |5 |
|1 |3 |
|2 |6 |
|2 |10 |
|2 |14 |
+----------+-----+
You can use following code
df.withColumn("diff", avg("value").over(Window.partitionBy("machine_id")))
.withColumn("diff", 'value - 'diff)
to get the final result as
+----------+-----+----+
|machine_id|value|diff|
+----------+-----+----+
|1 |5 |1.0 |
|1 |3 |-1.0|
|2 |6 |-4.0|
|2 |10 |0.0 |
|2 |14 |4.0 |
+----------+-----+----+