Spark Join based on nearest Date

Spark Join based on nearest Date - scala

I have 2 parquet tables. The simplified schema is as follows:
case class Product(SerialNumber:Integer,
UniqueKey:String,
ValidityDate1:Date
)
case class ExceptionEvents(SerialNumber:Integer,
ExceptionId:String,
ValidityDate2:Date
)
The Product Dataframe can contain the following entries, as an example:
Product:
-----------------------------------------
SerialNumber UniqueKey ValidityDate1
-----------------------------------------
10001 Key_1 01/10/2021
10001 Key_2 05/10/2021
10001 Key_3 10/10/2021
10002 Key_4 02/10/2021
10003 Key_5 07/10/2021
-----------------------------------------
ExceptionEvents:
-----------------------------------------
SerialNumber ExceptionId ValidityDate2
-----------------------------------------
10001 ExcId_1 02/10/2021
10001 ExcId_2 05/10/2021
10001 ExcId_3 07/10/2021
10001 ExcId_4 11/10/2021
10001 ExcId_5 15/10/2021
-----------------------------------------
I want to join the 2 DFs such that the SerialNumbers match and the ValidityDate shall be mapped such that ValidityDate2 of ExceptionEvent is greater than ValidityDate1 of Product, but the 2 dates should be as close as possible.
For example, the resultant DF should look like below:
---------------------------------------------------------------------
SerialNumber ExceptionId UniqueKey ValidityDate2
---------------------------------------------------------------------
10001 ExcId_1 Key_1 02/10/2021
10001 ExcId_2 Key_2 05/10/2021
10001 ExcId_3 Key_2 07/10/2021
10001 ExcId_4 Key_3 11/10/2021
10001 ExcId_5 Key_3 15/10/2021
---------------------------------------------------------------------
Any idea how the query should be done using scala & spark Dataframe APIs?

The below solution works fine for me:
val dfp1 = List(("1001", "Key1", "01/10/2021"), ("1001", "Key2", "05/10/2021"), ("1001", "Key3", "10/10/2021"), ("1002", "Key4", "02/10/2021")).toDF("SerialNumber", "UniqueKey", "Date1")
val dfProduct = dfp1.withColumn("Date1", to_date($"Date1","dd/MM/yyyy"))
val dfe1 = List(("1001", "ExcId1", "02/10/2021"), ("1001", "ExcId2", "05/10/2021"), ("1001", "ExcId3", "07/10/2021"), ("1001", "ExcId4", "11/10/2021"), ("1001", "ExcId5", "15/10/2021")).toDF("SerialNumber", "ExceptionId", "Date2")
val dfExceptions = dfe1.withColumn("Date2", to_date($"Date2","dd/MM/yyyy"))
val exceptionStat2 = dfExceptions.as("fact").join(dfProduct.as("dim"), Seq("SerialNumber")).select($"fact.*", $"dim.UniqueKey", datediff($"fact.Date2", $"dim.Date1").as("DiffDate")).where($"DiffDate" >= 0)
import org.apache.spark.sql.expressions.Window
val exceptionStat3 = exceptionStat2.withColumn("rank", rank.over(Window.partitionBy($"SerialNumber", $"ExceptionId").orderBy($"DiffDate")) )
.where($"rank" === 1)
.select( $"SerialNumber", $"ExceptionId", $"UniqueKey", $"Date2", $"DiffDate", $"rank" )
.orderBy($"SerialNumber", $"Date2")

Related

How to match column values of one table with column Name another table pyspark

I have two dataframes
dataframe_1
Id
qname
qval
01
Mango
[100,200]
01
Banana
[500,400,800]
dataframe_2
reqId
Mango
Banana
Orange
Apple
1000
100
500
NULL
NULL
1001
200
500
NULL
NULL
1002
200
800
NULL
NULL
1003
900
1100
NULL
NULL
Expected Result
Id
ReqId
01
1000
01
1001
01
10002
Please give me some idea. I need to match all qname and value of dataframe_1 to the columns of dataframe_2, ignoring the NULL columns of dataframe_2. Get all the reqId from dataframe_2.
Note - All qname and val of a particular id of dataframe_1 should match with all the columns of dataframe_2, ignoring nulls. For example, id -01 , has two qname and val. These two should match with corresponding column names of dataframe_2.

The logic is:
In df2, for each column, pair "reqId" with the column.
In df2, introduce a dummy column with some constant value and group by this column so all values are in one group.
Unpivot df2.
Join df1 and above processed df2.
For each element in "qval" list, filter corresponding "reqId" from joined df2 column.
Group by "id" and explode "reqId".
df1 = spark.createDataFrame(data=[["01","Mango",[100,200]],["01","Banana",[500,400,800]],["02","Banana",[800,1100]]], schema=["Id","qname","qval"])
df2 = spark.createDataFrame(data=[[1000,100,500,None,None],[1001,200,500,None,None],[1002,200,800,None,None],[1003,900,1100,None,None]], schema="reqId int,Mango int,Banana int,Orange int,Apple int")
for c in df2.columns:
if c != "reqId":
df2 = df2.withColumn(c, F.array(c, "reqId"))
df2 = df2.withColumn("dummy", F.lit(0)) \
.groupBy("dummy") \
.agg(*[F.collect_list(c).alias(c) for c in df2.columns]) \
.drop("dummy", "reqId")
stack_cols = ", ".join([f"{c}, '{c}'" for c in df2.columns])
df2 = df2.selectExpr(f"stack({len(df2.columns)},{stack_cols}) as (qval2, qname2)")
#F.udf(returnType=ArrayType(IntegerType()))
def compare_qvals(qval, qval2):
return [x[1] for x in qval2 if x[0] in qval]
#
df_result = df1.join(df2, on=(df1.qname == df2.qname2)) \
.withColumn("reqId", compare_qvals("qval", "qval2")) \
.groupBy("Id") \
.agg(F.flatten(F.array_distinct(F.collect_list("reqId"))).alias("reqId")) \
.withColumn("reqId", F.explode("reqId"))
Output:
+---+-----+
|Id |reqId|
+---+-----+
|01 |1000 |
|01 |1001 |
|01 |1002 |
|02 |1002 |
|02 |1003 |
+---+-----+
PS - To cover case with multiple "Id"s, I have added some extra data to the sample dataset, hence output has some extra rows.

How to apply changes stored on a dataframe to another dataframe?

My base dataframe looks like this:
HeroNamesDF
id gen name surname supername
1 1 Clarc Kent BATMAN
2 1 Bruce Smith BATMAN
3 2 Clark Kent SUPERMAN
And then I have another one with the corrections: CorrectionsDF
id gen attribute value
1 1 supername SUPERMAN
1 1 name Clark
2 1 surname Wayne
My aproach to the problem was to do this
CorrectionsDF.select(id, gen).distinct().collect().map(r => {
val id = r(0)
val gen = r(1)
val corrections = CorrectionsDF.filter(col("id") === lit(id) and col("gen") === lit(gen))
val candidates = HeroNamesDF.filter(col("id") === lit(id) and col("gen") === lit(gen))
candidates.columns.map(column => {
val change = corrections.where(col("attribute") === lit(column)).select("id", "gen", "value")
candidates.select("id", "gen", column)
.join(change, Seq("id", "gen"), "full")
.withColumn(column, when(col("value").isNotNull, col("value")).otherwise(col(column)))
.drop("value")
}).reduce((df1, df2) => df1.join(df2, Seq("id", "gen")) )
}
Expected output:
id gen name surname supername
1 1 Clark Kent SUPERMAN
2 1 Bruce Wayne BATMAN
3 2 Clark Kent SUPERMAN
And I would like to get rid of the .collect() but I can't make it work.

If I understood correctly the example, one inner join combined with a group by should be sufficient in your case. With the group by we will generate a map, using
collect_list and map_from_arrays, which will contain the aggregated data for every id/gen pair i.e {"name" : "Clarc", "surname" : "Kent", "superaname" : "BATMAN"}:
import org.apache.spark.sql.functions.{collect_list, map_from_arrays, coalesce}
val hdf = (load hero df)
val cdf = (load corrections df)
hdf.join(cdf, Seq("id", "gen"), "left")
.groupBy(hdf("id"), hdf("gen"))
.agg(
map_from_arrays(
collect_list("attribute"), // the keys
collect_list("value") // the values
).as("m"),
first("firstname").as("firstname"),
first("lastname").as("surname"),
first("supername").as("supername")
)
.select(
$"id",
$"gen",
coalesce($"m".getItem("name"), $"firstname").as("firstname"),
coalesce($"m".getItem("surname"), $"surname").as("surname"),
coalesce($"m".getItem("supername"), $"supername").as("supername")
)

How to apply Window function to multiple columns in DataFrame

I have the following DataFrame df:
Id label field1 field2
1 xxx 2 3
1 yyy 1 5
2 aaa 0 10
1 zzz 2 6
For each unique Id I want to know the label with highest field1 and field2.
Expected result:
Id labelField1 lableLield2
1 xxx zzz
2 aaa aaa
I know how to do it if I would only have labelField1 or labelField2.
But I am not sure what is the best way to deal with both labels.
val w1 = Window.partitionBy($"Id").orderBy($"field1".desc)
val w2 = Window.partitionBy($"Id").orderBy($"field2".desc)
val myLabels = df.select("Id", "label", "field1", "field2")
.withColumn("rn", row_number.over(w1)).where($"rn" === 1)
.drop("rn")
.drop("field1")

You can combine struct and max inbuild functions to achieve your requirement as
import org.apache.spark.sql.functions._
df.groupBy("Id")
.agg(max(struct("field1", "label")).as("temp1"), max(struct("field2", "label")).as("temp2"))
.select(col("Id"), col("temp1.label").as("labelField1"), col("temp2.label").as("labelField2"))
.show(false)
which should give you
+---+-----------+-----------+
|Id |labelField1|labelField2|
+---+-----------+-----------+
|1 |xxx |zzz |
|2 |aaa |aaa |
+---+-----------+-----------+
Note: In case of tie as in field1 for Id=1 there is tie between xxx and zzz so random will be chosen

How to refer broadcast variable in dataframes

I use spark1.6. I tried to broadcast a RDD and am not sure how to access the broadcasted variable in the data frames?
I have two dataframes employee & department.
Employee Dataframe
-------------------
Emp Id | Emp Name | Emp_Age
------------------
1 | john | 25
2 | David | 35
Department Dataframe
--------------------
Dept Id | Dept Name | Emp Id
-----------------------------
1 | Admin | 1
2 | HR | 2
import scala.collection.Map
val df_emp = hiveContext.sql("select * from emp")
val df_dept = hiveContext.sql("select * from dept")
val rdd = df_emp.rdd.map(row => (row.getInt(0),row.getString(1)))
val lkp = rdd.collectAsMap()
val bc = sc.broadcast(lkp)
print(bc.value.get(1).get)
--Below statement doesn't work
val combinedDF = df_dept.withColumn("emp_name",bc.value.get($"emp_id").get)
How do I refer the broadcast variable in the above combinedDF statement?
How to handle if the lkp doesn't return any value?
Is there a way to return multiple records from the lkp (lets assume if there are 2 records for emp_id=1 in the look up, I would like to get both records)
How to return more than one value from broadcast...(emp_name & emp_age)

How do I refer the broadcast variable in the above combinedDF statement?
Use udf. If emp_id is Int
val f = udf((emp_id: Int) => bc.value.get(emp_id))
df_dept.withColumn("emp_name", f($"emp_id"))
How to handle if the lkp doesn't return any value?
Don't use get as shown above
Is there a way to return multiple records from the lkp
Use groupByKey:
val lkp = rdd.groupByKey.collectAsMap()
and explode:
df_dept.withColumn("emp_name", f($"emp_id")).withColumn("emp_name", explode($"emp_name"))
or just skip all the steps and broadcast:
import org.apache.spark.sql.functions._
df_emp.join(broadcast(df_dep), Seq("Emp Id"), "left")

How to use NOT IN from a CSV file in Spark

I use Spark sql to load data into a val like this
val customers = sqlContext.sql("SELECT * FROM customers")
But I have a separate txt file that contains one column CUST_ID and 50,00 rows. i.e.
CUST_ID
1
2
3
I want my customers val to have all customers in customers table that are not in the TXT file.
Using Sql I would do this by SELECT * FROM customers NOT IN cust_id ('1','2','3')
How can I do this using Spark?
I've read the textFile and I can print rows of it but I'm not sure how to match this with my sql query
scala> val custids = sc.textFile("cust_ids.txt")
scala> custids.take(4).foreach(println)
CUST_ID
1
2
3

You can import your text file as a dataframe and do a left outer join:
val customers = Seq(("1", "AAA", "shipped"), ("2", "ADA", "delivered") , ("3", "FGA", "never received")).toDF("id","name","status")
val custId = Seq(1,2).toDF("custId")
customers.join(custId,'id === 'custId,"leftOuter")
.where('custId.isNull)
.drop("custId")
.show()
+---+----+--------------+
| id|name| status|
+---+----+--------------+
| 3| FGA|never received|
+---+----+--------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark Join based on nearest Date - scala

Related

How to match column values of one table with column Name another table pyspark

How to apply changes stored on a dataframe to another dataframe?

How to apply Window function to multiple columns in DataFrame

How to refer broadcast variable in dataframes

How to use NOT IN from a CSV file in Spark

Categories

Resources