Skip the current row COUNT and sum up the other COUNTS for current key with Spark Dataframe - scala

My input:
val df = sc.parallelize(Seq(
("0","car1", "success"),
("0","car1", "success"),
("0","car3", "success"),
("0","car2", "success"),
("1","car1", "success"),
("1","car2", "success"),
("0","car3", "success")
)).toDF("id", "item", "status")
My intermediary group by output looks like this:
val df2 = df.groupBy("id", "item").agg(count("item").alias("occurences"))
+---+----+----------+
| id|item|occurences|
+---+----+----------+
| 0|car3| 2|
| 0|car2| 1|
| 0|car1| 2|
| 1|car2| 1|
| 1|car1| 1|
+---+----+----------+
The output I would like is:
Calculating the sum of occurrences of item skipping the occurrences value of the current id's item.
For example in the output table below, car3 appeared for id "0" 2 times, car 2 appeared 1 time and car 1 appeared 2 times.
So for id "0", the sum of other occurrences for its "car3" item would be value of car2(1) + car1(2) = 3.
For the same id "0", the sum of other occurrences for its "car2" item would be value of car3(2) + car1(2) = 4.
This continues for the rest. Sample output
+---+----+----------+----------------------+
| id|item|occurences| other_occurences_sum |
+---+----+----------+----------------------+
| 0|car3| 2| 3 |<- (car2+car1) for id 0
| 0|car2| 1| 4 |<- (car3+car1) for id 0
| 0|car1| 2| 3 |<- (car3+car2) for id 0
| 1|car2| 1| 1 |<- (car1) for id 1
| 1|car1| 1| 1 |<- (car2) for id 1
+---+----+----------+----------------------+

That's perfect target for a window function.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.sum
val w = Window.partitionBy("id")
df2.withColumn(
"other_occurences_sum", sum($"occurences").over(w) - $"occurences"
).show
// +---+----+----------+--------------------+
// | id|item|occurences|other_occurences_sum|
// +---+----+----------+--------------------+
// | 0|car3| 2| 3|
// | 0|car2| 1| 4|
// | 0|car1| 2| 3|
// | 1|car2| 1| 1|
// | 1|car1| 1| 1|
// +---+----+----------+--------------------+
where sum($"occurences").over(w) is a sum of all occurrences for the current id. Of course join is also valid:
df2.join(
df2.groupBy("id").agg(sum($"occurences") as "total"), Seq("id")
).select(
$"*", ($"total" - $"occurences") as "other_occurences_sum"
).show
// +---+----+----------+--------------------+
// | id|item|occurences|other_occurences_sum|
// +---+----+----------+--------------------+
// | 0|car3| 2| 3|
// | 0|car2| 1| 4|
// | 0|car1| 2| 3|
// | 1|car2| 1| 1|
// | 1|car1| 1| 1|
// +---+----+----------+--------------------+

Related

how to solve following issue with apache spark with optimal solution

i need to solve the following problem without graphframe please help.
Input Dataframe
|-----------+-----------+--------------|
| ID | prev | next |
|-----------+-----------+--------------|
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 2 | null |
| 9 | 9 | null |
|-----------+-----------+--------------|
output dataframe
|-----------+------------|
| bill_id | item_id |
|-----------+------------|
| 1 | [1, 2, 3] |
| 9 | [9] |
|-----------+------------|
This is probably quite inefficient, but it works. It is inspired by how graphframes does connected components. Basically join with itself on the prev column until it doesn't get any lower, then group.
df = sc.parallelize([(1, 1, 2), (2, 1, 3), (3, 2, None), (9, 9, None)]).toDF(['ID', 'prev', 'next'])
df.show()
+---+----+----+
| ID|prev|next|
+---+----+----+
| 1| 1| 2|
| 2| 1| 3|
| 3| 2|null|
| 9| 9|null|
+---+----+----+
converged = False
count = 0
while not converged:
step = df.join(df.selectExpr('ID as prev', 'prev as lower_prev'), 'prev', 'left').cache()
print('step', count)
step.show()
converged = step.where('prev != lower_prev').count() == 0
df = step.selectExpr('ID', 'lower_prev as prev')
print('df', count)
df.show()
count += 1
step 0
+----+---+----+----------+
|prev| ID|next|lower_prev|
+----+---+----+----------+
| 2| 3|null| 1|
| 1| 2| 3| 1|
| 1| 1| 2| 1|
| 9| 9|null| 9|
+----+---+----+----------+
df 0
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
step 1
+----+---+----------+
|prev| ID|lower_prev|
+----+---+----------+
| 1| 3| 1|
| 1| 1| 1|
| 1| 2| 1|
| 9| 9| 9|
+----+---+----------+
df 1
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
df.groupBy('prev').agg(F.collect_set('ID').alias('item_id')).withColumnRenamed('prev', 'bill_id').show()
+-------+---------+
|bill_id| item_id|
+-------+---------+
| 1|[1, 2, 3]|
| 9| [9]|
+-------+---------+

Filter one data frame using other data frame in spark scala

I am going to demonstrate my question using following two data frames.
val datF1= Seq((1,"everlasting",1.39),(1,"game", 2.7),(1,"life",0.69),(1,"learning",0.69),
(2,"living",1.38),(2,"worth",1.38),(2,"life",0.69),(3,"learning",0.69),(3,"never",1.38)).toDF("ID","token","value")
datF1.show()
+---+-----------+-----+
| ID| token|value|
+---+-----------+-----+
| 1|everlasting| 1.39|
| 1| game| 2.7|
| 1| life| 0.69|
| 1| learning| 0.69|
| 2| living| 1.38|
| 2| worth| 1.38|
| 2| life| 0.69|
| 3| learning| 0.69|
| 3| never| 1.38|
+---+-----------+-----+
val dataF2= Seq(("life ",0.71),("learning",0.75)).toDF("token1","val2")
dataF2.show()
+--------+----+
| token1|val2|
+--------+----+
| life |0.71|
|learning|0.75|
+--------+----+
I want to filter the ID and value of dataF1 based on the token1 of dataF2. For the each word in token1 of dataF2 , if there is a word token then value should be equal to the value of dataF1 else value should be zero.
In other words my desired output should be like this
+---+----+----+
| ID| val|val2|
+---+----+----+
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
+---+----+----+
Since learning is not presented in ID equals 2 , the val has equal to zero. Similarly since life is not there for ID equal 3, val2 equlas zero.
I did it manually as follows ,
val newQ61=datF1.filter($"token"==="learning")
val newQ7 =Seq(1,2,3).toDF("ID")
val newQ81 =newQ7.join(newQ61, Seq("ID"), "left")
val tf2=newQ81.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val" )
val newQ62=datF1.filter($"token"==="life")
val newQ71 =Seq(1,2,3).toDF("ID")
val newQ82 =newQ71.join(newQ62, Seq("ID"), "left")
val tf3=newQ82.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val2" )
val tf4 =tf2.join(tf3 ,Seq("ID"), "left")
tf4.show()
+---+----+----+
| ID| val|val2|
+---+----+----+
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
+---+----+----+
Instead of doing this manually , is there a way to do this more efficiently by accessing indexes of one data frame within the other data frame ? because in real life situations, there can be more than 2 words so manually accessing each word may be very hard thing to do.
Thank you
UPDATE
When i use leftsemi join my output is like this :
datF1.join(dataF2, $"token"===$"token1", "leftsemi").show()
+---+--------+-----+
| ID| token|value|
+---+--------+-----+
| 1|learning| 0.69|
| 3|learning| 0.69|
+---+--------+-----+
I believe a left outer join and then pivoting on token can work here:
val ans = df1.join(df2, $"token" === $"token1", "LEFT_OUTER")
.filter($"token1".isNotNull)
.select("ID","token","value")
.groupBy("ID")
.pivot("token")
.agg(first("value"))
.na.fill(0)
The result (without the null handling):
ans.show
+---+--------+----+
| ID|learning|life|
+---+--------+----+
| 1| 0.69|0.69|
| 3| 0.69|0.0 |
| 2| 0.0 |0.69|
+---+--------+----+
UPDATE: as the answer by Lamanus suggest, an inner join is possibly a better approach than an outer join + filter.
I think the inner join is enough. Btw, I found the typo in your test case, which makes the result wrong.
val dataF1= Seq((1,"everlasting",1.39),
(1,"game", 2.7),
(1,"life",0.69),
(1,"learning",0.69),
(2,"living",1.38),
(2,"worth",1.38),
(2,"life",0.69),
(3,"learning",0.69),
(3,"never",1.38)).toDF("ID","token","value")
dataF1.show
// +---+-----------+-----+
// | ID| token|value|
// +---+-----------+-----+
// | 1|everlasting| 1.39|
// | 1| game| 2.7|
// | 1| life| 0.69|
// | 1| learning| 0.69|
// | 2| living| 1.38|
// | 2| worth| 1.38|
// | 2| life| 0.69|
// | 3| learning| 0.69|
// | 3| never| 1.38|
// +---+-----------+-----+
val dataF2= Seq(("life",0.71), // "life " -> "life"
("learning",0.75)).toDF("token1","val2")
dataF2.show
// +--------+----+
// | token1|val2|
// +--------+----+
// | life|0.71|
// |learning|0.75|
// +--------+----+
val resultDF = dataF1.join(dataF2, $"token" === $"token1", "inner")
resultDF.show
// +---+--------+-----+--------+----+
// | ID| token|value| token1|val2|
// +---+--------+-----+--------+----+
// | 1| life| 0.69| life|0.71|
// | 1|learning| 0.69|learning|0.75|
// | 2| life| 0.69| life|0.71|
// | 3|learning| 0.69|learning|0.75|
// +---+--------+-----+--------+----+
resultDF.groupBy("ID").pivot("token").agg(first("value"))
.na.fill(0).orderBy("ID").show
This will give you the result such as
+---+--------+----+
| ID|learning|life|
+---+--------+----+
| 1| 0.69|0.69|
| 2| 0.0|0.69|
| 3| 0.69| 0.0|
+---+--------+----+
Seems like you need "left semi-join". It will filter one dataframe, based on another one.
Try using it like
datF1.join(datF2, $"token"===$"token2", "leftsemi")
You can find a bit more info here - https://medium.com/datamindedbe/little-known-spark-dataframe-join-types-cc524ea39fd5

How to select rows using Window function? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I have the following DataFrame df in Spark:
+------------+---------+-----------+
|OrderID | Type| Qty|
+------------+---------+-----------+
| 571936| 62800| 1|
| 571936| 62800| 1|
| 571936| 62802| 3|
| 661455| 72800| 1|
| 661455| 72801| 1|
I need to select the row that has a largest value of Qty per each unique OrderID or the last rows per OrderID if all Qty are the same (e.g. as for 661455). The expected result:
+------------+---------+-----------+
|OrderID | Type| Qty|
+------------+---------+-----------+
| 571936| 62802| 3|
| 661455| 72801| 1|
Any ides how to get it?
This is what I tried:
val partitionWindow = Window.partitionBy(col("OrderID")).orderBy(col("Qty").asc)
val result = df.over(partitionWindow)
scala> val w = Window.partitionBy("OrderID").orderBy("Qty")
scala> val w1 = Window.partitionBy("OrderID")
scala> df.show()
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 571936|62800| 1|
| 571936|62800| 1|
| 571936|62802| 3|
| 661455|72800| 1|
| 661455|72801| 1|
+-------+-----+---+
scala> df.withColumn("rn", row_number.over(w)).withColumn("mxrn", max("rn").over(w1)).filter($"mxrn" === $"rn").drop("mxrn","rn").show
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 661455|72801| 1|
| 571936|62802| 3|
+-------+-----+---+

how to create a dataframe based on the first appearing date and based on additional columns each id column

i try to create a dataframe with following condition:
I have multiple IDs, multiple columns with defaults (0 or 1) and a startdate column. I would like to get a dataframe with the appearing defaults based on the first startdate (default_date) and each id.
the orginal df looks like this:
+----+-----+-----+-----+-----------+
|id |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 1| 2019-01-31|
| 02| 1| 1| 0| 2018-12-31|
| 03| 1| 1| 1| 2018-10-31|
| 01| 1| 0| 1| 2018-09-30|
| 02| 1| 1| 0| 2018-08-31|
| 03| 1| 1| 0| 2018-07-31|
| 03| 1| 1| 1| 2019-05-31|
this is how i would like to have it:
+----+-----+-----+-----+-----------+
|id |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 1| 2018-09-30|
| 02| 1| 1| 0| 2018-08-31|
| 03| 1| 1| 1| 2018-07-31|
i tried following code:
val w = Window.partitionBy($"id").orderBy($"date".asc)
val reult = join3.withColumn("rn", row_number.over(w)).where($"def_a" === 1 || $"def_b" === 1 ||$"def_c" === 1).filter($"rn" >= 1).drop("rn")
result.show
I would be grateful for any help
This should work for you. First assign the min date to the original df then join the new df2 with df.
import org.apache.spark.sql.expressions.Window
val df = Seq(
(1,1,0,1,"2019-01-31"),
(2,1,1,0,"2018-12-31"),
(3,1,1,1,"2018-10-31"),
(1,1,0,1,"2018-09-30"),
(2,1,1,0,"2018-08-31"),
(3,1,1,0,"2018-07-31"),
(3,1,1,1,"2019-05-31"))
.toDF("id" ,"def_a" , "def_b", "deb_c", "date")
val w = Window.partitionBy($"id").orderBy($"date".asc)
val df2 = df.withColumn("date", $"date".cast("date"))
.withColumn("min_date", min($"date").over(w))
.select("id", "min_date")
.distinct()
df.join(df2, df("id") === df2("id") && df("date") === df2("min_date"))
.select(df("*"))
.show
And the output should be:
+---+-----+-----+-----+----------+
| id|def_a|def_b|deb_c| date|
+---+-----+-----+-----+----------+
| 1| 1| 0| 1|2018-09-30|
| 2| 1| 1| 0|2018-08-31|
| 3| 1| 1| 0|2018-07-31|
+---+-----+-----+-----+----------+
By the way I believe you had a little mistake on your expected results. It is (3, 1, 1, 0, 2018-07-31) not (3, 1, 1, 1, 2018-07-31)

Spark - How to apply rules defined in a dataframe to another dataframe

I'm trying to solve this kind of problem with Spark 2, but I can't find a solution.
I have a dataframe A :
+----+-------+------+
|id |COUNTRY| MONTH|
+----+-------+------+
| 1 | US | 1 |
| 2 | FR | 1 |
| 4 | DE | 1 |
| 5 | DE | 2 |
| 3 | DE | 3 |
+----+-------+------+
And a dataframe B :
+-------+------+------+
|COLUMN |VALUE | PRIO |
+-------+------+------+
|COUNTRY| US | 5 |
|COUNTRY| FR | 15 |
|MONTH | 3 | 2 |
+-------+------+------+
The idea is to apply "rules" of dataframe B on dataframe A in order to get this result :
dataframe A' :
+----+-------+------+------+
|id |COUNTRY| MONTH| PRIO |
+----+-------+------+------+
| 1 | US | 1 | 5 |
| 2 | FR | 1 | 15 |
| 4 | DE | 1 | 20 |
| 5 | DE | 2 | 20 |
| 3 | DE | 3 | 2 |
+----+-------+------+------+
I tried someting like that :
dfB.collect.foreach( r =>
var dfAp = dfA.where(r.getAs("COLUMN") == r.getAs("VALUE"))
dfAp.withColumn("PRIO", lit(r.getAs("PRIO")))
)
But I'm sure it's not the right way.
What are the strategy to solve this problem in Spark ?
Working under assumption that the set of rules is reasonably small (possible concerns are the size of the data and the size of generated expression, which in the worst case scenario, can crash the planner) the simplest solution is to use local collection and map it to a SQL expression:
import org.apache.spark.sql.functions.{coalesce, col, lit, when}
val df = Seq(
(1, "US", "1"), (2, "FR", "1"), (4, "DE", "1"),
(5, "DE", "2"), (3, "DE", "3")
).toDF("id", "COUNTRY", "MONTH")
val rules = Seq(
("COUNTRY", "US", 5), ("COUNTRY", "FR", 15), ("MONTH", "3", 2)
).toDF("COLUMN", "VALUE", "PRIO")
val prio = coalesce(rules.as[(String, String, Int)].collect.map {
case (c, v, p) => when(col(c) === v, p)
} :+ lit(20): _*)
df.withColumn("PRIO", prio)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
You can replace coalesce with least or greatest to apply the smallest or the largest matching value respectively.
With larger set of rules you could:
melt data to convert to a long format.
val dfLong = df.melt(Seq("id"), df.columns.tail, "COLUMN", "VALUE")
join by column and value.
Aggregate PRIOR by id with appropriate aggregation function (for example min):
val priorities = dfLong.join(rules, Seq("COLUMN", "VALUE"))
.groupBy("id")
.agg(min("PRIO").alias("PRIO"))
Outer join the output with df by id.
df.join(priorities, Seq("id"), "leftouter").na.fill(20)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
lets assume rules of dataframeB is limited
I have created dataframe "df" for below table
+---+-------+------+
| id|COUNTRY|MONTH|
+---+-------+------+
| 1| US| 1|
| 2| FR| 1|
| 4| DE| 1|
| 5| DE| 2|
| 3| DE| 3|
+---+-------+------+
By using UDF
val code = udf{(x:String,y:Int)=>if(x=="US") "5" else if (x=="FR") "15" else if (y==3) "2" else "20"}
df.withColumn("PRIO",code($"COUNTRY",$"MONTH")).show()
output
+---+-------+------+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+------+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+------+----+