Spark Scala: moving average for multiple columns - scala

Input:
val customers = sc.parallelize(List(("Alice", "2016-05-01", 50.00,4),
("Alice", "2016-05-03", 45.00,2),
("Alice", "2016-05-04", 55.00,4),
("Bob", "2016-05-01", 25.00,6),
("Bob", "2016-05-04", 29.00,7),
("Bob", "2016-05-06", 27.00,10))).
toDF("name", "date", "amountSpent","NumItems")
Procedure:
// Import the window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
// Create a window spec.
val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
In this window spec, the data is partitioned by customer. Each customer’s data is ordered by date. And, the window frame is defined as starting from -1 (one row before the current row) and ending at 1 (one row after the current row), for a total of 3 rows in the sliding window. The problem is to take window-based summation for a list of columns. In this case, they're "amountSpent","NumItems". But the problem can have up to hundreds of columns.
Below is the solution for doing window-based summation for each column. However, how to perform the summation more effectively? because we don't need to do find slided-window rows every time for each column.
// Calculate the sum of spent
customers.withColumn("sumSpent",sum(customers("amountSpent")).over(wSpec1)).show()
+-----+----------+-----------+--------+--------+
| name| date|amountSpent|NumItems|sumSpent|
+-----+----------+-----------+--------+--------+
|Alice|2016-05-01| 50.0| 4| 95.0|
|Alice|2016-05-03| 45.0| 2| 150.0|
|Alice|2016-05-04| 55.0| 4| 100.0|
| Bob|2016-05-01| 25.0| 6| 54.0|
| Bob|2016-05-04| 29.0| 7| 81.0|
| Bob|2016-05-06| 27.0| 10| 56.0|
+-----+----------+-----------+--------+--------+
// Calculate the sum of items
customers.withColumn( "sumItems",
sum(customers("NumItems")).over(wSpec1) ).show()
+-----+----------+-----------+--------+--------+
| name| date|amountSpent|NumItems|sumItems|
+-----+----------+-----------+--------+--------+
|Alice|2016-05-01| 50.0| 4| 6|
|Alice|2016-05-03| 45.0| 2| 10|
|Alice|2016-05-04| 55.0| 4| 6|
| Bob|2016-05-01| 25.0| 6| 13|
| Bob|2016-05-04| 29.0| 7| 23|
| Bob|2016-05-06| 27.0| 10| 17|
+-----+----------+-----------+--------+--------+

Currently, I guess, its not possible to update multiple columns using Window function. You can act as if its happening at the same time as below
val customers = sc.parallelize(List(("Alice", "2016-05-01", 50.00,4),
("Alice", "2016-05-03", 45.00,2),
("Alice", "2016-05-04", 55.00,4),
("Bob", "2016-05-01", 25.00,6),
("Bob", "2016-05-04", 29.00,7),
("Bob", "2016-05-06", 27.00,10))).
toDF("name", "date", "amountSpent","NumItems")
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
// Create a window spec.
val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
var tempdf = customers
val colNames = List("amountSpent", "NumItems")
for(column <- colNames){
tempdf = tempdf.withColumn(column+"Sum", sum(tempdf(column)).over(wSpec1))
}
tempdf.show(false)
You should have output as
+-----+----------+-----------+--------+--------------+-----------+
|name |date |amountSpent|NumItems|amountSpentSum|NumItemsSum|
+-----+----------+-----------+--------+--------------+-----------+
|Bob |2016-05-01|25.0 |6 |54.0 |13 |
|Bob |2016-05-04|29.0 |7 |81.0 |23 |
|Bob |2016-05-06|27.0 |10 |56.0 |17 |
|Alice|2016-05-01|50.0 |4 |95.0 |6 |
|Alice|2016-05-03|45.0 |2 |150.0 |10 |
|Alice|2016-05-04|55.0 |4 |100.0 |6 |
+-----+----------+-----------+--------+--------------+-----------+

Yes, it's possible to calculate the window only once (if you have Spark 2 which allows you to use collect_list with struct-types), assuming the to have the dataframe and windowSpec as in your code, then:
val colNames = List("amountSpent","NumItems")
val cols= colNames.map(col(_))
// put window-content of all columns in one struct
val df_wc_arr = customers
.withColumn("window_content_arr",collect_list(struct(cols:_*)).over(wSpec1))
// calculate sum of window-content for each column
// aggregation exression used later
val aggExpr = colNames.map(n => sum(col("window_content."+n)).as(n+"Sum"))
df_wc_arr
.withColumn("window_content",explode($"window_content_arr"))
.drop($"window_content_arr")
.groupBy(($"name" :: $"date" :: cols):_*)
.agg(aggExpr.head,aggExpr.tail:_*)
.orderBy($"name",$"date")
.show
gives
+-----+----------+-----------+--------+--------------+-----------+
| name| date|amountSpent|NumItems|amountSpentSum|NumItemsSum|
+-----+----------+-----------+--------+--------------+-----------+
|Alice|2016-05-01| 50.0| 4| 95.0| 6|
|Alice|2016-05-03| 45.0| 2| 150.0| 10|
|Alice|2016-05-04| 55.0| 4| 100.0| 6|
| Bob|2016-05-01| 25.0| 6| 54.0| 13|
| Bob|2016-05-04| 29.0| 7| 81.0| 23|
| Bob|2016-05-06| 27.0| 10| 56.0| 17|
+-----+----------+-----------+--------+--------------+-----------+

Related

Sum of column in sqlDataframe without using groupBy or agg functions in scala/spark

For a dataframe given below, i want a new column in dataframe which should have constant value of sum of freq column.
+------+----+
|number|freq|
+------+----+
| 8| 1|
| 6| 2|
| 2| 4|
+------+----+
The result should look like
+------+----+-------+
|number|freq|new_col|
+------+----+-------+
| 8| 1| 7|
| 6| 2| 7|
| 2| 4| 7|
+------+----+-------+
and i want this without groupBy or agg.
I tried :
var x = sum(df("freq"))
df.withColumn("new_col",lit(x))
or
df.withColumn("new_col",x)
or
df.withColumn("new_col",sum($"freq"))
But none worked.
You can try this but be careful, it uses a single partition :
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(8,1),
(6,2),
(2,4)
).toDF("number","freq")
df.withColumn("new_col", sum($"freq").over())
.show(false)
+------+----+-------+
|number|freq|new_col|
+------+----+-------+
|8 |1 |7 |
|6 |2 |7 |
|2 |4 |7 |
+------+----+-------+
You could use a window over the entire dataframe to do that but I highly recommend not to do it for all the data would need to go to only one partition which would be terrible in terms of performance.
A simple way to do it, very similar to your 1st approach, is:
import org.apache.spark.sql.Row
val Row(x) = df.select(sum('freq)).head
val new_df = df.withColumn("new_col", lit(x))

Fill null values in dataframe column with next value

I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.
I have a dataframe with similar to below
//I replaced null to 0 in value column
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0 |exA |30 |
|0 |exB |22 |
|0 |exC |19 |
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 |
|13 |exH |53 |
+-----+----+----+
From this dataframe I am expecting as below
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16 |exA |30 | // Change the value 0 to 16 at value column
|16 |exB |22 | // Change the value 0 to 16 at value column
|16 |exC |19 | // Change the value 0 to 16 at value column
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 | // value should not be change here
|13 |exH |53 |
+-----+----+----+
Please help me solve this.
You can use Window function for this purpose
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
val w = Window.orderBy($"col2".desc)
df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.orderBy($"col2")
.show(10)
Will result in
+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
| 0| exA| 30| 16|
| 0| exB| 22| 16|
| 0| exC| 19| 16|
| 16| exD| 13| 16|
| 5| exE| 28| 5|
| 6| exF| 26| 6|
| 0| exG| 12| 13|
| 13| exH| 53| 13|
+-----+----+----+------+
Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.
UPDATE
To get exactly what you need you should you a little bit more complicated code
val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
.orderBy($"col2")
.show(10)
+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
| 0| exA| 30| null| 16|
| 0| exB| 22| null| 16|
| 0| exC| 19| null| 16|
| 16| exD| 13| 16| 16|
| 5| exE| 28| 16| 5|
| 6| exF| 26| 16| 6|
| 0| exG| 12| 16| 0|
| 13| exH| 53| 16| 13|
+-----+----+----+------------------+------+
I think you need to take the 1st not null or non-zero value based on col2 's order. Please find the script below. I have created a table in spark's memory to write sql.
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)
Please let me know if you have any questions.
I added a new column with incremental id to your DF
import org.apache.spark.sql.functions._
val df_1 = Seq((0,"exA",30),
(0,"exB",22),
(0,"exC",19),
(16,"exD",13),
(5,"exE",28),
(6,"exF",26),
(0,"exG",12),
(13,"exH",53))
.toDF("value", "col2", "col3")
.withColumn("UniqueID", monotonically_increasing_id)
filter DF to have non-zero values
val df_2 = df_1.filter("value != 0")
create a variable "limit" to limit first N row that we need and variable Nvar for the first non-zero value
val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt
create DF with a column with the same name ("value") with a condition
val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))

Upsert Two Dataframes in Scala

I have two data sources, both of which have opinions about the current state of the same set of entities. Either data source may contain the most current data, which may or may not be from the current date. For example:
val df1 = Seq((1, "green", "there", "2018-01-19"), (2, "yellow", "there", "2018-01-18"), (4, "yellow", "here", "2018-01-20")).toDF("id", "status", "location", "date")
val df2 = Seq((2, "red", "here", "2018-01-20"), (3, "green", "there", "2018-01-20"), (4, "green", "here", "2018-01-19")).toDF("id", "status", "location", "date")
df1.show
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2|yellow| there|2018-01-18|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
df2.show
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4| green| here|2018-01-19|
+---+------+--------+----------+
I want the output to be the set of most current states for each entity:
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
My approach, which seems to work, is to join the two tables and then do a kind of custom coalesce operation based on date:
val joined = df1.join(df2, df1("id") === df2("id"), "outer")
+----+------+--------+----------+----+------+--------+----------+
| id|status|location| date| id|status|location| date|
+----+------+--------+----------+----+------+--------+----------+
| 1| green| there|2018-01-19|null| null| null| null|
|null| null| null| null| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20| 4|yellow| here|2018-01-20|
| 2|yellow| there|2018-01-18| 2| red| here|2018-01-20|
+----+------+--------+----------+----+------+--------+----------+
val weirdCoal(name: String) = when(df1("date") > df2("date") || df2("date").isNull, df1(name)).otherwise(df2(name)) as name
val ouput = joined.select(df1.columns.map(weirdCoal):_*)
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
Which is the output I expect.
I can also see doing this via some kind of union / aggregation approach or with a window that partitions by id and sorts by date and takes the last row.
My question: is there an idiomatic way of doing this?
Yes it can be done without join using Window functions:
df1.union(df2)
.withColumn("rank", rank().over(Window.partitionBy($"id").orderBy($"date".desc)))
.filter($"rank" === 1)
.drop($"rank")
.orderBy($"id")
.show
output:
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
the above code partitions the data by id and finds the top date among all dates falling under same id.

Joining two DataFrames and appending where not exists

I have two DataFrames. One is a MasterList, the other is an InsertList
MasterList:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 10|
+--------+--------+
InsertList:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 9|
+--------+--------+
In Scala, how do I join two DataFrames but only append to the new DataFrame records
WHERE MasterList.ttm_id = InsertList.ttm_id AND
MasterList.audit_id != InsertList.audit_id
-
ExpectedOutput:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 10|
| 15| 9|
+--------+--------+
I'd anti join (NOT IN) by both columns and union
val masterList = Seq((1, 10), (15, 10)).toDF("ttm_id", "audit_id")
val insertList = Seq((1, 10), (15, 9)).toDF("ttm_id", "audit_id")
insertList
.join(masterList, Seq("ttm_id", "audit_id"), "leftanti")
.union(masterList)
.show
// +------+--------+
// |ttm_id|audit_id|
// +------+--------+
// | 15| 9|
// | 1| 10|
// | 15| 10|
// +------+--------+
It seems that you want to merge rows from insertList dataFrame that are not in masterList dataFrame. This can be achived using except function
insertList.except(masterList)
And you just use union function merge both dataFrames as
masterList.union(insertList.except(masterList))
You should get what you desire as
+------+--------+
|ttm_id|audit_id|
+------+--------+
|1 |10 |
|15 |10 |
|15 |9 |
+------+--------+

How to perform division operation in dataFrame Spark using Scala?

I have a dataFrame something like below.
+---+---+-----+
|uId| Id| sum |
+---+---+-----+
| 3| 1| 1.0|
| 7| 1| 1.0|
| 1| 2| 3.0|
| 1| 1| 1.0|
| 6| 5| 1.0|
using above DataFrame, I want to generate new DataFrame mention below
Sum column should be :-
For example:
For uid=3 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=7 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=1 and id=2, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
3.0*1/1=3.0
For uid=6 and id=5, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/1=1.0
My final output should be:
+---+---+---------+
|uId| Id| sum |
+---+---+---------+
| 3| 1| 0.33333|
| 7| 1| 0.33333|
| 1| 2| 3.0 |
| 1| 1| 0.3333 |
| 6| 5| 1.0 |
You can use Window function to get the count of each group of id column and finally use that count to divide the original sum
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("id")
import org.apache.spark.sql.functions._
df.withColumn("sum", $"sum"/count("id").over(windowSpec))
you should have the final dataframe as
+---+---+------------------+
|uId|Id |sum |
+---+---+------------------+
|3 |1 |0.3333333333333333|
|7 |1 |0.3333333333333333|
|1 |1 |0.3333333333333333|
|6 |5 |1.0 |
|1 |2 |3.0 |
+---+---+------------------+