Sliding window over a period of weeks in Spark - scala

I have a dataset:
+---------------+-----------+---------+--------+
| Country | Timezone |Year_Week|MinUsers|
+---------------+-----------+---------+--------+
|Germany |1.0 |2019-01 |4322 |
|Germany |1.0 |2019-02 |4634 |
|Germany |1.0 |2019-03 |5073 |
|Germany |1.0 |2019-04 |4757 |
|Germany |1.0 |2019-05 |5831 |
|Germany |1.0 |2019-06 |5026 |
|Germany |1.0 |2019-07 |5038 |
|Germany |1.0 |2019-08 |5005 |
|Germany |1.0 |2019-09 |5766 |
|Germany |1.0 |2019-10 |5204 |
|Germany |1.0 |2019-11 |5240 |
|Germany |1.0 |2019-12 |5306 |
|Germany |1.0 |2019-13 |5381 |
|Germany |1.0 |2019-14 |5659 |
|Germany |1.0 |2019-15 |5518 |
|Germany |1.0 |2019-16 |6666 |
|Germany |1.0 |2019-17 |5594 |
|Germany |1.0 |2019-18 |5395 |
|Germany |1.0 |2019-19 |5482 |
|Germany |1.0 |2019-20 |5582 |
|Germany |1.0 |2019-21 |5492 |
|Germany |1.0 |2019-22 |5889 |
|Germany |1.0 |2019-23 |6514 |
|Germany |1.0 |2019-24 |5112 |
|Germany |1.0 |2019-25 |4795 |
|Germany |1.0 |2019-26 |4673 |
|Germany |1.0 |2019-27 |5330 |
+---------------+-----------+---------+--------+
I want to slide over the dataset with a window of 25 weeks and calculate avg min users over the period. So final results should look like():
+---------------+-----------+---------+-------------+
| Country | Timezone |Year_Week|Avg(MinUsers)|
+---------------+-----------+---------+-------------+
|Germany |1.0 |2019-25 |6006.12 |
|Germany |1.0 |2019-26 |2343.16 |
|Germany |1.0 |2019-27 |8464.2 |
+---------------+-----------+---------+-------------+
*Avg(MinUsers) are dummy numbers.
I want avg per country per timezone per yeark_week:
df
.groupBy("Country", "Timezone", "Year_Week")
.agg(min("NumUserPer4Hour").alias("MinUsers"))
.withColumn("Avg", avg("MinUsers").over(Window.partitionBy("Country", "Timezone").rowsBetween(-25, 0).orderBy("Year_Week")))
.orderBy("Country", "Year_Week")
Im not sure how to add the partition information there. I tried tumbling window as well but it didn't work well.
It would be great if someone can help in this regard.

This can be solved with a Window Function.
import org.apache.spark.sql.expressions.Window
val df = Seq(("Germany",1.0,"2019-01",4322),
("Germany",1.0,"2019-02",4634),
("Germany",1.0,"2019-03",5073),
("Germany",1.0,"2019-04",4757),
("Germany",1.0,"2019-05",5831),
("Germany",1.0,"2019-06",5026),
("Germany",1.0,"2019-07",5038),
("Germany",1.0,"2019-08",5005),
("Germany",1.0,"2019-09",5766),
("Germany",1.0,"2019-10",5204),
("Germany",1.0,"2019-11",5240),
("Germany",1.0,"2019-12",5306),
("Germany",1.0,"2019-13",5381),
("Germany",1.0,"2019-14",5659),
("Germany",1.0,"2019-15",5518),
("Germany",1.0,"2019-16",6666),
("Germany",1.0,"2019-17",5594),
("Germany",1.0,"2019-18",5395),
("Germany",1.0,"2019-19",5482),
("Germany",1.0,"2019-20",5582),
("Germany",1.0,"2019-21",5492),
("Germany",1.0,"2019-22",5889),
("Germany",1.0,"2019-23",6514),
("Germany",1.0,"2019-24",5112),
("Germany",1.0,"2019-25",4795),
("Germany",1.0,"2019-26",4673),
("Germany",1.0,"2019-27",5330)
).toDF("Country", "Timezone", "Year_Week", "MinUsers")
val w = Window.partitionBy("Country", "Timezone")
.orderBy("Year_Week")
.rowsBetween(-25, Window.currentRow)
df.select(
$"Country",
$"Timezone",
$"Year_week",
avg($"MinUsers").over(w).as("Avg(MinUsers)")
)
.filter($"Year_Week" >= "2019-25")
.show()
The filter is there to reduce the rows to the ones in your question, but the window function will calculate it for every row, ignoring when the number of previous weeks goes beyond the beginning of the dataframe. In those cases, it will calculate the averages with the rows that exist in that window.
The above code produces:
+-------+--------+---------+-----------------+
|Country|Timezone|Year_week| Avg(MinUsers)|
+-------+--------+---------+-----------------+
|Germany| 1.0| 2019-25| 5371.24|
|Germany| 1.0| 2019-26|5344.384615384615|
|Germany| 1.0| 2019-27|5383.153846153846|
+-------+--------+---------+-----------------+

If it is a date field you can use the following code. You can replace days with weeks months year etc
spark.sql(
"""SELECT *, avg(some_value) OVER (
PARTITION BY Country, Timezone
ORDER BY CAST(Year_Week AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS avg FROM df""").show()

Related

How to rank dataframe depending on a group of rows in a column?

I have this dataframe :
+-----+----------+---------+
|num |Timestamp |frequency|
+-----+----------+---------+
|20.0 |1632899456|4 |
|20.0 |1632901256|4 |
|20.0 |1632901796|4 |
|20.0 |1632899155|4 |
|10.0 |1632901743|2 |
|10.0 |1632899933|2 |
|91.0 |1632899756|1 |
|32.0 |1632900776|1 |
|41.0 |1632900176|1 |
+-----+----------+---------+
I want to add a column containing the rank of each frequency. The new dataframe would be like this :
+-----+----------+---------+------------+
|num |Timestamp |frequency|rank |
+-----+----------+---------+------------+
|20.0 |1632899456|4 |1 |
|20.0 |1632901256|4 |1 |
|20.0 |1632901796|4 |1 |
|20.0 |1632899155|4 |1 |
|10.0 |1632901743|2 |2 |
|10.0 |1632899933|2 |2 |
|91.0 |1632899756|1 |3 |
|32.0 |1632900776|1 |3 |
|41.0 |1632900176|1 |3 |
+-----+----------+---------+------------+
I am using Spark version 2.4.3 and SQLContext, with scala language.
You can use dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("rank", dense_rank().over(Window.orderBy(desc("frequency")))

how to rename the Columns Produced by count() function in Scala

I have the below df:
+------+-------+--------+
|student| vars|observed|
+------+-------+--------+
| 1| ABC | 19|
| 1| ABC | 1|
| 2| CDB | 1|
| 1| ABC | 8|
| 3| XYZ | 3|
| 1| ABC | 389|
| 2| CDB | 946|
| 1| ABC | 342|
|+------+-------+--------+
I wanted to add a new frequency column groupBy two columns "student", "vars" in SCALA.
val frequency = df.groupBy($"student", $"vars").count()
This code generates a "count" column with the frequencies BUT losing observed column from the df.
I would like to create a new df as follows without losing "observed" column
+------+-------+--------+------------+
|student| vars|observed|total_count |
+------+-------+--------+------------+
| 1| ABC | 9|22
| 1| ABC | 1|22
| 2| CDB | 1|7
| 1| ABC | 2|22
| 3| XYZ | 3|3
| 1| ABC | 8|22
| 2| CDB | 6|7
| 1| ABC | 2|22
|+------+-------+-------+--------------+
You cannot do this directly but there are couple of ways,
You can join original df with count df. check here
You collect the observed column while doing aggregation and explode it again
With explode:
val frequency = df.groupBy("student", "vars").agg(collect_list("observed").as("observed_list"),count("*").as("total_count")).select($"student", $"vars",explode($"observed_list").alias("observed"), $"total_count")
scala> frequency.show(false)
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+
We can use Window functions as well
val windowSpec = Window.partitionBy("student","vars")
val frequency = df.withColumn("total_count", count(col("student")) over windowSpec)
.show
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+

How to find the next occurring item from current row in a data frame using Spark Windowing?

I have the following Dataframe:
+------+----------+-------------+--------------------+---------+-----+----------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |
+------+----------+----------------------------------+---------+-----+----------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |
+------+----------+----------------------------------+---------+-----+----------+
For every row, I have to find the occurrence of next 'PD' Typ at BFS level from the current row and populate its associated ID as a new column named 'NEXT_PD_TYP_ID'
The output I am expecting is:
+------+---------+-------------+--------------------+----+-----+---------+---------------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |NEXT_PD_TYP_ID |
+------+---------+----------------------------------+----+-----+---------+---------------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |105772 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |105775 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |105775 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |105776 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |null |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |null |
+------+---------+----------------------------------+----+-----+---------+---------------+
Need help.
I have tried using the conditional aggregation: max(when), however since it has more than one 'PD' the max is returning only one value for all the rows.
No error messages
I hope this helps.
I created a new column with ID's of TYP === PD. I called it TYPPDID.
Then I used Window frame ranging from next row to unbounded following row and got the first not-null TYPPDID
orderBy("ID") in the end is only to show records in order.
import org.apache.spark.sql.functions._
val df = Seq(
("105771", "BRIMONIDINE", "PD"),
("105772", "BRIMONIDINE", "PD"),
("105773", "BRIMONIDINE","RV"),
("105774", "TIMOLOL", "RV"),
("105775", "BRIMONIDINE", "PD"),
("105776", "TIMOLOL", "PD")
).toDF("ID", "BFS", "TYP").withColumn("TYPPDID", when($"TYP" === "PD", $"ID"))
df: org.apache.spark.sql.DataFrame = [ID: string, BFS: string ... 2 more fields]
scala> df.show
+------+-----------+---+-------+
| ID| BFS|TYP|TYPPDID|
+------+-----------+---+-------+
|105771|BRIMONIDINE| PD| 105771|
|105772|BRIMONIDINE| PD| 105772|
|105773|BRIMONIDINE| RV| null|
|105774| TIMOLOL| RV| null|
|105775|BRIMONIDINE| PD| 105775|
|105776| TIMOLOL| PD| 105776|
+------+-----------+---+-------+
scala> val overColumns = Window.partitionBy("BFS").orderBy("ID").rowsBetween(1, Window.unboundedFollowing)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#eb923ef
scala> df.withColumn("NEXT_PD_TYP_ID",first("TYPPDID", true).over(overColumns)).orderBy("ID").show(false)
+------+-----------+---+-------+-------+
|ID |BFS |TYP|TYPPDID|NEXT_PD_TYP_ID|
+------+-----------+---+-------+-------+
|105771|BRIMONIDINE|PD |105771 |105772 |
|105772|BRIMONIDINE|PD |105772 |105775 |
|105773|BRIMONIDINE|RV |null |105775 |
|105774|TIMOLOL |RV |null |105776 |
|105775|BRIMONIDINE|PD |105775 |null |
|105776|TIMOLOL |PD |105776 |null |
+------+-----------+---+-------+-------+

GroupBy based on conditions in Spark dataframe

I have two dataframe,
Dataframe1 contains key/value pairs:
+------+-----------------+
| Key | Value |
+------+-----------------+
| key1 | Column1 |
+------+-----------------+
| key2 | Column2 |
+------+-----------------+
| key3 | Column1,Column3 |
+------+-----------------+
Second dataframe:
This is actual dataframe where I need to apply groupBy operation
+---------+---------+---------+--------+
| Column1 | Column2 | Column3 | Amount |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A2 | XYZ | 10 |
+---------+---------+---------+--------+
| A | A3 | PQR | 100 |
+---------+---------+---------+--------+
| B | B1 | XYZ | 200 |
+---------+---------+---------+--------+
| B | B2 | PQR | 280 |
+---------+---------+---------+--------+
| B | B3 | XYZ | 20 |
+---------+---------+---------+--------+
Dataframe1 contains the key,value columns
It has to take the keys from dataframe1, it has to take the respective value and do the groupBy operation on the dataframe2
Dframe= df.groupBy($"key").sum("amount").show()
Expected Output: Generate three dataframes based on number of keys in dataframe
d1= df.grouBy($"key1").sum("amount").show()
it has to be : df.grouBy($"column1").sum("amount").show()
+---+-----+
| A | 310 |
+---+-----+
| B | 500 |
+---+-----+
Code:
d2=df.groupBy($"key2").sum("amount").show()
result: df.grouBy($"column2").sum("amount").show()
dataframe:
+----+-----+
| A1 | 200 |
+----+-----+
| A2 | 10 |
+----+-----+
Code :
d3.df.groupBy($"key3").sum("amount").show()
DataFrame:
+---+-----+-----+
| A | XYZ | 320 |
+---+-----+-----+
| A | PQR | 10 |
+---+-----+-----+
| B | XYZ | 220 |
+---+-----+-----+
| B | PQR | 280 |
+---+-----+-----+
In future, if I add more keys , it has to show the dataframe. Can someone help me.
Given the key value dataframe as ( which I suggest you not to form dataframe from the source data, reason is given below)
+----+---------------+
|Key |Value |
+----+---------------+
|key1|Column1 |
|key2|Column2 |
|key3|Column1,Column3|
+----+---------------+
and actual dataframe as
+-------+-------+-------+------+
|Column1|Column2|Column3|Amount|
+-------+-------+-------+------+
|A |A1 |XYZ |100 |
|A |A1 |XYZ |100 |
|A |A2 |XYZ |10 |
|A |A3 |PQR |100 |
|B |B1 |XYZ |200 |
|B |B2 |PQR |280 |
|B |B3 |XYZ |20 |
+-------+-------+-------+------+
I would suggest you not to convert the first dataframe to rdd maps as
val maps = df1.rdd.map(row => row(0) -> row(1)).collect()
And then loop the maps as
import org.apache.spark.sql.functions._
for(kv <- maps){
df2.groupBy(kv._2.toString.split(",").map(col): _*).agg(sum($"Amount")).show(false)
//you can store the results in separate dataframes or write them to files or database
}
You should have follwing outputs
+-------+-----------+
|Column1|sum(Amount)|
+-------+-----------+
|B |500 |
|A |310 |
+-------+-----------+
+-------+-----------+
|Column2|sum(Amount)|
+-------+-----------+
|A2 |10 |
|B2 |280 |
|B1 |200 |
|B3 |20 |
|A3 |100 |
|A1 |200 |
+-------+-----------+
+-------+-------+-----------+
|Column1|Column3|sum(Amount)|
+-------+-------+-----------+
|B |PQR |280 |
|B |XYZ |220 |
|A |PQR |100 |
|A |XYZ |210 |
+-------+-------+-----------+

How to transform the dataframe into label feature vector?

I am running a logistic regression modl in scala and I have a data frame like below:
df
+-----------+------------+
|x |y |
+-----------+------------+
| 0| 0|
| 0| 33|
| 0| 58|
| 0| 96|
| 0| 1|
| 1| 21|
| 0| 10|
| 0| 65|
| 1| 7|
| 1| 28|
+-----------+------------+
I need to tranform this into something like this
+-----+------------------+
|label| features |
+-----+------------------+
| 0.0|(1,[1],[0]) |
| 0.0|(1,[1],[33]) |
| 0.0|(1,[1],[58]) |
| 0.0|(1,[1],[96]) |
| 0.0|(1,[1],[1]) |
| 1.0|(1,[1],[21]) |
| 0.0|(1,[1],[10]) |
| 0.0|(1,[1],[65]) |
| 1.0|(1,[1],[7]) |
| 1.0|(1,[1],[28]) |
+-----------+------------+
I tried
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val assembler = new VectorAssembler()
.setInputCols(Array("x"))
.setOutputCol("Feature")
var lrModel= lr.fit(daf.withColumnRenamed("x","label").withColumnRenamed("y","features"))
Any help is appreciated.
Given the dataframe as
+---+---+
|x |y |
+---+---+
|0 |0 |
|0 |33 |
|0 |58 |
|0 |96 |
|0 |1 |
|1 |21 |
|0 |10 |
|0 |65 |
|1 |7 |
|1 |28 |
+---+---+
And doing as below
val assembler = new VectorAssembler()
.setInputCols(Array("x", "y"))
.setOutputCol("features")
val output = assembler.transform(df).select($"x".cast(DoubleType).as("label"), $"features")
output.show(false)
Would give you result as
+-----+----------+
|label|features |
+-----+----------+
|0.0 |(2,[],[]) |
|0.0 |[0.0,33.0]|
|0.0 |[0.0,58.0]|
|0.0 |[0.0,96.0]|
|0.0 |[0.0,1.0] |
|1.0 |[1.0,21.0]|
|0.0 |[0.0,10.0]|
|0.0 |[0.0,65.0]|
|1.0 |[1.0,7.0] |
|1.0 |[1.0,28.0]|
+-----+----------+
Now using LogisticRegression would be easy
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(output)
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
You will have output as
Coefficients: [1.5672602877378823,0.0] Intercept: -1.4055020984891717