Spark collect_set from a column using window function approach - scala

I have a sample dataset with salaries. I want to distribute that salary into 3 buckets and then find the lower of the salary in each bucket and then convert that into an array and attach it to the original set. I am trying to use window function to do that. And it seems to do it in a progressive fashion.
Here is the code that I have written
val spark = sparkSession
import spark.implicits._
val simpleData = Seq(("James", "Sales", 3000),
("Michael", "Sales", 3100),
("Robert", "Sales", 3200),
("Maria", "Finance", 3300),
("James", "Sales", 3400),
("Scott", "Finance", 3500),
("Jen", "Finance", 3600),
("Jeff", "Marketing", 3700),
("Kumar", "Marketing", 3800),
("Saif", "Sales", 3900)
)
val df = simpleData.toDF("employee_name", "department", "salary")
val windowSpec = Window.orderBy("salary")
val ntileFrame = df.withColumn("ntile", ntile(3).over(windowSpec))
val lowWindowSpec = Window.partitionBy("ntile")
val ntileMinDf = ntileFrame.withColumn("lower_bound", min("salary").over(lowWindowSpec))
var rangeDf = ntileMinDf.withColumn("range", collect_set("lower_bound").over(windowSpec))
rangeDf.show()
I am getting the dataset like this
+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound| range|
+-------------+----------+------+-----+-----------+------------------+
| James| Sales| 3000| 1| 3000| [3000]|
| Michael| Sales| 3100| 1| 3000| [3000]|
| Robert| Sales| 3200| 1| 3000| [3000]|
| Maria| Finance| 3300| 1| 3000| [3000]|
| James| Sales| 3400| 2| 3400| [3000, 3400]|
| Scott| Finance| 3500| 2| 3400| [3000, 3400]|
| Jen| Finance| 3600| 2| 3400| [3000, 3400]|
| Jeff| Marketing| 3700| 3| 3700|[3000, 3700, 3400]|
| Kumar| Marketing| 3800| 3| 3700|[3000, 3700, 3400]|
| Saif| Sales| 3900| 3| 3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+
I am expecting the dataset to look like this
+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound| range|
+-------------+----------+------+-----+-----------+------------------+
| James| Sales| 3000| 1| 3000|[3000, 3700, 3400]|
| Michael| Sales| 3100| 1| 3000|[3000, 3700, 3400]|
| Robert| Sales| 3200| 1| 3000|[3000, 3700, 3400]|
| Maria| Finance| 3300| 1| 3000|[3000, 3700, 3400]|
| James| Sales| 3400| 2| 3400|[3000, 3700, 3400]|
| Scott| Finance| 3500| 2| 3400|[3000, 3700, 3400]|
| Jen| Finance| 3600| 2| 3400|[3000, 3700, 3400]|
| Jeff| Marketing| 3700| 3| 3700|[3000, 3700, 3400]|
| Kumar| Marketing| 3800| 3| 3700|[3000, 3700, 3400]|
| Saif| Sales| 3900| 3| 3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+

To ensure that your windows take into account all rows and not only rows before current row, you can use rowsBetween method with Window.unboundedPreceding and Window.unboundedFollowing as argument. Your last line thus become:
var rangeDf = ntileMinDf.withColumn(
"range",
collect_set("lower_bound")
.over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
)
and you get the following rangeDf dataframe:
+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound| range|
+-------------+----------+------+-----+-----------+------------------+
| James| Sales| 3000| 1| 3000|[3000, 3700, 3400]|
| Michael| Sales| 3100| 1| 3000|[3000, 3700, 3400]|
| Robert| Sales| 3200| 1| 3000|[3000, 3700, 3400]|
| Maria| Finance| 3300| 1| 3000|[3000, 3700, 3400]|
| James| Sales| 3400| 2| 3400|[3000, 3700, 3400]|
| Scott| Finance| 3500| 2| 3400|[3000, 3700, 3400]|
| Jen| Finance| 3600| 2| 3400|[3000, 3700, 3400]|
| Jeff| Marketing| 3700| 3| 3700|[3000, 3700, 3400]|
| Kumar| Marketing| 3800| 3| 3700|[3000, 3700, 3400]|
| Saif| Sales| 3900| 3| 3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+

Related

Add another column after groupBy and agg

I have a df looks like this:
+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
| 3| 3| 600|
| 2| 3| 702|
| 1| 2| 120|
| 2| 5| 200|
| 2| 2| 500|
| 3| 1| 100|
| 3| 5| 2000|
| 3| 4| 122|
| 1| 3| 1200|
| 1| 1| 1000|
+-----+-------+-----+
I want to output the max count of vocabId and the docId it belongs to. I did this:
val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count"))
and got this:
+-------+----------+
|vocabId| count |
+-------+----------+
| 1| 1000|
| 3| 1200|
| 5| 2000|
| 4| 122|
| 2| 500|
+-------+----------+
How do I add the docId at the front???
It should looks something like this(the order is not important):
+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
| 2| 2| 500|
| 3| 5| 2000|
| 3| 4| 122|
| 1| 3| 1200|
| 1| 1| 1000|
+-----+-------+-----+
You can do self join with docwords over count and vocabId something like below
val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count")).join(docwords,Seq("vocabId","count"))

Maximum search method in several columns and unification of the result within a single column with Spark

I have the following Dataset :
+----+-----+--------+-----+--------+
| id|date1|address1|date2|address2|
+----+-----+--------+-----+--------+
| 1| 2019| Paris| 2018| Madrid|
| 2| 2020|New York| 2002| Geneva|
| 3| 1998| London| 2005| Tokyo|
| 4| 2005| Sydney| 2013| Berlin|
+----+-----+-------+------+--------+
I try to obtain the most recent date and the corresponding address of each id in two other columns. The desired result is :
+----+-----+--------+-----+--------+--------+-----------+
| id|date1|address1|date2|address2|date_max|address_max|
+----+-----+--------+-----+--------+--------+-----------+
| 1| 2019| Paris| 2018| Madrid| 2019| Paris|
| 2| 2020|New York| 2002| Geneva| 2020| New York|
| 3| 1998| London| 2005| Tokyo| 2005| Tokyo|
| 4| 2005| Sydney| 2013| Berlin| 2013| Berlin|
+----+-----+-------+------+--------+--------+-----------+
Any ideas to make this in a very efficient way ?
You can do a CASE WHEN to pick the more recent date/address:
import org.apache.spark.sql.functions._
val date_max = when(col("date1") > col("date2"), col("date1")).otherwise(col("date2")).alias("date_max")
val address_max = when(col("date1") > col("date2"), col("address1")).otherwise(col("address2")).alias("address_max")
df = df.select("*", date_max, address_max)
If you want a more scalable option with many columns:
val df2 = df.withColumn(
"all_date",
array(df.columns.filter(_.contains("date")).map(col): _*)
).withColumn(
"all_address",
array(df.columns.filter(_.contains("address")).map(col): _*)
).withColumn(
"date_max",
array_max($"all_date")
).withColumn(
"address_max",
element_at($"all_address",
(array_position($"all_date", array_max($"all_date"))).cast("int")
)
).drop("all_date", "all_address")
df2.show
+---+-----+--------+-----+--------+-------+----------+
| id|date1|address1|date2|address2|datemax|addressmax|
+---+-----+--------+-----+--------+-------+----------+
| 1| 2019| Paris| 2018| Madrid| 2019| Paris|
| 2| 2020| NewYork| 2002| Geneva| 2020| NewYork|
| 3| 1998| London| 2005| Tokyo| 2005| Tokyo|
| 4| 2005| Sydney| 2013| Berlin| 2013| Berlin|
+---+-----+--------+-----+--------+-------+----------+

Spark Dataframe Scala: add new columns by some conditions

I revised my question so that it is easier to understand.
original df looks like this:
+---+----------+-------+----+------+
| id|tim |price | qty|qtyChg|
+---+----------+-------+----+------+
| 1| 31951.509| 0.370| 1| 1|
| 2| 31951.515|145.380| 100| 100|
| 3| 31951.519|149.370| 100| 100|
| 4| 31951.520|144.370| 100| 100|
| 5| 31951.520|119.370| 5| 5|
| 6| 31951.520|149.370| 300| 200|
| 7| 31951.521|149.370| 400| 100|
| 8| 31951.522|149.370| 410| 10|
| 9| 31951.522|149.870| 50| 50|
| 10| 31951.522|109.370| 50| 50|
| 11| 31951.522|144.370| 400| 300|
| 12| 31951.524|149.370| 610| 200|
| 13| 31951.526|135.130| 22| 22|
| 14| 31951.527|149.370| 750| 140|
| 15| 31951.528| 89.370| 100| 100|
| 16| 31951.528|145.870| 50| 50|
| 17| 31951.528|139.370| 100| 100|
| 18| 31951.531|144.370| 410| 10|
| 19| 31951.531|149.370| 769| 19|
| 20| 31951.538|149.370| 869| 100|
| 21| 31951.538|144.880| 200| 200|
| 22| 31951.541|139.370| 221| 121|
| 23| 31951.542|149.370|1199| 330|
| 24| 31951.542|139.370| 236| 15|
| 25| 31951.542|144.370| 510| 100|
| 26| 31951.543|146.250| 50| 50|
| 27| 31951.543|143.820| 100| 100|
| 28| 31951.543|139.370| 381| 145|
| 29| 31951.544|149.370|1266| 67|
| 30| 31951.544|150.000| 50| 50|
| 31| 31951.544|137.870| 300| 300|
| 32| 31951.544|140.470| 10| 10|
| 33| 31951.545|150.000| 53| 3|
| 34| 31951.545|140.000| 25| 25|
| 35| 31951.545|148.310| 8| 8|
| 36| 31951.547|149.000| 20| 20|
| 37| 31951.549|143.820| 102| 2|
| 38| 31951.549|150.110| 75| 75|
+---+----------+-------+----+------+
then I run the code
val ww = Window.partitionBy().orderBy($"tim")
val step1 = df.withColumn("sequence",sort_array(collect_set(col("price")).over(ww),asc=false))
.withColumn("top1price",col("sequence").getItem(0))
.withColumn("top2price",col("sequence").getItem(1))
.drop("sequence")
The new dataframe looks like this:
+---+---------+-------+----+------+---------+---------+
| id| tim| price| qty|qtyChg|top1price|top2price|
+---+---------+-------+----+------+---------+---------+
| 1|31951.509| 0.370| 1| 1| 0.370| null|
| 2|31951.515|145.380| 100| 100| 145.380| 0.370|
| 3|31951.519|149.370| 100| 100| 149.370| 145.380|
| 4|31951.520|149.370| 300| 200| 149.370| 145.380|
| 5|31951.520|144.370| 100| 100| 149.370| 145.380|
| 6|31951.520|119.370| 5| 5| 149.370| 145.380|
| 7|31951.521|149.370| 400| 100| 149.370| 145.380|
| 8|31951.522|109.370| 50| 50| 149.870| 149.370|
| 9|31951.522|144.370| 400| 300| 149.870| 149.370|
| 10|31951.522|149.870| 50| 50| 149.870| 149.370|
| 11|31951.522|149.370| 410| 10| 149.870| 149.370|
| 12|31951.524|149.370| 610| 200| 149.870| 149.370|
| 13|31951.526|135.130| 22| 22| 149.870| 149.370|
| 14|31951.527|149.370| 750| 140| 149.870| 149.370|
| 15|31951.528| 89.370| 100| 100| 149.870| 149.370|
| 16|31951.528|139.370| 100| 100| 149.870| 149.370|
| 17|31951.528|145.870| 50| 50| 149.870| 149.370|
| 18|31951.531|144.370| 410| 10| 149.870| 149.370|
| 19|31951.531|149.370| 769| 19| 149.870| 149.370|
| 20|31951.538|144.880| 200| 200| 149.870| 149.370|
| 21|31951.538|149.370| 869| 100| 149.870| 149.370|
| 22|31951.541|139.370| 221| 121| 149.870| 149.370|
| 23|31951.542|144.370| 510| 100| 149.870| 149.370|
| 24|31951.542|139.370| 236| 15| 149.870| 149.370|
| 25|31951.542|149.370|1199| 330| 149.870| 149.370|
| 26|31951.543|139.370| 381| 145| 149.870| 149.370|
| 27|31951.543|143.820| 100| 100| 149.870| 149.370|
| 28|31951.543|146.250| 50| 50| 149.870| 149.370|
| 29|31951.544|140.470| 10| 10| 150.000| 149.870|
| 30|31951.544|137.870| 300| 300| 150.000| 149.870|
| 31|31951.544|150.000| 50| 50| 150.000| 149.870|
| 32|31951.544|149.370|1266| 67| 150.000| 149.870|
| 33|31951.545|140.000| 25| 25| 150.000| 149.870|
| 34|31951.545|150.000| 53| 3| 150.000| 149.870|
| 35|31951.545|148.310| 8| 8| 150.000| 149.870|
| 36|31951.547|149.000| 20| 20| 150.000| 149.870|
| 37|31951.549|150.110| 75| 75| 150.110| 150.000|
| 38|31951.549|143.820| 102| 2| 150.110| 150.000|
+---+---------+-------+----+------+---------+---------+
I am hoping to get two new columns top1priceQty, top2priceQty which store the most updated corresponding qty of top1price and top2price.
For example, in row 6, top1price= 149.370, based on this value, I want to get its corresponding qty which is 400(not 100 or 300). in row 33, when top1price=150.00000000, I want to get its corresponding qty which is 53 that comes from row 32, not 50 from row 28. same rule apply to top2price
Thank you all in advance!
You were very close to the answer by yourself. Instead of collecting set of just one column, collect array of 'LMTPRICE' and it's corresponding 'qty'. Then use getItem(0).getItem(0) for top1price and getItem(0).getItem(1) for top1priceQty. To keep the order by INTEREST_TIME for getting correct qty, use INTEREST_TIME also after 'LMTPRICE' and before 'qty'.
df.withColumn("sequence",sort_array(collect_set(array("LMTPRICE","INTEREST_TIME","qty")).over(ww),asc=false)).withColumn("top1price",col("sequence").getItem(0).getItem(0)).withColumn("top1priceQty",col("sequence").getItem(0).getItem(2).cast("int")).drop("sequence").show(false)
+-----+-------------+--------+---+------+---------+------------+
|index|INTEREST_TIME|LMTPRICE|qty|qtyChg|top1price|top1priceQty|
+-----+-------------+--------+---+------+---------+------------+
|0 |31951.509 |0.37 |1 |1 |0.37 |1 |
|1 |31951.515 |145.38 |100|100 |145.38 |100 |
|2 |31951.519 |149.37 |100|100 |149.37 |100 |
|3 |31951.52 |119.37 |5 |5 |149.37 |300 |
|4 |31951.52 |144.37 |100|100 |149.37 |300 |
|5 |31951.52 |149.37 |300|200 |149.37 |300 |
|6 |31951.521 |149.37 |400|100 |149.37 |400 |
|7 |31951.522 |149.87 |50 |50 |149.87 |50 |
|8 |31951.522 |149.37 |410|10 |149.87 |50 |
|9 |31951.522 |109.37 |50 |50 |149.87 |50 |
|10 |31951.522 |144.37 |400|300 |149.87 |50 |
|11 |31951.524 |149.87 |610|200 |149.87 |610 |
|12 |31951.526 |135.13 |22 |22 |149.87 |610 |
|13 |31951.527 |149.37 |750|140 |149.87 |610 |
|14 |31951.528 |139.37 |100|100 |149.87 |610 |
|15 |31951.528 |145.87 |50 |50 |149.87 |610 |
|16 |31951.528 |89.37 |100|100 |149.87 |610 |
|17 |31951.531 |144.37 |410|10 |149.87 |610 |
|18 |31951.531 |149.37 |769|19 |149.87 |610 |
|19 |31951.538 |149.37 |869|100 |149.87 |610 |
+-----+-------------+--------+---+------+---------+------------+

Taking sum ini spark-scala based on a condition

I have a data frame like this. How can i take the sum of the column sales where the rank is greater than 3 , per 'M'
+---+-----+----+
| M|Sales|Rank|
+---+-----+----+
| M1| 200| 1|
| M1| 175| 2|
| M1| 150| 3|
| M1| 125| 4|
| M1| 90| 5|
| M1| 85| 6|
| M2| 1001| 1|
| M2| 500| 2|
| M2| 456| 3|
| M2| 345| 4|
| M2| 231| 5|
| M2| 123| 6|
+---+-----+----+
Expected Output --
+---+-----+----+---------------+
| M|Sales|Rank|SumGreaterThan3|
+---+-----+----+---------------+
| M1| 200| 1| 300|
| M1| 175| 2| 300|
| M1| 150| 3| 300|
| M1| 125| 4| 300|
| M1| 90| 5| 300|
| M1| 85| 6| 300|
| M2| 1001| 1| 699|
| M2| 500| 2| 699|
| M2| 456| 3| 699|
| M2| 345| 4| 699|
| M2| 231| 5| 699|
| M2| 123| 6| 699|
+---+-----+----+---------------+
I have done sum over ROwnumber like this
df.withColumn("SumGreaterThan3",sum("Sales").over(Window.partitionBy(col("M"))))` //But this will provide total sum of sales.
To replicate the same DF-
val df = Seq(
("M1",200,1),
("M1",175,2),
("M1",150,3),
("M1",125,4),
("M1",90,5),
("M1",85,6),
("M2",1001,1),
("M2",500,2),
("M2",456,3),
("M2",345,4),
("M2",231,5),
("M2",123,6)
).toDF("M","Sales","Rank")
Well, the partition is enough to set the window function. Of course you also have to use the conditional summation by mixing sum and when.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("M")
df.withColumn("SumGreaterThan3", sum(when('Rank > 3, 'Sales).otherwise(0)).over(w).alias("sum")).show
This will givs you the expected results.

Apache Spark group by combining types and sub types

I have this dataset in spark,
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
I can now group this by city and media like this,
val groupByCityAndYear = sales
.groupBy("city", "media")
.count()
groupByCityAndYear.show()
+-------+--------+-----+
| city| media|count|
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| twitter| 1|
|Toronto| twitter| 1|
| Warsaw|facebook| 2|
+-------+--------+-----+
But, how can I do combine media and action together in one column, so the expected output should be,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share | 1|
| Warsaw|like | 1|
+-------+--------+-----+
Combine media and action columns as array column, explode it, then do groupBy count:
sales.select(
$"city", explode(array($"media", $"action")).as("mediaAction")
).groupBy("city", "mediaAction").count().show()
+-------+-----------+-----+
| city|mediaAction|count|
+-------+-----------+-----+
| Boston| share| 2|
| Boston| facebook| 1|
| Warsaw| share| 1|
| Boston| twitter| 1|
| Warsaw| like| 1|
|Toronto| twitter| 1|
|Toronto| like| 1|
| Warsaw| facebook| 2|
+-------+-----------+-----+
Or assuming media and action doesn't intersect (the two columns don't have common elements):
sales.groupBy("city", "media").count().union(
sales.groupBy("city", "action").count()
).show
+-------+--------+-----+
| city| media|count|
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| twitter| 1|
|Toronto| twitter| 1|
| Warsaw|facebook| 2|
| Boston| share| 2|
| Warsaw| share| 1|
| Warsaw| like| 1|
|Toronto| like| 1|
+-------+--------+-----+