Finding Percentile in Spark-Scala per a group - scala

I am trying to do a percentile over a column using a Window function as below. I have referred here to use the ApproxQuantile definition over a group.
val df1 = Seq(
(1, 10.0), (1, 20.0), (1, 40.6), (1, 15.6), (1, 17.6), (1, 25.6),
(1, 39.6), (2, 20.5), (2 ,70.3), (2, 69.4), (2, 74.4), (2, 45.4),
(3, 60.6), (3, 80.6), (4, 30.6), (4, 90.6)
).toDF("ID","Count")
val idBucketMapping = Seq((1, 4), (2, 3), (3, 2), (4, 2))
.toDF("ID", "Bucket")
//jpp
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
import org.apache.spark.sql.expressions.Window
object PercentileApprox {
def percentile_approx(col: Column, percentage: Column,
accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
def percentile_approx(col: Column, percentage: Column): Column =
percentile_approx(col, percentage,
lit(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
}
import PercentileApprox._
var res = df1
.withColumn("percentile",
percentile_approx(col("count"), typedLit(doBucketing(2)))
.over(Window.partitionBy("ID"))
)
def doBucketing(bucket_size : Int) = (1 until bucket_size)
.scanLeft(0d)((a, _) => a + (1 / bucket_size.toDouble))
scala> df1.show
+---+-----+
| ID|Count|
+---+-----+
| 1| 10.0|
| 1| 20.0|
| 1| 40.6|
| 1| 15.6|
| 1| 17.6|
| 1| 25.6|
| 1| 39.6|
| 2| 20.5|
| 2| 70.3|
| 2| 69.4|
| 2| 74.4|
| 2| 45.4|
| 3| 60.6|
| 3| 80.6|
| 4| 30.6|
| 4| 90.6|
+---+-----+
scala> idBucketMapping.show
+---+------+
| ID|Bucket|
+---+------+
| 1| 4|
| 2| 3|
| 3| 2|
| 4| 2|
+---+------+
scala> res.show
+---+-----+------------------+
| ID|Count| percentile|
+---+-----+------------------+
| 1| 10.0|[10.0, 20.0, 40.6]|
| 1| 20.0|[10.0, 20.0, 40.6]|
| 1| 40.6|[10.0, 20.0, 40.6]|
| 1| 15.6|[10.0, 20.0, 40.6]|
| 1| 17.6|[10.0, 20.0, 40.6]|
| 1| 25.6|[10.0, 20.0, 40.6]|
| 1| 39.6|[10.0, 20.0, 40.6]|
| 3| 60.6|[60.6, 60.6, 80.6]|
| 3| 80.6|[60.6, 60.6, 80.6]|
| 4| 30.6|[30.6, 30.6, 90.6]|
| 4| 90.6|[30.6, 30.6, 90.6]|
| 2| 20.5|[20.5, 69.4, 74.4]|
| 2| 70.3|[20.5, 69.4, 74.4]|
| 2| 69.4|[20.5, 69.4, 74.4]|
| 2| 74.4|[20.5, 69.4, 74.4]|
| 2| 45.4|[20.5, 69.4, 74.4]|
+---+-----+------------------+
Upto here it is well and good and the logic is simple. But I need results in a dynamic fashion. This means the argument doBucketing(2) to this function should be taken from idBucketMapping based on the ID - Value.
This seems to be little bit tricky for me. Is this possible by any means?
Expected Output --
This means the percentile bucket is based on - idBucketMapping Dataframe .
+---+-----+------------------------+
|ID |Count|percentile |
+---+-----+------------------------+
|1 |10.0 |[10.0, 15.6, 20.0, 39.6]|
|1 |20.0 |[10.0, 15.6, 20.0, 39.6]|
|1 |40.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |15.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |17.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |25.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |39.6 |[10.0, 15.6, 20.0, 39.6]|
|3 |60.6 |[60.6, 60.6] |
|3 |80.6 |[60.6, 60.6] |
|4 |30.6 |[30.6, 30.6] |
|4 |90.6 |[30.6, 30.6] |
|2 |20.5 |[20.5, 45.4, 70.3] |
|2 |70.3 |[20.5, 45.4, 70.3] |
|2 |69.4 |[20.5, 45.4, 70.3] |
|2 |74.4 |[20.5, 45.4, 70.3] |
|2 |45.4 |[20.5, 45.4, 70.3] |
+---+-----+------------------------+

I have a solution for you that is extremely unelegant and works only if you have a limited number of possible bucketing.
My first version is very ugly.
// for the sake of clarity, let's define a function that generates the
// window aggregation
def per(x : Int) = percentile_approx(col("count"), typedLit(doBucketing(x)))
.over(Window.partitionBy("ID"))
// then, we simply try to match the Bucket column with a possible value
val res = df1
.join(idBucketMapping, Seq("ID"))
.withColumn("percentile", when('Bucket === 2, per(2)
.otherwise(when('Bucket === 3, per(3))
.otherwise(per(4)))
)
That's nasty but it works in your case.
Slightly less ugly but very same logic, you can define a set of possible numbers of buckets and use it to do the same thing as above.
val possible_number_of_buckets = 2 to 5
val res = df1
.join(idBucketMapping, Seq("ID"))
.withColumn("percentile", possible_number_of_buckets
.tail
.foldLeft(per(possible_number_of_buckets.head))
((column, size) => when('Bucket === size, per(size))
.otherwise(column)))

percentile_approx takes percentage and accuracy. It seems, they both must be a constant literal. Thus we can't compute the percentile_approx at runtime with dynamically calculated percentage and accuracy.
ref- apache spark git percentile_approx source

Related

Spark: explode multiple columns into one

Is it possible to explode multiple columns into one new column in spark? I have a dataframe which looks like this:
userId varA varB
1 [0,2,5] [1,2,9]
desired output:
userId bothVars
1 0
1 2
1 5
1 1
1 2
1 9
What I have tried so far:
val explodedDf = df.withColumn("bothVars", explode($"varA")).drop("varA")
.withColumn("bothVars", explode($"varB")).drop("varB")
which doesn't work. Any suggestions is much appreciated.
You could wrap the two arrays into one and flatten the nested array before exploding it, as shown below:
val df = Seq(
(1, Seq(0, 2, 5), Seq(1, 2, 9)),
(2, Seq(1, 3, 4), Seq(2, 3, 8))
).toDF("userId", "varA", "varB")
df.
select($"userId", explode(flatten(array($"varA", $"varB"))).as("bothVars")).
show
// +------+--------+
// |userId|bothVars|
// +------+--------+
// | 1| 0|
// | 1| 2|
// | 1| 5|
// | 1| 1|
// | 1| 2|
// | 1| 9|
// | 2| 1|
// | 2| 3|
// | 2| 4|
// | 2| 2|
// | 2| 3|
// | 2| 8|
// +------+--------+
Note that flatten is available on Spark 2.4+.
Use array_union and then use explode function.
scala> df.show(false)
+------+---------+---------+
|userId|varA |varB |
+------+---------+---------+
|1 |[0, 2, 5]|[1, 2, 9]|
|2 |[1, 3, 4]|[2, 3, 8]|
+------+---------+---------+
scala> df
.select($"userId",explode(array_union($"varA",$"varB")).as("bothVars"))
.show(false)
+------+--------+
|userId|bothVars|
+------+--------+
|1 |0 |
|1 |2 |
|1 |5 |
|1 |1 |
|1 |9 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |2 |
|2 |8 |
+------+--------+
array_union is available in Spark 2.4+

Calculate residual amount in dataframe column

I have a "capacity" dataframe:
scala> sql("create table capacity (id String, capacity Int)");
scala> sql("insert into capacity values ('A', 50), ('B', 100)");
scala> sql("select * from capacity").show(false)
+---+--------+
|id |capacity|
+---+--------+
|A |50 |
|B |100 |
+---+--------+
I have another "used" dataframe with following information:
scala> sql ("create table used (id String, capacityId String, used Int)");
scala> sql ("insert into used values ('item1', 'A', 10), ('item2', 'A', 20), ('item3', 'A', 10), ('item4', 'B', 30), ('item5', 'B', 40), ('item6', 'B', 40)")
scala> sql("select * from used order by capacityId").show(false)
+-----+----------+----+
|id |capacityId|used|
+-----+----------+----+
|item1|A |10 |
|item3|A |10 |
|item2|A |20 |
|item6|B |40 |
|item4|B |30 |
|item5|B |40 |
+-----+----------+----+
Column "capacityId" of the "used" dataframe is foreign key to column "id" of the "capacity" dataframe.
I want to calculate the "capacityLeft" column which is residual amount at that point of time.
+-----+----------+----+--------------+
|id |capacityId|used| capacityLeft |
+-----+----------+----+--------------+
|item1|A |10 |40 | <- 50(capacity of 'A')-10
|item3|A |10 |30 | <- 40-10
|item2|A |20 |10 | <- 30-20
|item6|B |40 |60 | <- 100(capacity of 'B')-40
|item4|B |30 |30 | <- 60-30
|item5|B |40 |-10 | <- 30-40
+-----+----------+----+--------------+
In real senario, the "createdDate" column is used for ordering of "used" dataframe column.
Spark version: 2.2
This can be solved by using window functions in Spark. Note that for this to work there need to exist a column that keep track of the row order for each capacityId.
Start by joining the two dataframes together:
val df = used.join(capacity.withColumnRenamed("id", "capacityId"), Seq("capacityId"), "inner")
Here the id in the capacity dataframe is renamed to match the id name in the used dataframe as to not keep a duplicate columns.
Now create a window and calculate the cumsum of the used column. Take the value of the capacity and subtract the cumsum to get the remaining amount:
val w = Window.partitionBy("capacityId").orderBy("createdDate")
val df2 = df.withColumn("capacityLeft", $"capacity" - sum($"used").over(w))
Resulting dataframe with example createdDate column:
+----------+-----+----+-----------+--------+------------+
|capacityId| id|used|createdDate|capacity|capacityLeft|
+----------+-----+----+-----------+--------+------------+
| B|item6| 40| 1| 100| 60|
| B|item4| 30| 2| 100| 30|
| B|item5| 40| 3| 100| -10|
| A|item1| 10| 1| 50| 40|
| A|item3| 10| 2| 50| 30|
| A|item2| 20| 3| 50| 10|
+----------+-----+----+-----------+--------+------------+
Any unwanted columns can now be removed with drop.

Scala/Spark drop duplicates based in other column specific value [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I want to drop duplicates with same ID that not have a specific value in other column (in this case filter by those rows that have same ID and value = 1)
Input df:
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 1| 3|
| 4| 0| 6|
| 4| 1| 5|
| 5| 4| 6|
+---+-----+------+
Result I want:
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 1| 3|
| 4| 1| 5|
| 5| 4| 6|
+---+-----+------+
Can be done by getting rows where values is "1", and then left join with orignal data:
val df = List(
(3, 0, 2),
(3, 1, 3),
(4, 0, 6),
(4, 1, 5),
(5, 4, 6)
).toDF("id", "value", "sorted")
val withOne = df.filter($"value" === 1)
val joinedWithOriginal = df.alias("orig").join(withOne.alias("one"), Seq("id"), "left")
val result = joinedWithOriginal
.where($"one.value".isNull || $"one.value" === $"orig.value")
.select("orig.id", "orig.value", "orig.sorted")
result.show(false)
Output:
+---+-----+------+
|id |value|sorted|
+---+-----+------+
|3 |1 |3 |
|4 |1 |5 |
|5 |4 |6 |
+---+-----+------+

Perform Arithmetic Operations on multiple columns in Spark dataframe

I have an input spark-dataframe named df as
+---------------+---+---+---+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+---+---+---+-----------+
| 725153| 1| 0| 2| 3|
| 873008| 0| 0| 3| 3|
| 625109| 1| 1| 0| 2|
+---------------+---+---+---+-----------+
Here,Total_Count is the sum of P1,P2,P3 and P1,P2,P3 were the product names. I need to find the frequency of each product by dividing the values of products with its Total_Count. I need to create a new spark-dataframe named frequencyTable as follows,
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+------------------+---+------------------+-----------+
| 725153|0.3333333333333333|0.0|0.6666666666666666| 3|
| 873008| 0.0|0.0| 1.0| 3|
| 625109| 0.5|0.5| 0.0| 2|
+---------------+------------------+---+------------------+-----------+
I have done this using Scala as,
val df_columns = df.columns.toSeq
var frequencyTable = df
for (index <- df_columns) {
if (index != "Main_CustomerID" && index != "Total_Count") {
frequencyTable = frequencyTable.withColumn(index, df.col(index) / df.col("Total_Count"))
}
}
But I don't prefer this for loop because my df is of larger size. What is the optimized solution?
If you have dataframe as
val df = Seq(
("725153", 1, 0, 2, 3),
("873008", 0, 0, 3, 3),
("625109", 1, 1, 0, 2)
).toDF("Main_CustomerID", "P1", "P2", "P3", "Total_Count")
+---------------+---+---+---+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+---+---+---+-----------+
|725153 |1 |0 |2 |3 |
|873008 |0 |0 |3 |3 |
|625109 |1 |1 |0 |2 |
+---------------+---+---+---+-----------+
You can simply use foldLeft on the columns except Main_CustomerID, Total_Count i.e. on P1 P2 and P3
val df_columns = df.columns.toSet - "Main_CustomerID" - "Total_Count" toList
df_columns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName, df.col(colName) / df.col("Total_Count"))}.show(false)
which should give you
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+------------------+---+------------------+-----------+
|725153 |0.3333333333333333|0.0|0.6666666666666666|3 |
|873008 |0.0 |0.0|1.0 |3 |
|625109 |0.5 |0.5|0.0 |2 |
+---------------+------------------+---+------------------+-----------+
I hope the answer is helpful

transform a feature of a spark groupedBy DataFrame

I'm searching for a scala analogue of python .transform()
Namely, i need to create a new feature - a group mean of a corresponding: class
val df = Seq(
("a", 1),
("a", 3),
("b", 3),
("b", 7)
).toDF("class", "val")
+-----+---+
|class|val|
+-----+---+
| a| 1|
| a| 3|
| b| 3|
| b| 7|
+-----+---+
val grouped_df = df.groupBy('class)
Here's python implementation:
df["class_mean"] = grouped_df["class"].transform(
lambda x: x.mean())
So, the desired result:
+-----+---+----------+
|class|val|class_mean|
+-----+---+---+------+
| a| 1| 2.0|
| a| 3| 2.0|
| b| 3| 5.0|
| b| 7| 5.0|
+-----+---+----------+
You can use
df.groupBy("class").agg(mean("val").as("class_mean"))
If you can want all the columns then you can use window function
val w = Window.partitionBy("class")
df.withColumn("class_mean", mean("val").over(w))
.show(false)
Output:
+-----+---+----------+
|class|val|class_mean|
+-----+---+----------+
|b |3 |5.0 |
|b |7 |5.0 |
|a |1 |2.0 |
|a |3 |2.0 |
+-----+---+----------+