Finding Percentile in Spark-Scala per a group - scala
I am trying to do a percentile over a column using a Window function as below. I have referred here to use the ApproxQuantile definition over a group.
val df1 = Seq(
(1, 10.0), (1, 20.0), (1, 40.6), (1, 15.6), (1, 17.6), (1, 25.6),
(1, 39.6), (2, 20.5), (2 ,70.3), (2, 69.4), (2, 74.4), (2, 45.4),
(3, 60.6), (3, 80.6), (4, 30.6), (4, 90.6)
).toDF("ID","Count")
val idBucketMapping = Seq((1, 4), (2, 3), (3, 2), (4, 2))
.toDF("ID", "Bucket")
//jpp
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
import org.apache.spark.sql.expressions.Window
object PercentileApprox {
def percentile_approx(col: Column, percentage: Column,
accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
def percentile_approx(col: Column, percentage: Column): Column =
percentile_approx(col, percentage,
lit(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
}
import PercentileApprox._
var res = df1
.withColumn("percentile",
percentile_approx(col("count"), typedLit(doBucketing(2)))
.over(Window.partitionBy("ID"))
)
def doBucketing(bucket_size : Int) = (1 until bucket_size)
.scanLeft(0d)((a, _) => a + (1 / bucket_size.toDouble))
scala> df1.show
+---+-----+
| ID|Count|
+---+-----+
| 1| 10.0|
| 1| 20.0|
| 1| 40.6|
| 1| 15.6|
| 1| 17.6|
| 1| 25.6|
| 1| 39.6|
| 2| 20.5|
| 2| 70.3|
| 2| 69.4|
| 2| 74.4|
| 2| 45.4|
| 3| 60.6|
| 3| 80.6|
| 4| 30.6|
| 4| 90.6|
+---+-----+
scala> idBucketMapping.show
+---+------+
| ID|Bucket|
+---+------+
| 1| 4|
| 2| 3|
| 3| 2|
| 4| 2|
+---+------+
scala> res.show
+---+-----+------------------+
| ID|Count| percentile|
+---+-----+------------------+
| 1| 10.0|[10.0, 20.0, 40.6]|
| 1| 20.0|[10.0, 20.0, 40.6]|
| 1| 40.6|[10.0, 20.0, 40.6]|
| 1| 15.6|[10.0, 20.0, 40.6]|
| 1| 17.6|[10.0, 20.0, 40.6]|
| 1| 25.6|[10.0, 20.0, 40.6]|
| 1| 39.6|[10.0, 20.0, 40.6]|
| 3| 60.6|[60.6, 60.6, 80.6]|
| 3| 80.6|[60.6, 60.6, 80.6]|
| 4| 30.6|[30.6, 30.6, 90.6]|
| 4| 90.6|[30.6, 30.6, 90.6]|
| 2| 20.5|[20.5, 69.4, 74.4]|
| 2| 70.3|[20.5, 69.4, 74.4]|
| 2| 69.4|[20.5, 69.4, 74.4]|
| 2| 74.4|[20.5, 69.4, 74.4]|
| 2| 45.4|[20.5, 69.4, 74.4]|
+---+-----+------------------+
Upto here it is well and good and the logic is simple. But I need results in a dynamic fashion. This means the argument doBucketing(2) to this function should be taken from idBucketMapping based on the ID - Value.
This seems to be little bit tricky for me. Is this possible by any means?
Expected Output --
This means the percentile bucket is based on - idBucketMapping Dataframe .
+---+-----+------------------------+
|ID |Count|percentile |
+---+-----+------------------------+
|1 |10.0 |[10.0, 15.6, 20.0, 39.6]|
|1 |20.0 |[10.0, 15.6, 20.0, 39.6]|
|1 |40.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |15.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |17.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |25.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |39.6 |[10.0, 15.6, 20.0, 39.6]|
|3 |60.6 |[60.6, 60.6] |
|3 |80.6 |[60.6, 60.6] |
|4 |30.6 |[30.6, 30.6] |
|4 |90.6 |[30.6, 30.6] |
|2 |20.5 |[20.5, 45.4, 70.3] |
|2 |70.3 |[20.5, 45.4, 70.3] |
|2 |69.4 |[20.5, 45.4, 70.3] |
|2 |74.4 |[20.5, 45.4, 70.3] |
|2 |45.4 |[20.5, 45.4, 70.3] |
+---+-----+------------------------+
I have a solution for you that is extremely unelegant and works only if you have a limited number of possible bucketing.
My first version is very ugly.
// for the sake of clarity, let's define a function that generates the
// window aggregation
def per(x : Int) = percentile_approx(col("count"), typedLit(doBucketing(x)))
.over(Window.partitionBy("ID"))
// then, we simply try to match the Bucket column with a possible value
val res = df1
.join(idBucketMapping, Seq("ID"))
.withColumn("percentile", when('Bucket === 2, per(2)
.otherwise(when('Bucket === 3, per(3))
.otherwise(per(4)))
)
That's nasty but it works in your case.
Slightly less ugly but very same logic, you can define a set of possible numbers of buckets and use it to do the same thing as above.
val possible_number_of_buckets = 2 to 5
val res = df1
.join(idBucketMapping, Seq("ID"))
.withColumn("percentile", possible_number_of_buckets
.tail
.foldLeft(per(possible_number_of_buckets.head))
((column, size) => when('Bucket === size, per(size))
.otherwise(column)))
percentile_approx takes percentage and accuracy. It seems, they both must be a constant literal. Thus we can't compute the percentile_approx at runtime with dynamically calculated percentage and accuracy.
ref- apache spark git percentile_approx source
Related
Spark: explode multiple columns into one
Is it possible to explode multiple columns into one new column in spark? I have a dataframe which looks like this: userId varA varB 1 [0,2,5] [1,2,9] desired output: userId bothVars 1 0 1 2 1 5 1 1 1 2 1 9 What I have tried so far: val explodedDf = df.withColumn("bothVars", explode($"varA")).drop("varA") .withColumn("bothVars", explode($"varB")).drop("varB") which doesn't work. Any suggestions is much appreciated.
You could wrap the two arrays into one and flatten the nested array before exploding it, as shown below: val df = Seq( (1, Seq(0, 2, 5), Seq(1, 2, 9)), (2, Seq(1, 3, 4), Seq(2, 3, 8)) ).toDF("userId", "varA", "varB") df. select($"userId", explode(flatten(array($"varA", $"varB"))).as("bothVars")). show // +------+--------+ // |userId|bothVars| // +------+--------+ // | 1| 0| // | 1| 2| // | 1| 5| // | 1| 1| // | 1| 2| // | 1| 9| // | 2| 1| // | 2| 3| // | 2| 4| // | 2| 2| // | 2| 3| // | 2| 8| // +------+--------+ Note that flatten is available on Spark 2.4+.
Use array_union and then use explode function. scala> df.show(false) +------+---------+---------+ |userId|varA |varB | +------+---------+---------+ |1 |[0, 2, 5]|[1, 2, 9]| |2 |[1, 3, 4]|[2, 3, 8]| +------+---------+---------+ scala> df .select($"userId",explode(array_union($"varA",$"varB")).as("bothVars")) .show(false) +------+--------+ |userId|bothVars| +------+--------+ |1 |0 | |1 |2 | |1 |5 | |1 |1 | |1 |9 | |2 |1 | |2 |3 | |2 |4 | |2 |2 | |2 |8 | +------+--------+ array_union is available in Spark 2.4+
Calculate residual amount in dataframe column
I have a "capacity" dataframe: scala> sql("create table capacity (id String, capacity Int)"); scala> sql("insert into capacity values ('A', 50), ('B', 100)"); scala> sql("select * from capacity").show(false) +---+--------+ |id |capacity| +---+--------+ |A |50 | |B |100 | +---+--------+ I have another "used" dataframe with following information: scala> sql ("create table used (id String, capacityId String, used Int)"); scala> sql ("insert into used values ('item1', 'A', 10), ('item2', 'A', 20), ('item3', 'A', 10), ('item4', 'B', 30), ('item5', 'B', 40), ('item6', 'B', 40)") scala> sql("select * from used order by capacityId").show(false) +-----+----------+----+ |id |capacityId|used| +-----+----------+----+ |item1|A |10 | |item3|A |10 | |item2|A |20 | |item6|B |40 | |item4|B |30 | |item5|B |40 | +-----+----------+----+ Column "capacityId" of the "used" dataframe is foreign key to column "id" of the "capacity" dataframe. I want to calculate the "capacityLeft" column which is residual amount at that point of time. +-----+----------+----+--------------+ |id |capacityId|used| capacityLeft | +-----+----------+----+--------------+ |item1|A |10 |40 | <- 50(capacity of 'A')-10 |item3|A |10 |30 | <- 40-10 |item2|A |20 |10 | <- 30-20 |item6|B |40 |60 | <- 100(capacity of 'B')-40 |item4|B |30 |30 | <- 60-30 |item5|B |40 |-10 | <- 30-40 +-----+----------+----+--------------+ In real senario, the "createdDate" column is used for ordering of "used" dataframe column. Spark version: 2.2
This can be solved by using window functions in Spark. Note that for this to work there need to exist a column that keep track of the row order for each capacityId. Start by joining the two dataframes together: val df = used.join(capacity.withColumnRenamed("id", "capacityId"), Seq("capacityId"), "inner") Here the id in the capacity dataframe is renamed to match the id name in the used dataframe as to not keep a duplicate columns. Now create a window and calculate the cumsum of the used column. Take the value of the capacity and subtract the cumsum to get the remaining amount: val w = Window.partitionBy("capacityId").orderBy("createdDate") val df2 = df.withColumn("capacityLeft", $"capacity" - sum($"used").over(w)) Resulting dataframe with example createdDate column: +----------+-----+----+-----------+--------+------------+ |capacityId| id|used|createdDate|capacity|capacityLeft| +----------+-----+----+-----------+--------+------------+ | B|item6| 40| 1| 100| 60| | B|item4| 30| 2| 100| 30| | B|item5| 40| 3| 100| -10| | A|item1| 10| 1| 50| 40| | A|item3| 10| 2| 50| 30| | A|item2| 20| 3| 50| 10| +----------+-----+----+-----------+--------+------------+ Any unwanted columns can now be removed with drop.
Scala/Spark drop duplicates based in other column specific value [duplicate]
This question already has answers here: How to select the first row of each group? (9 answers) Closed 1 year ago. I want to drop duplicates with same ID that not have a specific value in other column (in this case filter by those rows that have same ID and value = 1) Input df: +---+-----+------+ | id|value|sorted| +---+-----+------+ | 3| 0| 2| | 3| 1| 3| | 4| 0| 6| | 4| 1| 5| | 5| 4| 6| +---+-----+------+ Result I want: +---+-----+------+ | id|value|sorted| +---+-----+------+ | 3| 1| 3| | 4| 1| 5| | 5| 4| 6| +---+-----+------+
Can be done by getting rows where values is "1", and then left join with orignal data: val df = List( (3, 0, 2), (3, 1, 3), (4, 0, 6), (4, 1, 5), (5, 4, 6) ).toDF("id", "value", "sorted") val withOne = df.filter($"value" === 1) val joinedWithOriginal = df.alias("orig").join(withOne.alias("one"), Seq("id"), "left") val result = joinedWithOriginal .where($"one.value".isNull || $"one.value" === $"orig.value") .select("orig.id", "orig.value", "orig.sorted") result.show(false) Output: +---+-----+------+ |id |value|sorted| +---+-----+------+ |3 |1 |3 | |4 |1 |5 | |5 |4 |6 | +---+-----+------+
Perform Arithmetic Operations on multiple columns in Spark dataframe
I have an input spark-dataframe named df as +---------------+---+---+---+-----------+ |Main_CustomerID| P1| P2| P3|Total_Count| +---------------+---+---+---+-----------+ | 725153| 1| 0| 2| 3| | 873008| 0| 0| 3| 3| | 625109| 1| 1| 0| 2| +---------------+---+---+---+-----------+ Here,Total_Count is the sum of P1,P2,P3 and P1,P2,P3 were the product names. I need to find the frequency of each product by dividing the values of products with its Total_Count. I need to create a new spark-dataframe named frequencyTable as follows, +---------------+------------------+---+------------------+-----------+ |Main_CustomerID| P1| P2| P3|Total_Count| +---------------+------------------+---+------------------+-----------+ | 725153|0.3333333333333333|0.0|0.6666666666666666| 3| | 873008| 0.0|0.0| 1.0| 3| | 625109| 0.5|0.5| 0.0| 2| +---------------+------------------+---+------------------+-----------+ I have done this using Scala as, val df_columns = df.columns.toSeq var frequencyTable = df for (index <- df_columns) { if (index != "Main_CustomerID" && index != "Total_Count") { frequencyTable = frequencyTable.withColumn(index, df.col(index) / df.col("Total_Count")) } } But I don't prefer this for loop because my df is of larger size. What is the optimized solution?
If you have dataframe as val df = Seq( ("725153", 1, 0, 2, 3), ("873008", 0, 0, 3, 3), ("625109", 1, 1, 0, 2) ).toDF("Main_CustomerID", "P1", "P2", "P3", "Total_Count") +---------------+---+---+---+-----------+ |Main_CustomerID|P1 |P2 |P3 |Total_Count| +---------------+---+---+---+-----------+ |725153 |1 |0 |2 |3 | |873008 |0 |0 |3 |3 | |625109 |1 |1 |0 |2 | +---------------+---+---+---+-----------+ You can simply use foldLeft on the columns except Main_CustomerID, Total_Count i.e. on P1 P2 and P3 val df_columns = df.columns.toSet - "Main_CustomerID" - "Total_Count" toList df_columns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName, df.col(colName) / df.col("Total_Count"))}.show(false) which should give you +---------------+------------------+---+------------------+-----------+ |Main_CustomerID|P1 |P2 |P3 |Total_Count| +---------------+------------------+---+------------------+-----------+ |725153 |0.3333333333333333|0.0|0.6666666666666666|3 | |873008 |0.0 |0.0|1.0 |3 | |625109 |0.5 |0.5|0.0 |2 | +---------------+------------------+---+------------------+-----------+ I hope the answer is helpful
transform a feature of a spark groupedBy DataFrame
I'm searching for a scala analogue of python .transform() Namely, i need to create a new feature - a group mean of a corresponding: class val df = Seq( ("a", 1), ("a", 3), ("b", 3), ("b", 7) ).toDF("class", "val") +-----+---+ |class|val| +-----+---+ | a| 1| | a| 3| | b| 3| | b| 7| +-----+---+ val grouped_df = df.groupBy('class) Here's python implementation: df["class_mean"] = grouped_df["class"].transform( lambda x: x.mean()) So, the desired result: +-----+---+----------+ |class|val|class_mean| +-----+---+---+------+ | a| 1| 2.0| | a| 3| 2.0| | b| 3| 5.0| | b| 7| 5.0| +-----+---+----------+
You can use df.groupBy("class").agg(mean("val").as("class_mean")) If you can want all the columns then you can use window function val w = Window.partitionBy("class") df.withColumn("class_mean", mean("val").over(w)) .show(false) Output: +-----+---+----------+ |class|val|class_mean| +-----+---+----------+ |b |3 |5.0 | |b |7 |5.0 | |a |1 |2.0 | |a |3 |2.0 | +-----+---+----------+