Spark Dataframe Pivot without Overlapping in Pivoting Columns - scala

Say, I have a dataframe, df, such as:
Column_A Columns_B Column_C Column_D Priority
a_1 b_1 c_1 d_1 high
a_1 b_1 c_1 d_1 medium
a_1 b_1 c_1 d_1 low
a_1 b_1 c_1 d_2 high
a_1 b_1 c_1 d_3 medium
a_1 b_1 c_1 d_4 high
a_1 b_1 c_1 d_4 low
a_2 b_2 c_2 d_5 medium
a_2 b_2 c_2 d_5 low
a_2 b_2 c_2 d_6 high
a_2 b_2 c_2 d_7 low
Now if I collect values of Column D by applying groupBy on Columns A, B and C and then pivoting on Column Priority, the outcome will be:
scala> val outcome = df.groupBy("Column_A", "Column_B", "Column_C")
.pivot("Priority", ("high", "medium", "low"))
.agg(collect_set("Column_D") as "Set_D")
scala> outcome.show
Column_A Column_B Column_C High Medium Low
a_1 b_1 c_1 [d_1, d_2, d_4] [d_1, d_3] [d_1, d_4]
a_2 b_2 c_2 [d_6] [d_5] [d_5, d_7]
But I want the pivoting to be in an heirarchichal order, i.e., if, for a grouped columns (A, B, and C), the value of column D lies in High then it should not be in Medium nor in Low. Similarly, if the value lies in Medium then it should not be in Low. It's like exclusively collecting the values of Column D based on a heirarchy in pivoting columns.
The desired outcome:
scala> outcome.show
Column_A Column_B Column_C High Medium Low
a_1 b_1 c_1 [d_1, d_2, d_4] [d_3] []
a_2 b_2 c_2 [d_6] [d_5] [d_7]
Suggestion to subtract column High from column Medium and similarly subtracting column Medium from column Low is not applicable because there could be any number of baskets in Column Priority, i.e, for example, Column Priority could have values like ("very high", "high", "medium", "low", "very low"), etc.
Edit: The order of values of Priorities is determined by a score list. For example, [("high", 1), ("medium", 0.5), ("low", 0.25)] or [("very high", 1), ("high", 0.8), ("medium", 0.6), ("low", 0.4), ("very low", 0.2)]
Any lead would be highly appreciated. Thanks a lot!! :)

Substracting the arrays from each other could work if the list of scores is used to iterate over the columns in the correct order:
df = ...
outcome = ...
val scoreList: Seq[(String, Double)] = Seq(("very high", 1), ("high", 0.8), ("medium", 0.6), ("low", 0.4), ("very low", 0.2))
val actCols=scoreList.sortWith( _._2 > _._2) //sort the list
.filter( t => outcome.columns.contains(t._1)) //remove all columns that are not part of the actual dataset
.map(_._1) //only keep the column names
var cleanedUp=outcome
for( (col, i) <- actCols.tail.zipWithIndex) { //iterate over all columns but the one with the highest prio
val colsWithHigherPrio = actCols.take(i+1) //consider all columns with higher prio than the current one
for( c <- colsWithHigherPrio) { //remove all values that are already present in columns with higher prio
cleanedUp = cleanedUp.withColumn(col, functions.array_except(functions.col(col), functions.col(c)))
}
}
cleanedUp.show()
Output:
+--------+--------+--------+---------------+------+-----+
|Column_A|Column_B|Column_C| high|medium| low|
+--------+--------+--------+---------------+------+-----+
| a_2| b_2| c_2| [d_6]| [d_5]|[d_7]|
| a_1| b_1| c_1|[d_1, d_2, d_4]| [d_3]| []|
+--------+--------+--------+---------------+------+-----+
Edit: a very similar approach without using explict loops:
def cleanupSingleCol(df:DataFrame, cols:Seq[String]) = cols.tail.foldLeft(df)((d, c) =>
d.withColumn(cols.head, functions.array_except(functions.col(cols.head), functions.col(c))))
val cleanedUp=actCols.inits.filter(_.size >= 2).foldLeft(outcome)((df, l) =>
cleanupSingleCol(df, l.reverse))

Related

Spark - Add a column which sums another column, grouped by a third column, without losing other columns

I have a DataFrame with 5 columns:
col11, col2, col3, col4 and col5
I'd like to add col6 which would be the sum of col5, grouped by col1. But I don't want to lose the other columns.
If I do:
df
.groupBy("col1")
.agg(sum("col5") as "col6")
Then I lose columns 2-4.
I can do a join by running:
val sumValues = df
.groupBy("col1")
.agg(sum("col5") as "col6")
df
.join(sumValues, Seq("col1"))
But it feels an over-kill.
I was hoping to do something like:
df
.withGroupedColumn("col6", "col1", sum("col5") as "col6")
Is there a simple way to do that in Spark?
You can use window functions:
val df2 = df.withColumn("col6", expr("sum(col5) over (partition by col1)"))
Or equivalently
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("col6", sum("col5").over(Window.partitionBy("col1")))

Spark Scala - Winsorize DataFrame columns within groups

I am pre-processing data for machine learning inputs, a target value column, call it "price" has many outliers, and rather than winsorizing price over the whole set I want to winsorize within groups labeled by "product_category". There are other features, product_category is just a price-relevant label.
There is a Scala stat function that works great:
df_data.stat.approxQuantile("price", Array(0.01, 0.99), 0.00001)
// res19: Array[Double] = Array(3.13, 318.54)
Unfortunately, it doesn't support computing the quantiles within groups. Nor does is support window partitions.
df_data
.groupBy("product_category")
.approxQuantile($"price", Array(0.01, 0.99), 0.00001)
// error: value approxQuantile is not a member of
// org.apache.spark.sql.RelationalGroupedDataset
What is the best way to compute say the p01 and p99 within groups of a spark dataframe, for the purpose of replacing values beyond that range, ie winsorizing?
My dataset schema can be imagined like this, and its over 20MM rows with appx 10K different labels for "product_category", so performance is also a concern.
df_data and a winsorized price column:
+---------+------------------+--------+---------+
| item | product_category | price | pr_winz |
+---------+------------------+--------+---------+
| I000001 | XX11 | 1.99 | 5.00 |
| I000002 | XX11 | 59.99 | 59.99 |
| I000003 | XX11 |1359.00 | 850.00 |
+---------+------------------+--------+---------+
supposing p01 = 5.00, p99 = 850.00 for this product_category
Here is what I came up with, after struggling with the documentation (there are two functions approx_percentile and percentile_approx that apparently do the same thing).
I was not able to figure out how to implement this except as a spark sql expression, not sure exactly why grouping only works there. I suspect because its part of Hive?
Spark DataFrame Winsorizor
Tested on DF in 10 to 100MM rows range
// Winsorize function, groupable by columns list
// low/hi element of [0,1]
// precision: integer in [1, 1E7-ish], in practice use 100 or 1000 for large data, smaller is faster/less accurate
// group_col: comma-separated list of column names
import org.apache.spark.sql._
def grouped_winzo(df: DataFrame, winz_col: String, group_col: String, low: Double, hi: Double, precision: Integer): DataFrame = {
df.createOrReplaceTempView("df_table")
spark.sql(s"""
select distinct
*
, percentile_approx($winz_col, $low, $precision) over(partition by $group_col) p_low
, percentile_approx($winz_col, $hi, $precision) over(partition by $group_col) p_hi
from df_table
""")
.withColumn(winz_col + "_winz", expr(s"""
case when $winz_col <= p_low then p_low
when $winz_col >= p_hi then p_hi
else $winz_col end"""))
.drop(winz_col, "p_low", "p_hi")
}
// winsorize the price column of a dataframe at the p01 and p99
// percentiles, grouped by 'product_category' column.
val df_winsorized = grouped_winzo(
df_data
, "price"
, "product_category"
, 0.01, 0.99, 1000)

how to increase performance on Spark distinct() on multiple columns

Could you please suggest alternative way of implementing distinct in spark data frame.
I tried both SQL and spark distinct but since the dataset size (>2 Billion) it fails on the shuffle .
If I increase the node and memory to >250GB, process run for a longe time (more than 7 hours).
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes").distinct()
val df2 = df1.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df2.createOrReplaceTempView("df2")
val df3 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df2
LATERAL VIEW explode(codes) exploded_table as d
""")
df3.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)

Sample a different number of random rows for every group in a dataframe in spark scala

The goal is to sample (without replacement) a different number of rows in a dataframe for every group. The number of rows to sample for a specific group is in another dataframe.
Example: idDF is the dataframe to sample from. The groups are denoted by the ID column. The dataframe, planDF specifies the number of rows to sample for each group where "datesToUse" denotes the number of rows, and "ID" denotes the group. "totalDates" is the total number of rows for that group and may or may not be useful.
The final result should have 3 rows sampled from the first group (ID 1), 2 rows sampled from the second group (ID 2) and 1 row sampled from the third group (ID 3).
val idDF = Seq(
(1, "2017-10-03"),
(1, "2017-10-22"),
(1, "2017-11-01"),
(1, "2017-10-02"),
(1, "2017-10-09"),
(1, "2017-12-24"),
(1, "2017-10-20"),
(2, "2017-11-17"),
(2, "2017-11-12"),
(2, "2017-12-02"),
(2, "2017-10-03"),
(3, "2017-12-18"),
(3, "2017-11-21"),
(3, "2017-12-13"),
(3, "2017-10-08"),
(3, "2017-10-16"),
(3, "2017-12-04")
).toDF("ID", "date")
val planDF = Seq(
(1, 3, 7),
(2, 2, 4),
(3, 1, 6)
).toDF("ID", "datesToUse", "totalDates")
this is an example of what a resultant dataframe should look like:
+---+----------+
| ID| date|
+---+----------+
| 1|2017-10-22|
| 1|2017-11-01|
| 1|2017-10-20|
| 2|2017-11-12|
| 2|2017-10-03|
| 3|2017-10-16|
+---+----------+
So far, I tried to use the sample method for DataFrame: https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/DataFrame.html
Here is an example that would work for an entire data frame.
def sampleDF(DF: DataFrame, datesToUse: Int, totalDates: Int): DataFrame = {
val fraction = datesToUse/totalDates.toFloat.toDouble
DF.sample(false, fraction)
}
I cant figure out how to use something like this for each group. I tried joining the planDF table to the idDF table and using a window partition.
Another idea I had was to somehow make a new column with randomly labeled True / false and then filter on that column.
Another option staying entirely in Dataframes would be to compute probabilities using your planDF, join with idDF, append a column of random numbers and then filter. Helpfully, sql.functions has a rand function.
import org.apache.spark.sql.functions._
import spark.implicits._
val probabilities = planDF.withColumn("prob", $"datesToUse" / $"totalDates")
val dfWithProbs = idDF.join(probabilities, Seq("ID"))
.withColumn("rand", rand())
.where($"rand" < $"prob")
(You'll want to double check that that isn't integer division.)
With the assumption that your planDF is small enough to be collected, you can use Scala's foldLeft to traverse the id list and accumulate the sample Dataframes per id:
import org.apache.spark.sql.{Row, DataFrame}
def sampleByIdDF(DF: DataFrame, id: Int, datesToUse: Int, totalDates: Int): DataFrame = {
val fraction = datesToUse.toDouble / totalDates
DF.where($"id" === id ).sample(false, fraction)
}
val emptyDF = Seq.empty[(Int, String)].toDF("ID", "date")
val planList = planDF.rdd.collect.map{ case Row(x: Int, y: Int, z: Int) => (x, y, z) }
// planList: Array[(Int, Int, Int)] = Array((1,3,7), (2,2,4), (3,1,6))
planList.foldLeft( emptyDF ){
case (accDF: DataFrame, (id: Int, num: Int, total: Int)) =>
accDF union sampleByIdDF(idDF, id, num, total)
}
// res1: org.apache.spark.sql.DataFrame = [ID: int, date: string]
// res1.show
// +---+----------+
// | ID| date|
// +---+----------+
// | 1|2017-10-03|
// | 1|2017-11-01|
// | 1|2017-10-02|
// | 1|2017-12-24|
// | 1|2017-10-20|
// | 2|2017-11-17|
// | 2|2017-11-12|
// | 2|2017-12-02|
// | 3|2017-11-21|
// | 3|2017-12-13|
// +---+----------+
Note that method sample() does not necessarily generate the exact number of samples specified in the method arguments. Here's a relevant SO Q&A.
If your planDF is large, you might have to consider using RDD's aggregate, which has the following signature (skipping the implicit argument):
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U
It works somewhat like foldLeft, except that it has one accumulation operator within a partition and an additional one to comine results from different partitions.

Ignoring case in spark while joining

I have a spark dataframe(input_dataframe_1), data in this dataframe looks like as below:
id value
1 Ab
2 Ai
3 aB
I have another spark dataframe(input_dataframe_2), data in this dataframe looks like as below:
name value
x ab
y iA
z aB
I want to join both dataframe and join condition should be case insensitive, below is the join condition I am using:
output = input_dataframe_1.join(input_dataframe_2,['value'])
How can I make join condition case insensitive?
from pyspark.sql.functions import lower
#sample data
input_dataframe_1 = sc.parallelize([(1, 'Ab'), (2, 'Ai'), (3, 'aB')]).toDF(["id", "value"])
input_dataframe_2 = sc.parallelize([('x', 'ab'), ('y', 'iA'), ('z', 'aB')]).toDF(["name", "value"])
output = input_dataframe_1.\
join(input_dataframe_2, lower(input_dataframe_1.value)==lower(input_dataframe_2.value)).\
drop(input_dataframe_2.value)
output.show()
Expecting you are doing an inner join, find below solution:
Create input dataframe 1
val inputDF1 = spark.createDataFrame(Seq(("1","Ab"),("2","Ai"),("3","aB"))).withColumnRenamed("_1","id").withColumnRenamed("_2","value")
Create input dataframe 2
val inputDF2 = spark.createDataFrame(Seq(("x","ab"),("y","iA"),("z","aB"))).withColumnRenamed("_1","id").withColumnRenamed("_2","value")
Joining both dataframes on lower(value) column
inputDF1.join(inputDF2,lower(inputDF1.col("value"))===lower(inputDF2.col("value"))).show
id
value
id
value
1
Ab
z
aB
1
Ab
x
ab
3
aB
z
aB
3
aB
x
ab