Spark Scala - Winsorize DataFrame columns within groups - scala

I am pre-processing data for machine learning inputs, a target value column, call it "price" has many outliers, and rather than winsorizing price over the whole set I want to winsorize within groups labeled by "product_category". There are other features, product_category is just a price-relevant label.
There is a Scala stat function that works great:
df_data.stat.approxQuantile("price", Array(0.01, 0.99), 0.00001)
// res19: Array[Double] = Array(3.13, 318.54)
Unfortunately, it doesn't support computing the quantiles within groups. Nor does is support window partitions.
df_data
.groupBy("product_category")
.approxQuantile($"price", Array(0.01, 0.99), 0.00001)
// error: value approxQuantile is not a member of
// org.apache.spark.sql.RelationalGroupedDataset
What is the best way to compute say the p01 and p99 within groups of a spark dataframe, for the purpose of replacing values beyond that range, ie winsorizing?
My dataset schema can be imagined like this, and its over 20MM rows with appx 10K different labels for "product_category", so performance is also a concern.
df_data and a winsorized price column:
+---------+------------------+--------+---------+
| item | product_category | price | pr_winz |
+---------+------------------+--------+---------+
| I000001 | XX11 | 1.99 | 5.00 |
| I000002 | XX11 | 59.99 | 59.99 |
| I000003 | XX11 |1359.00 | 850.00 |
+---------+------------------+--------+---------+
supposing p01 = 5.00, p99 = 850.00 for this product_category

Here is what I came up with, after struggling with the documentation (there are two functions approx_percentile and percentile_approx that apparently do the same thing).
I was not able to figure out how to implement this except as a spark sql expression, not sure exactly why grouping only works there. I suspect because its part of Hive?
Spark DataFrame Winsorizor
Tested on DF in 10 to 100MM rows range
// Winsorize function, groupable by columns list
// low/hi element of [0,1]
// precision: integer in [1, 1E7-ish], in practice use 100 or 1000 for large data, smaller is faster/less accurate
// group_col: comma-separated list of column names
import org.apache.spark.sql._
def grouped_winzo(df: DataFrame, winz_col: String, group_col: String, low: Double, hi: Double, precision: Integer): DataFrame = {
df.createOrReplaceTempView("df_table")
spark.sql(s"""
select distinct
*
, percentile_approx($winz_col, $low, $precision) over(partition by $group_col) p_low
, percentile_approx($winz_col, $hi, $precision) over(partition by $group_col) p_hi
from df_table
""")
.withColumn(winz_col + "_winz", expr(s"""
case when $winz_col <= p_low then p_low
when $winz_col >= p_hi then p_hi
else $winz_col end"""))
.drop(winz_col, "p_low", "p_hi")
}
// winsorize the price column of a dataframe at the p01 and p99
// percentiles, grouped by 'product_category' column.
val df_winsorized = grouped_winzo(
df_data
, "price"
, "product_category"
, 0.01, 0.99, 1000)

Related

how to create range column based on a column value?

I have sample data in table which contains distance_travelled_in_meter, where the values are of Integer type as follows:
distance_travelled_in_meter |
--------------------------- |
500 |
1221|
990 |
575|
I want to create range based on the value of the column distance_travelled_in_meter. Range column has values with 500 intervals.
The result dataset is as follows:
distance_travelled_in_meter | range
--------------------------- |---------
500 | 1-500
1221|1000-1500
990 |500-1000
575|500-1000
For value 500, the range is 1-500 as it is within 500 meter, 1221 is in 1000-1500 and so on..
I tried using Spark.sql.functions.sequence but it takes the start and stop column values which is not what I want and want to be in range that I mentioned above. And also it creates an Range array from start column value to stop column value.
I'm using Spark2.4.2 with Scala 2.11.12
Any help is much appreciated.
You can chain multiple when expressions that you generate dynamically using something like this:
val maxDistance = 1221 // you can get this from the dataframe
val ranges = (0 until maxDistance by 500).map(x => (x, x + 500))
val rangeExpr = ranges.foldLeft(lit(null)) {
case (acc, (lowerBound, upperBound)) =>
when(
col("distance_travelled_in_meter").between(lowerBound, upperBound),
lit(s"$lowerBound-$upperBound")
).otherwise(acc)
}
val df1 = df.withColumn("range", rangeExpr)

Spark dataframe random sampling based on frequency occurrence in dataframe

Input description
I have a spark job with input dataframe with a column queryId. This queryId is not unique with respect to the dataframe. For example, there are roughly 3M rows in the spark dataframe with 450k distinct query ids.
Problem
I am trying to implement sampling logic and create a new column sampledQueryId which contains randomly sampled query id for each dataframe row by looking up query ids from the aggregate spark dataframe query id set.
Sampling goal
The restriction is that sampled query id shouldn't be equal to input query id.
Sampling should correspond to frequency of occurrence of query id in the incoming spark dataframe - ie given two query id q1 and q2, if the ratio of occurrence is 10:1(q1:q2), then q1 should appear approximately 10 times more in the sample id column.
Solution tried so far
I have tried to implement this through collecting the query ids into a list and lookup query id list with random sampling but have some suspicion based on empirical evidence that the logic doesn't work as expected for eg I see a specific query id getting sampled 200 times but a query id with similar frequency never gets sampled.
Any suggestions on whether this spark code is expected to work as intended?
val random = new scala.util.Random
val queryIds = data.select($"queryId").map(row => row.getAs[Long](0)).collect()
val sampleQueryId = udf((queryId: Long) => {
val sampledId = queryIds(random.nextInt(queryIds.length))
if (sampledId != queryId) sampledId else null
})
val dataWithSampledIds = data.withColumn("sampledQueryId",sampleQueryId($"queryId"))
Received response on different forum documenting for posterity's sake. The issue is that one random instance is being passed to all executors through the udf. So the n-th row on every executor is going to give the same output.
scala> val random = new scala.util.Random
scala> val getRandom = udf((data: Long) => random.nextInt(10000))
scala> spark.range(0, 12, 1, 4).withColumn("rnd", getRandom($"id")).orderBy($"id").show
+---+----+
| id| rnd|
+---+----+
| 0|6720|
| 1|7667|
| 2|3344|
| 3|6720|
| 4|7667|
| 5|3344|
| 6|6720|
| 7|7667|
| 8|3344|
| 9|6720|
| 10|7667|
| 11|3344|
+---+----+
This df had 4 partitions. The value of rrd for every n-th row is the same (e.g. id = 1, 4, 7, 10 are the same).The solution is to use rand() built-in function in Spark like below.
val queryIds = data.select($"queryId").map(row => row.getAs[Long](0)).collect()
val sampleQueryId = udf((companyId: Long, rand: Double) => {
val sampledId = queryIds(scala.math.floor(rand*queryIds.length).toInt)
if (sampledId != queryId) sampledId else null
})
val dataWithSampledIds = data.withColumn("sampledQueryId",sampleQueryId($"queryId", rand()))

Explode or pivot spark scala dataframe horizontally to create a big flat dataframe

I have a dataframe with following schema:
UserID | StartDate | endDate | orderId | OrderCost| OrderItems| OrderLocation| Rank
Where Rank is 1 to 10.
I need to transpose this dataframe on rank and create dataframe in the below format:
UserID| StartDate_1 | endDate_1 | orderId_1 | OrderCost_1| OrderItems_1| OrderLocation_1|start_2 |endDate_2| orderId_2 | OrderCost_2| OrderItems_2| OrderLocation_2 |............| startDate_N|endDate_N | orderId_N | OrderCost_N| OrderItems_N| OrderLocation_N
If a user has only two records with rank 3 and 10 then the requirement is populate columns with suffix _3 and _10 the rest of the cell values for the user will be null.
I have tried 2 brute force approaches
Filter the DF for a rank, and rename the columns with suffix and do self join back to DF.
Grouped by UserID, collect as list and pass it to map function where I populate a array based on rank and then return the seq of string. Create the DF by passing the required schema
Both seemed to be working (Unsure if its the right approach
)but they are not generic that i can re use for different usecase i have
In this example I used https://github.com/bokeh/bokeh/blob/master/bokeh/sampledata/_data/auto-mpg.csv
Spark by default puts the rank in front, so the column names are "reversed" from what you specified, but this is done in only a few steps. The key is that exprs should be dynamically created, and that agg requires this to be split into a head and tail (which is why there is .agg(exprs(0), exprs.slice(1, exprs.length) below)
scala> df2.columns
res39: Array[String] = Array(mpg, cyl, displ, hp, weight, accel, yr, origin, name, Rank)
// note here, you would use columns.slice with the indices for
// the columns you need, i.e. (1, 7)
val exprs = for (col <- df2.columns.slice(0, 8)) yield expr(s"first(${col}) as ${col}")
exprs: Array[org.apache.spark.sql.Column] = Array(first(mpg, false) AS `mpg`, first(cyl, false) AS `cyl`, first(displ, false) AS `displ`, first(hp, false) AS `hp`, first(weight, false) AS `weight`, first(accel, false) AS `accel`, first(yr, false) AS `yr`, first(origin, false) AS `origin`)
scala> val resultDF = df2.groupBy("name").pivot("Rank").agg(exprs(0), exprs.slice(1, exprs.length):_*)
scala> resultDF.columns
res40: Array[String] = Array(name, 1_mpg, 1_cyl, 1_displ, 1_hp, 1_weight, 1_accel, 1_yr, 1_origin, 2_mpg, 2_cyl, 2_displ, 2_hp, 2_weight, 2_accel, 2_yr, 2_origin, 3_mpg, 3_cyl, 3_displ, 3_hp, 3_weight, 3_accel, 3_yr, 3_origin, 4_mpg, 4_cyl, 4_displ, 4_hp, 4_weight, 4_accel, 4_yr, 4_origin, 5_mpg, 5_cyl, 5_displ, 5_hp, 5_weight, 5_accel, 5_yr, 5_origin, 6_mpg, 6_cyl, 6_displ, 6_hp, 6_weight, 6_accel, 6_yr, 6_origin, 7_mpg, 7_cyl, 7_displ, 7_hp, 7_weight, 7_accel, 7_yr, 7_origin, 8_mpg, 8_cyl, 8_displ, 8_hp, 8_weight, 8_accel, 8_yr, 8_origin, 9_mpg, 9_cyl, 9_displ, 9_hp, 9_weight, 9_accel, 9_yr, 9_origin, 10_mpg, 10_cyl, 10_displ, 10_hp, 10_weight, 10_accel, 10_yr, 10_origin)

How to merge two columns into a new DataFrame?

I have two DataFrames (Spark 2.2.0 and Scala 2.11.8). The first DataFrame df1 has one column called col1, and the second one df2 has also 1 column called col2. The number of rows is equal in both DataFrames.
How can I merge these two columns into a new DataFrame?
I tried join, but I think that there should be some other way to do it.
Also, I tried to apply withColumm, but it does not compile.
val result = df1.withColumn(col("col2"), df2.col1)
UPDATE:
For example:
df1 =
col1
1
2
3
df2 =
col2
4
5
6
result =
col1 col2
1 4
2 5
3 6
If that there's no actual relationship between these two columns, it sounds like you need the union operator, which will return, well, just the union of these two dataframes:
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.union(df2).show
+---+
|one|
+---+
| a |
| b |
| c |
| d |
| e |
| f |
+---+
[edit]
Now you've made clear that you just want two columns, then with DataFrames you can use the trick of adding a row index with the function monotonically_increasing_id() and joining on that index value:
import org.apache.spark.sql.functions.monotonically_increasing_id
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), Seq("id"))
.drop("id")
.show
+---+---+
|one|two|
+---+---+
| a | d |
| b | e |
| c | f |
+---+---+
As far as I know, the only way to do want you want with DataFrames is by adding an index column using RDD.zipWithIndex to each and then doing a join on the index column. Code for doing zipWithIndex on a DataFrame can be found in this SO answer.
But, if the DataFrames are small, it would be much simpler to collect the two DFs in the driver, zip them together, and make the result into a new DataFrame.
[Update with example of in-driver collect/zip]
val df3 = spark.createDataFrame(df1.collect() zip df2.collect()).withColumnRenamed("_1", "col1").withColumnRenamed("_2", "col2")
Depends in what you want to do.
If you want to merge two DataFrame you should use the join. There are the same join's types has in relational algebra (or any DBMS)
You are saying that your Data Frames just had one column each.
In that case you might want todo a cross join (cartesian product) with give you a two columns table of all possible combination of col1 and col2, or you might want the uniao (as referred by #Chondrops) witch give you a one column table with all elements.
I think all other join's types uses can be done specialized operations in spark (in this case two Data Frames one column each).

Calculate frequency of column in data frame using spark sql

I'm trying to get the Frequency of distinct values in a Spark dataframe column, something like "value_counts" from Python Pandas. By frequency I mean, the highest occurring value in a table column (such as rank 1 value, rank 2, rank 3 etc. In the expected output, 1 has occurred 9 times in column a, so it has topmost frequency.
I'm using Spark SQL but it is not working out, may be because of the reduce operation I have written is wrong.
**Pandas Example**
value_counts().index[1]
**Current Code in Spark**
val x= parquetRDD_subset.schema.fieldNames
val dfs = x.map(field => spark.sql
(s"select 'ParquetRDD' as TableName,
'$field' as column,
min($field) as min, max($field) as max,
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field) as frequency from peopleRDDtable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
withSum.show()
The problem area is with query below.
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field)
**Expected output**
TableName | column | min | max | frequency1 |
_____________+_________+______+_______+____________+
ParquetRDD | a | 1 | 30 | 9 |
_____________+_________+______+_______+____________+
ParquetRDD | b | 2 | 21 | 5 |
How do I solve this ? please help.
I could solve the issue with below with using count($field) instead of approx_count_distinct($field). Then I used Rank analytical function to get the first rank of value. It worked.