How to include/map calculated percentiles to the result dataframe? - scala

I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles, i.e. percentile 0, percentile 25, etc, on each column of my given data.
As I am doing multiple percentiles, how to retrieve each calculated percentile from the results?
My dataframe df:
+----+---------+-------------+----------+-----------+
| id| date| revenue|con_dist_1| con_dist_2|
+----+---------+-------------+----------+-----------+
| 10|1/15/2018| 0.010680705| 6|0.019875458|
| 10|1/15/2018| 0.006628853| 4|0.816039063|
| 10|1/15/2018| 0.01378215| 4|0.082049528|
| 10|1/15/2018| 0.010680705| 6|0.019875458|
| 10|1/15/2018| 0.006628853| 4|0.816039063|
+----+---------+-------------+----------+-----------+
I need to get expected output/result as below:
+----+---------+-------------+-------------+------------+-------------+
| id| date| revenue| perctile_col| quantile_0 |quantile_10 |
+----+---------+-------------+-------------+------------+-------------+
| 10|1/15/2018| 0.010680705| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.01378215| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.01378215| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_2 |<quant0_val>|<quant10_val>|
+----+---------+-------------+-------------+------------+-------------+
I have already calculated the quantiles like this but need to add them to the output dataframe:
val col_list = Array("con_dist_1","con_dist_2")
val quantiles = df.stat.approxQuantile(col_list, Array(0.0,0.1,0.5),0.0)
val percentile_0 = 0;
val percentile_10 = 1;
val Q0 = quantiles(col_list.indexOf("con_dist_1"))(percentile_0)
val Q10 =quantiles(col_list.indexOf("con_dist_1"))(percentile_10)
How to get expected output show above?

An easy solution would be to create multiple dataframes, one for each "con_dist" column, and then use union to merge them together. This can easily be done using a map over col_list as follows:
val col_list = Array("con_dist_1", "con_dist_2")
val quantiles = df.stat.approxQuantile(col_list, Array(0.0,0.1,0.5), 0.0)
val df2 = df.drop(col_list: _*) // we don't need these columns anymore
val result = col_list
.zipWithIndex
.map{case (col, colIndex) =>
val Q0 = quantiles(colIndex)(percentile_0)
val Q10 = quantiles(colIndex)(percentile_10)
df2.withColumn("perctile_col", lit(col))
.withColumn("quantile_0", lit(Q0))
.withColumn("quantile_10", lit(Q10))
}.reduce(_.union(_))
The final dataframe will then be:
+---+---------+-----------+------------+-----------+-----------+
| id| date| revenue|perctile_col| quantile_0|quantile_10|
+---+---------+-----------+------------+-----------+-----------+
| 10|1/15/2018|0.010680705| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.006628853| con_dist_1| 4.0| 4.0|
| 10|1/15/2018| 0.01378215| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.010680705| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.006628853| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.010680705| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018|0.006628853| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018| 0.01378215| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018|0.010680705| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018|0.006628853| con_dist_2|0.019875458|0.019875458|
+---+---------+-----------+------------+-----------+-----------+

Related

Window function acts not as expected when I use Order By (PySpark)

So I have read this comprehensive material yet I don't understand why Window function acts this way.
Here's a little example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
spark = SparkSession.builder.getOrCreate()
columns = ["CATEGORY", "REVENUE"]
data = [("Cell Phone", "6000"),
("Tablet", "1500"),
("Tablet", "5500"),
("Cell Phone", "5000"),
("Cell Phone", "6000"),
("Tablet", "2500"),
("Cell Phone", "3000"),
("Cell Phone", "3000"),
("Tablet", "3000"),
("Tablet", "4500"),
("Tablet", "6500")]
df = spark.createDataFrame(data=data, schema=columns)
window_spec = Window.partitionBy(df['CATEGORY']).orderBy(df['REVENUE'])
revenue_difference = F.max(df['REVENUE']).over(window_spec)
df.select(
df['CATEGORY'],
df['REVENUE'],
revenue_difference.alias("revenue_difference")).show()
So when I write orderBy(df['REVENUE']), I get this:
+----------+-------+------------------+
| CATEGORY|REVENUE|revenue_difference|
+----------+-------+------------------+
|Cell Phone| 3000| 3000|
|Cell Phone| 3000| 3000|
|Cell Phone| 5000| 5000|
|Cell Phone| 6000| 6000|
|Cell Phone| 6000| 6000|
| Tablet| 1500| 1500|
| Tablet| 2500| 2500|
| Tablet| 3000| 3000|
| Tablet| 4500| 4500|
| Tablet| 5500| 5500|
| Tablet| 6500| 6500|
+----------+-------+------------------+
But when I write orderBy(df['REVENUE']).desc(), I get this:
+----------+-------+------------------+
| CATEGORY|REVENUE|revenue_difference|
+----------+-------+------------------+
|Cell Phone| 6000| 6000|
|Cell Phone| 6000| 6000|
|Cell Phone| 5000| 6000|
|Cell Phone| 3000| 6000|
|Cell Phone| 3000| 6000|
| Tablet| 6500| 6500|
| Tablet| 5500| 6500|
| Tablet| 4500| 6500|
| Tablet| 3000| 6500|
| Tablet| 2500| 6500|
| Tablet| 1500| 6500|
+----------+-------+------------------+
I don't understand because the way I see it, the MAX value in each window stays the same no matter what order is. So can someone please explain me what I am not gettin here??
Thank you!
The simple reason is that the default window range/row spec is Window.UnboundedPreceding to Window.CurrentRow, which means that the max is taken from the first row in that partition to the current row, NOT the last row of the partition.
This is a common gotcha. (you can replace .max() with sum() and see what output you get. It also changes depending on how you order the partition.)
To solve this, you can specify that you want the max of each partition to always be calculated using the full window partition, like so:
window_spec = Window.partitionBy(df['CATEGORY']).orderBy(df['REVENUE']).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
revenue_difference = F.max(df['REVENUE']).over(window_spec)
df.select(
df['CATEGORY'],
df['REVENUE'],
revenue_difference.alias("revenue_difference")).show()
+----------+-------+------------------+
| CATEGORY|REVENUE|revenue_difference|
+----------+-------+------------------+
| Tablet| 6500| 6500|
| Tablet| 5500| 6500|
| Tablet| 4500| 6500|
| Tablet| 3000| 6500|
| Tablet| 2500| 6500|
| Tablet| 1500| 6500|
|Cell Phone| 6000| 6000|
|Cell Phone| 6000| 6000|
|Cell Phone| 5000| 6000|
|Cell Phone| 3000| 6000|
|Cell Phone| 3000| 6000|
+----------+-------+------------------+

How to calculate group-based quantiles?

I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles, i.e. percentile 0, percentile 25, etc, on each column of my given data.
My dataframe df:
+----+---------+-------------+----------+-----------+--------+
| id| date| revenue|con_dist_1| con_dist_2| state |
+----+---------+-------------+----------+-----------+--------+
| 10|1/15/2018| 0.010680705| 6|0.019875458| TX |
| 10|1/15/2018| 0.006628853| 4|0.816039063| AZ |
| 10|1/15/2018| 0.01378215| 4|0.082049528| TX |
| 10|1/15/2018| 0.010680705| 6|0.019875458| TX |
| 10|1/15/2018| 0.006628853| 4|0.816039063| AZ |
+----+---------+-------------+----------+-----------+--------+
How to find the quantile on the columns "con_dist_1" & "con_dist_2" for each state?
The possible solution could be:
scala> input.show
+---+---------+-----------+----------+-----------+-----+
| id| date| revenue|con_dist_1| con_dist_2|state|
+---+---------+-----------+----------+-----------+-----+
| 10|1/15/2018|0.010680705| 6|0.019875458| TX|
| 10|1/15/2018|0.006628853| 4|0.816039063| AZ|
| 10|1/15/2018| 0.01378215| 4|0.082049528| TX|
| 10|1/15/2018|0.010680705| 6|0.019875458| TX|
| 10|1/15/2018|0.006628853| 4|0.816039063| AZ|
+---+---------+-----------+----------+-----------+-----+
scala> val df1 = input.groupBy("state").agg(collect_list("con_dist_1").as("combined_1"), collect_list("con_dist_2").as("combined_2"))
df1: org.apache.spark.sql.DataFrame = [state: string, combined_1: array<int> ... 1 more field]
scala> df1.show
+-----+----------+--------------------+
|state|combined_1| combined_2|
+-----+----------+--------------------+
| AZ| [4, 4]|[0.816039063, 0.8...|
| TX| [6, 4, 6]|[0.019875458, 0.0...|
+-----+----------+--------------------+
scala> df1.
| withColumn("comb1_Q1", sort_array($"combined_1")(((size($"combined_1")-1)*0.25).cast("int"))).
| withColumn("comb1_Q2", sort_array($"combined_1")(((size($"combined_1")-1)*0.5).cast("int"))).
| withColumn("comb1_Q3", sort_array($"combined_1")(((size($"combined_1")-1)*0.75).cast("int"))).
| withColumn("comb_2_Q1", sort_array($"combined_2")(((size($"combined_2")-1)*0.25).cast("int"))).
| withColumn("comb_2_Q2", sort_array($"combined_2")(((size($"combined_2")-1)*0.5).cast("int"))).
| withColumn("comb_2_Q3", sort_array($"combined_2")(((size($"combined_2")-1)*0.75).cast("int"))).
| show
+-----+----------+--------------------+--------+--------+--------+-----------+-----------+-----------+
|state|combined_1| combined_2|comb1_Q1|comb1_Q2|comb1_Q3| comb_2_Q1| comb_2_Q2| comb_2_Q3|
+-----+----------+--------------------+--------+--------+--------+-----------+-----------+-----------+
| AZ| [4, 4]|[0.816039063, 0.8...| 4| 4| 4|0.816039063|0.816039063|0.816039063|
| TX| [6, 4, 6]|[0.019875458, 0.0...| 4| 6| 6|0.019875458|0.019875458|0.019875458|
+-----+----------+--------------------+--------+--------+--------+-----------+-----------+-----------+
EDIT
I don't think we can achieve using approx quantile method as you want it for each state for which you will need to group by on state column and aggregate the con_dist columns and approx quantile expects a whole column of integers or float but not of array types.
The other solution is to use spark-sql as shown below:
scala> input.show
+---+---------+-----------+----------+-----------+-----+
| id| date| revenue|con_dist_1| con_dist_2|state|
+---+---------+-----------+----------+-----------+-----+
| 10|1/15/2018|0.010680705| 6|0.019875458| TX|
| 10|1/15/2018|0.006628853| 4|0.816039063| AZ|
| 10|1/15/2018| 0.01378215| 4|0.082049528| TX|
| 10|1/15/2018|0.010680705| 6|0.019875458| TX|
| 10|1/15/2018|0.006628853| 4|0.816039063| AZ|
+---+---------+-----------+----------+-----------+-----+
scala> input.createOrReplaceTempView("input")
scala> :paste
// Entering paste mode (ctrl-D to finish)
val query = "select state, percentile_approx(con_dist_1,0.25) as col1_quantile_1, " +
"percentile_approx(con_dist_1,0.5) as col1_quantile_2," +
"percentile_approx(con_dist_1,0.75) as col1_quantile_3, " +
"percentile_approx(con_dist_2,0.25) as col2_quantile_1,"+
"percentile_approx(con_dist_2,0.5) as col2_quantile_2," +
"percentile_approx(con_dist_2,0.75) as col2_quantile_3 " +
"from input group by state"
// Exiting paste mode, now interpreting.
query: String = select state, percentile_approx(con_dist_1,0.25) as col1_quantile_1, percentile_approx(con_dist_1,0.5) as col1_quantile_2,percentile_approx(con_dist_1,0.75) as col1_quantile_3, percentile_approx(con_dist_2,0.25) as col2_quantile_1,percentile_approx(con_dist_2,0.5) as col2_quantile_2,percentile_approx(con_dist_2,0.75) as col2_quantile_3 from input group by state
scala> val df2 = spark.sql(query)
df2: org.apache.spark.sql.DataFrame = [state: string, col1_quantile_1: int ... 5 more fields]
scala> df2.show
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
|state|col1_quantile_1|col1_quantile_2|col1_quantile_3|col2_quantile_1|col2_quantile_2|col2_quantile_3|
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
| AZ| 4| 4| 4| 0.816039063| 0.816039063| 0.816039063|
| TX| 4| 6| 6| 0.019875458| 0.019875458| 0.082049528|
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
Let me know if it helps!!

Understanding the split function of MLlib from PySpark

I have the following transformed data.
dataframe: rev
+--------+------------------+
|features| label|
+--------+------------------+
| [24.0]| 6.382551510879452|
| [29.0]| 6.233604067150788|
| [35.0]|15.604956217859785|
+--------+------------------+
When I split it into two set like following, I get something really unexpected. Sorry at first, I am new in PySpark.
(trainingData, testData) = rev.randomSplit([0.7, 0.3])
Now when I check, I find:
trainingData.show(3)
+--------+--------------------+
|features| label|
+--------+--------------------+
| [22.0]|0.007807592294154144|
| [22.0]|0.016228017481755445|
| [22.0]|0.029326273621380787|
+--------+--------------------+
And unfortunately when I run the model and check prediction on test set, I get following:
+------------------+--------------------+--------+
| prediction| label|features|
+------------------+--------------------+--------+
|11.316183853894138|0.023462300065135114| [22.0]|
|11.316183853894138| 0.02558467547137103| [22.0]|
|11.316183853894138| 0.03734394063419729| [22.0]|
|11.316183853894138| 0.07660100900324195| [22.0]|
|11.316183853894138| 0.08032742812331381| [22.0]|
+------------------+--------------------+--------+
Prediction and Label are in horrible relationship.
Thanks in advance.
Info Update:
Whole dataset:
rev.describe().show()
+-------+--------------------+
|summary| label|
+-------+--------------------+
| count| 28755967|
| mean| 11.326884020257475|
| stddev| 6.0085535870540125|
| min|5.158072668697356E-4|
| max| 621.5236222433649|
+-------+--------------------+
And train set:
+-------+--------------------+
|summary| label|
+-------+--------------------+
| count| 20132404|
| mean| 11.327304652511287|
| stddev| 6.006384709888342|
| min|5.158072668697356E-4|
| max| 294.9624797344751|
+-------+--------------------+
Try to set seed pyspark.sql.DataFrame.randomSplit
(trainingData, testData) = rev.randomSplit([7.0, 3.0], 100)

Sorting numeric String in Spark Dataset

Let's assume that I have the following Dataset:
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-13| 300|
| XX-1| 250|
| XX-2| 410|
| XX-9| 50|
| XX-10| 35|
| XX-100| 870|
+-----------+----------+
Where productCode is of String type and the amount is an Int.
If one will try to order this by productCode the result will be (and this is expected because of nature of String comparison):
def orderProducts(product: Dataset[Product]): Dataset[Product] = {
product.orderBy("productCode")
}
// Output:
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-1| 250|
| XX-10| 35|
| XX-100| 870|
| XX-13| 300|
| XX-2| 410|
| XX-9| 50|
+-----------+----------+
How can I get an output ordered by Integer part of the productCode like below considering Dataset API?
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-1| 250|
| XX-2| 410|
| XX-9| 50|
| XX-10| 35|
| XX-13| 300|
| XX-100| 870|
+-----------+----------+
Use the expression in the orderBy. Check this out:
scala> val df = Seq(("XX-13",300),("XX-1",250),("XX-2",410),("XX-9",50),("XX-10",35),("XX-100",870)).toDF("productCode", "amt")
df: org.apache.spark.sql.DataFrame = [productCode: string, amt: int]
scala> df.orderBy(split('productCode,"-")(1).cast("int")).show
+-----------+---+
|productCode|amt|
+-----------+---+
| XX-1|250|
| XX-2|410|
| XX-9| 50|
| XX-10| 35|
| XX-13|300|
| XX-100|870|
+-----------+---+
scala>
With window functions, you could do like
scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1 |250|1 |
|XX-2 |410|2 |
|XX-9 |50 |3 |
|XX-10 |35 |4 |
|XX-13 |300|5 |
|XX-100 |870|6 |
+-----------+---+----+
scala>
Note that spark complains of moving all data to single partition.

How to transform the dataframe into label feature vector?

I am running a logistic regression modl in scala and I have a data frame like below:
df
+-----------+------------+
|x |y |
+-----------+------------+
| 0| 0|
| 0| 33|
| 0| 58|
| 0| 96|
| 0| 1|
| 1| 21|
| 0| 10|
| 0| 65|
| 1| 7|
| 1| 28|
+-----------+------------+
I need to tranform this into something like this
+-----+------------------+
|label| features |
+-----+------------------+
| 0.0|(1,[1],[0]) |
| 0.0|(1,[1],[33]) |
| 0.0|(1,[1],[58]) |
| 0.0|(1,[1],[96]) |
| 0.0|(1,[1],[1]) |
| 1.0|(1,[1],[21]) |
| 0.0|(1,[1],[10]) |
| 0.0|(1,[1],[65]) |
| 1.0|(1,[1],[7]) |
| 1.0|(1,[1],[28]) |
+-----------+------------+
I tried
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val assembler = new VectorAssembler()
.setInputCols(Array("x"))
.setOutputCol("Feature")
var lrModel= lr.fit(daf.withColumnRenamed("x","label").withColumnRenamed("y","features"))
Any help is appreciated.
Given the dataframe as
+---+---+
|x |y |
+---+---+
|0 |0 |
|0 |33 |
|0 |58 |
|0 |96 |
|0 |1 |
|1 |21 |
|0 |10 |
|0 |65 |
|1 |7 |
|1 |28 |
+---+---+
And doing as below
val assembler = new VectorAssembler()
.setInputCols(Array("x", "y"))
.setOutputCol("features")
val output = assembler.transform(df).select($"x".cast(DoubleType).as("label"), $"features")
output.show(false)
Would give you result as
+-----+----------+
|label|features |
+-----+----------+
|0.0 |(2,[],[]) |
|0.0 |[0.0,33.0]|
|0.0 |[0.0,58.0]|
|0.0 |[0.0,96.0]|
|0.0 |[0.0,1.0] |
|1.0 |[1.0,21.0]|
|0.0 |[0.0,10.0]|
|0.0 |[0.0,65.0]|
|1.0 |[1.0,7.0] |
|1.0 |[1.0,28.0]|
+-----+----------+
Now using LogisticRegression would be easy
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(output)
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
You will have output as
Coefficients: [1.5672602877378823,0.0] Intercept: -1.4055020984891717