Spark Scala: User defined aggregate function that calculates median - scala

I´m trying to find a way, to calculate the Median for a given Dataframe.
val df = sc.parallelize(Seq(("a",1.0),("a",2.0),("a",3.0),("b",6.0), ("b", 8.0))).toDF("col1", "col2")
+----+----+
|col1|col2|
+----+----+
| a| 1.0|
| a| 2.0|
| a| 3.0|
| b| 6.0|
| b| 8.0|
+----+----+
Now I want to do sth like that:
df.groupBy("col1").agg(calcmedian("col2"))
the result should look like this:
+----+------+
|col1|median|
+----+------+
| a| 2.0|
| b| 7.0|
+----+------+`
therefore calcmedian() has to be a UDAF, but the problem is, the "evaluate" method of the UDAF only takes a Row, but i need the whole table to sort the values and return the median...
// Once all entries for a group are exhausted, spark will evaluate to get the final result
def evaluate(buffer: Row) = {...}
Is this possible somehow? or is there another nice workaround? I want to stress, that i know how to calculate the median on a dataset with "one group". But i don´t want to use this algorithm in a "foreach" loop as this is inefficient!
Thank you!
edit:
that´s what i tried so far:
object calcMedian extends UserDefinedAggregateFunction {
// Schema you get as an input
def inputSchema = new StructType().add("col2", DoubleType)
// Schema of the row which is used for aggregation
def bufferSchema = new StructType().add("col2", DoubleType)
// Returned type
def dataType = DoubleType
// Self-explaining
def deterministic = true
// initialize - called once for each group
def initialize(buffer: MutableAggregationBuffer) = {
buffer(0) = 0.0
}
// called for each input record of that group
def update(buffer: MutableAggregationBuffer, input: Row) = {
buffer(0) = input.getDouble(0)
}
// if function supports partial aggregates, spark might (as an optimization) comput partial results and combine them together
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1(0) = input.getDouble(0)
}
// Once all entries for a group are exhausted, spark will evaluate to get the final result
def evaluate(buffer: Row) = {
val tile = 50
var median = 0.0
//PROBLEM: buffer is a Row --> I need DataFrame here???
val rdd_sorted = buffer.sortBy(x => x)
val c = rdd_sorted.count()
if (c == 1){
median = rdd_sorted.first()
}else{
val index = rdd_sorted.zipWithIndex().map(_.swap)
val last = c
val n = (tile/ 100d) * (c*1d)
val k = math.floor(n).toLong
val d = n - k
if( k <= 0) {
median = rdd_sorted.first()
}else{
if (k <= c){
median = index.lookup(last - 1).head
}else{
if(k >= c){
median = index.lookup(last - 1).head
}else{
median = index.lookup(k-1).head + d* (index.lookup(k).head - index.lookup(k-1).head)
}
}
}
}
} //end of evaluate

try this:
import org.apache.spark.functions._
val result = data.groupBy("col1").agg(callUDF("percentile_approx", col("col2"), lit(0.5)))

Related

Dynamically renaming column in dataframe, and then joining with one more table

I have a property table like below, in the dataframe
In the columns to rename,
I have to rename the column based on this input
If the cust_id flag is yes I just want to join with the customer table
In the final output I want to show the hashed column values with the actual column name
val maintab_df = maintable
val cust_df = customertable
Joining main table and customer table after renaming the main table column e to a.
maintable.a = customertable.a
Here's an example of how to do it:
propertydf.show
+-----------------+------------+
|columns-to-rename|cust_id_flag|
+-----------------+------------+
|(e to a),(d to b)| Y|
+-----------------+------------+
val columns_to_rename = propertydf.head(1)(0).getAs[String]("columns-to-rename")
val cust_id_flag = propertydf.head(1)(0).getAs[String]("cust_id_flag")
val parsed_columns = columns_to_rename.split(",")
.map(c => c.replace("(", "").replace(")", "")
.split(" to "))
// parsed_columns: Array[Array[String]] = Array(Array(e, a), Array(d, b))
val rename_columns = maintab_df.columns.map(c => {
val matched = parsed_columns.filter(p => c == p(0))
if (matched.size != 0)
col(c).as(matched(0)(1).toString)
else
col(c)
})
// rename_columns: Array[org.apache.spark.sql.Column] = Array(e AS `a`, f, c, d AS `b`)
val select_columns = maintab_df.columns.map(c => {
val matched = parsed_columns.filter(p => c == p(0))
if (matched.size != 0)
col(matched(0)(1) + "_hash").as(matched(0)(1).toString)
else
col(c)
})
// select_columns: Array[org.apache.spark.sql.Column] = Array(a_hash AS `a`, f, c, b_hash AS `b`)
val join_cond = parsed_columns.map(_(1))
// join_cond: Array[String] = Array(a, b)
if (cust_id_flag == "Y") {
val result = maintab_df.select(rename_columns:_*)
.join(cust_df, join_cond)
.select(select_columns:_*)
} else {
val result = maintab_df
}
result.show
+------+---+---+--------+
| a| f| c| b|
+------+---+---+--------+
|*****!| 1| 11| &&&&|
| ****%| 2| 12|;;;;;;;;|
|*****#| 3| 13| \\\\\\|
+------+---+---+--------+

Scala Dataframe get max value of specific row

Given a dataframe with an index column ("Z"):
val tmp= Seq(("D",0.1,0.3, 0.4), ("E",0.3, 0.1, 0.4), ("F",0.2, 0.2, 0.5)).toDF("Z", "a", "b", "c")
+---+---+---+---+
| Z | a| b| c|
---+---+---+---+
| "D"|0.1|0.3|0.4|
| "E"|0.3|0.1|0.4|
| "F"|0.2|0.2|0.5|
+---+---+---+---+
Say im interested in the first row where Z = "D":
tmp.filter(col("Z")=== "D")
+---+---+---+---+
| Z | a| b| c|
+---+---+---+---+
|"D"|0.1|0.3|0.4|
+---+---+---+---+
How do i get the min and max values of that Dataframe row and its corresponding column name while keeping the index column?
Desired output if i want top 2 max
+---+---+---
| Z | b|c |
+---+---+--+
| D |0.3|0.4|
+---+---+---
Desired output if i want min
+---+---+
| Z | a|
+---+---+
| D |0.1|
+---+---+
What i tried:
// first convert that DF to an array
val tmp = df.collect.map(_.toSeq).flatten
// returns
tmp: Array[Any] = Array(0.1, 0.3, 0.4) <---dont know why Any is returned
//take top values of array
val n = 1
tmp.zipWithIndex.sortBy(-_._1).take(n).map(_._2)
But got error:
No implicit Ordering defined for Any.
Any way to do it straight from dataframe instead of array?
You can do something like this
tmp
.where($"a" === 0.1)
.take(1)
.map { row =>
Seq(row.getDouble(0), row.getDouble(1), row.getDouble(2))
}
.head
.sortBy(d => -d)
.take(2)
Or if you have big amount of fields you can take schema and pattern match row fields against schema data types like this
import org.apache.spark.sql.types._
val schemaWithIndex = tmp.schema.zipWithIndex
tmp
.where($"a" === 0.1)
.take(1)
.map { row =>
for {
tuple <- schemaWithIndex
} yield {
val field = tuple._1
val index = tuple._2
field.dataType match {
case DoubleType => row.getDouble(index)
}
}
}
.head
.sortBy(d => -d)
.take(2)
Maybe there is easier way to do this.
Definitely not the fastest way, but straight from dataframe
More generic solution:
// somewhere in codebase
import spark.implicits._
import org.apache.spark.sql.functions._
def transform[T, R : Encoder](ds: DataFrame, colsToSelect: Seq[String])(func: Map[String, T] => Map[String, R])
(implicit encoder: Encoder[Map[String, R]]): DataFrame = {
ds.map(row => func(row.getValuesMap(colsToSelect)))
.toDF()
.select(explode(col("value")))
.withColumn("idx", lit(1))
.groupBy(col("idx")).pivot(col("key")).agg(first(col("value")))
.drop("idx")
}
Now it's about working with Map where the map key is a field name and map value is the field value.
def fuzzyStuff(values: Map[String, Any]): Map[String, String] = {
val valueForA = values("a").asInstanceOf[Double]
//Do whatever you want to do
// ...
//use map as a return type where key is a column name and value is whatever yo want to
Map("x" -> (s"fuzzyA-$valueForA"))
}
def maxN(n: Int)(values: Map[String, Double]): Map[String, Double] = {
println(values)
values.toSeq.sorted.reverse.take(n).toMap
}
Usage:
val tmp = Seq((0.1,0.3, 0.4), (0.3, 0.1, 0.4), (0.2, 0.2, 0.5)).toDF("a", "b", "c")
val filtered = tmp.filter(col("a") === 0.1)
transform(filtered, colsToSelect = Seq("a", "b", "c"))(maxN(2))
.show()
+---+---+
| b| c|
+---+---+
|0.3|0.4|
+---+---+
transform(filtered, colsToSelect = Seq("a", "b", "c"))(fuzzyStuff)
.show()
+----------+
| x|
+----------+
|fuzzyA-0.1|
+----------+
Define max and min functions
def maxN(values: Map[String, Double], n: Int): Map[String, Double] = {
values.toSeq.sorted.reverse.take(n).toMap
}
def min(values: Map[String, Double]): Map[String, Double] = {
Map(values.toSeq.min)
}
Create dataset
val tmp= Seq((0.1,0.3, 0.4), (0.3, 0.1, 0.4), (0.2, 0.2, 0.5)).toDF("a", "b", "c")
val filtered = tmp.filter(col("a") === 0.1)
Exple and pivot map type
val df = filtered.map(row => maxN(row.getValuesMap(Seq("a", "b", "c")), 2)).toDF()
val exploded = df.select(explode($"value"))
+---+-----+
|key|value|
+---+-----+
| a| 0.1|
| b| 0.3|
+---+-----+
//Then pivot
exploded.withColumn("idx", lit(1))
.groupBy($"idx").pivot($"key").agg(first($"value"))
.drop("idx")
.show()
+---+---+
| b| c|
+---+---+
|0.3|0.4|
+---+---+

Spark creating a new column based on a mapped value of an existing column

I am trying to map the values of one column in my dataframe to a new value and put it into a new column using a UDF, but I am unable to get the UDF to accept a parameter that isn't also a column. For example I have a dataframe dfOriginial like this:
+-----------+-----+
|high_scores|count|
+-----------+-----+
| 9| 1|
| 21| 2|
| 23| 3|
| 7| 6|
+-----------+-----+
And I'm trying to get a sense of the bin the numeric value falls into, so I may construct a list of bins like this:
case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
val binMin = binMax - binWidth
// only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
def rangeAsString(): String = {
val sb = new StringBuilder()
sb.append(trimDecimal(binMin)).append(" - ").append(trimDecimal(binMax))
sb.toString()
}
}
And then I want to transform my old dataframe like this to make dfBin:
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
| 9| 1| 0 - 10 |
| 21| 2| 20 - 30 |
| 23| 3| 20 - 30 |
| 7| 6| 0 - 10 |
+-----------+-----+---------+
So that I can ultimately get a count of the instances of the bins by calling .groupBy("bin_range").count().
I am trying to generate dfBin by using the withColumn function with an UDF.
Here's the code with the UDF I am attempting to use:
val convertValueToBinRangeUDF = udf((value:String, binList:List[Bin]) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
val dfBin = dfOriginal.withColumn("bin_range", convertValueToBinRangeUDF(col("high_scores"), binList))
But it's giving me a type mismatch:
Error:type mismatch;
found : List[Bin]
required: org.apache.spark.sql.Column
val valueCountsWithBin = valuesCounts.withColumn(binRangeCol, convertValueToBinRangeUDF(col(columnName), binList))
Seeing the definition of an UDF makes me think it should handle the conversion fine, but it's clearly not, any ideas?
The problem is that parameters to an UDF should all be of column type. One solution would be to convert binList into a column and pass it to the UDF similar to the current code.
However, it is simpler to adjust the UDF slightly and turn it into a def. In this way you can easily pass other non-column type data:
def convertValueToBinRangeUDF(binList: List[Bin]) = udf((value:String) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
Usage:
val dfBin = valuesCounts.withColumn("bin_range", convertValueToBinRangeUDF(binList)($"columnName"))
Try this -
scala> case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
| val binMin = binMax - binWidth
|
| // only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
| def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
|
| def rangeAsString(): String = {
| val sb = new StringBuilder()
| sb.append(binMin).append(" - ").append(binMax)
| sb.toString()
| }
| }
defined class Bin
scala> val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
binList: List[Bin] = List(Bin(10,10), Bin(20,10), Bin(30,10), Bin(40,10), Bin(50,10))
scala> spark.udf.register("convertValueToBinRangeUDF", (value: String) => {
| val number = BigDecimal(value)
| val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
| bin.rangeAsString()
| })
res13: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
//-- Testing with one record
scala> val dfOriginal = spark.sql(s""" select "9" as `high_scores`, "1" as count """)
dfOriginal: org.apache.spark.sql.DataFrame = [high_scores: string, count: string]
scala> dfOriginal.createOrReplaceTempView("dfOriginal")
scala> val dfBin = spark.sql(s""" select high_scores, count, convertValueToBinRangeUDF(high_scores) as bin_range from dfOriginal """)
dfBin: org.apache.spark.sql.DataFrame = [high_scores: string, count: string ... 1 more field]
scala> dfBin.show(false)
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
|9 |1 |0 - 10 |
+-----------+-----+---------+
Hope this will help.

Remove columns from Dataframe that all have one value (more efficient)

Let's say I have following dataframe:
/*
+---------+--------+----------+--------+
|a |b | c |d |
+---------+--------+----------+--------+
| bob| -1| 5| -1|
| alice| -1| -1| -1|
+---------+--------+----------+--------+
*/
I want to remove columns which only have -1 in all rows (in this case b and d). I found a solution but when I run my job I found out it was very inefficient:
private def removeEmptyColumns(df: DataFrame): DataFrame = {
val types = List("IntegerType", "DoubleType", "LongType")
val dTypes: Array[(String, String)] = df.dtypes
dTypes.foldLeft(df)((d, t) => {
val colType = t._2
val colName = t._1
if (types.contains(colType)) {
if (colType.equals("IntegerType")) {
if (d.select(colName).filter(col(colName) =!= -1).take(1).length == 0) d.drop(colName)
else d
} else if (colType.equals("DoubleType")) {
if (d.select(colName).filter(col(colName) =!= -1.0).take(1).length == 0) d.drop(colName)
else d
} else {
if (d.select(colName).filter(col(colName) =!= -1).take(1).length == 0) d.drop(colName)
else d
}
} else {
d
}
})
}
Is there a better solution or way to improve my existing code?
(I think this line val count = d.select(colName).distinct.count is the bottleneck)
I am using Spark 2.2 atm.
Many thanks
Instead of counting number of distinct values try to check if there exist any other value that is not -1
d.select(colName).filter(_ != -1).take(1).length == 0
Another approach
Instead of going through the dataframe n times (once for each column) you can try to collect statistics all at once
val summary = d.agg(
max(col1).as(s"${col1}_max"), min(col1).as(s"${col1}_min")),
max(col2).as(s"${col1}_max"), min(col2).as(s"${col2}_min")),
...)
.first
Then compare if min and max value for the column is the same -1

Get top values from a spark dataframe column in Scala

val df = sc.parallelize(Seq((201601, a),
(201602, b),
(201603, c),
(201604, c),
(201607, c),
(201604, c),
(201608, c),
(201609, c),
(201605, b))).toDF("col1", "col2")
I want to get top 3 values of col1. Can any please let me know the better way to do this.
Spark : 1.6.2
Scala : 2.10
You can do it like below.
df.select($"col1").orderBy($"col1".desc).limit(3).show()
You will get
+------+
| col1|
+------+
|201609|
|201608|
|201607|
+------+
You can extract the maxDate firstly and then filter based on the maxDate:
val maxDate = df.agg(max("col1")).first().getAs[Int](0)
// maxDate: Int = 201609
def minusThree(date: Int): Int = {
var Year = date/100
var month = date%100
if(month <= 3) {
Year -= 1
month += 9
} else { month -= 3}
Year*100 + month
}
df.filter($"col1" > minusThree(maxDate)).show
+------+----+
| col1|col2|
+------+----+
|201607| c|
|201608| c|
|201609| c|
+------+----+
You can get same results in one more way using top function
Example:
val data=sc.parallelize(Seq(("maths",52),("english",75),("science",82), ("computer",65),("maths",85))).top(2)
Results:
(science,82)
(maths,85)