collect_set equivalent spark 1.5 UDAF method verification - scala

Can some one tell me the equivalent function for collect_set in spark 1.5 ?
Is there any work around to get the similar results like collect_set(col(name)) ?
Is this correct approach :
class CollectSetFunction[T](val colType: DataType) extends UserDefinedAggregateFunction {
def inputSchema: StructType =
new StructType().add("inputCol", colType)
def bufferSchema: StructType =
new StructType().add("outputCol", ArrayType(colType))
def dataType: DataType = ArrayType(colType)
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0, new scala.collection.mutable.ArrayBuffer[T])
}
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val list = buffer.getSeq[T](0)
if (!input.isNullAt(0)) {
val sales = input.getAs[T](0)
buffer.update(0, list:+sales)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1.update(0, buffer1.getSeq[T](0).toSet ++ buffer2.getSeq[T](0).toSet)
}
def evaluate(buffer: Row): Any = {
buffer.getSeq[T](0)
}
}

It code look correct. Furthermore, I tested in 1.6.2 in local mode and got the same result (see below). I don't know of any simpler alternative using the DataFrame API. Using RDD, it's pretty straightforward and it might be preferable to make a detour to RDD API sometimes in 1.5 as the data frames are not fully implemented.
scala> val rdd = sc.parallelize((1 to 10)).map(x => (x%5,x))
scala> rdd.groupByKey.mapValues(_.toSet.toList)).toDF("k","set").show
+---+-------+
| k| set|
+---+-------+
| 0|[5, 10]|
| 1| [1, 6]|
| 2| [2, 7]|
| 3| [3, 8]|
| 4| [4, 9]|
+---+-------+
And if you want to factor it out, an initial version (which can be imroved) can be the following
def collectSet(df: DataFrame, k: Column, v: Column) = df
.select(k.as("k"),v.as("v"))
.map( r => (r.getInt(0),r.getInt(1)))
.groupByKey()
.mapValues(_.toSet.toList)
.toDF("k","v")
but if you want to make other aggregations, you will not be able to avoid a join.
scala> val df = sc.parallelize((1 to 10)).toDF("v").withColumn("k", pmod('v,lit(5)))
df: org.apache.spark.sql.DataFrame = [v: int, k: int]
scala> val csudaf = new CollectSetFunction[Int](IntegerType)
scala> df.groupBy('k).agg(collect_set('v),csudaf('v)).show
+---+--------------+---------------------+
| k|collect_set(v)|CollectSetFunction(v)|
+---+--------------+---------------------+
| 0| [5, 10]| [5, 10]|
| 1| [1, 6]| [1, 6]|
| 2| [2, 7]| [2, 7]|
| 3| [3, 8]| [3, 8]|
| 4| [4, 9]| [4, 9]|
+---+--------------+---------------------+
test 2:
scala> val df = sc.parallelize((1 to 100000)).toDF("v").withColumn("k", floor(rand*10))
df: org.apache.spark.sql.DataFrame = [v: int, k: bigint]
scala> df.groupBy('k).agg(collect_set('v).as("a"),csudaf('v).as("b"))
.groupBy('a==='b).count.show
+-------+-----+
|(a = b)|count|
+-------+-----+
| true| 10|
+-------+-----+

Related

Evaluate formulas in Spark DataFrame

Is it possible to evaluate formulas in a dataframe which refer to columns? e.g. if I have data like this (Scala example):
val df = Seq(
( 1, "(a+b)/d", 1, 20, 2, 3, 1 ),
( 2, "(c+b)*(a+e)", 0, 1, 2, 3, 4 ),
( 3, "a*(d+e+c)", 7, 10, 6, 2, 1 )
)
.toDF( "Id", "formula", "a", "b", "c", "d", "e" )
df.show()
Expected results:
I have been unable to get selectExpr, expr, eval() or combinations of them to work.
You can use the scala toolbox eval in a UDF:
import org.apache.spark.sql.functions.col
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val cols = df.columns.tail
val eval_udf = udf(
(r: Seq[String]) =>
tb.eval(tb.parse(
("val %s = %s;" * cols.tail.size).format(
cols.tail.zip(r.tail).flatMap(x => List(x._1, x._2)): _*
) + r(0)
)).toString
)
val df2 = df.select(col("id"), eval_udf(array(df.columns.tail.map(col):_*)).as("result"))
df2.show
+---+------+
| id|result|
+---+------+
| 1| 7|
| 2| 12|
| 3| 63|
+---+------+
A slightly different version of mck's answer, by replacing the variables in the formula column by their corresponding values from the other columns then calling eval udf :
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
val eval = udf((f: String) => {
val toolbox = currentMirror.mkToolBox()
toolbox.eval(toolbox.parse(f)).toString
})
val formulaExpr = expr(df.columns.drop(2).foldLeft("formula")((acc, c) => s"replace($acc, '$c', $c)"))
df.select($"Id", eval(formulaExpr).as("result")).show()
//+---+------+
//| Id|result|
//+---+------+
//| 1| 7|
//| 2| 12|
//| 3| 63|
//+---+------+

Pass column and a Map to a Scala UDF

I come from Pyspark. I know how to do this in Pyspark but haven't managed to do the same thing in Scala.
Here is a dataframe,
val df = Seq(
("u1", Array[Int](2,3,4)),
("u2", Array[Int](7,8,9))
).toDF("id", "mylist")
// +---+---------+
// | id| mylist|
// +---+---------+
// | u1|[2, 3, 4]|
// | u2|[7, 8, 9]|
// +---+---------+
and here is a Map object,
val myMap = (1 to 4).toList.map(x => (x,0)).toMap
//myMap: scala.collection.immutable.Map[Int,Int] = Map(1 -> 0, 2 -> 0, 3 -> 0, 4 -> 0)
so this map has key values from 1 to 4.
For each row of df, I want to check if any element in "mylist" is contained in myMap as a key value. If myMap contains an element, then return that element (return any one if multiple elements are contained), elsewise return -1.
So the result should look like
+---+---------+-------+
| id| mylist| label|
+---+---------+-------+
| u1|[2, 3, 4]| 2 |
| u2|[7, 8, 9]| -1 |
+---+---------+-------+
I have tried the following approaches:
below function works for an array object, but does not work for a column:
def list2label(ls: Array[Int],
m: Map[Int, Int]):(Int) = {
var flag = 0
for (element <- ls) {
if (m.contains(element)) flag = element
}
flag
}
val testls = Array[Int](2,3,4)
list2label(testls, myMap)
//testls: Array[Int] = Array(2, 3, 4)
//res33: Int = 4
trying to use UDF, but got an error:
def list2label_udf(m: Map[Int, Int]) = udf( (ls: Array[Int]) =>(
var flag = 0
for (element <- ls) {
if (m.contains(element)) flag = element
}
flag
)
)
//<console>:3: error: illegal start of simple expression
// var flag = 0
// ^
I think my udf is in wrong format..
in Pyspark I can do this as I wish:
%pyspark
myDict={1:0, 2:0, 3:0, 4:0}
def list2label(ls, myDict):
for i in ls:
if i in dict3:
return i
return 0
def list2label_UDF(myDict):
return udf(lambda c: list2label(c,myDict))
df = df.withColumn("label",list2label_UDF(myDict)(col("mylist")))
Any help would be appreciated!
The solution is shown below:
scala> df.show
+---+---------+
| id| mylist|
+---+---------+
| u1|[2, 3, 4]|
| u2|[7, 8, 9]|
+---+---------+
scala> def customUdf(m: Map[Int,Int]) = udf((s: Seq[Int]) => {
val intersection = s.toList.intersect(m.keys.toList)
if(intersection.isEmpty) -1 else intersection(0)})
customUdf: (m: Map[Int,Int])org.apache.spark.sql.expressions.UserDefinedFunction
scala> df.select($"id", $"myList", customUdf(myMap)($"myList").as("new_col")).show
+---+---------+-------+
| id| myList|new_col|
+---+---------+-------+
| u1|[2, 3, 4]| 2|
| u2|[7, 8, 9]| -1|
+---+---------+-------+
Another approach could be to send list of keys of map instead of map itself as ypu are only checking on the keys. For this the solution is hown below:
scala> def customUdf1(m: List[Int]) = udf((s: Seq[Int]) => {
val intersection = s.toList.intersect(m)
if(intersection.isEmpty) -1 else intersection(0)})
customUdf1: (m: List[Int])org.apache.spark.sql.expressions.UserDefinedFunction
scala> df.select($"id",$"myList", customUdf1(myMap.keys.toList)($"myList").as("new_col")).show
+---+---------+-------+
| id| myList|new_col|
+---+---------+-------+
| u1|[2, 3, 4]| 2|
| u2|[7, 8, 9]| -1|
+---+---------+-------+
Let me know if it helps!!

Add new column containing an Array of column names sorted by the row-wise values

Given a dataFrame with a few columns, I'm trying to create a new column containing an array of these columns' names sorted by decreasing order, based on the row-wise values of these columns.
| a | b | c | newcol|
|---|---|---|-------|
| 1 | 4 | 3 |[b,c,a]|
| 4 | 1 | 3 |[a,c,b]|
---------------------
The names of the columns are stored in a var names:Array[String]
What approach should I go for?
Using UDF is most simple way to achieve custom tasks here.
val df = spark.createDataFrame(Seq((1,4,3), (4,1,3))).toDF("a", "b", "c")
val names=df.schema.fieldNames
val sortNames = udf((v: Seq[Int]) => {v.zip(names).sortBy(_._1).map(_._2)})
df.withColumn("newcol", sortNames(array(names.map(col): _*))).show
Something like this can be an approach using Dataset:
case class Element(name: String, value: Int)
case class Columns(a: Int, b: Int, c: Int, elements: Array[String])
def function1()(implicit spark: SparkSession) = {
import spark.implicits._
val df0: DataFrame =
spark.createDataFrame(spark.sparkContext
.parallelize(Seq(Row(1, 2, 3), Row(4, 1, 3))),
StructType(Seq(StructField("a", IntegerType, false),
StructField("b", IntegerType, false),
StructField("c", IntegerType, false))))
val df1 = df0
.flatMap(row => Seq(Columns(row.getAs[Int]("a"),
row.getAs[Int]("b"),
row.getAs[Int]("c"),
Array(Element("a", row.getAs[Int]("a")),
Element("b", row.getAs[Int]("b")),
Element("c", row.getAs[Int]("c"))).sortBy(-_.value).map(_.name))))
df1
}
def main(args: Array[String]) : Unit = {
implicit val spark = SparkSession.builder().master("local[1]").getOrCreate()
function1().show()
}
gives:
+---+---+---+---------+
| a| b| c| elements|
+---+---+---+---------+
| 1| 2| 3|[a, b, c]|
| 4| 1| 3|[b, c, a]|
+---+---+---+---------+
Try something like this:
val sorted_column_names = udf((column_map: Map[String, Int]) =>
column_map.toSeq.sortBy(- _._2).map(_._1)
)
df.withColumn("column_map", map(lit("a"), $"a", lit("b"), $"b", lit("c"), $"c")
.withColumn("newcol", sorted_column_names($"column_map"))

How to perform arithmetic operations in a DataFrame groupBy aggregation in Spark? [duplicate]

This question already has answers here:
Cumulative product in Spark
(4 answers)
Closed 4 years ago.
I have a dataframe as follows:
val df = Seq(("x", "y", 1),("x", "z", 2),("x", "a", 4), ("x", "a", 5), ("t", "y", 1), ("t", "y2", 6), ("t", "y3", 3), ("t", "y4", 5)).toDF("F1", "F2", "F3")
+---+---+---+
| F1| F2| F3|
+---+---+---+
| x| y| 1|
| x| z| 2|
| x| a| 4|
| x| a| 5|
| t| y| 1|
| t| y2| 6|
| t| y3| 3|
| t| y4| 5|
+---+---+---+
How do I groupBy on column "F1", and multiply on "F3"?
For sum, I can do as follows, but not sure which function to use for multiplication.
df.groupBy("F1").agg(sum("F3")).show
+---+-------+
| F1|sum(F3)|
+---+-------+
| x| 12|
| t| 15|
+---+-------+
val df = Seq(("x", "y", 1),("x", "z", 2),("x", "a", 4), ("x", "a", 5), ("t", "y", 1), ("t", "y2", 6), ("t", "y3", 3), ("t", "y4", 5)).toDF("F1", "F2", "F3")
import org.apache.spark.sql.Row
val x=df.select($"F1",$"F3").groupByKey{case r=>r.getString(0)}.reduceGroups{ ((r),(r2)) =>Row(r.getString(0),r.getInt(1)*r2.getInt(1)) }
x.show()
+-----+------------------------------------------+
|value|ReduceAggregator(org.apache.spark.sql.Row)|
+-----+------------------------------------------+
| x| [x, 40]|
| t| [t, 90]|
+-----+------------------------------------------+
Define a custom aggregation function as follows :
class Product extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function.
override def inputSchema: org.apache.spark.sql.types.StructType =
StructType(StructField("value", LongType) :: Nil)
// This is the internal fields you keep for computing your aggregate.
override def bufferSchema: StructType = StructType(
StructField("product", LongType) :: Nil
)
// This is the output type of your aggregatation function.
override def dataType: DataType = LongType
override def deterministic: Boolean = true
// This is the initial value for your buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 1L
}
// This is how to update your buffer schema given an input.
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getAs[Long](0) * input.getAs[Long](0)
}
// This is how to merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getAs[Long](0) * buffer2.getAs[Long](0)
}
// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): Any = {
buffer.getLong(0)
}
}
Then use it in aggregation as follows :
val product = new Product
val df = Seq(("x", "y", 1),("x", "z", 2),("x", "a", 4), ("x", "a", 5), ("t", "y", 1), ("t", "y2", 6), ("t", "y3", 3), ("t", "y4", 5)).toDF("F1", "F2", "F3")
df.groupBy("F1").agg(product(col("F3"))).show
Here's the output :
+---+-----------+
| F1|product(F3)|
+---+-----------+
| x| 40|
| t| 90|
+---+-----------+

Spark, Scala, DataFrame: create feature vectors

I have a DataFrame that looks like follow:
userID, category, frequency
1,cat1,1
1,cat2,3
1,cat9,5
2,cat4,6
2,cat9,2
2,cat10,1
3,cat1,5
3,cat7,16
3,cat8,2
The number of distinct categories is 10, and I would like to create a feature vector for each userID and fill the missing categories with zeros.
So the output would be something like:
userID,feature
1,[1,3,0,0,0,0,0,0,5,0]
2,[0,0,0,6,0,0,0,0,2,1]
3,[5,0,0,0,0,0,16,2,0,0]
It is just an illustrative example, in reality I have about 200,000 unique userID and and 300 unique category.
What is the most efficient way to create the features DataFrame?
A little bit more DataFrame centric solution:
import org.apache.spark.ml.feature.VectorAssembler
val df = sc.parallelize(Seq(
(1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6),
(2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16),
(3, "cat8", 2))).toDF("userID", "category", "frequency")
// Create a sorted array of categories
val categories = df
.select($"category")
.distinct.map(_.getString(0))
.collect
.sorted
// Prepare vector assemble
val assembler = new VectorAssembler()
.setInputCols(categories)
.setOutputCol("features")
// Aggregation expressions
val exprs = categories.map(
c => sum(when($"category" === c, $"frequency").otherwise(lit(0))).alias(c))
val transformed = assembler.transform(
df.groupBy($"userID").agg(exprs.head, exprs.tail: _*))
.select($"userID", $"features")
and an UDAF alternative:
import org.apache.spark.sql.expressions.{
MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.types.{
StructType, ArrayType, DoubleType, IntegerType}
import scala.collection.mutable.WrappedArray
class VectorAggregate (n: Int) extends UserDefinedAggregateFunction {
def inputSchema = new StructType()
.add("i", IntegerType)
.add("v", DoubleType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = new VectorUDT()
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, Array.fill(n)(0.0))
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val i = input.getInt(0)
val v = input.getDouble(1)
val buff = buffer.getAs[WrappedArray[Double]](0)
buff(i) += v
buffer.update(0, buff)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getAs[WrappedArray[Double]](0)
val buff2 = buffer2.getAs[WrappedArray[Double]](0)
for ((x, i) <- buff2.zipWithIndex) {
buff1(i) += x
}
buffer1.update(0, buff1)
}
def evaluate(buffer: Row) = Vectors.dense(
buffer.getAs[Seq[Double]](0).toArray)
}
with example usage:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("category_idx")
.fit(df)
val indexed = indexer.transform(df)
.withColumn("category_idx", $"category_idx".cast("integer"))
.withColumn("frequency", $"frequency".cast("double"))
val n = indexer.labels.size + 1
val transformed = indexed
.groupBy($"userID")
.agg(new VectorAggregate(n)($"category_idx", $"frequency").as("vec"))
transformed.show
// +------+--------------------+
// |userID| vec|
// +------+--------------------+
// | 1|[1.0,5.0,0.0,3.0,...|
// | 2|[0.0,2.0,0.0,0.0,...|
// | 3|[5.0,0.0,16.0,0.0...|
// +------+--------------------+
In this case order of values is defined by indexer.labels:
indexer.labels
// Array[String] = Array(cat1, cat9, cat7, cat2, cat8, cat4, cat10)
In practice I would prefer solution by Odomontois so these are provided mostly for reference.
Suppose:
val cs: SparkContext
val sc: SQLContext
val cats: DataFrame
Where userId and frequency are bigint columns which corresponds to scala.Long
We are creating intermediate mapping RDD:
val catMaps = cats.rdd
.groupBy(_.getAs[Long]("userId"))
.map { case (id, rows) => id -> rows
.map { row => row.getAs[String]("category") -> row.getAs[Long]("frequency") }
.toMap
}
Then collecting all presented categories in the lexicographic order
val catNames = cs.broadcast(catMaps.map(_._2.keySet).reduce(_ union _).toArray.sorted)
Or creating it manually
val catNames = cs.broadcast(1 to 10 map {n => s"cat$n"} toArray)
Finally we're transforming maps to arrays with 0-values for non-existing values
import sc.implicits._
val catArrays = catMaps
.map { case (id, catMap) => id -> catNames.value.map(catMap.getOrElse(_, 0L)) }
.toDF("userId", "feature")
now catArrays.show() prints something like
+------+--------------------+
|userId| feature|
+------+--------------------+
| 2|[0, 1, 0, 6, 0, 0...|
| 1|[1, 0, 3, 0, 0, 0...|
| 3|[5, 0, 0, 0, 16, ...|
+------+--------------------+
This could be not the most elegant solution for dataframes, as I barely familiar with this area of spark.
Note, that you could create your catNames manually to add zeros for missing cat3, cat5, ...
Also note that otherwise catMaps RDD is operated twice, you might want to .persist() it
Given your input:
val df = Seq((1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5),
(2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1),
(3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))
.toDF("userID", "category", "frequency")
df.show
+------+--------+---------+
|userID|category|frequency|
+------+--------+---------+
| 1| cat1| 1|
| 1| cat2| 3|
| 1| cat9| 5|
| 2| cat4| 6|
| 2| cat9| 2|
| 2| cat10| 1|
| 3| cat1| 5|
| 3| cat7| 16|
| 3| cat8| 2|
+------+--------+---------+
Just run:
val pivoted = df.groupBy("userID").pivot("category").avg("frequency")
val dfZeros = pivoted.na.fill(0)
dzZeros.show
+------+----+-----+----+----+----+----+----+
|userID|cat1|cat10|cat2|cat4|cat7|cat8|cat9|
+------+----+-----+----+----+----+----+----+
| 1| 1.0| 0.0| 3.0| 0.0| 0.0| 0.0| 5.0|
| 3| 5.0| 0.0| 0.0| 0.0|16.0| 2.0| 0.0|
| 2| 0.0| 1.0| 0.0| 6.0| 0.0| 0.0| 2.0|
+------+----+-----+----+----+----+----+----+
Finally, use VectorAssembler to create a org.apache.spark.ml.linalg.Vector
NOTE: I have not checked performances on this yet...
EDIT: Possibly more complex, but likely more efficient!
def toSparseVectorUdf(size: Int) = udf[Vector, Seq[Row]] {
(data: Seq[Row]) => {
val indices = data.map(_.getDouble(0).toInt).toArray
val values = data.map(_.getInt(1).toDouble).toArray
Vectors.sparse(size, indices, values)
}
}
val indexer = new StringIndexer().setInputCol("category").setOutputCol("idx")
val indexerModel = indexer.fit(df)
val totalCategories = indexerModel.labels.size
val dataWithIndices = indexerModel.transform(df)
val data = dataWithIndices.groupBy("userId").agg(sort_array(collect_list(struct($"idx", $"frequency".as("val")))).as("data"))
val dataWithFeatures = data.withColumn("features", toSparseVectorUdf(totalCategories)($"data")).drop("data")
dataWithFeatures.show(false)
+------+--------------------------+
|userId|features |
+------+--------------------------+
|1 |(7,[0,1,3],[1.0,5.0,3.0]) |
|3 |(7,[0,2,4],[5.0,16.0,2.0])|
|2 |(7,[1,5,6],[2.0,6.0,1.0]) |
+------+--------------------------+
NOTE: StringIndexer will sort categories by frequency => most frequent category will be at index=0 in indexerModel.labels. Feel free to use your own mapping if you'd like and pass that directly to toSparseVectorUdf.