Converting a dataframe to dataset retains extra columns

Converting a dataframe to dataset retains extra columns - scala

In Spark 2.11, when converting a Dataframe to a Dataset, spark retains the extra columns that are not even referred in the class of the dataset.
scala> case class F(x: String, y: String)
defined class F
scala> import spark.implicits._
import spark.implicits._
scala> val df = Seq(("1a","2a","3a","4a"), ("5a", "6a", "7a","8a")).toDF("w","x","y","z")
df: org.apache.spark.sql.DataFrame = [w: string, x: string ... 2 more fields]
scala> val ds = df.to
toDF toJSON toJavaRDD toLocalIterator toString
scala> val ds = df.as
as asInstanceOf
scala> val ds = df.as[F]
ds: org.apache.spark.sql.Dataset[F] = [w: string, x: string ... 2 more fields]
scala> ds.show()
+---+---+---+---+
| w| x| y| z|
+---+---+---+---+
| 1a| 2a| 3a| 4a|
| 5a| 6a| 7a| 8a|
+---+---+---+---+
This throws the type safety aspect of it down the drain. Is there a way to prevent this from happening ?
I read that using ds.map(identity) because the conversion is supposed to be lazy but it does not trigger when trying to write the dataframe. Plus it's impossible to write a test to cover .map(identity) and prevent deletion.

You can address this behavior in several ways but one of the easiest and most performant option in scala in to define your own to[T] method that selects only the appropriate columns:
import scala.reflect.runtime.universe.TypeTag
import org.apache.spark.sql._
object SparkExtensions {
implicit class ExtendedDataFrame(df: DataFrame) {
def to[T <: Product: TypeTag]: Dataset[T] = {
import df.sparkSession.implicits._
import org.apache.spark.sql.functions.col
df.select(Encoders.product[T].schema.map(f => col(f.name)): _*).as[T]
}
}
}
Which you can than use in the following way:
scala> case class F(x: String, y: String)
defined class F
scala> import SparkExtensions._
import SparkExtensions._
scala> val df = Seq(("1a","2a","3a","4a"), ("5a", "6a", "7a","8a")).toDF("w","x","y","z")
df: org.apache.spark.sql.DataFrame = [w: string, x: string ... 2 more fields]
scala> val ds: Dataset[F] = df.to[F]
ds: org.apache.spark.sql.Dataset[F] = [w: string, x: string]
scala> ds.show()
+---+---+
| w| x|
+---+---+
| 1a| 2a|
| 5a| 6a|
+---+---+
If you don't want to define an implicit class, you can always simply do df.as[F].rdd.toDS to achieve the same result in a slightly less runtime efficient way.

Related

Select case class based on String in Scala

How can I select a case class based on a String value?
My code is
val spark = SparkSession.builder()...
val rddOfJsonStrings: RDD[String] = // some json strings as RDD
val classSelector: String = ??? // could be "Foo" or "Bar", or any other String value
case class Foo(foo: String)
case class Bar(bar: String)
if (classSelector == "Foo") {
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Foo]
} else if (classSelector == "Bar") {
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Bar]
} else {
throw ClassUnknownException //custom Exception
}
The variable classSeletector is a simple String that should be used to point to the case class of the same name.
Imagine I don't only have Foo and Bar as case classes but more then those two. How is it possible to call the df.as[] statement based on the String (if possible at all)?
Or is there a completely different approach available in Scala?

Check below code
classSeletector match {
case c if Foo.getClass.getSimpleName.replace("$","").equalsIgnoreCase(c) => spark.read.json(rddOfJsonStrings).as[Foo]
case c if Bar.getClass.getSimpleName.replace("$","").equalsIgnoreCase(c) => spark.read.json(rddOfJsonStrings).as[Bar]
case _ => throw ClassUnknownException //custom Exception
}

How is it possible to call the df.as[] statement based on the String (if possible at all)?
It isn't (or based on any runtime value). You may note that all answers still need to:
have a separate branch for Foo and Bar (and one more branch for each class you'll want to add);
repeat the class name twice in the branch.
You can avoid the second:
import scala.reflect.{classTag, ClassTag}
val df: DataFrame = spark.read.json(rddOfJsonStrings)
// local function defined where df and classSelector are visible
def dfAsOption[T : Encoder : ClassTag] =
Option.when(classSelector == classTag[T].runtimeClass.simpleName)(df.as[T])
dfAsOption[Foo].dfAsOption(asOption[Bar]).getOrElse(throw ClassUnknownException)
But for the first you'd need a macro if it's possible at all. I would guess it isn't.

Define a generic method and invoke it,
getDs[Foo](spark,rddOfJsonStrings)
getDs[Bar](spark,rddOfJsonStrings)
def getDs[T](spark : SparkSession, rddOfJsonStrings:String) {
spark.read.json(rddOfJsonStrings).as[T](Encoders.bean[T](classOf[T]))
}

Alternative-
highlights-
Use simpleName of the case class and not of the companion object
if classSelector is null, the solution won't fail
case class Foo(foo: String)
case class Bar(bar: String)
Testcase-
val rddOfJsonStrings: RDD[String] = spark.sparkContext.parallelize(Seq("""{"foo":1}"""))
val classSelector: String = "Foo" // could be "Foo" or "Bar", or any other String value
val ds = classSelector match {
case foo if classOf[Foo].getSimpleName == foo =>
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Foo]
case bar if classOf[Bar].getSimpleName == bar =>
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Bar]
case _ => throw new UnsupportedOperationException
}
ds.show(false)
/**
* +---+
* |foo|
* +---+
* |1 |
* +---+
*/

You can use reflective toolbox
import org.apache.spark.sql.{Dataset, SparkSession}
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
object Main extends App {
val spark = SparkSession.builder
.master("local")
.appName("Spark SQL basic example")
.getOrCreate()
import spark.implicits._
val rddOfJsonStrings: Dataset[String] = spark.createDataset(Seq("""{"foo":"aaa"}"""))
// val rddOfJsonStrings: Dataset[String] = spark.createDataset(Seq("""{"bar":"bbb"}"""))
val classSelector: String = "Foo"
// val classSelector: String = "Bar"
case class Foo(foo: String)
case class Bar(bar: String)
val runtimeMirror = runtime.currentMirror
val toolbox = runtimeMirror.mkToolBox()
val res = toolbox.eval(toolbox.parse(s"""
import org.apache.spark.sql.DataFrame
import Main._
import spark.implicits._
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[$classSelector]
""")).asInstanceOf[Dataset[_]]
println(res) // [foo: string]
}
Notice that statically you will have a Dataset[_], not Dataset[Foo] or Dataset[Bar].

GroupBy + custom aggregation on Dataset with Case class / Trait in the Key

I am trying to refactor some code and put the general logic into a trait. I basically want to process datasets, group them by some key and aggregate:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{ Dataset, Encoder, Encoders, TypedColumn }
case class SomeKey(a: String, b: Boolean)
case class InputRow(
SomeKey,
v: Double
)
trait MyTrait {
def processInputs: Dataset[InputRow]
def groupAndAggregate(
logs: Dataset[InputRow]
): Dataset[(SomeKey, Long)] = {
import logs.sparkSession.implicits._
logs
.groupByKey(i => i.key)
.agg(someAggFunc)
}
//Whatever agg function: here, it counts the number of v that are >= 0.5
def someAggFunc: TypedColumn[InputRow, Long] =
new Aggregator[
/*input type*/ InputRow,
/* "buffer" type */ Long,
/* output type */ Long
] with Serializable {
def zero = 0L
def reduce(b: Long, a: InputRow) = {
if (a.v >= 0.5)
b + 1
else
b
}
def merge(b1: Long, b2: Long) =
b1 + b2
// map buffer to output type
def finish(b: Long) = b
def bufferEncoder: Encoder[Long] = Encoders.scalaLong
def outputEncoder: Encoder[Long] = Encoders.scalaLong
}.toColumn
}
everything works fine: I can instantiate a class that inherits from MyTrait and override the way I process inputs:
import spark.implicits._
case class MyTraitTest(testDf: DataFrame) extends MyTrait {
override def processInputs: Dataset[InputRow] = {
val ds = testDf
.select(
$"a",
$"b",
$"v",
)
.rdd
.map(
r =>
InputRow(
SomeKey(r.getAs[String]("a"), r.getAs[Boolean]("b")),
r.getAs[Double]("v")
)
)
.toDS
ds
}
val df: DataFrame = Seq(
("1", false, 0.40),
("1", false, 0.54),
("0", true, 0.85),
("1", true, 0.39)
).toDF("a", "b", "v")
val myTraitTest = MyTraitTest(df)
val ds: Dataset[InputRow] = myTraitTest.processInputs
val res = myTraitTest.groupAndAggregate(ds)
res.show(false)
+----------+----------------------------------+
|key |InputRow |
+----------+----------------------------------+
|[1, false]|1 |
|[0, true] |1 |
|[1, true] |0 |
+----------+----------------------------------+
Now the problem: I want SomeKey to derive from a more generic trait Key, because the key will not always have only two fields, the fields won't have the same type etc. It will always be a simple tuple of some basic primitive types though.
So I tried to do the following:
trait Key extends Product
case class SomeKey(a: String, b: Boolean) extends Key
case class SomeOtherKey(x: Int, y: Boolean, z: String) extends Key
case class InputRow[T <: Key](
key: T,
v: Double
)
trait MyTrait[T <: Key] {
def processInputs: Dataset[InputRow[T]]
def groupAndAggregate(
logs: Dataset[InputRow[T]]
): Dataset[(T, Long)] = {
import logs.sparkSession.implicits._
logs
.groupByKey(i => i.key)
.agg(someAggFunc)
}
def someAggFunc: TypedColumn[InputRow[T], Long] = {...}
I now do:
case class MyTraitTest(testDf: DataFrame) extends MyTrait[SomeKey] {
override def processInputs: Dataset[InputRow[SomeKey]] = {
...
}
etc.
But now I get the error : Unable to find encoder for type T. An implicit Encoder[T] is needed to store T instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
.groupByKey(i => i.key)
I really don't know how to work around this issue, I tried lots of things without success. Sorry for this quite lengthy description but hopefully you have all the elements to help me understand... thanks!

Spark needs to be able to implicitly create the encoder for product type T so you'll need to help it work around the JVM type erasure and pass the TypeTag for T as an implicit parameter of your groupAndAggregate method.
A working example:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{ DataFrame, Dataset, Encoders, TypedColumn }
import scala.reflect.runtime.universe.TypeTag
trait Key extends Product
case class SomeKey(a: String, b: Boolean) extends Key
case class SomeOtherKey(x: Int, y: Boolean, z: String) extends Key
case class InputRow[T <: Key](key: T, v: Double)
trait MyTrait[T <: Key] {
def processInputs: Dataset[InputRow[T]]
def groupAndAggregate(
logs: Dataset[InputRow[T]]
)(implicit tTypeTag: TypeTag[T]): Dataset[(T, Long)] = {
import logs.sparkSession.implicits._
logs
.groupByKey(i => i.key)
.agg(someAggFunc)
}
def someAggFunc: TypedColumn[InputRow[T], Long] =
new Aggregator[InputRow[T], Long, Long] with Serializable {
def reduce(b: Long, a: InputRow[T]) = b + (a.v * 100).toLong
def merge(b1: Long, b2: Long) = b1 + b2
def zero = 0L
def finish(b: Long) = b
def bufferEncoder = Encoders.scalaLong
def outputEncoder = Encoders.scalaLong
}.toColumn
}
with a wrapping case class
case class MyTraitTest(testDf: DataFrame) extends MyTrait[SomeKey] {
import testDf.sparkSession.implicits._
import org.apache.spark.sql.functions.struct
override def processInputs = testDf
.select(struct($"a", $"b") as "key", $"v" )
.as[InputRow[SomeKey]]
}
and a test execution
val df = Seq(
("1", false, 0.40),
("1", false, 0.54),
("0", true, 0.85),
("1", true, 0.39)
).toDF("a", "b", "v")
val myTraitTest = MyTraitTest(df)
val ds = myTraitTest.processInputs
val res = myTraitTest.groupAndAggregate(ds)
res.show(false)
+----------+-----------------------------------------------+
|key |$anon$1($line5460910223.$read$$iw$$iw$InputRow)|
+----------+-----------------------------------------------+
|[1, false]|94 |
|[1, true] |39 |
|[0, true] |85 |
+----------+-----------------------------------------------+

Dataframe with "Sparse" Vector groupBy aggregation,not dense Vector in spark using Scala [duplicate]

I have a DataFrame of two columns, ID of type Int and Vec of type Vector (org.apache.spark.mllib.linalg.Vector).
The DataFrame looks like follow:
ID,Vec
1,[0,0,5]
1,[4,0,1]
1,[1,2,1]
2,[7,5,0]
2,[3,3,4]
3,[0,8,1]
3,[0,0,1]
3,[7,7,7]
....
I would like to do a groupBy($"ID") then apply an aggregation on the rows inside each group by summing the vectors.
The desired output of the above example would be:
ID,SumOfVectors
1,[5,2,7]
2,[10,8,4]
3,[7,15,9]
...
The available aggregation functions will not work, e.g. df.groupBy($"ID").agg(sum($"Vec") will lead to an ClassCastException.
How to implement a custom aggregation function that allows me to do the sum of vectors or arrays or any other custom operation?

Spark >= 3.0
You can use Summarizer with sum
import org.apache.spark.ml.stat.Summarizer
df
.groupBy($"id")
.agg(Summarizer.sum($"vec").alias("vec"))
Spark <= 3.0
Personally I wouldn't bother with UDAFs. There are more than verbose and not exactly fast (Spark UDAF with ArrayType as bufferSchema performance issues) Instead I would simply use reduceByKey / foldByKey:
import org.apache.spark.sql.Row
import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.ml.linalg.{Vector, Vectors}
def dv(values: Double*): Vector = Vectors.dense(values.toArray)
val df = spark.createDataFrame(Seq(
(1, dv(0,0,5)), (1, dv(4,0,1)), (1, dv(1,2,1)),
(2, dv(7,5,0)), (2, dv(3,3,4)),
(3, dv(0,8,1)), (3, dv(0,0,1)), (3, dv(7,7,7)))
).toDF("id", "vec")
val aggregated = df
.rdd
.map{ case Row(k: Int, v: Vector) => (k, BDV(v.toDense.values)) }
.foldByKey(BDV.zeros[Double](3))(_ += _)
.mapValues(v => Vectors.dense(v.toArray))
.toDF("id", "vec")
aggregated.show
// +---+--------------+
// | id| vec|
// +---+--------------+
// | 1| [5.0,2.0,7.0]|
// | 2|[10.0,8.0,4.0]|
// | 3|[7.0,15.0,9.0]|
// +---+--------------+
And just for comparison a "simple" UDAF. Required imports:
import org.apache.spark.sql.expressions.{MutableAggregationBuffer,
UserDefinedAggregateFunction}
import org.apache.spark.ml.linalg.{Vector, Vectors, SQLDataTypes}
import org.apache.spark.sql.types.{StructType, ArrayType, DoubleType}
import org.apache.spark.sql.Row
import scala.collection.mutable.WrappedArray
Class definition:
class VectorSum (n: Int) extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("v", SQLDataTypes.VectorType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = SQLDataTypes.VectorType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, Array.fill(n)(0.0))
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val buff = buffer.getAs[WrappedArray[Double]](0)
val v = input.getAs[Vector](0).toSparse
for (i <- v.indices) {
buff(i) += v(i)
}
buffer.update(0, buff)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getAs[WrappedArray[Double]](0)
val buff2 = buffer2.getAs[WrappedArray[Double]](0)
for ((x, i) <- buff2.zipWithIndex) {
buff1(i) += x
}
buffer1.update(0, buff1)
}
def evaluate(buffer: Row) = Vectors.dense(
buffer.getAs[Seq[Double]](0).toArray)
}
And an example usage:
df.groupBy($"id").agg(new VectorSum(3)($"vec") alias "vec").show
// +---+--------------+
// | id| vec|
// +---+--------------+
// | 1| [5.0,2.0,7.0]|
// | 2|[10.0,8.0,4.0]|
// | 3|[7.0,15.0,9.0]|
// +---+--------------+
See also: How to find mean of grouped Vector columns in Spark SQL?.

I suggest the following (works on Spark 2.0.2 onward), it might be optimized but it's very nice, one thing you have to know in advance is the vector size when you create the UDAF instance
import org.apache.spark.ml.linalg._
import org.apache.spark.mllib.linalg.WeightedSparseVector
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
class VectorAggregate(val numFeatures: Int)
extends UserDefinedAggregateFunction {
private type B = Map[Int, Double]
def inputSchema: StructType = StructType(StructField("vec", new VectorUDT()) :: Nil)
def bufferSchema: StructType =
StructType(StructField("agg", MapType(IntegerType, DoubleType)) :: Nil)
def initialize(buffer: MutableAggregationBuffer): Unit =
buffer.update(0, Map.empty[Int, Double])
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val zero = buffer.getAs[B](0)
input match {
case Row(DenseVector(values)) => buffer.update(0, values.zipWithIndex.foldLeft(zero){case (acc,(v,i)) => acc.updated(i, v + acc.getOrElse(i,0d))})
case Row(SparseVector(_, indices, values)) => buffer.update(0, values.zip(indices).foldLeft(zero){case (acc,(v,i)) => acc.updated(i, v + acc.getOrElse(i,0d))}) }}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val zero = buffer1.getAs[B](0)
buffer1.update(0, buffer2.getAs[B](0).foldLeft(zero){case (acc,(i,v)) => acc.updated(i, v + acc.getOrElse(i,0d))})}
def deterministic: Boolean = true
def evaluate(buffer: Row): Any = {
val Row(agg: B) = buffer
val indices = agg.keys.toArray.sorted
Vectors.sparse(numFeatures,indices,indices.map(agg)).compressed
}
def dataType: DataType = new VectorUDT()
}

With pyspark 3.0.0, which is my version, you can use Summarizer to do it easily. Your col needs to be type of DenseVector
from pyspark.ml.stat import Summarizer
sdf.groupBy("ID").agg(Summarizer.mean(sdf.Vec)).show()
Note: there is no avg function in pyspark, but you can use mean method

Spark: group concat equivalent in scala rdd

I have following DataFrame:
|-----id-------|----value------|-----desc------|
| 1 | v1 | d1 |
| 1 | v2 | d2 |
| 2 | v21 | d21 |
| 2 | v22 | d22 |
|--------------|---------------|---------------|
I want to transform it into:
|-----id-------|----value------|-----desc------|
| 1 | v1;v2 | d1;d2 |
| 2 | v21;v22 | d21;d22 |
|--------------|---------------|---------------|
Is it possible through data frame operations?
How would rdd transformation look like in this case?
I presume rdd.reduce is the key, but I have no idea how to adapt it to this scenario.

You can transform your data using spark sql
case class Test(id: Int, value: String, desc: String)
val data = sc.parallelize(Seq((1, "v1", "d1"), (1, "v2", "d2"), (2, "v21", "d21"), (2, "v22", "d22")))
.map(line => Test(line._1, line._2, line._3))
.df
data.registerTempTable("data")
val result = sqlContext.sql("select id,concat_ws(';', collect_list(value)),concat_ws(';', collect_list(value)) from data group by id")
result.show

Suppose you have something like
import scala.util.Random
val sqlc: SQLContext = ???
case class Record(id: Long, value: String, desc: String)
val testData = for {
(i, j) <- List.fill(30)(Random.nextInt(5), Random.nextInt(5))
} yield Record(i, s"v$i$j", s"d$i$j")
val df = sqlc.createDataFrame(testData)
You can easily join data as:
import sqlc.implicits._
def aggConcat(col: String) = df
.map(row => (row.getAs[Long]("id"), row.getAs[String](col)))
.aggregateByKey(Vector[String]())(_ :+ _, _ ++ _)
val result = aggConcat("value").zip(aggConcat("desc")).map{
case ((id, value), (_, desc)) => (id, value, desc)
}.toDF("id", "values", "descs")
If you would like to have concatenated strings instead of arrays, you can run later
import org.apache.spark.sql.functions._
val resultConcat = result
.withColumn("values", concat_ws(";", $"values"))
.withColumn("descs" , concat_ws(";", $"descs" ))

If working with DataFrames, use UDAF
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, StringType, StructField, StructType}
class ConcatStringsUDAF(InputColumnName: String, sep:String = ",") extends UserDefinedAggregateFunction {
def inputSchema: StructType = StructType(StructField(InputColumnName, StringType) :: Nil)
def bufferSchema: StructType = StructType(StructField("concatString", StringType) :: Nil)
def dataType: DataType = StringType
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = buffer(0) = ""
private def concatStrings(str1: String, str2: String): String = {
(str1, str2) match {
case (s1: String, s2: String) => Seq(s1, s2).filter(_ != "").mkString(sep)
case (null, s: String) => s
case (s: String, null) => s
case _ => ""
}
}
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val acc1 = buffer.getAs[String](0)
val acc2 = input.getAs[String](0)
buffer(0) = concatStrings(acc1, acc2)
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val acc1 = buffer1.getAs[String](0)
val acc2 = buffer2.getAs[String](0)
buffer1(0) = concatStrings(acc1, acc2)
}
def evaluate(buffer: Row): Any = buffer.getAs[String](0)
}
And then use this way
val stringConcatener = new ConcatStringsUDAF("Category_ID", ",")
data.groupBy("aaid", "os_country").agg(stringConcatener(data("X")).as("Xs"))
As from Spark 1.6, have a look at Datasets and Aggregator.

After some research I've came up with sth like that:
val data = sc.parallelize(
List(
("1", "v1", "d1"),
("1", "v2", "d2"),
("2", "v21", "d21"),
("2", "v22", "d22")))
.map{ case(id, value, desc)=>((id), (value, desc))}
.reduceByKey((x,y)=>(x._1+";"+y._1, x._2+";"+x._2))
.map{ case(id,(value, desc))=>(id, value, desc)}.toDF("id", "value","desc")
.show()
that leaves me with:
+---+-------+-------+
| id| value| desc|
+---+-------+-------+
| 1| v1;v2| d1;d1|
| 2|v21;v22|d21;d21|
+---+-------+-------+

How to define a custom aggregation function to sum a column of Vectors?

I have a DataFrame of two columns, ID of type Int and Vec of type Vector (org.apache.spark.mllib.linalg.Vector).
The DataFrame looks like follow:
ID,Vec
1,[0,0,5]
1,[4,0,1]
1,[1,2,1]
2,[7,5,0]
2,[3,3,4]
3,[0,8,1]
3,[0,0,1]
3,[7,7,7]
....
I would like to do a groupBy($"ID") then apply an aggregation on the rows inside each group by summing the vectors.
The desired output of the above example would be:
ID,SumOfVectors
1,[5,2,7]
2,[10,8,4]
3,[7,15,9]
...
The available aggregation functions will not work, e.g. df.groupBy($"ID").agg(sum($"Vec") will lead to an ClassCastException.
How to implement a custom aggregation function that allows me to do the sum of vectors or arrays or any other custom operation?

Spark >= 3.0
You can use Summarizer with sum
import org.apache.spark.ml.stat.Summarizer
df
.groupBy($"id")
.agg(Summarizer.sum($"vec").alias("vec"))
Spark <= 3.0
Personally I wouldn't bother with UDAFs. There are more than verbose and not exactly fast (Spark UDAF with ArrayType as bufferSchema performance issues) Instead I would simply use reduceByKey / foldByKey:
import org.apache.spark.sql.Row
import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.ml.linalg.{Vector, Vectors}
def dv(values: Double*): Vector = Vectors.dense(values.toArray)
val df = spark.createDataFrame(Seq(
(1, dv(0,0,5)), (1, dv(4,0,1)), (1, dv(1,2,1)),
(2, dv(7,5,0)), (2, dv(3,3,4)),
(3, dv(0,8,1)), (3, dv(0,0,1)), (3, dv(7,7,7)))
).toDF("id", "vec")
val aggregated = df
.rdd
.map{ case Row(k: Int, v: Vector) => (k, BDV(v.toDense.values)) }
.foldByKey(BDV.zeros[Double](3))(_ += _)
.mapValues(v => Vectors.dense(v.toArray))
.toDF("id", "vec")
aggregated.show
// +---+--------------+
// | id| vec|
// +---+--------------+
// | 1| [5.0,2.0,7.0]|
// | 2|[10.0,8.0,4.0]|
// | 3|[7.0,15.0,9.0]|
// +---+--------------+
And just for comparison a "simple" UDAF. Required imports:
import org.apache.spark.sql.expressions.{MutableAggregationBuffer,
UserDefinedAggregateFunction}
import org.apache.spark.ml.linalg.{Vector, Vectors, SQLDataTypes}
import org.apache.spark.sql.types.{StructType, ArrayType, DoubleType}
import org.apache.spark.sql.Row
import scala.collection.mutable.WrappedArray
Class definition:
class VectorSum (n: Int) extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("v", SQLDataTypes.VectorType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = SQLDataTypes.VectorType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, Array.fill(n)(0.0))
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val buff = buffer.getAs[WrappedArray[Double]](0)
val v = input.getAs[Vector](0).toSparse
for (i <- v.indices) {
buff(i) += v(i)
}
buffer.update(0, buff)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getAs[WrappedArray[Double]](0)
val buff2 = buffer2.getAs[WrappedArray[Double]](0)
for ((x, i) <- buff2.zipWithIndex) {
buff1(i) += x
}
buffer1.update(0, buff1)
}
def evaluate(buffer: Row) = Vectors.dense(
buffer.getAs[Seq[Double]](0).toArray)
}
And an example usage:
df.groupBy($"id").agg(new VectorSum(3)($"vec") alias "vec").show
// +---+--------------+
// | id| vec|
// +---+--------------+
// | 1| [5.0,2.0,7.0]|
// | 2|[10.0,8.0,4.0]|
// | 3|[7.0,15.0,9.0]|
// +---+--------------+
See also: How to find mean of grouped Vector columns in Spark SQL?.

I suggest the following (works on Spark 2.0.2 onward), it might be optimized but it's very nice, one thing you have to know in advance is the vector size when you create the UDAF instance
import org.apache.spark.ml.linalg._
import org.apache.spark.mllib.linalg.WeightedSparseVector
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
class VectorAggregate(val numFeatures: Int)
extends UserDefinedAggregateFunction {
private type B = Map[Int, Double]
def inputSchema: StructType = StructType(StructField("vec", new VectorUDT()) :: Nil)
def bufferSchema: StructType =
StructType(StructField("agg", MapType(IntegerType, DoubleType)) :: Nil)
def initialize(buffer: MutableAggregationBuffer): Unit =
buffer.update(0, Map.empty[Int, Double])
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val zero = buffer.getAs[B](0)
input match {
case Row(DenseVector(values)) => buffer.update(0, values.zipWithIndex.foldLeft(zero){case (acc,(v,i)) => acc.updated(i, v + acc.getOrElse(i,0d))})
case Row(SparseVector(_, indices, values)) => buffer.update(0, values.zip(indices).foldLeft(zero){case (acc,(v,i)) => acc.updated(i, v + acc.getOrElse(i,0d))}) }}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val zero = buffer1.getAs[B](0)
buffer1.update(0, buffer2.getAs[B](0).foldLeft(zero){case (acc,(i,v)) => acc.updated(i, v + acc.getOrElse(i,0d))})}
def deterministic: Boolean = true
def evaluate(buffer: Row): Any = {
val Row(agg: B) = buffer
val indices = agg.keys.toArray.sorted
Vectors.sparse(numFeatures,indices,indices.map(agg)).compressed
}
def dataType: DataType = new VectorUDT()
}

With pyspark 3.0.0, which is my version, you can use Summarizer to do it easily. Your col needs to be type of DenseVector
from pyspark.ml.stat import Summarizer
sdf.groupBy("ID").agg(Summarizer.mean(sdf.Vec)).show()
Note: there is no avg function in pyspark, but you can use mean method

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Converting a dataframe to dataset retains extra columns - scala

Related

Select case class based on String in Scala

GroupBy + custom aggregation on Dataset with Case class / Trait in the Key

Dataframe with "Sparse" Vector groupBy aggregation,not dense Vector in spark using Scala [duplicate]

Spark: group concat equivalent in scala rdd

How to define a custom aggregation function to sum a column of Vectors?

Categories

Resources