Apache Spark - How to define a UserDefinedAggregateFunction after 3? - scala

I'm using Spark 3.0, and in order to use a user-defined function as a window function, I need an instance of UserDefinedAggregateFunction. Initially I thought that using the new Aggregator and udaf would solve this problem (as shown here), but udaf returns a UserDefinedFunction, not a UserDefinedAggregateFunction.
Since Spark 3.0, UserDefinedAggregateFunction is deprecated, as stated here (despite being possible to still find it around)
So the question is: is there a correct (not deprecated) way in Spark 3.0 to define a proper UserDefinedAggregateFunction and use it as a window function?

In Spark 3, the new API uses Aggregator to define user-defined aggregations:
abstract class Aggregator[-IN, BUF, OUT] extends Serializable:
A base class for user-defined aggregations, which can be used in
Dataset operations to take all of the elements of a group and reduce
them to a single value.
Aggregator brings performance improvements comparing to deprecated UDAF. You can see the issue Efficient User Defined Aggregators.
Here's an example on how to define a mean Aggregator and register it using functions.udaf method:
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
val meanAgg= new Aggregator[Long, (Long, Long), Double]() {
def zero = (0L, 0L) // Init the buffer
def reduce(y: (Long, Long), x: Long) = (y._1 + x, y._2 + 1)
def merge(a: (Long, Long), b: (Long, Long)) = (a._1 + b._1, a._2 + b._2)
def finish(r: (Long, Long)) = r._1.toDouble / r._2
def bufferEncoder: Encoder[(Long, Long)] = implicitly(ExpressionEncoder[(Long, Long)])
def outputEncoder: Encoder[Double] = implicitly(ExpressionEncoder[Double])
}
val meanUdaf = udaf(meanAgg)
Using it with Window:
val df = Seq(
(1, 2), (1, 5),
(2, 3), (2, 1),
).toDF("id", "value")
df.withColumn("mean", meanUdaf($"value").over(Window.partitionBy($"id"))).show
//+---+-----+----+
//| id|value|mean|
//+---+-----+----+
//| 1| 2| 3.5|
//| 1| 5| 3.5|
//| 2| 3| 2.0|
//| 2| 1| 2.0|
//+---+-----+----+

Related

How to use agg() on a KeyValueGroupedDataset and remain Typesafe

I know that this question was posted here before but the answers weren't satisfiying for my case.
How to use the agg method of Spark KeyValueGroupedDataset?
Actually the question posted here is not in line with the given content as it circles around a Dataset and its group() function and NOT a KeyValueGroupedDataset.
I am trying to work with case classes and stay typesafe. So in the above case the answers were not type safe and using sql statements on a Dataframe which is easlily recognizable by the given column names as string parameter.
What I am trying to achieve is this here:
val r = dsResult1.groupByKey(r => r.Id) // r is a KeyValueGroupedDataset
r.agg // ??? I don't know how to call this method and couldn't find any examples
This is the scaladoc : http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/KeyValueGroupedDataset.html
You can use an Aggregator to stay type-safe.
Let's say we have a data class
case class A(id: Long, text: String)
with the data
val dsA: Dataset[A] = Seq(A(1, "one"), A(2, "two"), A(3, "three"), A(3, "THREE")).toDS()
//+---+-----+
//| id| text|
//+---+-----+
//| 1| one|
//| 2| two|
//| 3|three|
//| 3|THREE|
//+---+-----+
and we want to group the data by id while combining the texts of each id.
val kvgds: KeyValueGroupedDataset[Long, A] = dsA.groupByKey(a => a.id)
val myAggregateFunction: Aggregator[A, A, A] = new Aggregator[A, A, A] {
def zero: A = A(-1, "")
def reduce(a: A, b: A): A = if (b.equals(zero)) a else if (a.equals(zero)) b
else A(a.id, a.text + "+" + b.text)
def merge(a: A, b: A): A = reduce(a, b)
def finish(r: A): A = r
def bufferEncoder: Encoder[A] = dsA.encoder
def outputEncoder: Encoder[A] = dsA.encoder
}
val result: Dataset[(Long,A)] = kvgds.agg(myAggregateFunction.toColumn.name("result"))
All operations are type-safe and the result is a dataset with type [(Long,A)] and the contents
+---+----------------+
|key| result|
+---+----------------+
| 1| [1, one]|
| 3|[3, three+THREE]|
| 2| [2, two]|
+---+----------------+
Good lord. As of Spark 3.0.0 typed aggregates are not supported anymore and one should use the untyped aggregate functions instead :-( So I am going from typed to untyped, what the... Or are there other approaches to stay type safe.

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.
Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee

Is there an explode function equivalent in plain Scala?

I am trying to look for explode function or its equivalent in plain scala rather Spark.
Using the explode function in Spark, I was able to flatten a row with multiple elements into multiple rows as below.
scala> import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.explode
scala> val test = spark.read.json(spark.sparkContext.parallelize(Seq("""{"a":1,"b":[2,3]}""")))
scala> test.schema
res1: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true))
scala> test.show
+---+------+
| a| b|
+---+------+
| 1|[2, 3]|
+---+------+
scala> val flat = test.withColumn("b",explode($"b"))
flat: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
scala> flat.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
+---+---+
Is there an explode equivalent function in plain scala without using Spark ? Is there anyway I can implement it if there is no explode function available in scala ?
Simple flatMap should help you in this case. I don't know exact data structure, which you would like to work with in scala, but let's take a bit artificial example:
val l: List[(Int, List[Int])] = List(1 -> List(2, 3))
val result: List[(Int, Int)] = l.flatMap {
case (a, b) => b.map(i => a -> i)
}
println(result)
Which will produce next result:
List((1,2), (1,3))
UPDATE
As suggested in comment section by #jwvh, or same result may be achieved using for-comprehension construction and hiding explicit flatMap & map invocations:
val result2: List[(Int, Int)] = for((a, bList) <- l; b <- bList) yield a -> b
Hope this helps!

Spark: reduce/aggregate by key

I am new to Spark and Scala, so I have no idea how this kind of problem is called (which makes searching for it pretty hard).
I have data of the following structure:
[(date1, (name1, 1)), (date1, (name1, 1)), (date1, (name2, 1)), (date2, (name3, 1))]
In some way, this has to be reduced/aggregated to:
[(date1, [(name1, 2), (name2, 1)]), (date2, [(name3, 1)])]
I know how to do reduceByKey on a list of key-value pairs, but this particular problem is a mystery to me.
Thanks in advance!
My data, but here goes, step-wise:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), x._2._2))
val rdd3 = rdd2.groupByKey
val rdd4 = rdd3.map{
case (str, nums) => (str, nums.sum)
}
val rdd5 = rdd4.map(x => (x._1._1, (x._1._2, x._2))).groupByKey
rdd5.collect
returns:
res28: Array[(String, Iterable[(String, Int)])] = Array((d2,CompactBuffer((E,1))), (d1,CompactBuffer((A,2), (B,1))))
Better approach avoiding groupByKey is as follows:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), (x._2._2))) // Need to add quotes around V part for reduceByKey
val rdd3 = rdd2.reduceByKey(_+_)
val rdd4 = rdd3.map(x => (x._1._1, (x._1._2, x._2))).groupByKey // Necessary Shuffle
rdd4.collect
As I stated in the columns it can be done with DataFrames for structured data, so run this below:
// This above should be enough.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val rddA = sc.makeRDD(Array( ("d1","A",1), ("d1","A",1), ("d1","B",1), ("d2","E",1) ),2)
val dfA = rddA.toDF("c1", "c2", "c3")
val dfB = dfA
.groupBy("c1", "c2")
.agg(sum("c3").alias("sum"))
dfB.show
returns:
+---+---+---+
| c1| c2|sum|
+---+---+---+
| d1| A| 2|
| d2| E| 1|
| d1| B| 1|
+---+---+---+
But you can do this to approximate the above of the CompactBuffer above.
import org.apache.spark.sql.functions.{col, udf}
case class XY(x: String, y: Long)
val xyTuple = udf((x: String, y: Long) => XY(x, y))
val dfC = dfB
.withColumn("xy", xyTuple(col("c2"), col("sum")))
.drop("c2")
.drop("sum")
dfC.printSchema
dfC.show
// Then ... this gives you the CompactBuffer answer but from a DF-perspective
val dfD = dfC.groupBy(col("c1")).agg(collect_list(col("xy")))
dfD.show
returns - some renaming req'd and possible sorting:
---+----------------+
| c1|collect_list(xy)|
+---+----------------+
| d2| [[E, 1]]|
| d1|[[A, 2], [B, 1]]|
+---+----------------+

Spark: How to convert a Dataset[(String , array[int])] to Dataset[(String, int, int, int, int, ...)] using Scala [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
The Dataset I have has the following schema: [String, array[int]]. Now, I want to convert it to a Dataset (or Dataframe) with the following schema: [String, int, int, int, ...]. Note that array[int] is dynamic, it can therefore have different length for different rows.
The problem stems from the fact that the tuple (String, Array[Int]) is a specific type and it is the same type no matter how many elements are in the Array.
On the other hand, the tuple (String, Int) is a different type from (String, Int, Int), which is different still from (String, Int, Int, Int), and so on. Being a strongly typed language, Scala doesn't easily allow for a method that takes one type as input and produces one of many possible, and unrelated, types as output.
Perhaps if you describe why you think you want to do this we can offer a better solution for your situation.
As #jwvh suggest, you probably cannot do this in a type-safe Dataset way. If you relax type-safety you can probably do this using DataFrames (assuming your arrays are not crazy long - I believe currently columns are restricted to Int.MaxValue number of columns).
Here is the solution using (primarily) DataFrames on Spark 2.0.2:
We start with a toy example:
import org.apache.spark.sql.functions._
import spark.implicits._
val ds = spark.createDataset(("Hello", Array(1,2,3)) :: ("There", Array(1,2,10,11,100)) :: ("A", Array(5,6,7,8)) :: Nil)
// ds: org.apache.spark.sql.Dataset[(String, Array[Int])] = [_1: string, _2: array<int>]
ds.show()
+-----+-------------------+
| _1| _2|
+-----+-------------------+
|Hello| [1, 2, 3]|
|There|[1, 2, 10, 11, 100]|
| A| [5, 6, 7, 8]|
+-----+-------------------+
Next we compute the max length of the arrays we have (we hope is not crazy long here):
val maxLen = ds.select(max(size($"_2")).as[Long]).collect().head
Next, we want a function to select an element of the array at a particular index. We express the array selection function as a UDF:
val u = udf((a: Seq[Int], i: Int) => if(a.size <= i) null.asInstanceOf[Int] else a(i))
Now we create all the columns we want to generate:
val columns = ds.col("_1") +: (for(i <- 0 until maxLen.toInt ) yield u(ds.col("_2"), lit(i)).as(s"a[$i]"))
Then hopefully we are done:
ds.select(columns:_*).show()
+-----+----+----+----+----+----+
| _1|a[0]|a[1]|a[2]|a[3]|a[4]|
+-----+----+----+----+----+----+
|Hello| 1| 2| 3| 0| 0|
|There| 1| 2| 10| 11| 100|
| A| 5| 6| 7| 8| 0|
+-----+----+----+----+----+----+
Here is the complete code for copy paste
import org.apache.spark.sql.functions._
import spark.implicits._
val ds = spark.createDataset(("Hello", Array(1,2,3)) :: ("There", Array(1,2,10,11,100)) :: ("A", Array(5,6,7,8)) :: Nil)
val maxLen = ds.select(max(size($"_2")).as[Long]).collect().head
val u = udf((a: Seq[Int], i: Int) => if(a.size <= i) null.asInstanceOf[Int] else a(i))
val columns = ds.col("_1") +: (for(i <- 0 until maxLen.toInt ) yield u(ds.col("_2"), lit(i)).as(s"a[$i]"))
ds.select(columns:_*).show()