Generic UDAF in Spark 3.0 using Aggregator - scala

Spark 3.0 has deprecated UserDefinedAggregateFunction and I was trying to rewrite my udaf using Aggregator. Basic usage of Aggregator is simple, however, I struggle with more generic version of the function.
I will try to explain my problem with this example, an implementation of collect_set. It's not my actual case, but it's easier to explain the problem:
class CollectSetDemoAgg(name: String) extends Aggregator[Row, Set[Int], Set[Int]] {
override def zero = Set.empty
override def reduce(b: Set[Int], a: Row) = b + a.getInt(a.fieldIndex(name))
override def merge(b1: Set[Int], b2: Set[Int]) = b1 ++ b2
override def finish(reduction: Set[Int]) = reduction
override def bufferEncoder = Encoders.kryo[Set[Int]]
override def outputEncoder = ExpressionEncoder()
}
// using it:
df.agg(new CollectSetDemoAgg("rank").toColumn as "result").show()
I prefer .toColumn vs .udf.register, but it's not the point here.
Problem:
I can not make universal version of this Aggregator, it will only work with integers.
I've attempted:
class CollectSetDemo(name: String) extends Aggregator[Row, Set[Any], Set[Any]]
It crashes with error:
No Encoder found for Any
- array element class: "java.lang.Object"
- root class: "scala.collection.immutable.Set"
java.lang.UnsupportedOperationException: No Encoder found for Any
- array element class: "java.lang.Object"
- root class: "scala.collection.immutable.Set"
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:567)
I could not go with CollectSetDemo[T], case I was not able to proper outputEncoder. Also, when using udaf, I can only work with Spark data types, columns, etc.

Have not found a nice way to solve the situation, but I was able to somewhat workaround it. Code was partially borrowed from RowEncoder:
class CollectSetDemoAgg(name: String, fieldType: DataType) extends Aggregator[Row, Set[Any], Any] {
override def zero = Set.empty
override def reduce(b: Set[Any], a: Row) = b + a.get(a.fieldIndex(name))
override def merge(b1: Set[Any], b2: Set[Any]) = b1 ++ b2
override def finish(reduction: Set[Any]) = reduction.toSeq
override def bufferEncoder = Encoders.kryo[Set[Any]]
// now
override def outputEncoder = {
val mirror = ScalaReflection.mirror
val tt = fieldType match {
case ArrayType(LongType, _) => typeTag[Seq[Long]]
case ArrayType(IntegerType, _) => typeTag[Seq[Int]]
case ArrayType(StringType, _) => typeTag[Seq[String]]
// .. etc etc
case _ => throw new RuntimeException(s"Could not create encoder for ${name} column (${fieldType})")
}
val tpe = tt.in(mirror).tpe
val cls = mirror.runtimeClass(tpe)
val serializer = ScalaReflection.serializerForType(tpe)
val deserializer = ScalaReflection.deserializerForType(tpe)
new ExpressionEncoder[Any](serializer, deserializer, ClassTag[Any](cls))
}
}
One thing, that I had to add was result data type parameter in aggregator. The usage then changed to:
df.agg(new CollectSetDemoAgg("rank", new ArrayType(IntegerType, true)).toColumn as "result").show()
I really don't like how it turned out, but it works. I also welcome any suggestions how to improve it.

Modification of #Ramunas answer with generics:
class CollectSetDemoAgg[T: TypeTag](name: String) extends Aggregator[Row, Set[T], Seq[T]] {
override def zero = Set.empty
override def reduce(b: Set[T], a: Row) = b + a.getAs[T](a.fieldIndex(name))
override def merge(b1: Set[T], b2: Set[T]) = b1 ++ b2
override def finish(reduction: Set[T]) = reduction.toSeq
override def bufferEncoder = Encoders.kryo[Set[T]]
override def outputEncoder = {
val tt = typeTag[Seq[T]]
val tpe = tt.in(mirror).tpe
val cls = mirror.runtimeClass(tpe)
val serializer = serializerForType(tpe)
val deserializer = deserializerForType(tpe)
new ExpressionEncoder[Seq[T]](serializer, deserializer, ClassTag[Seq[T]](cls))
}
}

Related

Generating random/sample json based on a schema in Scala

I need to generate some radom json sample, compliant to a schema dynamically. Meaning that the input would be a schema (e.g. json-schema) and the output would a json that complies to it.
I'm looking for pointers. Any suggestions ?
This is not complete solution, but you can get it from here.
Let's assume we have our domain objects we want to generate:
case class Dummy1(foo: String)
case class Dummy11(foo: Dummy1)
If we do this:
object O {
implicit def stringR: Random[String] = new Random[String] {
override def generate(): String = "s"
}
implicit def intR: Random[Int] = new Random[Int] {
override def generate(): Int = 1
}
implicit def tupleR[T1: Random, T2: Random]: Random[(T1, T2)] = new Random[(T1, T2)] {
override def generate(): (T1, T2) = {
val t1: T1 = G.random[T1]()
val t2: T2 = G.random[T2]()
(t1, t2)
}
}
}
object G {
def random[R: Random](): R = {
implicitly[Random[R]].generate()
}
}
then we will be able to generate some primitive values:
import O._
val s: String = G.random[String]()
val i: Int = G.random[Int]()
val t: (Int, String) = G.random[(Int, String)]()
println("s=" + s)
println("i=" + i)
println("t=" + t)
Now to jump to custom type we need to add
def randomX[R: Random, T](f: R=>T): Random[T] = {
val value: Random[R] = implicitly[Random[R]]
new Random[T] {
override def generate(): T = f.apply(value.generate())
}
}
to our G object.
Now we can
import O._
val d1: Dummy1 = G.randomX(Dummy1.apply).generate()
println("d1=" + d1)
and with some extra effort even
import O._
implicit val d1Gen: Random[Dummy1] = G.randomX(Dummy1.apply)
val d11: Dummy11 = G.randomX(Dummy11.apply).generate()
println("d11=" + d11)
Now you need to extend it to all primitive you have, add real implementation of random and support classes with more then 1 field and you ready to go.
You may even make some fancy library out of it.

How to reflect concrete types that corresponds to the type parameters of an abstraction type in Scala?

Suppose we have a generic type (for example, Seq[E]) and a concrete subtype (for example, Seq[Int]). How can we extract concrete type that corresponds to the type parameters of the abstraction type. In other words, how can we know E -> Int.
Below is a minimal code example that tests for the desired behavior. The extractTypeBinding function would perform the transformation in question.
import scala.reflect.runtime.{universe => ru}
class MyFuncs
object MyFuncs {
def fn1[E](s: Seq[E]): E = ???
def fn2[K, V](m: Map[K, V]): Int = ???
}
object Scratch {
def extractTypeBinding(genType: ru.Type, typeParam: ru.Type)(concreteType: ru.Type): ru.Type = ???
def getArgTypes(methodSymbol: ru.MethodSymbol): Seq[ru.Type] =
methodSymbol.paramLists.headOption.getOrElse(Nil).map(_.typeSignature)
def main(a: Array[String]): Unit = {
// Grab the argument types of our methods.
val funcsType = ru.typeOf[MyFuncs].companion
val fn1ArgTypes = getArgTypes(funcsType.member(ru.TermName("fn1")).asMethod)
val fn2ArgTypes = getArgTypes(funcsType.member(ru.TermName("fn2")).asMethod)
val genericSeq = fn1ArgTypes.head // Seq[E]
val genericMap = fn2ArgTypes.head // Map[K, V]
// Create an extractor for the `E` in `Seq[E]`.
val seqElExtractor = extractTypeBinding(genericSeq, genericSeq.typeArgs.head) _
// Extractor for the `K` in `Map[K,V]`
val mapKeyExtractor = extractTypeBinding(genericMap, genericMap.typeArgs.head) _
// Extractor for the `V` in `Map[K,V]`
val mapValueExtractor = extractTypeBinding(genericMap, genericMap.typeArgs(1)) _
println(seqElExtractor(ru.typeOf[Seq[Int]])) // should be Int
println(seqElExtractor(ru.typeOf[Seq[Map[String, Double]]])) // should be Map[String, Double]
println(mapKeyExtractor(ru.typeOf[Map[String, Double]])) // should be String
println(mapKeyExtractor(ru.typeOf[Map[Int, Boolean]])) // should be Int
println(mapValueExtractor(ru.typeOf[Map[String, Double]])) // should be Double
println(mapValueExtractor(ru.typeOf[Map[Int, Boolean]])) // should be Boolean
}
}
Based on the docstrings, it seems like asSeenFrom should be the key to implementing extractTypeBinding. I tried the below implementation, but it returned the type parameter unchanged.
def extractTypeBinding(genType: ru.Type, typeParam: ru.Type)(concreteType: ru.Type): ru.Type =
typeParam.asSeenFrom(concreteType, genType.typeSymbol.asClass)
...
println(seqElExtractor(ru.typeOf[Seq[Int]])) // E
println(seqElExtractor(ru.typeOf[Seq[Map[String, Double]]])) // E
If asSeenFrom is the correct approach, what would the correct incantation be?
If not, then how should this be done?
The simplest solution came from the helpful prodding by Dmytro Mitin in the comments.
I had a couple misunderstandings about .typeArgs that were cleared up with some additional experimentation.
It returns all type arguments, not just the abstract ones.
It only returns the "top level" type arguments of the type you call it on. In other words, Map[A, Map[B, C]] only has 2 type args (A and Map[B, C])
Both of those seem very intuitive now, but I initially made some foolish assumptions. Below is a modified version of my test that more clearly achieves my original intent.
class MyFuncs
object MyFuncs {
def fn1[E](s: Seq[E]): E = ???
def fn2[K, V](m: Map[K, V]): Int = ???
}
object Scratch {
def typeArgBindings(genericType: ru.Type, concreteType: ru.Type): Map[ru.Type, ru.Type] =
// #todo consider validating both have the same base type.
genericType.typeArgs.zip(concreteType.typeArgs).toMap
def getArgTypes(methodSymbol: ru.MethodSymbol): Seq[ru.Type] =
methodSymbol.paramLists.headOption.getOrElse(Nil).map(_.typeSignature)
def main(a: Array[String]): Unit = {
// Grab the argument types of our methods.
val funcsType = ru.typeOf[MyFuncs].companion
val fn1ArgTypes = getArgTypes(funcsType.member(ru.TermName("fn1")).asMethod)
val fn2ArgTypes = getArgTypes(funcsType.member(ru.TermName("fn2")).asMethod)
val genericSeq = fn1ArgTypes.head // Seq[E]
val genericMap = fn2ArgTypes.head // Map[K, V]
println(typeArgBindings(genericSeq, ru.typeOf[Seq[Int]])) // Map(E -> Int)
println(typeArgBindings(genericSeq, ru.typeOf[Seq[Map[String, Double]]])) // Map(E -> Map[String,Double])
println(typeArgBindings(genericMap, ru.typeOf[Map[String, Double]])) // Map(K -> String, V -> Double)
println(typeArgBindings(genericMap, ru.typeOf[Map[Int, Boolean]])) // Map(K -> Int, V -> Boolean)
}
}

Is it possible to write a upickle Serializer for akka

I would like to implement an akka Serializer using upickle but I'm not sure its possible. To do so I would need to implement a Serializer something like the following:
import akka.serialization.Serializer
import upickle.default._
class UpickleSerializer extends Serializer {
def includeManifest: Boolean = true
def identifier = 1234567
def toBinary(obj: AnyRef): Array[Byte] = {
writeBinary(obj) // ???
}
def fromBinary(bytes: Array[Byte], clazz: Option[Class[_]]): AnyRef = {
readBinary(bytes) // ???
}
}
The problem is I cannot call writeBinary/readBinary without having the relevant Writer/Reader. Is there a way I can look these up based on the object class?
Take a look at following files, you should get some ideas!
CborAkkaSerializer.scala
LocationAkkaSerializer.scala
Note: These serializers are using cbor
I found a way to do it using reflection. I base the solution on the assumption that any object that needs to be serialized should have defined a ReadWriter in its companion object:
class UpickleSerializer extends Serializer {
private var map = Map[Class[_], ReadWriter[AnyRef]]()
def includeManifest: Boolean = true
def identifier = 1234567
def toBinary(obj: AnyRef): Array[Byte] = {
implicit val rw = getReadWriter(obj.getClass)
writeBinary(obj)
}
def fromBinary(bytes: Array[Byte], clazz: Option[Class[_]]): AnyRef = {
implicit val rw = lookup(clazz.get)
readBinary[AnyRef](bytes)
}
private def getReadWriter(clazz: Class[_]) = map.get(clazz) match {
case Some(rw) => rw
case None =>
val rw = lookup(clazz)
map += clazz -> rw
rw
}
private def lookup(clazz: Class[_]) = {
import scala.reflect.runtime._
val rootMirror = universe.runtimeMirror(clazz.getClassLoader)
val classSymbol = rootMirror.classSymbol(clazz)
val moduleSymbol = classSymbol.companion.asModule
val moduleMirror = rootMirror.reflectModule(moduleSymbol)
val instanceMirror = rootMirror.reflect(moduleMirror.instance)
val members = instanceMirror.symbol.typeSignature.members
members.find(_.typeSignature <:< typeOf[ReadWriter[_]]) match {
case Some(rw) =>
instanceMirror.reflectField(rw.asTerm).get.asInstanceOf[ReadWriter[AnyRef]]
case None =>
throw new RuntimeException("Not found")
}
}
}

Determine non-empty additional fields in a subclass

Assume I have a trait which looks something like this
trait MyTrait {
val x: Option[String] = None
val y: Option[String] = None
}
Post defining the trait I extend this trait to a class MyClass which looks something like this
case class MyClass(
override val x: Option[String] = None,
override val y: Option[String] = None,
z: Option[String] = None
) extends MyTrait
Now I need to find if any other property other than the properties extended by MyTrait is not None. In the sense if I need to write a method which is called getClassInfo which returns true/false based upon the values present in the case class. In this case it should return true if z is Non optional. My getClassInfo goes something like this
def getClassInfo(myClass: MyClass): Boolean = {
myClass
.productIterator
.filterNot(x => x.isInstanceOf[MyTrait])
.exists(_.isInstanceOf[Some[_]])
}
Ideally this should filter out all the fields which are not a part of Mytrait and return me z in this case.
I tried using variance, However It seems like isInstanceOf doesn't take the same
filterNot(x => x.isInstanceOf[+MyTrait])
However this cannot be possible
val a = getClassInfo(MyClass()) //Needs to return false
val b = getClassInfo(MyClass(Some("a"), Some("B"), Some("c"))) //returns true
val c = getClassInfo(MyClass(z = Some("z"))) //needs to return true
val d = getClassInfo(MyClass(x = Some("x"), y = Some("y"))) // needs to return false
The simple answer is to declare an abstract method that gives the result you want and override it in the subclass:
trait MyTrait {
def x: Option[String]
def y: Option[String]
def anyNonEmpty: Boolean = false
}
case class MyClass(x: Option[String] = None, y: Option[String] = None, z: Option[String] = None) extends MyTrait {
override def anyNonEmpty = z.nonEmpty
}
You can then call anyNonEmpty on your object to get the getClassInfo result.
Also note that I've used def here in the trait because val in a trait is generally a bad idea because of initialisation issues.
If you really need reflection you can try
import scala.reflect.runtime.currentMirror
import scala.reflect.runtime.universe._
def getClassInfo(myClass: MyClass): Boolean = {
def fields[A: TypeTag] = typeOf[A].members.collect {
case m: MethodSymbol if m.isGetter && m.isPublic => m
}
val mtFields = fields[MyTrait]
val mcFields = fields[MyClass]
val mtFieldNames = mtFields.map(_.name).toSet
val mcNotMtFields = mcFields.filterNot(f => mtFieldNames.contains(f.name))
val instanceMirror = currentMirror.reflect(myClass)
val mcNotMtFieldValues = mcNotMtFields.map(f => instanceMirror.reflectField(f).get)
mcNotMtFieldValues.exists(_.isInstanceOf[Some[_]])
}
val a = getClassInfo(MyClass()) //false
val b = getClassInfo(MyClass(Some("a"), Some("B"), Some("c"))) //true
val c = getClassInfo(MyClass(z = Some("z"))) //true
val d = getClassInfo(MyClass(x = Some("x"), y = Some("y")))//false

Convert java.util.IdentityHashMap to scala.immutable.Map

What is the simplest way to convert a java.util.IdentityHashMap[A,B] into a subtype of scala.immutable.Map[A,B]? I need to keep keys separate unless they are eq.
Here's what I've tried so far:
scala> case class Example()
scala> val m = new java.util.IdentityHashMap[Example, String]()
scala> m.put(Example(), "first!")
scala> m.put(Example(), "second!")
scala> m.asScala // got a mutable Scala equivalent OK
res14: scala.collection.mutable.Map[Example,String] = Map(Example() -> first!, Example() -> second!)
scala> m.asScala.toMap // doesn't work, since toMap() removes duplicate keys (testing with ==)
res15: scala.collection.immutable.Map[Example,String] = Map(Example() -> second!)
Here's a simple implementation of identity map in Scala. In usage, it should be similar to standard immutable map.
Example usage:
val im = IdentityMap(
new String("stuff") -> 5,
new String("stuff") -> 10)
println(im) // IdentityMap(stuff -> 5, stuff -> 10)
Your case:
import scala.collection.JavaConverters._
import java.{util => ju}
val javaIdentityMap: ju.IdentityHashMap = ???
val scalaIdentityMap = IdentityMap.empty[String,Int] ++ javaIdentityMap.asScala
Implementation itself (for performance reasons, there may be some more methods that need to be overridden):
import scala.collection.generic.ImmutableMapFactory
import scala.collection.immutable.MapLike
import IdentityMap.{Wrapper, wrap}
class IdentityMap[A, +B] private(underlying: Map[Wrapper[A], B])
extends Map[A, B] with MapLike[A, B, IdentityMap[A, B]] {
def +[B1 >: B](kv: (A, B1)) =
new IdentityMap(underlying + ((wrap(kv._1), kv._2)))
def -(key: A) =
new IdentityMap(underlying - wrap(key))
def iterator =
underlying.iterator.map {
case (kw, v) => (kw.value, v)
}
def get(key: A) =
underlying.get(wrap(key))
override def size: Int =
underlying.size
override def empty =
new IdentityMap(underlying.empty)
override def stringPrefix =
"IdentityMap"
}
object IdentityMap extends ImmutableMapFactory[IdentityMap] {
def empty[A, B] =
new IdentityMap(Map.empty)
private class Wrapper[A](val value: A) {
override def toString: String =
value.toString
override def equals(other: Any) = other match {
case otherWrapper: Wrapper[_] =>
value.asInstanceOf[AnyRef] eq otherWrapper.value.asInstanceOf[AnyRef]
case _ => false
}
override def hashCode =
System.identityHashCode(value)
}
private def wrap[A](key: A) =
new Wrapper(key)
}
One way to handle this would be change what equality means for the class, e.g.
scala> case class Example() {
override def equals( that:Any ) = that match {
case that:AnyRef => this eq that
case _ => false
}
}
defined class Example
scala> val m = new java.util.IdentityHashMap[Example, String]()
m: java.util.IdentityHashMap[Example,String] = {}
scala> m.put(Example(), "first!")
res1: String = null
scala> m.put(Example(), "second!")
res2: String = null
scala> import scala.collection.JavaConverters._
import scala.collection.JavaConverters._
scala> m.asScala
res3: scala.collection.mutable.Map[Example,String] = Map(Example() -> second!, Example() -> first!)
scala> m.asScala.toMap
res4: scala.collection.immutable.Map[Example,String] = Map(Example() -> second!, Example() -> first!)
Or if you don't want to change equality for the class, you could make a wrapper.
Of course, this won't perform as well as a Map that uses eq instead of ==; it might be worth asking for one....