How to store custom objects in Dataset? - scala

According to Introducing Spark Datasets:
As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically:
...
Custom encoders – while we currently autogenerate encoders for a wide variety of types, we’d like to open up an API for custom objects.
and attempts to store custom type in a Dataset lead to following error like:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases
or:
Java.lang.UnsupportedOperationException: No Encoder found for ....
Are there any existing workarounds?
Note this question exists only as an entry point for a Community Wiki answer. Feel free to update / improve both question and answer.

Update
This answer is still valid and informative, although things are now better since 2.2/2.3, which adds built-in encoder support for Set, Seq, Map, Date, Timestamp, and BigDecimal. If you stick to making types with only case classes and the usual Scala types, you should be fine with just the implicit in SQLImplicits.
Unfortunately, virtually nothing has been added to help with this. Searching for #since 2.0.0 in Encoders.scala or SQLImplicits.scala finds things mostly to do with primitive types (and some tweaking of case classes). So, first thing to say: there currently is no real good support for custom class encoders. With that out of the way, what follows is some tricks which do as good a job as we can ever hope to, given what we currently have at our disposal. As an upfront disclaimer: this won't work perfectly and I'll do my best to make all limitations clear and upfront.
What exactly is the problem
When you want to make a dataset, Spark "requires an encoder (to convert a JVM object of type T to and from the internal Spark SQL representation) that is generally created automatically through implicits from a SparkSession, or can be created explicitly by calling static methods on Encoders" (taken from the docs on createDataset). An encoder will take the form Encoder[T] where T is the type you are encoding. The first suggestion is to add import spark.implicits._ (which gives you these implicit encoders) and the second suggestion is to explicitly pass in the implicit encoder using this set of encoder related functions.
There is no encoder available for regular classes, so
import spark.implicits._
class MyObj(val i: Int)
// ...
val d = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
will give you the following implicit related compile time error:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases
However, if you wrap whatever type you just used to get the above error in some class that extends Product, the error confusingly gets delayed to runtime, so
import spark.implicits._
case class Wrap[T](unwrap: T)
class MyObj(val i: Int)
// ...
val d = spark.createDataset(Seq(Wrap(new MyObj(1)),Wrap(new MyObj(2)),Wrap(new MyObj(3))))
Compiles just fine, but fails at runtime with
java.lang.UnsupportedOperationException: No Encoder found for MyObj
The reason for this is that the encoders Spark creates with the implicits are actually only made at runtime (via scala relfection). In this case, all Spark checks at compile time is that the outermost class extends Product (which all case classes do), and only realizes at runtime that it still doesn't know what to do with MyObj (the same problem occurs if I tried to make a Dataset[(Int,MyObj)] - Spark waits until runtime to barf on MyObj). These are central problems that are in dire need of being fixed:
some classes that extend Product compile despite always crashing at runtime and
there is no way of passing in custom encoders for nested types (I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)).
Just use kryo
The solution everyone suggests is to use the kryo encoder.
import spark.implicits._
class MyObj(val i: Int)
implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
// ...
val d = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
This gets pretty tedious fast though. Especially if your code is manipulating all sorts of datasets, joining, grouping etc. You end up racking up a bunch of extra implicits. So, why not just make an implicit that does this all automatically?
import scala.reflect.ClassTag
implicit def kryoEncoder[A](implicit ct: ClassTag[A]) =
org.apache.spark.sql.Encoders.kryo[A](ct)
And now, it seems like I can do almost anything I want (the example below won't work in the spark-shell where spark.implicits._ is automatically imported)
class MyObj(val i: Int)
val d1 = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
val d2 = d1.map(d => (d.i+1,d)).alias("d2") // mapping works fine and ..
val d3 = d1.map(d => (d.i, d)).alias("d3") // .. deals with the new type
val d4 = d2.joinWith(d3, $"d2._1" === $"d3._1") // Boom!
Or almost. The problem is that using kryo leads to Spark just storing every row in the dataset as a flat binary object. For map, filter, foreach that is enough, but for operations like join, Spark really needs these to be separated into columns. Inspecting the schema for d2 or d3, you see there is just one binary column:
d2.printSchema
// root
// |-- value: binary (nullable = true)
Partial solution for tuples
So, using the magic of implicits in Scala (more in 6.26.3 Overloading Resolution), I can make myself a series of implicits that will do as good a job as possible, at least for tuples, and will work well with existing implicits:
import org.apache.spark.sql.{Encoder,Encoders}
import scala.reflect.ClassTag
import spark.implicits._ // we can still take advantage of all the old implicits
implicit def single[A](implicit c: ClassTag[A]): Encoder[A] = Encoders.kryo[A](c)
implicit def tuple2[A1, A2](
implicit e1: Encoder[A1],
e2: Encoder[A2]
): Encoder[(A1,A2)] = Encoders.tuple[A1,A2](e1, e2)
implicit def tuple3[A1, A2, A3](
implicit e1: Encoder[A1],
e2: Encoder[A2],
e3: Encoder[A3]
): Encoder[(A1,A2,A3)] = Encoders.tuple[A1,A2,A3](e1, e2, e3)
// ... you can keep making these
Then, armed with these implicits, I can make my example above work, albeit with some column renaming
class MyObj(val i: Int)
val d1 = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
val d2 = d1.map(d => (d.i+1,d)).toDF("_1","_2").as[(Int,MyObj)].alias("d2")
val d3 = d1.map(d => (d.i ,d)).toDF("_1","_2").as[(Int,MyObj)].alias("d3")
val d4 = d2.joinWith(d3, $"d2._1" === $"d3._1")
I haven't yet figured out how to get the expected tuple names (_1, _2, ...) by default without renaming them - if someone else wants to play around with this, this is where the name "value" gets introduced and this is where the tuple names are usually added. However, the key point is that that I now have a nice structured schema:
d4.printSchema
// root
// |-- _1: struct (nullable = false)
// | |-- _1: integer (nullable = true)
// | |-- _2: binary (nullable = true)
// |-- _2: struct (nullable = false)
// | |-- _1: integer (nullable = true)
// | |-- _2: binary (nullable = true)
So, in summary, this workaround:
allows us to get separate columns for tuples (so we can join on tuples again, yay!)
we can again just rely on the implicits (so no need to be passing in kryo all over the place)
is almost entirely backwards compatible with import spark.implicits._ (with some renaming involved)
does not let us join on the kyro serialized binary columns, let alone on fields those may have
has the unpleasant side-effect of renaming some of the tuple columns to "value" (if necessary, this can be undone by converting .toDF, specifying new column names, and converting back to a dataset - and the schema names seem to be preserved through joins, where they are most needed).
Partial solution for classes in general
This one is less pleasant and has no good solution. However, now that we have the tuple solution above, I have a hunch the implicit conversion solution from another answer will be a bit less painful too since you can convert your more complex classes to tuples. Then, after creating the dataset, you'd probably rename the columns using the dataframe approach. If all goes well, this is really an improvement since I can now perform joins on the fields of my classes. If I had just used one flat binary kryo serializer that wouldn't have been possible.
Here is an example that does a bit of everything: I have a class MyObj which has fields of types Int, java.util.UUID, and Set[String]. The first takes care of itself. The second, although I could serialize using kryo would be more useful if stored as a String (since UUIDs are usually something I'll want to join against). The third really just belongs in a binary column.
class MyObj(val i: Int, val u: java.util.UUID, val s: Set[String])
// alias for the type to convert to and from
type MyObjEncoded = (Int, String, Set[String])
// implicit conversions
implicit def toEncoded(o: MyObj): MyObjEncoded = (o.i, o.u.toString, o.s)
implicit def fromEncoded(e: MyObjEncoded): MyObj =
new MyObj(e._1, java.util.UUID.fromString(e._2), e._3)
Now, I can create a dataset with a nice schema using this machinery:
val d = spark.createDataset(Seq[MyObjEncoded](
new MyObj(1, java.util.UUID.randomUUID, Set("foo")),
new MyObj(2, java.util.UUID.randomUUID, Set("bar"))
)).toDF("i","u","s").as[MyObjEncoded]
And the schema shows me I columns with the right names and with the first two both things I can join against.
d.printSchema
// root
// |-- i: integer (nullable = false)
// |-- u: string (nullable = true)
// |-- s: binary (nullable = true)

Using generic encoders.
There are two generic encoders available for now kryo and javaSerialization where the latter one is explicitly described as:
extremely inefficient and should only be used as the last resort.
Assuming following class
class Bar(i: Int) {
override def toString = s"bar $i"
def bar = i
}
you can use these encoders by adding implicit encoder:
object BarEncoders {
implicit def barEncoder: org.apache.spark.sql.Encoder[Bar] =
org.apache.spark.sql.Encoders.kryo[Bar]
}
which can be used together as follows:
object Main {
def main(args: Array[String]) {
val sc = new SparkContext("local", "test", new SparkConf())
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import BarEncoders._
val ds = Seq(new Bar(1)).toDS
ds.show
sc.stop()
}
}
It stores objects as binary column so when converted to DataFrame you get following schema:
root
|-- value: binary (nullable = true)
It is also possible to encode tuples using kryo encoder for specific field:
val longBarEncoder = Encoders.tuple(Encoders.scalaLong, Encoders.kryo[Bar])
spark.createDataset(Seq((1L, new Bar(1))))(longBarEncoder)
// org.apache.spark.sql.Dataset[(Long, Bar)] = [_1: bigint, _2: binary]
Please note that we don't depend on implicit encoders here but pass encoder explicitly so this most likely won't work with toDS method.
Using implicit conversions:
Provide implicit conversions between representation which can be encoded and custom class, for example:
object BarConversions {
implicit def toInt(bar: Bar): Int = bar.bar
implicit def toBar(i: Int): Bar = new Bar(i)
}
object Main {
def main(args: Array[String]) {
val sc = new SparkContext("local", "test", new SparkConf())
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import BarConversions._
type EncodedBar = Int
val bars: RDD[EncodedBar] = sc.parallelize(Seq(new Bar(1)))
val barsDS = bars.toDS
barsDS.show
barsDS.map(_.bar).show
sc.stop()
}
}
Related questions:
How to create encoder for Option type constructor, e.g. Option[Int]?

You can use UDTRegistration and then Case Classes, Tuples, etc... all work correctly with your User Defined Type!
Say you want to use a custom Enum:
trait CustomEnum { def value:String }
case object Foo extends CustomEnum { val value = "F" }
case object Bar extends CustomEnum { val value = "B" }
object CustomEnum {
def fromString(str:String) = Seq(Foo, Bar).find(_.value == str).get
}
Register it like this:
// First define a UDT class for it:
class CustomEnumUDT extends UserDefinedType[CustomEnum] {
override def sqlType: DataType = org.apache.spark.sql.types.StringType
override def serialize(obj: CustomEnum): Any = org.apache.spark.unsafe.types.UTF8String.fromString(obj.value)
// Note that this will be a UTF8String type
override def deserialize(datum: Any): CustomEnum = CustomEnum.fromString(datum.toString)
override def userClass: Class[CustomEnum] = classOf[CustomEnum]
}
// Then Register the UDT Class!
// NOTE: you have to put this file into the org.apache.spark package!
UDTRegistration.register(classOf[CustomEnum].getName, classOf[CustomEnumUDT].getName)
Then USE IT!
case class UsingCustomEnum(id:Int, en:CustomEnum)
val seq = Seq(
UsingCustomEnum(1, Foo),
UsingCustomEnum(2, Bar),
UsingCustomEnum(3, Foo)
).toDS()
seq.filter(_.en == Foo).show()
println(seq.collect())
Say you want to use a Polymorphic Record:
trait CustomPoly
case class FooPoly(id:Int) extends CustomPoly
case class BarPoly(value:String, secondValue:Long) extends CustomPoly
... and the use it like this:
case class UsingPoly(id:Int, poly:CustomPoly)
Seq(
UsingPoly(1, new FooPoly(1)),
UsingPoly(2, new BarPoly("Blah", 123)),
UsingPoly(3, new FooPoly(1))
).toDS
polySeq.filter(_.poly match {
case FooPoly(value) => value == 1
case _ => false
}).show()
You can write a custom UDT that encodes everything to bytes (I'm using java serialization here but it's probably better to instrument Spark's Kryo context).
First define the UDT class:
class CustomPolyUDT extends UserDefinedType[CustomPoly] {
val kryo = new Kryo()
override def sqlType: DataType = org.apache.spark.sql.types.BinaryType
override def serialize(obj: CustomPoly): Any = {
val bos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(bos)
oos.writeObject(obj)
bos.toByteArray
}
override def deserialize(datum: Any): CustomPoly = {
val bis = new ByteArrayInputStream(datum.asInstanceOf[Array[Byte]])
val ois = new ObjectInputStream(bis)
val obj = ois.readObject()
obj.asInstanceOf[CustomPoly]
}
override def userClass: Class[CustomPoly] = classOf[CustomPoly]
}
Then register it:
// NOTE: The file you do this in has to be inside of the org.apache.spark package!
UDTRegistration.register(classOf[CustomPoly].getName, classOf[CustomPolyUDT].getName)
Then you can use it!
// As shown above:
case class UsingPoly(id:Int, poly:CustomPoly)
Seq(
UsingPoly(1, new FooPoly(1)),
UsingPoly(2, new BarPoly("Blah", 123)),
UsingPoly(3, new FooPoly(1))
).toDS
polySeq.filter(_.poly match {
case FooPoly(value) => value == 1
case _ => false
}).show()

Encoders work more or less the same in Spark2.0. And Kryo is still the recommended serialization choice.
You can look at following example with spark-shell
scala> import spark.implicits._
import spark.implicits._
scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders
scala> case class NormalPerson(name: String, age: Int) {
| def aboutMe = s"I am ${name}. I am ${age} years old."
| }
defined class NormalPerson
scala> case class ReversePerson(name: Int, age: String) {
| def aboutMe = s"I am ${name}. I am ${age} years old."
| }
defined class ReversePerson
scala> val normalPersons = Seq(
| NormalPerson("Superman", 25),
| NormalPerson("Spiderman", 17),
| NormalPerson("Ironman", 29)
| )
normalPersons: Seq[NormalPerson] = List(NormalPerson(Superman,25), NormalPerson(Spiderman,17), NormalPerson(Ironman,29))
scala> val ds1 = sc.parallelize(normalPersons).toDS
ds1: org.apache.spark.sql.Dataset[NormalPerson] = [name: string, age: int]
scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
scala> ds1.show()
+---------+---+
| name|age|
+---------+---+
| Superman| 25|
|Spiderman| 17|
| Ironman| 29|
+---------+---+
scala> ds2.show()
+----+---------+
|name| age|
+----+---------+
| 25| Superman|
| 17|Spiderman|
| 29| Ironman|
+----+---------+
scala> ds1.foreach(p => println(p.aboutMe))
I am Ironman. I am 29 years old.
I am Superman. I am 25 years old.
I am Spiderman. I am 17 years old.
scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
scala> ds2.foreach(p => println(p.aboutMe))
I am 17. I am Spiderman years old.
I am 25. I am Superman years old.
I am 29. I am Ironman years old.
Till now] there were no appropriate encoders in present scope so our persons were not encoded as binary values. But that will change once we provide some implicit encoders using Kryo serialization.
// Provide Encoders
scala> implicit val normalPersonKryoEncoder = Encoders.kryo[NormalPerson]
normalPersonKryoEncoder: org.apache.spark.sql.Encoder[NormalPerson] = class[value[0]: binary]
scala> implicit val reversePersonKryoEncoder = Encoders.kryo[ReversePerson]
reversePersonKryoEncoder: org.apache.spark.sql.Encoder[ReversePerson] = class[value[0]: binary]
// Ecoders will be used since they are now present in Scope
scala> val ds3 = sc.parallelize(normalPersons).toDS
ds3: org.apache.spark.sql.Dataset[NormalPerson] = [value: binary]
scala> val ds4 = ds3.map(np => ReversePerson(np.age, np.name))
ds4: org.apache.spark.sql.Dataset[ReversePerson] = [value: binary]
// now all our persons show up as binary values
scala> ds3.show()
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
scala> ds4.show()
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
// Our instances still work as expected
scala> ds3.foreach(p => println(p.aboutMe))
I am Ironman. I am 29 years old.
I am Spiderman. I am 17 years old.
I am Superman. I am 25 years old.
scala> ds4.foreach(p => println(p.aboutMe))
I am 25. I am Superman years old.
I am 29. I am Ironman years old.
I am 17. I am Spiderman years old.

In case of Java Bean class, this can be useful
import spark.sqlContext.implicits._
import org.apache.spark.sql.Encoders
implicit val encoder = Encoders.bean[MyClasss](classOf[MyClass])
Now you can simply read the dataFrame as custom DataFrame
dataFrame.as[MyClass]
This will create a custom class encoder and not a binary one.

My examples will be in Java, but I don't imagine it to be difficult adapting to Scala.
I have been quite successful converting RDD<Fruit> to Dataset<Fruit> using spark.createDataset and Encoders.bean as long as Fruit is a simple Java Bean.
Step 1: Create the simple Java Bean.
public class Fruit implements Serializable {
private String name = "default-fruit";
private String color = "default-color";
// AllArgsConstructor
public Fruit(String name, String color) {
this.name = name;
this.color = color;
}
// NoArgsConstructor
public Fruit() {
this("default-fruit", "default-color");
}
// ...create getters and setters for above fields
// you figure it out
}
I'd stick to classes with primitive types and String as fields before the DataBricks folks beef up their Encoders. If you have a class with nested object, create another simple Java Bean with all of its fields flattened, so you can use RDD transformations to map the complex type to the simpler one. Sure it's a little extra work, but I imagine it'll help a lot on performance working with a flat schema.
Step 2: Get your Dataset from the RDD
SparkSession spark = SparkSession.builder().getOrCreate();
JavaSparkContext jsc = new JavaSparkContext();
List<Fruit> fruitList = ImmutableList.of(
new Fruit("apple", "red"),
new Fruit("orange", "orange"),
new Fruit("grape", "purple"));
JavaRDD<Fruit> fruitJavaRDD = jsc.parallelize(fruitList);
RDD<Fruit> fruitRDD = fruitJavaRDD.rdd();
Encoder<Fruit> fruitBean = Encoders.bean(Fruit.class);
Dataset<Fruit> fruitDataset = spark.createDataset(rdd, bean);
And voila! Lather, rinse, repeat.

For those who may in my situation I put my answer here, too.
To be specific,
I was reading 'Set typed data' from SQLContext. So original data format is DataFrame.
val sample = spark.sqlContext.sql("select 1 as a, collect_set(1) as b limit 1")
sample.show()
+---+---+
| a| b|
+---+---+
| 1|[1]|
+---+---+
Then convert it into RDD using rdd.map() with mutable.WrappedArray type.
sample
.rdd.map(r =>
(r.getInt(0), r.getAs[mutable.WrappedArray[Int]](1).toSet))
.collect()
.foreach(println)
Result:
(1,Set(1))

In addition to the suggestions already given, another option I recently discovered is that you can declare your custom class including the trait org.apache.spark.sql.catalyst.DefinedByConstructorParams.
This works if the class has a constructor that uses types the ExpressionEncoder can understand, i.e. primitive values and standard collections. It can come in handy when you're not able to declare the class as a case class, but don't want to use Kryo to encode it every time it's included in a Dataset.
For example, I wanted to declare a case class that included a Breeze vector. The only encoder that would be able to handle that would normally be Kryo. But if I declared a subclass that extended the Breeze DenseVector and DefinedByConstructorParams, the ExpressionEncoder understood that it could be serialized as an array of Doubles.
Here's how I declared it:
class SerializableDenseVector(values: Array[Double]) extends breeze.linalg.DenseVector[Double](values) with DefinedByConstructorParams
implicit def BreezeVectorToSerializable(bv: breeze.linalg.DenseVector[Double]): SerializableDenseVector = bv.asInstanceOf[SerializableDenseVector]
Now I can use SerializableDenseVector in a Dataset (directly, or as part of a Product) using a simple ExpressionEncoder and no Kryo. It works just like a Breeze DenseVector but serializes as an Array[Double].

#Alec's answer is great! Just to add a comment in this part of his/her answer:
import spark.implicits._
case class Wrap[T](unwrap: T)
class MyObj(val i: Int)
// ...
val d = spark.createDataset(Seq(Wrap(new MyObj(1)),Wrap(new MyObj(2)),Wrap(new MyObj(3))))
#Alec mentions:
there is no way of passing in custom encoders for nested types (I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)).
It seems so, because if I add an encoder for MyObj:
implicit val myEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
, it still fails:
java.lang.UnsupportedOperationException: No Encoder found for MyObj
- field (class: "MyObj", name: "unwrap")
- root class: "Wrap"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
But notice the important error message:
root class: "Wrap"
It actually gives a hint that encoding MyObj isn't enough, and you have to encode the entire chain including Wrap[T].
So if I do this, it solves the problem:
implicit val myWrapperEncoder = org.apache.spark.sql.Encoders.kryo[Wrap[MyObj]]
Hence, the comment of #Alec is NOT that true:
I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)
We still have a way to feed Spark the encoder for MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj).

Related

How to infer StructType schema for Spark Scala at run time given a Fully Qualified Name of a case class

Since a few days I was wondering if it is possible to infer a schema for Spark in Scala for a given case class, but unknown at compile time.
The only input is a string containing the FQN of the class (that could be used for example to create an instance of the case class at runtime via reflection)
I was thinking if it was possible to do something like:
package com.my.namespace
case class MyCaseClass (name: String, num: Int)
//Somewhere else in codebase
// coming from external configuration file, so unknown at compile time
val fqn = "com.my.namespace.MyCaseClass"
val schema = Encoders.product [ getXYZ( fqn ) ].schema
Of course, any other techniques that is not using Encoders is fine (building StructType analysing an instance of the case class ? Is it even possible ?)
What is the best approach?
Is it something feasible ?
You can use reflective toolbox
package com.my.namespace
import org.apache.spark.sql.types.StructType
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
case class MyCaseClass (name: String, num: Int)
object Main extends App {
val fqn = "com.my.namespace.MyCaseClass"
val runtimeMirror = runtime.currentMirror
val toolbox = runtimeMirror.mkToolBox()
val res = toolbox.eval(toolbox.parse(s"""
import org.apache.spark.sql.Encoders
Encoders.product[$fqn].schema
""")).asInstanceOf[StructType]
println(res) // StructType(StructField(name,StringType,true),StructField(num,IntegerType,false))
}

How do I write a Dataset encoder to support mapping a function to a org.apache.spark.sql.Dataset[String] in Scala Spark

Moving from Spark 1.6 to Spark 2.2* has brought the error “error: Unable to find encoder for type stored in a 'Dataset'. Primitive types (Int, String, etc)” when trying to apply a method to a dataset returned from querying a parquet table.
I have oversimplified my code to demonstrate the same error. The code queries a parquet file to return the following datatype:
'org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]'
I apply a function to extract a string and integer , returning a string. Returning the following
datatype: Array[String]
Next, I need to perform extensive manipulations requiring a separate function. In this test function, I try to append a string producing the same error as my detailed example.
I have tried some encoder examples and use of the ‘case’ but have not come up with a workable solution. Any suggestions/ examples would be appreciated
scala> var d1 = hive.executeQuery(st)
d1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cvdt35_message_id_d: string,
cvdt35_input_timestamp_s: decimal(16,5) ... 2 more fields]
val parseCVDP_parquet = (s:org.apache.spark.sql.Row) => s.getString(2).split("0x"
(1)+","+s.getDecimal(1);
scala> var d2 = d1.map(parseCVDP_parquet)
d2: org.apache.spark.sql.Dataset[String] = [value: string]
scala> d2.take(1)
20/03/25 19:01:08 WARN TaskSetManager: Stage 3 contains a task of very large size (131 KB). The
maximum recommended task size is 100 KB.
res10: Array[String] = Array(ab04006000504304,1522194407.95162)
scala> def dd(s:String){
| s + "some string"
| }
dd: (s: String)Unit
scala> var d3 = d2.map{s=> dd(s) }
<console>:47: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int,
String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support
for serializing other types will be added in future releases.
To distill the problem further, i believe this scenario (though I have not tried all possible solutions to) can be simplified further to the following code:
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String){
| s + "hi"
| }
f: (s: String)Unit
scala> var test2 = test.map{ s => f(s) }
<console>:42: error: Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are
supported by importing spark.implicits._ Support for serializing other types
will be added in future releases.
var test2 = test.map{ s => f(s) }
I have a solution at least to my simplified Problem (below).
I will be testing more....
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String): String = {
| val r = s + "hi"
| return r
| }
f: (s: String)String
scala> var test2 = test.rdd.map{ s => f(s) }
test2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at map at <console>:43
scala> test2.take(1)
res9: Array[String] = Array(just some wordshi)
The first solution does not work on my initial (production) data set, rather producing the error "org.apache.spark.SparkException: Task not serializable" (interestingly though both stored as the same data type (org.apache.spark.sql.Dataset[String] = [value: string]) which I believe to be related. I included yet another solution to my test data set that eliminates the initial Encoder error and as shown actually works on my toy problem, does not ramp to a production data set. A bit confused as to exactly why my application is sidelined in the movement from 1.6 to 2.3 version spark as I didn't have to make any special accommodations to my application for years and have run it successfully for calculations that most likely count in the trillions. Other explorations have included wrapping my method as Serializable, explorations of the #transient keyword, leveraging the "org.apache.spark.serializer.KryoSerializer", writing my methods as functions and changing all vars to 'vals' (following related posts on 'stack').
scala> import spark.implicits._
import spark.implicits._
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String): String = {
| val r = s + "hi"
| return r
| }
f: (s: String)String
scala> var d2 = test.map{s => f(s)}(Encoders.STRING)
d2: org.apache.spark.sql.Dataset[String] = [value: string]
scala> d2.take(1)
res0: Array[String] = Array(just some wordshi)
scala>

Evalutate complex type with quasiquote scala, unlifting

I need to compile function and then evaluate it with different parameters of type List[Map[String, AnyRef]].
I have the following code that does not compile with such the type but compiles with simple type like List[Int].
I found that there are just certain implementations of Liftable in scala.reflect.api.StandardLiftables.StandardLiftableInstances
import scala.reflect.runtime.universe
import scala.reflect.runtime.universe._
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val functionWrapper =
"""
object FunctionWrapper {
def makeBody(messages: List[Map[String, AnyRef]]) = Map.empty
}""".stripMargin
val functionSymbol =
tb.define(tb.parse(functionWrapper).asInstanceOf[tb.u.ImplDef])
val list: List[Map[String, AnyRef]] = List(Map("1" -> "2"))
tb.eval(q"$functionSymbol.function($list)")
Getting compilation error for this, how can I make it work?
Error:(22, 38) Can't unquote List[Map[String,AnyRef]], consider using
... or providing an implicit instance of
Liftable[List[Map[String,AnyRef]]]
tb.eval(q"$functionSymbol.function($list)")
^
The problem comes not from complicated type but from the attempt to use AnyRef. When you unquote some literal, it means you want the infrastructure to be able to create a valid syntax tree to create an object that would exactly match the object you pass. Unfortunately this is obviously not possible for all objects. For example, assume that you've passed a reference to Thread.currentThread() as a part of the Map. How it could possible work? Compiler is just not able to recreate such a complicated object (not to mention making it the current thread). So you have two obvious alternatives:
Make you argument also a Tree i.e. something like this
def testTree() = {
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val functionWrapper =
"""
| object FunctionWrapper {
|
| def makeBody(messages: List[Map[String, AnyRef]]) = Map.empty
|
| }
""".stripMargin
val functionSymbol =
tb.define(tb.parse(functionWrapper).asInstanceOf[tb.u.ImplDef])
//val list: List[Map[String, AnyRef]] = List(Map("1" -> "2"))
val list = q"""List(Map("1" -> "2"))"""
val res = tb.eval(q"$functionSymbol.makeBody($list)")
println(s"testTree = $res")
}
The obvious drawback of this approach is that you loose type safety at compile time and might need to provide a lot of context for the tree to work
Another approach is to not try to pass anything containing AnyRef to the compiler-infrastructure. It means you create some function-like Wrapper:
package so {
trait Wrapper {
def call(args: List[Map[String, AnyRef]]): Map[String, AnyRef]
}
}
and then make your generated code return a Wrapper instead of directly executing the logic and call the Wrapper from the usual Scala code rather than inside compiled code. Something like this:
def testWrapper() = {
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val functionWrapper =
"""
|object FunctionWrapper {
| import scala.collection._
| import so.Wrapper /* <- here probably different package :) */
|
| def createWrapper(): Wrapper = new Wrapper {
| override def call(args: List[Map[String, AnyRef]]): Map[String, AnyRef] = Map.empty
| }
|}
| """.stripMargin
val functionSymbol = tb.define(tb.parse(functionWrapper).asInstanceOf[tb.u.ImplDef])
val list: List[Map[String, AnyRef]] = List(Map("1" -> "2"))
val tree: tb.u.Tree = q"$functionSymbol.createWrapper()"
val wrapper = tb.eval(tree).asInstanceOf[Wrapper]
val res = wrapper.call(list)
println(s"testWrapper = $res")
}
P.S. I'm not sure what are you doing but beware of performance issues. Scala is a hard language to compile and thus it might easily take more time to compile your custom code than to run it. If performance becomes an issue you might need to use some other methods such as full-blown macro-code-generation or at least caching of the compiled code.

Using a scala.Symbol in a Spark Dataset [duplicate]

According to Introducing Spark Datasets:
As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically:
...
Custom encoders – while we currently autogenerate encoders for a wide variety of types, we’d like to open up an API for custom objects.
and attempts to store custom type in a Dataset lead to following error like:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases
or:
Java.lang.UnsupportedOperationException: No Encoder found for ....
Are there any existing workarounds?
Note this question exists only as an entry point for a Community Wiki answer. Feel free to update / improve both question and answer.
Update
This answer is still valid and informative, although things are now better since 2.2/2.3, which adds built-in encoder support for Set, Seq, Map, Date, Timestamp, and BigDecimal. If you stick to making types with only case classes and the usual Scala types, you should be fine with just the implicit in SQLImplicits.
Unfortunately, virtually nothing has been added to help with this. Searching for #since 2.0.0 in Encoders.scala or SQLImplicits.scala finds things mostly to do with primitive types (and some tweaking of case classes). So, first thing to say: there currently is no real good support for custom class encoders. With that out of the way, what follows is some tricks which do as good a job as we can ever hope to, given what we currently have at our disposal. As an upfront disclaimer: this won't work perfectly and I'll do my best to make all limitations clear and upfront.
What exactly is the problem
When you want to make a dataset, Spark "requires an encoder (to convert a JVM object of type T to and from the internal Spark SQL representation) that is generally created automatically through implicits from a SparkSession, or can be created explicitly by calling static methods on Encoders" (taken from the docs on createDataset). An encoder will take the form Encoder[T] where T is the type you are encoding. The first suggestion is to add import spark.implicits._ (which gives you these implicit encoders) and the second suggestion is to explicitly pass in the implicit encoder using this set of encoder related functions.
There is no encoder available for regular classes, so
import spark.implicits._
class MyObj(val i: Int)
// ...
val d = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
will give you the following implicit related compile time error:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases
However, if you wrap whatever type you just used to get the above error in some class that extends Product, the error confusingly gets delayed to runtime, so
import spark.implicits._
case class Wrap[T](unwrap: T)
class MyObj(val i: Int)
// ...
val d = spark.createDataset(Seq(Wrap(new MyObj(1)),Wrap(new MyObj(2)),Wrap(new MyObj(3))))
Compiles just fine, but fails at runtime with
java.lang.UnsupportedOperationException: No Encoder found for MyObj
The reason for this is that the encoders Spark creates with the implicits are actually only made at runtime (via scala relfection). In this case, all Spark checks at compile time is that the outermost class extends Product (which all case classes do), and only realizes at runtime that it still doesn't know what to do with MyObj (the same problem occurs if I tried to make a Dataset[(Int,MyObj)] - Spark waits until runtime to barf on MyObj). These are central problems that are in dire need of being fixed:
some classes that extend Product compile despite always crashing at runtime and
there is no way of passing in custom encoders for nested types (I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)).
Just use kryo
The solution everyone suggests is to use the kryo encoder.
import spark.implicits._
class MyObj(val i: Int)
implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
// ...
val d = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
This gets pretty tedious fast though. Especially if your code is manipulating all sorts of datasets, joining, grouping etc. You end up racking up a bunch of extra implicits. So, why not just make an implicit that does this all automatically?
import scala.reflect.ClassTag
implicit def kryoEncoder[A](implicit ct: ClassTag[A]) =
org.apache.spark.sql.Encoders.kryo[A](ct)
And now, it seems like I can do almost anything I want (the example below won't work in the spark-shell where spark.implicits._ is automatically imported)
class MyObj(val i: Int)
val d1 = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
val d2 = d1.map(d => (d.i+1,d)).alias("d2") // mapping works fine and ..
val d3 = d1.map(d => (d.i, d)).alias("d3") // .. deals with the new type
val d4 = d2.joinWith(d3, $"d2._1" === $"d3._1") // Boom!
Or almost. The problem is that using kryo leads to Spark just storing every row in the dataset as a flat binary object. For map, filter, foreach that is enough, but for operations like join, Spark really needs these to be separated into columns. Inspecting the schema for d2 or d3, you see there is just one binary column:
d2.printSchema
// root
// |-- value: binary (nullable = true)
Partial solution for tuples
So, using the magic of implicits in Scala (more in 6.26.3 Overloading Resolution), I can make myself a series of implicits that will do as good a job as possible, at least for tuples, and will work well with existing implicits:
import org.apache.spark.sql.{Encoder,Encoders}
import scala.reflect.ClassTag
import spark.implicits._ // we can still take advantage of all the old implicits
implicit def single[A](implicit c: ClassTag[A]): Encoder[A] = Encoders.kryo[A](c)
implicit def tuple2[A1, A2](
implicit e1: Encoder[A1],
e2: Encoder[A2]
): Encoder[(A1,A2)] = Encoders.tuple[A1,A2](e1, e2)
implicit def tuple3[A1, A2, A3](
implicit e1: Encoder[A1],
e2: Encoder[A2],
e3: Encoder[A3]
): Encoder[(A1,A2,A3)] = Encoders.tuple[A1,A2,A3](e1, e2, e3)
// ... you can keep making these
Then, armed with these implicits, I can make my example above work, albeit with some column renaming
class MyObj(val i: Int)
val d1 = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
val d2 = d1.map(d => (d.i+1,d)).toDF("_1","_2").as[(Int,MyObj)].alias("d2")
val d3 = d1.map(d => (d.i ,d)).toDF("_1","_2").as[(Int,MyObj)].alias("d3")
val d4 = d2.joinWith(d3, $"d2._1" === $"d3._1")
I haven't yet figured out how to get the expected tuple names (_1, _2, ...) by default without renaming them - if someone else wants to play around with this, this is where the name "value" gets introduced and this is where the tuple names are usually added. However, the key point is that that I now have a nice structured schema:
d4.printSchema
// root
// |-- _1: struct (nullable = false)
// | |-- _1: integer (nullable = true)
// | |-- _2: binary (nullable = true)
// |-- _2: struct (nullable = false)
// | |-- _1: integer (nullable = true)
// | |-- _2: binary (nullable = true)
So, in summary, this workaround:
allows us to get separate columns for tuples (so we can join on tuples again, yay!)
we can again just rely on the implicits (so no need to be passing in kryo all over the place)
is almost entirely backwards compatible with import spark.implicits._ (with some renaming involved)
does not let us join on the kyro serialized binary columns, let alone on fields those may have
has the unpleasant side-effect of renaming some of the tuple columns to "value" (if necessary, this can be undone by converting .toDF, specifying new column names, and converting back to a dataset - and the schema names seem to be preserved through joins, where they are most needed).
Partial solution for classes in general
This one is less pleasant and has no good solution. However, now that we have the tuple solution above, I have a hunch the implicit conversion solution from another answer will be a bit less painful too since you can convert your more complex classes to tuples. Then, after creating the dataset, you'd probably rename the columns using the dataframe approach. If all goes well, this is really an improvement since I can now perform joins on the fields of my classes. If I had just used one flat binary kryo serializer that wouldn't have been possible.
Here is an example that does a bit of everything: I have a class MyObj which has fields of types Int, java.util.UUID, and Set[String]. The first takes care of itself. The second, although I could serialize using kryo would be more useful if stored as a String (since UUIDs are usually something I'll want to join against). The third really just belongs in a binary column.
class MyObj(val i: Int, val u: java.util.UUID, val s: Set[String])
// alias for the type to convert to and from
type MyObjEncoded = (Int, String, Set[String])
// implicit conversions
implicit def toEncoded(o: MyObj): MyObjEncoded = (o.i, o.u.toString, o.s)
implicit def fromEncoded(e: MyObjEncoded): MyObj =
new MyObj(e._1, java.util.UUID.fromString(e._2), e._3)
Now, I can create a dataset with a nice schema using this machinery:
val d = spark.createDataset(Seq[MyObjEncoded](
new MyObj(1, java.util.UUID.randomUUID, Set("foo")),
new MyObj(2, java.util.UUID.randomUUID, Set("bar"))
)).toDF("i","u","s").as[MyObjEncoded]
And the schema shows me I columns with the right names and with the first two both things I can join against.
d.printSchema
// root
// |-- i: integer (nullable = false)
// |-- u: string (nullable = true)
// |-- s: binary (nullable = true)
Using generic encoders.
There are two generic encoders available for now kryo and javaSerialization where the latter one is explicitly described as:
extremely inefficient and should only be used as the last resort.
Assuming following class
class Bar(i: Int) {
override def toString = s"bar $i"
def bar = i
}
you can use these encoders by adding implicit encoder:
object BarEncoders {
implicit def barEncoder: org.apache.spark.sql.Encoder[Bar] =
org.apache.spark.sql.Encoders.kryo[Bar]
}
which can be used together as follows:
object Main {
def main(args: Array[String]) {
val sc = new SparkContext("local", "test", new SparkConf())
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import BarEncoders._
val ds = Seq(new Bar(1)).toDS
ds.show
sc.stop()
}
}
It stores objects as binary column so when converted to DataFrame you get following schema:
root
|-- value: binary (nullable = true)
It is also possible to encode tuples using kryo encoder for specific field:
val longBarEncoder = Encoders.tuple(Encoders.scalaLong, Encoders.kryo[Bar])
spark.createDataset(Seq((1L, new Bar(1))))(longBarEncoder)
// org.apache.spark.sql.Dataset[(Long, Bar)] = [_1: bigint, _2: binary]
Please note that we don't depend on implicit encoders here but pass encoder explicitly so this most likely won't work with toDS method.
Using implicit conversions:
Provide implicit conversions between representation which can be encoded and custom class, for example:
object BarConversions {
implicit def toInt(bar: Bar): Int = bar.bar
implicit def toBar(i: Int): Bar = new Bar(i)
}
object Main {
def main(args: Array[String]) {
val sc = new SparkContext("local", "test", new SparkConf())
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import BarConversions._
type EncodedBar = Int
val bars: RDD[EncodedBar] = sc.parallelize(Seq(new Bar(1)))
val barsDS = bars.toDS
barsDS.show
barsDS.map(_.bar).show
sc.stop()
}
}
Related questions:
How to create encoder for Option type constructor, e.g. Option[Int]?
You can use UDTRegistration and then Case Classes, Tuples, etc... all work correctly with your User Defined Type!
Say you want to use a custom Enum:
trait CustomEnum { def value:String }
case object Foo extends CustomEnum { val value = "F" }
case object Bar extends CustomEnum { val value = "B" }
object CustomEnum {
def fromString(str:String) = Seq(Foo, Bar).find(_.value == str).get
}
Register it like this:
// First define a UDT class for it:
class CustomEnumUDT extends UserDefinedType[CustomEnum] {
override def sqlType: DataType = org.apache.spark.sql.types.StringType
override def serialize(obj: CustomEnum): Any = org.apache.spark.unsafe.types.UTF8String.fromString(obj.value)
// Note that this will be a UTF8String type
override def deserialize(datum: Any): CustomEnum = CustomEnum.fromString(datum.toString)
override def userClass: Class[CustomEnum] = classOf[CustomEnum]
}
// Then Register the UDT Class!
// NOTE: you have to put this file into the org.apache.spark package!
UDTRegistration.register(classOf[CustomEnum].getName, classOf[CustomEnumUDT].getName)
Then USE IT!
case class UsingCustomEnum(id:Int, en:CustomEnum)
val seq = Seq(
UsingCustomEnum(1, Foo),
UsingCustomEnum(2, Bar),
UsingCustomEnum(3, Foo)
).toDS()
seq.filter(_.en == Foo).show()
println(seq.collect())
Say you want to use a Polymorphic Record:
trait CustomPoly
case class FooPoly(id:Int) extends CustomPoly
case class BarPoly(value:String, secondValue:Long) extends CustomPoly
... and the use it like this:
case class UsingPoly(id:Int, poly:CustomPoly)
Seq(
UsingPoly(1, new FooPoly(1)),
UsingPoly(2, new BarPoly("Blah", 123)),
UsingPoly(3, new FooPoly(1))
).toDS
polySeq.filter(_.poly match {
case FooPoly(value) => value == 1
case _ => false
}).show()
You can write a custom UDT that encodes everything to bytes (I'm using java serialization here but it's probably better to instrument Spark's Kryo context).
First define the UDT class:
class CustomPolyUDT extends UserDefinedType[CustomPoly] {
val kryo = new Kryo()
override def sqlType: DataType = org.apache.spark.sql.types.BinaryType
override def serialize(obj: CustomPoly): Any = {
val bos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(bos)
oos.writeObject(obj)
bos.toByteArray
}
override def deserialize(datum: Any): CustomPoly = {
val bis = new ByteArrayInputStream(datum.asInstanceOf[Array[Byte]])
val ois = new ObjectInputStream(bis)
val obj = ois.readObject()
obj.asInstanceOf[CustomPoly]
}
override def userClass: Class[CustomPoly] = classOf[CustomPoly]
}
Then register it:
// NOTE: The file you do this in has to be inside of the org.apache.spark package!
UDTRegistration.register(classOf[CustomPoly].getName, classOf[CustomPolyUDT].getName)
Then you can use it!
// As shown above:
case class UsingPoly(id:Int, poly:CustomPoly)
Seq(
UsingPoly(1, new FooPoly(1)),
UsingPoly(2, new BarPoly("Blah", 123)),
UsingPoly(3, new FooPoly(1))
).toDS
polySeq.filter(_.poly match {
case FooPoly(value) => value == 1
case _ => false
}).show()
Encoders work more or less the same in Spark2.0. And Kryo is still the recommended serialization choice.
You can look at following example with spark-shell
scala> import spark.implicits._
import spark.implicits._
scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders
scala> case class NormalPerson(name: String, age: Int) {
| def aboutMe = s"I am ${name}. I am ${age} years old."
| }
defined class NormalPerson
scala> case class ReversePerson(name: Int, age: String) {
| def aboutMe = s"I am ${name}. I am ${age} years old."
| }
defined class ReversePerson
scala> val normalPersons = Seq(
| NormalPerson("Superman", 25),
| NormalPerson("Spiderman", 17),
| NormalPerson("Ironman", 29)
| )
normalPersons: Seq[NormalPerson] = List(NormalPerson(Superman,25), NormalPerson(Spiderman,17), NormalPerson(Ironman,29))
scala> val ds1 = sc.parallelize(normalPersons).toDS
ds1: org.apache.spark.sql.Dataset[NormalPerson] = [name: string, age: int]
scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
scala> ds1.show()
+---------+---+
| name|age|
+---------+---+
| Superman| 25|
|Spiderman| 17|
| Ironman| 29|
+---------+---+
scala> ds2.show()
+----+---------+
|name| age|
+----+---------+
| 25| Superman|
| 17|Spiderman|
| 29| Ironman|
+----+---------+
scala> ds1.foreach(p => println(p.aboutMe))
I am Ironman. I am 29 years old.
I am Superman. I am 25 years old.
I am Spiderman. I am 17 years old.
scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
scala> ds2.foreach(p => println(p.aboutMe))
I am 17. I am Spiderman years old.
I am 25. I am Superman years old.
I am 29. I am Ironman years old.
Till now] there were no appropriate encoders in present scope so our persons were not encoded as binary values. But that will change once we provide some implicit encoders using Kryo serialization.
// Provide Encoders
scala> implicit val normalPersonKryoEncoder = Encoders.kryo[NormalPerson]
normalPersonKryoEncoder: org.apache.spark.sql.Encoder[NormalPerson] = class[value[0]: binary]
scala> implicit val reversePersonKryoEncoder = Encoders.kryo[ReversePerson]
reversePersonKryoEncoder: org.apache.spark.sql.Encoder[ReversePerson] = class[value[0]: binary]
// Ecoders will be used since they are now present in Scope
scala> val ds3 = sc.parallelize(normalPersons).toDS
ds3: org.apache.spark.sql.Dataset[NormalPerson] = [value: binary]
scala> val ds4 = ds3.map(np => ReversePerson(np.age, np.name))
ds4: org.apache.spark.sql.Dataset[ReversePerson] = [value: binary]
// now all our persons show up as binary values
scala> ds3.show()
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
scala> ds4.show()
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
// Our instances still work as expected
scala> ds3.foreach(p => println(p.aboutMe))
I am Ironman. I am 29 years old.
I am Spiderman. I am 17 years old.
I am Superman. I am 25 years old.
scala> ds4.foreach(p => println(p.aboutMe))
I am 25. I am Superman years old.
I am 29. I am Ironman years old.
I am 17. I am Spiderman years old.
In case of Java Bean class, this can be useful
import spark.sqlContext.implicits._
import org.apache.spark.sql.Encoders
implicit val encoder = Encoders.bean[MyClasss](classOf[MyClass])
Now you can simply read the dataFrame as custom DataFrame
dataFrame.as[MyClass]
This will create a custom class encoder and not a binary one.
My examples will be in Java, but I don't imagine it to be difficult adapting to Scala.
I have been quite successful converting RDD<Fruit> to Dataset<Fruit> using spark.createDataset and Encoders.bean as long as Fruit is a simple Java Bean.
Step 1: Create the simple Java Bean.
public class Fruit implements Serializable {
private String name = "default-fruit";
private String color = "default-color";
// AllArgsConstructor
public Fruit(String name, String color) {
this.name = name;
this.color = color;
}
// NoArgsConstructor
public Fruit() {
this("default-fruit", "default-color");
}
// ...create getters and setters for above fields
// you figure it out
}
I'd stick to classes with primitive types and String as fields before the DataBricks folks beef up their Encoders. If you have a class with nested object, create another simple Java Bean with all of its fields flattened, so you can use RDD transformations to map the complex type to the simpler one. Sure it's a little extra work, but I imagine it'll help a lot on performance working with a flat schema.
Step 2: Get your Dataset from the RDD
SparkSession spark = SparkSession.builder().getOrCreate();
JavaSparkContext jsc = new JavaSparkContext();
List<Fruit> fruitList = ImmutableList.of(
new Fruit("apple", "red"),
new Fruit("orange", "orange"),
new Fruit("grape", "purple"));
JavaRDD<Fruit> fruitJavaRDD = jsc.parallelize(fruitList);
RDD<Fruit> fruitRDD = fruitJavaRDD.rdd();
Encoder<Fruit> fruitBean = Encoders.bean(Fruit.class);
Dataset<Fruit> fruitDataset = spark.createDataset(rdd, bean);
And voila! Lather, rinse, repeat.
For those who may in my situation I put my answer here, too.
To be specific,
I was reading 'Set typed data' from SQLContext. So original data format is DataFrame.
val sample = spark.sqlContext.sql("select 1 as a, collect_set(1) as b limit 1")
sample.show()
+---+---+
| a| b|
+---+---+
| 1|[1]|
+---+---+
Then convert it into RDD using rdd.map() with mutable.WrappedArray type.
sample
.rdd.map(r =>
(r.getInt(0), r.getAs[mutable.WrappedArray[Int]](1).toSet))
.collect()
.foreach(println)
Result:
(1,Set(1))
In addition to the suggestions already given, another option I recently discovered is that you can declare your custom class including the trait org.apache.spark.sql.catalyst.DefinedByConstructorParams.
This works if the class has a constructor that uses types the ExpressionEncoder can understand, i.e. primitive values and standard collections. It can come in handy when you're not able to declare the class as a case class, but don't want to use Kryo to encode it every time it's included in a Dataset.
For example, I wanted to declare a case class that included a Breeze vector. The only encoder that would be able to handle that would normally be Kryo. But if I declared a subclass that extended the Breeze DenseVector and DefinedByConstructorParams, the ExpressionEncoder understood that it could be serialized as an array of Doubles.
Here's how I declared it:
class SerializableDenseVector(values: Array[Double]) extends breeze.linalg.DenseVector[Double](values) with DefinedByConstructorParams
implicit def BreezeVectorToSerializable(bv: breeze.linalg.DenseVector[Double]): SerializableDenseVector = bv.asInstanceOf[SerializableDenseVector]
Now I can use SerializableDenseVector in a Dataset (directly, or as part of a Product) using a simple ExpressionEncoder and no Kryo. It works just like a Breeze DenseVector but serializes as an Array[Double].
#Alec's answer is great! Just to add a comment in this part of his/her answer:
import spark.implicits._
case class Wrap[T](unwrap: T)
class MyObj(val i: Int)
// ...
val d = spark.createDataset(Seq(Wrap(new MyObj(1)),Wrap(new MyObj(2)),Wrap(new MyObj(3))))
#Alec mentions:
there is no way of passing in custom encoders for nested types (I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)).
It seems so, because if I add an encoder for MyObj:
implicit val myEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
, it still fails:
java.lang.UnsupportedOperationException: No Encoder found for MyObj
- field (class: "MyObj", name: "unwrap")
- root class: "Wrap"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
But notice the important error message:
root class: "Wrap"
It actually gives a hint that encoding MyObj isn't enough, and you have to encode the entire chain including Wrap[T].
So if I do this, it solves the problem:
implicit val myWrapperEncoder = org.apache.spark.sql.Encoders.kryo[Wrap[MyObj]]
Hence, the comment of #Alec is NOT that true:
I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)
We still have a way to feed Spark the encoder for MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj).

How to create a custom Transformer from a UDF?

I was trying to create and save a Pipeline with custom stages. I need to add a column to my DataFrame by using a UDF. Therefore, I was wondering if it was possible to convert a UDF or a similar action into a Transformer?
My custom UDF looks like this and I'd like to learn how to do it using an UDF as a custom Transformer.
def getFeatures(n: String) = {
val NUMBER_FEATURES = 4
val name = n.split(" +")(0).toLowerCase
((1 to NUMBER_FEATURES)
.filter(size => size <= name.length)
.map(size => name.substring(name.length - size)))
}
val tokenizeUDF = sqlContext.udf.register("tokenize", (name: String) => getFeatures(name))
It is not a fully featured solution but your can start with something like this:
import org.apache.spark.ml.{UnaryTransformer}
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.{ArrayType, DataType, StringType}
class NGramTokenizer(override val uid: String)
extends UnaryTransformer[String, Seq[String], NGramTokenizer] {
def this() = this(Identifiable.randomUID("ngramtokenizer"))
override protected def createTransformFunc: String => Seq[String] = {
getFeatures _
}
override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == StringType)
}
override protected def outputDataType: DataType = {
new ArrayType(StringType, true)
}
}
Quick check:
val df = Seq((1L, "abcdef"), (2L, "foobar")).toDF("k", "v")
val transformer = new NGramTokenizer().setInputCol("v").setOutputCol("vs")
transformer.transform(df).show
// +---+------+------------------+
// | k| v| vs|
// +---+------+------------------+
// | 1|abcdef|[f, ef, def, cdef]|
// | 2|foobar|[r, ar, bar, obar]|
// +---+------+------------------+
You can even try to generalize it to something like this:
import org.apache.spark.sql.catalyst.ScalaReflection.schemaFor
import scala.reflect.runtime.universe._
class UnaryUDFTransformer[T : TypeTag, U : TypeTag](
override val uid: String,
f: T => U
) extends UnaryTransformer[T, U, UnaryUDFTransformer[T, U]] {
override protected def createTransformFunc: T => U = f
override protected def validateInputType(inputType: DataType): Unit =
require(inputType == schemaFor[T].dataType)
override protected def outputDataType: DataType = schemaFor[U].dataType
}
val transformer = new UnaryUDFTransformer("featurize", getFeatures)
.setInputCol("v")
.setOutputCol("vs")
If you want to use UDF not the wrapped function you'll have to extend Transformer directly and override transform method. Unfortunately majority of the useful classes is private so it can be rather tricky.
Alternatively you can register UDF:
spark.udf.register("getFeatures", getFeatures _)
and use SQLTransformer
import org.apache.spark.ml.feature.SQLTransformer
val transformer = new SQLTransformer()
.setStatement("SELECT *, getFeatures(v) AS vs FROM __THIS__")
transformer.transform(df).show
// +---+------+------------------+
// | k| v| vs|
// +---+------+------------------+
// | 1|abcdef|[f, ef, def, cdef]|
// | 2|foobar|[r, ar, bar, obar]|
// +---+------+------------------+
I initially tried to extend the Transformer and UnaryTransformer abstracts but encountered trouble with my application being unable to reach DefaultParamsWriteable.As an example that may be relevant to your problem, I created a simple term normalizer as a UDF following along from this example. My goal is to match terms against patterns and sets to replace them with generic terms. For example:
"\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,}\b".r -> "emailaddr"
This is the class
import scala.util.matching.Regex
class TermNormalizer(normMap: Map[Any, String]) {
val normalizationMap = normMap
def normalizeTerms(terms: Seq[String]): Seq[String] = {
var termsUpdated = terms
for ((term, idx) <- termsUpdated.view.zipWithIndex) {
for (normalizer <- normalizationMap.keys: Iterable[Any]) {
normalizer match {
case (regex: Regex) =>
if (!regex.findFirstIn(term).isEmpty) termsUpdated =
termsUpdated.updated(idx, normalizationMap(regex))
case (set: Set[String]) =>
if (set.contains(term)) termsUpdated =
termsUpdated.updated(idx, normalizationMap(set))
}
}
}
termsUpdated
}
}
I use it like this:
val testMap: Map[Any, String] = Map("hadoop".r -> "elephant",
"spark".r -> "sparky", "cool".r -> "neat",
Set("123", "456") -> "set1",
Set("789", "10") -> "set2")
val testTermNormalizer = new TermNormalizer(testMap)
val termNormalizerUdf = udf(testTermNormalizer.normalizeTerms(_: Seq[String]))
val trainingTest = sqlContext.createDataFrame(Seq(
(0L, "spark is cool 123", 1.0),
(1L, "adsjkfadfk akjdsfhad 456", 0.0),
(2L, "spark rocks my socks 789 10", 1.0),
(3L, "hadoop is cool 10", 0.0)
)).toDF("id", "text", "label")
val testTokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val tokenizedTrainingTest = testTokenizer.transform(trainingTest)
println(tokenizedTrainingTest
.select($"id", $"text", $"words", termNormalizerUdf($"words"), $"label").show(false))
Now that I read the question a little closer, it sounds like you're asking how to avoid doing it this way lol. Anyways, I'll still post it in case someone in the future is looking for an easy way to apply a transformer-ish like functionality
If you wish to make the transformer writable as well, then you can re-implement the traits such as HasInputCol in the sharedParams library in a public package of your choice and then use them with DefaultParamsWritable trait to make the transformer persistable.
This way you can also avoid having to place part of your code inside the spark core ml packages but you kind of maintain a parallel set of params in your own package. This isnt really a problem given they hardly ever change.
But do track the bug in their JIRA board here that asks for some of the common sharedParams to be made public instead of private to the ml so that people can directly use those from outside classes.