Converting Scala case class to PySpark schema - scala

Given a simple Scala case class like this:
package com.foo.storage.schema
case class Person(name: String, age: Int)
it's possible to create a Spark schema from a case class as follows:
import org.apache.spark.sql._
import com.foo.storage.schema.Person
val schema = Encoders.product[Person].schema
I wonder if it's possible to access the schema from a case class in Python/PySpark. I would hope to do something like this [Python]:
jvm = sc._jvm
py4j_class = jvm.com.foo.storage.schema.Person
jvm.org.apache.spark.sql.Encoders.product(py4j_class)
This throws an error com.foo.storage.schema.Person._get_object_id does not exist in the JVM. The Encoders.product is a generic in Scala, and I'm not entirely sure how to specify the type using Py4J. Is there a way to use the case class to create a PySpark schema?

I've found there's no clean / easy way to do this using generics, also not as a pure Scala function. What I ended up doing is making a companion object for the case class that can fetch the schema.
Solution
package com.foo.storage.schema
case class Person(name: String, age: Int)
object Person {
def getSchema = Encoders.product[Person].schema
}
This function can be called from Py4J, but will return a JavaObject. It can be converted with a helper function like this:
from pyspark.sql.types import StructType
import json
def java_schema_to_python(j_schema):
json_schema = json.loads(ddl.json())
return StructType.fromJson(json_schema)
Finally, we can extract our schema:
j_schema = jvm.com.foo.storage.Person.getSchema()
java_schema_to_python(j_schema)
Alternative solution
I found there is one more way to do this, but I like the first one better. You can make a generic function that infers the type of the argument in Scala, and uses that to infer the type:
object SchemaConverter {
def getSchemaFromType[T <: Product: TypeTag](obj: T): StructType = {
Encoders.product[T].schema
}
}
Which can be called like this:
val schema = SchemaConverter.getSchemaFromType(Person("Joe", 42))
I didn't like this method since it requires you to create a dummy instance of the case class. Haven't tested it, but I think the function above could be called using Py4J too.

Related

How to create an Encoder for Scala collection (to implement custom Aggregator)?

Spark 2.3.0 with Scala 2.11. I'm implementing a custom Aggregator according to the docs here. The aggregator requires 3 types for input, buffer, and output.
My aggregator has to act upon all previous rows in the window so I declared it like this:
case class Foo(...)
object MyAggregator extends Aggregator[Foo, ListBuffer[Foo], Boolean] {
// other override methods
override def bufferEncoder: Encoder[ListBuffer[Mod]] = ???
}
One of the override methods is supposed to return the encoder for the buffer type, which in this case is a ListBuffer. I can't find any suitable encoder for org.apache.spark.sql.Encoders nor any other way to encode this so I don't know what to return here.
I thought of creating a new case class which has a single property of type ListBuffer[Foo] and using that as my buffer class, and then using Encoders.product on that, but I am not sure if that is necessary or if there is something else I am missing. Thanks for any tips.
You should just let Spark SQL do its work and find the proper encoder using ExpressionEncoder as follows:
scala> spark.version
res0: String = 2.3.0
case class Mod(id: Long)
import org.apache.spark.sql.Encoder
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
scala> val enc: Encoder[ListBuffer[Mod]] = ExpressionEncoder()
enc: org.apache.spark.sql.Encoder[scala.collection.mutable.ListBuffer[Mod]] = class[value[0]: array<struct<id:bigint>>]
I cannot see anything in org.apache.spark.sql.Encoders that could be used to directly encode a ListBuffer, or for that matter even a List
One option seems to be going with putting it in a case class, as you suggested:
import org.apache.spark.sql.Encoders
case class Foo(field: String)
case class Wrapper(lb: scala.collection.mutable.ListBuffer[Foo])
Encoders.product[Wrapper]
Another option could be to use kryo:
Encoders.kryo[scala.collection.mutable.ListBuffer[Foo]]
Or finally you could look at ExpressionEncoders, which extend Encoder:
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
ExpressionEncoder[scala.collection.mutable.ListBuffer[Foo]]
This is the best solution, as it keeps everything transparent to catalyst and therefore allows it to do all of its wonderful optimisations.
One thing I noticed whilst having a play:
ExpressionEncoder[scala.collection.mutable.ListBuffer[Foo]].schema == ExpressionEncoder[List[Foo]].schema
I haven't tested any of the above whilst executing aggregations, so there may be runtime issues. Hope this is helpful.

Scala Type Classes Understanding Interface Syntax

I'm was reading about cats and I encountered the following code snippet which is about serializing objects to JSON!
It starts with a trait like this:
trait JsonWriter[A] {
def write(value: A): Json
}
After this, there are some instances of our domain object:
final case class Person(name: String, email: String)
object JsonWriterInstances {
implicit val stringWriter: JsonWriter[String] =
new JsonWriter[String] {
def write(value: String): Json =
JsString(value)
}
implicit val personWriter: JsonWriter[Person] =
new JsonWriter[Person] {
def write(value: Person): Json =
JsObject(Map(
"name" -> JsString(value.name),
"email" -> JsString(value.email)
))
}
// etc...
}
So far so good! I can then use this like this:
import JsonWriterInstances._
Json.toJson(Person("Dave", "dave#example.com"))
Later on I come across something called the interface syntax, which uses extension methods to extend existing types with interface methods like below:
object JsonSyntax {
implicit class JsonWriterOps[A](value: A) {
def toJson(implicit w: JsonWriter[A]): Json =
w.write(value)
}
}
This then simplifies the call to serializing a Person as:
import JsonWriterInstances._
import JsonSyntax._
Person("Dave", "dave#example.com").toJson
What I don't understand is that how is the Person boxed into JsonWriterOps such that I can directly call the toJson as though toJson was defined in the Person case class itself. I like this magic, but I fail to understand this one last step about the JsonWriterOps. So what is the idea behind this interface syntax and how does this work? Any help?
This is actually a standard Scala feature, since JsonWriterOps is marked implicit and is in scope, the compiler can apply it at compilation-time when needed.
Hence scalac will do the following transformations:
Person("Dave", "dave#example.com").toJson
new JsonWriterOps(Person("Dave", "dave#example.com")).toJson
new JsonWriterOps[Person](Person("Dave", "dave#example.com")).toJson
Side note:
It's much more efficient to implicit classes as value classes like this:
implicit class JsonWriterOps[A](value: A) extends AnyVal
This makes the compiler also optimize away the new object construction, if possible, compiling the whole implicit conversion + method call to a simple function call.

scala-cass generic read from cassandra table as case class

I am attempting to use scala-cass in order to read from cassandra and convert the resultset to a case class using resultSet.as[CaseClass]. This works great when running the following.
import com.weather.scalacass.syntax._
case class TestTable(id: String, data1: Int, data2: Long)
val resultSet = session.execute(s"select * from test.testTable limit 10")
resultSet.one.as[TestTable]
Now I am attempting to make this more generic and I am unable to find the proper type constraint for the generic class.
import com.weather.scalacass.syntax._
case class TestTable(id: String, data1: Int, data2: Long)
abstract class GenericReader[T] {
val table: String
val keyspace: String
def getRows(session: Session): T = {
val resultSet = session.execute(s"select * from $keyspace.$table limit 10")
resultSet.one.as[T]
}
}
I implement this class with the desired case class and attempt to call getRows on the created Object.
object TestTable extends GenericReader[TestTable] {
val keyspace = "test"
val table = "TestTable"
}
TestTable.getRows(session)
This throws an exception could not find implicit value for parameter ccd: com.weather.scalacass.CCCassFormatDecoder[T].
I am trying to add a type constraint to GenericReader in order to ensure the implicit conversion will work. However, I am unable to find the proper type. I am attempting to read through scala-cass in order to find the proper constraint but I have had no luck so far.
I would also be happy to use any other library that can achieve this.
Looks like as[T] requires an implicit value that you don't have in scope, so you'll need to require that implicit parameter in the getRows method as well.
def getRows(session: Session)(implicit cfd: CCCassFormatDecoder[T]): T
You could express this as a type constraint (what you were looking for in the original question) using context bounds:
abstract class GenericReader[T:CCCassFormatDecoder]
Rather than try to bound your generic T type, it might be easier to just pass through the missing implicit parameter:
abstract class GenericReader[T](implicit ccd: CCCassFormatDecoder[T]) {
val table: String
val keyspace: String
def getRows(session: Session): T = {
val resultSet = session.execute(s"select * from $keyspace.$table limit 10")
resultSet.one.as[T]
}
}
Finding a concrete value for that implicit can then be deferred to when you narrow that T to a specific class (like object TestTable extends GenericReader[TestTable])

How to implement a trait with a generic case class that creates a dataset in Scala

I want to create a Scala trait that should be implemented with a case class T. The trait is simply to load data and transform it into a Spark Dataset of type T. I got the error that no encoder can be stored, which I think is because Scala does not know that T should be a case class. How can I tell the compiler that? I've seen somewhere that I should mention Product, but there is no such class defined.. Feel free to suggest other ways to do this!
I have the following code but it is not compiling with the error: 42: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._
[INFO] .as[T]
I'm using Spark 1.6.1
Code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Dataset, SQLContext}
/**
* A trait that moves data on Hadoop with Spark based on the location and the granularity of the data.
*/
trait Agent[T] {
/**
* Load a Dataframe from the location and convert into a Dataset
* #return Dataset[T]
*/
protected def load(): Dataset[T] = {
// Read in the data
SparkContextKeeper.sqlContext.read
.format("com.databricks.spark.csv")
.load("/myfolder/" + location + "/2016/10/01/")
.as[T]
}
}
Your code is missing 3 things:
Indeed, you must let compiler know that T is subclass of Product (the superclass of all Scala case classes and Tuples)
Compiler would also require the TypeTag and ClassTag of the actual case class. This is used implicitly by Spark to overcome type erasure
import of sqlContext.implicits._
Unfortunately, you can't add type parameters with context bounds in a trait, so the simplest workaround would be to use an abstract class instead:
import scala.reflect.runtime.universe.TypeTag
import scala.reflect.ClassTag
abstract class Agent[T <: Product : ClassTag : TypeTag] {
protected def load(): Dataset[T] = {
val sqlContext: SQLContext = SparkContextKeeper.sqlContext
import sqlContext.implicits._
sqlContext.read.// same...
}
}
Obviously, this isn't equivalent to using a trait, and might suggest that this design isn't the best fit for the job. Another alternative is placing load in an object and moving the type parameter to the method:
object Agent {
protected def load[T <: Product : ClassTag : TypeTag](): Dataset[T] = {
// same...
}
}
Which one is preferable is mostly up to where and how you're going to call load and what you're planning to do with the result.
You need to take two actions :
Add import sparkSession.implicits._ in your imports
Make your trait trait Agent[T <: Product]

How to map postgresql custom enum column with Slick2.0.1?

I just can't figure it out. What I am using right now is:
abstract class DBEnumString extends Enumeration {
implicit val enumMapper = MappedJdbcType.base[Value, String](
_.toString(),
s => this.withName(s)
)
}
And then:
object SomeEnum extends DBEnumString {
type T = Value
val A1 = Value("A1")
val A2 = Value("A2")
}
The problem is, during insert/update JDBC driver for PostgreSQL complains about parameter type being "character varying" when column type is "some_enum", which is reasonable as I am converting SomeEnum to String.
How do I tell Slick to treat String as DB-defined "enum_type"? Or how to define some other Scala-type that will map to "enum_type"?
I had similar confusion when trying to get my postgreSQL enums to work with slick. Slick-pg allows you to use Scala enums with your databases enums, and the test suite shows how.
Below is an example.
Say we have this enumerated type in our database.
CREATE TYPE Dog AS ENUM ('Poodle', 'Labrador');
We want to be able to map these to Scala enums, so we can use them happily with Slick. We can do this with slick-pg, an extension for slick.
First off, we make a Scala version of the above enum.
object Dogs extends Enumeration {
type Dog = Value
val Poodle, Labrador = Value
}
To get the extra functionality from slick-pg we extend the normal PostgresDriver and say we want to map our Scala enum to the PostgreSQL one (remember to change the slick driver in application.conf to the one you've created).
object MyPostgresDriver extends PostgresDriver with PgEnumSupport {
override val api = new API with MyEnumImplicits {}
trait MyEnumImplicits {
implicit val dogTypeMapper = createEnumJdbcType("Dog", Dogs)
implicit val dogListTypeMapper = createEnumListJdbcType("Dog", Dogs)
implicit val dogColumnExtensionMethodsBuilder = createEnumColumnExtensionMethodsBuilder(Dogs)
implicit val dogOptionColumnExtensionMethodsBuilder = createEnumOptionColumnExtensionMethodsBuilder(Dogs)
}
}
Now when you want to make a new model case class, simply use the corresponding Scala enum.
case class User(favouriteDog: Dog)
And when you do the whole DAO table shenanigans, again you can just use it.
class Users(tag: Tag) extends Table[User](tag, "User") {
def favouriteDog = column[Dog]("favouriteDog")
def * = (favouriteDog) <> (Dog.tupled, Dog.unapply _)
}
Obviously you need the Scala Dog enum in scope wherever you use it.
Due to a bug in slick, currently you can't dynamically link to a custom slick driver in application.conf (it should work). This means you either need to run play framework with start and not get dynamic recompiling, or you can create a standalone sbt project with just the custom slick driver in it and depend on it locally.