Spark/scala create empty dataset using generics in a trait - scala

I have a trait called that takes a type parameter, and one of its methods needs to be able to create an empty typed dataset.
trait MyTrait[T] {
val sparkSession: SparkSession
val spark = sparkSession.session
val sparkContext = spark.sparkContext
def createEmptyDataset(): Dataset[T] = {
import spark.implicits._ // to access .toDS() function
// DOESN'T WORK.
val emptyRDD = sparkContext.parallelize(Seq[T]())
val accumulator = emptyRDD.toDS()
...
}
}
So far I have not gotten it to work. It complains no ClassTag for T, and that value toDS is not a member of org.apache.spark.rdd.RDD[T]
Any help would be appreciated. Thanks!

You have to provide both ClassTag[T] and Encoder[T] in the same scope. For example:
import org.apache.spark.sql.{SparkSession, Dataset, Encoder}
import scala.reflect.ClassTag
trait MyTrait[T] {
val ct: ClassTag[T]
val enc: Encoder[T]
val sparkSession: SparkSession
val sparkContext = spark.sparkContext
def createEmptyDataset(): Dataset[T] = {
val emptyRDD = sparkContext.emptyRDD[T](ct)
spark.createDataset(emptyRDD)(enc)
}
}
with concrete implementation:
class Foo extends MyTrait[Int] {
val sparkSession = SparkSession.builder.getOrCreate()
import sparkSession.implicits._
val ct = implicitly[ClassTag[Int]]
val enc = implicitly[Encoder[Int]]
}
It is possible to skip RDD:
import org.apache.spark.sql.{SparkSession, Dataset, Encoder}
trait MyTrait[T] {
val enc: Encoder[T]
val sparkSession: SparkSession
val sparkContext = spark.sparkContext
def createEmptyDataset(): Dataset[T] = {
spark.emptyDataset[T](enc)
}
}
Check How to declare traits as taking implicit "constructor parameters"?, specifically answer by Blaisorblade and another one by Alexey Romanov.

Related

Scala - How to extract Json4s with dynamic case class created with ToolBox

I defined the case class dynamically using Toolbox.
And when I do extract of json4s, I get the following exception:
import org.json4s._
import scala.reflect.runtime._
import scala.tools.reflect.ToolBox
implicit val formats = DefaultFormats
val cm = universe.runtimeMirror(getClass.getClassLoader)
val toolBox = cm.mkToolBox()
val parse =
toolBox.parse(
s"""
| case class Person( name:String, age:String)
| scala.reflect.classTag[ Person].runtimeClass
""".stripMargin)
val person = toolBox.compile( parse)().asInstanceOf[Class[_]]
val js = JsonMethods.parse("""{ "name":"Tom","age" : "28"}""")
val jv = js.extract[person.type ] //How do I pass the class type?
**"Exception in thread "main" org.json4s.MappingException: No constructor for type Class, JObject(List((name,JString(Tom)), (age,JString(28))))"**
But after creating a dummy instance of the dynamically created class,
Then pass in the type of that dummy class and it will be parsed.
I don't know why.
How can I parse without creating a dummy instance?
import org.json4s._
import scala.reflect.runtime._
import scala.tools.reflect.ToolBox
implicit val formats = DefaultFormats
val cm = universe.runtimeMirror(getClass.getClassLoader)
val toolBox = cm.mkToolBox()
val parse =
toolBox.parse(
s"""
| case class Person( name:String, age:String)
| scala.reflect.classTag[ Person].runtimeClass
""".stripMargin)
val person = toolBox.compile( parse)().asInstanceOf[Class[_]]
val dummy = person.getConstructors.head.newInstance( "a", "b") //make dummy instance
val js = JsonMethods.parse("""{ "name":"Tom","age" : "28"}""")
println( js.extract[ dummy.type ] ) // Result: Person(Tom,28)
x.type is a singleton type. So person.type can't be correct, it's the singleton type of this specific variable val person: Class[_].
Fortunately, dummy.type is correct because of the runtime reflection. This works even for ordinary case class
import org.json4s._
import org.json4s.jackson.JsonMethods
implicit val formats = DefaultFormats
case class Person(name: String, age: String)
val js = JsonMethods.parse("""{ "name":"Tom","age" : "28"}""")
val dummy0: AnyRef = Person("a", "b")
val dummy: AnyRef = dummy0
js.extract[dummy.type] // Person(Tom,28)
Actually after resolving implicits js.extract[Person] is
js.extract[Person](formats, ManifestFactory.classType(classOf[Person])
js.extract[dummy.type] is
js.extract[dummy.type](formats, ManifestFactory.singleType(dummy))
So for a toolbox-generated case class we could try
import org.json4s._
import org.json4s.jackson.JsonMethods
import scala.reflect.ManifestFactory
import scala.reflect.runtime.universe
import scala.reflect.runtime.universe._
import scala.tools.reflect.ToolBox
val cm = universe.runtimeMirror(getClass.getClassLoader)
val toolBox = cm.mkToolBox()
implicit val formats = DefaultFormats
val person = toolBox.eval(q"""
case class Person(name:String, age:String)
scala.reflect.classTag[Person].runtimeClass
""").asInstanceOf[Class[_]]
val js = JsonMethods.parse("""{ "name":"Tom","age" : "28"}""")
js.extract(formats, ManifestFactory.classType(person))
// java.lang.ClassCastException: __wrapper$1$6246735221dc4d64a9e372a9d0891e5e.__wrapper$1$6246735221dc4d64a9e372a9d0891e5e$Person$1 cannot be cast to scala.runtime.Nothing$
(toolBox.eval(tree) is instead of toolBox.compile(toolBox.parse(string))())
but this doesn't work.
Manifest should be captured from toolbox compile time
import org.json4s._
import org.json4s.jackson.JsonMethods
import scala.reflect.runtime.universe
import scala.reflect.runtime.universe._
import scala.tools.reflect.ToolBox
val cm = universe.runtimeMirror(getClass.getClassLoader)
val toolBox = cm.mkToolBox()
implicit val formats = DefaultFormats
val person = toolBox.eval(q"""
case class Person(name:String, age:String)
val clazz = scala.reflect.classTag[Person].runtimeClass
scala.reflect.ManifestFactory.classType(clazz)
""").asInstanceOf[Manifest[_]]
val js = JsonMethods.parse("""{ "name":"Tom","age" : "28"}""")
js.extract(formats, person) // Person(Tom,28)
Alternatively you don't need java-reflection Class at all. You can do
import scala.reflect.runtime
import scala.reflect.runtime.universe._
import scala.tools.reflect.ToolBox
val cm = runtime.currentMirror
val toolBox = cm.mkToolBox()
toolBox.eval(q"""
import org.json4s._
import org.json4s.jackson.JsonMethods
implicit val formats = DefaultFormats
case class Person(name: String, age: String)
val js = JsonMethods.parse(${"""{"name":"Tom","age" : "28"}"""})
js.extract[Person]
""") // Person(Tom,28)
or
import org.json4s._
import org.json4s.jackson.JsonMethods
import scala.reflect.runtime
import scala.reflect.runtime.universe._
import scala.tools.reflect.ToolBox
object Main extends App {
val cm = runtime.currentMirror
val toolBox = cm.mkToolBox()
implicit val formats = DefaultFormats
val person: ClassSymbol = toolBox.define(q"case class Person(name: String, age: String)")
val js = JsonMethods.parse("""{"name":"Tom","age" : "28"}""")
val jv = toolBox.eval(q"""
import Main._
js.extract[$person]
""")
println(jv) // Person(Tom,28)
}

Implicit Encoder for TypedDataset and Type Bounds in Scala

My objective is to create a MyDataFrame class that will know how to fetch data at a given path, but I want to provide type-safety. I'm having some trouble using a frameless.TypedDataset with type bounds on remote data. For example
sealed trait Schema
final case class TableA(id: String) extends Schema
final case class TableB(id: String) extends Schema
class MyDataFrame[T <: Schema](path: String, implicit val spark: SparkSession) {
def read = TypedDataset.create(spark.read.parquet(path)).as[T]
}
But I keep getting could not find implicit value for evidence parameter of type frameless.TypedEncoder[org.apache.spark.sql.Row]. I know that TypedDataset.create needs an Injection for this to work. But I'm not sure how I would write this for a generic T. I thought the compiler would be able to deduce that since all subtypes of Schema are case classes that it would work.
Anybody ever run into this?
All implicit parameters should be in the last parameter list and this parameter list should be separate from non-implicit ones.
If you try to compile
class MyDataFrame[T <: Schema](path: String)(implicit spark: SparkSession) {
def read = TypedDataset.create(spark.read.parquet(path)).as[T]
}
you'll see error
Error:(11, 35) could not find implicit value for evidence parameter of type frameless.TypedEncoder[org.apache.spark.sql.Row]
def read = TypedDataset.create(spark.read.parquet(path)).as[T]
So let's just add corresponding implicit parameter
class MyDataFrame[T <: Schema](path: String)(implicit spark: SparkSession, te: TypedEncoder[Row]) {
def read = TypedDataset.create(spark.read.parquet(path)).as[T]
}
we'll have error
Error:(11, 64) could not find implicit value for parameter as: frameless.ops.As[org.apache.spark.sql.Row,T]
def read = TypedDataset.create(spark.read.parquet(path)).as[T]
So let's add one more implicit parameter
import frameless.ops.As
import frameless.{TypedDataset, TypedEncoder}
import org.apache.spark.sql.{Row, SparkSession}
class MyDataFrame[T <: Schema](path: String)(implicit spark: SparkSession, te: TypedEncoder[Row], as: As[Row, T]) {
def read = TypedDataset.create(spark.read.parquet(path)).as[T]
}
or with kind-projector
class MyDataFrame[T <: Schema : As[Row, ?]](path: String)(implicit spark: SparkSession, te: TypedEncoder[Row]) {
def read = TypedDataset.create(spark.read.parquet(path)).as[T]
}
You can create custom type class
trait Helper[T] {
implicit def te: TypedEncoder[Row]
implicit def as: As[Row, T]
}
object Helper {
implicit def mkHelper[T](implicit te0: TypedEncoder[Row], as0: As[Row, T]): Helper[T] = new Helper[T] {
override implicit def te: TypedEncoder[Row] = te0
override implicit def as: As[Row, T] = as0
}
}
class MyDataFrame[T <: Schema : Helper](path: String)(implicit spark: SparkSession) {
val h = implicitly[Helper[T]]
import h._
def read = TypedDataset.create(spark.read.parquet(path)).as[T]
}
or
class MyDataFrame[T <: Schema](path: String)(implicit spark: SparkSession, h: Helper[T]) {
import h._
def read = TypedDataset.create(spark.read.parquet(path)).as[T]
}
or
trait Helper[T] {
def create(dataFrame: DataFrame): TypedDataset[T]
}
object Helper {
implicit def mkHelper[T](implicit te: TypedEncoder[Row], as: As[Row, T]): Helper[T] =
(dataFrame: DataFrame) => TypedDataset.create(dataFrame).as[T]
}
class MyDataFrame[T <: Schema : Helper](path: String)(implicit spark: SparkSession) {
def read = implicitly[Helper[T]].create(spark.read.parquet(path))
}
or
class MyDataFrame[T <: Schema](path: String)(implicit spark: SparkSession, h: Helper[T]) {
def read = h.create(spark.read.parquet(path))
}
Corrected version:
import org.apache.spark.sql.Encoder
import frameless.{TypedDataset, TypedEncoder}
class MyDataFrame[T <: Schema](path: String)(implicit
spark: SparkSession,
e: Encoder[T],
te: TypedEncoder[T]
) {
def read: TypedDataset[T] = TypedDataset.create[T](spark.read.parquet(path).as[T])
}
or using context bounds
class MyDataFrame[T <: Schema : Encoder : TypedEncoder](path: String)(implicit
spark: SparkSession
) {
def read: TypedDataset[T] = TypedDataset.create[T](spark.read.parquet(path).as[T])
}
Testing:
I converted a json file {"id": "xyz"} into parquet file and then
sealed trait Schema
final case class TableA(id: String) extends Schema
final case class TableB(id: String) extends Schema
import org.apache.spark.sql.SparkSession
implicit val spark: SparkSession = SparkSession.builder
.master("local")
.appName("Spark SQL basic example")
.getOrCreate()
import spark.implicits._
import frameless.syntax._
val res: TypedDataset[TableA] = new MyDataFrame[TableA]("path/to/parquet/file").read
println(res) // [id: string]
res.foreach(println).run() // TableA(xyz)

Where is the implicit value for parameter P: cats.Parallel[cats.effect.IO,F]

Running the example code snippet under the subtopic parSequence in Cats Effect document throws an error,
import cats._, cats.data._, cats.syntax.all._, cats.effect.IO
val anIO = IO(1)
val aLotOfIOs = NonEmptyList.of(anIO, anIO)
val ioOfList = aLotOfIOs.parSequence
<console>:44: error: could not find implicit value for parameter P: cats.Parallel[cats.effect.IO,F]
I include implicit Timer[IO] i.e. implicit val timer = IO.timer(ExecutionContext.global) but it does not work. Please advise. Thanks
Update #1
For a complete working snippet,
import cats._, cats.data._, cats.syntax.all._, cats.effect.IO
import scala.concurrent.ExecutionContext.Implicits.global
implicit val contextShift = IO.contextShift(global)
val anIO = IO(1)
val aLotOfIOs = NonEmptyList.of(anIO, anIO)
val ioOfList = aLotOfIOs.parSequence
The implicit you're looking for is defined in cats.effect.IOInstances and you can bring it in scope by importing cats.effect.IO._.
private[effect] abstract class IOInstances extends IOLowPriorityInstances {
//....
implicit def ioParallel(implicit cs: ContextShift[IO]): Parallel[IO, IO.Par] =
new Parallel[IO, IO.Par] {
final override val applicative: Applicative[IO.Par] =
parApplicative(cs)
final override val monad: Monad[IO] =
ioConcurrentEffect(cs)
final override val sequential: ~>[IO.Par, IO] =
new FunctionK[IO.Par, IO] { def apply[A](fa: IO.Par[A]): IO[A] = IO.Par.unwrap(fa) }
final override val parallel: ~>[IO, IO.Par] =
new FunctionK[IO, IO.Par] { def apply[A](fa: IO[A]): IO.Par[A] = IO.Par(fa) }
}
}
object IO extends IOInstances {
// ...
}
Note that you will need to have an implicit ContextShift[IO] in scope if you want to use the ioParallel instance.
It is a common pattern in Scala to have implicit instances defined as part of the companion object for the class (in this case IO).

Unit testing in Spark with SQLContext implicits

I'm trying to run multiple unit tests in Spark and have copied (and slightly adapted) the bit from the source code:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, Suite}
trait SharedSparkContext extends BeforeAndAfterAll {
self: Suite =>
#transient private var _sc: SparkContext = _
#transient private var _sqlContext: SQLContext = _
def sc: SparkContext = _sc
def sqlContext: SQLContext = _sqlContext
private var conf = new SparkConf(false)
override def beforeAll() {
super.beforeAll()
_sc = new SparkContext("local[*]", "Test Suites", conf)
_sqlContext = new SQLContext(_sc)
}
override def afterAll() {
try {
LocalSparkContext.stop(_sc)
_sc = null
} finally {
super.afterAll()
}
}
}
The LocalSparkContext class with companion object are simply copied from the source.
I thought about using it as follows, which tells me that stable identifier required because the def sqlContext does not have the member implicits:
class MySuite extends FlatSpec with SharedSparkContext {
import sqlContext.implicits._
// ...
}
I have tried replacing it with the following, but that gives me null pointer exceptions:
class MySuite extends FlatSpec with SharedSparkContext {
val sqlCtxt = sqlContext
import sqlCtxt.implicits._
// ...
}
I am using Spark 1.4.1 and I have set parallelExecution in test := false.
How can get this to work (without using additional packages)?
Instead of using a trait, you can use a simple object that holds all your variables, here's what I do for my tests :
object TestConfiguration extends Serializable {
private val sparkConf = new SparkConf()
.setAppName("Tests")
.setMaster("local")
private lazy val sparkContext = new SparkContext(sparkConf)
private lazy val sqlContext = new SQLContext(sparkContext)
def getSqlContext() = {
sqlContext
}
}
Then, you'll be able to use the sqlContext in a test suite.
class MySuite extends FlatSpec with SharedSparkContext {
val sqlCtxt = TestConfiguration.getSqlContext
import sqlCtxt.implicits._
// ...
}

How do I supply an implicit value for an akka.stream.Materializer when sending a FakeRequest?

I'm trying to make sense of the error(s) I'm seeing below, and to learn how to fix it.
could not find implicit value for parameter materializer: akka.Stream.Materializer
val fut: Future[Result] = action.apply(fakeRequest).run
^
not enough arguments for method run (implicit materializer: akka.stream.Materializer)scala.concurrent.Future[play.api.mvc.Result].
Unspecified value parameter materializer.
val fut: Future[Result] = action.apply(fakeRequest).run
^
Here is the test code that produced the error(s):
package com.foo.test
import com.foo.{Api, BoundingBox}
import org.scalatest.{FlatSpec, Matchers}
import play.api.libs.json._
import play.api.mvc._
import play.api.test.{FakeHeaders, FakeRequest}
import scala.concurrent.duration._
import scala.concurrent.{Await, Future}
class TestJmlPlay extends FlatSpec with Matchers {
val bbox = new BoundingBox(-76.778154438007732F, 39.239828198015971F, -76.501003519894326F, 39.354663763993926F)
"latitudes" should "be between swLat and neLat" in {
val action: Action[AnyContent] = (new Api).getForPlay(bbox)
val jsonStr = getStringFromAction(action)
areLatitudesOk(jsonStr, bbox) shouldBe true
}
private def getStringFromAction(action:Action[AnyContent]):String = {
val fakeRequest: Request[String] = new FakeRequest("fakeMethod", "fakeUrl", new FakeHeaders, "fakeBody")
val fut: Future[Result] = action.apply(fakeRequest).run // <== ERROR!
val result = Await.result(fut, 5000 milliseconds)
result.body.toString
}
private def areLatitudesOk(jsonStr: String, bbox: BoundingBox): Boolean = ...
}
You can create an implicit ActorMaterializer within your test class which will use testkit's ActorSystem:
import akka.testkit.TestKit
import akka.actor.ActorSystem
class TestJmlPlay(_system : ActorSystem) extends TestKit(_system) ... {
implicit val materializer: ActorMaterializer = ActorMaterializer()
val bbox = ...
You don't need Materializer.
I believe you are calling not the right action.apply method.
You want def apply(request: Request[A]): Future[Result]
To call the right, you need FakeRequest[AnyContent], same parametrized type as action:Action[AnyContent].This type is forced by PlayBodyParser I believe you set for your action.
After that you don't need .run call