Generic T as Spark Dataset[T] constructor

Generic T as Spark Dataset[T] constructor - scala

In the following snippet, the tryParquet function tries to load a Dataset from a Parquet file if it exists. If not, it computes, persists and returns back the Dataset plan which was provided:
import scala.util.{Try, Success, Failure}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
sealed trait CustomRow
case class MyRow(
id: Int,
name: String
) extends CustomRow
val ds: Dataset[MyRow] =
Seq((1, "foo"),
(2, "bar"),
(3, "baz")).toDF("id", "name").as[MyRow]
def tryParquet[T <: CustomRow](session: SparkSession, path: String, target: Dataset[T]): Dataset[T] =
Try(session.read.parquet(path)) match {
case Success(df) => df.as[T] // <---- compile error here
case Failure(_) => {
target.write.parquet(path)
target
}
}
val readyDS: Dataset[MyRow] =
tryParquet(spark, "/path/to/file.parq", ds)
However this produces a compile error on df.as[T]:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
case Success(df) => df.as[T]
One can circumvent this problem by making tryParquet cast df to return an untyped DataFrame and let caller cast to the desired constructor. However is there any solution in the case we want the type to be managed internally by the function?

Looks like it's possible by using an Encoder in the type parameter:
import org.apache.spark.sql.Encoder
def tryParquet[T <: CustomRow: Encoder](...)
This way the compiler can prove that df.as[T] is providing an Encoder when constructing the objects.

Related

How to infer StructType schema for Spark Scala at run time given a Fully Qualified Name of a case class

Since a few days I was wondering if it is possible to infer a schema for Spark in Scala for a given case class, but unknown at compile time.
The only input is a string containing the FQN of the class (that could be used for example to create an instance of the case class at runtime via reflection)
I was thinking if it was possible to do something like:
package com.my.namespace
case class MyCaseClass (name: String, num: Int)
//Somewhere else in codebase
// coming from external configuration file, so unknown at compile time
val fqn = "com.my.namespace.MyCaseClass"
val schema = Encoders.product [ getXYZ( fqn ) ].schema
Of course, any other techniques that is not using Encoders is fine (building StructType analysing an instance of the case class ? Is it even possible ?)
What is the best approach?
Is it something feasible ?

You can use reflective toolbox
package com.my.namespace
import org.apache.spark.sql.types.StructType
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
case class MyCaseClass (name: String, num: Int)
object Main extends App {
val fqn = "com.my.namespace.MyCaseClass"
val runtimeMirror = runtime.currentMirror
val toolbox = runtimeMirror.mkToolBox()
val res = toolbox.eval(toolbox.parse(s"""
import org.apache.spark.sql.Encoders
Encoders.product[$fqn].schema
""")).asInstanceOf[StructType]
println(res) // StructType(StructField(name,StringType,true),StructField(num,IntegerType,false))
}

Convert prepareStament object to Json Scala

I'am trying to convert prepareStament(object uses for sending SQL statement to the database ) to Json with scala.
So far, I've discovered that the best way to convert an object to Json in scala is to do it with the net.liftweb library.
But when I tried it, I got an empty json.
this is the code
import java.sql.DriverManager
import net.liftweb.json._
import net.liftweb.json.Serialization.write
object Main {
def main (args: Array[String]): Unit = {
implicit val formats = DefaultFormats
val jdbcSqlConnStr = "sqlserverurl**"
val conn = DriverManager.getConnection(jdbcSqlConnStr)
val statement = conn.prepareStatement("exec select_all")
val piedPierJSON2= write(statement)
println(piedPierJSON2)
}
}
this is the result
{}
I used an object I created , and the conversion worked.
case class Person(name: String, address: Address)
case class Address(city: String, state: String)
val p = Person("Alvin Alexander", Address("Talkeetna", "AK"))
val piedPierJSON3 = write(p)
println(piedPierJSON3)
This is the result
{"name":"Alvin Alexander","address":{"city":"Talkeetna","state":"AK"}}

I understood where the problem was, PrepareStament is an interface, and none of its subtypes are serializable...
I'm going to try to wrap it up and put it in a different class.

Spark unable to find encoder(case class) although providing it

Trying to figure out why getting an error on encoders, any insight would be helpful!
ERROR Unable to find encoder for type SolrNewsDocument, An implicit Encoder[SolrNewsDocument] is needed to store `
Clearly I have imported spark.implicits._. I have also have provided an encoder as a case class.
def ingestDocsToSolr(newsItemDF: DataFrame) = {
case class SolrNewsDocument(
title: String,
body: String,
publication: String,
date: String,
byline: String,
length: String
)
import spark.implicits._
val solrDocs = newsItemDF.as[SolrNewsDocument].map { doc =>
val solrDoc = new SolrInputDocument
solrDoc.setField("title", doc.title.toString)
solrDoc.setField("body", doc.body)
solrDoc.setField("publication", doc.publication)
solrDoc.setField("date", doc.date)
solrDoc.setField("byline", doc.byline)
solrDoc.setField("length", doc.length)
solrDoc
}
// can be used for stream SolrSupport.
SolrSupport.indexDocs("localhost:2181", "collection", 10, solrDocs.rdd);
val solrServer = SolrSupport.getCachedCloudClient("localhost:2181")
solrServer.setDefaultCollection("collection")
solrServer.commit(false, false)
}

//Check this one.-Move case class declaration before function declaration.
//Encoder is created once case class statement is executed by compiler. Then only compiler will be able to use encoder inside function deceleration.
import spark.implicits._
case class SolrNewsDocument(title: String,body: String,publication: String,date: String,byline: String,length: String)
def ingestDocsToSolr(newsItemDF:DataFrame) = {
val solrDocs = newsItemDF.as[SolrNewsDocument]}

i got this error trying to iterate over a text file, and in my case, as of spark 2.4.x the issue was that i had to cast it to an RDD first (that used to be implicit)
textFile
.rdd
.flatMap(line=>line.split(" "))
Migrating our Scala codebase to Spark 2

How to use TypeInformation in a generic method using Scala

I'm trying to create a generic method in Apache Flink to parse a DataSet[String](JSON strings) using case classes. I tried to use the TypeInformation like it's mentioned here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html#generic-methods
I'm using liftweb to parse the JSON string, this is my code:
import net.liftweb.json._
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.api.scala._
class Loader(settings: Map[String, String])(implicit environment: ExecutionEnvironment) {
val env: ExecutionEnvironment = environment
def load[T: TypeInformation](): DataSet[T] = {
val data: DataSet[String] = env.fromElements(
"""{"name": "name1"}""",
"""{"name": "name2"}"""
)
implicit val formats = DefaultFormats
data.map(item => parse(item).extract[T])
}
}
But I got the error:
No Manifest available for T
data.map(item => parse(item).extract[T])
Then I tried to add a Manifest and delete the TypeInformation like this:
def load[T: Manifest](): DataSet[T] = { ...
And I got the next error:
could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[T]
I'm very confuse about this, I'll really appreciate your help.
Thanks.

How to pass data through a spark mapper without modelling it in the parameter class?

I need to do stateful processing of dataframe rows. To do that I need to create a bean or case class that models the data necessary for the stateful processing. I would like to hang on to other data in the dataframe for use after the stateful processing without modelling it in the case class. How can this be done?
In stateless processing we can sort of stay in DataFrame land by using UDFs but we do not have that option here.
Here's what I tried:
package com.example.so
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout, OutputMode}
case class WibbleState() // just a placeholder
case class Wibble
(
x: String,
y: Int,
data: Row // data I don't want to model in the case class
)
object PartialModelization {
def wibbleStateFlatMapper(k: String,
it: Iterator[Wibble],
state: GroupState[WibbleState]): Iterator[Wibble] = it
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("PartialModelization")
.master("local[*]").getOrCreate()
import spark.implicits._
// imagine this is actually a streaming data frame
val input = spark.createDataFrame(List(("a", 1, 0), ("b", 1, 2)))
.toDF("x", "y", "z")
// dont want to model z in the case class
// if that seems pointless imagine there is also z1, z2, z3, etc
// or that z is itself a struct
input.select($"x", $"y", struct("*").as("data"))
.as[Wibble]
.groupByKey(w => w.x)
.flatMapGroupsWithState[WibbleState, Wibble](
OutputMode.Append, GroupStateTimeout.NoTimeout)(wibbleStateFlatMapper)
.select("data.*")
.show()
}
}
Which gives this error:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "data")
- root class: "com.example.so.Wibble"
Conceptually you might suggest trying to find some key that allows us to join to output dataframe with the input one to recover the "data" attribute but that just seems from performance and implementation complexity stand point to be a horrible solution. (I'd rather just type out the whole data structure in my case classes in that case!)

The best solution I have found so far is to use a tuple to seperate mapper data and row data.
So we remove the data attribute from Wibble.
case class Wibble
(
x: String,
y: Int
)
Modify the types on our stateful flat mapper to handle (Wibble, Row) instead of just Wibble:
def wibbleStateFlatMapper(k: String,
it: Iterator[(Wibble, Row)],
state: GroupState[WibbleState]): Iterator[(Wibble, Row)] = it
Now our pipeline code becomes:
// imagine this is actually a streaming data frame
val input = spark.createDataFrame(List(("a", 1, 0), ("b", 1, 2)))
.toDF("x", "y", "z")
val inputEncoder = RowEncoder(input.schema)
val wibbleEncoder = Encoders.product[Wibble]
implicit val tupleEncoder = Encoders.tuple(wibbleEncoder, inputEncoder)
input.select(struct($"x", $"y").as("wibble"), struct("*").as("data"))
.as(tupleEncoder)
.groupByKey({case (w,_) => w.x})
.flatMapGroupsWithState(
OutputMode.Append, GroupStateTimeout.NoTimeout)(wibbleStateFlatMapper)
.select("_2.*")
.show()