Spark unable to find encoder(case class) although providing it - scala

Trying to figure out why getting an error on encoders, any insight would be helpful!
ERROR Unable to find encoder for type SolrNewsDocument, An implicit Encoder[SolrNewsDocument] is needed to store `
Clearly I have imported spark.implicits._. I have also have provided an encoder as a case class.
def ingestDocsToSolr(newsItemDF: DataFrame) = {
case class SolrNewsDocument(
title: String,
body: String,
publication: String,
date: String,
byline: String,
length: String
)
import spark.implicits._
val solrDocs = newsItemDF.as[SolrNewsDocument].map { doc =>
val solrDoc = new SolrInputDocument
solrDoc.setField("title", doc.title.toString)
solrDoc.setField("body", doc.body)
solrDoc.setField("publication", doc.publication)
solrDoc.setField("date", doc.date)
solrDoc.setField("byline", doc.byline)
solrDoc.setField("length", doc.length)
solrDoc
}
// can be used for stream SolrSupport.
SolrSupport.indexDocs("localhost:2181", "collection", 10, solrDocs.rdd);
val solrServer = SolrSupport.getCachedCloudClient("localhost:2181")
solrServer.setDefaultCollection("collection")
solrServer.commit(false, false)
}

//Check this one.-Move case class declaration before function declaration.
//Encoder is created once case class statement is executed by compiler. Then only compiler will be able to use encoder inside function deceleration.
import spark.implicits._
case class SolrNewsDocument(title: String,body: String,publication: String,date: String,byline: String,length: String)
def ingestDocsToSolr(newsItemDF:DataFrame) = {
val solrDocs = newsItemDF.as[SolrNewsDocument]}

i got this error trying to iterate over a text file, and in my case, as of spark 2.4.x the issue was that i had to cast it to an RDD first (that used to be implicit)
textFile
.rdd
.flatMap(line=>line.split(" "))
Migrating our Scala codebase to Spark 2

Related

How to infer StructType schema for Spark Scala at run time given a Fully Qualified Name of a case class

Since a few days I was wondering if it is possible to infer a schema for Spark in Scala for a given case class, but unknown at compile time.
The only input is a string containing the FQN of the class (that could be used for example to create an instance of the case class at runtime via reflection)
I was thinking if it was possible to do something like:
package com.my.namespace
case class MyCaseClass (name: String, num: Int)
//Somewhere else in codebase
// coming from external configuration file, so unknown at compile time
val fqn = "com.my.namespace.MyCaseClass"
val schema = Encoders.product [ getXYZ( fqn ) ].schema
Of course, any other techniques that is not using Encoders is fine (building StructType analysing an instance of the case class ? Is it even possible ?)
What is the best approach?
Is it something feasible ?
You can use reflective toolbox
package com.my.namespace
import org.apache.spark.sql.types.StructType
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
case class MyCaseClass (name: String, num: Int)
object Main extends App {
val fqn = "com.my.namespace.MyCaseClass"
val runtimeMirror = runtime.currentMirror
val toolbox = runtimeMirror.mkToolBox()
val res = toolbox.eval(toolbox.parse(s"""
import org.apache.spark.sql.Encoders
Encoders.product[$fqn].schema
""")).asInstanceOf[StructType]
println(res) // StructType(StructField(name,StringType,true),StructField(num,IntegerType,false))
}

Scala Broadcast + UDF

I am trying to broadcast an List and pass the broadcast variable to UDF (Scala code is present in separate file). But facing issues.
val Lookup_BroadCast = SC.broadcast(lookup_data)
UDF creation with 3 arguments
val Call_Sub_Pgm = udf(foo(_: String, Lookup_BroadCast: org.apache.spark.broadcast.Broadcast[List[String]], Trace: String))
Calling the UDF using "withColumn"
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
I am getting compilation error for above code - "found broadcast variable, required Sql Column"
If i remove "Lookup_BroadCast" variable from above
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
then I get below error:
java.lang.ClassCastException: org.spark.masking.ExtractData$$anonfun$7 cannot be cast to scala.Function0
Serializable wrapper class can be created for function, with Broadcast in constructor:
class Wrapper(Lookup_BroadCast: Broadcast[List[String]]) extends Serializable {
def foo(v: String, s: String): String = {
// usage example
Lookup_BroadCast.value.head
}
}
And used like:
val wrapper = new Wrapper(Lookup_BroadCast)
val Call_Sub_Pgm = udf(wrapper.foo(_: String, _: String))

Convert prepareStament object to Json Scala

I'am trying to convert prepareStament(object uses for sending SQL statement to the database ) to Json with scala.
So far, I've discovered that the best way to convert an object to Json in scala is to do it with the net.liftweb library.
But when I tried it, I got an empty json.
this is the code
import java.sql.DriverManager
import net.liftweb.json._
import net.liftweb.json.Serialization.write
object Main {
def main (args: Array[String]): Unit = {
implicit val formats = DefaultFormats
val jdbcSqlConnStr = "sqlserverurl**"
val conn = DriverManager.getConnection(jdbcSqlConnStr)
val statement = conn.prepareStatement("exec select_all")
val piedPierJSON2= write(statement)
println(piedPierJSON2)
}
}
this is the result
{}
I used an object I created , and the conversion worked.
case class Person(name: String, address: Address)
case class Address(city: String, state: String)
val p = Person("Alvin Alexander", Address("Talkeetna", "AK"))
val piedPierJSON3 = write(p)
println(piedPierJSON3)
This is the result
{"name":"Alvin Alexander","address":{"city":"Talkeetna","state":"AK"}}
I understood where the problem was, PrepareStament is an interface, and none of its subtypes are serializable...
I'm going to try to wrap it up and put it in a different class.

How to use TypeInformation in a generic method using Scala

I'm trying to create a generic method in Apache Flink to parse a DataSet[String](JSON strings) using case classes. I tried to use the TypeInformation like it's mentioned here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html#generic-methods
I'm using liftweb to parse the JSON string, this is my code:
import net.liftweb.json._
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.api.scala._
class Loader(settings: Map[String, String])(implicit environment: ExecutionEnvironment) {
val env: ExecutionEnvironment = environment
def load[T: TypeInformation](): DataSet[T] = {
val data: DataSet[String] = env.fromElements(
"""{"name": "name1"}""",
"""{"name": "name2"}"""
)
implicit val formats = DefaultFormats
data.map(item => parse(item).extract[T])
}
}
But I got the error:
No Manifest available for T
data.map(item => parse(item).extract[T])
Then I tried to add a Manifest and delete the TypeInformation like this:
def load[T: Manifest](): DataSet[T] = { ...
And I got the next error:
could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[T]
I'm very confuse about this, I'll really appreciate your help.
Thanks.

Generic T as Spark Dataset[T] constructor

In the following snippet, the tryParquet function tries to load a Dataset from a Parquet file if it exists. If not, it computes, persists and returns back the Dataset plan which was provided:
import scala.util.{Try, Success, Failure}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
sealed trait CustomRow
case class MyRow(
id: Int,
name: String
) extends CustomRow
val ds: Dataset[MyRow] =
Seq((1, "foo"),
(2, "bar"),
(3, "baz")).toDF("id", "name").as[MyRow]
def tryParquet[T <: CustomRow](session: SparkSession, path: String, target: Dataset[T]): Dataset[T] =
Try(session.read.parquet(path)) match {
case Success(df) => df.as[T] // <---- compile error here
case Failure(_) => {
target.write.parquet(path)
target
}
}
val readyDS: Dataset[MyRow] =
tryParquet(spark, "/path/to/file.parq", ds)
However this produces a compile error on df.as[T]:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
case Success(df) => df.as[T]
One can circumvent this problem by making tryParquet cast df to return an untyped DataFrame and let caller cast to the desired constructor. However is there any solution in the case we want the type to be managed internally by the function?
Looks like it's possible by using an Encoder in the type parameter:
import org.apache.spark.sql.Encoder
def tryParquet[T <: CustomRow: Encoder](...)
This way the compiler can prove that df.as[T] is providing an Encoder when constructing the objects.