Scala Broadcast + UDF - scala

I am trying to broadcast an List and pass the broadcast variable to UDF (Scala code is present in separate file). But facing issues.
val Lookup_BroadCast = SC.broadcast(lookup_data)
UDF creation with 3 arguments
val Call_Sub_Pgm = udf(foo(_: String, Lookup_BroadCast: org.apache.spark.broadcast.Broadcast[List[String]], Trace: String))
Calling the UDF using "withColumn"
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
I am getting compilation error for above code - "found broadcast variable, required Sql Column"
If i remove "Lookup_BroadCast" variable from above
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
then I get below error:
java.lang.ClassCastException: org.spark.masking.ExtractData$$anonfun$7 cannot be cast to scala.Function0

Serializable wrapper class can be created for function, with Broadcast in constructor:
class Wrapper(Lookup_BroadCast: Broadcast[List[String]]) extends Serializable {
def foo(v: String, s: String): String = {
// usage example
Lookup_BroadCast.value.head
}
}
And used like:
val wrapper = new Wrapper(Lookup_BroadCast)
val Call_Sub_Pgm = udf(wrapper.foo(_: String, _: String))

Related

Mockito verify fails when method takes a function as argument

I have a Scala test which uses Mockito to verify that certain DataFrame transformations are invoked. I broke it down to this simple problematic example
import org.apache.spark.sql.DataFrame
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.functions._
import org.mockito.{Mockito, MockitoSugar}
class SimpleTest extends AnyFunSuite{
def withGreeting(df: DataFrame):DataFrame = {
df.withColumn("greeting", lit("hello"))
}
test("sample test") {
val mockDF = MockitoSugar.mock[DataFrame]
val mockDF2 = MockitoSugar.mock[DataFrame]
MockitoSugar.doReturn(mockDF2).when(mockDF).transform(withGreeting)
mockDF.transform(withGreeting)
val orderVerifier = Mockito.inOrder(mockDF)
orderVerifier.verify(mockDF).transform(withGreeting)
}
}
I'm trying to assert that the transform was called on my mockDF, but it fails with
Argument(s) are different! Wanted:
dataset.transform(<function1>);
-> at org.apache.spark.sql.Dataset.transform(Dataset.scala:2182)
Actual invocations have different arguments:
dataset.transform(<function1>);
Why would the verify fail in this case?
You need to save lambda expression argument for transform as val for correct testing and pass it to all transform calls:
def withGreeting(df: DataFrame):DataFrame = {
df.withColumn("greeting", lit("hello"))
}
test("sample test") {
val mockDF = MockitoSugar.mock[DataFrame]
val mockDF2 = MockitoSugar.mock[DataFrame]
val withGreetingExpression = df => withGreeting(df)
MockitoSugar.doReturn(mockDF2).when(mockDF).transform(withGreetingExpression)
mockDF.transform(withGreetingExpression)
val orderVerifier = Mockito.inOrder(mockDF)
orderVerifier.verify(mockDF).transform(withGreetingExpression)
}
Mockito requires to provide same (or equal) arguments to the mocked functions calls. When you are passing lambda expression without saving each call transform(withGreeting) creates new object Function[DataFrame, DataFrame]
transform(withGreeting)
is the same as:
transform(new Function[DataFrame, DataFrame] {
override def apply(df: DataFrame): DataFrame = withGreeting(df)
})
And they aren't equal to each other - this is the cause of error message:
Argument(s) are different!
For example, try to execute:
println(((df: DataFrame) => withGreeting(df)) == ((df: DataFrame) => withGreeting(df))) //false
You can read more about objects equality in java (in the scala it's same):
wikibooks
javaworld.com

Spark unable to find encoder(case class) although providing it

Trying to figure out why getting an error on encoders, any insight would be helpful!
ERROR Unable to find encoder for type SolrNewsDocument, An implicit Encoder[SolrNewsDocument] is needed to store `
Clearly I have imported spark.implicits._. I have also have provided an encoder as a case class.
def ingestDocsToSolr(newsItemDF: DataFrame) = {
case class SolrNewsDocument(
title: String,
body: String,
publication: String,
date: String,
byline: String,
length: String
)
import spark.implicits._
val solrDocs = newsItemDF.as[SolrNewsDocument].map { doc =>
val solrDoc = new SolrInputDocument
solrDoc.setField("title", doc.title.toString)
solrDoc.setField("body", doc.body)
solrDoc.setField("publication", doc.publication)
solrDoc.setField("date", doc.date)
solrDoc.setField("byline", doc.byline)
solrDoc.setField("length", doc.length)
solrDoc
}
// can be used for stream SolrSupport.
SolrSupport.indexDocs("localhost:2181", "collection", 10, solrDocs.rdd);
val solrServer = SolrSupport.getCachedCloudClient("localhost:2181")
solrServer.setDefaultCollection("collection")
solrServer.commit(false, false)
}
//Check this one.-Move case class declaration before function declaration.
//Encoder is created once case class statement is executed by compiler. Then only compiler will be able to use encoder inside function deceleration.
import spark.implicits._
case class SolrNewsDocument(title: String,body: String,publication: String,date: String,byline: String,length: String)
def ingestDocsToSolr(newsItemDF:DataFrame) = {
val solrDocs = newsItemDF.as[SolrNewsDocument]}
i got this error trying to iterate over a text file, and in my case, as of spark 2.4.x the issue was that i had to cast it to an RDD first (that used to be implicit)
textFile
.rdd
.flatMap(line=>line.split(" "))
Migrating our Scala codebase to Spark 2

State management not serializable

In my application, I want to keep track of multiple states. Thus I tried to encapsulate the whole state management logic within a class StateManager as follows:
#SerialVersionUID(xxxxxxxL)
class StateManager(
inputStream: DStream[(String, String)],
initialState: RDD[(String, String)]
) extends Serializable {
lazy val state = inputStream.mapWithState(stateSpec).map(_.get)
lazy val stateSpec = StateSpec
.function(trackStateFunc _)
.initialState(initialState)
.timeout(Seconds(30))
def trackStateFunc(key: String, value: Option[String], state: State[String]): Option[(String, String)] = {}
}
object StateManager { def apply(dstream: DStream[(String, String)], initialstate: RDD[(String, String)]) = new StateManager(_dStream, _initialState) }
The #SerialVersionUID(xxxxxxxL) ... extends Serializable is an attempt to solve my problem.
But when calling StateManager from my main class like the following:
val lStreamingContext = StreamingEnvironment(streamingWindow, checkpointDirectory)
val statemanager= StateManager(lStreamingEnvironment.sparkContext, 1, None)
val state= statemanager.state(lKafkaStream)
state.foreachRDD(_.foreach(println))
(see below for StreamingEnvironment), I get:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
[...]
Caused by: java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
The error is clear, but still I don't get on what point does it trigger.
Where does it trigger?
What could I do to solve this and have a reusable class?
The might-be-useful StreamingEnvironment class:
class StreamingEnvironment(mySparkConf: SparkConf, myKafkaConf: KafkaConf, myStreamingWindow: Duration, myCheckPointDirectory: String) {
val sparkContext = spark.SparkContext.getOrCreate(mySparkConf)
lazy val streamingContext = new StreamingContext(sparkContext , mMicrobatchPeriod)
streamingContext.checkpoint(mCheckPointDirectory)
streamingContext.remember(Minutes(1))
def stream() = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, myKafkaConf.mBrokers, myKafkaConf.mTopics)
}
object StreamingEnvironment {
def apply(streamingWindow: Duration, checkpointDirectory: String) = {
//setup sparkConf and kafkaConf
new StreamingEnvironment(sparkConf , kafkaConf, streamingWindow, checkpointDirectory)
}
}
When we lift a method into a function, the outer reference to the parent class will be part of that function reference, like here: function(trackStateFunc _)
Declaring trackStateFunc directly as a function (i.e. as a val) will probably take care of the problem.
Also note that marking a class Serializable does not make it magically so. DStream is not serializable and should be annotated as #transient, which will probably solve the issue as well.

Spark convert RDD to DataFrame - Enumeration is not supported

I have a case class which contains a enumeration field "PersonType". I would like to insert this record to a Hive table.
object PersonType extends Enumeration {
type PersonType = Value
val BOSS = Value
val REGULAR = Value
}
case class Person(firstname: String, lastname: String)
case class Holder(personType: PersonType.Value, person: Person)
And:
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
val item = new Holder(PersonType.REGULAR, new Person("tom", "smith"))
val content: Seq[Holder] = Seq(item)
val data : RDD[Holder] = sc.parallelize(content)
val df = data.toDF()
...
When I try to convert the corresponding RDD to DataFrame, I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type com.test.PersonType.Value is not supported
...
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:691)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:630)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414)
at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:94)
I'd like to convert PersonType to String before inserting to Hive.
Is it possible to extend the implicitconversion to handle PersonType as well?
I tried something like this but didn't work:
object PersonTypeConversions {
implicit def toString(personType: PersonTypeConversions.Value): String = personType.toString()
}
import PersonTypeConversions._
Spark: 1.6.0

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}