State management not serializable - scala

In my application, I want to keep track of multiple states. Thus I tried to encapsulate the whole state management logic within a class StateManager as follows:
#SerialVersionUID(xxxxxxxL)
class StateManager(
inputStream: DStream[(String, String)],
initialState: RDD[(String, String)]
) extends Serializable {
lazy val state = inputStream.mapWithState(stateSpec).map(_.get)
lazy val stateSpec = StateSpec
.function(trackStateFunc _)
.initialState(initialState)
.timeout(Seconds(30))
def trackStateFunc(key: String, value: Option[String], state: State[String]): Option[(String, String)] = {}
}
object StateManager { def apply(dstream: DStream[(String, String)], initialstate: RDD[(String, String)]) = new StateManager(_dStream, _initialState) }
The #SerialVersionUID(xxxxxxxL) ... extends Serializable is an attempt to solve my problem.
But when calling StateManager from my main class like the following:
val lStreamingContext = StreamingEnvironment(streamingWindow, checkpointDirectory)
val statemanager= StateManager(lStreamingEnvironment.sparkContext, 1, None)
val state= statemanager.state(lKafkaStream)
state.foreachRDD(_.foreach(println))
(see below for StreamingEnvironment), I get:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
[...]
Caused by: java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
The error is clear, but still I don't get on what point does it trigger.
Where does it trigger?
What could I do to solve this and have a reusable class?
The might-be-useful StreamingEnvironment class:
class StreamingEnvironment(mySparkConf: SparkConf, myKafkaConf: KafkaConf, myStreamingWindow: Duration, myCheckPointDirectory: String) {
val sparkContext = spark.SparkContext.getOrCreate(mySparkConf)
lazy val streamingContext = new StreamingContext(sparkContext , mMicrobatchPeriod)
streamingContext.checkpoint(mCheckPointDirectory)
streamingContext.remember(Minutes(1))
def stream() = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, myKafkaConf.mBrokers, myKafkaConf.mTopics)
}
object StreamingEnvironment {
def apply(streamingWindow: Duration, checkpointDirectory: String) = {
//setup sparkConf and kafkaConf
new StreamingEnvironment(sparkConf , kafkaConf, streamingWindow, checkpointDirectory)
}
}

When we lift a method into a function, the outer reference to the parent class will be part of that function reference, like here: function(trackStateFunc _)
Declaring trackStateFunc directly as a function (i.e. as a val) will probably take care of the problem.
Also note that marking a class Serializable does not make it magically so. DStream is not serializable and should be annotated as #transient, which will probably solve the issue as well.

Related

Scala Broadcast + UDF

I am trying to broadcast an List and pass the broadcast variable to UDF (Scala code is present in separate file). But facing issues.
val Lookup_BroadCast = SC.broadcast(lookup_data)
UDF creation with 3 arguments
val Call_Sub_Pgm = udf(foo(_: String, Lookup_BroadCast: org.apache.spark.broadcast.Broadcast[List[String]], Trace: String))
Calling the UDF using "withColumn"
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
I am getting compilation error for above code - "found broadcast variable, required Sql Column"
If i remove "Lookup_BroadCast" variable from above
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
then I get below error:
java.lang.ClassCastException: org.spark.masking.ExtractData$$anonfun$7 cannot be cast to scala.Function0
Serializable wrapper class can be created for function, with Broadcast in constructor:
class Wrapper(Lookup_BroadCast: Broadcast[List[String]]) extends Serializable {
def foo(v: String, s: String): String = {
// usage example
Lookup_BroadCast.value.head
}
}
And used like:
val wrapper = new Wrapper(Lookup_BroadCast)
val Call_Sub_Pgm = udf(wrapper.foo(_: String, _: String))

Code working in Spark-Shell not in eclipse

I have a small Scala code which works properly on Spark-Shell but not in Eclipse with Scala plugin. I can access hdfs using plugin tried writing another file and it worked..
FirstSpark.scala
package bigdata.spark
import org.apache.spark.SparkConf
import java. io. _
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object FirstSpark {
def main(args: Array[String])={
val conf = new SparkConf().setMaster("local").setAppName("FirstSparkProgram")
val sparkcontext = new SparkContext(conf)
val textFile =sparkcontext.textFile("hdfs://pranay:8020/spark/linkage")
val m = new Methods()
val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
q.saveAsTextFile("hdfs://pranay:8020/output") }
}
Methods.scala
package bigdata.spark
import java.util.function.ToDoubleFunction
class Methods {
def isHeader(s:String):Boolean={
s.contains("id_1")
}
def parse(line:String) ={
val pieces = line.split(',')
val id1=pieces(0).toInt
val id2=pieces(1).toInt
val matches=pieces(11).toBoolean
val mapArray=pieces.slice(2, 11).map(toDouble)
MatchData(id1,id2,mapArray,matches)
}
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
Error Message:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
Can anyone please help me with this
Try changing class Methods { .. } to object Methods { .. }.
I think the problem is at val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x)). When Spark sees the filter and map functions it tries to serialize the functions passed to them (x => !m.isHeader(x) and x=> m.parse(x)) so that it can dispatch the work of executing them to all of the executors (this is the Task referred to). However, to do this, it needs to serialize m, since this object is referenced inside the function (it is in the closure of the two anonymous methods) - but it cannot do this since Methods is not serializable. You could add extends Serializable to the Methods class, but in this case an object is more appropriate (and is already Serializable).

Decoupling non-serializable object to avoid Serialization error in Spark

The following class contains the main function which tries to read from Elasticsearch and prints the documents returned:
object TopicApp extends Serializable {
def run() {
val start = System.currentTimeMillis()
val sparkConf = new Configuration()
sparkConf.set("spark.executor.memory","1g")
sparkConf.set("spark.kryoserializer.buffer","256")
val es = new EsContext(sparkConf)
val esConf = new Configuration()
esConf.set("es.nodes","localhost")
esConf.set("es.port","9200")
esConf.set("es.resource", "temp_index/some_doc")
esConf.set("es.query", "?q=*:*")
esConf.set("es.fields", "_score,_id")
val documents = es.documents(esConf)
documents.foreach(println)
val end = System.currentTimeMillis()
println("Total time: " + (end-start) + " ms")
es.shutdown()
}
def main(args: Array[String]) {
run()
}
}
Following class converts the returned document to JSON using org.json4s
class EsContext(sparkConf:HadoopConfig) extends SparkBase {
private val sc = createSCLocal("ElasticContext", sparkConf)
def documentsAsJson(esConf:HadoopConfig):RDD[String] = {
implicit val formats = DefaultFormats
val source = sc.newAPIHadoopRDD(
esConf,
classOf[EsInputFormat[Text, MapWritable]],
classOf[Text],
classOf[MapWritable]
)
val docs = source.map(
hit => {
val doc = Map("ident" -> hit._1.toString) ++ mwToMap(hit._2)
write(doc)
}
)
docs
}
def shutdown() = sc.stop()
// mwToMap() converts MapWritable to Map
}
Following class creates the local SparkContext for the application:
trait SparkBase extends Serializable {
protected def createSCLocal(name:String, config:HadoopConfig):SparkContext = {
val iterator = config.iterator()
for (prop <- iterator) {
val k = prop.getKey
val v = prop.getValue
if (k.startsWith("spark."))
System.setProperty(k, v)
}
val runtime = Runtime.getRuntime
runtime.gc()
val conf = new SparkConf()
conf.setMaster("local[2]")
conf.setAppName(name)
conf.set("spark.serializer", classOf[KryoSerializer].getName)
conf.set("spark.ui.port", "0")
new SparkContext(conf)
}
}
When I run TopicApp I get the following errors:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:323)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.map(RDD.scala:323)
at TopicApp.EsContext.documents(EsContext.scala:51)
at TopicApp.TopicApp$.run(TopicApp.scala:28)
at TopicApp.TopicApp$.main(TopicApp.scala:39)
at TopicApp.TopicApp.main(TopicApp.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#14f70e7d)
- field (class: TopicApp.EsContext, name: sc, type: class org.apache.spark.SparkContext)
- object (class TopicApp.EsContext, TopicApp.EsContext#2cf77cdc)
- field (class: TopicApp.EsContext$$anonfun$documents$1, name: $outer, type: class TopicApp.EsContext)
- object (class TopicApp.EsContext$$anonfun$documents$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 13 more
Going through other posts that cover similar issue there were mostly recommending making the classes Serializable or try to separate the non-serializable objects from the classes.
From the error that I got I inferred that SparkContext i.e. sc is non-serializable as SparkContext is not a serializable class.
How should I decouple SparkContext, so that the applications runs correctly?
I can't run your program to be sure, but the general rule is not to create anonymous functions that refer to members of unserializable classes if they have to be executed on the RDD's data. In your case:
EsContext has a val of type SparkContext, which is (intentionally) not serializable
In the anonymous function passed to RDD.map in EsContext.documentsAsJson, you call another function of this EsContext instance (mwToMap) which forces Spark to serialize that instance, along with the SparkContext it holds
One possible solution would be removing mwToMap from the EsContext class (possibly into a companion object of EsContext - objects need not be serializable as they are static). If there are other methods of the same nature (write?) they'll have to be moved too. This would look something like:
import EsContext._
class EsContext(sparkConf:HadoopConfig) extends SparkBase {
private val sc = createSCLocal("ElasticContext", sparkConf)
def documentsAsJson(esConf: HadoopConfig): RDD[String] = { /* unchanged */ }
def documents(esConf: HadoopConfig): RDD[EsDocument] = { /* unchanged */ }
def shutdown() = sc.stop()
}
object EsContext {
private def mwToMap(mw: MapWritable): Map[String, String] = { ... }
}
If moving these methods out isn't possible (i.e. if they require some of EsContext's members) - then consider separating the class that does the actual mapping from this context (which seems to be some kind of wrapper around the SparkContext - if that's what it is, that's all that it should be).

Spark: Task not serializable (Broadcast/RDD/SparkContext)

There are numerous questions about Task is not serializable in Spark. However, this case seems quite particular.
I have created a class:
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
val allEs: RDD[(String, E)] = e.map(e => (e.w, e))
.persist()
val sc = allEs.sparkContext
val centroids = sc.broadcast(m.clusterCenters)
[...]
The class defines the following method:
private def centroidDistances(v: Vector): Array[Double] = {
centroids.value.map(c => (centroids.value.indexOf(c), Vectors.sqdist(v, c)))
.sortBy(_._1)
.map(_._2)
}
However, when the class is called, a Task is not serializable exception is thrown.
Strange enough, a tiny change in the header of class Neighbours suffices to fix the issue. Instead of creating a val sc: SparkContext to use for broadcasting, I merely inline the code that creates the Spark context:
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
val allEs: RDD[(String, E)] = e.map(e => (e.w, e))
.setName("embeddings")
.persist()
val centroids = allEmbeddings.sparkContext(m.clusterCenters)
[...]
My question is: how does the second variant make a difference? What goes wrong in the first one? Intuitively, this should be merely syntactic sugar, is this a bug in Spark?
I use Spark 1.4.1 on a Hadoop/Yarn cluster.
When you define
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
...
val sc = allEmbeddings.sparkContext
val centroids = sc.broadcast(m.clusterCenters)
...
}
You have made sc into a class variable, meaning it could be accessed from an instance of Neighbours e.g. neighbours.sc. This means that sc needs to be serializable, which is it not.
When you inline the code, only the final value of centroids needs to be serializable. centroids is of type Broadcast which is Serializable.

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}