Spark: Task not serializable (Broadcast/RDD/SparkContext) - scala

There are numerous questions about Task is not serializable in Spark. However, this case seems quite particular.
I have created a class:
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
val allEs: RDD[(String, E)] = e.map(e => (e.w, e))
.persist()
val sc = allEs.sparkContext
val centroids = sc.broadcast(m.clusterCenters)
[...]
The class defines the following method:
private def centroidDistances(v: Vector): Array[Double] = {
centroids.value.map(c => (centroids.value.indexOf(c), Vectors.sqdist(v, c)))
.sortBy(_._1)
.map(_._2)
}
However, when the class is called, a Task is not serializable exception is thrown.
Strange enough, a tiny change in the header of class Neighbours suffices to fix the issue. Instead of creating a val sc: SparkContext to use for broadcasting, I merely inline the code that creates the Spark context:
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
val allEs: RDD[(String, E)] = e.map(e => (e.w, e))
.setName("embeddings")
.persist()
val centroids = allEmbeddings.sparkContext(m.clusterCenters)
[...]
My question is: how does the second variant make a difference? What goes wrong in the first one? Intuitively, this should be merely syntactic sugar, is this a bug in Spark?
I use Spark 1.4.1 on a Hadoop/Yarn cluster.

When you define
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
...
val sc = allEmbeddings.sparkContext
val centroids = sc.broadcast(m.clusterCenters)
...
}
You have made sc into a class variable, meaning it could be accessed from an instance of Neighbours e.g. neighbours.sc. This means that sc needs to be serializable, which is it not.
When you inline the code, only the final value of centroids needs to be serializable. centroids is of type Broadcast which is Serializable.

Related

How to load multiple files into multiple rdd?

I have multiple files those are independent and need processing by spark. How could I load them into separate rdds in parallel? Thanks!
The coding language is scala.
If you want concurrent reading/processing of RDDs, you could leverage scala.concurrent.Future (or effects in ZIO, Cats etc).
Sample code for loading function is below:
def load(paths: Seq[String], spark: SparkSession)
(implicit ec: ExecutionContext): Seq[Future[RDD[String]]] = {
def loadSinglePath(path: String): Future[RDD[String]] = Future {
spark.sparkContext.textFile(path)
}
paths map loadSinglePath
}
Sample code for using this function:
import scala.concurrent.duration.{Duration, DurationInt}
val sc = SparkSession.builder.master("local[*]").getOrCreate()
implicit val ec = ExecutionContext.global
val result = load(Seq("t1.txt", "t2.txt", "t3.txt"), sc).zipWithIndex
.map { case (rddFuture, idx) =>
rddFuture.map( rdd =>
println(s"Rdd with index $idx has ${rdd.count()}")
)
}
Await.result(Future.sequence(result), 1 hour)
For example purposes, default global ExecutionContext is provided, but it could be configurable to run your code inside the custom one (just replace this implicit val ec with your own ExecutionContext)

Nested Spark DataFrame when calling Vegas-viz .withDataFrame

Quick abstract :
I am trying to display multiple histograms from Spark DataFrames with Vegas-viz in Scala. I created a trait to create different types of histograms, and implemented classes expending it. When I create an instance of a child class, I get a NullPointerException which makes me think there is a nested DataFrame somewhere.
Is there a workaround? Did I miss something and the error is something else?
Details :
Here is the trait :
trait Histogram {
val rawdf: DataFrame
val sparseDim: Seq[String]
val name: String
val xColumn: String
val yColumn: String
val group: DataFrame
val plot: ExtendedUnitSpecBuilder = Vegas(name).
withDataFrame(group).
encodeX(
field = xColumn,
Quantitative,
scale = Scale(ScaleType.Log),
title = sparseDim.reduce((a, b) => a + ", " + b)
).
encodeY(field = yColumn, Quantitative).
mark(Bar)
def show(): Unit = plot.show
}
And here is one of the classes extending it :
class HistogramCount(val rawdf: DataFrame,
val sparseDim: Seq[String],
val name: String = "Histogram Count") extends Histogram {
val xColumn = "cube"
val yColumn = "count"
override val group: DataFrame = rawdf.
select("VALUE", sparseDim: _*).
groupBy(sparseDim.head, sparseDim.tail: _*).
count().
withColumnRenamed("count", "cube").
groupBy("cube").
count()
}
When i create an instance of the child class, the following error occures :
Exception in thread "main" java.lang.NullPointerException
at <Pointing to .withDataFrame(group) in the trait>
I guess this is because the evaluation of group is lazy and that it is called in .withDataFrame(group) when plot is created.
I tried to evaluate the group DataFrame before Calling plot with a val evaluate: Long = group.rdd.count(), but it does not solve the issue.
Solved it by making the variable plot lazy. Still not sure if that is the best way thought.

State management not serializable

In my application, I want to keep track of multiple states. Thus I tried to encapsulate the whole state management logic within a class StateManager as follows:
#SerialVersionUID(xxxxxxxL)
class StateManager(
inputStream: DStream[(String, String)],
initialState: RDD[(String, String)]
) extends Serializable {
lazy val state = inputStream.mapWithState(stateSpec).map(_.get)
lazy val stateSpec = StateSpec
.function(trackStateFunc _)
.initialState(initialState)
.timeout(Seconds(30))
def trackStateFunc(key: String, value: Option[String], state: State[String]): Option[(String, String)] = {}
}
object StateManager { def apply(dstream: DStream[(String, String)], initialstate: RDD[(String, String)]) = new StateManager(_dStream, _initialState) }
The #SerialVersionUID(xxxxxxxL) ... extends Serializable is an attempt to solve my problem.
But when calling StateManager from my main class like the following:
val lStreamingContext = StreamingEnvironment(streamingWindow, checkpointDirectory)
val statemanager= StateManager(lStreamingEnvironment.sparkContext, 1, None)
val state= statemanager.state(lKafkaStream)
state.foreachRDD(_.foreach(println))
(see below for StreamingEnvironment), I get:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
[...]
Caused by: java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
The error is clear, but still I don't get on what point does it trigger.
Where does it trigger?
What could I do to solve this and have a reusable class?
The might-be-useful StreamingEnvironment class:
class StreamingEnvironment(mySparkConf: SparkConf, myKafkaConf: KafkaConf, myStreamingWindow: Duration, myCheckPointDirectory: String) {
val sparkContext = spark.SparkContext.getOrCreate(mySparkConf)
lazy val streamingContext = new StreamingContext(sparkContext , mMicrobatchPeriod)
streamingContext.checkpoint(mCheckPointDirectory)
streamingContext.remember(Minutes(1))
def stream() = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, myKafkaConf.mBrokers, myKafkaConf.mTopics)
}
object StreamingEnvironment {
def apply(streamingWindow: Duration, checkpointDirectory: String) = {
//setup sparkConf and kafkaConf
new StreamingEnvironment(sparkConf , kafkaConf, streamingWindow, checkpointDirectory)
}
}
When we lift a method into a function, the outer reference to the parent class will be part of that function reference, like here: function(trackStateFunc _)
Declaring trackStateFunc directly as a function (i.e. as a val) will probably take care of the problem.
Also note that marking a class Serializable does not make it magically so. DStream is not serializable and should be annotated as #transient, which will probably solve the issue as well.

Code working in Spark-Shell not in eclipse

I have a small Scala code which works properly on Spark-Shell but not in Eclipse with Scala plugin. I can access hdfs using plugin tried writing another file and it worked..
FirstSpark.scala
package bigdata.spark
import org.apache.spark.SparkConf
import java. io. _
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object FirstSpark {
def main(args: Array[String])={
val conf = new SparkConf().setMaster("local").setAppName("FirstSparkProgram")
val sparkcontext = new SparkContext(conf)
val textFile =sparkcontext.textFile("hdfs://pranay:8020/spark/linkage")
val m = new Methods()
val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
q.saveAsTextFile("hdfs://pranay:8020/output") }
}
Methods.scala
package bigdata.spark
import java.util.function.ToDoubleFunction
class Methods {
def isHeader(s:String):Boolean={
s.contains("id_1")
}
def parse(line:String) ={
val pieces = line.split(',')
val id1=pieces(0).toInt
val id2=pieces(1).toInt
val matches=pieces(11).toBoolean
val mapArray=pieces.slice(2, 11).map(toDouble)
MatchData(id1,id2,mapArray,matches)
}
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
Error Message:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
Can anyone please help me with this
Try changing class Methods { .. } to object Methods { .. }.
I think the problem is at val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x)). When Spark sees the filter and map functions it tries to serialize the functions passed to them (x => !m.isHeader(x) and x=> m.parse(x)) so that it can dispatch the work of executing them to all of the executors (this is the Task referred to). However, to do this, it needs to serialize m, since this object is referenced inside the function (it is in the closure of the two anonymous methods) - but it cannot do this since Methods is not serializable. You could add extends Serializable to the Methods class, but in this case an object is more appropriate (and is already Serializable).

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}