Scala Test SparkException: Task not serializable - scala

I'm new to Scala and Spark. Wrote a simple test class and stuck on this issue for the whole day.
Please find the below code
A.scala
class A(key :String) extends Serializable {
val this.key:String=key
def getKey(): String = { return this.key}
}
B.Scala
class B(key :String) extends Serializable {
val this.key:String=key
def getKey(): String = { return this.key}
}
Test.scala
import com.holdenkarau.spark.testing.{RDDComparisons, SharedSparkContext}
import org.scalatest.FunSuite
import org.scalatest.BeforeAndAfter
class Test extends FunSuite with SharedSparkContext with RDDComparisons with BeforeAndAfter with Serializable {
//comment this
private[this] val b1 = new B("test1")
test("Test RDD") {
val a1 = new A("test1")
val a2 = new A("test2")
val expected= sc.parallelize(Seq(a1,a2))
println(b1.getKey())
//val b1 = new B("test1")
//val key1 :String = b1.getKey()
expected.foreach{ a =>
//if(a.getKey().equalsIgnoreCase(key1))
if(a.getKey().equalsIgnoreCase(b1.getKey()))
print("hi")
}
}
}
This code is throwing exception
Task not serializable
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
at com.adgear.adata.hhid.Test$$anonfun$1.apply$mcV$sp(Test.scala:19)
at com.adgear.adata.hhid.Test$$anonfun$1.apply(Test.scala:11)
at com.adgear.adata.hhid.Test$$anonfun$1.apply(Test.scala:11)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560)
at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
at com.adgear.adata.hhid.Test.org$scalatest$BeforeAndAfterAll$$super$run(Test.scala:7)
at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at com.adgear.adata.hhid.Test.run(Test.scala:7)
When I comment out the class level declaration of b1 and use the declaration inside the test methods itself then "if(a.getKey().equalsIgnoreCase(b1.getKey()))" this works. And if I retain class level b1 definition then "if(a.getKey().equalsIgnoreCase(b1.getKey()))" throws above exception. To solve this, I have to use "//val key1 :String = b1.getKey()" and "//if(a.getKey().equalsIgnoreCase(key1))" then it works.
As one can see A, B, and Test all implements Serializable still I get this exception. What is causing this issue?
Thanks

Declaring a class as Serializable doesn't mean that it can be serialized unless all of its field are Serializable as well.
Since your Test class extends Funsuite, it will have an "assertionsHelper" field which is not Serializable. So when you reference the "b1" field in your "forEach" method, Spark will try to serialize the Test instance along with all its field (including the assertionsHelper).
If you want to avoid this, you'll have to either define b1 somwhere else (in the test method scope or a companion object), or dereference b1 into a new variable before including it in the forEach function:
val b1_ref = b1
expected.foreach { a =>
if (a.getKey().equalsIgnoreCase(b1_ref.getKey()))
print("hi")
}
PS: When you encounter a serialization exception you usually have access to the "serialization stack" in the logs which tell you exactly which object caused the error

Related

How to use Scala Lists together with JAXB?

Short version
How do I have to annotate the list property in
case class Foo(list: List[String])
to be able to serialize it with JAXB?
Long version
Here is what I tried so far. I annotated the property with an #XmlElements and an #XmlJavaTypeAdapter annotation, see class ScalaFoo. But the ScalaFoo serialization fails with the exception below, while JavaFoo works.
import java.util
import javax.xml.bind.annotation._
import javax.xml.bind.annotation.adapters._
import ListTest.{xmlElement, xmlElements, xmlTypeAdapter}
import scala.annotation.meta.field
import scala.collection.JavaConverters._
object ListTest extends App {
type xmlElement = XmlElement#field
type xmlElements = XmlElements#field
type xmlTypeAdapter = XmlJavaTypeAdapter#field
import com.sun.jersey.api.json._
val config = JSONConfiguration.natural().rootUnwrapping(true).build()
val context = new JSONJAXBContext(config, classOf[JavaFoo], classOf[ScalaFoo])
val javaFoo = JavaFoo(util.Arrays.asList("a"))
val scalaFoo = ScalaFoo(List("a"))
context.createJSONMarshaller().marshallToJSON(javaFoo, System.out)
context.createJSONMarshaller().marshallToJSON(scalaFoo, System.out)
}
class ListAdapter[A] extends XmlAdapter[java.util.List[A], List[A]] {
def marshal(v: List[A]): util.List[A] = new util.ArrayList(v.asJava)
def unmarshal(v: java.util.List[A]): List[A] = v.asScala.toList
}
#XmlRootElement(name = "foo")
#XmlAccessorType(XmlAccessType.FIELD)
case class JavaFoo(#xmlElements(value = Array(new xmlElement(`type` = classOf[String])))
list: java.util.List[String]) {
private def this() = this(null)
}
#XmlRootElement(name = "foo")
#XmlAccessorType(XmlAccessType.FIELD)
case class ScalaFoo(#xmlTypeAdapter(classOf[ListAdapter[String]])
#xmlElements(value = Array(new xmlElement(`type` = classOf[String])))
list: List[String]) {
private def this() = this(null)
}
Exception:
Exception in thread "main" javax.xml.bind.MarshalException
- with linked exception:
[com.sun.istack.SAXException2: class java.util.ArrayList nor any of its super class is known to this context.
javax.xml.bind.JAXBException: class java.util.ArrayList nor any of its super class is known to this context.]
at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:323)
at com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:177)
at com.sun.jersey.json.impl.JSONMarshallerImpl.marshal(JSONMarshallerImpl.java:94)
at com.sun.jersey.json.impl.BaseJSONMarshaller.marshallToJSON(BaseJSONMarshaller.java:106)
at com.sun.jersey.json.impl.BaseJSONMarshaller.marshallToJSON(BaseJSONMarshaller.java:94)
at ListTest$.delayedEndpoint$ListTest$1(ListTest.scala:26)
at ListTest$delayedInit$body.apply(ListTest.scala:12)
at scala.Function0.apply$mcV$sp(Function0.scala:34)
at scala.Function0.apply$mcV$sp$(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:389)
at scala.App.main(App.scala:76)
at scala.App.main$(App.scala:74)
at ListTest$.main(ListTest.scala:12)
at ListTest.main(ListTest.scala)
Caused by: com.sun.istack.SAXException2: class java.util.ArrayList nor any of its super class is known to this context.
javax.xml.bind.JAXBException: class java.util.ArrayList nor any of its super class is known to this context.
at com.sun.xml.bind.v2.runtime.XMLSerializer.reportError(XMLSerializer.java:250)
at com.sun.xml.bind.v2.runtime.XMLSerializer.reportError(XMLSerializer.java:265)
at com.sun.xml.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:657)
at com.sun.xml.bind.v2.runtime.property.SingleElementNodeProperty.serializeBody(SingleElementNodeProperty.java:153)
at com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:344)
at com.sun.xml.bind.v2.runtime.XMLSerializer.childAsSoleContent(XMLSerializer.java:597)
at com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.serializeRoot(ClassBeanInfoImpl.java:328)
at com.sun.xml.bind.v2.runtime.XMLSerializer.childAsRoot(XMLSerializer.java:498)
at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:320)
... 15 more
Caused by: javax.xml.bind.JAXBException: class java.util.ArrayList nor any of its super class is known to this context.
at com.sun.xml.bind.v2.runtime.JAXBContextImpl.getBeanInfo(JAXBContextImpl.java:611)
at com.sun.xml.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:652)
... 21 more

Why does custom DefaultSource give java.io.NotSerializableException?

this is my first post on SO and my apology if the improper format is being used.
I'm working with Apache Spark to create a new source (via DefaultSource), BaseRelations, etc... and run into a problem with serialization that I would like to understand better. Consider below a class that extends BaseRelation and implements the scan builder.
class RootTableScan(path: String, treeName: String)(#transient val sqlContext: SQLContext) extends BaseRelation with PrunedFilteredScan{
private val att: core.SRType =
{
val reader = new RootFileReader(new java.io.File(Seq(path) head))
val tmp =
if (treeName==null)
buildATT(findTree(reader.getTopDir), arrangeStreamers(reader), null)
else
buildATT(reader.getKey(treeName).getObject.asInstanceOf[TTree],
arrangeStreamers(reader), null)
tmp
}
// define the schema from the AST
def schema: StructType = {
val s = buildSparkSchema(att)
s
}
// builds a scan
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {
// parallelize over all the files
val r = sqlContext.sparkContext.parallelize(Seq(path), 1).
flatMap({fileName =>
val reader = new RootFileReader(new java.io.File(fileName))
// get the TTree
/* PROBLEM !!! */
val rootTree =
// findTree(reader)
if (treeName == null) findTree(reader)
else reader.getKey(treeName).getObject.asInstanceOf[TTree]
new RootTreeIterator(rootTree, arrangeStreamers(reader),
requiredColumns, filters)
})
println("Done building Scan")
r
}
}
}
PROBLEM identifies where the issue happens. treeName is a val that gets injected into the class thru the constructor. The lambda that uses it is supposed to be executed on the slave and I do need to send the treeName - serialize it. I would like to understand why exactly the code snippet below causes this NotSerializableException. I know for sure that without treeName in it, it works just fine
val rootTree =
// findTree(reader)
if (treeName == null) findTree(reader)
else reader.getKey(treeName).getObject.asInstanceOf[TTree]
Below is the Stack trace
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2056)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:375)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:374)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:374)
at org.dianahep.sparkroot.package$RootTableScan.buildScan(sparkroot.scala:95)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:303)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:302)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:379)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:298)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:256)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2572)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1934)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2149)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
at org.apache.spark.sql.Dataset.show(Dataset.scala:495)
... 50 elided
Caused by: java.io.NotSerializableException: org.dianahep.sparkroot.package$RootTableScan
Serialization stack:
- object not serializable (class: org.dianahep.sparkroot.package$RootTableScan, value: org.dianahep.sparkroot.package$RootTableScan#6421e9e7)
- field (class: org.dianahep.sparkroot.package$RootTableScan$$anonfun$1, name: $outer, type: class org.dianahep.sparkroot.package$RootTableScan)
- object (class org.dianahep.sparkroot.package$RootTableScan$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
From the stack I think I can deduce that it tries to serialize my lambda and can not. this lambda should be a closure as we have a val in there that is defined outside of the lambda scope. But I don't understand why this can not be serialized.
Any help would be really appreciated!!!
Thanks a lot!
Any time a scala closure references a class variable, like treeName, then the JVM serializes the parent class along with the closure. Your class RootTableScan is not serializable, though! The solution is to create a local string variable:
// builds a scan
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {
val localTreeName = treeName // this is safe to serialize
// parallelize over all the files
val r = sqlContext.sparkContext.parallelize(Seq(path), 1).
flatMap({fileName =>
val reader = new RootFileReader(new java.io.File(fileName))
// get the TTree
/* PROBLEM !!! */
val rootTree =
// findTree(reader)
if (localTreeName == null) findTree(reader)
else reader.getKey(localTreeName).getObject.asInstanceOf[TTree]
new RootTreeIterator(rootTree, arrangeStreamers(reader),
requiredColumns, filters)
})

Decoupling non-serializable object to avoid Serialization error in Spark

The following class contains the main function which tries to read from Elasticsearch and prints the documents returned:
object TopicApp extends Serializable {
def run() {
val start = System.currentTimeMillis()
val sparkConf = new Configuration()
sparkConf.set("spark.executor.memory","1g")
sparkConf.set("spark.kryoserializer.buffer","256")
val es = new EsContext(sparkConf)
val esConf = new Configuration()
esConf.set("es.nodes","localhost")
esConf.set("es.port","9200")
esConf.set("es.resource", "temp_index/some_doc")
esConf.set("es.query", "?q=*:*")
esConf.set("es.fields", "_score,_id")
val documents = es.documents(esConf)
documents.foreach(println)
val end = System.currentTimeMillis()
println("Total time: " + (end-start) + " ms")
es.shutdown()
}
def main(args: Array[String]) {
run()
}
}
Following class converts the returned document to JSON using org.json4s
class EsContext(sparkConf:HadoopConfig) extends SparkBase {
private val sc = createSCLocal("ElasticContext", sparkConf)
def documentsAsJson(esConf:HadoopConfig):RDD[String] = {
implicit val formats = DefaultFormats
val source = sc.newAPIHadoopRDD(
esConf,
classOf[EsInputFormat[Text, MapWritable]],
classOf[Text],
classOf[MapWritable]
)
val docs = source.map(
hit => {
val doc = Map("ident" -> hit._1.toString) ++ mwToMap(hit._2)
write(doc)
}
)
docs
}
def shutdown() = sc.stop()
// mwToMap() converts MapWritable to Map
}
Following class creates the local SparkContext for the application:
trait SparkBase extends Serializable {
protected def createSCLocal(name:String, config:HadoopConfig):SparkContext = {
val iterator = config.iterator()
for (prop <- iterator) {
val k = prop.getKey
val v = prop.getValue
if (k.startsWith("spark."))
System.setProperty(k, v)
}
val runtime = Runtime.getRuntime
runtime.gc()
val conf = new SparkConf()
conf.setMaster("local[2]")
conf.setAppName(name)
conf.set("spark.serializer", classOf[KryoSerializer].getName)
conf.set("spark.ui.port", "0")
new SparkContext(conf)
}
}
When I run TopicApp I get the following errors:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:323)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.map(RDD.scala:323)
at TopicApp.EsContext.documents(EsContext.scala:51)
at TopicApp.TopicApp$.run(TopicApp.scala:28)
at TopicApp.TopicApp$.main(TopicApp.scala:39)
at TopicApp.TopicApp.main(TopicApp.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#14f70e7d)
- field (class: TopicApp.EsContext, name: sc, type: class org.apache.spark.SparkContext)
- object (class TopicApp.EsContext, TopicApp.EsContext#2cf77cdc)
- field (class: TopicApp.EsContext$$anonfun$documents$1, name: $outer, type: class TopicApp.EsContext)
- object (class TopicApp.EsContext$$anonfun$documents$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 13 more
Going through other posts that cover similar issue there were mostly recommending making the classes Serializable or try to separate the non-serializable objects from the classes.
From the error that I got I inferred that SparkContext i.e. sc is non-serializable as SparkContext is not a serializable class.
How should I decouple SparkContext, so that the applications runs correctly?
I can't run your program to be sure, but the general rule is not to create anonymous functions that refer to members of unserializable classes if they have to be executed on the RDD's data. In your case:
EsContext has a val of type SparkContext, which is (intentionally) not serializable
In the anonymous function passed to RDD.map in EsContext.documentsAsJson, you call another function of this EsContext instance (mwToMap) which forces Spark to serialize that instance, along with the SparkContext it holds
One possible solution would be removing mwToMap from the EsContext class (possibly into a companion object of EsContext - objects need not be serializable as they are static). If there are other methods of the same nature (write?) they'll have to be moved too. This would look something like:
import EsContext._
class EsContext(sparkConf:HadoopConfig) extends SparkBase {
private val sc = createSCLocal("ElasticContext", sparkConf)
def documentsAsJson(esConf: HadoopConfig): RDD[String] = { /* unchanged */ }
def documents(esConf: HadoopConfig): RDD[EsDocument] = { /* unchanged */ }
def shutdown() = sc.stop()
}
object EsContext {
private def mwToMap(mw: MapWritable): Map[String, String] = { ... }
}
If moving these methods out isn't possible (i.e. if they require some of EsContext's members) - then consider separating the class that does the actual mapping from this context (which seems to be some kind of wrapper around the SparkContext - if that's what it is, that's all that it should be).

Scala Play no application started when grabbing data sources from application.conf

I am trying to read in data sources from my application.conf file, but every time I run my server, or try and run test cases, I am getting an error saying that there is no application started.
Here is an example of what I am trying to do:
Unit test that is trying to read a property from my application.conf
class DbConfigWebUnitTest extends PlaySpec with OneAppPerSuite {
implicit override lazy val app: FakeApplication = FakeApplication(
additionalConfiguration = Map("db.test.url" -> "jdbc:postgresql://localhost:5432/suredbitswebtest",
"db.test.user" -> "postgres", "db.test.password" -> "postgres", "db.test.driver" -> "org.postgresql.Driver"))
val dbManagementWeb = new DbManagementWeb with DbConfigWeb with DbTestQualifier
"DbConfigWebTest" must {
"have the same username as what is defined in application.conf" in {
dbManagementWeb.username must be("postgres")
}
}
}
Here is my DbConfigWeb
import play.api.Play.current
trait DbConfigWeb extends DbConfig { qualifier: DbQualifier =>
val url: String = current.configuration.getString(qualifier + ".url").get
val username: String = current.configuration.getString(qualifier + ".user").get
val password: String = current.configuration.getString(qualifier + ".password").get
val driver: String = current.configuration.getString(qualifier + ".driver").get
override def database: DatabaseDef = JdbcBackend.Database.forURL(url, username, password, null, driver)
override implicit val session = database createSession
}
trait DbQualifier {
val qualifier: String
}
trait DbProductionQualifier extends DbQualifier {
override val qualifier = "db.production"
}
trait DbTestQualifier extends DbQualifier {
override val qualifier = "db.test"
}
and lastly here is my stack trace:
[suredbits-web] $ last test:test
[debug] Forking tests - parallelism = false
[debug] Create a single-thread test executor
[debug] Runner for sbt.FrameworkWrapper produced 0 initial tasks for 0 tests.
[debug] Runner for org.scalatest.tools.Framework produced 2 initial tasks for 2 tests.
[debug] Running TaskDef(com.suredbits.web.db.DbConfigWebUnitTest, sbt.ForkMain$SubclassFingerscan#48687c55, false, [SuiteSelector])
[error] Uncaught exception when running com.suredbits.web.db.DbConfigWebUnitTest: java.lang.RuntimeException: There is no started application
sbt.ForkMain$ForkError: There is no started application
at scala.sys.package$.error(package.scala:27)
at play.api.Play$$anonfun$current$1.apply(Play.scala:71)
at play.api.Play$$anonfun$current$1.apply(Play.scala:71)
at scala.Option.getOrElse(Option.scala:120)
at play.api.Play$.current(Play.scala:71)
at com.suredbits.web.db.DbConfigWeb$class.$init$(DbConfigWebProduction.scala:14)
at com.suredbits.web.db.DbConfigWebUnitTest$$anon$1.<init>(DbConfigWebUnitTest.scala:14)
at com.suredbits.web.db.DbConfigWebUnitTest.<init>(DbConfigWebUnitTest.scala:14)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:379)
at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:641)
at sbt.ForkMain$Run$2.call(ForkMain.java:294)
at sbt.ForkMain$Run$2.call(ForkMain.java:284)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I think the key problem is that vals in Scala traits are initialized at construction time, which is prior to the test Play application being started (presumably its lifecycle is tied to each spec example.) You have a couple of workarounds:
make everything in DbConfigWeb a def or perhaps a lazy val
give DbConfigWeb an abstract play.api.Application field from which to extract the config values (rather than using current) and pass it explicitly (the fake application) to whatever DbManagementWeb is as a constructor parameter
Here's a simplified version, using the first approach (which works for me):
import play.api.Play.current
trait DbConfig
trait DbConfigWeb extends DbConfig {
self: DbQualifier =>
// Using defs instead of vals
def url: String = current.configuration.getString(qualifier + ".url").get
def username: String = current.configuration.getString(qualifier + ".user").get
def password: String = current.configuration.getString(qualifier + ".password").get
def driver: String = current.configuration.getString(qualifier + ".driver").get
}
trait DbQualifier {
val qualifier: String
}
trait DbTestQualifier extends DbQualifier {
override val qualifier = "db.test"
}
and the spec:
import controllers.{DbConfigWeb, DbTestQualifier}
import org.scalatestplus.play.{OneAppPerSuite, PlaySpec}
import play.api.test.FakeApplication
class DbConfigTest extends PlaySpec with OneAppPerSuite {
implicit override lazy val app: FakeApplication = FakeApplication(
additionalConfiguration = Map("db.test.url" -> "jdbc:h2:mem:play",
"db.test.user" -> "sa", "db.test.password" -> "", "db.test.driver" -> "org.h2.Driver"))
val dbManagementWeb = new DbConfigWeb with DbTestQualifier
"DbConfigWebTest" must {
"have the same username as what is defined in application.conf" in {
dbManagementWeb.username must be("sa")
}
}
}
Personally I prefer the second approach, which keeps the application state passed around explicitly rather than relying on play.api.Play.current, which you cannot rely on always being started.
You mentioned in the comments that lazy vals were not working for you but I can only conjecture that some chain of calls was forcing initialization: check again that this isn't the case.
Note also that order of initialization for vals can be complex and, while some might disagree, it's a pretty safe bet to stick to defs as trait members unless you're sure it's some expensive operation (in which case a lazy val might be an option.)

Even trivial serialization examples in Scala don't work. Why?

I am trying the simplest possible serialization examples of a class:
#serializable class Person(age:Int) {}
val fred = new Person(45)
import java.io._
val out = new ObjectOutputStream(new FileOutputStream("test.obj"))
out.writeObject(fred)
out.close()
This throws exception "java.io.NotSerializableException: Main$$anon$1$Person" on me. Why?
Is there a simple serialization example?
I also tried
#serializable class Person(nm:String) {
private val name:String=nm
}
val fred = new Person("Fred")
...
and tried to remove #serializable and some other permutations. The file "test.obj" is created, over 2Kb in size and has plausible contents.
EDIT:
Reading the "test.obj" back in (from the 2nd answer below) causes
Welcome to Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM,
Java 1.7.0_51). Type in expressions to have them evaluated. Type :help
for more information.
scala> import java.io._ import java.io._
scala> val fis = new FileInputStream( "test.obj" ) fis:
java.io.FileInputStream = java.io.FileInputStream#716ad1b3
scala> val oin = new ObjectInputStream( fis ) oin:
java.io.ObjectInputStream = java.io.ObjectInputStream#1f927f0a
scala> val p= oin.readObject java.io.WriteAbortedException: writing
aborted; java.io.NotSerializableException: Main$$anon$1
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1354)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at .(:12)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:756)
at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:801)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:713)
at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:577)
at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:584)
at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:587)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:878)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:833)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:833)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:833)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala) Caused
by: java.io.NotSerializableException: Main$$anon$1
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at Main$$anon$1.(a.scala:11)
at Main$.main(a.scala:1)
at Main.main(a.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at scala.tools.nsc.util.ScalaClassLoader$$anonfun$run$1.apply(ScalaClassLoader.scala:71)
at scala.tools.nsc.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:139)
at scala.tools.nsc.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:71)
at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:139)
at scala.tools.nsc.CommonRunner$class.run(ObjectRunner.scala:28)
at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:45)
at scala.tools.nsc.CommonRunner$class.runAndCatch(ObjectRunner.scala:35)
at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:45)
at scala.tools.nsc.ScriptRunner.scala$tools$nsc$ScriptRunner$$runCompiled(ScriptRunner.scala:171)
at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply$mcZ$sp(ScriptRunner.scala:157)
at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
at scala.tools.nsc.util.package$.trackingThreads(package.scala:51)
at scala.tools.nsc.util.package$.waitingForThreads(package.scala:35)
at scala.tools.nsc.ScriptRunner.withCompiledScript(ScriptRunner.scala:130)
at scala.tools.nsc.ScriptRunner.runScript(ScriptRunner.scala:188)
at scala.tools.nsc.ScriptRunner.runScriptAndCatch(ScriptRunner.scala:201)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:76)
... 3 more
Note that #serializable scaladoc tells that it is deprecated since 2.9.0:
Deprecated (Since version 2.9.0) instead of #serializable class C, use class C extends Serializable
So you just have to use Serializable trait:
class Person(val age: Int) extends Serializable
This works for me (type :paste in REPL and paste these lines):
import java.io.{ObjectOutputStream, ObjectInputStream}
import java.io.{FileOutputStream, FileInputStream}
class Person(val age: Int) extends Serializable {
override def toString = s"Person($age)"
}
val os = new ObjectOutputStream(new FileOutputStream("/tmp/example.dat"))
os.writeObject(new Person(22))
os.close()
val is = new ObjectInputStream(new FileInputStream("/tmp/example.dat"))
val obj = is.readObject()
is.close()
obj
This is the output:
// Exiting paste mode, now interpreting.
import java.io.{ObjectOutputStream, ObjectInputStream}
import java.io.{FileOutputStream, FileInputStream}
defined class Person
os: java.io.ObjectOutputStream = java.io.ObjectOutputStream#5126abfd
is: java.io.ObjectInputStream = java.io.ObjectInputStream#41e598aa
obj: Object = Person(22)
res8: Object = Person(22)
So, you can see, the [de]serialization attempt was successful.
Edit (on why you're getting NotSerializableException when you run Scala script from file)
I've put my code into a file and tried to run it via scala test.scala and got exactly the same error as you. Here is my speculation on why it happens.
According to the stack trace a weird class Main$$anon$1 is not serializable. Logical question is: why it is there in the first place? We're trying to serialize Person after all, not something weird.
Scala script is special in that it is implicitly wrapped into an object called Main. This is indicated by the stack trace:
at Main$$anon$1.<init>(test.scala:9)
at Main$.main(test.scala:1)
at Main.main(test.scala)
The names here suggest that Main.main static method is the program entry point, and this method delegates to Main$.main instance method (object's class is named after the object but with $ appended). This instance method in turn tries to create an instance of a class Main$$anon$1. As far as I remember, anonymous classes are named that way.
Now, let's try to find exact Person class name (run this as Scala script):
class Person(val age: Int) extends Serializable {
override def toString = s"Person($age)"
}
println(new Person(22).getClass)
This prints something I was expecting:
class Main$$anon$1$Person
This means that Person is not a top-level class; instead it is a nested class defined in the anonymous class generated by the compiler! So in fact we have something like this:
object Main {
def main(args: Array[String]) {
new { // this is where Main$$anon$1 is generated, and the following code is its constructor body
class Person(val age: Int) extends Serializable { ... }
// all other definitions
}
}
}
But in Scala all nested classes are something called "nested non-static" (or "inner") classes in Java. This means that these classes always contain an implicit reference to an instance of their enclosing class. In this case, enclosing class is Main$$anon$1. Because of that when Java serializer tries to serialize Person, it transitively encounters an instance of Main$$anon$1 and tries to serialize it, but since it is not Serializable, the process fails. BTW, serializing non-static inner classes is a notorious thing in Java world, it is known to cause problems like this one.
As for why it works in REPL, it seems that in REPL declared classes somehow do not end up as inner ones, so they don't have any implicit fields. Hence serialization works normally for them.
You could use the Serializable Trait:
Trivial Serialization example using Java Serialization with the Serializable Trait:
case class Person(age: Int) extends Serializable
Usage:
Serialization, Write Object
val fos = new FileOutputStream( "person.serializedObject" )
val o = new ObjectOutputStream( fos )
o writeObject Person(31)
Deserialization, Read Object
val fis = new FileInputStream( "person.serializedObject" )
val oin = new ObjectInputStream( fis )
val p= oin.readObject
Which creates following output
fis: java.io.FileInputStream = java.io.FileInputStream#43a2bc95
oin: java.io.ObjectInputStream = java.io.ObjectInputStream#710afce3
p: Object = Person(31)
As you see the deserialization can't infer the Object Type itself, which is a clear drawback.
Serialization with Scala-Pickling
https://github.com/scala/pickling or part of the Standard-Distribution starting with Scala 2.11
In the exmple code the object is not written to a file and JSON is used instead of ByteCode Serialization which avoids certain problems originating in byte code incompatibilities between different Scala version.
import scala.pickling._
import json._
case class Person(age: Int)
val person = Person(31)
val pickledPerson = person.pickle
val unpickledPerson = pickledPerson.unpickle[Person]
class Person(age:Int) {} is equivalent to the Java code:
class Person{
Person(Int age){}
}
which is probably not what you want. Note that the parameter age is simply discarded and Person has no member fields.
You want either:
#serializable case class Person(age:Int)
#serializable class Person(val age:Int)
You can leave out the empty curly brackets at the end. In fact, it's encouraged.