Even trivial serialization examples in Scala don't work. Why? - class

I am trying the simplest possible serialization examples of a class:
#serializable class Person(age:Int) {}
val fred = new Person(45)
import java.io._
val out = new ObjectOutputStream(new FileOutputStream("test.obj"))
out.writeObject(fred)
out.close()
This throws exception "java.io.NotSerializableException: Main$$anon$1$Person" on me. Why?
Is there a simple serialization example?
I also tried
#serializable class Person(nm:String) {
private val name:String=nm
}
val fred = new Person("Fred")
...
and tried to remove #serializable and some other permutations. The file "test.obj" is created, over 2Kb in size and has plausible contents.
EDIT:
Reading the "test.obj" back in (from the 2nd answer below) causes
Welcome to Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM,
Java 1.7.0_51). Type in expressions to have them evaluated. Type :help
for more information.
scala> import java.io._ import java.io._
scala> val fis = new FileInputStream( "test.obj" ) fis:
java.io.FileInputStream = java.io.FileInputStream#716ad1b3
scala> val oin = new ObjectInputStream( fis ) oin:
java.io.ObjectInputStream = java.io.ObjectInputStream#1f927f0a
scala> val p= oin.readObject java.io.WriteAbortedException: writing
aborted; java.io.NotSerializableException: Main$$anon$1
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1354)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at .(:12)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:756)
at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:801)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:713)
at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:577)
at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:584)
at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:587)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:878)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:833)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:833)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:833)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala) Caused
by: java.io.NotSerializableException: Main$$anon$1
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at Main$$anon$1.(a.scala:11)
at Main$.main(a.scala:1)
at Main.main(a.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at scala.tools.nsc.util.ScalaClassLoader$$anonfun$run$1.apply(ScalaClassLoader.scala:71)
at scala.tools.nsc.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:139)
at scala.tools.nsc.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:71)
at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:139)
at scala.tools.nsc.CommonRunner$class.run(ObjectRunner.scala:28)
at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:45)
at scala.tools.nsc.CommonRunner$class.runAndCatch(ObjectRunner.scala:35)
at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:45)
at scala.tools.nsc.ScriptRunner.scala$tools$nsc$ScriptRunner$$runCompiled(ScriptRunner.scala:171)
at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply$mcZ$sp(ScriptRunner.scala:157)
at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
at scala.tools.nsc.util.package$.trackingThreads(package.scala:51)
at scala.tools.nsc.util.package$.waitingForThreads(package.scala:35)
at scala.tools.nsc.ScriptRunner.withCompiledScript(ScriptRunner.scala:130)
at scala.tools.nsc.ScriptRunner.runScript(ScriptRunner.scala:188)
at scala.tools.nsc.ScriptRunner.runScriptAndCatch(ScriptRunner.scala:201)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:76)
... 3 more

Note that #serializable scaladoc tells that it is deprecated since 2.9.0:
Deprecated (Since version 2.9.0) instead of #serializable class C, use class C extends Serializable
So you just have to use Serializable trait:
class Person(val age: Int) extends Serializable
This works for me (type :paste in REPL and paste these lines):
import java.io.{ObjectOutputStream, ObjectInputStream}
import java.io.{FileOutputStream, FileInputStream}
class Person(val age: Int) extends Serializable {
override def toString = s"Person($age)"
}
val os = new ObjectOutputStream(new FileOutputStream("/tmp/example.dat"))
os.writeObject(new Person(22))
os.close()
val is = new ObjectInputStream(new FileInputStream("/tmp/example.dat"))
val obj = is.readObject()
is.close()
obj
This is the output:
// Exiting paste mode, now interpreting.
import java.io.{ObjectOutputStream, ObjectInputStream}
import java.io.{FileOutputStream, FileInputStream}
defined class Person
os: java.io.ObjectOutputStream = java.io.ObjectOutputStream#5126abfd
is: java.io.ObjectInputStream = java.io.ObjectInputStream#41e598aa
obj: Object = Person(22)
res8: Object = Person(22)
So, you can see, the [de]serialization attempt was successful.
Edit (on why you're getting NotSerializableException when you run Scala script from file)
I've put my code into a file and tried to run it via scala test.scala and got exactly the same error as you. Here is my speculation on why it happens.
According to the stack trace a weird class Main$$anon$1 is not serializable. Logical question is: why it is there in the first place? We're trying to serialize Person after all, not something weird.
Scala script is special in that it is implicitly wrapped into an object called Main. This is indicated by the stack trace:
at Main$$anon$1.<init>(test.scala:9)
at Main$.main(test.scala:1)
at Main.main(test.scala)
The names here suggest that Main.main static method is the program entry point, and this method delegates to Main$.main instance method (object's class is named after the object but with $ appended). This instance method in turn tries to create an instance of a class Main$$anon$1. As far as I remember, anonymous classes are named that way.
Now, let's try to find exact Person class name (run this as Scala script):
class Person(val age: Int) extends Serializable {
override def toString = s"Person($age)"
}
println(new Person(22).getClass)
This prints something I was expecting:
class Main$$anon$1$Person
This means that Person is not a top-level class; instead it is a nested class defined in the anonymous class generated by the compiler! So in fact we have something like this:
object Main {
def main(args: Array[String]) {
new { // this is where Main$$anon$1 is generated, and the following code is its constructor body
class Person(val age: Int) extends Serializable { ... }
// all other definitions
}
}
}
But in Scala all nested classes are something called "nested non-static" (or "inner") classes in Java. This means that these classes always contain an implicit reference to an instance of their enclosing class. In this case, enclosing class is Main$$anon$1. Because of that when Java serializer tries to serialize Person, it transitively encounters an instance of Main$$anon$1 and tries to serialize it, but since it is not Serializable, the process fails. BTW, serializing non-static inner classes is a notorious thing in Java world, it is known to cause problems like this one.
As for why it works in REPL, it seems that in REPL declared classes somehow do not end up as inner ones, so they don't have any implicit fields. Hence serialization works normally for them.

You could use the Serializable Trait:
Trivial Serialization example using Java Serialization with the Serializable Trait:
case class Person(age: Int) extends Serializable
Usage:
Serialization, Write Object
val fos = new FileOutputStream( "person.serializedObject" )
val o = new ObjectOutputStream( fos )
o writeObject Person(31)
Deserialization, Read Object
val fis = new FileInputStream( "person.serializedObject" )
val oin = new ObjectInputStream( fis )
val p= oin.readObject
Which creates following output
fis: java.io.FileInputStream = java.io.FileInputStream#43a2bc95
oin: java.io.ObjectInputStream = java.io.ObjectInputStream#710afce3
p: Object = Person(31)
As you see the deserialization can't infer the Object Type itself, which is a clear drawback.
Serialization with Scala-Pickling
https://github.com/scala/pickling or part of the Standard-Distribution starting with Scala 2.11
In the exmple code the object is not written to a file and JSON is used instead of ByteCode Serialization which avoids certain problems originating in byte code incompatibilities between different Scala version.
import scala.pickling._
import json._
case class Person(age: Int)
val person = Person(31)
val pickledPerson = person.pickle
val unpickledPerson = pickledPerson.unpickle[Person]

class Person(age:Int) {} is equivalent to the Java code:
class Person{
Person(Int age){}
}
which is probably not what you want. Note that the parameter age is simply discarded and Person has no member fields.
You want either:
#serializable case class Person(age:Int)
#serializable class Person(val age:Int)
You can leave out the empty curly brackets at the end. In fact, it's encouraged.

Related

Generating recursive structures in scalacheck

I'm trying to make a generator for a recursive datatype called Row. A row is a list of named Vals, where a Val is either an atomic Bin or else a nested Row.
This is my code:
package com.dtci.data.anonymize.parquet
import java.nio.charset.StandardCharsets
import org.scalacheck.Gen
object TestApp extends App {
sealed trait Val
case class Bin(bytes: Array[Byte]) extends Val
object Bin {
def from_string(str: String): Bin = Bin(str.getBytes(StandardCharsets.UTF_8))
}
case class Row(flds: List[(String, Val)]) extends Val
val gen_bin = Gen.alphaStr.map(Bin.from_string)
val gen_field_name = Gen.alphaLowerStr
val gen_field = Gen.zip(gen_field_name, gen_val)
val gen_row = Gen.nonEmptyListOf(gen_field).map(Row.apply)
def gen_val: Gen[Val] = Gen.oneOf(gen_bin, gen_row)
gen_row.sample.get.flds.foreach( fld => println(s"${fld._1} --> ${fld._2}"))
}
It crashes with the following stack trace:
Exception in thread "main" java.lang.NullPointerException
at org.scalacheck.Gen.$anonfun$flatMap$2(Gen.scala:84)
at org.scalacheck.Gen$R.flatMap(Gen.scala:243)
at org.scalacheck.Gen$R.flatMap$(Gen.scala:240)
at org.scalacheck.Gen$R$$anon$3.flatMap(Gen.scala:228)
at org.scalacheck.Gen.$anonfun$flatMap$1(Gen.scala:84)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$5.doApply(Gen.scala:255)
at org.scalacheck.Gen$$anon$1.$anonfun$doApply$1(Gen.scala:110)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$1.doApply(Gen.scala:109)
at org.scalacheck.Gen.$anonfun$map$1(Gen.scala:79)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$5.doApply(Gen.scala:255)
at org.scalacheck.Gen.$anonfun$flatMap$2(Gen.scala:84)
at org.scalacheck.Gen$R.flatMap(Gen.scala:243)
at org.scalacheck.Gen$R.flatMap$(Gen.scala:240)
at org.scalacheck.Gen$R$$anon$3.flatMap(Gen.scala:228)
at org.scalacheck.Gen.$anonfun$flatMap$1(Gen.scala:84)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$5.doApply(Gen.scala:255)
at org.scalacheck.Gen$$anon$1.$anonfun$doApply$1(Gen.scala:110)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$1.doApply(Gen.scala:109)
at org.scalacheck.Gen$.$anonfun$sequence$2(Gen.scala:492)
at scala.collection.LinearSeqOps.foldLeft(LinearSeq.scala:168)
at scala.collection.LinearSeqOps.foldLeft$(LinearSeq.scala:164)
at scala.collection.immutable.List.foldLeft(List.scala:79)
at org.scalacheck.Gen$.$anonfun$sequence$1(Gen.scala:490)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$5.doApply(Gen.scala:255)
at org.scalacheck.Gen.$anonfun$map$1(Gen.scala:79)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$5.doApply(Gen.scala:255)
at org.scalacheck.Gen$$anon$1.$anonfun$doApply$1(Gen.scala:110)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$1.doApply(Gen.scala:109)
at org.scalacheck.Gen.$anonfun$flatMap$2(Gen.scala:84)
at org.scalacheck.Gen$R.flatMap(Gen.scala:243)
at org.scalacheck.Gen$R.flatMap$(Gen.scala:240)
at org.scalacheck.Gen$R$$anon$3.flatMap(Gen.scala:228)
at org.scalacheck.Gen.$anonfun$flatMap$1(Gen.scala:84)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$5.doApply(Gen.scala:255)
at org.scalacheck.Gen$.$anonfun$sized$1(Gen.scala:551)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$5.doApply(Gen.scala:255)
at org.scalacheck.Gen$$anon$1.$anonfun$doApply$1(Gen.scala:110)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$1.doApply(Gen.scala:109)
at org.scalacheck.Gen.$anonfun$map$1(Gen.scala:79)
at org.scalacheck.Gen$Parameters.useInitialSeed(Gen.scala:318)
at org.scalacheck.Gen$$anon$5.doApply(Gen.scala:255)
at org.scalacheck.Gen.sample(Gen.scala:154)
What's wrong with my code, and what would have been the best way for me to diagnose it myself?
As a note, I've seen the remarks about Gen.oneOf being strict and needing Gen.lzy for recursive structures. But if, in my code, I wrap the definition of gen_val inside of Gen.lzy(...) then I get a stack overflow rather than the current null pointer exception.
First of all, be careful using object Main extends App. I find its fields initialization semantic less obvious than plain old main with line-after-line semantics:
object Main {
def main(args: Array[String]): Unit = {...}
}
This is likely a problem with the NullPointerException.
Usually, it can be fixed by careful checking out fields initialization order and marking some (or all of them) val's as lazy.
The StackOverflowError arises because of too deep generated data structure.
Generally, when you are dealing with any kind of recursion, always consider the base case when the recursion should stop and the step which eventually will hit the base case.
In your particular case we can utilize the Gen.sized and Gen.resize which are responsible for how "big" are generated elements (checkout docs and google for more information).
package com.dtci.data.anonymize.parquet
import java.nio.charset.StandardCharsets
import org.scalacheck.Gen
object Main extends App {
sealed trait Val
case class Bin(bytes: Array[Byte]) extends Val
object Bin {
def from_string(str: String): Bin = Bin(str.getBytes(StandardCharsets.UTF_8))
}
case class Row(flds: List[(String, Val)]) extends Val
val gen_bin = Gen.alphaStr.map(Bin.from_string)
val gen_field_name = Gen.alphaLowerStr
val gen_field = Gen.zip(gen_field_name, gen_val)
val gen_row = Gen.sized(size => Gen.resize(size / 2, Gen.nonEmptyListOf(gen_field).map(Row.apply)))
def gen_val: Gen[Val] = Gen.sized { size =>
if (size <= 0) {
gen_bin
} else {
Gen.oneOf(gen_bin, gen_row)
}
}
gen_row.sample.get.flds.foreach(fld => println(s"${fld._1} --> ${fld._2}"))
}

Scala Test SparkException: Task not serializable

I'm new to Scala and Spark. Wrote a simple test class and stuck on this issue for the whole day.
Please find the below code
A.scala
class A(key :String) extends Serializable {
val this.key:String=key
def getKey(): String = { return this.key}
}
B.Scala
class B(key :String) extends Serializable {
val this.key:String=key
def getKey(): String = { return this.key}
}
Test.scala
import com.holdenkarau.spark.testing.{RDDComparisons, SharedSparkContext}
import org.scalatest.FunSuite
import org.scalatest.BeforeAndAfter
class Test extends FunSuite with SharedSparkContext with RDDComparisons with BeforeAndAfter with Serializable {
//comment this
private[this] val b1 = new B("test1")
test("Test RDD") {
val a1 = new A("test1")
val a2 = new A("test2")
val expected= sc.parallelize(Seq(a1,a2))
println(b1.getKey())
//val b1 = new B("test1")
//val key1 :String = b1.getKey()
expected.foreach{ a =>
//if(a.getKey().equalsIgnoreCase(key1))
if(a.getKey().equalsIgnoreCase(b1.getKey()))
print("hi")
}
}
}
This code is throwing exception
Task not serializable
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
at com.adgear.adata.hhid.Test$$anonfun$1.apply$mcV$sp(Test.scala:19)
at com.adgear.adata.hhid.Test$$anonfun$1.apply(Test.scala:11)
at com.adgear.adata.hhid.Test$$anonfun$1.apply(Test.scala:11)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560)
at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
at com.adgear.adata.hhid.Test.org$scalatest$BeforeAndAfterAll$$super$run(Test.scala:7)
at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at com.adgear.adata.hhid.Test.run(Test.scala:7)
When I comment out the class level declaration of b1 and use the declaration inside the test methods itself then "if(a.getKey().equalsIgnoreCase(b1.getKey()))" this works. And if I retain class level b1 definition then "if(a.getKey().equalsIgnoreCase(b1.getKey()))" throws above exception. To solve this, I have to use "//val key1 :String = b1.getKey()" and "//if(a.getKey().equalsIgnoreCase(key1))" then it works.
As one can see A, B, and Test all implements Serializable still I get this exception. What is causing this issue?
Thanks
Declaring a class as Serializable doesn't mean that it can be serialized unless all of its field are Serializable as well.
Since your Test class extends Funsuite, it will have an "assertionsHelper" field which is not Serializable. So when you reference the "b1" field in your "forEach" method, Spark will try to serialize the Test instance along with all its field (including the assertionsHelper).
If you want to avoid this, you'll have to either define b1 somwhere else (in the test method scope or a companion object), or dereference b1 into a new variable before including it in the forEach function:
val b1_ref = b1
expected.foreach { a =>
if (a.getKey().equalsIgnoreCase(b1_ref.getKey()))
print("hi")
}
PS: When you encounter a serialization exception you usually have access to the "serialization stack" in the logs which tell you exactly which object caused the error

Why does custom DefaultSource give java.io.NotSerializableException?

this is my first post on SO and my apology if the improper format is being used.
I'm working with Apache Spark to create a new source (via DefaultSource), BaseRelations, etc... and run into a problem with serialization that I would like to understand better. Consider below a class that extends BaseRelation and implements the scan builder.
class RootTableScan(path: String, treeName: String)(#transient val sqlContext: SQLContext) extends BaseRelation with PrunedFilteredScan{
private val att: core.SRType =
{
val reader = new RootFileReader(new java.io.File(Seq(path) head))
val tmp =
if (treeName==null)
buildATT(findTree(reader.getTopDir), arrangeStreamers(reader), null)
else
buildATT(reader.getKey(treeName).getObject.asInstanceOf[TTree],
arrangeStreamers(reader), null)
tmp
}
// define the schema from the AST
def schema: StructType = {
val s = buildSparkSchema(att)
s
}
// builds a scan
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {
// parallelize over all the files
val r = sqlContext.sparkContext.parallelize(Seq(path), 1).
flatMap({fileName =>
val reader = new RootFileReader(new java.io.File(fileName))
// get the TTree
/* PROBLEM !!! */
val rootTree =
// findTree(reader)
if (treeName == null) findTree(reader)
else reader.getKey(treeName).getObject.asInstanceOf[TTree]
new RootTreeIterator(rootTree, arrangeStreamers(reader),
requiredColumns, filters)
})
println("Done building Scan")
r
}
}
}
PROBLEM identifies where the issue happens. treeName is a val that gets injected into the class thru the constructor. The lambda that uses it is supposed to be executed on the slave and I do need to send the treeName - serialize it. I would like to understand why exactly the code snippet below causes this NotSerializableException. I know for sure that without treeName in it, it works just fine
val rootTree =
// findTree(reader)
if (treeName == null) findTree(reader)
else reader.getKey(treeName).getObject.asInstanceOf[TTree]
Below is the Stack trace
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2056)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:375)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:374)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:374)
at org.dianahep.sparkroot.package$RootTableScan.buildScan(sparkroot.scala:95)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:303)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:302)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:379)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:298)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:256)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2572)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1934)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2149)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
at org.apache.spark.sql.Dataset.show(Dataset.scala:495)
... 50 elided
Caused by: java.io.NotSerializableException: org.dianahep.sparkroot.package$RootTableScan
Serialization stack:
- object not serializable (class: org.dianahep.sparkroot.package$RootTableScan, value: org.dianahep.sparkroot.package$RootTableScan#6421e9e7)
- field (class: org.dianahep.sparkroot.package$RootTableScan$$anonfun$1, name: $outer, type: class org.dianahep.sparkroot.package$RootTableScan)
- object (class org.dianahep.sparkroot.package$RootTableScan$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
From the stack I think I can deduce that it tries to serialize my lambda and can not. this lambda should be a closure as we have a val in there that is defined outside of the lambda scope. But I don't understand why this can not be serialized.
Any help would be really appreciated!!!
Thanks a lot!
Any time a scala closure references a class variable, like treeName, then the JVM serializes the parent class along with the closure. Your class RootTableScan is not serializable, though! The solution is to create a local string variable:
// builds a scan
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {
val localTreeName = treeName // this is safe to serialize
// parallelize over all the files
val r = sqlContext.sparkContext.parallelize(Seq(path), 1).
flatMap({fileName =>
val reader = new RootFileReader(new java.io.File(fileName))
// get the TTree
/* PROBLEM !!! */
val rootTree =
// findTree(reader)
if (localTreeName == null) findTree(reader)
else reader.getKey(localTreeName).getObject.asInstanceOf[TTree]
new RootTreeIterator(rootTree, arrangeStreamers(reader),
requiredColumns, filters)
})

Scala Case Class serialization

I've been trying to binary serialize a composite case class object that kept throwing a weird exception. I don't really understand what is wrong with this example which throws the following exception. I used to get that exception for circular references which is not the case here. Some hints please?
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field com.Table.rows of type scala.collection.immutable.List in instance of com.Table
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field com.Table.rows of type scala.collection.immutable.List in instance of com.Table
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at com.TestSeri$.serializeBinDeserialise(TestSeri.scala:37)
at com.TestSeri$.main(TestSeri.scala:22)
at com.TestSeri.main(TestSeri.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
Here is the code
import java.io._
import scalax.file.Path
case class Row(name: String)
case class Table(rows: List[Row])
case class Cont(docs: Map[String, Table])
case object TestSeri {
def main(args: Array[String]) {
val cc = Cont(docs = List(
"1" -> Table(rows = List(Row("r1"), Row("r2"))),
"2" -> Table(rows = List(Row("r301"), Row("r31"), Row("r32")))
).toMap)
val tt = Table(rows = List(Row("r1"), Row("r2")))
val ttdes = serializeBinDeserialize(tt)
println(ttdes == tt)
val ccdes = serializeBinDeserialize(cc)
println(ccdes == cc)
}
def serializeBinDeserialize[T](payload: T): T = {
val bos = new ByteArrayOutputStream()
val out = new ObjectOutputStream(bos)
out.writeObject(payload)
val bis = new ByteArrayInputStream(bos.toByteArray)
val in = new ObjectInputStream(bis)
in.readObject().asInstanceOf[T]
}
}
Replacing List with Array which is immutable too, fixed the problem.
In my original problem I had a Map which I replaced with TreeMap.
I think is likely to be related to the proxy pattern implementation in generic immutable List and Map mentioned here:
https://issues.scala-lang.org/browse/SI-9237.
Can't believe I wasted a full day on this.

Code working in Spark-Shell not in eclipse

I have a small Scala code which works properly on Spark-Shell but not in Eclipse with Scala plugin. I can access hdfs using plugin tried writing another file and it worked..
FirstSpark.scala
package bigdata.spark
import org.apache.spark.SparkConf
import java. io. _
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object FirstSpark {
def main(args: Array[String])={
val conf = new SparkConf().setMaster("local").setAppName("FirstSparkProgram")
val sparkcontext = new SparkContext(conf)
val textFile =sparkcontext.textFile("hdfs://pranay:8020/spark/linkage")
val m = new Methods()
val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
q.saveAsTextFile("hdfs://pranay:8020/output") }
}
Methods.scala
package bigdata.spark
import java.util.function.ToDoubleFunction
class Methods {
def isHeader(s:String):Boolean={
s.contains("id_1")
}
def parse(line:String) ={
val pieces = line.split(',')
val id1=pieces(0).toInt
val id2=pieces(1).toInt
val matches=pieces(11).toBoolean
val mapArray=pieces.slice(2, 11).map(toDouble)
MatchData(id1,id2,mapArray,matches)
}
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
Error Message:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
Can anyone please help me with this
Try changing class Methods { .. } to object Methods { .. }.
I think the problem is at val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x)). When Spark sees the filter and map functions it tries to serialize the functions passed to them (x => !m.isHeader(x) and x=> m.parse(x)) so that it can dispatch the work of executing them to all of the executors (this is the Task referred to). However, to do this, it needs to serialize m, since this object is referenced inside the function (it is in the closure of the two anonymous methods) - but it cannot do this since Methods is not serializable. You could add extends Serializable to the Methods class, but in this case an object is more appropriate (and is already Serializable).