Spark not serializable issue - scala

I'm, working on refactoring our code so that we can use the CAKE pattern for DI.
I've stumbled upon a serialisation issue which I'm having difficulty in understanding.
When I call this function:
def getAccounts(winZones: Broadcast[List[WindowsZones]]): RDD[AccountDetails] = {
val accounts = getAccounts //call to db
val result = accounts.map(row =>
Some(AccountDetails(UUID.fromString(row.getAs[String]("")),
row.getAs[String](""),
UUID.fromString(row.getAs[String]("")),
row.getAs[String](""),
row.getAs[String](""),
DateUtils.getIanaZoneFromWinZone(row.getAs[String]("timeZone"), winZones))))
.map(m=>m.get)
result
}
it works perfectly, but this is ugly and I want to refactor it so that the middle mapping from row to AccountDetails is placed inside a private function - but when doing that it causes the serialisation issue.
I'd like:
def getAccounts(winZones: Broadcast[List[WindowsZones]]): RDD[AccountDetails] = {
val accounts = getAccounts
val result = accounts
.map(m => getAccountDetails(m, winZones))
.filter(_.isDefined)
.map(m => m.get)
result
}
private def getAccountDetails(row: Row, winZones: Broadcast[List[WindowsZones]]): Option[AccountDetails] = {
try {
Some(AccountDetails(UUID.fromString(""),
row.getAs[String](""),
UUID.fromString(row.getAs[String]("")),
row.getAs[String](""),
row.getAs[String](""),
DateUtils.getIanaZoneFromWinZone(row.getAs[String]("timeZone"), winZones)))
}
catch {
case e: Exception =>
logger.error(s"Unable to set AccountDetails $e")
None
}
}
Any help is appreciated of course, the AccountDetails obj is a case class should that be pertinent. Also happy to take any other advice on implementing cake or DI with spark in general. Thanks.
Edit to show structure:
trait serviceImpl extends anotherComponent {this: DBA =>
def Accounts = new Accounts
class Accounts extends AccountService {
//the methods above are defined here.
}
Edit to include stacktrace:
17/02/13 17:32:32 INFO CodeGenerator: Code generated in 271.36617 ms
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.map(RDD.scala:365)
at FunnelServiceComponentImpl$FunnelAccounts.getAccounts(FunnelServiceComponentImpl.scala:24)
at Main$.delayedEndpoint$Main$1(Main.scala:26)
at Main$delayedInit$body.apply(Main.scala:7)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Main$.main(Main.scala:7)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: FunnelServiceComponentImpl$FunnelAccounts
Serialization stack:
- object not serializable (class: FunnelServiceComponentImpl$FunnelAccounts, value: FunnelServiceComponentImpl$FunnelAccounts#16b7e04a)
- field (class: FunnelServiceComponentImpl$FunnelAccounts$$anonfun$1, name: $outer, type: class FunnelServiceComponentImpl$FunnelAccounts)
- object (class FunnelServiceComponentImpl$FunnelAccounts$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 26 more
17/02/13 17:32:32 INFO SparkContext: Invoking stop() from shutdown hook

Where are you defining the functions?
Let's say you are defining them in a class X. If the class is not serializable this would cause your issue.
To solve this you can either make it an object instead or make the class serializable.

Because getAccountDetails is in your class, Spark will want to serialize your entire FunnelAccounts object. After all, you need an instance in order to use this method. However, FunnelAccounts is not serializable. Thus it can't be sent off to a worker.
In your case, you should move getAccountDetails into a FunnelAccounts object, so that you don't need an instance FunnelAccounts to run it.

Related

Spark Scala: Task Not serializable error

I am using IntelliJ Community Edition with Scala Plugin and spark libraries. I am still learning Spark and am using Scala Worksheet.
I have written the below code which removes punctuation marks in a String:
def removePunctuation(text: String): String = {
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
Then I read a text file and try to remove punctuation:
val myfile = sc.textFile("/home/ubuntu/data.txt",4).map(removePunctuation)
This gives error as below, any help would be appreciated:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(/home/ubuntu/src/main/scala/Test.sc:294)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(/home/ubuntu/src/main/scala/Test.sc:284)
at org.apache.spark.util.ClosureCleaner$.clean(/home/ubuntu/src/main/scala/Test.sc:104)
at org.apache.spark.SparkContext.clean(/home/ubuntu/src/main/scala/Test.sc:2090)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:365)
at org.apache.spark.rdd.RDDOperationScope$.withScope(/home/ubuntu/src/main/scala/Test.sc:147)
at #worksheet#.#worksheet#(/home/ubuntu/src/main/scala/Test.sc:108)
Caused by: java.io.NotSerializableException: A$A21$A$A21
Serialization stack:
- object not serializable (class: A$A21$A$A21, value: A$A21$A$A21#62db3891)
- field (class: A$A21$A$A21$$anonfun$words$1, name: $outer, type: class A$A21$A$A21)
- object (class A$A21$A$A21$$anonfun$words$1, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at A$A21$A$A21.words$lzycompute(Test.sc:27)
at A$A21$A$A21.words(Test.sc:27)
at A$A21$A$A21.get$$instance$$words(Test.sc:27)
at A$A21$.main(Test.sc:73)
at A$A21.main(Test.sc)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.jetbrains.plugins.scala.worksheet.MyWorksheetRunner.main(MyWorksheetRunner.java:22)
As T. Gaweda already pointed out, you're most likely defining your function in a class that's not serializable. Because it is a pure function, i.e. it doesn't depend on any context of the enclosing class, I suggest you put it into a companion object which should extend Serializable. This would be Scala's equivalent of a Java static method:
object Helper extends Serializable {
def removePunctuation(text: String): String = {
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
}
As #TGaweda suggests, Spark's SerializationDebugger is very helpful for identifying "the serialization path leading from the given object to the problematic object." All the dollar signs before the "Serialization stack" in the stack trace indicate that the container object for your method is the problem.
While it is easiest to just slap Serializable on your container class, I prefer to take advantage of the fact Scala is a functional language and use your function as a first class citizen:
sc.textFile("/home/ubuntu/data.txt",4).map { text =>
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
Or if you really want to keep things separate:
val removePunctuation: String => String = (text: String) => {
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
sc.textFile("/home/ubuntu/data.txt",4).map(removePunctuation)
These options work of course since Regex is serializable as you should confirm.
On a secondary but very important note, constructing a Regex is expensive, so factor it out of your transformations for the sake of performance--possibly with a broadcast.
Read the stacktrace, there is:
$outer, type: class A$A21$A$A21
It is a very good hint. Your lambda is serializable, but your class is not serializable.
When you make lambda expression, then this expression has reference to outer class. Outer class in your case is not serializable, i.e. is not implementing Serializable or one of fields is not an instance of Serializable

org.apache.spark.SparkException: Task not serializable with lambda

I'm very new to scala and spark. Now I'm having an issue that makes me very confused. Please give me an advice.
I'm making RDD[myEntityClass] from RDD[Array[String]] using lambda. But I faced an error which says there is null value to parse String to Long. To investigate this I implemented a method which makes me able to use breakpoint.
However now I'm getting org.apache.spark.SparkException: Task not serializable and I can't find what's wrong. Below is my code snippet please help me if you can find anything.
def makingData() : RDD[MyEntityClass] = {
.
.
data.map(row => toMyEntityClass(row))
}
def toMyEntityClass(row : Array[String]) : MyEntityClass = {
var id = row(0).toLong
var name = row(1)
var code = row(2).toLong
var parentId = row(3).toLong
var status = row(4)
MyEntityClass(id, name, code, parentId, status)
}
===== updated question =====
I'm updating my question to respond your advices. I've already had MyEntityClass as case class like below.
case class MyEntityClass(id: Long, name: String, code: Long, parentId: Long, status: String)
===== appended stack trace =====
Task not serializable
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:314)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:313)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.map(RDD.scala:313)
at com.myproject.repository.MyRepositorySpec.getDummyData(MyRepositorySpec.scala:40)
at com.myproject.repository.MyRepositorySpec$$anonfun$3.apply(MyRepositorySpec.scala:66)
at com.myproject.repository.MyRepositorySpec$$anonfun$3.apply(MyRepositorySpec.scala:65)
at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1681)
at org.scalatest.Suite$class.withFixture(Suite.scala:1031)
at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1691)
at org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1678)
at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1690)
at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1690)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:287)
at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1690)
at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1691)
at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1748)
at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1748)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:394)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:382)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:382)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:371)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:408)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:382)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:382)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:377)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:459)
at org.scalatest.FlatSpecLike$class.runTests(FlatSpecLike.scala:1748)
at org.scalatest.FlatSpec.runTests(FlatSpec.scala:1691)
at org.scalatest.Suite$class.run(Suite.scala:1320)
at org.scalatest.FlatSpec.org$scalatest$FlatSpecLike$$super$run(FlatSpec.scala:1691)
at org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1794)
at org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1794)
at org.scalatest.SuperEngine.runImpl(Engine.scala:519)
at org.scalatest.FlatSpecLike$class.run(FlatSpecLike.scala:1794)
at org.scalatest.FlatSpec.run(FlatSpec.scala:1691)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:46)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1340)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1334)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1334)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1011)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1010)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1500)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010)
at org.scalatest.tools.Runner$.run(Runner.scala:850)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
Caused by: java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper
Serialization stack:
- object not serializable (class: org.scalatest.Assertions$AssertionsHelper, value: org.scalatest.Assertions$AssertionsHelper#45e639ee)
- field (class: org.scalatest.FlatSpec, name: assertionsHelper, type: class org.scalatest.Assertions$AssertionsHelper)
- object (class com.myproject.repository.MyRepositorySpec, MyRepositorySpec)
- field (class: com.myproject.repository.MyRepositorySpec$$anonfun$getDummyData$1, name: $outer, type: class com.myproject.repository.MyRepositorySpec)
- object (class com.myproject.repository.MyRepositorySpec$$anonfun$getDummyData$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 61 more
From the code given above, I understand that you want to convert
RDD[Array[String]] to RDD[MyEntityClass]
We've 2 options here..
Make a case class MyEntityClass which is by default Serializable.
for example
case MyEntityClass(id : Long, name : String, code : String, parentId : Long, status : String)
Make a normal class MyEntityClass with Serializable then its eligible for serialization... Note : In general this approach is used when case class has more than 22 fields(productarity issue) and if you are using < scala 2.10
EDIT : After you confirmed that MyEntityClass is a case class, and
pasted Serialization Debugger stack trace, which reveals
MyRepositorySpec is just a test class which extends FlatSpec and has
makingData() and toMyEntityClass().You are using your test class
inside the closure which is the cause of this exception
With below error it is clearly evident
caused by: java.io.NotSerializableException:
org.scalatest.Assertions$AssertionsHelper Serialization stack:
- object not serializable (class: org.scalatest.Assertions$AssertionsHelper, value:
org.scalatest.Assertions$AssertionsHelper#45e639ee)
- field (class: org.scalatest.FlatSpec, name: assertionsHelper, type: class org.scalatest.Assertions$AssertionsHelper)
- object (class com.myproject.repository.MyRepositorySpec, MyRepositorySpec)
- field (class: com.myproject.repository.MyRepositorySpec$$anonfun$getDummyData$1,
name:
Solution : Make MyRepositorySpec as Serializable

How to perform Secondary Sort in Spark?

I was searching for secondary sort using Spark and found this solution:
case class RFMCKey(cId: String, R: Double, F: Double, M: Double, C: Double)
class RFMCPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, "Number of partitions ($partitions) cannot be negative.")
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
val k = key.asInstanceOf[RFMCKey]
k.cId.hashCode() % numPartitions
}
}
object RFMCKey {
implicit def orderingBycId[A <: RFMCKey] : Ordering[A] = {
Ordering.by(k => (k.R, k.F * -1, k.M * -1, k.C * -1))
}
}
Now this is the code that I am using for my RFMC (Recency, Frequency, Monetary, Clumpiness) program.
In the same code, at the end, I am doing:
val rfmcTableSorted = rfmcTable.repartitionAndSortWithinPartitions(new RFMCPartitioner(1))
But when I load this file in spark-shell, I get the following error:
<console>:130: error: RFMCKey is already defined as (compiler-generated) case class companion object RFMCKey
object RFMCKey {
^
<console>:198: error: RFMCKey.type does not take parameters
case (custId, (((rVal, fVal), mVal),cVal)) => (RFMCKey(custId, rVal, fVal, mVal, cVal), rVal+","+fVal+","+mVal+","+cVal)
^
<console>:200: error: value repartitionAndSortWithinPartitions is not a member of org.apache.spark.rdd.RDD[Nothing]
val rfmcTableSorted = rfmcTable.repartitionAndSortWithinPartitions(new RFMCPartitioner(1)).cache()
How do I circumvent this issue?
Update 1
I tried changing the order of declaration of my case class and object class and surprisingly the shell loaded the file without throwing any errors. But when I ran my program it threw a new error:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1623)
at org.apache.spark.rdd.RDD.map(RDD.scala:286)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$.constructRFMC(<console>:113)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:43)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
at $iwC$$iwC$$iwC.<init>(<console>:53)
at $iwC$$iwC.<init>(<console>:55)
at $iwC.<init>(<console>:57)
at <init>(<console>:59)
at .<init>(<console>:63)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$
Serialization stack:
- object not serializable (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$, value: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$#757fc606)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$$anonfun$17, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$$anonfun$17, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 52 more
Update 2
The way I am defining my objects and functions is like this:
object rfmc {
def constructrfmc() = {
// Everything goes inside including the custom key and partitioner
// code defined above
}
}
Update 3
The way I am defining my code in eclipse which works perfectly is:
object rfmc extends App {
// Everything goes inside including the custom key and partitioner
// code defined above
}
I also created a JAR for this code and ran using spark-submit and that too worked perfectly.
To address the issue that RFMCKey is already defined, you need to swap the order of your case class and object declaration as explained in this issue.
Regarding your updates, there may be some limitations in the spark-shell that can't let execute any arbitrary code (such as with accumulators). To get more insight on the serialization mechanism, you should pass the following option -Dsun.io.serialization.extendedDebugInfo=true. Remember that the spark-shell is more of an exploratory utility to test small portions of code or new features iteratively thanks to the REPL, and not a fully-fledged production-ready utility that should be used extensively to test your code.
Your safest option here is to package your app into a jar and set up Spark in standalone mode, and run spark-submit with your packaged jar. As reflected in update 3 and 4 of your post, you'll need to update your code to wrap it into an object so that it is the entry point of your job. This will enable you to make sure your code is not at fault here.

Serialization Exception on spark

I meet a very strange problem on Spark about serialization.
The code is as below:
class PLSA(val sc : SparkContext, val numOfTopics : Int) extends Serializable
{
def infer(document: RDD[Document]): RDD[DocumentParameter] = {
val docs = documents.map(doc => DocumentParameter(doc, numOfTopics))
docs
}
}
where Document is defined as:
class Document(val tokens: SparseVector[Int]) extends Serializable
and DocumentParameter is:
class DocumentParameter(val document: Document, val theta: Array[Float]) extends Serializable
object DocumentParameter extends Serializable
{
def apply(document: Document, numOfTopics: Int) = new DocumentParameter(document,
Array.ofDim[Float](numOfTopics))
}
SparseVectoris a serializable class in breeze.linalg.SparseVector.
This is a simple map procedure, and all the classes are serializable, but I get this exception:
org.apache.spark.SparkException: Task not serializable
But when I remove the numOfTopics parameter, that is:
object DocumentParameter extends Serializable
{
def apply(document: Document) = new DocumentParameter(document,
Array.ofDim[Float](10))
}
and call it like this:
val docs = documents.map(DocumentParameter.apply)
and it seems OK.
Is type Int not serializable? But I do see that some code is written like that.
I am not sure how to fix this bug.
#UPDATED#:
Thank you #samthebest. I will add more details about it.
stack trace:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.map(RDD.scala:270)
at com.topicmodel.PLSA.infer(PLSA.scala:13)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
at $iwC$$iwC$$iwC.<init>(<console>:39)
at $iwC$$iwC.<init>(<console>:41)
at $iwC.<init>(<console>:43)
at <init>(<console>:45)
at .<init>(<console>:49)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 46 more
As the stack trace gives the general information of exception, I removed it.
I run the code in the spark-shell.
// suppose I have get RDD[Document] for docs
val numOfTopics = 100
val plsa = new PLSA(sc, numOfTopics)
val docPara = plsa.infer(docs)
Could you give me some tutorials or tips on serializable?
Anonymous functions serialize their containing class. When you map {doc => DocumentParameter(doc, numOfTopics)}, the only way it can give that function access to numOfTopics is to serialize the PLSA class. And that class can't actually be serialized, because (as you can see from the stacktrace) it contains the SparkContext which isn't serializable (Bad Things would happen if individual cluster nodes had access to the context and could e.g. create new jobs from within a mapper).
In general, try to avoid storing the SparkContext in your classes (edit: or at least, make sure it's very clear what kind of classes contain the SparkContext and what kind don't); it's better to pass it as a (possibly implicit) parameter to individual methods that need it. Alternatively, move the function {doc => DocumentParameter(doc, numOfTopics)} into a different class from PLSA, one that really can be serialized.
(As multiple people have suggested, it's possible to keep the SparkContext in the class but marked as #transient so that it won't be serialized. I don't recommend this approach; it means the class will "magically" change state when serialized (losing the SparkContext), and so you might end up with NPEs when you try to access the SparkContext from inside a serialized job. It's better to maintain a clear distinction between classes that are only used in the "control" code (and might use the SparkContext) and classes that are serialized to run on the cluster (which must not have the SparkContext)).
This is indeed a weird one, but I think I can guess the problem. But first, you have not provided the bare minimum to solve the problem (I can guess, because I've seen 100s of these before). Here are some problems with your question:
def infer(document: RDD[Document], numOfTopics: Int): RDD[DocumentParameter] = {
val docs = documents.map(doc => DocumentParameter(doc, numOfTopics))
}
This method doesn't return RDD[DocumentParameter] it returns Unit. You must have copied and pasted code incorrectly.
Secondly you haven't provided the entire stack trace? Why? There is no reason NOT to provide the full stack trace, and the full stack trace with message is necessary to understand the error - one needs the whole error to understand what the error is. Usually a not serializable exception tells you what is not serializable.
Thirdly you haven't told us where method infer is, are you doing this in a shell? What is the containing object/class/trait etc of infer?
Anyway, I'm going guess that by passing in the Int your causing a chain of things to get serialized that you don't expect, I can't give you any more information than that until you provide the bare minimum code so we can fully understand your problem.

Akka persistence NotSerializableException for Enumeration

I'm trying to use akkas persistance module(2.3.0), unfortunately when I'm sending Persistent message containing Enumeration I'm getting java.io.NotSerializableException. Here's my earthy example:
object TestEnum extends Enumeration with Serializable {
type TestEnum = Value
val test = Value("test")
}
class TestProcessor extends Processor with Logging {
override def receive: Actor.Receive = {
case PersistenceFailure(payload, sequenceNr, cause) =>
log.error(s"error when reciving persistent message[$payload]", cause)
case a =>
log.error(s"test proc recived message [$a]")
}
}
val a = sys.actorOf(Props[TestProcessor])
a ! Persistent(TestEnum.test)
this ends with
error when reciving persistent message[test]java.io.NotSerializableException: scala.slick.driver.JdbcTypesComponent$MappedJdbcType$$anon$1
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
at akka.persistence.serialization.MessageSerializer.persistentPayloadBuilder(MessageSerializer.scala:111)
at akka.persistence.serialization.MessageSerializer.akka$persistence$serialization$MessageSerializer$$persistentMessageBuilder(MessageSerializer.scala:97)
at akka.persistence.serialization.MessageSerializer.toBinary(MessageSerializer.scala:45)
at akka.serialization.Serialization$$anonfun$serialize$1.apply(Serialization.scala:90)
at akka.serialization.Serialization$$anonfun$serialize$1.apply(Serialization.scala:90)
at scala.util.Try$.apply(Try.scala:161)
at akka.serialization.Serialization.serialize(Serialization.scala:90)
at akka.persistence.journal.leveldb.LeveldbStore$class.persistentToBytes(LeveldbStore.scala:98)
at akka.persistence.journal.leveldb.LeveldbJournal.persistentToBytes(LeveldbJournal.scala:19)
at akka.persistence.journal.leveldb.LeveldbStore$class.akka$persistence$journal$leveldb$LeveldbStore$$addToMessageBatch(LeveldbStore.scala:104)
at akka.persistence.journal.leveldb.LeveldbStore$$anonfun$writeMessages$1$$anonfun$apply$1.apply(LeveldbStore.scala:48)
at akka.persistence.journal.leveldb.LeveldbStore$$anonfun$writeMessages$1$$anonfun$apply$1.apply(LeveldbStore.scala:48)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at akka.persistence.journal.leveldb.LeveldbStore$$anonfun$writeMessages$1.apply(LeveldbStore.scala:48)
at akka.persistence.journal.leveldb.LeveldbStore$$anonfun$writeMessages$1.apply(LeveldbStore.scala:48)
at akka.persistence.journal.leveldb.LeveldbStore$class.withBatch(LeveldbStore.scala:90)
at akka.persistence.journal.leveldb.LeveldbJournal.withBatch(LeveldbJournal.scala:19)
at akka.persistence.journal.leveldb.LeveldbStore$class.writeMessages(LeveldbStore.scala:48)
at akka.persistence.journal.leveldb.LeveldbJournal.writeMessages(LeveldbJournal.scala:19)
at akka.persistence.journal.SyncWriteJournal$$anonfun$receive$1$$anonfun$1.apply$mcV$sp(SyncWriteJournal.scala:27)
at akka.persistence.journal.SyncWriteJournal$$anonfun$receive$1$$anonfun$1.apply(SyncWriteJournal.scala:27)
at akka.persistence.journal.SyncWriteJournal$$anonfun$receive$1$$anonfun$1.apply(SyncWriteJournal.scala:27)
at scala.util.Try$.apply(Try.scala:161)
at akka.persistence.journal.SyncWriteJournal$$anonfun$receive$1.applyOrElse(SyncWriteJournal.scala:27)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.persistence.journal.leveldb.LeveldbJournal.aroundReceive(LeveldbJournal.scala:19)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
anybody has any idea what is going on here and how to fix it?
It was more complicated that I thought. According to Patrik suggesstion I've defined all participating objects as top level classes/objects in different files. And test example started to work. Finally I've discovered why error occurs:
I'm using slick to communicate with my DB so I've declared very useful Enumeration base class with default mapper that allows slick to use Enumeration without writing converters
abstract class DBEnum extends Enumeration {
implicit val enumMapper = MappedJdbcType.base[Value, String](_.toString, this.withName(_))
}
Then I've derived from that class and created
object AccountType extends DBEnum{
type AccountType = Value
val ADMINISTRATOR = Value("administrator")
val REGULAR = Value("regular")
val ADMINISTRATOR_SPONSORED = Value("administrator_sponsored")
}
my actual test example was
a ! Persistent(TestEnum.test)
a ! Persistent(AccountType.ADMINISTRATOR)
and it seems that TestEnum started to have enumMapper(probably because of some weird implicit conversion - maby someone can explain how is it possible) which can't be serialized(it does not implement serializable).
changing
implicit val enumMapper = MappedJdbcType.base[Value, String](_.toString, this.withName(_))
to
implicit def enumMapper = MappedJdbcType.base[Value, String](_.toString, this.withName(_))
fixed my problem.