I am using IntelliJ Community Edition with Scala Plugin and spark libraries. I am still learning Spark and am using Scala Worksheet.
I have written the below code which removes punctuation marks in a String:
def removePunctuation(text: String): String = {
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
Then I read a text file and try to remove punctuation:
val myfile = sc.textFile("/home/ubuntu/data.txt",4).map(removePunctuation)
This gives error as below, any help would be appreciated:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(/home/ubuntu/src/main/scala/Test.sc:294)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(/home/ubuntu/src/main/scala/Test.sc:284)
at org.apache.spark.util.ClosureCleaner$.clean(/home/ubuntu/src/main/scala/Test.sc:104)
at org.apache.spark.SparkContext.clean(/home/ubuntu/src/main/scala/Test.sc:2090)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:365)
at org.apache.spark.rdd.RDDOperationScope$.withScope(/home/ubuntu/src/main/scala/Test.sc:147)
at #worksheet#.#worksheet#(/home/ubuntu/src/main/scala/Test.sc:108)
Caused by: java.io.NotSerializableException: A$A21$A$A21
Serialization stack:
- object not serializable (class: A$A21$A$A21, value: A$A21$A$A21#62db3891)
- field (class: A$A21$A$A21$$anonfun$words$1, name: $outer, type: class A$A21$A$A21)
- object (class A$A21$A$A21$$anonfun$words$1, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at A$A21$A$A21.words$lzycompute(Test.sc:27)
at A$A21$A$A21.words(Test.sc:27)
at A$A21$A$A21.get$$instance$$words(Test.sc:27)
at A$A21$.main(Test.sc:73)
at A$A21.main(Test.sc)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.jetbrains.plugins.scala.worksheet.MyWorksheetRunner.main(MyWorksheetRunner.java:22)
As T. Gaweda already pointed out, you're most likely defining your function in a class that's not serializable. Because it is a pure function, i.e. it doesn't depend on any context of the enclosing class, I suggest you put it into a companion object which should extend Serializable. This would be Scala's equivalent of a Java static method:
object Helper extends Serializable {
def removePunctuation(text: String): String = {
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
}
As #TGaweda suggests, Spark's SerializationDebugger is very helpful for identifying "the serialization path leading from the given object to the problematic object." All the dollar signs before the "Serialization stack" in the stack trace indicate that the container object for your method is the problem.
While it is easiest to just slap Serializable on your container class, I prefer to take advantage of the fact Scala is a functional language and use your function as a first class citizen:
sc.textFile("/home/ubuntu/data.txt",4).map { text =>
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
Or if you really want to keep things separate:
val removePunctuation: String => String = (text: String) => {
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
sc.textFile("/home/ubuntu/data.txt",4).map(removePunctuation)
These options work of course since Regex is serializable as you should confirm.
On a secondary but very important note, constructing a Regex is expensive, so factor it out of your transformations for the sake of performance--possibly with a broadcast.
Read the stacktrace, there is:
$outer, type: class A$A21$A$A21
It is a very good hint. Your lambda is serializable, but your class is not serializable.
When you make lambda expression, then this expression has reference to outer class. Outer class in your case is not serializable, i.e. is not implementing Serializable or one of fields is not an instance of Serializable
Related
I am a C++ programmer and is trying to learn Scala. I want to achieve something similar to the following code using C++ template
template<typename T>
class Foo {
public:
T* bar;
/////////////////Other Code Omitted//////////////////////////
};
Its counter-part in Scala will not compile due to type erasure
class Foo[E](){
val bar = new E() //Will not compile
}
I have been searching the whole night for a workaround, this seems to be one of them
package test
import scala.reflect._
object Type {
def newInstance[T: ClassTag](init_args: AnyRef*): T = {
classTag[T].runtimeClass.getConstructors.head.newInstance(init_args: _*).asInstanceOf[T]
}
}
class Foo[T1:ClassTag](init_args: AnyRef*){
val bar = Type.newInstance[T1](init_args)
}
class TestClass(val arg:String){
val data = arg
}
However, when I try to instantiate one (val test = new Foo[Test]("test")) in the scala console, it gives the following error
java.lang.IllegalArgumentException: argument type mismatch
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at ParActor.Type$.newInstance(ParActor.scala:32)
at ParActor.Foo.<init>(ParActor.scala:37)
... 35 elided
I am not exactly sure what causes the problem and how to fix this. Other work around is also welcomed.
You should turn
Type.newInstance[T1](init_args)
into
Type.newInstance[T1](init_args: _*)
What : _* does is turn a list or sequence into a varargs argument. A varargs parameter AnyRef* is actually an IndexedSeq[AnyRef], more specifically a WrappedArray[AnyRef]. So when you pass init_args as an argument to newInstance without telling the compiler to interpret it as a varargs argument you are actually passing in WrappedArray(WrappedArray("test")).
I'm, working on refactoring our code so that we can use the CAKE pattern for DI.
I've stumbled upon a serialisation issue which I'm having difficulty in understanding.
When I call this function:
def getAccounts(winZones: Broadcast[List[WindowsZones]]): RDD[AccountDetails] = {
val accounts = getAccounts //call to db
val result = accounts.map(row =>
Some(AccountDetails(UUID.fromString(row.getAs[String]("")),
row.getAs[String](""),
UUID.fromString(row.getAs[String]("")),
row.getAs[String](""),
row.getAs[String](""),
DateUtils.getIanaZoneFromWinZone(row.getAs[String]("timeZone"), winZones))))
.map(m=>m.get)
result
}
it works perfectly, but this is ugly and I want to refactor it so that the middle mapping from row to AccountDetails is placed inside a private function - but when doing that it causes the serialisation issue.
I'd like:
def getAccounts(winZones: Broadcast[List[WindowsZones]]): RDD[AccountDetails] = {
val accounts = getAccounts
val result = accounts
.map(m => getAccountDetails(m, winZones))
.filter(_.isDefined)
.map(m => m.get)
result
}
private def getAccountDetails(row: Row, winZones: Broadcast[List[WindowsZones]]): Option[AccountDetails] = {
try {
Some(AccountDetails(UUID.fromString(""),
row.getAs[String](""),
UUID.fromString(row.getAs[String]("")),
row.getAs[String](""),
row.getAs[String](""),
DateUtils.getIanaZoneFromWinZone(row.getAs[String]("timeZone"), winZones)))
}
catch {
case e: Exception =>
logger.error(s"Unable to set AccountDetails $e")
None
}
}
Any help is appreciated of course, the AccountDetails obj is a case class should that be pertinent. Also happy to take any other advice on implementing cake or DI with spark in general. Thanks.
Edit to show structure:
trait serviceImpl extends anotherComponent {this: DBA =>
def Accounts = new Accounts
class Accounts extends AccountService {
//the methods above are defined here.
}
Edit to include stacktrace:
17/02/13 17:32:32 INFO CodeGenerator: Code generated in 271.36617 ms
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.map(RDD.scala:365)
at FunnelServiceComponentImpl$FunnelAccounts.getAccounts(FunnelServiceComponentImpl.scala:24)
at Main$.delayedEndpoint$Main$1(Main.scala:26)
at Main$delayedInit$body.apply(Main.scala:7)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Main$.main(Main.scala:7)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: FunnelServiceComponentImpl$FunnelAccounts
Serialization stack:
- object not serializable (class: FunnelServiceComponentImpl$FunnelAccounts, value: FunnelServiceComponentImpl$FunnelAccounts#16b7e04a)
- field (class: FunnelServiceComponentImpl$FunnelAccounts$$anonfun$1, name: $outer, type: class FunnelServiceComponentImpl$FunnelAccounts)
- object (class FunnelServiceComponentImpl$FunnelAccounts$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 26 more
17/02/13 17:32:32 INFO SparkContext: Invoking stop() from shutdown hook
Where are you defining the functions?
Let's say you are defining them in a class X. If the class is not serializable this would cause your issue.
To solve this you can either make it an object instead or make the class serializable.
Because getAccountDetails is in your class, Spark will want to serialize your entire FunnelAccounts object. After all, you need an instance in order to use this method. However, FunnelAccounts is not serializable. Thus it can't be sent off to a worker.
In your case, you should move getAccountDetails into a FunnelAccounts object, so that you don't need an instance FunnelAccounts to run it.
I was searching for secondary sort using Spark and found this solution:
case class RFMCKey(cId: String, R: Double, F: Double, M: Double, C: Double)
class RFMCPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, "Number of partitions ($partitions) cannot be negative.")
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
val k = key.asInstanceOf[RFMCKey]
k.cId.hashCode() % numPartitions
}
}
object RFMCKey {
implicit def orderingBycId[A <: RFMCKey] : Ordering[A] = {
Ordering.by(k => (k.R, k.F * -1, k.M * -1, k.C * -1))
}
}
Now this is the code that I am using for my RFMC (Recency, Frequency, Monetary, Clumpiness) program.
In the same code, at the end, I am doing:
val rfmcTableSorted = rfmcTable.repartitionAndSortWithinPartitions(new RFMCPartitioner(1))
But when I load this file in spark-shell, I get the following error:
<console>:130: error: RFMCKey is already defined as (compiler-generated) case class companion object RFMCKey
object RFMCKey {
^
<console>:198: error: RFMCKey.type does not take parameters
case (custId, (((rVal, fVal), mVal),cVal)) => (RFMCKey(custId, rVal, fVal, mVal, cVal), rVal+","+fVal+","+mVal+","+cVal)
^
<console>:200: error: value repartitionAndSortWithinPartitions is not a member of org.apache.spark.rdd.RDD[Nothing]
val rfmcTableSorted = rfmcTable.repartitionAndSortWithinPartitions(new RFMCPartitioner(1)).cache()
How do I circumvent this issue?
Update 1
I tried changing the order of declaration of my case class and object class and surprisingly the shell loaded the file without throwing any errors. But when I ran my program it threw a new error:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1623)
at org.apache.spark.rdd.RDD.map(RDD.scala:286)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$.constructRFMC(<console>:113)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:43)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
at $iwC$$iwC$$iwC.<init>(<console>:53)
at $iwC$$iwC.<init>(<console>:55)
at $iwC.<init>(<console>:57)
at <init>(<console>:59)
at .<init>(<console>:63)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$
Serialization stack:
- object not serializable (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$, value: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$#757fc606)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$$anonfun$17, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$rfmc$$anonfun$17, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 52 more
Update 2
The way I am defining my objects and functions is like this:
object rfmc {
def constructrfmc() = {
// Everything goes inside including the custom key and partitioner
// code defined above
}
}
Update 3
The way I am defining my code in eclipse which works perfectly is:
object rfmc extends App {
// Everything goes inside including the custom key and partitioner
// code defined above
}
I also created a JAR for this code and ran using spark-submit and that too worked perfectly.
To address the issue that RFMCKey is already defined, you need to swap the order of your case class and object declaration as explained in this issue.
Regarding your updates, there may be some limitations in the spark-shell that can't let execute any arbitrary code (such as with accumulators). To get more insight on the serialization mechanism, you should pass the following option -Dsun.io.serialization.extendedDebugInfo=true. Remember that the spark-shell is more of an exploratory utility to test small portions of code or new features iteratively thanks to the REPL, and not a fully-fledged production-ready utility that should be used extensively to test your code.
Your safest option here is to package your app into a jar and set up Spark in standalone mode, and run spark-submit with your packaged jar. As reflected in update 3 and 4 of your post, you'll need to update your code to wrap it into an object so that it is the entry point of your job. This will enable you to make sure your code is not at fault here.
I am consistently having errors regarding Task not Serializable.
I have made a small Class and it extends Serializable - which is what I believe is meant to be the case when you need values in it to be serialised.
class SGD(filePath : String) extends Serializable {
val rdd = sc.textFile(filePath)
val mappedRDD = rdd.map(x => x.split(" ")
.slice(0,3))
.map(y => Rating(y(0).toInt, y(1).toInt, y(2).toDouble))
.cache
val RNG = new Random(1)
val factorsRDD = mappedRDD(x => (x.user, (x.product, x.rating)))
.groupByKey
.mapValues(listOfItemsAndRatings =>
Vector(Array.fill(2){RNG.nextDouble}))
}
The final line always results in a Task not Serializable error. What I do not understand is: the Class is Serializable; and, the Class Random is also Serializable according to the API. So, what am I doing wrong? I consistently can't get stuff like this to work; therefore, I imagine my understanding is wrong. I keep being told the Class must be Serializable... well it is and it still doesn't work!?
scala.util.Random was not Serializable until 2.11.0-M2.
Most likely you are using an earlier version of Scala.
A class doesn't become Serializable until all its members are Serializable as well (or some other mechanism is provided to serialize them, e.g. transient or readObject/writeObject.)
I get the following stacktrace when running given example in spark-1.3:
Caused by: java.io.NotSerializableException: scala.util.Random
Serialization stack:
- object not serializable (class: scala.util.Random, value: scala.util.Random#52bbf03d)
- field (class: $iwC$$iwC$SGD, name: RNG, type: class scala.util.Random)
One way to fix it is to take instatiation of random variable within mapValues:
mapValues(listOfItemsAndRatings => { val RNG = new Random(1)
Vector(Array.fill(2)(RNG.nextDouble)) })
I meet a very strange problem on Spark about serialization.
The code is as below:
class PLSA(val sc : SparkContext, val numOfTopics : Int) extends Serializable
{
def infer(document: RDD[Document]): RDD[DocumentParameter] = {
val docs = documents.map(doc => DocumentParameter(doc, numOfTopics))
docs
}
}
where Document is defined as:
class Document(val tokens: SparseVector[Int]) extends Serializable
and DocumentParameter is:
class DocumentParameter(val document: Document, val theta: Array[Float]) extends Serializable
object DocumentParameter extends Serializable
{
def apply(document: Document, numOfTopics: Int) = new DocumentParameter(document,
Array.ofDim[Float](numOfTopics))
}
SparseVectoris a serializable class in breeze.linalg.SparseVector.
This is a simple map procedure, and all the classes are serializable, but I get this exception:
org.apache.spark.SparkException: Task not serializable
But when I remove the numOfTopics parameter, that is:
object DocumentParameter extends Serializable
{
def apply(document: Document) = new DocumentParameter(document,
Array.ofDim[Float](10))
}
and call it like this:
val docs = documents.map(DocumentParameter.apply)
and it seems OK.
Is type Int not serializable? But I do see that some code is written like that.
I am not sure how to fix this bug.
#UPDATED#:
Thank you #samthebest. I will add more details about it.
stack trace:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.map(RDD.scala:270)
at com.topicmodel.PLSA.infer(PLSA.scala:13)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
at $iwC$$iwC$$iwC.<init>(<console>:39)
at $iwC$$iwC.<init>(<console>:41)
at $iwC.<init>(<console>:43)
at <init>(<console>:45)
at .<init>(<console>:49)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 46 more
As the stack trace gives the general information of exception, I removed it.
I run the code in the spark-shell.
// suppose I have get RDD[Document] for docs
val numOfTopics = 100
val plsa = new PLSA(sc, numOfTopics)
val docPara = plsa.infer(docs)
Could you give me some tutorials or tips on serializable?
Anonymous functions serialize their containing class. When you map {doc => DocumentParameter(doc, numOfTopics)}, the only way it can give that function access to numOfTopics is to serialize the PLSA class. And that class can't actually be serialized, because (as you can see from the stacktrace) it contains the SparkContext which isn't serializable (Bad Things would happen if individual cluster nodes had access to the context and could e.g. create new jobs from within a mapper).
In general, try to avoid storing the SparkContext in your classes (edit: or at least, make sure it's very clear what kind of classes contain the SparkContext and what kind don't); it's better to pass it as a (possibly implicit) parameter to individual methods that need it. Alternatively, move the function {doc => DocumentParameter(doc, numOfTopics)} into a different class from PLSA, one that really can be serialized.
(As multiple people have suggested, it's possible to keep the SparkContext in the class but marked as #transient so that it won't be serialized. I don't recommend this approach; it means the class will "magically" change state when serialized (losing the SparkContext), and so you might end up with NPEs when you try to access the SparkContext from inside a serialized job. It's better to maintain a clear distinction between classes that are only used in the "control" code (and might use the SparkContext) and classes that are serialized to run on the cluster (which must not have the SparkContext)).
This is indeed a weird one, but I think I can guess the problem. But first, you have not provided the bare minimum to solve the problem (I can guess, because I've seen 100s of these before). Here are some problems with your question:
def infer(document: RDD[Document], numOfTopics: Int): RDD[DocumentParameter] = {
val docs = documents.map(doc => DocumentParameter(doc, numOfTopics))
}
This method doesn't return RDD[DocumentParameter] it returns Unit. You must have copied and pasted code incorrectly.
Secondly you haven't provided the entire stack trace? Why? There is no reason NOT to provide the full stack trace, and the full stack trace with message is necessary to understand the error - one needs the whole error to understand what the error is. Usually a not serializable exception tells you what is not serializable.
Thirdly you haven't told us where method infer is, are you doing this in a shell? What is the containing object/class/trait etc of infer?
Anyway, I'm going guess that by passing in the Int your causing a chain of things to get serialized that you don't expect, I can't give you any more information than that until you provide the bare minimum code so we can fully understand your problem.