I am consistently having errors regarding Task not Serializable.
I have made a small Class and it extends Serializable - which is what I believe is meant to be the case when you need values in it to be serialised.
class SGD(filePath : String) extends Serializable {
val rdd = sc.textFile(filePath)
val mappedRDD = rdd.map(x => x.split(" ")
.slice(0,3))
.map(y => Rating(y(0).toInt, y(1).toInt, y(2).toDouble))
.cache
val RNG = new Random(1)
val factorsRDD = mappedRDD(x => (x.user, (x.product, x.rating)))
.groupByKey
.mapValues(listOfItemsAndRatings =>
Vector(Array.fill(2){RNG.nextDouble}))
}
The final line always results in a Task not Serializable error. What I do not understand is: the Class is Serializable; and, the Class Random is also Serializable according to the API. So, what am I doing wrong? I consistently can't get stuff like this to work; therefore, I imagine my understanding is wrong. I keep being told the Class must be Serializable... well it is and it still doesn't work!?
scala.util.Random was not Serializable until 2.11.0-M2.
Most likely you are using an earlier version of Scala.
A class doesn't become Serializable until all its members are Serializable as well (or some other mechanism is provided to serialize them, e.g. transient or readObject/writeObject.)
I get the following stacktrace when running given example in spark-1.3:
Caused by: java.io.NotSerializableException: scala.util.Random
Serialization stack:
- object not serializable (class: scala.util.Random, value: scala.util.Random#52bbf03d)
- field (class: $iwC$$iwC$SGD, name: RNG, type: class scala.util.Random)
One way to fix it is to take instatiation of random variable within mapValues:
mapValues(listOfItemsAndRatings => { val RNG = new Random(1)
Vector(Array.fill(2)(RNG.nextDouble)) })
Related
Say I define the following case class:
case class C(i: Int) {
lazy val incremented = copy(i = i + 1)
}
And then try to serialize it to json:
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
val out = new StringWriter
mapper.writeValue(out, C(4))
val json = out.toString()
println("Json is: " + json)
It will throw the following exception:
Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion (StackOverflowError) (through reference chain: C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]-
...
I don't know why is it trying to serialize the lazy val by default in the first place? This does not seem to me as the logical approach
And can I disable this feature?
This happens because Jackson is designed for Java. Specifically, note that:
Java has no idea of a lazy val
Java's normal semantics around fields and constructors don't allow the partitioning of fields into "needed for construction" and "derived for construction" (neither of those is a technical term) that Scala's combination of val in default constructor (implicitly present in a case class) and val in a class's body provide
The consequence of the second is that (except for beans, sometimes), Java-oriented serialization approaches tend to assume that anything which is a field (including private fields, since Java idiom is to make fields private by default) in the object needs to be serialized, with the ability to opt out through #transient annotations.
The first, in turn, means that lazy vals are implemented by the compiler in a way that includes a private field.
Thus to a Java-oriented serializer like Jackson, a lazy val without a #transient annotation gets serialized.
Scala-oriented serialization approaches (e.g. circe, play-json, etc.) tend to serialize case classes by only serializing the constructor parameters.
The solution I found was to use json4s for my serialization rather than jackson databind. My issue arose using akka cluster so I had to add a custom serlializer to my project. For reference here is my complete implementation:
class Json4sSerializer(system: ExtendedActorSystem) extends Serializer {
private val actorRefResolver = ActorRefResolver(system.toTyped)
object ActorRefSerializer extends CustomSerializer[ActorRef[_]](format => (
{
case JString(str) =>
actorRefResolver.resolveActorRef[AnyRef](str)
},
{
case actorRef: ActorRef[_] =>
JString(actorRefResolver.toSerializationFormat(actorRef))
}
))
implicit private val formats = DefaultFormats + ActorRefSerializer
def includeManifest: Boolean = true
def identifier = 1234567
def toBinary(obj: AnyRef): Array[Byte] = {
write(obj).getBytes(StandardCharsets.UTF_8)
}
def fromBinary(bytes: Array[Byte], clazz: Option[Class[_]]): AnyRef = clazz match {
case Some(cls) =>
read[AnyRef](new String(bytes, StandardCharsets.UTF_8))(formats, ManifestFactory.classType(cls))
case None =>
throw new RuntimeException("Specified includeManifest but it was never passed")
}
}
You can't serialize that class because the value is infinitely recursive (hence the stack overflow). Specifically, the value of incremented for C(4) is an instance of C(5). The value of incremented for C(5) is C(6). The value of incremented for C(6) is C(7) and so on...
Since an instance of C(n) contains an instance of C(n+1) it can never be fully serlialized.
If you don't want a field to appear in the JSON, make it a function:
case class C(i: Int) {
def incremented = copy(i = i + 1)
}
The root of this problem is trying to serialise a class that also implements application logic, which breaches the principle of Separation of Concerns (The S in SOLID).
It is better to have distinct classes for serialisation and populate them from the application data as necessary. This allows different forms of serialisation to be used without having to change the application logic.
I am trying to use join method between 2 RDD and save it to cassandra but my code don't work . at the begining , i get a huge Main method and everything working well , but when i use function and class this don't work . i am new to scala and spark
code is :
class Migration extends Serializable {
case class userId(offerFamily: String, bp: String, pdl: String) extends Serializable
case class siteExternalId(site_external_id: Option[String]) extends Serializable
case class profileData(begin_ts: Option[Long], Source: Option[String]) extends Serializable
def SparkMigrationProfile(sc: SparkContext) = {
val test = sc.cassandraTable[siteExternalId](KEYSPACE,TABLE)
.keyBy[userId]
.filter(x => x._2.site_external_id != None)
val profileRDD = sc.cassandraTable[profileData](KEYSPACE,TABLE)
.keyBy[userId]
//dont work
test.join(profileRDD)
.foreach(println)
// don't work
test.join(profileRDD)
.saveToCassandra(keyspace, table)
}
At the beginig i get the famous : Exception in thread "main" org.apache.spark.SparkException: Task not serializable at . . .
so i extends my main class and also the case class but stil don't work .
I think you should move your case classes from Migration class to dedicated file and/or object. This should solve your problem. Additionally, Scala case classes are serializable by default.
Here is a runnable demo
import java.io._
object Main extends App {
case class Value(s: String)
val serializer = new ObjectOutputStream(new ByteArrayOutputStream())
serializer.writeObject(Value("123"))
println("success") //> success
}
Please note that program succeeds despite I do not mark my class with Serializable. Does Serializable make sense in Scala?
Case classes extend Serializable by default in Scala. If you create a regular class it will need to extend Serializable otherwise it will throw a serialization error.
Background
Here's my situation: I'm trying to create a class that filters an RDD based on some feature of the contents, but that feature can be different in different scenarios so I'd like to parameterize that with a function. Unfortunately, I seem to be running into issues with the way Scala captures its closures. Even though my function is serializable, the class is not.
From the example in the spark source on closure cleaning, it seems to suggest my situation can't be solved, but I'm convinced there's a way to achieve what I'm trying to do by creating the right (smaller) closure.
My Code
class MyFilter(getFeature: Element => String, other: NonSerializable) {
def filter(rdd: RDD[Element]): RDD[Element] = {
// All my complicated logic I want to share
rdd.filter { elem => getFeature(elem) == "myTargetString" }
}
Simplified Example
class Foo(f: Int => Double, rdd: RDD[Int]) {
def go(data: RDD[Int]) = data.map(f)
}
val works = new Foo(_.toDouble, otherRdd)
works.go(myRdd).collect() // works
val myMap = Map(1 -> 10d)
val complicatedButSerializableFunc: Int => Double = x => myMap.getOrElse(x, 0)
val doesntWork = new Foo(complicatedButSerializableFunc, otherRdd)
doesntWork.go(myRdd).collect() // craps out
org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: $iwC$$iwC$Foo
Serialization stack:
- object not serializable (class: $iwC$$iwC$Foo, value: $iwC$$iwC$Foo#61e33118)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: foo, type: class $iwC$$iwC$Foo)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#47d6a31a)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, <function1>)
// Even though
val out = new ObjectOutputStream(new FileOutputStream("test.obj"))
out.writeObject(complicatedButSerializableFunc) // works
Questions
Why does the first simplied example not attempt to serialize all of Foo but the second one does?
How can I get the reference to my serializable function without including a reference to Foo in my closure?
Found the answer with the help of this article.
Essentially, when creating the closure for a given function, Scala will include the entire object for any complex field referenced (if someone has a good explanation for why this doesn't happen in the first simple example, I'll accept that answer). The solution is to pass the serializable value to a different function so that only the minimal reference is kept, very similar to the ol' javascript for-loop paradigm for event listeners.
Example
def enclose[E, R](enclosed: E)(func: E => R): R = func(enclosed)
class Foo(f: Int => Double, somethingNonserializable: RDD[String]) {
def go(data: RDD[Int]) = enclose(f) { actualFunction => data.map(actualFunction) }
}
Or with JS-style self-executing anonymous function
def go(data: RDD[Int]) = ((actualFunction: Int => Double) => data.map(actualFunction))(f)
I have the following RDD:
myRDD:org.apache.spark.rdd.RDD[(String, org.apache.spark.mllib.linalg.Vector)]
Then I want to add a fixed key:
myRdd.map(("myFixedKey",_)):org.apache.spark.rdd.RDD[(String, (String, org.apache.spark.mllib.linalg.Vector))]
But if I use a constant String val instead of a hardcoded/explicit string:
val myFixedKeyVal:String = "myFixedKey"
myRdd.map((myFixedKeyVal,_))
This previous line code gives the following exception:
org.apache.spark.SparkException: Task not serializable
Am I missing something?
Solution:
Ok I found the problem, myRdd is an object that extends a Serializable class, but after process this RDD by another class, e.g. Process:
class Process(someRdd:MyRddClass) extends Serializable{
def preprocess = someRdd.map(_)
}
val someprocess = Process(myRdd)
val newRdd = someprocess.preprocess
newRdd.map(x=>("newkey",x)
This class Process must extend Serializable too in order to work. I thought that the newRdd was extending the root class MyRddClass...
The string constant is not the problem. Turn on serialization debugging with -Dsun.io.serialization.extendedDebugInfo=true to figure out the real cause.