Serializing a case class with a lazy val causes a StackOverflow - scala

Say I define the following case class:
case class C(i: Int) {
lazy val incremented = copy(i = i + 1)
}
And then try to serialize it to json:
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
val out = new StringWriter
mapper.writeValue(out, C(4))
val json = out.toString()
println("Json is: " + json)
It will throw the following exception:
Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion (StackOverflowError) (through reference chain: C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]-
...
I don't know why is it trying to serialize the lazy val by default in the first place? This does not seem to me as the logical approach
And can I disable this feature?

This happens because Jackson is designed for Java. Specifically, note that:
Java has no idea of a lazy val
Java's normal semantics around fields and constructors don't allow the partitioning of fields into "needed for construction" and "derived for construction" (neither of those is a technical term) that Scala's combination of val in default constructor (implicitly present in a case class) and val in a class's body provide
The consequence of the second is that (except for beans, sometimes), Java-oriented serialization approaches tend to assume that anything which is a field (including private fields, since Java idiom is to make fields private by default) in the object needs to be serialized, with the ability to opt out through #transient annotations.
The first, in turn, means that lazy vals are implemented by the compiler in a way that includes a private field.
Thus to a Java-oriented serializer like Jackson, a lazy val without a #transient annotation gets serialized.
Scala-oriented serialization approaches (e.g. circe, play-json, etc.) tend to serialize case classes by only serializing the constructor parameters.

The solution I found was to use json4s for my serialization rather than jackson databind. My issue arose using akka cluster so I had to add a custom serlializer to my project. For reference here is my complete implementation:
class Json4sSerializer(system: ExtendedActorSystem) extends Serializer {
private val actorRefResolver = ActorRefResolver(system.toTyped)
object ActorRefSerializer extends CustomSerializer[ActorRef[_]](format => (
{
case JString(str) =>
actorRefResolver.resolveActorRef[AnyRef](str)
},
{
case actorRef: ActorRef[_] =>
JString(actorRefResolver.toSerializationFormat(actorRef))
}
))
implicit private val formats = DefaultFormats + ActorRefSerializer
def includeManifest: Boolean = true
def identifier = 1234567
def toBinary(obj: AnyRef): Array[Byte] = {
write(obj).getBytes(StandardCharsets.UTF_8)
}
def fromBinary(bytes: Array[Byte], clazz: Option[Class[_]]): AnyRef = clazz match {
case Some(cls) =>
read[AnyRef](new String(bytes, StandardCharsets.UTF_8))(formats, ManifestFactory.classType(cls))
case None =>
throw new RuntimeException("Specified includeManifest but it was never passed")
}
}

You can't serialize that class because the value is infinitely recursive (hence the stack overflow). Specifically, the value of incremented for C(4) is an instance of C(5). The value of incremented for C(5) is C(6). The value of incremented for C(6) is C(7) and so on...
Since an instance of C(n) contains an instance of C(n+1) it can never be fully serlialized.
If you don't want a field to appear in the JSON, make it a function:
case class C(i: Int) {
def incremented = copy(i = i + 1)
}
The root of this problem is trying to serialise a class that also implements application logic, which breaches the principle of Separation of Concerns (The S in SOLID).
It is better to have distinct classes for serialisation and populate them from the application data as necessary. This allows different forms of serialisation to be used without having to change the application logic.

Related

How to use Avro serialization for scala case classes with Flink 1.7?

We've got a Flink job written in Scala using case classes (generated from avsc files by avrohugger) to represent our state. We would like to use Avro for serialising our state so state migration will work when we update our models. We understood since Flink 1.7 Avro serialization is supported OOTB. We added the flink-avro module to the classpath, but when restoring from a saved snapshot we notice that it's still trying to use Kryo serialization. Relevant Code snippet
case class Foo(id: String, timestamp: java.time.Instant)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val conf = env.getConfig
conf.disableForceKryo()
conf.enableForceAvro()
val rawDataStream: DataStream[String] = env.addSource(MyFlinkKafkaConsumer)
val parsedDataSteam: DataStream[Foo] = rawDataStream.flatMap(new JsonParser[Foo])
// do something useful with it
env.execute("my-job")
When performing a state migration on Foo (e.g. by adding a field and deploying the job) I see that it tries to deserialize using Kryo, which obviously fails. How can I make sure Avro serialization is being used?
UPDATE
Found out about https://issues.apache.org/jira/browse/FLINK-10897, so POJO state serialization with Avro is only supported from 1.8 afaik. I tried it using the latest RC of 1.8 with a simple WordCount POJO that extends from SpecificRecord:
/** MACHINE-GENERATED FROM AVRO SCHEMA. DO NOT EDIT DIRECTLY */
import scala.annotation.switch
case class WordWithCount(var word: String, var count: Long) extends
org.apache.avro.specific.SpecificRecordBase {
def this() = this("", 0L)
def get(field$: Int): AnyRef = {
(field$: #switch) match {
case 0 => {
word
}.asInstanceOf[AnyRef]
case 1 => {
count
}.asInstanceOf[AnyRef]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
}
def put(field$: Int, value: Any): Unit = {
(field$: #switch) match {
case 0 => this.word = {
value.toString
}.asInstanceOf[String]
case 1 => this.count = {
value
}.asInstanceOf[Long]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
()
}
def getSchema: org.apache.avro.Schema = WordWithCount.SCHEMA$
}
object WordWithCount {
val SCHEMA$ = new org.apache.avro.Schema.Parser().parse(" .
{\"type\":\"record\",\"name\":\"WordWithCount\",\"fields\":
[{\"name\":\"word\",\"type\":\"string\"},
{\"name\":\"count\",\"type\":\"long\"}]}")
}
This, however, also didn’t work out of the box. We then tried to define our own type information using flink-avro’s AvroTypeInfo but this fails because Avro looks for a SCHEMA$ property (SpecificData:285) in the class and is unable to use Java reflection to identify the SCHEMA$ in the Scala companion object.
I could never get reflection to work due to Scala's fields being private under the hood. AFAIK the only solution is to update Flink to use avro's non-reflection-based constructors in AvroInputFormat (compare).
In a pinch, other than Java, one could fall back to avro's GenericRecord, maybe use avro4s to generate them from avrohugger's Standard format (note that Avro4s will generate it's own schema from the generated Scala types)

flink parsing JSON in map: InvalidProgramException: Task not serializable

I am working a on Flink project and would like to parse the source JSON string data to Json Object. I am using jackson-module-scala for the JSON parsing. However, I encountered some issues with using the JSON parser within Flink APIs (map for example).
Here are some examples of the code, and I cannot understand the reason under the hood why it is behaving like this.
Situation 1:
In this case, I am doing what the jackson-module-scala's official exmaple code told me to do:
create a new ObjectMapper
register the DefaultScalaModule
DefaultScalaModule is a Scala object that includes support for all currently supported Scala data types.
call the readValue in order to parse the JSON to Map
The error I got is: org.apache.flink.api.common.InvalidProgramException:Task not serializable.
object JsonProcessing {
def main(args: Array[String]) {
// set up the execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
// get input data
val text = env.readTextFile("xxx")
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
val counts = text.map(mapper.readValue(_, classOf[Map[String, String]]))
// execute and print result
counts.print()
env.execute("JsonProcessing")
}
}
Situation 2:
Then I did some Google, and came up with the following solution, where registerModule is moved into the map function.
val mapper = new ObjectMapper
val counts = text.map(l => {
mapper.registerModule(DefaultScalaModule)
mapper.readValue(l, classOf[Map[String, String]])
})
However, what I am not able to understand is: why this is going to work, with calling method of outside-defined object mapper? Is it because the ObjectMapper itself is Serializable as stated here ObjectMapper.java#L114?
Now, the JSON parsing is working fine, but every time, I have to call mapper.registerModule(DefaultScalaModule) which I think could cause some performance issue (Does it?). I also tried another solution as follows.
Situation 3:
I created a new case class Jsen, and use it as the corresponding parsing class, registering the Scala modules. And it is also working fine.
However, this is not so flexible if your input JSON is varying very often. It is not maintainable to manage the class Jsen.
case class Jsen(
#JsonProperty("a") a: String,
#JsonProperty("c") c: String,
#JsonProperty("e") e: String
)
object JsonProcessing {
def main(args: Array[String]) {
...
val mapper = new ObjectMapper
val counts = text.map(mapper.readValue(_, classOf[Jsen]))
...
}
Additionally, I also tried using JsonNode without calling registerModule as follows:
...
val mapper = new ObjectMapper
val counts = text.map(mapper.readValue(_, classOf[JsonNode]))
...
It is working fine as well.
My main question is: what is actually causing the problem of Task not serializable under the hood of registerModule(DefaultScalaModule)?
How to identify whether your code could potentially cause this unserializable problem during coding?
The thing is that Apache Flink is designed to be distributed. It means that it needs to be able to run your code remotely. So it means that all your processing functions should be serializable. In the current implementation this is ensure early on when you build your streaming process even if you will not run this in any distributed mode. This is a trade-off with an obvious benefit of providing you feedback down to the very line that breaks this contract (via exception stack trace).
So when you write
val counts = text.map(mapper.readValue(_, classOf[Map[String, String]]))
what you actually write is something like
val counts = text.map(new Function1[String, Map[String, String]] {
val capturedMapper = mapper
override def apply(param: String) = capturedMapper.readValue(param, classOf[Map[String, String]])
})
The important thing here is that you capture the mapper from the outside context and store it as a part of your Function1 object that has to be serializble. And this means that the mapper has to be serializable. The designers of Jackson library recognized that kind of a need and since there is nothing fundamentally non-serizliable in a mapper they made their ObjectMapper and the default Modules serializable. Unfortunately for you the designers of Scala Jackson Module missed that and made their DefaultScalaModule deeply non-serialiazable by making ScalaTypeModifier and all sub-classes non-serializable. This is why your second code works while the first one doesn't: "raw" ObjectMapper is serializable while ObjectMapper with pre-registered DefaultScalaModule is not.
There are a few possible workarounds. Probably the easiest one is to wrap ObjectMapper
object MapperWrapper extends java.io.Serializable {
// this lazy is the important trick here
// #transient adds some safety in current Scala (see also Update section)
#transient lazy val mapper = {
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper
}
def readValue[T](content: String, valueType: Class[T]): T = mapper.readValue(content, valueType)
}
and then use it as
val counts = text.map(MapperWrapper.readValue(_, classOf[Map[String, String]]))
This lazy trick works because although an instance of DefaultScalaModule is not serializable, the function to create an instance of DefaultScalaModule is.
Update: what about #transient?
what are the differences here, if I add lazy val vs. #transient lazy val?
This is actually a tricky question. What the lazy val is compiled to is actually something like this:
object MapperWrapper extends java.io.Serializable {
// #transient is set or not set for both fields depending on its presence at "lazy val"
[#transient] private var mapperValue: ObjectMapper = null
[#transient] #volatile private var mapperInitialized = false
def mapper: ObjectMapper = {
if (!mapperInitialized) {
this.synchronized {
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
mapperValue = mapper
mapperInitialized = true
}
}
mapperValue
}
def readValue[T](content: String, valueType: Class[T]): T = mapper.readValue(content, valueType)
}
where #transient on the lazy val affects both backing fields. So now you can see why lazy val trick works:
locally it works because it delays initialization of the mapperValue field until first access to the mapper method so the field is safe null when the serialization check is performed
remotely it works because MapperWrapper is fully serializable and the logic of how lazy val should be initialized is put into a method of the same class (see def mapper).
Note however that AFAIK this behavior of how lazy val is compiled is an implementation detail of the current Scala compiler rather than a part of the Scala specification. If at some later point a class similar to .Net Lazy will be added to the Java standard library, Scala compiler potentially might start generating different code. This is important because it provides a kind of trade-off for #transient. The benefit of adding #transient now is that it ensures that code like this works as well:
val someJson:String = "..."
val something:Something = MapperWrapper.readValue(someJson:String, ...)
val counts = text.map(MapperWrapper.readValue(_, classOf[Map[String, String]]))
Without #transient the code above will fail because we forced initialization of the lazy backing field and now it contains a non-serializable value. With #transient this is not an issue as that field will not be serialized at all.
A potential drawback of #transient is that if Scala changes the way code for lazy val is generated and the field is marked as #transient, it might actually be not de-serialized in the remote-work scenario.
Also there is a trick with object because for objects Scala compiler generates custom de-serialization logic (overrides readResolve) to return the same singleton object. It means that the object including the lazy val is not really de-serialized and the value from the object itself is used. It means that #transient lazy val inside object is much more future-proof than inside class in remote scenario.

spray-json and spray-routing: how to invoke JsonFormat write in complete

I am trying to figure out how to get a custom JsonFormat write method to be invoked when using the routing directive complete. JsonFormat created with the jsonFormat set of helper functions work fine, but defining a complete JsonFormat will not get called.
sealed trait Error
sealed trait ErrorWithReason extends Error {
def reason: String
}
case class ValidationError(reason: String) extends ErrorWithReason
case object EntityNotFound extends Error
case class DatabaseError(reason: String) extends ErrorWithReason
case class Record(a: String, b: String, error: Error)
object MyJsonProtocol extends DefaultJsonProtocol {
implicit object ErrorJsonFormat extends JsonFormat[Error] {
def write(err: Error) = failure match {
case e: ErrorWithReason => JsString(e.reason)
case x => JsString(x.toString())
}
def read(value: JsValue) = {
value match {
//Really only intended to serialize to JSON for API responses, not implementing read
case _ => throw new DeserializationException("Can't reliably deserialize Error")
}
}
}
implicit val record2Json = jsonFormat3(Record)
}
And then a route like:
import MyJsonProtocol._
trait TestRoute extends HttpService with Json4sSupport {
path("testRoute") {
val response: Record = getErrorRecord()
complete(response)
}
}
If I add logging, I can see that the ErrorJsonFormat.write method never gets called.
The ramifications are as follows showing what output I'm trying to get and what I actually get. Let's say the Record instance was Record("something", "somethingelse", EntityNotFound)
actual
{
"a": "something",
"b": "somethingelse",
"error": {}
}
intended
{
"a": "something",
"b": "somethingelse",
"error": "EntityNotFound"
}
I was expecting that the complete(record) uses the implicit JsonFormat for Record which in turn relies on the implicit object ErrorJsonFormat that specifies the write method that creates the appropriate JsString field. Instead it seems to both recognize the provided ErrorJsonFormat while ignoring its instructions for serializing.
I feel like there should be a solution that does not involve needing to replace implicit val record2Json = jsonFormat3(Record) with an explicit implicit object RecordJsonFormat extends JsonFormat[Record] { ... }
So to summarize what I am asking
Why does the serialization of Record fail to call the ErrorJsonFormat write method (what does it even do instead?) answered below
Is there a way to match my expectation while still using complete(record)?
Edit
Digging through the spray-json source code, there is an sbt-boilerplate template that seems to define the jsonFormat series of methods: https://github.com/spray/spray-json/blob/master/src/main/boilerplate/spray/json/ProductFormatsInstances.scala.template
and the relevant product for jsonFormat3 from that seems to be :
def jsonFormat3[P1 :JF, P2 :JF, P3 :JF, T <: Product :ClassManifest](construct: (P1, P2, P3) => T): RootJsonFormat[T] = {
val Array(p1,p2,p3) = extractFieldNames(classManifest[T])
jsonFormat(construct, p1, p2, p3)
}
def jsonFormat[P1 :JF, P2 :JF, P3 :JF, T <: Product](construct: (P1, P2, P3) => T, fieldName1: String, fieldName2: String, fieldName3: String): RootJsonFormat[T] = new RootJsonFormat[T]{
def write(p: T) = {
val fields = new collection.mutable.ListBuffer[(String, JsValue)]
fields.sizeHint(3 * 4)
fields ++= productElement2Field[P1](fieldName1, p, 0)
fields ++= productElement2Field[P2](fieldName2, p, 0)
fields ++= productElement2Field[P3](fieldName3, p, 0)
JsObject(fields: _*)
}
def read(value: JsValue) = {
val p1V = fromField[P1](value, fieldName1)
val p2V = fromField[P2](value, fieldName2)
val p3V = fromField[P3](value, fieldName3)
construct(p1v, p2v, p3v)
}
}
From this it would seem that jsonFormat3 itself is perfectly fine (if you trace into the productElement2Field it grabs the writer and directly calls write). The problem must then be that the complete(record) doesn't involve JsonFormat at all and somehow alternately marshals the object.
So this seems to answer part 1: Why does the serialization of Record fail to call the ErrorJsonFormat write method (what does it even do instead?). No JsonFormat is called because complete marshals via some other means.
It seems the remaining question is if it is possible to provide a marshaller for the complete directive that will use the JsonFormat if it exists otherwise default to its normal behavior. I realize that I can generally rely on the default marshaller for basic case class serialization. But when I get a complicated trait/case class setup like in this example I need to use JsonFormat to get the proper response. Ideally, this distinction shouldn't have to be explicit for someone writing routes to need to know the situations where its the default marshaller as opposed to needing to invoke JsonFormat. Or in other words, needing to distinguish if the given type needs to be written as complete(someType) or complete(someType.toJson) feels wrong.
After digging further, it seems the root of the problem has been a confusion of the Json4s and Spray-Json libraries in the code. In trying to track down examples of various elements of JSON handling, I didn't recognize the separation between the two libraries readily and ended up with code that mixed some of each, explaining the unexpected behavior.
In this question, the offending piece is pulling in the Json4sSupport in the router. The proper definition should be using SprayJsonSupport:
import MyJsonProtocol._
trait TestRoute extends HttpService with SprayJsonSupport {
path("testRoute") {
val response: Record = getErrorRecord()
complete(response)
}
}
With this all considered, the answers are more apparent.
1: Why does the serialization of Record fail to call the ErrorJsonFormat write method (what does it even do instead)?.
No JsonFormat is called because complete marshals via some other means. That other means is the marshaling provided implicitly by Json4s with Json4sSupport. You can use record.toJson to force spray-json serialization of the object, but the output will not be clean (it will include nested JS objects and "fields" keys).
Is there a way to match my expectation while still using complete(record)?
Yes, using SprayJsonSupport will use implicit RootJsonReader and/or RootJsonWriter where needed to automatically create a relevant Unmarshaller and/or Marshaller. Documentation reference
So with SprayJsonSupport it will see the RootJsonWriter defined by the jsonFormat3(Record) and complete(record) will serialize as expected.

Support generic deserialization from a List[(String, Any)] in Scala

This is a follow up to the following question, which concerned serialization: How best to keep a cached list of member fields, one each for a family of case classes in Scala
I'm trying to generically support deserialization in the same way. One straightforward attempt is the following:
abstract class Serializer[T](implicit ctag: ClassTag[T]) {
private val fields = ctag.runtimeClass.getDeclaredFields.toList
fields foreach { _.setAccessible(true) }
implicit class AddSerializeMethod(obj: T) {
def serialize = fields.map(f => (f.getName, f.get(obj)))
}
def deserialize(data: List[(String, Any)]): T = {
val m = data toMap
val r: T = ctag.runtimeClass.newInstance // ???
fields.foreach { case f => f.set(r, m(f.getName)) }
r;
}
}
There are a couple of issues with the code:
The line with val r: T = ... has a compile error because the compiler thinks it's not guaranteed to have the right type. (I'm generally unsure of how to create a new instance of a generic class in a typesafe way -- not sure why this isn't safe since the instance of Serializer is created with a class tag whose type is checked by the compiler).
The objects I'm creating are expected to be immutable case class objects, which are guaranteed to be fully constructed if created in the usual way. However, since I'm mutating the fields of instances of these objects in the deserialize method, how can I be sure that the objects will not be seen as partially constructed (due to caching and instruction reordering) if they are published to other threads?
ClassTag's runtimeClass method returns Class[_], not Class[T], probably due to the fact generics in Scala and Java behave differently; you can try casting it forcefully: val r: T = ctag.runtimeClass.newInstance.asInstanceOf[T]
newInstance calls the default, parameterless constructor. If the class doesn't have one, newInstance will throw InstantiationException. There's no way around it, except for:
looking around for other constructors
writing custom serializers (see how Gson does that; BTW Gson can automatically serialize only classes with parameterless constructors and those classes it has predefined deserializers for)
for case classes, finding their companion object and calling its apply method
Anyhow, reflection allows for modifying final fields as well, so if you manage to create an immutable object, you'll be able to set its fields.

Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

Getting strange behavior when calling function outside of a closure:
when function is in a object everything is working
when function is in a class get :
Task not serializable: java.io.NotSerializableException: testing
The problem is I need my code in a class and not an object. Any idea why this is happening? Is a Scala object serialized (default?)?
This is a working code example:
object working extends App {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
//calling function outside closure
val after = rddList.map(someFunc(_))
def someFunc(a:Int) = a+1
after.collect().map(println(_))
}
This is the non-working example :
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
RDDs extend the Serialisable interface, so this is not what's causing your task to fail. Now this doesn't mean that you can serialise an RDD with Spark and avoid NotSerializableException
Spark is a distributed computing engine and its main abstraction is a resilient distributed dataset (RDD), which can be viewed as a distributed collection. Basically, RDD's elements are partitioned across the nodes of the cluster, but Spark abstracts this away from the user, letting the user interact with the RDD (collection) as if it were a local one.
Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:
serialized on the driver node,
shipped to the appropriate nodes in the cluster,
deserialized,
and finally executed on the nodes
You can of course run this locally (as in your example), but all those phases (apart from shipping over network) still occur. [This lets you catch any bugs even before deploying to production]
What happens in your second case is that you are calling a method, defined in class testing from inside the map function. Spark sees that and since methods cannot be serialized on their own, Spark tries to serialize the whole testing class, so that the code will still work when executed in another JVM. You have two possibilities:
Either you make class testing serializable, so the whole class can be serialized by Spark:
import org.apache.spark.{SparkContext,SparkConf}
object Spark {
val ctx = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}
object NOTworking extends App {
new Test().doIT
}
class Test extends java.io.Serializable {
val rddList = Spark.ctx.parallelize(List(1,2,3))
def doIT() = {
val after = rddList.map(someFunc)
after.collect().foreach(println)
}
def someFunc(a: Int) = a + 1
}
or you make someFunc function instead of a method (functions are objects in Scala), so that Spark will be able to serialize it:
import org.apache.spark.{SparkContext,SparkConf}
object Spark {
val ctx = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}
object NOTworking extends App {
new Test().doIT
}
class Test {
val rddList = Spark.ctx.parallelize(List(1,2,3))
def doIT() = {
val after = rddList.map(someFunc)
after.collect().foreach(println)
}
val someFunc = (a: Int) => a + 1
}
Similar, but not the same problem with class serialization can be of interest to you and you can read on it in this Spark Summit 2013 presentation.
As a side note, you can rewrite rddList.map(someFunc(_)) to rddList.map(someFunc), they are exactly the same. Usually, the second is preferred as it's less verbose and cleaner to read.
EDIT (2015-03-15): SPARK-5307 introduced SerializationDebugger and Spark 1.3.0 is the first version to use it. It adds serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.
In OP's case, this is what gets printed to stdout:
Serialization stack:
- object not serializable (class: testing, value: testing#2dfe2f00)
- field (class: testing$$anonfun$1, name: $outer, type: class testing)
- object (class testing$$anonfun$1, <function1>)
Grega's answer is great in explaining why the original code does not work and two ways to fix the issue. However, this solution is not very flexible; consider the case where your closure includes a method call on a non-Serializable class that you have no control over. You can neither add the Serializable tag to this class nor change the underlying implementation to change the method into a function.
Nilesh presents a great workaround for this, but the solution can be made both more concise and general:
def genMapper[A, B](f: A => B): A => B = {
val locker = com.twitter.chill.MeatLocker(f)
x => locker.get.apply(x)
}
This function-serializer can then be used to automatically wrap closures and method calls:
rdd map genMapper(someFunc)
This technique also has the benefit of not requiring the additional Shark dependencies in order to access KryoSerializationWrapper, since Twitter's Chill is already pulled in by core Spark
Complete talk fully explaining the problem, which proposes a great paradigm shifting way to avoid these serialization problems: https://github.com/samthebest/dump/blob/master/sams-scala-tutorial/serialization-exceptions-and-memory-leaks-no-ws.md
The top voted answer is basically suggesting throwing away an entire language feature - that is no longer using methods and only using functions. Indeed in functional programming methods in classes should be avoided, but turning them into functions isn't solving the design issue here (see above link).
As a quick fix in this particular situation you could just use the #transient annotation to tell it not to try to serialise the offending value (here, Spark.ctx is a custom class not Spark's one following OP's naming):
#transient
val rddList = Spark.ctx.parallelize(list)
You can also restructure code so that rddList lives somewhere else, but that is also nasty.
The Future is Probably Spores
In future Scala will include these things called "spores" that should allow us to fine grain control what does and does not exactly get pulled in by a closure. Furthermore this should turn all mistakes of accidentally pulling in non-serializable types (or any unwanted values) into compile errors rather than now which is horrible runtime exceptions / memory leaks.
http://docs.scala-lang.org/sips/pending/spores.html
A tip on Kryo serialization
When using kyro, make it so that registration is necessary, this will mean you get errors instead of memory leaks:
"Finally, I know that kryo has kryo.setRegistrationOptional(true) but I am having a very difficult time trying to figure out how to use it. When this option is turned on, kryo still seems to throw exceptions if I haven't registered classes."
Strategy for registering classes with kryo
Of course this only gives you type-level control not value-level control.
... more ideas to come.
I faced similar issue, and what I understand from Grega's answer is
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
your doIT method is trying to serialize someFunc(_) method, but as method are not serializable, it tries to serialize class testing which is again not serializable.
So make your code work, you should define someFunc inside doIT method. For example:
def doIT = {
def someFunc(a:Int) = a+1
//function definition
}
val after = rddList.map(someFunc(_))
after.collect().map(println(_))
}
And if there are multiple functions coming into picture, then all those functions should be available to the parent context.
I solved this problem using a different approach. You simply need to serialize the objects before passing through the closure, and de-serialize afterwards. This approach just works, even if your classes aren't Serializable, because it uses Kryo behind the scenes. All you need is some curry. ;)
Here's an example of how I did it:
def genMapper(kryoWrapper: KryoSerializationWrapper[(Foo => Bar)])
(foo: Foo) : Bar = {
kryoWrapper.value.apply(foo)
}
val mapper = genMapper(KryoSerializationWrapper(new Blah(abc))) _
rdd.flatMap(mapper).collectAsMap()
object Blah(abc: ABC) extends (Foo => Bar) {
def apply(foo: Foo) : Bar = { //This is the real function }
}
Feel free to make Blah as complicated as you want, class, companion object, nested classes, references to multiple 3rd party libs.
KryoSerializationWrapper refers to: https://github.com/amplab/shark/blob/master/src/main/scala/shark/execution/serialization/KryoSerializationWrapper.scala
I'm not entirely certain that this applies to Scala but, in Java, I solved the NotSerializableException by refactoring my code so that the closure did not access a non-serializable final field.
Scala methods defined in a class are non-serializable, methods can be converted into functions to resolve serialization issue.
Method syntax
def func_name (x String) : String = {
...
return x
}
function syntax
val func_name = { (x String) =>
...
x
}
FYI in Spark 2.4 a lot of you will probably encounter this issue. Kryo serialization has gotten better but in many cases you cannot use spark.kryo.unsafe=true or the naive kryo serializer.
For a quick fix try changing the following in your Spark configuration
spark.kryo.unsafe="false"
OR
spark.serializer="org.apache.spark.serializer.JavaSerializer"
I modify custom RDD transformations that I encounter or personally write by using explicit broadcast variables and utilizing the new inbuilt twitter-chill api, converting them from rdd.map(row => to rdd.mapPartitions(partition => { functions.
Example
Old (not-great) Way
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val outputRDD = rdd.map(row => {
val value = sampleMap.get(row._1)
value
})
Alternative (better) Way
import com.twitter.chill.MeatLocker
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val brdSerSampleMap = spark.sparkContext.broadcast(MeatLocker(sampleMap))
rdd.mapPartitions(partition => {
val deSerSampleMap = brdSerSampleMap.value.get
partition.map(row => {
val value = sampleMap.get(row._1)
value
}).toIterator
})
This new way will only call the broadcast variable once per partition which is better. You will still need to use Java Serialization if you do not register classes.
I had a similar experience.
The error was triggered when I initialize a variable on the driver (master), but then tried to use it on one of the workers.
When that happens, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable.
I solved the error by making the variable static.
Previous non-working code
private final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Working code
private static final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Credits:
https://learn.microsoft.com/en-us/answers/questions/35812/sparkexception-job-aborted-due-to-stage-failure-ta.html ( The answer of pradeepcheekatla-msft)
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
def upper(name: String) : String = {
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
val emp_details = """[{"id": "1","name": "James Butt","country": "USA"},
{"id": "2", "name": "Josephine Darakjy","country": "USA"},
{"id": "3", "name": "Art Venere","country": "USA"},
{"id": "4", "name": "Lenna Paprocki","country": "USA"},
{"id": "5", "name": "Donette Foller","country": "USA"},
{"id": "6", "name": "Leota Dilliard","country": "USA"}]"""
val df_emp = spark.read.json(Seq(emp_details).toDS())
val df_name=df_emp.select($"id",$"name")
val df_upperName= df_name.withColumn("name",toUpperName($"name")).filter("id='5'")
display(df_upperName)
this will give error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
Solution -
import java.io.Serializable;
object obj_upper extends Serializable {
def upper(name: String) : String =
{
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
}
val df_upperName=
df_name.withColumn("name",obj_upper.toUpperName($"name")).filter("id='5'")
display(df_upperName)
My solution was to add a compagnion class that handles all methods that are not seriazable within the class.