Say I define the following case class:
case class C(i: Int) {
lazy val incremented = copy(i = i + 1)
}
And then try to serialize it to json:
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
val out = new StringWriter
mapper.writeValue(out, C(4))
val json = out.toString()
println("Json is: " + json)
It will throw the following exception:
Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion (StackOverflowError) (through reference chain: C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]-
...
I don't know why is it trying to serialize the lazy val by default in the first place? This does not seem to me as the logical approach
And can I disable this feature?
This happens because Jackson is designed for Java. Specifically, note that:
Java has no idea of a lazy val
Java's normal semantics around fields and constructors don't allow the partitioning of fields into "needed for construction" and "derived for construction" (neither of those is a technical term) that Scala's combination of val in default constructor (implicitly present in a case class) and val in a class's body provide
The consequence of the second is that (except for beans, sometimes), Java-oriented serialization approaches tend to assume that anything which is a field (including private fields, since Java idiom is to make fields private by default) in the object needs to be serialized, with the ability to opt out through #transient annotations.
The first, in turn, means that lazy vals are implemented by the compiler in a way that includes a private field.
Thus to a Java-oriented serializer like Jackson, a lazy val without a #transient annotation gets serialized.
Scala-oriented serialization approaches (e.g. circe, play-json, etc.) tend to serialize case classes by only serializing the constructor parameters.
The solution I found was to use json4s for my serialization rather than jackson databind. My issue arose using akka cluster so I had to add a custom serlializer to my project. For reference here is my complete implementation:
class Json4sSerializer(system: ExtendedActorSystem) extends Serializer {
private val actorRefResolver = ActorRefResolver(system.toTyped)
object ActorRefSerializer extends CustomSerializer[ActorRef[_]](format => (
{
case JString(str) =>
actorRefResolver.resolveActorRef[AnyRef](str)
},
{
case actorRef: ActorRef[_] =>
JString(actorRefResolver.toSerializationFormat(actorRef))
}
))
implicit private val formats = DefaultFormats + ActorRefSerializer
def includeManifest: Boolean = true
def identifier = 1234567
def toBinary(obj: AnyRef): Array[Byte] = {
write(obj).getBytes(StandardCharsets.UTF_8)
}
def fromBinary(bytes: Array[Byte], clazz: Option[Class[_]]): AnyRef = clazz match {
case Some(cls) =>
read[AnyRef](new String(bytes, StandardCharsets.UTF_8))(formats, ManifestFactory.classType(cls))
case None =>
throw new RuntimeException("Specified includeManifest but it was never passed")
}
}
You can't serialize that class because the value is infinitely recursive (hence the stack overflow). Specifically, the value of incremented for C(4) is an instance of C(5). The value of incremented for C(5) is C(6). The value of incremented for C(6) is C(7) and so on...
Since an instance of C(n) contains an instance of C(n+1) it can never be fully serlialized.
If you don't want a field to appear in the JSON, make it a function:
case class C(i: Int) {
def incremented = copy(i = i + 1)
}
The root of this problem is trying to serialise a class that also implements application logic, which breaches the principle of Separation of Concerns (The S in SOLID).
It is better to have distinct classes for serialisation and populate them from the application data as necessary. This allows different forms of serialisation to be used without having to change the application logic.
I'm trying to get the SQLContext instance from one module in another module. The first module instantiates it to an implicit sqlContext and I had (erroneously) thought that I could then use an implicit parameter in the second module, but the compiler informs me that:
could not find implicit value for parameter sqlCtxt: org.apache.spark.sql.SQLContext
Here's the skeletal setup I have (I have elided imports and details):
-----
// Application.scala
-----
package apps
object Application extends App {
val env = new SparkEnvironment("My app", ...)
try {
// Call methods from various packages that use code from internally DFExtensions.scala
}
}
-----
// SparkEnvironment.scala
-----
package common
class SparkEnvironment(val app: String, ...) {
#transient lazy val conf: SparkConf = new SparkConf().setAppName(app)
#transient implicit lazy val sc: SparkContext = new SparkContext(conf)
#transient implicit lazy val sqlContext: SQLContext = new SQLContext(sc)
...
}
-----
// DFExtensions.scala
-----
package util
object DFExtensions {
private def myFun(...)(implicit sqlCtxt: SQLContext) = { ... }
implicit final class DFExt(val df: DataFrame) extends AnyVal {
// Extension methods for DataFrame where myFun is supposed to be used -- causes exception!
}
}
Since it's a multi-project sbt setup I don't want to pass around the instance env to all related objects because the stuff in util is really a shared library. Each sub-project (i.e. app) has its own instance created in the main method.
Because myFun is only called from the implicit class DFExt I thought about creating an implicit just before each call à la implicit val sqlCtxt = df.sqlContext and that compiles but it's kind of ugly and I would not need the implicit in SparkEnvironment any longer.
According to this discussion the implicit sqlContext instance is not in scope, hence compilation fails. I'm not sure a package object would work because the implicit value and parameter are in different packages.
Is what I'm trying to achieve even possible? Is there a better alternative?
The idea is to have several sub-projects that use the same libraries and core functions to share the same project. They are typically updated together, so it's nice to have them in a single place. Most of the library functions directly work on data frames and other structures in Spark, but occasionally I need to do something that requires an instance of SparkContext or SQLContext, for instance write a query with sqlContext.sql as some syntax is not yet natively supported (e.g. flattening with outer lateral views).
Each sub-project has its own main method that creates an implicit instance. Obviously the libraries do not 'know' about this as they are in different packages and I don't pass around the instances. I had thought that somehow implicits are looked for at runtime, so that when an application runs there is an instance of SQLContext defined as an implicit. It's possible that a) it's not in scope because it's in a different package or b) what I'm trying to do is just a bad idea.
Currently there is only one main method because I first have to split the application in multiple components, which I have not done yet.
Just in case it helps:
Spark 1.4.1
Scala 2.10
sbt 0.13.8
Because myFun is only called from the implicit class DFExt I thought about creating an implicit just before each call à la implicit val sqlCtxt = df.sqlContext and that compiles but it's kind of ugly and I would not need the implicit in SparkEnvironment any longer.
Just put the implicit and myFun inside DFExt:
implicit final class DFExt(val df: DataFrame) extends AnyVal {
private implicit def sqlCtxt: SqlContext = df.sqlContext
// no need to take an implicit parameter, as sqlCtxt is already in scope
private def myFun(...) = ...
// The extension methods can now use sqlCtxt and/or myFun freely
}
You could also make sqlCtxt a val, but then: 1) DFExt can't extend AnyVal anymore; 2) it needs to be initialized even if the extension method you call doesn't need it; 3) any calls to sqlCtxt are likely to be inlined, so you are just accessing a val from df instead of this anyway. If they aren't, this means you are using it far too little to matter.
for creating datasource I have
object MyDataSource {
priavte lazy val dataSource: javax.sql.DataSource = {
val ds = new BasicDataSource
val conf = ConfigFactory.load()
val url = conf.getString("jdbc-url")
val driver = conf.getString("jdbc-driver")
val username = conf.getString("db-username")
val password = conf.getString("db-password")
val port = conf.getString("db-port")
val maxActive = conf.getInt("max-active")
val maxIdle = conf.getInt("max-idle")
val initSize = conf.getInt("init-size")
ds.setDriverClassName(driver)
ds.setUsername(username)
ds.setPassword(password)
ds.setMaxActive(maxActive)
ds.setMaxIdle(maxIdle)
ds.setInitialSize(initSize)
ds.setUrl(url)
ds
}
lazy val database = Database.forDataSource(dataSource)
}
MyDataSource is used as below
def insertCompany = {
MyDataSource.database.withSession{ implicit session =>
company.insert(companyRow)
}
}
Now for testing I have trait DatabaseSpec which loads test database(pointing to test db) config and has following fixture
def withSession(testCode: Session => Any) {
val session = postgres.createSession()
session.conn.setAutoCommit(false)
try {
testCode(session)
} finally {
session.rollback()
session.close()
}
}
And test code can then mix in DatabaseSpec and use withSession to test transactional code.
Now question is what's the best practice in keeping MyDataSource.database.withSession abstracted away from DataSource in insertCompany so that method can be tested with DatabaseSpec and pointing to test db?
The best way to be able to exchange a value, .e.g for prod and testing is by parameterizing your code in that value. E.g.
def insertCompany(db: Database) = db.withSession(company.insert(companyRow)(_))
or
class DAO(db:Database){
def insertCompany = db.withSession(company.insert(companyRow)(_))
}
Keep it simple. Avoid unnecessary complexity like the Cake pattern, DI frameworks or mixin composition for this.
If you need to pass multiple values around... aggregate them into a "config"-class. Compose multiple config classes with different purposes to target different things, if you want to avoid writing one huge config class as stuff accumulates.
If you find yourself passing config objects to all your functions, you can mark them as implicit, that saves you at least the call-site code overhead. Or you can use something like scalaz's monadic function composition to avoid call site and definition site code overhead for passing config around. It is sometimes called the Reader monad, but it is simply for-comprehension enabled composition of 1-argument functions.
Slick 2.2 will ship with something like that out-of-the-box and make what you want very easy.
Also, here is an approach I am currently playing around with, a composable configuration object "TMap". This code example shows step by step how you get from global imports over parameterized functions and making them implicit to using TMap and removing most boilerplate: https://github.com/cvogt/slick-action/blob/0.1/src/test/scala/org/cvogt/di/TMapTest.scala#L49
Getting strange behavior when calling function outside of a closure:
when function is in a object everything is working
when function is in a class get :
Task not serializable: java.io.NotSerializableException: testing
The problem is I need my code in a class and not an object. Any idea why this is happening? Is a Scala object serialized (default?)?
This is a working code example:
object working extends App {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
//calling function outside closure
val after = rddList.map(someFunc(_))
def someFunc(a:Int) = a+1
after.collect().map(println(_))
}
This is the non-working example :
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
RDDs extend the Serialisable interface, so this is not what's causing your task to fail. Now this doesn't mean that you can serialise an RDD with Spark and avoid NotSerializableException
Spark is a distributed computing engine and its main abstraction is a resilient distributed dataset (RDD), which can be viewed as a distributed collection. Basically, RDD's elements are partitioned across the nodes of the cluster, but Spark abstracts this away from the user, letting the user interact with the RDD (collection) as if it were a local one.
Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:
serialized on the driver node,
shipped to the appropriate nodes in the cluster,
deserialized,
and finally executed on the nodes
You can of course run this locally (as in your example), but all those phases (apart from shipping over network) still occur. [This lets you catch any bugs even before deploying to production]
What happens in your second case is that you are calling a method, defined in class testing from inside the map function. Spark sees that and since methods cannot be serialized on their own, Spark tries to serialize the whole testing class, so that the code will still work when executed in another JVM. You have two possibilities:
Either you make class testing serializable, so the whole class can be serialized by Spark:
import org.apache.spark.{SparkContext,SparkConf}
object Spark {
val ctx = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}
object NOTworking extends App {
new Test().doIT
}
class Test extends java.io.Serializable {
val rddList = Spark.ctx.parallelize(List(1,2,3))
def doIT() = {
val after = rddList.map(someFunc)
after.collect().foreach(println)
}
def someFunc(a: Int) = a + 1
}
or you make someFunc function instead of a method (functions are objects in Scala), so that Spark will be able to serialize it:
import org.apache.spark.{SparkContext,SparkConf}
object Spark {
val ctx = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}
object NOTworking extends App {
new Test().doIT
}
class Test {
val rddList = Spark.ctx.parallelize(List(1,2,3))
def doIT() = {
val after = rddList.map(someFunc)
after.collect().foreach(println)
}
val someFunc = (a: Int) => a + 1
}
Similar, but not the same problem with class serialization can be of interest to you and you can read on it in this Spark Summit 2013 presentation.
As a side note, you can rewrite rddList.map(someFunc(_)) to rddList.map(someFunc), they are exactly the same. Usually, the second is preferred as it's less verbose and cleaner to read.
EDIT (2015-03-15): SPARK-5307 introduced SerializationDebugger and Spark 1.3.0 is the first version to use it. It adds serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.
In OP's case, this is what gets printed to stdout:
Serialization stack:
- object not serializable (class: testing, value: testing#2dfe2f00)
- field (class: testing$$anonfun$1, name: $outer, type: class testing)
- object (class testing$$anonfun$1, <function1>)
Grega's answer is great in explaining why the original code does not work and two ways to fix the issue. However, this solution is not very flexible; consider the case where your closure includes a method call on a non-Serializable class that you have no control over. You can neither add the Serializable tag to this class nor change the underlying implementation to change the method into a function.
Nilesh presents a great workaround for this, but the solution can be made both more concise and general:
def genMapper[A, B](f: A => B): A => B = {
val locker = com.twitter.chill.MeatLocker(f)
x => locker.get.apply(x)
}
This function-serializer can then be used to automatically wrap closures and method calls:
rdd map genMapper(someFunc)
This technique also has the benefit of not requiring the additional Shark dependencies in order to access KryoSerializationWrapper, since Twitter's Chill is already pulled in by core Spark
Complete talk fully explaining the problem, which proposes a great paradigm shifting way to avoid these serialization problems: https://github.com/samthebest/dump/blob/master/sams-scala-tutorial/serialization-exceptions-and-memory-leaks-no-ws.md
The top voted answer is basically suggesting throwing away an entire language feature - that is no longer using methods and only using functions. Indeed in functional programming methods in classes should be avoided, but turning them into functions isn't solving the design issue here (see above link).
As a quick fix in this particular situation you could just use the #transient annotation to tell it not to try to serialise the offending value (here, Spark.ctx is a custom class not Spark's one following OP's naming):
#transient
val rddList = Spark.ctx.parallelize(list)
You can also restructure code so that rddList lives somewhere else, but that is also nasty.
The Future is Probably Spores
In future Scala will include these things called "spores" that should allow us to fine grain control what does and does not exactly get pulled in by a closure. Furthermore this should turn all mistakes of accidentally pulling in non-serializable types (or any unwanted values) into compile errors rather than now which is horrible runtime exceptions / memory leaks.
http://docs.scala-lang.org/sips/pending/spores.html
A tip on Kryo serialization
When using kyro, make it so that registration is necessary, this will mean you get errors instead of memory leaks:
"Finally, I know that kryo has kryo.setRegistrationOptional(true) but I am having a very difficult time trying to figure out how to use it. When this option is turned on, kryo still seems to throw exceptions if I haven't registered classes."
Strategy for registering classes with kryo
Of course this only gives you type-level control not value-level control.
... more ideas to come.
I faced similar issue, and what I understand from Grega's answer is
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
your doIT method is trying to serialize someFunc(_) method, but as method are not serializable, it tries to serialize class testing which is again not serializable.
So make your code work, you should define someFunc inside doIT method. For example:
def doIT = {
def someFunc(a:Int) = a+1
//function definition
}
val after = rddList.map(someFunc(_))
after.collect().map(println(_))
}
And if there are multiple functions coming into picture, then all those functions should be available to the parent context.
I solved this problem using a different approach. You simply need to serialize the objects before passing through the closure, and de-serialize afterwards. This approach just works, even if your classes aren't Serializable, because it uses Kryo behind the scenes. All you need is some curry. ;)
Here's an example of how I did it:
def genMapper(kryoWrapper: KryoSerializationWrapper[(Foo => Bar)])
(foo: Foo) : Bar = {
kryoWrapper.value.apply(foo)
}
val mapper = genMapper(KryoSerializationWrapper(new Blah(abc))) _
rdd.flatMap(mapper).collectAsMap()
object Blah(abc: ABC) extends (Foo => Bar) {
def apply(foo: Foo) : Bar = { //This is the real function }
}
Feel free to make Blah as complicated as you want, class, companion object, nested classes, references to multiple 3rd party libs.
KryoSerializationWrapper refers to: https://github.com/amplab/shark/blob/master/src/main/scala/shark/execution/serialization/KryoSerializationWrapper.scala
I'm not entirely certain that this applies to Scala but, in Java, I solved the NotSerializableException by refactoring my code so that the closure did not access a non-serializable final field.
Scala methods defined in a class are non-serializable, methods can be converted into functions to resolve serialization issue.
Method syntax
def func_name (x String) : String = {
...
return x
}
function syntax
val func_name = { (x String) =>
...
x
}
FYI in Spark 2.4 a lot of you will probably encounter this issue. Kryo serialization has gotten better but in many cases you cannot use spark.kryo.unsafe=true or the naive kryo serializer.
For a quick fix try changing the following in your Spark configuration
spark.kryo.unsafe="false"
OR
spark.serializer="org.apache.spark.serializer.JavaSerializer"
I modify custom RDD transformations that I encounter or personally write by using explicit broadcast variables and utilizing the new inbuilt twitter-chill api, converting them from rdd.map(row => to rdd.mapPartitions(partition => { functions.
Example
Old (not-great) Way
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val outputRDD = rdd.map(row => {
val value = sampleMap.get(row._1)
value
})
Alternative (better) Way
import com.twitter.chill.MeatLocker
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val brdSerSampleMap = spark.sparkContext.broadcast(MeatLocker(sampleMap))
rdd.mapPartitions(partition => {
val deSerSampleMap = brdSerSampleMap.value.get
partition.map(row => {
val value = sampleMap.get(row._1)
value
}).toIterator
})
This new way will only call the broadcast variable once per partition which is better. You will still need to use Java Serialization if you do not register classes.
I had a similar experience.
The error was triggered when I initialize a variable on the driver (master), but then tried to use it on one of the workers.
When that happens, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable.
I solved the error by making the variable static.
Previous non-working code
private final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Working code
private static final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Credits:
https://learn.microsoft.com/en-us/answers/questions/35812/sparkexception-job-aborted-due-to-stage-failure-ta.html ( The answer of pradeepcheekatla-msft)
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
def upper(name: String) : String = {
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
val emp_details = """[{"id": "1","name": "James Butt","country": "USA"},
{"id": "2", "name": "Josephine Darakjy","country": "USA"},
{"id": "3", "name": "Art Venere","country": "USA"},
{"id": "4", "name": "Lenna Paprocki","country": "USA"},
{"id": "5", "name": "Donette Foller","country": "USA"},
{"id": "6", "name": "Leota Dilliard","country": "USA"}]"""
val df_emp = spark.read.json(Seq(emp_details).toDS())
val df_name=df_emp.select($"id",$"name")
val df_upperName= df_name.withColumn("name",toUpperName($"name")).filter("id='5'")
display(df_upperName)
this will give error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
Solution -
import java.io.Serializable;
object obj_upper extends Serializable {
def upper(name: String) : String =
{
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
}
val df_upperName=
df_name.withColumn("name",obj_upper.toUpperName($"name")).filter("id='5'")
display(df_upperName)
My solution was to add a compagnion class that handles all methods that are not seriazable within the class.
Given the trait (simplified)
trait A {
val eventStream: EventStream
val credentialsStorage = // something here
val userStorage = // something here
val crypto = // something here
...
lazy val authSvc = new CoreAuthentication(credentialsStorage, new AuthenticationProviderResolver, userStorage, eventStream, crypto)
}
class T extends A with TraitProvidingEventStream with FlatSpec with [lot of another traits here] {
val eventStream = systemFromTraitProvidingEventStream.eventStream
"This" should "work" in {
println(authSvc) // this is "magic"
val user = authSvc.doSomethingWithUser(...);
}
}
if I remove line marked as //this is "magic", then I will get NullPointerException on the next line, so authSvc is null.
What may be wrong there?
I wasn't be able to create clean small test case for that, usually this works well
This came up once on the ML: If an exception is thrown when initializing a lazy val, the val is null; but you can attempt to init again and it can work magically. (That is, the "initialized" bit flag for the lazy val is not set on the first failed attempt to initialize.)
I think the case on the ML had to do with init order of vals in traits, so maybe that's your problem. It's infamously dangerous to rely on it, hence the advice to use defs in traits. See Luigi's comment on DelayedInit.