This is more of a Scala concept doubt than Spark. I have this Spark initialization code :
object EntryPoint {
val spark = SparkFactory.createSparkSession(...
val funcsSingleton = ContextSingleton[CustomFunctions] { new CustomFunctions(Some(hashConf)) }
lazy val funcs = funcsSingleton.get
//this part I want moved to another place since there are many many UDFs
spark.udf.register("funcName", udf {funcName _ })
}
The other class, CustomFunctions looks like this
class CustomFunctions(val hashConfig: Option[HashConfig], sark: Option[SparkSession] = None) {
val funcUdf = udf { funcName _ }
def funcName(colValue: String) = withDefinedOpt(hashConfig) { c =>
...}
}
^ class is wrapped in Serializable interface using ContextSingleton which is defined like so
class ContextSingleton[T: ClassTag](constructor: => T) extends AnyRef with Serializable {
val uuid = UUID.randomUUID.toString
#transient private lazy val instance = ContextSingleton.pool.synchronized {
ContextSingleton.pool.getOrElseUpdate(uuid, constructor)
}
def get = instance.asInstanceOf[T]
}
object ContextSingleton {
private val pool = new TrieMap[String, Any]()
def apply[T: ClassTag](constructor: => T): ContextSingleton[T] = new ContextSingleton[T](constructor)
def poolSize: Int = pool.size
def poolClear(): Unit = pool.clear()
}
Now to my problem, I want to not have to explicitly register the udfs as done in the EntryPoint app. I create all udfs as needed in my CustomFunctions class and then register dynamically only the ones that I read from user provided config. What would be the best way to achieve it? Also, I want to register the required udfs outside the main app but that throws my the infamous TaskNotSerializable exception. Serializing the big CustomFunctions is not a good idea, hence wrapped it up in ContextSingleton but my problem of registering udfs outside cannot be solved that way. Please suggest the right approach.
I am adding logging into my Play application, and in order to avoid cumbersome and repeated code I have created a case class with a formatting function inside in order to clean up my logs:
final case class LogMessage(keyValuePairs: (String, String)*) {
def jsonify: String =
s"""{${keyValuePairs map { case (key, value) => s""""$key":"$value"""" } mkString "," }}"""
}
Currently, in order to call this method I have to do something like:
Logger.info(LogMessage(("message", s"here is my message")).jsonify)
// jsonify prints:
// {"message":"here is my message"}
This runs fine, but I don't like how I need to write .jsonify after every time I make a new case class. Is there a way to make this method automatically called on creating a LogMessage case class so I don't have jsonify written all over my code? I have read about implicit methods but simply changing the method to implicit def jsonify = ... doesn't do anything.
This may be solved by overriding toString method of LogMessage, but there is also a better option.
Since actually there is no need to use instances of LogMessage, we can implement an object with its apply() method, which will serve as converter from tuples to the jsonified String:
object JsonifiedMessage {
def apply(keyValuePairs: (String, Any)*): String = {
val jsonified = keyValuePairs.map { case (key, value) => s""""$key":"$value"""" }.mkString(",")
s"{${jsonified}}"
}
}
Can be used as:
Logger.info(JsonifiedMessage(("a", 1), ("b", 2)))
I have to append this column generated by the method 'strToInt' which is turning out to be not serializable.
def strToInt(colVal : String) : Int = {
var str = new Array[String](3)
str(0) = "icmp"; str(1) = "tcp"; str(2) = "udp"
var i = 0
for (i <- 0 to str.length-1) {
if (str(i) == colVal) { return i }
}
throw new IllegalStateException("This never happens")
}
val strtoint = udf(strToInt(_:String)).apply(col("Atr 1"))
val newDF = df.withColumn("newCol", strtoint)
I have tried putting the function in a helper class this way,
object Helper extends Serializable {
def strToInt ...
}
but it doesn't help.
Change your code to be as follows where the function execution is at withColumn level (not when the UDF is defined).
// define a UDF
val strtoint = udf(strToInt _)
// use it (aka execute)
val newDF = df.withColumn("newCol", strtoint(col("Atr 1")))
That seemingly little change changes what you create and how you execute it afterwards.
As you may have noticed already, udf creates a user-defined function that Spark SQL understands (can can execute):
udf[RT, A1](f: (A1) ⇒ RT): UserDefinedFunction Defines a user-defined function of 1 arguments as user-defined function (UDF).
(I removed the implicit parameters to ease comprehension)
Quoting the scaladoc of UserDefinedFunction:
A user-defined function. To create one, use the udf functions in functions.
Not much I agree, but the "protocol" is to register a UDF first before you can execute it in your queries, say withColumn or select operators.
I'd also change strToInt to be more Scala-idiomatic (and hopefully easier to comprehend, too).
def strToInt(colVal : String) : Int = {
val strs = Array("icmp", "tcp", "udp")
strs.indexOf(colVal)
}
The key to understanding what's going on here is that while Scala is a functional programming language, it runs on the JVM which does not have support for a functional type. At runtime, any val assigned an "anonymous" or "lambda" function will actually be an instance of an anonymous class with an apply method. So let's say you have the following:
object helper {
val isNegative: (Int => Boolean) = (n: Int) => n < 0
}
This compiles to the same thing as this:
object helper {
val isNegative: Function1[Int, Boolean] = {
def apply(n: Int): Boolean = n < 0
}
}
isNegative is really an anonymous class instance extending the trait Function1. When you instead do this:
object helper {
def isNegative(n: Int): Boolean = n < 0
}
Now isNegative is a method of the object helper instead. When it comes to dealing with Spark, if you were to do something like this:
// ds is a Dataset[Int]
ds.filter(isNegative)
In the first case Spark will have to serialize the anonymous class assigned to isNegative and fail because it is not serializable. In the second case, it will have to serialize helper which does work because an object is serializable if all it's state is serializable.
To apply this to your problem, when you do this:
val strtoint = udf(strToInt(_:String)).apply(col("Atr 1"))
at runtime what strtoint is is an anonymous class instance with the trait Funtion1[String, UserDefinedFunction], that is a method that generates a UserDefinedFunction when it is a called. With the underscore filled in, it is identical to this:
val strtoInt: Function1[String, UserDefinedFunction] = new Function1[String, UserDefinedFunction] = {
def apply(t1: String) = udf(strToInt(t1 :String)).apply(col("Atr 1"))
}
to minimally change you code, you can just change the val to a def:
def sti = udf(strToInt(_:String)).apply(col("Atr 1"))
Now sti is a member function of it's enclosing class, and if that is serializable, you should be good as far as Spark is concerned. The other thing to keep in mind here is that strToInt also needs to be part of a serializable class or object
The other way to fix this as has been suggested would be to change val strtoint to a UserDefinedFunction which is a case class and thus serializable, however you still need to make sure that strToInt is a member of a serializable class or object.
This problem seems to be similar to the problem I was experiencing (In Java).
My udf function was using Cipher library to encrypt something and the exception that was thrown is :
Caused by: java.io.NotSerializableException: javax.crypto.Cipher
Serialization stack:
- object not serializable (class: javax.crypto.Cipher, value: javax.crypto.Cipher#625d02ce)
I could not add 'implements Serializable' to Cipher class because it was a library provided by Java.
I used the following solution from this link : spark-how-to-call-udf-over-dataset-in-java
private static UDF1 toUpper = new UDF1<String, String>() {
public String call(final String str) throws Exception {
return str.toUpperCase();
}
};
Register the UDF and you can use callUDF function.
import static org.apache.spark.sql.functions.callUDF;
import static org.apache.spark.sql.functions.col;
sqlContext.udf().register("toUpper", toUpper, DataTypes.StringType);
peopleDF.select(col("name"),callUDF("toUpper", col("name"))).show();
Where instead of calling str.toUpperCase(); I called my Cipher instance.
Getting strange behavior when calling function outside of a closure:
when function is in a object everything is working
when function is in a class get :
Task not serializable: java.io.NotSerializableException: testing
The problem is I need my code in a class and not an object. Any idea why this is happening? Is a Scala object serialized (default?)?
This is a working code example:
object working extends App {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
//calling function outside closure
val after = rddList.map(someFunc(_))
def someFunc(a:Int) = a+1
after.collect().map(println(_))
}
This is the non-working example :
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
RDDs extend the Serialisable interface, so this is not what's causing your task to fail. Now this doesn't mean that you can serialise an RDD with Spark and avoid NotSerializableException
Spark is a distributed computing engine and its main abstraction is a resilient distributed dataset (RDD), which can be viewed as a distributed collection. Basically, RDD's elements are partitioned across the nodes of the cluster, but Spark abstracts this away from the user, letting the user interact with the RDD (collection) as if it were a local one.
Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:
serialized on the driver node,
shipped to the appropriate nodes in the cluster,
deserialized,
and finally executed on the nodes
You can of course run this locally (as in your example), but all those phases (apart from shipping over network) still occur. [This lets you catch any bugs even before deploying to production]
What happens in your second case is that you are calling a method, defined in class testing from inside the map function. Spark sees that and since methods cannot be serialized on their own, Spark tries to serialize the whole testing class, so that the code will still work when executed in another JVM. You have two possibilities:
Either you make class testing serializable, so the whole class can be serialized by Spark:
import org.apache.spark.{SparkContext,SparkConf}
object Spark {
val ctx = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}
object NOTworking extends App {
new Test().doIT
}
class Test extends java.io.Serializable {
val rddList = Spark.ctx.parallelize(List(1,2,3))
def doIT() = {
val after = rddList.map(someFunc)
after.collect().foreach(println)
}
def someFunc(a: Int) = a + 1
}
or you make someFunc function instead of a method (functions are objects in Scala), so that Spark will be able to serialize it:
import org.apache.spark.{SparkContext,SparkConf}
object Spark {
val ctx = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}
object NOTworking extends App {
new Test().doIT
}
class Test {
val rddList = Spark.ctx.parallelize(List(1,2,3))
def doIT() = {
val after = rddList.map(someFunc)
after.collect().foreach(println)
}
val someFunc = (a: Int) => a + 1
}
Similar, but not the same problem with class serialization can be of interest to you and you can read on it in this Spark Summit 2013 presentation.
As a side note, you can rewrite rddList.map(someFunc(_)) to rddList.map(someFunc), they are exactly the same. Usually, the second is preferred as it's less verbose and cleaner to read.
EDIT (2015-03-15): SPARK-5307 introduced SerializationDebugger and Spark 1.3.0 is the first version to use it. It adds serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.
In OP's case, this is what gets printed to stdout:
Serialization stack:
- object not serializable (class: testing, value: testing#2dfe2f00)
- field (class: testing$$anonfun$1, name: $outer, type: class testing)
- object (class testing$$anonfun$1, <function1>)
Grega's answer is great in explaining why the original code does not work and two ways to fix the issue. However, this solution is not very flexible; consider the case where your closure includes a method call on a non-Serializable class that you have no control over. You can neither add the Serializable tag to this class nor change the underlying implementation to change the method into a function.
Nilesh presents a great workaround for this, but the solution can be made both more concise and general:
def genMapper[A, B](f: A => B): A => B = {
val locker = com.twitter.chill.MeatLocker(f)
x => locker.get.apply(x)
}
This function-serializer can then be used to automatically wrap closures and method calls:
rdd map genMapper(someFunc)
This technique also has the benefit of not requiring the additional Shark dependencies in order to access KryoSerializationWrapper, since Twitter's Chill is already pulled in by core Spark
Complete talk fully explaining the problem, which proposes a great paradigm shifting way to avoid these serialization problems: https://github.com/samthebest/dump/blob/master/sams-scala-tutorial/serialization-exceptions-and-memory-leaks-no-ws.md
The top voted answer is basically suggesting throwing away an entire language feature - that is no longer using methods and only using functions. Indeed in functional programming methods in classes should be avoided, but turning them into functions isn't solving the design issue here (see above link).
As a quick fix in this particular situation you could just use the #transient annotation to tell it not to try to serialise the offending value (here, Spark.ctx is a custom class not Spark's one following OP's naming):
#transient
val rddList = Spark.ctx.parallelize(list)
You can also restructure code so that rddList lives somewhere else, but that is also nasty.
The Future is Probably Spores
In future Scala will include these things called "spores" that should allow us to fine grain control what does and does not exactly get pulled in by a closure. Furthermore this should turn all mistakes of accidentally pulling in non-serializable types (or any unwanted values) into compile errors rather than now which is horrible runtime exceptions / memory leaks.
http://docs.scala-lang.org/sips/pending/spores.html
A tip on Kryo serialization
When using kyro, make it so that registration is necessary, this will mean you get errors instead of memory leaks:
"Finally, I know that kryo has kryo.setRegistrationOptional(true) but I am having a very difficult time trying to figure out how to use it. When this option is turned on, kryo still seems to throw exceptions if I haven't registered classes."
Strategy for registering classes with kryo
Of course this only gives you type-level control not value-level control.
... more ideas to come.
I faced similar issue, and what I understand from Grega's answer is
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
your doIT method is trying to serialize someFunc(_) method, but as method are not serializable, it tries to serialize class testing which is again not serializable.
So make your code work, you should define someFunc inside doIT method. For example:
def doIT = {
def someFunc(a:Int) = a+1
//function definition
}
val after = rddList.map(someFunc(_))
after.collect().map(println(_))
}
And if there are multiple functions coming into picture, then all those functions should be available to the parent context.
I solved this problem using a different approach. You simply need to serialize the objects before passing through the closure, and de-serialize afterwards. This approach just works, even if your classes aren't Serializable, because it uses Kryo behind the scenes. All you need is some curry. ;)
Here's an example of how I did it:
def genMapper(kryoWrapper: KryoSerializationWrapper[(Foo => Bar)])
(foo: Foo) : Bar = {
kryoWrapper.value.apply(foo)
}
val mapper = genMapper(KryoSerializationWrapper(new Blah(abc))) _
rdd.flatMap(mapper).collectAsMap()
object Blah(abc: ABC) extends (Foo => Bar) {
def apply(foo: Foo) : Bar = { //This is the real function }
}
Feel free to make Blah as complicated as you want, class, companion object, nested classes, references to multiple 3rd party libs.
KryoSerializationWrapper refers to: https://github.com/amplab/shark/blob/master/src/main/scala/shark/execution/serialization/KryoSerializationWrapper.scala
I'm not entirely certain that this applies to Scala but, in Java, I solved the NotSerializableException by refactoring my code so that the closure did not access a non-serializable final field.
Scala methods defined in a class are non-serializable, methods can be converted into functions to resolve serialization issue.
Method syntax
def func_name (x String) : String = {
...
return x
}
function syntax
val func_name = { (x String) =>
...
x
}
FYI in Spark 2.4 a lot of you will probably encounter this issue. Kryo serialization has gotten better but in many cases you cannot use spark.kryo.unsafe=true or the naive kryo serializer.
For a quick fix try changing the following in your Spark configuration
spark.kryo.unsafe="false"
OR
spark.serializer="org.apache.spark.serializer.JavaSerializer"
I modify custom RDD transformations that I encounter or personally write by using explicit broadcast variables and utilizing the new inbuilt twitter-chill api, converting them from rdd.map(row => to rdd.mapPartitions(partition => { functions.
Example
Old (not-great) Way
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val outputRDD = rdd.map(row => {
val value = sampleMap.get(row._1)
value
})
Alternative (better) Way
import com.twitter.chill.MeatLocker
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val brdSerSampleMap = spark.sparkContext.broadcast(MeatLocker(sampleMap))
rdd.mapPartitions(partition => {
val deSerSampleMap = brdSerSampleMap.value.get
partition.map(row => {
val value = sampleMap.get(row._1)
value
}).toIterator
})
This new way will only call the broadcast variable once per partition which is better. You will still need to use Java Serialization if you do not register classes.
I had a similar experience.
The error was triggered when I initialize a variable on the driver (master), but then tried to use it on one of the workers.
When that happens, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable.
I solved the error by making the variable static.
Previous non-working code
private final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Working code
private static final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Credits:
https://learn.microsoft.com/en-us/answers/questions/35812/sparkexception-job-aborted-due-to-stage-failure-ta.html ( The answer of pradeepcheekatla-msft)
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
def upper(name: String) : String = {
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
val emp_details = """[{"id": "1","name": "James Butt","country": "USA"},
{"id": "2", "name": "Josephine Darakjy","country": "USA"},
{"id": "3", "name": "Art Venere","country": "USA"},
{"id": "4", "name": "Lenna Paprocki","country": "USA"},
{"id": "5", "name": "Donette Foller","country": "USA"},
{"id": "6", "name": "Leota Dilliard","country": "USA"}]"""
val df_emp = spark.read.json(Seq(emp_details).toDS())
val df_name=df_emp.select($"id",$"name")
val df_upperName= df_name.withColumn("name",toUpperName($"name")).filter("id='5'")
display(df_upperName)
this will give error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
Solution -
import java.io.Serializable;
object obj_upper extends Serializable {
def upper(name: String) : String =
{
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
}
val df_upperName=
df_name.withColumn("name",obj_upper.toUpperName($"name")).filter("id='5'")
display(df_upperName)
My solution was to add a compagnion class that handles all methods that are not seriazable within the class.
void whatever() {
// ...
val parser = new MyParser
val parse = parser.parse(input)
if (parse successful) {
semanticAnalysis(parse)
}
}
void semanticAnalysis(parse: DontKnowTheCorrectType) {
// ...
}
What type do I have to give to the formal parameter parse? Hovering over parse inside whatever says val parse: parser.ParseResult[parsing.Program], but of course that doesn't work as a parameter type of semanticAnalysis, because the local variable parse is not in scope there.
Parse results are path-dependent types because they are results of this specific parser and there's no guarantee they are compatible.
This is why parsers are usually not instantiated the way you use them (new MyParser), but as object.
object MyParser extends RegexParsers {
def statements : Parser[List[Statement]] = // ...
}
def handleResult(res: MyParser.ParseResult[List[Statement]]) = { // ... }
val res = MyParser.parseAll(MyParser.statements, "/* */")
If you need more dynamic behaviour (or want concurrent parsing, parser combinators aren't thread-safe, ouch), you'll just have to keep the parser object accessible (and stable) wherever you want to use its results.
Sadly, passing a parser and its result around together is not trivial because you run into the prohibition of dependent method types, e.g.
def fun(p: scala.util.parsing.combinator.Parsers, res: p.ParseResult[_]) = {}
won't compile ("illegal dependent method type"), but there are ways to get around that if you must, like the answer to this question.
You should be able to define semanticAnalysis as:
def semanticAnalysis(parse: MyParser#ParseResult[parsing.Program]) = {
...
}
Note the use of # instead of . for the type. There are more details about type projections here, but basically, parser.ParseResult is different for each parser, while MyParser#ParseResult is the same for all parser instances.
But, as themel says, you should probably be using object instead of class.
You can rewrite it like this:
def whatever() {
// ...
val parser = new MyParser
def semanticAnalysis(parse: parser.ParseResult) {
// ...
}
val parse = parser.parse(input)
if (parse successful) {
semanticAnalysis(parse)
}
}
If you have this being called from multiple places, then maybe this:
class SemanticAnalysis(parser: Parser) {
def apply(parse: parser.ParseResult) {
// ...
}
}
And then
if (parse successful) {
new SemanticAnalysis(parser)(parse)
}
When I need to use the result of parser combinators in other parts of a program, I extract the success result or error message from the path-dependent ParseResult and put the data into an independent type. It's more verbose than I like, but I want to keep the combinator instance an implementation detail of the parser.
sealed abstract class Result[+T]
case class Success[T](result: T) extends Result[T]
case class Failure(msg: String) extends Result[Nothing]
case class Error(msg: String) extends Result[Nothing]
/** Parse the package declarations in the file. */
def parse(file: String): Result[List[Package]] = {
val stream = ... // open the file...
val parser = new MyParser
val result = parser.parseAll(parser.packages, stream)
stream.close()
result match {
case parser.Success(packages, _) => Success(packages)
case parser.Failure(msg, _) => Failure(msg)
case parser.Error(msg, _) => Error(msg)
}
}