I customized a toy estimator "SimpleIndexer" by following Holden Karau's tutorial at https://www.oreilly.com/learning/extend-spark-ml-for-your-own-modeltransformer-types. The problem is it error out when using it in "CrossValidator".
The error is
Exception in thread "main" java.lang.NoSuchMethodException: ....SimpleIndexerModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.param.Params$class.defaultCopy(params.scala:846)
at org.apache.spark.ml.PipelineStage.defaultCopy(Pipeline.scala:42)
at com.nextperf.feature.SimpleIndexerModel.copy(SimpleIndexer.scala:63)
There is a similar questions asked before - java.lang.NoSuchMethodException: <Class>.<init>(java.lang.String) when copying custom Transformer. Apparently the issue come from the "copy" method. But I tried the solution mentioned in the post, and it does not work.
"SimpleIndexerModel" extends the DefaultParamsWritable trait
Add a Companion object that extends the DefaultParamsReadable Interface
class SimpleIndexerModel(override val uid: String, words: Array[String])
extends Model[SimpleIndexerModel] with SimpleIndexerParams with DefaultParamsWritable{
...
...
}
object SimpleIndexerModel extends DefaultParamsReadable[SimpleIndexerModel]
The spark official implementation of this toy example is "StringIndexer". I cannot find a clue. Does anyone know why it happens, and how to fix the problem?
//"StringIndexerModel" works fine
val indexer1 = new StringIndexerModel("abc",Array("a"))
val m1 = indexer1.copy(new ParamMap())
//
//"SimpleIndexerModel" fails
val indexer2 = new SimpleIndexerModel("abc",Array("a"))
// This call throws the exception.
val m2 = indexer2.copy(new ParamMap())
See the implementation of Params.defaultCopy: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L845
This method requires a constructor with only one String parameter(uid). So you can resolve your problem by adding a constructor to your SimpleIndexerModel class.
def this(uid: String) = {...}
I am a C++ programmer and is trying to learn Scala. I want to achieve something similar to the following code using C++ template
template<typename T>
class Foo {
public:
T* bar;
/////////////////Other Code Omitted//////////////////////////
};
Its counter-part in Scala will not compile due to type erasure
class Foo[E](){
val bar = new E() //Will not compile
}
I have been searching the whole night for a workaround, this seems to be one of them
package test
import scala.reflect._
object Type {
def newInstance[T: ClassTag](init_args: AnyRef*): T = {
classTag[T].runtimeClass.getConstructors.head.newInstance(init_args: _*).asInstanceOf[T]
}
}
class Foo[T1:ClassTag](init_args: AnyRef*){
val bar = Type.newInstance[T1](init_args)
}
class TestClass(val arg:String){
val data = arg
}
However, when I try to instantiate one (val test = new Foo[Test]("test")) in the scala console, it gives the following error
java.lang.IllegalArgumentException: argument type mismatch
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at ParActor.Type$.newInstance(ParActor.scala:32)
at ParActor.Foo.<init>(ParActor.scala:37)
... 35 elided
I am not exactly sure what causes the problem and how to fix this. Other work around is also welcomed.
You should turn
Type.newInstance[T1](init_args)
into
Type.newInstance[T1](init_args: _*)
What : _* does is turn a list or sequence into a varargs argument. A varargs parameter AnyRef* is actually an IndexedSeq[AnyRef], more specifically a WrappedArray[AnyRef]. So when you pass init_args as an argument to newInstance without telling the compiler to interpret it as a varargs argument you are actually passing in WrappedArray(WrappedArray("test")).
Getting strange behavior when calling function outside of a closure:
when function is in a object everything is working
when function is in a class get :
Task not serializable: java.io.NotSerializableException: testing
The problem is I need my code in a class and not an object. Any idea why this is happening? Is a Scala object serialized (default?)?
This is a working code example:
object working extends App {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
//calling function outside closure
val after = rddList.map(someFunc(_))
def someFunc(a:Int) = a+1
after.collect().map(println(_))
}
This is the non-working example :
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
RDDs extend the Serialisable interface, so this is not what's causing your task to fail. Now this doesn't mean that you can serialise an RDD with Spark and avoid NotSerializableException
Spark is a distributed computing engine and its main abstraction is a resilient distributed dataset (RDD), which can be viewed as a distributed collection. Basically, RDD's elements are partitioned across the nodes of the cluster, but Spark abstracts this away from the user, letting the user interact with the RDD (collection) as if it were a local one.
Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:
serialized on the driver node,
shipped to the appropriate nodes in the cluster,
deserialized,
and finally executed on the nodes
You can of course run this locally (as in your example), but all those phases (apart from shipping over network) still occur. [This lets you catch any bugs even before deploying to production]
What happens in your second case is that you are calling a method, defined in class testing from inside the map function. Spark sees that and since methods cannot be serialized on their own, Spark tries to serialize the whole testing class, so that the code will still work when executed in another JVM. You have two possibilities:
Either you make class testing serializable, so the whole class can be serialized by Spark:
import org.apache.spark.{SparkContext,SparkConf}
object Spark {
val ctx = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}
object NOTworking extends App {
new Test().doIT
}
class Test extends java.io.Serializable {
val rddList = Spark.ctx.parallelize(List(1,2,3))
def doIT() = {
val after = rddList.map(someFunc)
after.collect().foreach(println)
}
def someFunc(a: Int) = a + 1
}
or you make someFunc function instead of a method (functions are objects in Scala), so that Spark will be able to serialize it:
import org.apache.spark.{SparkContext,SparkConf}
object Spark {
val ctx = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}
object NOTworking extends App {
new Test().doIT
}
class Test {
val rddList = Spark.ctx.parallelize(List(1,2,3))
def doIT() = {
val after = rddList.map(someFunc)
after.collect().foreach(println)
}
val someFunc = (a: Int) => a + 1
}
Similar, but not the same problem with class serialization can be of interest to you and you can read on it in this Spark Summit 2013 presentation.
As a side note, you can rewrite rddList.map(someFunc(_)) to rddList.map(someFunc), they are exactly the same. Usually, the second is preferred as it's less verbose and cleaner to read.
EDIT (2015-03-15): SPARK-5307 introduced SerializationDebugger and Spark 1.3.0 is the first version to use it. It adds serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.
In OP's case, this is what gets printed to stdout:
Serialization stack:
- object not serializable (class: testing, value: testing#2dfe2f00)
- field (class: testing$$anonfun$1, name: $outer, type: class testing)
- object (class testing$$anonfun$1, <function1>)
Grega's answer is great in explaining why the original code does not work and two ways to fix the issue. However, this solution is not very flexible; consider the case where your closure includes a method call on a non-Serializable class that you have no control over. You can neither add the Serializable tag to this class nor change the underlying implementation to change the method into a function.
Nilesh presents a great workaround for this, but the solution can be made both more concise and general:
def genMapper[A, B](f: A => B): A => B = {
val locker = com.twitter.chill.MeatLocker(f)
x => locker.get.apply(x)
}
This function-serializer can then be used to automatically wrap closures and method calls:
rdd map genMapper(someFunc)
This technique also has the benefit of not requiring the additional Shark dependencies in order to access KryoSerializationWrapper, since Twitter's Chill is already pulled in by core Spark
Complete talk fully explaining the problem, which proposes a great paradigm shifting way to avoid these serialization problems: https://github.com/samthebest/dump/blob/master/sams-scala-tutorial/serialization-exceptions-and-memory-leaks-no-ws.md
The top voted answer is basically suggesting throwing away an entire language feature - that is no longer using methods and only using functions. Indeed in functional programming methods in classes should be avoided, but turning them into functions isn't solving the design issue here (see above link).
As a quick fix in this particular situation you could just use the #transient annotation to tell it not to try to serialise the offending value (here, Spark.ctx is a custom class not Spark's one following OP's naming):
#transient
val rddList = Spark.ctx.parallelize(list)
You can also restructure code so that rddList lives somewhere else, but that is also nasty.
The Future is Probably Spores
In future Scala will include these things called "spores" that should allow us to fine grain control what does and does not exactly get pulled in by a closure. Furthermore this should turn all mistakes of accidentally pulling in non-serializable types (or any unwanted values) into compile errors rather than now which is horrible runtime exceptions / memory leaks.
http://docs.scala-lang.org/sips/pending/spores.html
A tip on Kryo serialization
When using kyro, make it so that registration is necessary, this will mean you get errors instead of memory leaks:
"Finally, I know that kryo has kryo.setRegistrationOptional(true) but I am having a very difficult time trying to figure out how to use it. When this option is turned on, kryo still seems to throw exceptions if I haven't registered classes."
Strategy for registering classes with kryo
Of course this only gives you type-level control not value-level control.
... more ideas to come.
I faced similar issue, and what I understand from Grega's answer is
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
your doIT method is trying to serialize someFunc(_) method, but as method are not serializable, it tries to serialize class testing which is again not serializable.
So make your code work, you should define someFunc inside doIT method. For example:
def doIT = {
def someFunc(a:Int) = a+1
//function definition
}
val after = rddList.map(someFunc(_))
after.collect().map(println(_))
}
And if there are multiple functions coming into picture, then all those functions should be available to the parent context.
I solved this problem using a different approach. You simply need to serialize the objects before passing through the closure, and de-serialize afterwards. This approach just works, even if your classes aren't Serializable, because it uses Kryo behind the scenes. All you need is some curry. ;)
Here's an example of how I did it:
def genMapper(kryoWrapper: KryoSerializationWrapper[(Foo => Bar)])
(foo: Foo) : Bar = {
kryoWrapper.value.apply(foo)
}
val mapper = genMapper(KryoSerializationWrapper(new Blah(abc))) _
rdd.flatMap(mapper).collectAsMap()
object Blah(abc: ABC) extends (Foo => Bar) {
def apply(foo: Foo) : Bar = { //This is the real function }
}
Feel free to make Blah as complicated as you want, class, companion object, nested classes, references to multiple 3rd party libs.
KryoSerializationWrapper refers to: https://github.com/amplab/shark/blob/master/src/main/scala/shark/execution/serialization/KryoSerializationWrapper.scala
I'm not entirely certain that this applies to Scala but, in Java, I solved the NotSerializableException by refactoring my code so that the closure did not access a non-serializable final field.
Scala methods defined in a class are non-serializable, methods can be converted into functions to resolve serialization issue.
Method syntax
def func_name (x String) : String = {
...
return x
}
function syntax
val func_name = { (x String) =>
...
x
}
FYI in Spark 2.4 a lot of you will probably encounter this issue. Kryo serialization has gotten better but in many cases you cannot use spark.kryo.unsafe=true or the naive kryo serializer.
For a quick fix try changing the following in your Spark configuration
spark.kryo.unsafe="false"
OR
spark.serializer="org.apache.spark.serializer.JavaSerializer"
I modify custom RDD transformations that I encounter or personally write by using explicit broadcast variables and utilizing the new inbuilt twitter-chill api, converting them from rdd.map(row => to rdd.mapPartitions(partition => { functions.
Example
Old (not-great) Way
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val outputRDD = rdd.map(row => {
val value = sampleMap.get(row._1)
value
})
Alternative (better) Way
import com.twitter.chill.MeatLocker
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val brdSerSampleMap = spark.sparkContext.broadcast(MeatLocker(sampleMap))
rdd.mapPartitions(partition => {
val deSerSampleMap = brdSerSampleMap.value.get
partition.map(row => {
val value = sampleMap.get(row._1)
value
}).toIterator
})
This new way will only call the broadcast variable once per partition which is better. You will still need to use Java Serialization if you do not register classes.
I had a similar experience.
The error was triggered when I initialize a variable on the driver (master), but then tried to use it on one of the workers.
When that happens, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable.
I solved the error by making the variable static.
Previous non-working code
private final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Working code
private static final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Credits:
https://learn.microsoft.com/en-us/answers/questions/35812/sparkexception-job-aborted-due-to-stage-failure-ta.html ( The answer of pradeepcheekatla-msft)
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
def upper(name: String) : String = {
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
val emp_details = """[{"id": "1","name": "James Butt","country": "USA"},
{"id": "2", "name": "Josephine Darakjy","country": "USA"},
{"id": "3", "name": "Art Venere","country": "USA"},
{"id": "4", "name": "Lenna Paprocki","country": "USA"},
{"id": "5", "name": "Donette Foller","country": "USA"},
{"id": "6", "name": "Leota Dilliard","country": "USA"}]"""
val df_emp = spark.read.json(Seq(emp_details).toDS())
val df_name=df_emp.select($"id",$"name")
val df_upperName= df_name.withColumn("name",toUpperName($"name")).filter("id='5'")
display(df_upperName)
this will give error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
Solution -
import java.io.Serializable;
object obj_upper extends Serializable {
def upper(name: String) : String =
{
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
}
val df_upperName=
df_name.withColumn("name",obj_upper.toUpperName($"name")).filter("id='5'")
display(df_upperName)
My solution was to add a compagnion class that handles all methods that are not seriazable within the class.
Can someone help me to understand what is wrong with the code below? The problem is inside the "join" method - I am not able to set "state" field. Error message is -
No implicit view available from code.model.Membership.MembershipState.Val => _14.MembershipState.Value.
[error] create.member(user).group(group).state(MembershipState.Accepted).save
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
What does _14 mean? I tried similar thing with MappedGender and it works as expected, so why MappedEnum fails?
scala 2.10
lift 2.5
Thanks
package code
package model
import net.liftweb.mapper._
import net.liftweb.util._
import net.liftweb.common._
class Membership extends LongKeyedMapper[Membership] with IdPK {
def getSingleton = Membership
object MembershipState extends Enumeration {
val Requested = new Val(1, "Requested")
val Accepted = new Val(2, "Accepted")
val Denied = new Val(3, "Denied")
}
object state extends MappedEnum(this, MembershipState)
{
override def defaultValue = MembershipState.Requested
}
object member extends MappedLongForeignKey(this, User) {
override def dbIndexed_? = true
}
object group extends MappedLongForeignKey(this, Group) {
override def dbIndexed_? = true
}
}
object Membership extends Membership with LongKeyedMetaMapper[Membership] {
def join (user : User, group : Group) = {
create.member(user).group(group).state(MembershipState.Accepted).save
}
}
Try moving your MembershipState enum outside of the MembershipClass. I was getting the same error as you until I tried this. Not sure why, but the code compiled after I did that.
_14 means a compiler-generated intermediate anonymous value. In other words, the compiler doesn't know how to express the type it's looking in a better way.
But if you look past that, you see the compiler is looking for a conversion from [...].Val to [...].Value. I would guess that changing
val Requested = new Val(1, "Requested")
to
val Requested = Value(1, "Requested")
would fix the error.
(I'm curious where you picked up the "new Val" style?)
What's strange is that Val actually extends Value. So if the outer type was known correctly (not inferred to the odd _14) Val vs. Value wouldn't be a problem. The issue here is that Lift from some reason defines the setters to take the now-deprecated view bound syntax. Perhaps this causes the compiler, rather than going in a straight line and trying to fit the input type into the expected type, instead to start from both ends, defining the starting type and the required type, and then start searching for an implicit view function that can reconcile the two.
I've got the above odd error message that I don't understand "value Parsers is not a member of package scala.util.parsing.combinator".
I'm trying to learn Parser combinators by writing a C parser step by step. I started at token, so I have the classes:
import util.parsing.combinator.JavaTokenParsers
object CeeParser extends JavaTokenParsers {
def token: Parser[CeeExpr] = ident ^^ (x => Token(x))
}
abstract class CeeExpr
case class Token(name: String) extends CeeExpr
This is as simple as I could make it.
The code below works fine, but if I uncomment the commented line I get the error message given above:
object Play {
def main(args: Array[String]) {
//val parser: _root_.scala.util.parsing.combinator.Parsers.Parser[CeeExpr] CeeParser.token
val x = CeeParser.token
print(x)
}
}
In case it is a problem with my setup, I'm using scala 2.7.6 via the scala-plugin for intellij. Can anyone shed any light on this? The message is wrong, Parsers is a member of scala.util.parsing.combinator.
--- Follow-up
This person http://www.scala-lang.org/node/5475 seems to have the same problem, but I don't understand the answer he was given. Can anyone explain it?
The problem is that Parser is a subclass of Parsers, so the proper way to refer to it is from an instance of Parser. That is, CeeParser.Parser is different from any other x.Parser.
The correct way to refer to the type of CeeParser.token is CeeParser.Parser.
The issue is that Parsers is not a package or class, is is a trait, so its members can't be imported. You need to import from the specific class extending the trait.
In this case the specific class is CeeParser so the type of val should be CeeParser.Parser[CeeExpr]:
val parser : CeeParser.Parser[CeeExpr]