I have the following RDD:
myRDD:org.apache.spark.rdd.RDD[(String, org.apache.spark.mllib.linalg.Vector)]
Then I want to add a fixed key:
myRdd.map(("myFixedKey",_)):org.apache.spark.rdd.RDD[(String, (String, org.apache.spark.mllib.linalg.Vector))]
But if I use a constant String val instead of a hardcoded/explicit string:
val myFixedKeyVal:String = "myFixedKey"
myRdd.map((myFixedKeyVal,_))
This previous line code gives the following exception:
org.apache.spark.SparkException: Task not serializable
Am I missing something?
Solution:
Ok I found the problem, myRdd is an object that extends a Serializable class, but after process this RDD by another class, e.g. Process:
class Process(someRdd:MyRddClass) extends Serializable{
def preprocess = someRdd.map(_)
}
val someprocess = Process(myRdd)
val newRdd = someprocess.preprocess
newRdd.map(x=>("newkey",x)
This class Process must extend Serializable too in order to work. I thought that the newRdd was extending the root class MyRddClass...
The string constant is not the problem. Turn on serialization debugging with -Dsun.io.serialization.extendedDebugInfo=true to figure out the real cause.
Related
I customized a toy estimator "SimpleIndexer" by following Holden Karau's tutorial at https://www.oreilly.com/learning/extend-spark-ml-for-your-own-modeltransformer-types. The problem is it error out when using it in "CrossValidator".
The error is
Exception in thread "main" java.lang.NoSuchMethodException: ....SimpleIndexerModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.param.Params$class.defaultCopy(params.scala:846)
at org.apache.spark.ml.PipelineStage.defaultCopy(Pipeline.scala:42)
at com.nextperf.feature.SimpleIndexerModel.copy(SimpleIndexer.scala:63)
There is a similar questions asked before - java.lang.NoSuchMethodException: <Class>.<init>(java.lang.String) when copying custom Transformer. Apparently the issue come from the "copy" method. But I tried the solution mentioned in the post, and it does not work.
"SimpleIndexerModel" extends the DefaultParamsWritable trait
Add a Companion object that extends the DefaultParamsReadable Interface
class SimpleIndexerModel(override val uid: String, words: Array[String])
extends Model[SimpleIndexerModel] with SimpleIndexerParams with DefaultParamsWritable{
...
...
}
object SimpleIndexerModel extends DefaultParamsReadable[SimpleIndexerModel]
The spark official implementation of this toy example is "StringIndexer". I cannot find a clue. Does anyone know why it happens, and how to fix the problem?
//"StringIndexerModel" works fine
val indexer1 = new StringIndexerModel("abc",Array("a"))
val m1 = indexer1.copy(new ParamMap())
//
//"SimpleIndexerModel" fails
val indexer2 = new SimpleIndexerModel("abc",Array("a"))
// This call throws the exception.
val m2 = indexer2.copy(new ParamMap())
See the implementation of Params.defaultCopy: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L845
This method requires a constructor with only one String parameter(uid). So you can resolve your problem by adding a constructor to your SimpleIndexerModel class.
def this(uid: String) = {...}
I am working a on Flink project and would like to parse the source JSON string data to Json Object. I am using jackson-module-scala for the JSON parsing. However, I encountered some issues with using the JSON parser within Flink APIs (map for example).
Here are some examples of the code, and I cannot understand the reason under the hood why it is behaving like this.
Situation 1:
In this case, I am doing what the jackson-module-scala's official exmaple code told me to do:
create a new ObjectMapper
register the DefaultScalaModule
DefaultScalaModule is a Scala object that includes support for all currently supported Scala data types.
call the readValue in order to parse the JSON to Map
The error I got is: org.apache.flink.api.common.InvalidProgramException:Task not serializable.
object JsonProcessing {
def main(args: Array[String]) {
// set up the execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
// get input data
val text = env.readTextFile("xxx")
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
val counts = text.map(mapper.readValue(_, classOf[Map[String, String]]))
// execute and print result
counts.print()
env.execute("JsonProcessing")
}
}
Situation 2:
Then I did some Google, and came up with the following solution, where registerModule is moved into the map function.
val mapper = new ObjectMapper
val counts = text.map(l => {
mapper.registerModule(DefaultScalaModule)
mapper.readValue(l, classOf[Map[String, String]])
})
However, what I am not able to understand is: why this is going to work, with calling method of outside-defined object mapper? Is it because the ObjectMapper itself is Serializable as stated here ObjectMapper.java#L114?
Now, the JSON parsing is working fine, but every time, I have to call mapper.registerModule(DefaultScalaModule) which I think could cause some performance issue (Does it?). I also tried another solution as follows.
Situation 3:
I created a new case class Jsen, and use it as the corresponding parsing class, registering the Scala modules. And it is also working fine.
However, this is not so flexible if your input JSON is varying very often. It is not maintainable to manage the class Jsen.
case class Jsen(
#JsonProperty("a") a: String,
#JsonProperty("c") c: String,
#JsonProperty("e") e: String
)
object JsonProcessing {
def main(args: Array[String]) {
...
val mapper = new ObjectMapper
val counts = text.map(mapper.readValue(_, classOf[Jsen]))
...
}
Additionally, I also tried using JsonNode without calling registerModule as follows:
...
val mapper = new ObjectMapper
val counts = text.map(mapper.readValue(_, classOf[JsonNode]))
...
It is working fine as well.
My main question is: what is actually causing the problem of Task not serializable under the hood of registerModule(DefaultScalaModule)?
How to identify whether your code could potentially cause this unserializable problem during coding?
The thing is that Apache Flink is designed to be distributed. It means that it needs to be able to run your code remotely. So it means that all your processing functions should be serializable. In the current implementation this is ensure early on when you build your streaming process even if you will not run this in any distributed mode. This is a trade-off with an obvious benefit of providing you feedback down to the very line that breaks this contract (via exception stack trace).
So when you write
val counts = text.map(mapper.readValue(_, classOf[Map[String, String]]))
what you actually write is something like
val counts = text.map(new Function1[String, Map[String, String]] {
val capturedMapper = mapper
override def apply(param: String) = capturedMapper.readValue(param, classOf[Map[String, String]])
})
The important thing here is that you capture the mapper from the outside context and store it as a part of your Function1 object that has to be serializble. And this means that the mapper has to be serializable. The designers of Jackson library recognized that kind of a need and since there is nothing fundamentally non-serizliable in a mapper they made their ObjectMapper and the default Modules serializable. Unfortunately for you the designers of Scala Jackson Module missed that and made their DefaultScalaModule deeply non-serialiazable by making ScalaTypeModifier and all sub-classes non-serializable. This is why your second code works while the first one doesn't: "raw" ObjectMapper is serializable while ObjectMapper with pre-registered DefaultScalaModule is not.
There are a few possible workarounds. Probably the easiest one is to wrap ObjectMapper
object MapperWrapper extends java.io.Serializable {
// this lazy is the important trick here
// #transient adds some safety in current Scala (see also Update section)
#transient lazy val mapper = {
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper
}
def readValue[T](content: String, valueType: Class[T]): T = mapper.readValue(content, valueType)
}
and then use it as
val counts = text.map(MapperWrapper.readValue(_, classOf[Map[String, String]]))
This lazy trick works because although an instance of DefaultScalaModule is not serializable, the function to create an instance of DefaultScalaModule is.
Update: what about #transient?
what are the differences here, if I add lazy val vs. #transient lazy val?
This is actually a tricky question. What the lazy val is compiled to is actually something like this:
object MapperWrapper extends java.io.Serializable {
// #transient is set or not set for both fields depending on its presence at "lazy val"
[#transient] private var mapperValue: ObjectMapper = null
[#transient] #volatile private var mapperInitialized = false
def mapper: ObjectMapper = {
if (!mapperInitialized) {
this.synchronized {
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
mapperValue = mapper
mapperInitialized = true
}
}
mapperValue
}
def readValue[T](content: String, valueType: Class[T]): T = mapper.readValue(content, valueType)
}
where #transient on the lazy val affects both backing fields. So now you can see why lazy val trick works:
locally it works because it delays initialization of the mapperValue field until first access to the mapper method so the field is safe null when the serialization check is performed
remotely it works because MapperWrapper is fully serializable and the logic of how lazy val should be initialized is put into a method of the same class (see def mapper).
Note however that AFAIK this behavior of how lazy val is compiled is an implementation detail of the current Scala compiler rather than a part of the Scala specification. If at some later point a class similar to .Net Lazy will be added to the Java standard library, Scala compiler potentially might start generating different code. This is important because it provides a kind of trade-off for #transient. The benefit of adding #transient now is that it ensures that code like this works as well:
val someJson:String = "..."
val something:Something = MapperWrapper.readValue(someJson:String, ...)
val counts = text.map(MapperWrapper.readValue(_, classOf[Map[String, String]]))
Without #transient the code above will fail because we forced initialization of the lazy backing field and now it contains a non-serializable value. With #transient this is not an issue as that field will not be serialized at all.
A potential drawback of #transient is that if Scala changes the way code for lazy val is generated and the field is marked as #transient, it might actually be not de-serialized in the remote-work scenario.
Also there is a trick with object because for objects Scala compiler generates custom de-serialization logic (overrides readResolve) to return the same singleton object. It means that the object including the lazy val is not really de-serialized and the value from the object itself is used. It means that #transient lazy val inside object is much more future-proof than inside class in remote scenario.
I am trying to use join method between 2 RDD and save it to cassandra but my code don't work . at the begining , i get a huge Main method and everything working well , but when i use function and class this don't work . i am new to scala and spark
code is :
class Migration extends Serializable {
case class userId(offerFamily: String, bp: String, pdl: String) extends Serializable
case class siteExternalId(site_external_id: Option[String]) extends Serializable
case class profileData(begin_ts: Option[Long], Source: Option[String]) extends Serializable
def SparkMigrationProfile(sc: SparkContext) = {
val test = sc.cassandraTable[siteExternalId](KEYSPACE,TABLE)
.keyBy[userId]
.filter(x => x._2.site_external_id != None)
val profileRDD = sc.cassandraTable[profileData](KEYSPACE,TABLE)
.keyBy[userId]
//dont work
test.join(profileRDD)
.foreach(println)
// don't work
test.join(profileRDD)
.saveToCassandra(keyspace, table)
}
At the beginig i get the famous : Exception in thread "main" org.apache.spark.SparkException: Task not serializable at . . .
so i extends my main class and also the case class but stil don't work .
I think you should move your case classes from Migration class to dedicated file and/or object. This should solve your problem. Additionally, Scala case classes are serializable by default.
This is my code:
class FNNode(val name: String)
case class Ingredient(override val name: String, category: String) extends FNNode(name)
val ingredients: RDD[(VertexId, FNNode)] =
sc.textFile(PATH+"ingr_info.tsv").
filter(! _.startsWith("#")).
map(line => line.split('\t')).
map(x => (x(0).toInt ,Ingredient(x(1), x(2))))
and there are no errors when I define these variables. However, when trying to execute it:
ingredients.take(1)
I get
org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.io.InvalidClassException: $iwC$$iwC$Ingredient; no valid constructor
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
It seems this could be related to Serialization issues as per the answer here . However, I have no idea of how to solve this if it is indeed a Serialization issue.
I'm following along the code in this book by they way, so I would assume this should have at least worked at some point?
This fixed your issue for me:
class FNNode(val name: String) extends Serializable
I am consistently having errors regarding Task not Serializable.
I have made a small Class and it extends Serializable - which is what I believe is meant to be the case when you need values in it to be serialised.
class SGD(filePath : String) extends Serializable {
val rdd = sc.textFile(filePath)
val mappedRDD = rdd.map(x => x.split(" ")
.slice(0,3))
.map(y => Rating(y(0).toInt, y(1).toInt, y(2).toDouble))
.cache
val RNG = new Random(1)
val factorsRDD = mappedRDD(x => (x.user, (x.product, x.rating)))
.groupByKey
.mapValues(listOfItemsAndRatings =>
Vector(Array.fill(2){RNG.nextDouble}))
}
The final line always results in a Task not Serializable error. What I do not understand is: the Class is Serializable; and, the Class Random is also Serializable according to the API. So, what am I doing wrong? I consistently can't get stuff like this to work; therefore, I imagine my understanding is wrong. I keep being told the Class must be Serializable... well it is and it still doesn't work!?
scala.util.Random was not Serializable until 2.11.0-M2.
Most likely you are using an earlier version of Scala.
A class doesn't become Serializable until all its members are Serializable as well (or some other mechanism is provided to serialize them, e.g. transient or readObject/writeObject.)
I get the following stacktrace when running given example in spark-1.3:
Caused by: java.io.NotSerializableException: scala.util.Random
Serialization stack:
- object not serializable (class: scala.util.Random, value: scala.util.Random#52bbf03d)
- field (class: $iwC$$iwC$SGD, name: RNG, type: class scala.util.Random)
One way to fix it is to take instatiation of random variable within mapValues:
mapValues(listOfItemsAndRatings => { val RNG = new Random(1)
Vector(Array.fill(2)(RNG.nextDouble)) })