This is my code:
class FNNode(val name: String)
case class Ingredient(override val name: String, category: String) extends FNNode(name)
val ingredients: RDD[(VertexId, FNNode)] =
sc.textFile(PATH+"ingr_info.tsv").
filter(! _.startsWith("#")).
map(line => line.split('\t')).
map(x => (x(0).toInt ,Ingredient(x(1), x(2))))
and there are no errors when I define these variables. However, when trying to execute it:
ingredients.take(1)
I get
org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.io.InvalidClassException: $iwC$$iwC$Ingredient; no valid constructor
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
It seems this could be related to Serialization issues as per the answer here . However, I have no idea of how to solve this if it is indeed a Serialization issue.
I'm following along the code in this book by they way, so I would assume this should have at least worked at some point?
This fixed your issue for me:
class FNNode(val name: String) extends Serializable
Related
I am trying to use join method between 2 RDD and save it to cassandra but my code don't work . at the begining , i get a huge Main method and everything working well , but when i use function and class this don't work . i am new to scala and spark
code is :
class Migration extends Serializable {
case class userId(offerFamily: String, bp: String, pdl: String) extends Serializable
case class siteExternalId(site_external_id: Option[String]) extends Serializable
case class profileData(begin_ts: Option[Long], Source: Option[String]) extends Serializable
def SparkMigrationProfile(sc: SparkContext) = {
val test = sc.cassandraTable[siteExternalId](KEYSPACE,TABLE)
.keyBy[userId]
.filter(x => x._2.site_external_id != None)
val profileRDD = sc.cassandraTable[profileData](KEYSPACE,TABLE)
.keyBy[userId]
//dont work
test.join(profileRDD)
.foreach(println)
// don't work
test.join(profileRDD)
.saveToCassandra(keyspace, table)
}
At the beginig i get the famous : Exception in thread "main" org.apache.spark.SparkException: Task not serializable at . . .
so i extends my main class and also the case class but stil don't work .
I think you should move your case classes from Migration class to dedicated file and/or object. This should solve your problem. Additionally, Scala case classes are serializable by default.
I am trying to read avro file with case class: UserItemIds that includes case class type: User , sbt and scala 2.11
case class User(id: Long, description: String)
case class UserItemIds(user: User, itemIds: List[Long])
val UserItemIdsInputStream = env.createInput(new AvroInputFormat[UserItemIds](user_item_ids_In, classOf[UserItemIds]))
UserItemIdsInputStream.print()
but receive: Error:
Caused by: java.lang.NoSuchMethodException: schema.User.<init>()
Can anyone guide me how to work with these types please? This example is with avro files, but this could be parquet or any custom DB input.
Do I need to use TypeInformation ? ex:, if yes how to do so?
val tupleInfo: TypeInformation[(User, List[Long])] = createTypeInformation[(User, List[Long])]
I also saw env.registerType() , does it relate to the issue at all? Any help is greatly appreciated.
I found the solution to this java error as Adding a default constructor in this case I added factory method to scala case class by adding it to the companion object
object UserItemIds{
case class UserItemIds(
user: User,
itemIds: List[Long])
def apply(user:User,itemIds:List[Long]) = new
UserItemIds(user,itemIds)}
but this has not resolved the issue
You have to add a default constructor for the User and UserItemIds type. This could look the following way:
case class User(id: Long, description: String) {
def this() = this(0L, "")
}
case class UserItemIds(user: User, itemIds: List[Long]) {
def this() = this(new User(), List())
}
I am consistently having errors regarding Task not Serializable.
I have made a small Class and it extends Serializable - which is what I believe is meant to be the case when you need values in it to be serialised.
class SGD(filePath : String) extends Serializable {
val rdd = sc.textFile(filePath)
val mappedRDD = rdd.map(x => x.split(" ")
.slice(0,3))
.map(y => Rating(y(0).toInt, y(1).toInt, y(2).toDouble))
.cache
val RNG = new Random(1)
val factorsRDD = mappedRDD(x => (x.user, (x.product, x.rating)))
.groupByKey
.mapValues(listOfItemsAndRatings =>
Vector(Array.fill(2){RNG.nextDouble}))
}
The final line always results in a Task not Serializable error. What I do not understand is: the Class is Serializable; and, the Class Random is also Serializable according to the API. So, what am I doing wrong? I consistently can't get stuff like this to work; therefore, I imagine my understanding is wrong. I keep being told the Class must be Serializable... well it is and it still doesn't work!?
scala.util.Random was not Serializable until 2.11.0-M2.
Most likely you are using an earlier version of Scala.
A class doesn't become Serializable until all its members are Serializable as well (or some other mechanism is provided to serialize them, e.g. transient or readObject/writeObject.)
I get the following stacktrace when running given example in spark-1.3:
Caused by: java.io.NotSerializableException: scala.util.Random
Serialization stack:
- object not serializable (class: scala.util.Random, value: scala.util.Random#52bbf03d)
- field (class: $iwC$$iwC$SGD, name: RNG, type: class scala.util.Random)
One way to fix it is to take instatiation of random variable within mapValues:
mapValues(listOfItemsAndRatings => { val RNG = new Random(1)
Vector(Array.fill(2)(RNG.nextDouble)) })
I have the following RDD:
myRDD:org.apache.spark.rdd.RDD[(String, org.apache.spark.mllib.linalg.Vector)]
Then I want to add a fixed key:
myRdd.map(("myFixedKey",_)):org.apache.spark.rdd.RDD[(String, (String, org.apache.spark.mllib.linalg.Vector))]
But if I use a constant String val instead of a hardcoded/explicit string:
val myFixedKeyVal:String = "myFixedKey"
myRdd.map((myFixedKeyVal,_))
This previous line code gives the following exception:
org.apache.spark.SparkException: Task not serializable
Am I missing something?
Solution:
Ok I found the problem, myRdd is an object that extends a Serializable class, but after process this RDD by another class, e.g. Process:
class Process(someRdd:MyRddClass) extends Serializable{
def preprocess = someRdd.map(_)
}
val someprocess = Process(myRdd)
val newRdd = someprocess.preprocess
newRdd.map(x=>("newkey",x)
This class Process must extend Serializable too in order to work. I thought that the newRdd was extending the root class MyRddClass...
The string constant is not the problem. Turn on serialization debugging with -Dsun.io.serialization.extendedDebugInfo=true to figure out the real cause.
I have some Spark code and I use Kryo serialization. When no server fails everything runs fine, but when a server fails, I come across some big issues as it tries to recover itself. Basically the error message says that my Article class becomes unknown to a server.
Job aborted due to stage failure: Task 29 in stage 4.0 failed 4 times, most recent failure: Lost task 29.3 in stage 4.0 (TID 316, DATANODE-3): com.esotericsoftware.kryo.KryoException: Unable to find class: $line50.$read$$iwC$$iwC$Article
com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
I really have a lot of trouble understanding what I am doing wrong...
I declare these classes outside of my maps
case class Contrib ( contribType: Option[String], surname: Option[String], givenNames: Option[String], phone: Option[String], email: Option[String], fax: Option[String] )
// Class to hold references
case class Reference( idRef:Option[String], articleNameRef:Option[String], pmIDFrom: Option[Long], pmIDRef:Option[Long])
// Class to hold articles
case class Article(articleName:String, articleAbstract: Option[String],
pmID:Option[Long], doi:Option[String],
references: Iterator[Reference],
contribs: Iterator[Contrib],
keywords: List[String])
It seems that some executors just don't know what an Article is anymore...
How can I resolve that issue?
Thanks,
Stephane