NullPointerException when using Word2VecModel with UserDefinedFunction - scala

I am trying to pass a word2vec model object to my spark udf. Basically I have a test set with movie Ids and I want to pass the ids along with the model object to get an array of recommended movies for each row.
def udfGetSynonyms(model: org.apache.spark.ml.feature.Word2VecModel) =
udf((col : String) => {
model.findSynonymsArray("20", 1)
})
however this gives me a null pointer exception. When I run model.findSynonymsArray("20", 1) outside the udf I get the expected answer. For some reason it doesn't understand something about the function within the udf but can run it outside the udf.
Note: I added "20" here just to get a fixed answer to see if that would work. It does the same when I replace "20" with col.
Thanks for the help!
StackTrace:
SparkException: Job aborted due to stage failure: Task 0 in stage 23127.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23127.0 (TID 4646648, 10.56.243.178, executor 149): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$udfGetSynonyms1$1: (string) => array<struct<_1:string,_2:double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:111)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Word2VecModel.findSynonymsArray(Word2Vec.scala:273)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:7)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:4)
... 12 more

The SQL and udf API is a bit limited and I am not sure if there is a way to use custom types as columns or as inputs to udfs. A bit of googling didn't turn up anything too useful.
Instead, you can use the DataSet or RDD API and just use a regular Scala function instead of a udf, something like:
val model: Word2VecModel = ...
val inputs: DataSet[String] = ...
inputs.map(movieId => model.findSynonymsArray(movieId, 10))
Alternatively, I guess you could serialize the model to and from a string, but that seems much uglier.

I think this issue happens because wordVectors is a transient variable
class Word2VecModel private[ml] (
#Since("1.4.0") override val uid: String,
#transient private val wordVectors: feature.Word2VecModel)
extends Model[Word2VecModel] with Word2VecBase with MLWritable {
I have solved this by broadcasting w2vModel.getVectors and re-creating the Word2VecModel model inside each partition

Related

Error comes when using User Defined Case Class in Scala

The result that I am trying to find out is that, I need to find the distinct repos accessed by a particular user, using the inverted index based method in Scala. I have made the required Case Class to extract the data from the input file, and hence made the required RDD and joined on the basis of the required user, but whenver after finding all the repos accessed. When I try to use the distinct and then count function, then I get some error, which I will paste below.
The code is as follows:
import org.apache.spark.rdd.RDD
import java.text.SimpleDateFormat
import java.util.Date
case class LogLine(debug_level: String, timestamp: Date, download_id: String,retrieval_stage: String, rest: String);
val regex = """([^\s]+), ([^\s]+)\+00:00, ([^\s]+) -- ([^\s]+): (.*$)""".r
val rdd = sc.
textFile("file.txt").
flatMap ( x => x match {
case regex(debug_level,dateTime,downloadId,retrievalStage,rest) =>
val df = new SimpleDateFormat("yyyy-MM-dd:HH:mm:ss")
new Some(LogLine(debug_level, df.parse(dateTime.replace("T", ":")), downloadId, retrievalStage, rest))
case _ => None;
})
rdd.cache()
val inverted2 = rdd.groupBy(element => element.download_id).cache
val list2 = List[String]("ghtorrent-22")
val user = sc.parallelize(list2).keyBy(x => x)
val poss = inverted2.join(user).flatMap{x => x._2._1}.map(line => line.rest)
After this, whenever, I try to run, the next part to evaluate the result, I get the error.
The code and error are as follows:
Code:
poss.distinct.count
Error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 8.0 failed 1 times, most recent failure: Lost task 5.0 in stage 8.0 (TID 59) (LAPTOP-Q81MN7NI executor driver): java.lang.OutOfMemoryError: Java heap space
at java.base/java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:4055)
at java.base/java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3861)
at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2043)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1661)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:514)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:538)
at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1087)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.readNextHashCode(ExternalAppendOnlyMap.scala:335)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1(ExternalAppendOnlyMap.scala:315)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1$adapted(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$Lambda$3514/0x0000000101065840.apply(Unknown Source)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:287)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:43)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:118)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
at org.apache.spark.rdd.RDD.count(RDD.scala:1253)
... 37 elided
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.base/java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:4055)
at java.base/java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3861)
at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2043)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1661)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:514)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:538)
at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1087)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.readNextHashCode(ExternalAppendOnlyMap.scala:335)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1(ExternalAppendOnlyMap.scala:315)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1$adapted(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$Lambda$3514/0x0000000101065840.apply(Unknown Source)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:287)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:43)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:118)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)

Spark UDF now throws ArrayIndexOutOfBoundsException

I wrote a UDF in Spark (3.0.0) to do an MD5 hash of columns that looks like this:
def Md5Hash(text: String): String = {
java.security.MessageDigest.getInstance("MD5")
.digest(text.getBytes())
.map(0xFF & _)
.map("%02x".format(_))
.foldLeft("") { _ + _ }
}
val md5Hash: UserDefinedFunction = udf(Md5Hash(_))
This function has worked fine for me for months, but it is now failing at runtime:
org.apache.spark.SparkException: Failed to execute user defined function(UDFs$$$Lambda$3876/1265187815: (string) => string)
....
Caused by: java.lang.ArrayIndexOutOfBoundsException
at sun.security.provider.DigestBase.engineUpdate(DigestBase.java:116)
at sun.security.provider.MD5.implDigest(MD5.java:109)
at sun.security.provider.DigestBase.engineDigest(DigestBase.java:207)
at sun.security.provider.DigestBase.engineDigest(DigestBase.java:186)
at java.security.MessageDigest$Delegate.engineDigest(MessageDigest.java:592)
at java.security.MessageDigest.digest(MessageDigest.java:365)
at java.security.MessageDigest.digest(MessageDigest.java:411)
It still works on some small datasets, but I have another larger dataset (10Ms of rows, so not terribly huge) that fails here. I couldn't find any indication that the data I'm trying to hash are bizarre in any way -- all input values are non-null, ASCII strings. What might cause this error when it previously worked fine? I'm running in AWS EMR 6.1.0 with Spark 3.0.0.

Can't access broadcast variable in transformation

I'm having problems accessing a variable from inside a transformation function. Could someone help me out?
Here are my relevant classes and functions.
#SerialVersionUID(889949215L)
object MyCache extends Serializable {
#transient lazy val logger = Logger(getClass.getName)
#volatile var cache: Broadcast[Map[UUID, Definition]] = null
def getInstance(sparkContext: SparkContext) : Broadcast[Map[UUID, Definition]] = {
if (cache == null) {
synchronized {
val map = sparkContext.cassandraTable("keyspace", "table")
.collect()
.map(m => m.getUUID("id") ->
Definition(m.getString("c1"), m.getString("c2"), m.getString("c3"),
m.getString("c4"))).toMap
cache = sparkContext.broadcast(map)
}
}
cache
}
}
In a different file:
object Processor extends Serializable {
#transient lazy val logger = Logger(getClass.getName)
def processData[T: ClassTag](rawStream: DStream[(String, String)], ssc: StreamingContext,
processor: (String, Broadcast[Map[UUID, Definition]]) => T): DStream[T] = {
MYCache.getInstance(ssc.sparkContext)
var newCacheValues = Map[UUID, Definition]()
rawStream.cache()
rawStream
.transform(rdd => {
val array = rdd.collect()
array.foreach(r => {
val value = getNewCacheValue(r._2, rdd.context)
if (value.isDefined) {
newCacheValues = newCacheValues + value.get
}
})
rdd
})
if (newCacheValues.nonEmpty) {
logger.info(s"Rebroadcasting. There are ${newCacheValues.size} new values")
logger.info("Destroying old cache")
MyCache.cache.destroy()
// this is probably wrong here, destroying object, but then referencing it. But I haven't gotten to this part yet.
MyCache.cache = ssc.sparkContext.broadcast(MyCache.cache.value ++ newCacheValues)
}
rawStream
.map(r => {
println("######################")
println(MyCache.cache.value)
r
})
.map(r => processor(r._2, MyCache.cache.value))
.filter(r => null != r)
}
}
Every time I run this I get SparkException: Failed to get broadcast_1_piece0 of broadcast_1 when trying to access cache.value
When I add a println(MyCache.cache.values) right after the .getInstance I'm able to access the broadcast variable, but when I deploy it to a mesos cluster I'm unable to access the broadcast values again, but with a null pointer exception.
Update:
The error I'm seeing is on println(MyCache.cache.value). I shouldn't have added this if statement containing the destroy, because my tests are never hitting that.
The basics of my application are, I have a table in cassandra that won't be updated very much. But I need to do some validation on some streaming data. So I want to pull all the data from this table, that isn't update much, into memory. getInstance pulls the whole table in on startup, and then I check all my streaming data to see if I need to pull from cassandra again (which I will have to very rarely). The transform and collect is where I check to see if I need to pull new data in. But since there is a chance that my table will be updated, I will need to update the broadcast occasionally. So my idea was to destroy it and then rebroadcast. I will update that once I get the other stuff working.
I get the same error if I comment out the destroy and rebroadcast.
Another update:
I need to access the broadcast variable in processor this line: .map(r => processor(r._2, MyCache.cache.value)).
I'm able to broadcast variable in the transform, and if I do println(MyCache.cache.value) in the transform, then all my tests pass, and I'm able to then access the broadcast in processor
Update:
rawStream
.map(r => {
println("$$$$$$$$$$$$$$$$$$$")
println(metrics.value)
r
})
This is the stack trace I get when it hits this line.
ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 135.0 (TID 114)
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.uptake.readings.ingestion.StreamProcessors$$anonfun$processIncomingKafkaData$4.apply(StreamProcessors.scala:160)
at com.uptake.readings.ingestion.StreamProcessors$$anonfun$processIncomingKafkaData$4.apply(StreamProcessors.scala:158)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:414)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 24 more
[Updated answer]
You're getting an error because the code inside rawStream.map i.e. MyCache.cache.value is getting executed on one of the executor and there the MyCache.cache is still null!
When you did MyCache.getInstance, it created the MyCache.cache value on the driver and broadcasted it alright. But you're not referring to the same object in the your map method, so it doesn't get sent over to executors. Instead since you are directly referring to the MyCache, the executors invoke MyCache.cache on their own copy of MyCache object, and this obviously is null.
You can get this to work as expected by first getting an instance of cache broadcast object within the driver and using that object in the map. The following code should work for you --
val cache = MYCache.getInstance(ssc.sparkContext)
rawStream.map(r => {
println(cache.value)
r
})

Spark streaming: task "predict" not serializable

I am trying to make a spark streaming program using a model to predict, but I get an error doing this: Task not serializable.
Code:
val model = sc.objectFile[DecisionTreeModel]("DecisionTreeModel").first()
val parsedData = reducedData.map { line =>
val arr = Array(line._2._1,line._2._2,line._2._3,line._2._4,line._2._5,line._2._6,line._2._7,line._2._8,line._2._9,line._2._10,line._2._11)
val vector = LabeledPoint(line._2._4, Vectors.dense(arr))
model.predict(vector.features))
}
I paste the error:
scala> val parsedData = reducedData.map { line =>
| val arr = Array(line._2._1,line._2._2,line._2._3,line._2._4,line._2._5,line._2._6,line._2._7,line._2._8,line._2._9,line._2._10,line._2._11)
| val vector=LabeledPoint(line._2._4, Vectors.dense(arr))
| model.predict(vector.features)
| }
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at .......
How can I solve this issue?
Thanks!
Refer this link:
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
In your case, "model" is instantiated in driver and used in map which causes the object to be sent over network from driver to executors, so it should be serializable. If you cannot make model serializable, try avoiding having to serialize by instantiating model inside map.You may also need to control how often you create this object within executor - once per row(default), once per task(i.e., thread) or once per executor(i.e, jvm).
Finally, I don't think you can have a single global "model" object that you can cause mutations to from multiple executors - just in case that's what you are looking for(irrespective of whether you need to make it serializable or not).Comments welcome on this point.

SparkContext not serializable inside a companion object

I'm currently trying to extend a Machine Learning application that uses Scala and Spark. I'm using the structure of a previous project from Dieterich Lawson that I found on Github
https://github.com/dieterichlawson/admm
This project basically uses SparkContext to build an RDD of blocks of training samples, and then perform local computations on each of these sets (for example solving a linear system).
I was following the same scheme, but for my local computation I need to perform a L-BFGS algorithm on each block of training samples. In order to do so, I wanted to use the L-BFGS algorithm from the mlLib which has the following signature.
runLBFGS(RDD<scala.Tuple2<Object,Vector>> data, Gradient gradient,
Updater updater, int numCorrections, double convergenceTol,
int maxNumIterations, double regParam, Vector initialWeights)
As it says, the method takes as input an RDD[Object,Vector] of the training samples. The problem is that locally on each worker I no longer keep the RDD structure of the data. Therefore, I'm trying to use parallelize function of the SparkContext on each block of the matrix. But when I do this, I get a serializer exception. (The exact exception message is at the end of the question).
This is a detailed explanation on how I'm handling the SparkContext.
First, in the main application it is used to open a textfile and it is used in the factory of the class LogRegressionXUpdate:
val A = sc.textFile("ds1.csv")
A.checkpoint
val f = LogRegressionXUpdate.fromTextFile(A,params.rho,1024,sc)
In the application, the class LogRegressionXUpdate is implemented as follows
class LogRegressionXUpdate(val training: RDD[(Double, NV)],
val rho: Double) extends Function1[BDV[Double],Double] with Prox with Serializable{
def prox(x: BDV[Double], rho: Double): BDV[Double] = {
val numCorrections = 10
val convergenceTol = 1e-4
val maxNumIterations = 20
val regParam = 0.1
val (weights, loss) = LBFGS.runLBFGS(
training,
new GradientForLogRegADMM(rho,fromBreeze(x)),
new SimpleUpdater(),
numCorrections,
convergenceTol,
maxNumIterations,
regParam,
fromBreeze(x))
toBreeze(weights.toArray).toDenseVector
}
def apply(x: BDV[Double]): Double = {
Math.pow(1,2.0)
}
}
With the following companion object:
object LogRegressionXUpdate {
def fromTextFile(file: RDD[String], rho: Double, blockHeight: Int = 1024, #transient sc: SparkContext): RDF[LogRegressionXUpdate] = {
val fns = new BlockMatrix(file, blockHeight).blocks.
map(X => new LogRegressionXUpdate(sc.parallelize((X(*,::).map(fila => (fila(-1),fromBreeze(fila(0 to -2))))).toArray),rho))
new RDF[LogRegressionXUpdate](fns, 0L)
}
}
This constructor is causing a serialization error though I'm not really needing the SparkContext to build each RDD locally. I've searched for solutions to this problem and adding #transient didn't solve it.
Then, my question is: is it really possible to build these "second layer RDDs" or I'm forced to perform a non distributed version of the L-BFGS algorithm.
Thanks in advance!
Error Log:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.map(RDD.scala:293)
at admm.functions.LogRegressionXUpdate$.fromTextFile(LogRegressionXUpdate.scala:70)
at admm.examples.Lasso$.run(Lasso.scala:96)
at admm.examples.Lasso$$anonfun$main$1.apply(Lasso.scala:70)
at admm.examples.Lasso$$anonfun$main$1.apply(Lasso.scala:69)
at scala.Option.map(Option.scala:145)
at admm.examples.Lasso$.main(Lasso.scala:69)
at admm.examples.Lasso.main(Lasso.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#20576557)
- field (class: admm.functions.LogRegressionXUpdate$$anonfun$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class admm.functions.LogRegressionXUpdate$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
... 21 more
RDDs should only be accessed from the driver. Whenever you call something like
myRDD.map(someObject.someMethod)
spark serializes whatever that is needed for the computation of someMethod, and sends it to the workers. There, the method is deserialized and then it runs on each partition independently.
You, however, try to use a method that itself uses spark: you attempt to create a new RDD. However, this is not possible since they can only be created in the driver. The error you see is spark's attempt to serialize the spark context itself since it is needed for the computation at each block. More about serialization can be found in the first answer to this question.
"... though I'm not really needing the SparkContext to build each RDD locally" - actually this is exactly what you are doing when calling sc.parallelize. Bottom line - you need to find (or write) a local implementation of L-BFGS.