Task not serializable about aggegateByKey - scala

Environment: spark 1.60. i use scala.
i can compile the program by sbt, but when i commit the program, it came across the error.
My full error is as followed:
238 17/01/21 18:32:24 INFO net.NetworkTopology: Adding a new node: /YH11070029/10.39.0.213:50010
17/01/21 18:32:24 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.39.0.44:41961 with 2.7 GB RAM, BlockManagerId(349, 10.39.0.44, 41961)
17/01/21 18:32:24 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.39.2.178:48591 with 2.7 GB RAM, BlockManagerId(518, 10.39.2.178, 48591)
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:93)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:82)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$1.apply(PairRDDFunctions.scala:177)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$1.apply(PairRDDFunctions.scala:166)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.PairRDDFunctions.aggregateByKey(PairRDDFunctions.scala:166)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$3.apply(PairRDDFunctions.scala:206)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$3.apply(PairRDDFunctions.scala:206)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.PairRDDFunctions.aggregateByKey(PairRDDFunctions.scala:205)
at com.sina.adalgo.feature.ETL$$anonfun$13.apply(ETL.scala:190)
at com.sina.adalgo.feature.ETL$$anonfun$13.apply(ETL.scala:102)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
The purpose of code is to statistic categorical features' frequentencies. Main code is as followed:
object ETL extends Serializable {
... ...
val cateList = featureData.map{v =>
case (psid: String, label: String, cate_features: ParArray[String], media_features: String) =>
val pair_feature = cate_features.zipWithIndex.map(x => (x._2, x._1))
pair_feature
}.flatMap(_.toList)
def seqop(m: HashMap[String, Int] , s: String) : HashMap[String, Int]={
var x = m.getOrElse(s, 0)
x += 1
m += s -> x
m
}
def combop(m: HashMap[String, Int], n: HashMap[String, Int]) : HashMap[String, Int]={
for (k <- n) {
var x = m.getOrElse(k._1, 0)
x += k._2
m += k._1 -> x
}
m
}
val hash = HashMap[String, Int]()
val feaFreq = cateList.aggregateByKey(hash)(seqop, combop)// (i, HashMap[String, Int]) i corresponded with categorical feature
The object have inheritated Serializable.
why? can u help me?

To me, this problem typically happens in Spark when we use a closure as aggregation function that un-intentially closes over some unwanted objects and/or sometimes simply a function that is inside the main class of our spark driver code.
I suspect this might be the case here since your stacktrace involves org.apache.spark.util.ClosureCleaner as top level culprit.
This is problematic because in such case, when Spark tries to forward that function to the workers so they can do the actual aggregation, it ends-up serializing much more than you actually intended: the function itelf plus its surrounding class.
See also this post by Erik Erlandson where some border cases of serialisation of closure are well explained as well as Spark 1.6 notes on closures.
A quick fix is probably to move the definition of the function you use in the aggregateByKey to a separate object, completely independant from the rest of the code.

Related

Spark Scala: difference in App execution vs line by line in REPL

I have a simple word count program packed as object:
object MyApp {
val path = "file:///home/sergey/spark/spark-2.2.0/README.md"
val readMe = sc.textFile(path)
val stop = List("to","the","a")
val res = (readMe
.flatMap(_.split("\\W+"))
.filter(_.length > 0)
.map(_.toLowerCase)
.filter(!stop.contains(_))
.map((_, 1))
.reduceByKey(_ + _)
.sortBy(-_._2)
)
println(res.take(3).mkString)
}
When I try to execute it I get:
scala> MyApp
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:387)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:386)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.filter(RDD.scala:386)
... 51 elided
Caused by: java.io.NotSerializableException: MyApp$
Serialization stack:
- object not serializable (class: MyApp$, value: MyApp$#7bd44868)
- field (class: MyApp$$anonfun$5, name: $outer, type: class MyApp$)
- object (class MyApp$$anonfun$5, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
(the culprit being .filter(!stop.contains(_)) line.
However, when I execute the same code line by line it runs well and produces expected results.
I would really appreciate answers to 2 questions:
What is so different between line-by-line execution and singleton execution, so that one runs whereas the other fails?
What could be other solutions than packing !stop.contains(_) closure together with stop list into another object?
Generally Speaking. Your program is a little strange.
Let me help to illustrate the detail. Hope it will help you!
You said you got correct answer if the program was executed line by line. From your description context, i guess this happened in spark-shell, right?
One important to note is that, if you open spark-shell in spark package, your are in a REPL environment, there will be a pre-constructed spark context object for you already, making your own SparkContext will not work.
For example,
$ ./bin/spark-shell --master local[4]
Now you want to get an spark application, represented by an program text. Such as your file MyApp.scala, then the parallel operation on RDD, here for readMe, a RDD[String], the spark will break the job triggered by action take to many tasks and forward there tasks to worker to execute,
But, now pay your attention to your code!, in order to have your operation to construct the closure(those variables and methods which must be visible for the executor to perform its computations on the RDD), but from your code, all your closure calculation is in you whole object.
However, in the singleton object MyApp, SparkContext type, sc, for example, can not and should not be serializable, because it will stand only on driver node to tell spark how to access the cluster, so your submit will fail.
I help to revise this code, you can run it on your machine. But for your purpose, your revised code should be submitted to spark-submit script.
import org.apache.spark.{SparkConf, SparkContext}
object MyApp {
private var sc: SparkContext = _
def init(): Unit = {
val sparkConf = new SparkConf().setAppName(this.getClass.getName)
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sc = new SparkContext(sparkConf)
}
def mission(): Unit = {
val path = "file:///home/sergey/spark/spark-2.2.0/README.md"
val readMe = sc.textFile(path)
val stop = List("to", "the", "a")
val res = readMe
.flatMap(_.split("\\W+"))
.filter(_.length > 0)
.map(_.toLowerCase)
.filter(!stop.contains(_))
.map((_, 1))
.reduceByKey(_ + _)
.sortBy(-_._2)
println(res.take(3).mkString)
}
def main(args: Array[String]): Unit = {
init()
mission()
}
}
Please refer to submit usage on page, sooner or later you have to use it.

Code working in Spark-Shell not in eclipse

I have a small Scala code which works properly on Spark-Shell but not in Eclipse with Scala plugin. I can access hdfs using plugin tried writing another file and it worked..
FirstSpark.scala
package bigdata.spark
import org.apache.spark.SparkConf
import java. io. _
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object FirstSpark {
def main(args: Array[String])={
val conf = new SparkConf().setMaster("local").setAppName("FirstSparkProgram")
val sparkcontext = new SparkContext(conf)
val textFile =sparkcontext.textFile("hdfs://pranay:8020/spark/linkage")
val m = new Methods()
val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
q.saveAsTextFile("hdfs://pranay:8020/output") }
}
Methods.scala
package bigdata.spark
import java.util.function.ToDoubleFunction
class Methods {
def isHeader(s:String):Boolean={
s.contains("id_1")
}
def parse(line:String) ={
val pieces = line.split(',')
val id1=pieces(0).toInt
val id2=pieces(1).toInt
val matches=pieces(11).toBoolean
val mapArray=pieces.slice(2, 11).map(toDouble)
MatchData(id1,id2,mapArray,matches)
}
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
Error Message:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
Can anyone please help me with this
Try changing class Methods { .. } to object Methods { .. }.
I think the problem is at val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x)). When Spark sees the filter and map functions it tries to serialize the functions passed to them (x => !m.isHeader(x) and x=> m.parse(x)) so that it can dispatch the work of executing them to all of the executors (this is the Task referred to). However, to do this, it needs to serialize m, since this object is referenced inside the function (it is in the closure of the two anonymous methods) - but it cannot do this since Methods is not serializable. You could add extends Serializable to the Methods class, but in this case an object is more appropriate (and is already Serializable).

Can't access broadcast variable in transformation

I'm having problems accessing a variable from inside a transformation function. Could someone help me out?
Here are my relevant classes and functions.
#SerialVersionUID(889949215L)
object MyCache extends Serializable {
#transient lazy val logger = Logger(getClass.getName)
#volatile var cache: Broadcast[Map[UUID, Definition]] = null
def getInstance(sparkContext: SparkContext) : Broadcast[Map[UUID, Definition]] = {
if (cache == null) {
synchronized {
val map = sparkContext.cassandraTable("keyspace", "table")
.collect()
.map(m => m.getUUID("id") ->
Definition(m.getString("c1"), m.getString("c2"), m.getString("c3"),
m.getString("c4"))).toMap
cache = sparkContext.broadcast(map)
}
}
cache
}
}
In a different file:
object Processor extends Serializable {
#transient lazy val logger = Logger(getClass.getName)
def processData[T: ClassTag](rawStream: DStream[(String, String)], ssc: StreamingContext,
processor: (String, Broadcast[Map[UUID, Definition]]) => T): DStream[T] = {
MYCache.getInstance(ssc.sparkContext)
var newCacheValues = Map[UUID, Definition]()
rawStream.cache()
rawStream
.transform(rdd => {
val array = rdd.collect()
array.foreach(r => {
val value = getNewCacheValue(r._2, rdd.context)
if (value.isDefined) {
newCacheValues = newCacheValues + value.get
}
})
rdd
})
if (newCacheValues.nonEmpty) {
logger.info(s"Rebroadcasting. There are ${newCacheValues.size} new values")
logger.info("Destroying old cache")
MyCache.cache.destroy()
// this is probably wrong here, destroying object, but then referencing it. But I haven't gotten to this part yet.
MyCache.cache = ssc.sparkContext.broadcast(MyCache.cache.value ++ newCacheValues)
}
rawStream
.map(r => {
println("######################")
println(MyCache.cache.value)
r
})
.map(r => processor(r._2, MyCache.cache.value))
.filter(r => null != r)
}
}
Every time I run this I get SparkException: Failed to get broadcast_1_piece0 of broadcast_1 when trying to access cache.value
When I add a println(MyCache.cache.values) right after the .getInstance I'm able to access the broadcast variable, but when I deploy it to a mesos cluster I'm unable to access the broadcast values again, but with a null pointer exception.
Update:
The error I'm seeing is on println(MyCache.cache.value). I shouldn't have added this if statement containing the destroy, because my tests are never hitting that.
The basics of my application are, I have a table in cassandra that won't be updated very much. But I need to do some validation on some streaming data. So I want to pull all the data from this table, that isn't update much, into memory. getInstance pulls the whole table in on startup, and then I check all my streaming data to see if I need to pull from cassandra again (which I will have to very rarely). The transform and collect is where I check to see if I need to pull new data in. But since there is a chance that my table will be updated, I will need to update the broadcast occasionally. So my idea was to destroy it and then rebroadcast. I will update that once I get the other stuff working.
I get the same error if I comment out the destroy and rebroadcast.
Another update:
I need to access the broadcast variable in processor this line: .map(r => processor(r._2, MyCache.cache.value)).
I'm able to broadcast variable in the transform, and if I do println(MyCache.cache.value) in the transform, then all my tests pass, and I'm able to then access the broadcast in processor
Update:
rawStream
.map(r => {
println("$$$$$$$$$$$$$$$$$$$")
println(metrics.value)
r
})
This is the stack trace I get when it hits this line.
ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 135.0 (TID 114)
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.uptake.readings.ingestion.StreamProcessors$$anonfun$processIncomingKafkaData$4.apply(StreamProcessors.scala:160)
at com.uptake.readings.ingestion.StreamProcessors$$anonfun$processIncomingKafkaData$4.apply(StreamProcessors.scala:158)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:414)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 24 more
[Updated answer]
You're getting an error because the code inside rawStream.map i.e. MyCache.cache.value is getting executed on one of the executor and there the MyCache.cache is still null!
When you did MyCache.getInstance, it created the MyCache.cache value on the driver and broadcasted it alright. But you're not referring to the same object in the your map method, so it doesn't get sent over to executors. Instead since you are directly referring to the MyCache, the executors invoke MyCache.cache on their own copy of MyCache object, and this obviously is null.
You can get this to work as expected by first getting an instance of cache broadcast object within the driver and using that object in the map. The following code should work for you --
val cache = MYCache.getInstance(ssc.sparkContext)
rawStream.map(r => {
println(cache.value)
r
})

Spark streaming: task "predict" not serializable

I am trying to make a spark streaming program using a model to predict, but I get an error doing this: Task not serializable.
Code:
val model = sc.objectFile[DecisionTreeModel]("DecisionTreeModel").first()
val parsedData = reducedData.map { line =>
val arr = Array(line._2._1,line._2._2,line._2._3,line._2._4,line._2._5,line._2._6,line._2._7,line._2._8,line._2._9,line._2._10,line._2._11)
val vector = LabeledPoint(line._2._4, Vectors.dense(arr))
model.predict(vector.features))
}
I paste the error:
scala> val parsedData = reducedData.map { line =>
| val arr = Array(line._2._1,line._2._2,line._2._3,line._2._4,line._2._5,line._2._6,line._2._7,line._2._8,line._2._9,line._2._10,line._2._11)
| val vector=LabeledPoint(line._2._4, Vectors.dense(arr))
| model.predict(vector.features)
| }
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at .......
How can I solve this issue?
Thanks!
Refer this link:
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
In your case, "model" is instantiated in driver and used in map which causes the object to be sent over network from driver to executors, so it should be serializable. If you cannot make model serializable, try avoiding having to serialize by instantiating model inside map.You may also need to control how often you create this object within executor - once per row(default), once per task(i.e., thread) or once per executor(i.e, jvm).
Finally, I don't think you can have a single global "model" object that you can cause mutations to from multiple executors - just in case that's what you are looking for(irrespective of whether you need to make it serializable or not).Comments welcome on this point.

SparkContext not serializable inside a companion object

I'm currently trying to extend a Machine Learning application that uses Scala and Spark. I'm using the structure of a previous project from Dieterich Lawson that I found on Github
https://github.com/dieterichlawson/admm
This project basically uses SparkContext to build an RDD of blocks of training samples, and then perform local computations on each of these sets (for example solving a linear system).
I was following the same scheme, but for my local computation I need to perform a L-BFGS algorithm on each block of training samples. In order to do so, I wanted to use the L-BFGS algorithm from the mlLib which has the following signature.
runLBFGS(RDD<scala.Tuple2<Object,Vector>> data, Gradient gradient,
Updater updater, int numCorrections, double convergenceTol,
int maxNumIterations, double regParam, Vector initialWeights)
As it says, the method takes as input an RDD[Object,Vector] of the training samples. The problem is that locally on each worker I no longer keep the RDD structure of the data. Therefore, I'm trying to use parallelize function of the SparkContext on each block of the matrix. But when I do this, I get a serializer exception. (The exact exception message is at the end of the question).
This is a detailed explanation on how I'm handling the SparkContext.
First, in the main application it is used to open a textfile and it is used in the factory of the class LogRegressionXUpdate:
val A = sc.textFile("ds1.csv")
A.checkpoint
val f = LogRegressionXUpdate.fromTextFile(A,params.rho,1024,sc)
In the application, the class LogRegressionXUpdate is implemented as follows
class LogRegressionXUpdate(val training: RDD[(Double, NV)],
val rho: Double) extends Function1[BDV[Double],Double] with Prox with Serializable{
def prox(x: BDV[Double], rho: Double): BDV[Double] = {
val numCorrections = 10
val convergenceTol = 1e-4
val maxNumIterations = 20
val regParam = 0.1
val (weights, loss) = LBFGS.runLBFGS(
training,
new GradientForLogRegADMM(rho,fromBreeze(x)),
new SimpleUpdater(),
numCorrections,
convergenceTol,
maxNumIterations,
regParam,
fromBreeze(x))
toBreeze(weights.toArray).toDenseVector
}
def apply(x: BDV[Double]): Double = {
Math.pow(1,2.0)
}
}
With the following companion object:
object LogRegressionXUpdate {
def fromTextFile(file: RDD[String], rho: Double, blockHeight: Int = 1024, #transient sc: SparkContext): RDF[LogRegressionXUpdate] = {
val fns = new BlockMatrix(file, blockHeight).blocks.
map(X => new LogRegressionXUpdate(sc.parallelize((X(*,::).map(fila => (fila(-1),fromBreeze(fila(0 to -2))))).toArray),rho))
new RDF[LogRegressionXUpdate](fns, 0L)
}
}
This constructor is causing a serialization error though I'm not really needing the SparkContext to build each RDD locally. I've searched for solutions to this problem and adding #transient didn't solve it.
Then, my question is: is it really possible to build these "second layer RDDs" or I'm forced to perform a non distributed version of the L-BFGS algorithm.
Thanks in advance!
Error Log:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.map(RDD.scala:293)
at admm.functions.LogRegressionXUpdate$.fromTextFile(LogRegressionXUpdate.scala:70)
at admm.examples.Lasso$.run(Lasso.scala:96)
at admm.examples.Lasso$$anonfun$main$1.apply(Lasso.scala:70)
at admm.examples.Lasso$$anonfun$main$1.apply(Lasso.scala:69)
at scala.Option.map(Option.scala:145)
at admm.examples.Lasso$.main(Lasso.scala:69)
at admm.examples.Lasso.main(Lasso.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#20576557)
- field (class: admm.functions.LogRegressionXUpdate$$anonfun$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class admm.functions.LogRegressionXUpdate$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
... 21 more
RDDs should only be accessed from the driver. Whenever you call something like
myRDD.map(someObject.someMethod)
spark serializes whatever that is needed for the computation of someMethod, and sends it to the workers. There, the method is deserialized and then it runs on each partition independently.
You, however, try to use a method that itself uses spark: you attempt to create a new RDD. However, this is not possible since they can only be created in the driver. The error you see is spark's attempt to serialize the spark context itself since it is needed for the computation at each block. More about serialization can be found in the first answer to this question.
"... though I'm not really needing the SparkContext to build each RDD locally" - actually this is exactly what you are doing when calling sc.parallelize. Bottom line - you need to find (or write) a local implementation of L-BFGS.