Is using function in transformation causing Not Serializable exceptions? - scala

I have a Breeze DenseMatrix, i find mean per row and mean of squares per row and put them in another DenseMatrix, one per column. But i get Task Not Serializable exception. I know that sc is not Serializable but i think that the exception is because i call functions in a transformation in Safe Zones.
Am i right? And how could be a possible way to be done without any functions? Any help would be great!
Code:
object MotitorDetection {
case class MonDetect() extends Serializable {
var sc: SparkContext = _
var machines: Int=0
var counters: Int=0
var GlobalVec= BDM.zeros[Double](counters, 2)
def findMean(a: BDM[Double]): BDV[Double] = {
var c = mean(a(*, ::))
c}
def toMatrix(x: BDV[Double], y: BDV[Double], C: Int): BDM[Double]={
val m = BDM.zeros[Double](C,2)
m(::, 0) := x
m(::, 1) := y
m}
def SafeZones(stream: DStream[(Int, BDM[Double])]){
stream.foreachRDD { (rdd: RDD[(Int, BDM[Double])], _) =>
if (isEmpty(rdd) == false) {
val InputVec = rdd.map(x=> (x._1, toMatrix(findMean(x._2), findMean(pow(x._2, 2)), counters)))
GlobalMeanVector(InputVec)
}}}
Exception:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1.apply(MotitorDetection.scala:85)
at ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1.apply(MotitorDetection.scala:82)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#6eee7027)
- field (class: ScalaApps.MotitorDetection$MonDetect, name: sc, type: class org.apache.spark.SparkContext)
- object (class ScalaApps.MotitorDetection$MonDetect, MonDetect())
- field (class: ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1, name: $outer, type: class ScalaApps.MotitorDetection$MonDetect)
- object (class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1, <function2>)
- field (class: ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1$$anonfun$2, name: $outer, type: class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1)
- object (class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1$$anonfun$2, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 28 more

The findMean method is a method of the object MotitorDetection. The object MotitorDetection has a SparkContext on-board, which is not serializable. Thus, the task used in rdd.map is not serializable.
Move all the matrix-related functions into a separate serializable object, MatrixUtils, say:
object MatrixUtils {
def findMean(a: BDM[Double]): BDV[Double] = {
var c = mean(a(*, ::))
c
}
def toMatrix(x: BDV[Double], y: BDV[Double], C: Int): BDM[Double]={
val m = BDM.zeros[Double](C,2)
m(::, 0) := x
m(::, 1) := y
m
}
...
}
and then use only those methods from rdd.map(...):
object MotitorDetection {
val sc = ...
def SafeZones(stream: DStream[(Int, BDM[Double])]){
import MatrixUtils._
... = rdd.map( ... )
}
}

Related

Spark Dataframe stat throwing Task not serializable

What am I trying to do? (Context)
I'm trying to calculate some stats for a dataframe/set in spark that is read from a directory with .parquet files about US flights between 2013 and 2015. To be more specific, I'm using approxQuantile method in DataFrameStatFunction that can be accessed calling stat method on a Dataset. See docu
import airportCaseStudy.model.Flight
import org.apache.spark.sql.SparkSession
object CaseStudy {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("local[*]")
.getOrCreate
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.sqlContext.implicits._
val flights = spark
.read
.parquet("C:\\Users\\Bluetab\\IdeaProjects\\GraphFramesSparkPlayground\\src\\resources\\flights")
.as[Flight]
flights.show()
flights.printSchema()
flights.describe("year", "flightEpochSeconds").show()
val approxQuantiles = flights.stat
.approxQuantile(Array("year", "flightEpochSeconds"), Array(0.25, 0.5, 0.75), 0.25)
// whatever...
}
}
Flight is simply a case class.
package airportCaseStudy.model
case class Flight(year: Int, quarter: Int, month: Int, dayOfMonth: Int, dayOfWeek: Int, flightDate: String,
uniqueCarrier: String, airlineID: String, carrier: String, tailNum: String, flightNum: Int,
originAirportID: String, origin: String, originCityName: String, dstAirportID: String,
dst: String, dstCityName: String, taxiOut: Float, taxiIn: Float, cancelled: Boolean,
diverted: Float, actualETMinutes: Float, airTimeMinutes: Float, distanceMiles: Float, flightEpochSeconds: Long)
What's the issue?
I'm using Spark 2.4.0.
When executing val approxQuantiles = flights.stat.approxQuantile(Array("year", "flightEpochSeconds"), Array(0.25, 0.5, 0.75), 0.25) I'm not getting it done because there must be such a task that cannot be serializable. I spent some time checking out there the following links, but I'm not able to figure out why this exception.
Find quantiles and mean using spark (python and scala)
Statistical and Mathematical functions with DF in Spark from Databricks
Exception
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:88)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:77)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:222)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:211)
at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1158)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1137)
at org.apache.spark.sql.execution.stat.StatFunctions$.multipleApproxQuantiles(StatFunctions.scala:102)
at org.apache.spark.sql.DataFrameStatFunctions.approxQuantile(DataFrameStatFunctions.scala:104)
at airportCaseStudy.CaseStudy$.main(CaseStudy.scala:27)
at airportCaseStudy.CaseStudy.main(CaseStudy.scala)
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
- object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.PairRDDFunctions, functionalInterfaceMethod=scala/Function0.apply:()Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/PairRDDFunctions.$anonfun$foldByKey$2:(Lorg/apache/spark/rdd/PairRDDFunctions;[BLscala/runtime/LazyRef;)Ljava/lang/Object;, instantiatedMethodType=()Ljava/lang/Object;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.PairRDDFunctions$$Lambda$2158/61210602, org.apache.spark.rdd.PairRDDFunctions$$Lambda$2158/61210602#165a5979)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 2)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.PairRDDFunctions, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/PairRDDFunctions.$anonfun$foldByKey$3:(Lscala/Function0;Lscala/Function2;Ljava/lang/Object;)Ljava/lang/Object;, instantiatedMethodType=(Ljava/lang/Object;)Ljava/lang/Object;, numCaptured=2])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.PairRDDFunctions$$Lambda$2159/758750856, org.apache.spark.rdd.PairRDDFunctions$$Lambda$2159/758750856#6a6e410c)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
... 22 more
I appreciate any help you can provide.
add "extends Serializable" to you class or object.
class/Object Test extends Serializable{
//type you code
}

Task not serializable after adding it to ForEachPartition

I am receiving a task not serializable exception in spark when attempting to implement an Apache pulsar Sink in spark structured streaming.
I have already attempted to extrapolate the PulsarConfig to a separate class and call this within the .foreachPartition lambda function which I normally do for JDBC connections and other systems I integrate into spark structured streaming like shown below:
PulsarSink Class
class PulsarSink(
sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode) extends Sink{
override def addBatch(batchId: Long, data: DataFrame): Unit = {
data.toJSON.foreachPartition( partition => {
val pulsarConfig = new PulsarConfig(parameters).client
val producer = pulsarConfig.newProducer(Schema.STRING)
.topic(parameters.get("topic").get)
.compressionType(CompressionType.LZ4)
.sendTimeout(0, TimeUnit.SECONDS)
.create
partition.foreach(rec => producer.send(rec))
producer.flush()
})
}
PulsarConfig Class
class PulsarConfig(parameters: Map[String, String]) {
def client(): PulsarClient = {
import scala.collection.JavaConverters._
if(!parameters.get("tlscert").isEmpty && !parameters.get("tlskey").isEmpty) {
val tlsAuthMap = Map("tlsCertFile" -> parameters.get("tlscert").get,
"tlsKeyFile" -> parameters.get("tlskey").get).asJava
val tlsAuth: Authentication = AuthenticationFactory.create(classOf[AuthenticationTls].getName, tlsAuthMap)
PulsarClient.builder
.serviceUrl(parameters.get("broker").get)
.tlsTrustCertsFilePath(parameters.get("tlscert").get)
.authentication(tlsAuth)
.enableTlsHostnameVerification(false)
.allowTlsInsecureConnection(true)
.build
}
else{
PulsarClient.builder
.serviceUrl(parameters.get("broker").get)
.enableTlsHostnameVerification(false)
.allowTlsInsecureConnection(true)
.build
}
}
}
The error message I receive is the following:
ERROR StreamExecution: Query [id = 12c715c2-2d62-4523-a37a-4555995ccb74, runId = d409c0db-7078-4654-b0ce-96e46dfb322c] terminated with error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2341)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2341)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2341)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2828)
at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2340)
at org.apache.spark.datamediation.impl.sink.PulsarSink.addBatch(PulsarSink.scala:20)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:666)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:666)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:666)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:665)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:306)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:290)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
Caused by: java.io.NotSerializableException: org.apache.spark.datamediation.impl.sink.PulsarSink
Serialization stack:
- object not serializable (class: org.apache.spark.datamediation.impl.sink.PulsarSink, value: org.apache.spark.datamediation.impl.sink.PulsarSink#38813f43)
- field (class: org.apache.spark.datamediation.impl.sink.PulsarSink$$anonfun$addBatch$1, name: $outer, type: class org.apache.spark.datamediation.impl.sink.PulsarSink)
- object (class org.apache.spark.datamediation.impl.sink.PulsarSink$$anonfun$addBatch$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:337)
... 31 more
Values used in "foreachPartition" can be reassigned from class level to function variables:
override def addBatch(batchId: Long, data: DataFrame): Unit = {
val parametersLocal = parameters
data.toJSON.foreachPartition( partition => {
val pulsarConfig = new PulsarConfig(parametersLocal).client

Spark task not serializable

I've tried all the solutions to this problem that I found on StackOverflow but, despite this, I can't solve it.
I have a "MainObj" object that instantiates a "Recommendation" object. When I call the "recommendationProducts" method I always get an error.
Here is the code of the method:
def recommendationProducts(item: Int): Unit = {
val aMatrix = new DoubleMatrix(Array(1.0, 2.0, 3.0))
def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = {
vec1.dot(vec2) / (vec1.norm2() * vec2.norm2())
}
val itemFactor = model.productFeatures.lookup(item).head
val itemVector = new DoubleMatrix(itemFactor)
//Here is where I get the error:
val sims = model.productFeatures.map { case (id, factor) =>
val factorVector = new DoubleMatrix(factor)
val sim = cosineSimilarity(factorVector, itemVector)
(id, sim)
}
val sortedSims = sims.top(10)(Ordering.by[(Int, Double), Double] {
case (id, similarity) => similarity
})
println("\nTop 10 products:")
sortedSims.map(x => (x._1, x._2)).foreach(println)
This is the error:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at RecommendationObj.recommendationProducts(RecommendationObj.scala:269)
at MainObj$.analisiIUNGO(MainObj.scala:257)
at MainObj$.menu(MainObj.scala:54)
at MainObj$.main(MainObj.scala:37)
at MainObj.main(MainObj.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#7c2312fa)
- field (class: RecommendationObj, name: sc, type: class org.apache.spark.SparkContext)
- object (class MainObj$$anon$1, MainObj$$anon$1#615bad16)
- field (class: RecommendationObj$$anonfun$37, name: $outer, type: class RecommendationObj)
- object (class RecommendationObj$$anonfun$37, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 14 more
I tried to add:
1) "extends Serializable" (Scala) to my Class
2) "extends extends java.io.Serializable" to my Class
3) "#transient" to some parts
4) Get the model (and other features) inside this class (Now I get them from an other object and I pass them to my Class like arguments)
How can I resolve it? I'm becoming crazy!
Thank you in advance!
Key is here:
field (class: RecommendationObj, name: sc, type: class org.apache.spark.SparkContext)
So you have field named sc of type SparkContext. Spark wants to serialize the class, so he try also to serialize all fields.
You should:
use #transient annotation and checking if null, then recreate
not use SparkContext from field, but put it into argument of method. However remember, that you never should use SparkContext inside closures in map, flatMap, etc.

Task not serializable error when applying regex transformation to String column

I want to reformat a DataFrame's String column. From format
(p:str_1,1)(p:str_2,2) ...
to format
str_1:1|str_2:2 ...
I wrote the following code [2] but I get org.apache.spark.SparkException: Task not serializable exception caused by Caused by: java.io.NotSerializableException: scala.util.matching.Regex$MatchIterator [3].
Can someone help understand why Regex$MatchIterator is not serializable and also how to fix this, it seems that the Iterator returned by the findAllIn(xs) is causing harm at the for-statement's elem <- mi [1].
Thanks a lot!
-- β
[1] If I change elem <- mi to use a random iterator (e.g. elem <- Iterator(1, 2, 3)) then no compile error and the code runs - but obviously doesn't do what I need it to do. I also tried to get a normal iterator findAllIn(xs).toIterator but the same exception occurs.
[2] Code
val df = spark.sparkContext.parallelize(Seq(("(p:some_string,6)(p:some_other_string,4)", "foo"), ("(p:yet_another_string,1) ", "bar"))).toDF("my_p", "my_s")
val regexStr: String = "\\((?:p|p2)?:(?<q>.+?),(?<s>\\d+)\\)"
def _reformat(xs: String): String = {
val re: scala.util.matching.Regex = regexStr.r
val mi = re.findAllIn(xs)
val d: Iterator[String] = for {
elem <- mi
val q: String = mi.group(1)
val s: String = mi.group(2)
val pair: String = s"${q}:${s}"
} yield pair
d.mkString("|")
}
def reformat: UserDefinedFunction = udf[String, String](_reformat)
val dfReformated: DataFrame = df
.withColumn("my_p", reformat($"my_p"))
[3] StackTrace
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:817)
...
at org.apache.spark.sql.Dataset.head(Dataset.scala:1934)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2149)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
at org.apache.spark.sql.Dataset.show(Dataset.scala:506)
... 46 elided
Caused by: java.io.NotSerializableException: scala.util.matching.Regex$MatchIterator
Serialization stack:
- object not serializable (class: scala.util.matching.Regex$MatchIterator, value: empty iterator)
- field (class: $iw, name: mi, type: class scala.util.matching.Regex$MatchIterator)
- object (class $iw, $iw#79303206)
- field (class: $iw, name: $iw, type: class $iw)
...

Scala: Xtream complains object not serializable

I have the following case classes defined and i would like to print out ClientData in xml format using xstream.
case class Address(addressLine1: String,
addressLine2: String,
city: String,
provinceCode: String,
country: String,
addressTypeDesc: String) extends Serializable{
}
case class ClientData(title: String,
firstName: String,
lastName: String,
addrList:Option[List[Address]]) extends Serializable{
}
object ex1{
def main(args: Array[String]){
...
...
...
// In below, x is Try[ClientData]
val xstream = new XStream(new DomDriver)
newClientRecord.foreach(x=> if (x.isSuccess) println(xstream.toXML(x.get)))
}
}
And when the program execute the line to print each ClientData in xml format, I am getting the runtime error below. Please help.
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:911)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:910)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:910)
at lab9$.main(lab9.scala:63)
at lab9.main(lab9.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: com.thoughtworks.xstream.XStream
Serialization stack:
- object not serializable (class: com.thoughtworks.xstream.XStream, value: com.thoughtworks.xstream.XStream#51e94b7d)
- field (class: lab9$$anonfun$main$1, name: xstream$1, type: class com.thoughtworks.xstream.XStream)
- object (class lab9$$anonfun$main$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 16 more
It isn't XStream which complains, it's Spark. You need to define xstream variable inside the task:
newClientRecord.foreach { x=>
if (x.isSuccess) {
val xstream = new XStream(new DomDriver)
println(xstream.toXML(x.get))
}
}
if XStream is sufficiently cheap to create;
newClientRecord.foreachPartition { xs =>
val xstream = new XStream(new DomDriver)
xs.foreach { x =>
if (x.isSuccess) {
println(xstream.toXML(x.get))
}
}
}
otherwise.