Spark task not serializable - scala

I've tried all the solutions to this problem that I found on StackOverflow but, despite this, I can't solve it.
I have a "MainObj" object that instantiates a "Recommendation" object. When I call the "recommendationProducts" method I always get an error.
Here is the code of the method:
def recommendationProducts(item: Int): Unit = {
val aMatrix = new DoubleMatrix(Array(1.0, 2.0, 3.0))
def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = {
vec1.dot(vec2) / (vec1.norm2() * vec2.norm2())
}
val itemFactor = model.productFeatures.lookup(item).head
val itemVector = new DoubleMatrix(itemFactor)
//Here is where I get the error:
val sims = model.productFeatures.map { case (id, factor) =>
val factorVector = new DoubleMatrix(factor)
val sim = cosineSimilarity(factorVector, itemVector)
(id, sim)
}
val sortedSims = sims.top(10)(Ordering.by[(Int, Double), Double] {
case (id, similarity) => similarity
})
println("\nTop 10 products:")
sortedSims.map(x => (x._1, x._2)).foreach(println)
This is the error:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at RecommendationObj.recommendationProducts(RecommendationObj.scala:269)
at MainObj$.analisiIUNGO(MainObj.scala:257)
at MainObj$.menu(MainObj.scala:54)
at MainObj$.main(MainObj.scala:37)
at MainObj.main(MainObj.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#7c2312fa)
- field (class: RecommendationObj, name: sc, type: class org.apache.spark.SparkContext)
- object (class MainObj$$anon$1, MainObj$$anon$1#615bad16)
- field (class: RecommendationObj$$anonfun$37, name: $outer, type: class RecommendationObj)
- object (class RecommendationObj$$anonfun$37, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 14 more
I tried to add:
1) "extends Serializable" (Scala) to my Class
2) "extends extends java.io.Serializable" to my Class
3) "#transient" to some parts
4) Get the model (and other features) inside this class (Now I get them from an other object and I pass them to my Class like arguments)
How can I resolve it? I'm becoming crazy!
Thank you in advance!

Key is here:
field (class: RecommendationObj, name: sc, type: class org.apache.spark.SparkContext)
So you have field named sc of type SparkContext. Spark wants to serialize the class, so he try also to serialize all fields.
You should:
use #transient annotation and checking if null, then recreate
not use SparkContext from field, but put it into argument of method. However remember, that you never should use SparkContext inside closures in map, flatMap, etc.

Related

Spark Dataframe stat throwing Task not serializable

What am I trying to do? (Context)
I'm trying to calculate some stats for a dataframe/set in spark that is read from a directory with .parquet files about US flights between 2013 and 2015. To be more specific, I'm using approxQuantile method in DataFrameStatFunction that can be accessed calling stat method on a Dataset. See docu
import airportCaseStudy.model.Flight
import org.apache.spark.sql.SparkSession
object CaseStudy {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("local[*]")
.getOrCreate
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.sqlContext.implicits._
val flights = spark
.read
.parquet("C:\\Users\\Bluetab\\IdeaProjects\\GraphFramesSparkPlayground\\src\\resources\\flights")
.as[Flight]
flights.show()
flights.printSchema()
flights.describe("year", "flightEpochSeconds").show()
val approxQuantiles = flights.stat
.approxQuantile(Array("year", "flightEpochSeconds"), Array(0.25, 0.5, 0.75), 0.25)
// whatever...
}
}
Flight is simply a case class.
package airportCaseStudy.model
case class Flight(year: Int, quarter: Int, month: Int, dayOfMonth: Int, dayOfWeek: Int, flightDate: String,
uniqueCarrier: String, airlineID: String, carrier: String, tailNum: String, flightNum: Int,
originAirportID: String, origin: String, originCityName: String, dstAirportID: String,
dst: String, dstCityName: String, taxiOut: Float, taxiIn: Float, cancelled: Boolean,
diverted: Float, actualETMinutes: Float, airTimeMinutes: Float, distanceMiles: Float, flightEpochSeconds: Long)
What's the issue?
I'm using Spark 2.4.0.
When executing val approxQuantiles = flights.stat.approxQuantile(Array("year", "flightEpochSeconds"), Array(0.25, 0.5, 0.75), 0.25) I'm not getting it done because there must be such a task that cannot be serializable. I spent some time checking out there the following links, but I'm not able to figure out why this exception.
Find quantiles and mean using spark (python and scala)
Statistical and Mathematical functions with DF in Spark from Databricks
Exception
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:88)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:77)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:222)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:211)
at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1158)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1137)
at org.apache.spark.sql.execution.stat.StatFunctions$.multipleApproxQuantiles(StatFunctions.scala:102)
at org.apache.spark.sql.DataFrameStatFunctions.approxQuantile(DataFrameStatFunctions.scala:104)
at airportCaseStudy.CaseStudy$.main(CaseStudy.scala:27)
at airportCaseStudy.CaseStudy.main(CaseStudy.scala)
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
- object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.PairRDDFunctions, functionalInterfaceMethod=scala/Function0.apply:()Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/PairRDDFunctions.$anonfun$foldByKey$2:(Lorg/apache/spark/rdd/PairRDDFunctions;[BLscala/runtime/LazyRef;)Ljava/lang/Object;, instantiatedMethodType=()Ljava/lang/Object;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.PairRDDFunctions$$Lambda$2158/61210602, org.apache.spark.rdd.PairRDDFunctions$$Lambda$2158/61210602#165a5979)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 2)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.PairRDDFunctions, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/PairRDDFunctions.$anonfun$foldByKey$3:(Lscala/Function0;Lscala/Function2;Ljava/lang/Object;)Ljava/lang/Object;, instantiatedMethodType=(Ljava/lang/Object;)Ljava/lang/Object;, numCaptured=2])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.PairRDDFunctions$$Lambda$2159/758750856, org.apache.spark.rdd.PairRDDFunctions$$Lambda$2159/758750856#6a6e410c)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
... 22 more
I appreciate any help you can provide.
add "extends Serializable" to you class or object.
class/Object Test extends Serializable{
//type you code
}

DataFrame using UDF giving Task not serializable Exception

Trying to use the show() method on a dataframe. It is giving Task not serializable Exception.
I have tried to extend the Serializable object but still the error persists.
object App extends Serializable{
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache").setLevel(Level.WARN);
val spark = SparkSession.builder()
.appName("LearningSpark")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val inputPath = "./src/resources/2015-03-01-0.json"
val ghLog = spark.read.json(inputPath)
val pushes = ghLog.filter("type = 'PushEvent'")
val grouped = pushes.groupBy("actor.login").count
val ordered = grouped.orderBy(grouped("count").desc)
ordered.show(5)
val empPath = "./src/resources/ghEmployees.txt"
val employees = Set() ++ (
for {
line <- fromFile(empPath).getLines
} yield line.trim)
val bcEmployees = sc.broadcast(employees)
import spark.implicits._
val isEmp = user => bcEmployees.value.contains(user)
val isEmployee = spark.udf.register("SetContainsUdf", isEmp)
val filtered = ordered.filter(isEmployee($"login"))
filtered.show()
}
}
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
19/09/01 10:21:48 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:850)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:630)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:128)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:151)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:136)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3383)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:745)
at org.apache.spark.sql.Dataset.show(Dataset.scala:704)
at org.apache.spark.sql.Dataset.show(Dataset.scala:713)
at App$.main(App.scala:33)
at App.main(App.scala)
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
- object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.catalyst.expressions.ScalaUDF, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/catalyst/expressions/ScalaUDF.$anonfun$f$2:(Lscala/Function1;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lscala/runtime/LazyRef;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, instantiatedMethodType=(Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF$$Lambda$2364/2031154005, org.apache.spark.sql.catalyst.expressions.ScalaUDF$$Lambda$2364/2031154005#1fd37440)
- field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF, name: f, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF, UDF:SetContainsUdf(actor#6.login))
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#3b65084e)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(isnotnull(type#13), (type#13 = PushEvent), UDF:SetContainsUdf(actor#6.login)))
- field (class: org.apache.spark.sql.execution.FileSourceScanExec, name: dataFilters, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.execution.FileSourceScanExec, FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- field (class: org.apache.spark.sql.execution.FilterExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.FilterExec, Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
+- FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- field (class: org.apache.spark.sql.execution.ProjectExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.ProjectExec, Project [actor#6]
+- Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
+- FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- field (class: org.apache.spark.sql.execution.aggregate.HashAggregateExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.aggregate.HashAggregateExec, HashAggregate(keys=[actor#6.login AS actor#6.login#53], functions=[partial_count(1)], output=[actor#6.login#53, count#43L])
+- Project [actor#6]
+- Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
+- FileScan json [actor#6,type#13] Batched:+------------------+-----+
| login|count|
+------------------+-----+
| greatfirebot| 192|
|diversify-exp-user| 146|
| KenanSulayman| 72|
| manuelrp07| 45|
| mirror-updates| 42|
+------------------+-----+
only showing top 5 rows
false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 14)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$1297/815648243, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$1297/815648243#27438750)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
... 48 more
I had spark 2.4.4 with Scala "2.12.1". I encountered the same issue (object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)) and it was driving me crazy. I changed Scala version to "2.12.10" and the issue is solved now!
The serialization issue is not because of object not being Serializable.
The object is not serialized and sent to executors for execution, it is the transform code that is serialized.
One of the functions in the code is not Serializable.
On looking at the code and the trace, isEmployee seems to be the issue.
A couple of observations
1. isEmployee is not a UDF. In Spark, UDF needs to be created by extending org.apache.spark.sql.expressions.UserDefinedFunction which is Serializable, and after defining the function it needs to be registered using org.apache.spark.sql.UDFRegistration#register
I can think of two solutions:
1. Create and register UDF rightly, so that Serialization happens rightly
2. Completely avoid UDF and make use of broadcast variable and filter method as follows
val employees: Set[String] = Set("users")
val bcEmployees = sc.broadcast(employees)
val filtered = ordered.filter {
x =>
val user = x.getString(0) // assuming 0th index contains user
bcEmployees.value.contains(user) // access broadcast variable in closure
}
filtered.show()
Life is full of mysteries. Serialization is one of them, and some aspects of the spark-shell vs. Databricks Notebooks - which are easier.
https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54 should be consulted so as to see that extends Serializable as provided at top-level is not the clue; the Driver ships relevant pieces to Executors as far as I understand.
If I run your code as is in Databricks Notebook without any extends Serializable, it works fine! In the past I have been able to capture Serialization issues in the Databricks Notebooks - always to-date. Interesting, as in pseudo cluster one should pick up all the possible Serialization issues prior to release I was assured - apparently not so always. Interesting, but a notebook is not spark-submit.
If I run in spark-shell with two consecutive "paste modes" - logical, or line-by-line as follows, here under and 1) omit a few things and 2) adapt with extends Serializable for an Object for your UDF - which is for a Column, so we adhere to that, it works.
:paste 1
scala> :paste
// Entering paste mode (ctrl-D to finish)
object X extends Serializable {
val isEmp = user => bcEmployees.value.contains(user)
}
:paste 2
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("LearningSpark")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
// Register UDF
val isEmployee = udf(X.isEmp)
import scala.io.Source
import spark.implicits._
// Simulated input.
val ghLog = Seq(("john2X0", "push"), ("james09", "abc"), ("peter01", "push"), ("mary99", "push"), ("peter01", "push")).toDF("login", "type")
val pushes = ghLog.filter("type = 'push'")
val grouped = pushes.groupBy("login").count
val ordered = grouped.orderBy(grouped("count").desc)
ordered.show(5)
val emp = "/home/mapr/emp.txt"
val employees = Set() ++ (
for {
line <- Source.fromFile(emp).getLines
} yield line.trim)
val bcEmployees = sc.broadcast(employees)
val filtered = ordered.filter(isEmployee($"login"))
filtered.show()
So, the other answer states do not via UDF, more performant in some cases, but I am sticking with the UDF which allows column input and is potentially reusable. This approach works with spark-submit as well, although that should be obvious - mentioned for posterity.

java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext

Ran into this error when trying to run a spark streaming application with checkpointing enabled.
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
Serialization stack:
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#63cf0da6)
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#63cf0da6)
- field (class: com.sales.spark.job.streaming.SalesStream, name: streamingContext, type: class org.apache.spark.streaming.StreamingContext)
- field (class: com.sales.spark.job.streaming.SalesStream, name: streamingContext, type: class org.apache.spark.streaming.StreamingContext)
- object (class com.sales.spark.job.streaming.SalesStreamFactory$$anon$1, com.sales.spark.job.streaming.SalesStreamFactory$$anon$1#1738d3b2)
- object (class com.sales.spark.job.streaming.SalesStreamFactory$$anon$1, com.sales.spark.job.streaming.SalesStreamFactory$$anon$1#1738d3b2)
- field (class: com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, name: $outer, type: class com.sales.spark.job.streaming.SalesStream)
- field (class: com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, name: $outer, type: class com.sales.spark.job.streaming.SalesStream)
- object (class com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, <function1>)
- object (class com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, <function1>)
Trying to execute the piece of code. I am thinking the issue has to do with trying to access the spark session variable inside the tempTableView function
Code
liveRecordStream
.foreachRDD(newRDD => {
if (!newRDD.isEmpty()) {
val cacheRDD = newRDD.cache()
val updTempTables = tempTableView(t2s, stgDFMap, cacheRDD)
val rdd = updatestgDFMap(stgDFMap, cacheRDD)
persistStgTable(stgDFMap)
dfMap
.filter(entry => updTempTables.contains(entry._2))
.map(spark.sql)
.foreach( df => writeToES(writer, df))
}
}
tempTableView
def tempTableView(t2s: Map[String, StructType], stgDFMap: Map[String, DataFrame], cacheRDD: RDD[cacheRDD]): Set[String] = {
stgDFMap.keys.filter { table =>
val tRDD = cacheRDD
.filter(r => r.Name == table)
.map(r => r.values)
val tDF = spark.createDataFrame(tRDD, tableNameToSchema(table))
if (!tRDD.isEmpty()) {
val tName = s"temp_$table"
tDF.createOrReplaceTempView(tName)
}
!tRDD.isEmpty()
}.toSet
}
Not sure how to get the spark session variable inside this function which is called inside foreachRDD.
I am instantiating the streamingContext as part of a different class.
class Test {
lazy val sparkSession: SparkSession =
SparkSession
.builder()
.appName("testApp")
.config("es.nodes", SalesConfig.elasticnode)
.config("es.port", SalesConfig.elasticport)
.config("spark.sql.parquet.filterPushdown", parquetFilterPushDown)
.config("spark.debug.maxToStringFields", 100000)
.config("spark.rdd.compress", rddCompress)
.config("spark.task.maxFailures", 25)
.config("spark.streaming.unpersist", streamingUnPersist)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val streamingContext: StreamingContext = new StreamingContext(sparkSession.sparkContext,Seconds(15))
streamingContext.checkpoint("/Users/gswaminathan/Guidewire/Java/explore-policy/checkpoint/")
}
I tried extending this class as Serializable, but no luck.

Task not serializable error when applying regex transformation to String column

I want to reformat a DataFrame's String column. From format
(p:str_1,1)(p:str_2,2) ...
to format
str_1:1|str_2:2 ...
I wrote the following code [2] but I get org.apache.spark.SparkException: Task not serializable exception caused by Caused by: java.io.NotSerializableException: scala.util.matching.Regex$MatchIterator [3].
Can someone help understand why Regex$MatchIterator is not serializable and also how to fix this, it seems that the Iterator returned by the findAllIn(xs) is causing harm at the for-statement's elem <- mi [1].
Thanks a lot!
-- β
[1] If I change elem <- mi to use a random iterator (e.g. elem <- Iterator(1, 2, 3)) then no compile error and the code runs - but obviously doesn't do what I need it to do. I also tried to get a normal iterator findAllIn(xs).toIterator but the same exception occurs.
[2] Code
val df = spark.sparkContext.parallelize(Seq(("(p:some_string,6)(p:some_other_string,4)", "foo"), ("(p:yet_another_string,1) ", "bar"))).toDF("my_p", "my_s")
val regexStr: String = "\\((?:p|p2)?:(?<q>.+?),(?<s>\\d+)\\)"
def _reformat(xs: String): String = {
val re: scala.util.matching.Regex = regexStr.r
val mi = re.findAllIn(xs)
val d: Iterator[String] = for {
elem <- mi
val q: String = mi.group(1)
val s: String = mi.group(2)
val pair: String = s"${q}:${s}"
} yield pair
d.mkString("|")
}
def reformat: UserDefinedFunction = udf[String, String](_reformat)
val dfReformated: DataFrame = df
.withColumn("my_p", reformat($"my_p"))
[3] StackTrace
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:817)
...
at org.apache.spark.sql.Dataset.head(Dataset.scala:1934)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2149)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
at org.apache.spark.sql.Dataset.show(Dataset.scala:506)
... 46 elided
Caused by: java.io.NotSerializableException: scala.util.matching.Regex$MatchIterator
Serialization stack:
- object not serializable (class: scala.util.matching.Regex$MatchIterator, value: empty iterator)
- field (class: $iw, name: mi, type: class scala.util.matching.Regex$MatchIterator)
- object (class $iw, $iw#79303206)
- field (class: $iw, name: $iw, type: class $iw)
...

Why does custom DefaultSource give java.io.NotSerializableException?

this is my first post on SO and my apology if the improper format is being used.
I'm working with Apache Spark to create a new source (via DefaultSource), BaseRelations, etc... and run into a problem with serialization that I would like to understand better. Consider below a class that extends BaseRelation and implements the scan builder.
class RootTableScan(path: String, treeName: String)(#transient val sqlContext: SQLContext) extends BaseRelation with PrunedFilteredScan{
private val att: core.SRType =
{
val reader = new RootFileReader(new java.io.File(Seq(path) head))
val tmp =
if (treeName==null)
buildATT(findTree(reader.getTopDir), arrangeStreamers(reader), null)
else
buildATT(reader.getKey(treeName).getObject.asInstanceOf[TTree],
arrangeStreamers(reader), null)
tmp
}
// define the schema from the AST
def schema: StructType = {
val s = buildSparkSchema(att)
s
}
// builds a scan
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {
// parallelize over all the files
val r = sqlContext.sparkContext.parallelize(Seq(path), 1).
flatMap({fileName =>
val reader = new RootFileReader(new java.io.File(fileName))
// get the TTree
/* PROBLEM !!! */
val rootTree =
// findTree(reader)
if (treeName == null) findTree(reader)
else reader.getKey(treeName).getObject.asInstanceOf[TTree]
new RootTreeIterator(rootTree, arrangeStreamers(reader),
requiredColumns, filters)
})
println("Done building Scan")
r
}
}
}
PROBLEM identifies where the issue happens. treeName is a val that gets injected into the class thru the constructor. The lambda that uses it is supposed to be executed on the slave and I do need to send the treeName - serialize it. I would like to understand why exactly the code snippet below causes this NotSerializableException. I know for sure that without treeName in it, it works just fine
val rootTree =
// findTree(reader)
if (treeName == null) findTree(reader)
else reader.getKey(treeName).getObject.asInstanceOf[TTree]
Below is the Stack trace
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2056)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:375)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:374)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:374)
at org.dianahep.sparkroot.package$RootTableScan.buildScan(sparkroot.scala:95)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:303)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:302)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:379)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:298)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:256)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2572)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1934)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2149)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
at org.apache.spark.sql.Dataset.show(Dataset.scala:495)
... 50 elided
Caused by: java.io.NotSerializableException: org.dianahep.sparkroot.package$RootTableScan
Serialization stack:
- object not serializable (class: org.dianahep.sparkroot.package$RootTableScan, value: org.dianahep.sparkroot.package$RootTableScan#6421e9e7)
- field (class: org.dianahep.sparkroot.package$RootTableScan$$anonfun$1, name: $outer, type: class org.dianahep.sparkroot.package$RootTableScan)
- object (class org.dianahep.sparkroot.package$RootTableScan$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
From the stack I think I can deduce that it tries to serialize my lambda and can not. this lambda should be a closure as we have a val in there that is defined outside of the lambda scope. But I don't understand why this can not be serialized.
Any help would be really appreciated!!!
Thanks a lot!
Any time a scala closure references a class variable, like treeName, then the JVM serializes the parent class along with the closure. Your class RootTableScan is not serializable, though! The solution is to create a local string variable:
// builds a scan
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {
val localTreeName = treeName // this is safe to serialize
// parallelize over all the files
val r = sqlContext.sparkContext.parallelize(Seq(path), 1).
flatMap({fileName =>
val reader = new RootFileReader(new java.io.File(fileName))
// get the TTree
/* PROBLEM !!! */
val rootTree =
// findTree(reader)
if (localTreeName == null) findTree(reader)
else reader.getKey(localTreeName).getObject.asInstanceOf[TTree]
new RootTreeIterator(rootTree, arrangeStreamers(reader),
requiredColumns, filters)
})