java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext - scala

Ran into this error when trying to run a spark streaming application with checkpointing enabled.
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
Serialization stack:
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#63cf0da6)
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#63cf0da6)
- field (class: com.sales.spark.job.streaming.SalesStream, name: streamingContext, type: class org.apache.spark.streaming.StreamingContext)
- field (class: com.sales.spark.job.streaming.SalesStream, name: streamingContext, type: class org.apache.spark.streaming.StreamingContext)
- object (class com.sales.spark.job.streaming.SalesStreamFactory$$anon$1, com.sales.spark.job.streaming.SalesStreamFactory$$anon$1#1738d3b2)
- object (class com.sales.spark.job.streaming.SalesStreamFactory$$anon$1, com.sales.spark.job.streaming.SalesStreamFactory$$anon$1#1738d3b2)
- field (class: com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, name: $outer, type: class com.sales.spark.job.streaming.SalesStream)
- field (class: com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, name: $outer, type: class com.sales.spark.job.streaming.SalesStream)
- object (class com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, <function1>)
- object (class com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, <function1>)
Trying to execute the piece of code. I am thinking the issue has to do with trying to access the spark session variable inside the tempTableView function
Code
liveRecordStream
.foreachRDD(newRDD => {
if (!newRDD.isEmpty()) {
val cacheRDD = newRDD.cache()
val updTempTables = tempTableView(t2s, stgDFMap, cacheRDD)
val rdd = updatestgDFMap(stgDFMap, cacheRDD)
persistStgTable(stgDFMap)
dfMap
.filter(entry => updTempTables.contains(entry._2))
.map(spark.sql)
.foreach( df => writeToES(writer, df))
}
}
tempTableView
def tempTableView(t2s: Map[String, StructType], stgDFMap: Map[String, DataFrame], cacheRDD: RDD[cacheRDD]): Set[String] = {
stgDFMap.keys.filter { table =>
val tRDD = cacheRDD
.filter(r => r.Name == table)
.map(r => r.values)
val tDF = spark.createDataFrame(tRDD, tableNameToSchema(table))
if (!tRDD.isEmpty()) {
val tName = s"temp_$table"
tDF.createOrReplaceTempView(tName)
}
!tRDD.isEmpty()
}.toSet
}
Not sure how to get the spark session variable inside this function which is called inside foreachRDD.
I am instantiating the streamingContext as part of a different class.
class Test {
lazy val sparkSession: SparkSession =
SparkSession
.builder()
.appName("testApp")
.config("es.nodes", SalesConfig.elasticnode)
.config("es.port", SalesConfig.elasticport)
.config("spark.sql.parquet.filterPushdown", parquetFilterPushDown)
.config("spark.debug.maxToStringFields", 100000)
.config("spark.rdd.compress", rddCompress)
.config("spark.task.maxFailures", 25)
.config("spark.streaming.unpersist", streamingUnPersist)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val streamingContext: StreamingContext = new StreamingContext(sparkSession.sparkContext,Seconds(15))
streamingContext.checkpoint("/Users/gswaminathan/Guidewire/Java/explore-policy/checkpoint/")
}
I tried extending this class as Serializable, but no luck.

Related

DataFrame using UDF giving Task not serializable Exception

Trying to use the show() method on a dataframe. It is giving Task not serializable Exception.
I have tried to extend the Serializable object but still the error persists.
object App extends Serializable{
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache").setLevel(Level.WARN);
val spark = SparkSession.builder()
.appName("LearningSpark")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val inputPath = "./src/resources/2015-03-01-0.json"
val ghLog = spark.read.json(inputPath)
val pushes = ghLog.filter("type = 'PushEvent'")
val grouped = pushes.groupBy("actor.login").count
val ordered = grouped.orderBy(grouped("count").desc)
ordered.show(5)
val empPath = "./src/resources/ghEmployees.txt"
val employees = Set() ++ (
for {
line <- fromFile(empPath).getLines
} yield line.trim)
val bcEmployees = sc.broadcast(employees)
import spark.implicits._
val isEmp = user => bcEmployees.value.contains(user)
val isEmployee = spark.udf.register("SetContainsUdf", isEmp)
val filtered = ordered.filter(isEmployee($"login"))
filtered.show()
}
}
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
19/09/01 10:21:48 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:850)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:630)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:128)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:151)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:136)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3383)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:745)
at org.apache.spark.sql.Dataset.show(Dataset.scala:704)
at org.apache.spark.sql.Dataset.show(Dataset.scala:713)
at App$.main(App.scala:33)
at App.main(App.scala)
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
- object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.catalyst.expressions.ScalaUDF, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/catalyst/expressions/ScalaUDF.$anonfun$f$2:(Lscala/Function1;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lscala/runtime/LazyRef;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, instantiatedMethodType=(Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF$$Lambda$2364/2031154005, org.apache.spark.sql.catalyst.expressions.ScalaUDF$$Lambda$2364/2031154005#1fd37440)
- field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF, name: f, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF, UDF:SetContainsUdf(actor#6.login))
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#3b65084e)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(isnotnull(type#13), (type#13 = PushEvent), UDF:SetContainsUdf(actor#6.login)))
- field (class: org.apache.spark.sql.execution.FileSourceScanExec, name: dataFilters, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.execution.FileSourceScanExec, FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- field (class: org.apache.spark.sql.execution.FilterExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.FilterExec, Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
+- FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- field (class: org.apache.spark.sql.execution.ProjectExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.ProjectExec, Project [actor#6]
+- Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
+- FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- field (class: org.apache.spark.sql.execution.aggregate.HashAggregateExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.aggregate.HashAggregateExec, HashAggregate(keys=[actor#6.login AS actor#6.login#53], functions=[partial_count(1)], output=[actor#6.login#53, count#43L])
+- Project [actor#6]
+- Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
+- FileScan json [actor#6,type#13] Batched:+------------------+-----+
| login|count|
+------------------+-----+
| greatfirebot| 192|
|diversify-exp-user| 146|
| KenanSulayman| 72|
| manuelrp07| 45|
| mirror-updates| 42|
+------------------+-----+
only showing top 5 rows
false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 14)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$1297/815648243, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$1297/815648243#27438750)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
... 48 more
I had spark 2.4.4 with Scala "2.12.1". I encountered the same issue (object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)) and it was driving me crazy. I changed Scala version to "2.12.10" and the issue is solved now!
The serialization issue is not because of object not being Serializable.
The object is not serialized and sent to executors for execution, it is the transform code that is serialized.
One of the functions in the code is not Serializable.
On looking at the code and the trace, isEmployee seems to be the issue.
A couple of observations
1. isEmployee is not a UDF. In Spark, UDF needs to be created by extending org.apache.spark.sql.expressions.UserDefinedFunction which is Serializable, and after defining the function it needs to be registered using org.apache.spark.sql.UDFRegistration#register
I can think of two solutions:
1. Create and register UDF rightly, so that Serialization happens rightly
2. Completely avoid UDF and make use of broadcast variable and filter method as follows
val employees: Set[String] = Set("users")
val bcEmployees = sc.broadcast(employees)
val filtered = ordered.filter {
x =>
val user = x.getString(0) // assuming 0th index contains user
bcEmployees.value.contains(user) // access broadcast variable in closure
}
filtered.show()
Life is full of mysteries. Serialization is one of them, and some aspects of the spark-shell vs. Databricks Notebooks - which are easier.
https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54 should be consulted so as to see that extends Serializable as provided at top-level is not the clue; the Driver ships relevant pieces to Executors as far as I understand.
If I run your code as is in Databricks Notebook without any extends Serializable, it works fine! In the past I have been able to capture Serialization issues in the Databricks Notebooks - always to-date. Interesting, as in pseudo cluster one should pick up all the possible Serialization issues prior to release I was assured - apparently not so always. Interesting, but a notebook is not spark-submit.
If I run in spark-shell with two consecutive "paste modes" - logical, or line-by-line as follows, here under and 1) omit a few things and 2) adapt with extends Serializable for an Object for your UDF - which is for a Column, so we adhere to that, it works.
:paste 1
scala> :paste
// Entering paste mode (ctrl-D to finish)
object X extends Serializable {
val isEmp = user => bcEmployees.value.contains(user)
}
:paste 2
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("LearningSpark")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
// Register UDF
val isEmployee = udf(X.isEmp)
import scala.io.Source
import spark.implicits._
// Simulated input.
val ghLog = Seq(("john2X0", "push"), ("james09", "abc"), ("peter01", "push"), ("mary99", "push"), ("peter01", "push")).toDF("login", "type")
val pushes = ghLog.filter("type = 'push'")
val grouped = pushes.groupBy("login").count
val ordered = grouped.orderBy(grouped("count").desc)
ordered.show(5)
val emp = "/home/mapr/emp.txt"
val employees = Set() ++ (
for {
line <- Source.fromFile(emp).getLines
} yield line.trim)
val bcEmployees = sc.broadcast(employees)
val filtered = ordered.filter(isEmployee($"login"))
filtered.show()
So, the other answer states do not via UDF, more performant in some cases, but I am sticking with the UDF which allows column input and is potentially reusable. This approach works with spark-submit as well, although that should be obvious - mentioned for posterity.

java.io.NotSerializableException with Spark Streaming Checkpoint enabled

I have enabled checkpointing in my spark streaming application and encounter this error on a class that is downloaded as a dependency.
With no checkpointing the application works great.
Error:
com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer
Serialization stack:
- object not serializable (class: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer, value: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer#46c7c593)
- field (class: com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector, name: _paranamer, type: interface com.fasterxml.jackson.module.paranamer.shaded.Paranamer)
- object (class com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector, com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector#39d62e47)
- field (class: com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, name: _secondary, type: class com.fasterxml.jackson.databind.AnnotationIntrospector)
- object (class com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair#7a925ac4)
- field (class: com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, name: _primary, type: class com.fasterxml.jackson.databind.AnnotationIntrospector)
- object (class com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair#203b98cf)
- field (class: com.fasterxml.jackson.databind.cfg.BaseSettings, name: _annotationIntrospector, type: class com.fasterxml.jackson.databind.AnnotationIntrospector)
- object (class com.fasterxml.jackson.databind.cfg.BaseSettings, com.fasterxml.jackson.databind.cfg.BaseSettings#78c34153)
- field (class: com.fasterxml.jackson.databind.cfg.MapperConfig, name: _base, type: class com.fasterxml.jackson.databind.cfg.BaseSettings)
- object (class com.fasterxml.jackson.databind.DeserializationConfig, com.fasterxml.jackson.databind.DeserializationConfig#2df0a4c3)
- field (class: com.fasterxml.jackson.databind.ObjectMapper, name: _deserializationConfig, type: class com.fasterxml.jackson.databind.DeserializationConfig)
- object (class com.fasterxml.jackson.databind.ObjectMapper, com.fasterxml.jackson.databind.ObjectMapper#2db07651)
I am not sure how to extend this class as serializable as its a maven dependency. I am using v2.6.0 of the jackson core in my pom.xml. If I try to use a newer version of Jackson core I am getting Incompatible Jackson version exception.
Code
liveRecordStream
.foreachRDD(newRDD => {
if (!newRDD.isEmpty()) {
val cacheRDD = newRDD.cache()
val updTempTables = tempTableView(t2s, stgDFMap, cacheRDD)
val rdd = updatestgDFMap(stgDFMap, cacheRDD)
persistStgTable(stgDFMap)
dfMap
.filter(entry => updTempTables.contains(entry._2))
.map(spark.sql)
.foreach( df => writeToES(writer, df))
cacheRDD.unpersist()
}
}
The issue is happening only if a method call happens inside foreachRDD like tempTableView in this case.
tempTableView
def tempTableView(t2s: Map[String, StructType], stgDFMap: Map[String, DataFrame], cacheRDD: RDD[cacheRDD]): Set[String] = {
stgDFMap.keys.filter { table =>
val tRDD = cacheRDD
.filter(r => r.Name == table)
.map(r => r.values)
val tDF = spark.createDataFrame(tRDD, tableNameToSchema(table))
if (!tRDD.isEmpty()) {
val tName = s"temp_$table"
tDF.createOrReplaceTempView(tName)
}
!tRDD.isEmpty()
}.toSet
}
Any help is appreciated. Not sure how to debug this and fix the issue.
From the code snippet you had shared, I don't see where the jackson library is invoked. However, NotSerializableException usually happens when you try to send an object which doesn't implement Serializable interface over wire.
And Spark is distributed processing engine, meaning it works this way: There is a driver and multiple executors across nodes. Only the part of code that is needed to be computed is sent by the driver to the executors (over wire). Spark transformations happen in that way i.e. across multiple nodes and if you try pass an instance of a class, which doesn't implement serializable interface, to such code blocks (the block that executes across nodes), it will throw NotSerializableException.
Ex:
def main(args: Array[String]): Unit = {
val gson: Gson = new Gson()
val sparkConf = new SparkConf().setMaster("local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq("0","1"))
val something = rdd.map(str => {
gson.toJson(str)
})
something.foreach(println)
spark.close()
}
This code block will throw NotSerializableException because we are sending an instance of Gson to a distributed function. map is a Spark transformation operation so it will execute on executors. The following will work:
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq("0","1"))
val something = rdd.map(str => {
val gson: Gson = new Gson()
gson.toJson(str)
})
something.foreach(println)
spark.close()
}
Reason why the above will work is, we are instantiating Gson within a transformation, so it will be instantiated at the executor, meaning it won't be sent from the driver program over the wire so no serialization is needed.
The issue was with jackson objectMapper that was trying to be serialized. objectMapper should not be serialized. Fixed this by adding #transient val objMapper = new ObjectMapper...

Spark task not serializable

I've tried all the solutions to this problem that I found on StackOverflow but, despite this, I can't solve it.
I have a "MainObj" object that instantiates a "Recommendation" object. When I call the "recommendationProducts" method I always get an error.
Here is the code of the method:
def recommendationProducts(item: Int): Unit = {
val aMatrix = new DoubleMatrix(Array(1.0, 2.0, 3.0))
def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = {
vec1.dot(vec2) / (vec1.norm2() * vec2.norm2())
}
val itemFactor = model.productFeatures.lookup(item).head
val itemVector = new DoubleMatrix(itemFactor)
//Here is where I get the error:
val sims = model.productFeatures.map { case (id, factor) =>
val factorVector = new DoubleMatrix(factor)
val sim = cosineSimilarity(factorVector, itemVector)
(id, sim)
}
val sortedSims = sims.top(10)(Ordering.by[(Int, Double), Double] {
case (id, similarity) => similarity
})
println("\nTop 10 products:")
sortedSims.map(x => (x._1, x._2)).foreach(println)
This is the error:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at RecommendationObj.recommendationProducts(RecommendationObj.scala:269)
at MainObj$.analisiIUNGO(MainObj.scala:257)
at MainObj$.menu(MainObj.scala:54)
at MainObj$.main(MainObj.scala:37)
at MainObj.main(MainObj.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#7c2312fa)
- field (class: RecommendationObj, name: sc, type: class org.apache.spark.SparkContext)
- object (class MainObj$$anon$1, MainObj$$anon$1#615bad16)
- field (class: RecommendationObj$$anonfun$37, name: $outer, type: class RecommendationObj)
- object (class RecommendationObj$$anonfun$37, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 14 more
I tried to add:
1) "extends Serializable" (Scala) to my Class
2) "extends extends java.io.Serializable" to my Class
3) "#transient" to some parts
4) Get the model (and other features) inside this class (Now I get them from an other object and I pass them to my Class like arguments)
How can I resolve it? I'm becoming crazy!
Thank you in advance!
Key is here:
field (class: RecommendationObj, name: sc, type: class org.apache.spark.SparkContext)
So you have field named sc of type SparkContext. Spark wants to serialize the class, so he try also to serialize all fields.
You should:
use #transient annotation and checking if null, then recreate
not use SparkContext from field, but put it into argument of method. However remember, that you never should use SparkContext inside closures in map, flatMap, etc.

Spark Streaming check pointing throws Not Serializable exception

We are using Spark Streaming Receiver based approach, and we just enabled the Check pointing to get rid of data loss issue.
Spark version is 1.6.1 and we are receiving message from Kafka topic.
I'm using ssc inside, foreachRDD method of DStream, so it throws Not Serializable exception.
I tried extending the class Serializable, but still the same error. It is happening only when we enable checkpoint.
Code is:
def main(args: Array[String]): Unit = {
val checkPointLocation = "/path/to/wal"
val ssc = StreamingContext.getOrCreate(checkPointLocation, () => createContext(checkPointLocation))
ssc.start()
ssc.awaitTermination()
}
def createContext (checkPointLocation: String): StreamingContext ={
val sparkConf = new SparkConf().setAppName("Test")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(sparkConf, Seconds(40))
ssc.checkpoint(checkPointLocation)
val sc = ssc.sparkContext
val sqlContext: SQLContext = new HiveContext(sc)
val kafkaParams = Map("group.id" -> groupId,
CommonClientConfigs.SECURITY_PROTOCOL_CONFIG -> sasl,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
"metadata.broker.list" -> brokerList,
"zookeeper.connect" -> zookeeperURL)
val dStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
dStream.foreachRDD(rdd =>
{
// using sparkContext / sqlContext to do any operation throws error.
// convert RDD[String] to RDD[Row]
//Create Schema for the RDD.
sqlContext.createDataFrame(rdd, schema)
})
ssc
}
Error log:
2017-02-08 22:53:53,250 ERROR [Driver] streaming.StreamingContext:
Error starting the context, marking it as stopped
java.io.NotSerializableException: DStream checkpointing has been
enabled but the DStreams with their functions are not serializable
org.apache.spark.SparkContext Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value:
org.apache.spark.SparkContext#1c5e3677)
- field (class: com.x.payments.RemedyDriver$$anonfun$main$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.x.payments.RemedyDriver$$anonfun$main$1, )
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
)
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream,
org.apache.spark.streaming.dstream.ForEachDStream#68866c5)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream#68866c5))
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [ 0
checkpoint files
])
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.kafka.KafkaInputDStream,
org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32))
- writeObject data (class: org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.DStreamGraph, org.apache.spark.streaming.DStreamGraph#6935641e)
- field (class: org.apache.spark.streaming.Checkpoint, name: graph, type: class org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.Checkpoint, org.apache.spark.streaming.Checkpoint#484bf033)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:557)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at com.x.payments.RemedyDriver$.main(RemedyDriver.scala:104)
at com.x.payments.RemedyDriver.main(RemedyDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)
2017-02-08 22:53:53,250 ERROR [Driver] payments.RemedyDriver$: DStream
checkpointing has been enabled but the DStreams with their functions
are not serializable org.apache.spark.SparkContext Serialization
stack:
- object not serializable (class: org.apache.spark.SparkContext, value:
org.apache.spark.SparkContext#1c5e3677)
- field (class: com.x.payments.RemedyDriver$$anonfun$main$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.x.payments.RemedyDriver$$anonfun$main$1, )
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
)
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream,
org.apache.spark.streaming.dstream.ForEachDStream#68866c5)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream#68866c5))
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [ 0
checkpoint files
])
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.kafka.KafkaInputDStream,
org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32))
- writeObject data (class: org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.DStreamGraph, org.apache.spark.streaming.DStreamGraph#6935641e)
- field (class: org.apache.spark.streaming.Checkpoint, name: graph, type: class org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.Checkpoint, org.apache.spark.streaming.Checkpoint#484bf033) 2017-02-08
22:53:53,255 INFO [Driver] yarn.ApplicationMaster: Final app status:
SUCCEEDED, exitCode: 0
Update:
Basically what we are trying to do is, converting the rdd to DF[inside foreachRDD method of DStream], then apply DF API on top of that and finally store the data in Cassandra. So we used sqlContext to convert rdd to DF, that time it throws error.
If you want to access the SparkContext, do so via the rdd value:
dStream.foreachRDD(rdd => {
val sqlContext = new HiveContext(rdd.context)
val dataFrameSchema = sqlContext.createDataFrame(rdd, schema)
}
This:
dStream.foreachRDD(rdd => {
// using sparkContext / sqlContext to do any operation throws error.
val numRDD = sc.parallelize(1 to 10, 2)
log.info("NUM RDD COUNT:"+numRDD.count())
}
Is causing the SparkContext to be serialized in the closure, which it can't because it isn't serializable.

Spark: Solrj exception on serialization

I am trying to import some csv file to Solr using Spark as "pipe", in this way:
trait SparkConnector {
def connectionWrapper[A](f: SparkContext => A): A = {
val sc: SparkContext = getSparkContext()
val res: A = f(sc)
sc.stop
res
}
}
object SolrIndexBuilder extends SparkConnector {
val solr = new ConcurrentUpdateSolrClient ("http://some-solr-url", 10000, 5)
def run(implicit fileSystem: FileSystem) = connectionWrapper[Unit] {
sparkContext =>
def build(): Unit = {
def toSolrDocument(person: Map[String, String], fieldNames: Array[String]): SolrInputDocument = {
val doc = new SolrInputDocument()
//add the values to the solr doc
doc
}
def rddToMapWithHeaderUnchecked(rdd: RDD[String], header: Array[String]): RDD[Map[String, String]] =
rdd.map { line =>
val splits = new CSVParser().parseLine(line)
header.zip(splits).toMap
}
val indexCsv = sparkContext.textFile("/somewhere/filename.csv")
val fieldNames = Vector("field1", "field2", "field3", ...)
val indexRows = rddToMapWithHeaderUnchecked(indexCsv, fieldNames)
indexRows.map(row => toSolrDocument(row, fieldNames)).foreach { doc => solr.add(doc) }
if (!indexRows.isEmpty()) {
solr.blockUntilFinished()
solr.commit()
}
}
build()
}
}
Running the code on a cluster (Spark-mode) I get the following error:
Caused by: java.io.NotSerializableException: org.apache.http.impl.client.SystemDefaultHttpClient
Serialization stack:
- object not serializable (class: org.apache.http.impl.client.SystemDefaultHttpClient, value: org.apache.http.impl.client.SystemDefaultHttpClient#3bdd79c7)
- field (class: org.apache.solr.client.solrj.impl.HttpSolrClient, name: httpClient, type: interface org.apache.http.client.HttpClient)
- object (class org.apache.solr.client.solrj.impl.HttpSolrClient, org.apache.solr.client.solrj.impl.HttpSolrClient#5d2c576a)
- field (class: org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient, name: client, type: class org.apache.solr.client.solrj.impl.HttpSolrClient)
- object (class org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient, org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient#5da7593d)
- field (class: com.oneninetwo.andromeda.solr.SolrIndexBuilder$$anonfun$run$1$$anonfun$build$1$2, name: solr$1, type: class org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient)
- object (class com.oneninetwo.andromeda.solr.SolrIndexBuilder$$anonfun$run$1$$anonfun$build$1$2, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 26 more
which clearly states that I cannot serialize classes related to Solrj (and relative dependencies).
I tried different thing but I am cannot find a way to make it work. Any help?
UPDATE
This is one solution I have found:
indexRows.map(row => toSolrDocument(row, fieldNames)).foreachPartition{ partition =>
val solr = new ConcurrentUpdateSolrClient(...)
partition.foreach { doc =>
solr.add(doc)
}
solr.commit()