Task not Serializable exception on converting dataset to red - scala

I have DataSet which looks like below:
dataset.show(10)
| features|
+-----------+
|[14.378858]|
|[14.388442]|
|[14.384361]|
|[14.386358]|
|[14.390068]|
|[14.423256]|
|[14.425567]|
|[14.434074]|
|[14.437667]|
|[14.445997]|
+-----------+
only showing top 10 rows
But, when I am trying to convert this DataSet into RDD using .rdd like below :
val myRDD = dataset.rdd
I'm getting exception like below:
Task not serializable: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
- object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.catalyst.expressions.ScalaUDF, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/catalyst/expressions/ScalaUDF.$anonfun$f$2:(Lscala/Function1;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lscala/runtime/LazyRef;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, instantiatedMethodType=(Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
How do I fix this?

java.io.NotSerializableException: scala.runtime.LazyRef
Clearly indicates runtime version mismatch issue. You have not mentioned your spark version...
This is scala version issue downgrade to scala 2.11 it should work
See this version table from this url https://mvnrepository.com/artifact/org.apache.spark/spark-core
and change your scala version appropriately.

Related

NotSerializableException: org.apache.spark.sql.TypedColumn when calling a UDAFs

I am trying to reproduce the User Defined Aggregate Functions example provided at Spark SQL Guide.
The only change I am adding with respect of the original code is the DataFrame creation:
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)
object MyAverage extends Aggregator[Employee, Average, Double] {
// A zero value for this aggregation. Should satisfy the property that any b + zero = b
def zero: Average = Average(0L, 0L)
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
def reduce(buffer: Average, employee: Employee): Average = {
buffer.sum += employee.salary
buffer.count += 1
buffer
}
// Merge two intermediate values
def merge(b1: Average, b2: Average): Average = {
b1.sum += b2.sum
b1.count += b2.count
b1
}
// Transform the output of the reduction
def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
// Specifies the Encoder for the intermediate value type
def bufferEncoder: Encoder[Average] = Encoders.product
// Specifies the Encoder for the final output value type
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
val originalDF = Seq(
("Michael", 3000),
("Andy", 4500),
("Justin", 3500),
("Berta", 4000)
).toDF("name", "salary")
+-------+------+
|name |salary|
+-------+------+
|Michael|3000 |
|Andy |4500 |
|Justin |3500 |
|Berta |4000 |
+-------+------+
When I try to use this UDAFs with Spark SQL (Second option the documentation)
spark.udf.register("myAverage", functions.udaf(MyAverage))
originalDF.createOrReplaceTempView("employees")
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
Everything goes as expected:
+--------------+
|average_salary|
+--------------+
| 3750.0|
+--------------+
However, when I try to use the approach which converts the function to a TypedColumn:
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = originalDF.as[Employee].select(averageSalary)
result.show()
I am getting the following Exception:
Job aborted due to stage failure.
Caused by: NotSerializableException: org.apache.spark.sql.TypedColumn
Serialization stack:
- object not serializable (class: org.apache.spark.sql.TypedColumn, value: myaverage(knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).sum AS sum, knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).count AS count, newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average), boundreference()) AS average_salary)
- field (class: $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw, name: averageSalary, type: class org.apache.spark.sql.TypedColumn)
- object (class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw#1254d4c6)
- field (class: $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$, name: $outer, type: class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw)
- object (class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$#60a7eee1)
- field (class: org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator)
- object (class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, MyAverage($line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee))
- field (class: org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, name: aggregateFunction, type: class org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction)
- object (class org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, partial_myaverage($line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$#60a7eee1, Some(newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee)), Some(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee), Some(StructType(StructField(name,StringType,true),StructField(salary,LongType,false))), knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).sum, knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).count, newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average), input[0, double, false], DoubleType, false, 0, 0) AS buf#308)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#f939d16)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(partial_myaverage($line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$#60a7eee1, Some(newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee)), Some(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee), Some(StructType(StructField(name,StringType,true),StructField(salary,LongType,false))), knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).sum, knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).count, newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average), input[0, double, false], DoubleType, false, 0, 0) AS buf#308))
- field (class: org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: aggregateExpressions, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, ObjectHashAggregate(keys=[], functions=[partial_myaverage($line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$#60a7eee1, Some(newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee)), Some(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee), Some(StructType(StructField(name,StringType,true),StructField(salary,LongType,false))), knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).sum, knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).count, newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average), input[0, double, false], DoubleType, false, 0, 0) AS buf#308], output=[buf#308])
+- LocalTableScan [name#274, salary#275]
)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 6)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/aggregate/ObjectHashAggregateExec.$anonfun$doExecute$1$adapted:(Lorg/apache/spark/sql/execution/aggregate/ObjectHashAggregateExec;ILorg/apache/spark/sql/execution/metric/SQLMetric;Lorg/apache/spark/sql/execution/metric/SQLMetric;Lorg/apache/spark/sql/execution/metric/SQLMetric;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=6])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$Lambda$5930/237479585, org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$Lambda$5930/237479585#6de6a3e)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted:(Lscala/Function2;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=1])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.RDD$$Lambda$5932/1340469986, org.apache.spark.rdd.RDD$$Lambda$5932/1340469986#7939a132)
- field (class: org.apache.spark.rdd.MapPartitionsRDD, name: f, type: interface scala.Function3)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[20] at $anonfun$executeCollectResult$1 at FrameProfiler.scala:80)
- field (class: org.apache.spark.NarrowDependency, name: _rdd, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.OneToOneDependency, org.apache.spark.OneToOneDependency#1e0b1350)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#29edc56a)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(org.apache.spark.OneToOneDependency#1e0b1350))
- field (class: org.apache.spark.rdd.RDD, name: dependencies_, type: interface scala.collection.Seq)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[21] at $anonfun$executeCollectResult$1 at FrameProfiler.scala:80)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (MapPartitionsRDD[21] at $anonfun$executeCollectResult$1 at FrameProfiler.scala:80,org.apache.spark.ShuffleDependency#567dc75c))
What am I missing?
I am running this script in DBR 11.0, with Spark 3.3.0, Scala 2.12
Applying the toColumn inside the select() fixed the problem:
val result = originalDF.as[Employee].select(MyAverage.toColumn.name("average_salary"))
result.show()
+--------------+
|average_salary|
+--------------+
| 3750.0|
+--------------+

Spark Dataframe stat throwing Task not serializable

What am I trying to do? (Context)
I'm trying to calculate some stats for a dataframe/set in spark that is read from a directory with .parquet files about US flights between 2013 and 2015. To be more specific, I'm using approxQuantile method in DataFrameStatFunction that can be accessed calling stat method on a Dataset. See docu
import airportCaseStudy.model.Flight
import org.apache.spark.sql.SparkSession
object CaseStudy {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("local[*]")
.getOrCreate
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.sqlContext.implicits._
val flights = spark
.read
.parquet("C:\\Users\\Bluetab\\IdeaProjects\\GraphFramesSparkPlayground\\src\\resources\\flights")
.as[Flight]
flights.show()
flights.printSchema()
flights.describe("year", "flightEpochSeconds").show()
val approxQuantiles = flights.stat
.approxQuantile(Array("year", "flightEpochSeconds"), Array(0.25, 0.5, 0.75), 0.25)
// whatever...
}
}
Flight is simply a case class.
package airportCaseStudy.model
case class Flight(year: Int, quarter: Int, month: Int, dayOfMonth: Int, dayOfWeek: Int, flightDate: String,
uniqueCarrier: String, airlineID: String, carrier: String, tailNum: String, flightNum: Int,
originAirportID: String, origin: String, originCityName: String, dstAirportID: String,
dst: String, dstCityName: String, taxiOut: Float, taxiIn: Float, cancelled: Boolean,
diverted: Float, actualETMinutes: Float, airTimeMinutes: Float, distanceMiles: Float, flightEpochSeconds: Long)
What's the issue?
I'm using Spark 2.4.0.
When executing val approxQuantiles = flights.stat.approxQuantile(Array("year", "flightEpochSeconds"), Array(0.25, 0.5, 0.75), 0.25) I'm not getting it done because there must be such a task that cannot be serializable. I spent some time checking out there the following links, but I'm not able to figure out why this exception.
Find quantiles and mean using spark (python and scala)
Statistical and Mathematical functions with DF in Spark from Databricks
Exception
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:88)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:77)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:222)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:211)
at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1158)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1137)
at org.apache.spark.sql.execution.stat.StatFunctions$.multipleApproxQuantiles(StatFunctions.scala:102)
at org.apache.spark.sql.DataFrameStatFunctions.approxQuantile(DataFrameStatFunctions.scala:104)
at airportCaseStudy.CaseStudy$.main(CaseStudy.scala:27)
at airportCaseStudy.CaseStudy.main(CaseStudy.scala)
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
- object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.PairRDDFunctions, functionalInterfaceMethod=scala/Function0.apply:()Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/PairRDDFunctions.$anonfun$foldByKey$2:(Lorg/apache/spark/rdd/PairRDDFunctions;[BLscala/runtime/LazyRef;)Ljava/lang/Object;, instantiatedMethodType=()Ljava/lang/Object;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.PairRDDFunctions$$Lambda$2158/61210602, org.apache.spark.rdd.PairRDDFunctions$$Lambda$2158/61210602#165a5979)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 2)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.PairRDDFunctions, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/PairRDDFunctions.$anonfun$foldByKey$3:(Lscala/Function0;Lscala/Function2;Ljava/lang/Object;)Ljava/lang/Object;, instantiatedMethodType=(Ljava/lang/Object;)Ljava/lang/Object;, numCaptured=2])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.PairRDDFunctions$$Lambda$2159/758750856, org.apache.spark.rdd.PairRDDFunctions$$Lambda$2159/758750856#6a6e410c)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
... 22 more
I appreciate any help you can provide.
add "extends Serializable" to you class or object.
class/Object Test extends Serializable{
//type you code
}

DataFrame using UDF giving Task not serializable Exception

Trying to use the show() method on a dataframe. It is giving Task not serializable Exception.
I have tried to extend the Serializable object but still the error persists.
object App extends Serializable{
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache").setLevel(Level.WARN);
val spark = SparkSession.builder()
.appName("LearningSpark")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val inputPath = "./src/resources/2015-03-01-0.json"
val ghLog = spark.read.json(inputPath)
val pushes = ghLog.filter("type = 'PushEvent'")
val grouped = pushes.groupBy("actor.login").count
val ordered = grouped.orderBy(grouped("count").desc)
ordered.show(5)
val empPath = "./src/resources/ghEmployees.txt"
val employees = Set() ++ (
for {
line <- fromFile(empPath).getLines
} yield line.trim)
val bcEmployees = sc.broadcast(employees)
import spark.implicits._
val isEmp = user => bcEmployees.value.contains(user)
val isEmployee = spark.udf.register("SetContainsUdf", isEmp)
val filtered = ordered.filter(isEmployee($"login"))
filtered.show()
}
}
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
19/09/01 10:21:48 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:850)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:630)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:128)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:151)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:136)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3383)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:745)
at org.apache.spark.sql.Dataset.show(Dataset.scala:704)
at org.apache.spark.sql.Dataset.show(Dataset.scala:713)
at App$.main(App.scala:33)
at App.main(App.scala)
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
- object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.catalyst.expressions.ScalaUDF, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/catalyst/expressions/ScalaUDF.$anonfun$f$2:(Lscala/Function1;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lscala/runtime/LazyRef;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, instantiatedMethodType=(Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF$$Lambda$2364/2031154005, org.apache.spark.sql.catalyst.expressions.ScalaUDF$$Lambda$2364/2031154005#1fd37440)
- field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF, name: f, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF, UDF:SetContainsUdf(actor#6.login))
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#3b65084e)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(isnotnull(type#13), (type#13 = PushEvent), UDF:SetContainsUdf(actor#6.login)))
- field (class: org.apache.spark.sql.execution.FileSourceScanExec, name: dataFilters, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.execution.FileSourceScanExec, FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- field (class: org.apache.spark.sql.execution.FilterExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.FilterExec, Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
+- FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- field (class: org.apache.spark.sql.execution.ProjectExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.ProjectExec, Project [actor#6]
+- Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
+- FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- field (class: org.apache.spark.sql.execution.aggregate.HashAggregateExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.aggregate.HashAggregateExec, HashAggregate(keys=[actor#6.login AS actor#6.login#53], functions=[partial_count(1)], output=[actor#6.login#53, count#43L])
+- Project [actor#6]
+- Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
+- FileScan json [actor#6,type#13] Batched:+------------------+-----+
| login|count|
+------------------+-----+
| greatfirebot| 192|
|diversify-exp-user| 146|
| KenanSulayman| 72|
| manuelrp07| 45|
| mirror-updates| 42|
+------------------+-----+
only showing top 5 rows
false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 14)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$1297/815648243, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$1297/815648243#27438750)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
... 48 more
I had spark 2.4.4 with Scala "2.12.1". I encountered the same issue (object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)) and it was driving me crazy. I changed Scala version to "2.12.10" and the issue is solved now!
The serialization issue is not because of object not being Serializable.
The object is not serialized and sent to executors for execution, it is the transform code that is serialized.
One of the functions in the code is not Serializable.
On looking at the code and the trace, isEmployee seems to be the issue.
A couple of observations
1. isEmployee is not a UDF. In Spark, UDF needs to be created by extending org.apache.spark.sql.expressions.UserDefinedFunction which is Serializable, and after defining the function it needs to be registered using org.apache.spark.sql.UDFRegistration#register
I can think of two solutions:
1. Create and register UDF rightly, so that Serialization happens rightly
2. Completely avoid UDF and make use of broadcast variable and filter method as follows
val employees: Set[String] = Set("users")
val bcEmployees = sc.broadcast(employees)
val filtered = ordered.filter {
x =>
val user = x.getString(0) // assuming 0th index contains user
bcEmployees.value.contains(user) // access broadcast variable in closure
}
filtered.show()
Life is full of mysteries. Serialization is one of them, and some aspects of the spark-shell vs. Databricks Notebooks - which are easier.
https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54 should be consulted so as to see that extends Serializable as provided at top-level is not the clue; the Driver ships relevant pieces to Executors as far as I understand.
If I run your code as is in Databricks Notebook without any extends Serializable, it works fine! In the past I have been able to capture Serialization issues in the Databricks Notebooks - always to-date. Interesting, as in pseudo cluster one should pick up all the possible Serialization issues prior to release I was assured - apparently not so always. Interesting, but a notebook is not spark-submit.
If I run in spark-shell with two consecutive "paste modes" - logical, or line-by-line as follows, here under and 1) omit a few things and 2) adapt with extends Serializable for an Object for your UDF - which is for a Column, so we adhere to that, it works.
:paste 1
scala> :paste
// Entering paste mode (ctrl-D to finish)
object X extends Serializable {
val isEmp = user => bcEmployees.value.contains(user)
}
:paste 2
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("LearningSpark")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
// Register UDF
val isEmployee = udf(X.isEmp)
import scala.io.Source
import spark.implicits._
// Simulated input.
val ghLog = Seq(("john2X0", "push"), ("james09", "abc"), ("peter01", "push"), ("mary99", "push"), ("peter01", "push")).toDF("login", "type")
val pushes = ghLog.filter("type = 'push'")
val grouped = pushes.groupBy("login").count
val ordered = grouped.orderBy(grouped("count").desc)
ordered.show(5)
val emp = "/home/mapr/emp.txt"
val employees = Set() ++ (
for {
line <- Source.fromFile(emp).getLines
} yield line.trim)
val bcEmployees = sc.broadcast(employees)
val filtered = ordered.filter(isEmployee($"login"))
filtered.show()
So, the other answer states do not via UDF, more performant in some cases, but I am sticking with the UDF which allows column input and is potentially reusable. This approach works with spark-submit as well, although that should be obvious - mentioned for posterity.

Spark Streaming check pointing throws Not Serializable exception

We are using Spark Streaming Receiver based approach, and we just enabled the Check pointing to get rid of data loss issue.
Spark version is 1.6.1 and we are receiving message from Kafka topic.
I'm using ssc inside, foreachRDD method of DStream, so it throws Not Serializable exception.
I tried extending the class Serializable, but still the same error. It is happening only when we enable checkpoint.
Code is:
def main(args: Array[String]): Unit = {
val checkPointLocation = "/path/to/wal"
val ssc = StreamingContext.getOrCreate(checkPointLocation, () => createContext(checkPointLocation))
ssc.start()
ssc.awaitTermination()
}
def createContext (checkPointLocation: String): StreamingContext ={
val sparkConf = new SparkConf().setAppName("Test")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(sparkConf, Seconds(40))
ssc.checkpoint(checkPointLocation)
val sc = ssc.sparkContext
val sqlContext: SQLContext = new HiveContext(sc)
val kafkaParams = Map("group.id" -> groupId,
CommonClientConfigs.SECURITY_PROTOCOL_CONFIG -> sasl,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
"metadata.broker.list" -> brokerList,
"zookeeper.connect" -> zookeeperURL)
val dStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
dStream.foreachRDD(rdd =>
{
// using sparkContext / sqlContext to do any operation throws error.
// convert RDD[String] to RDD[Row]
//Create Schema for the RDD.
sqlContext.createDataFrame(rdd, schema)
})
ssc
}
Error log:
2017-02-08 22:53:53,250 ERROR [Driver] streaming.StreamingContext:
Error starting the context, marking it as stopped
java.io.NotSerializableException: DStream checkpointing has been
enabled but the DStreams with their functions are not serializable
org.apache.spark.SparkContext Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value:
org.apache.spark.SparkContext#1c5e3677)
- field (class: com.x.payments.RemedyDriver$$anonfun$main$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.x.payments.RemedyDriver$$anonfun$main$1, )
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
)
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream,
org.apache.spark.streaming.dstream.ForEachDStream#68866c5)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream#68866c5))
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [ 0
checkpoint files
])
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.kafka.KafkaInputDStream,
org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32))
- writeObject data (class: org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.DStreamGraph, org.apache.spark.streaming.DStreamGraph#6935641e)
- field (class: org.apache.spark.streaming.Checkpoint, name: graph, type: class org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.Checkpoint, org.apache.spark.streaming.Checkpoint#484bf033)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:557)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at com.x.payments.RemedyDriver$.main(RemedyDriver.scala:104)
at com.x.payments.RemedyDriver.main(RemedyDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)
2017-02-08 22:53:53,250 ERROR [Driver] payments.RemedyDriver$: DStream
checkpointing has been enabled but the DStreams with their functions
are not serializable org.apache.spark.SparkContext Serialization
stack:
- object not serializable (class: org.apache.spark.SparkContext, value:
org.apache.spark.SparkContext#1c5e3677)
- field (class: com.x.payments.RemedyDriver$$anonfun$main$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.x.payments.RemedyDriver$$anonfun$main$1, )
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
)
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream,
org.apache.spark.streaming.dstream.ForEachDStream#68866c5)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream#68866c5))
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [ 0
checkpoint files
])
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.kafka.KafkaInputDStream,
org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32))
- writeObject data (class: org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.DStreamGraph, org.apache.spark.streaming.DStreamGraph#6935641e)
- field (class: org.apache.spark.streaming.Checkpoint, name: graph, type: class org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.Checkpoint, org.apache.spark.streaming.Checkpoint#484bf033) 2017-02-08
22:53:53,255 INFO [Driver] yarn.ApplicationMaster: Final app status:
SUCCEEDED, exitCode: 0
Update:
Basically what we are trying to do is, converting the rdd to DF[inside foreachRDD method of DStream], then apply DF API on top of that and finally store the data in Cassandra. So we used sqlContext to convert rdd to DF, that time it throws error.
If you want to access the SparkContext, do so via the rdd value:
dStream.foreachRDD(rdd => {
val sqlContext = new HiveContext(rdd.context)
val dataFrameSchema = sqlContext.createDataFrame(rdd, schema)
}
This:
dStream.foreachRDD(rdd => {
// using sparkContext / sqlContext to do any operation throws error.
val numRDD = sc.parallelize(1 to 10, 2)
log.info("NUM RDD COUNT:"+numRDD.count())
}
Is causing the SparkContext to be serialized in the closure, which it can't because it isn't serializable.

DStream checkpointing has been enabled but the DStreams with their functions are not serializable

I want to send DStream to Kafka , but it doesn't still work.
searchWordCountsDStream.foreachRDD(rdd =>
rdd.foreachPartition(
partitionOfRecords =>
{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, outbroker)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
case (x:String,y:String)=>{
println(x)
val message=new ProducerRecord[String, String](outtopic,null,x)
producer.send(message)
}
}
producer.close()
})
)
this is some error info :
16/10/31 14:44:15 ERROR StreamingContext: Error starting the context,
marking it as stopped java.io.NotSerializableException: DStream
checkpointing has been enabled but the DStreams with their functions
are not serializable spider.app.job.MeetMonitor Serialization stack:
- object not serializable (class: spider.app.job.MeetMonitor, value: spider.app.job.MeetMonitor#433c6abb)
- field (class: spider.app.job.MeetMonitor$$anonfun$createContext$2, name: $outer, type: class spider.app.job.MeetMonitor)
- object (class spider.app.job.MeetMonitor$$anonfun$createContext$2, )
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
)
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream#3ac3f6f)
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [ 0
checkpoint files
])
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream#6f9c5048)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream#6f9c5048,
org.apache.spark.streaming.dstream.ForEachDStream#3ac3f6f))
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [ 0
checkpoint files
])
I encountered the same problem and found an answer here
https://forums.databricks.com/questions/382/why-is-my-spark-streaming-application-throwing-a-n.html
It seems that using checkpoint with foreachRDD causes the problem. After removing checkpoint in my code, everything is fine.
P/S. I just want to comment, but I do not have enough reputation to do so.
I have been working with Spark 2.3.0 version and encountered same issue, I got it resolved just by implementing Serializable interface for the class it was throwing error.
In your case it spider.app.job.MeetMonitor should be implementing it like:.
public class MeetMonitor implements Serializable {
//
...
}
Another thing if you are making use of Logger in your class, please note that it's instance is also not serializable hence could cause the same issue.
This could also be resolved by defining it as:
private static final Logger logger = Logger.getLogger(.class);