We are using Spark Streaming Receiver based approach, and we just enabled the Check pointing to get rid of data loss issue.
Spark version is 1.6.1 and we are receiving message from Kafka topic.
I'm using ssc inside, foreachRDD method of DStream, so it throws Not Serializable exception.
I tried extending the class Serializable, but still the same error. It is happening only when we enable checkpoint.
Code is:
def main(args: Array[String]): Unit = {
val checkPointLocation = "/path/to/wal"
val ssc = StreamingContext.getOrCreate(checkPointLocation, () => createContext(checkPointLocation))
ssc.start()
ssc.awaitTermination()
}
def createContext (checkPointLocation: String): StreamingContext ={
val sparkConf = new SparkConf().setAppName("Test")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(sparkConf, Seconds(40))
ssc.checkpoint(checkPointLocation)
val sc = ssc.sparkContext
val sqlContext: SQLContext = new HiveContext(sc)
val kafkaParams = Map("group.id" -> groupId,
CommonClientConfigs.SECURITY_PROTOCOL_CONFIG -> sasl,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
"metadata.broker.list" -> brokerList,
"zookeeper.connect" -> zookeeperURL)
val dStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
dStream.foreachRDD(rdd =>
{
// using sparkContext / sqlContext to do any operation throws error.
// convert RDD[String] to RDD[Row]
//Create Schema for the RDD.
sqlContext.createDataFrame(rdd, schema)
})
ssc
}
Error log:
2017-02-08 22:53:53,250 ERROR [Driver] streaming.StreamingContext:
Error starting the context, marking it as stopped
java.io.NotSerializableException: DStream checkpointing has been
enabled but the DStreams with their functions are not serializable
org.apache.spark.SparkContext Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value:
org.apache.spark.SparkContext#1c5e3677)
- field (class: com.x.payments.RemedyDriver$$anonfun$main$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.x.payments.RemedyDriver$$anonfun$main$1, )
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
)
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream,
org.apache.spark.streaming.dstream.ForEachDStream#68866c5)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream#68866c5))
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [ 0
checkpoint files
])
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.kafka.KafkaInputDStream,
org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32))
- writeObject data (class: org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.DStreamGraph, org.apache.spark.streaming.DStreamGraph#6935641e)
- field (class: org.apache.spark.streaming.Checkpoint, name: graph, type: class org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.Checkpoint, org.apache.spark.streaming.Checkpoint#484bf033)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:557)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at com.x.payments.RemedyDriver$.main(RemedyDriver.scala:104)
at com.x.payments.RemedyDriver.main(RemedyDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)
2017-02-08 22:53:53,250 ERROR [Driver] payments.RemedyDriver$: DStream
checkpointing has been enabled but the DStreams with their functions
are not serializable org.apache.spark.SparkContext Serialization
stack:
- object not serializable (class: org.apache.spark.SparkContext, value:
org.apache.spark.SparkContext#1c5e3677)
- field (class: com.x.payments.RemedyDriver$$anonfun$main$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.x.payments.RemedyDriver$$anonfun$main$1, )
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
)
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream,
org.apache.spark.streaming.dstream.ForEachDStream#68866c5)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream#68866c5))
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [ 0
checkpoint files
])
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.kafka.KafkaInputDStream,
org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.kafka.KafkaInputDStream#acd8e32))
- writeObject data (class: org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.DStreamGraph, org.apache.spark.streaming.DStreamGraph#6935641e)
- field (class: org.apache.spark.streaming.Checkpoint, name: graph, type: class org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.Checkpoint, org.apache.spark.streaming.Checkpoint#484bf033) 2017-02-08
22:53:53,255 INFO [Driver] yarn.ApplicationMaster: Final app status:
SUCCEEDED, exitCode: 0
Update:
Basically what we are trying to do is, converting the rdd to DF[inside foreachRDD method of DStream], then apply DF API on top of that and finally store the data in Cassandra. So we used sqlContext to convert rdd to DF, that time it throws error.
If you want to access the SparkContext, do so via the rdd value:
dStream.foreachRDD(rdd => {
val sqlContext = new HiveContext(rdd.context)
val dataFrameSchema = sqlContext.createDataFrame(rdd, schema)
}
This:
dStream.foreachRDD(rdd => {
// using sparkContext / sqlContext to do any operation throws error.
val numRDD = sc.parallelize(1 to 10, 2)
log.info("NUM RDD COUNT:"+numRDD.count())
}
Is causing the SparkContext to be serialized in the closure, which it can't because it isn't serializable.
Related
I am trying to reproduce the User Defined Aggregate Functions example provided at Spark SQL Guide.
The only change I am adding with respect of the original code is the DataFrame creation:
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)
object MyAverage extends Aggregator[Employee, Average, Double] {
// A zero value for this aggregation. Should satisfy the property that any b + zero = b
def zero: Average = Average(0L, 0L)
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
def reduce(buffer: Average, employee: Employee): Average = {
buffer.sum += employee.salary
buffer.count += 1
buffer
}
// Merge two intermediate values
def merge(b1: Average, b2: Average): Average = {
b1.sum += b2.sum
b1.count += b2.count
b1
}
// Transform the output of the reduction
def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
// Specifies the Encoder for the intermediate value type
def bufferEncoder: Encoder[Average] = Encoders.product
// Specifies the Encoder for the final output value type
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
val originalDF = Seq(
("Michael", 3000),
("Andy", 4500),
("Justin", 3500),
("Berta", 4000)
).toDF("name", "salary")
+-------+------+
|name |salary|
+-------+------+
|Michael|3000 |
|Andy |4500 |
|Justin |3500 |
|Berta |4000 |
+-------+------+
When I try to use this UDAFs with Spark SQL (Second option the documentation)
spark.udf.register("myAverage", functions.udaf(MyAverage))
originalDF.createOrReplaceTempView("employees")
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
Everything goes as expected:
+--------------+
|average_salary|
+--------------+
| 3750.0|
+--------------+
However, when I try to use the approach which converts the function to a TypedColumn:
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = originalDF.as[Employee].select(averageSalary)
result.show()
I am getting the following Exception:
Job aborted due to stage failure.
Caused by: NotSerializableException: org.apache.spark.sql.TypedColumn
Serialization stack:
- object not serializable (class: org.apache.spark.sql.TypedColumn, value: myaverage(knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).sum AS sum, knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).count AS count, newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average), boundreference()) AS average_salary)
- field (class: $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw, name: averageSalary, type: class org.apache.spark.sql.TypedColumn)
- object (class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw#1254d4c6)
- field (class: $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$, name: $outer, type: class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw)
- object (class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$#60a7eee1)
- field (class: org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator)
- object (class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, MyAverage($line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee))
- field (class: org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, name: aggregateFunction, type: class org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction)
- object (class org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, partial_myaverage($line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$#60a7eee1, Some(newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee)), Some(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee), Some(StructType(StructField(name,StringType,true),StructField(salary,LongType,false))), knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).sum, knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).count, newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average), input[0, double, false], DoubleType, false, 0, 0) AS buf#308)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#f939d16)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(partial_myaverage($line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$#60a7eee1, Some(newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee)), Some(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee), Some(StructType(StructField(name,StringType,true),StructField(salary,LongType,false))), knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).sum, knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).count, newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average), input[0, double, false], DoubleType, false, 0, 0) AS buf#308))
- field (class: org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: aggregateExpressions, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, ObjectHashAggregate(keys=[], functions=[partial_myaverage($line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$MyAverage$#60a7eee1, Some(newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee)), Some(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Employee), Some(StructType(StructField(name,StringType,true),StructField(salary,LongType,false))), knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).sum, knownnotnull(assertnotnull(input[0, $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average, true])).count, newInstance(class $line24f4d7b3f7f54dfc89ae8e2757da4abf39.$read$$iw$$iw$$iw$$iw$$iw$$iw$Average), input[0, double, false], DoubleType, false, 0, 0) AS buf#308], output=[buf#308])
+- LocalTableScan [name#274, salary#275]
)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 6)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/aggregate/ObjectHashAggregateExec.$anonfun$doExecute$1$adapted:(Lorg/apache/spark/sql/execution/aggregate/ObjectHashAggregateExec;ILorg/apache/spark/sql/execution/metric/SQLMetric;Lorg/apache/spark/sql/execution/metric/SQLMetric;Lorg/apache/spark/sql/execution/metric/SQLMetric;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=6])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$Lambda$5930/237479585, org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$Lambda$5930/237479585#6de6a3e)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted:(Lscala/Function2;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=1])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.RDD$$Lambda$5932/1340469986, org.apache.spark.rdd.RDD$$Lambda$5932/1340469986#7939a132)
- field (class: org.apache.spark.rdd.MapPartitionsRDD, name: f, type: interface scala.Function3)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[20] at $anonfun$executeCollectResult$1 at FrameProfiler.scala:80)
- field (class: org.apache.spark.NarrowDependency, name: _rdd, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.OneToOneDependency, org.apache.spark.OneToOneDependency#1e0b1350)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#29edc56a)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(org.apache.spark.OneToOneDependency#1e0b1350))
- field (class: org.apache.spark.rdd.RDD, name: dependencies_, type: interface scala.collection.Seq)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[21] at $anonfun$executeCollectResult$1 at FrameProfiler.scala:80)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (MapPartitionsRDD[21] at $anonfun$executeCollectResult$1 at FrameProfiler.scala:80,org.apache.spark.ShuffleDependency#567dc75c))
What am I missing?
I am running this script in DBR 11.0, with Spark 3.3.0, Scala 2.12
Applying the toColumn inside the select() fixed the problem:
val result = originalDF.as[Employee].select(MyAverage.toColumn.name("average_salary"))
result.show()
+--------------+
|average_salary|
+--------------+
| 3750.0|
+--------------+
I have DataSet which looks like below:
dataset.show(10)
| features|
+-----------+
|[14.378858]|
|[14.388442]|
|[14.384361]|
|[14.386358]|
|[14.390068]|
|[14.423256]|
|[14.425567]|
|[14.434074]|
|[14.437667]|
|[14.445997]|
+-----------+
only showing top 10 rows
But, when I am trying to convert this DataSet into RDD using .rdd like below :
val myRDD = dataset.rdd
I'm getting exception like below:
Task not serializable: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
- object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.catalyst.expressions.ScalaUDF, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/catalyst/expressions/ScalaUDF.$anonfun$f$2:(Lscala/Function1;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lscala/runtime/LazyRef;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, instantiatedMethodType=(Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
How do I fix this?
java.io.NotSerializableException: scala.runtime.LazyRef
Clearly indicates runtime version mismatch issue. You have not mentioned your spark version...
This is scala version issue downgrade to scala 2.11 it should work
See this version table from this url https://mvnrepository.com/artifact/org.apache.spark/spark-core
and change your scala version appropriately.
What am I trying to do? (Context)
I'm trying to calculate some stats for a dataframe/set in spark that is read from a directory with .parquet files about US flights between 2013 and 2015. To be more specific, I'm using approxQuantile method in DataFrameStatFunction that can be accessed calling stat method on a Dataset. See docu
import airportCaseStudy.model.Flight
import org.apache.spark.sql.SparkSession
object CaseStudy {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("local[*]")
.getOrCreate
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.sqlContext.implicits._
val flights = spark
.read
.parquet("C:\\Users\\Bluetab\\IdeaProjects\\GraphFramesSparkPlayground\\src\\resources\\flights")
.as[Flight]
flights.show()
flights.printSchema()
flights.describe("year", "flightEpochSeconds").show()
val approxQuantiles = flights.stat
.approxQuantile(Array("year", "flightEpochSeconds"), Array(0.25, 0.5, 0.75), 0.25)
// whatever...
}
}
Flight is simply a case class.
package airportCaseStudy.model
case class Flight(year: Int, quarter: Int, month: Int, dayOfMonth: Int, dayOfWeek: Int, flightDate: String,
uniqueCarrier: String, airlineID: String, carrier: String, tailNum: String, flightNum: Int,
originAirportID: String, origin: String, originCityName: String, dstAirportID: String,
dst: String, dstCityName: String, taxiOut: Float, taxiIn: Float, cancelled: Boolean,
diverted: Float, actualETMinutes: Float, airTimeMinutes: Float, distanceMiles: Float, flightEpochSeconds: Long)
What's the issue?
I'm using Spark 2.4.0.
When executing val approxQuantiles = flights.stat.approxQuantile(Array("year", "flightEpochSeconds"), Array(0.25, 0.5, 0.75), 0.25) I'm not getting it done because there must be such a task that cannot be serializable. I spent some time checking out there the following links, but I'm not able to figure out why this exception.
Find quantiles and mean using spark (python and scala)
Statistical and Mathematical functions with DF in Spark from Databricks
Exception
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:88)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:77)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:222)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:211)
at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1158)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1137)
at org.apache.spark.sql.execution.stat.StatFunctions$.multipleApproxQuantiles(StatFunctions.scala:102)
at org.apache.spark.sql.DataFrameStatFunctions.approxQuantile(DataFrameStatFunctions.scala:104)
at airportCaseStudy.CaseStudy$.main(CaseStudy.scala:27)
at airportCaseStudy.CaseStudy.main(CaseStudy.scala)
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
- object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.PairRDDFunctions, functionalInterfaceMethod=scala/Function0.apply:()Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/PairRDDFunctions.$anonfun$foldByKey$2:(Lorg/apache/spark/rdd/PairRDDFunctions;[BLscala/runtime/LazyRef;)Ljava/lang/Object;, instantiatedMethodType=()Ljava/lang/Object;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.PairRDDFunctions$$Lambda$2158/61210602, org.apache.spark.rdd.PairRDDFunctions$$Lambda$2158/61210602#165a5979)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 2)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.PairRDDFunctions, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/PairRDDFunctions.$anonfun$foldByKey$3:(Lscala/Function0;Lscala/Function2;Ljava/lang/Object;)Ljava/lang/Object;, instantiatedMethodType=(Ljava/lang/Object;)Ljava/lang/Object;, numCaptured=2])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.PairRDDFunctions$$Lambda$2159/758750856, org.apache.spark.rdd.PairRDDFunctions$$Lambda$2159/758750856#6a6e410c)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
... 22 more
I appreciate any help you can provide.
add "extends Serializable" to you class or object.
class/Object Test extends Serializable{
//type you code
}
Ran into this error when trying to run a spark streaming application with checkpointing enabled.
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
Serialization stack:
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#63cf0da6)
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#63cf0da6)
- field (class: com.sales.spark.job.streaming.SalesStream, name: streamingContext, type: class org.apache.spark.streaming.StreamingContext)
- field (class: com.sales.spark.job.streaming.SalesStream, name: streamingContext, type: class org.apache.spark.streaming.StreamingContext)
- object (class com.sales.spark.job.streaming.SalesStreamFactory$$anon$1, com.sales.spark.job.streaming.SalesStreamFactory$$anon$1#1738d3b2)
- object (class com.sales.spark.job.streaming.SalesStreamFactory$$anon$1, com.sales.spark.job.streaming.SalesStreamFactory$$anon$1#1738d3b2)
- field (class: com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, name: $outer, type: class com.sales.spark.job.streaming.SalesStream)
- field (class: com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, name: $outer, type: class com.sales.spark.job.streaming.SalesStream)
- object (class com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, <function1>)
- object (class com.sales.spark.job.streaming.SalesStream$$anonfun$runJob$1, <function1>)
Trying to execute the piece of code. I am thinking the issue has to do with trying to access the spark session variable inside the tempTableView function
Code
liveRecordStream
.foreachRDD(newRDD => {
if (!newRDD.isEmpty()) {
val cacheRDD = newRDD.cache()
val updTempTables = tempTableView(t2s, stgDFMap, cacheRDD)
val rdd = updatestgDFMap(stgDFMap, cacheRDD)
persistStgTable(stgDFMap)
dfMap
.filter(entry => updTempTables.contains(entry._2))
.map(spark.sql)
.foreach( df => writeToES(writer, df))
}
}
tempTableView
def tempTableView(t2s: Map[String, StructType], stgDFMap: Map[String, DataFrame], cacheRDD: RDD[cacheRDD]): Set[String] = {
stgDFMap.keys.filter { table =>
val tRDD = cacheRDD
.filter(r => r.Name == table)
.map(r => r.values)
val tDF = spark.createDataFrame(tRDD, tableNameToSchema(table))
if (!tRDD.isEmpty()) {
val tName = s"temp_$table"
tDF.createOrReplaceTempView(tName)
}
!tRDD.isEmpty()
}.toSet
}
Not sure how to get the spark session variable inside this function which is called inside foreachRDD.
I am instantiating the streamingContext as part of a different class.
class Test {
lazy val sparkSession: SparkSession =
SparkSession
.builder()
.appName("testApp")
.config("es.nodes", SalesConfig.elasticnode)
.config("es.port", SalesConfig.elasticport)
.config("spark.sql.parquet.filterPushdown", parquetFilterPushDown)
.config("spark.debug.maxToStringFields", 100000)
.config("spark.rdd.compress", rddCompress)
.config("spark.task.maxFailures", 25)
.config("spark.streaming.unpersist", streamingUnPersist)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val streamingContext: StreamingContext = new StreamingContext(sparkSession.sparkContext,Seconds(15))
streamingContext.checkpoint("/Users/gswaminathan/Guidewire/Java/explore-policy/checkpoint/")
}
I tried extending this class as Serializable, but no luck.
I want to send DStream to Kafka , but it doesn't still work.
searchWordCountsDStream.foreachRDD(rdd =>
rdd.foreachPartition(
partitionOfRecords =>
{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, outbroker)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
case (x:String,y:String)=>{
println(x)
val message=new ProducerRecord[String, String](outtopic,null,x)
producer.send(message)
}
}
producer.close()
})
)
this is some error info :
16/10/31 14:44:15 ERROR StreamingContext: Error starting the context,
marking it as stopped java.io.NotSerializableException: DStream
checkpointing has been enabled but the DStreams with their functions
are not serializable spider.app.job.MeetMonitor Serialization stack:
- object not serializable (class: spider.app.job.MeetMonitor, value: spider.app.job.MeetMonitor#433c6abb)
- field (class: spider.app.job.MeetMonitor$$anonfun$createContext$2, name: $outer, type: class spider.app.job.MeetMonitor)
- object (class spider.app.job.MeetMonitor$$anonfun$createContext$2, )
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3,
)
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream#3ac3f6f)
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [ 0
checkpoint files
])
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream#6f9c5048)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream#6f9c5048,
org.apache.spark.streaming.dstream.ForEachDStream#3ac3f6f))
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [ 0
checkpoint files
])
I encountered the same problem and found an answer here
https://forums.databricks.com/questions/382/why-is-my-spark-streaming-application-throwing-a-n.html
It seems that using checkpoint with foreachRDD causes the problem. After removing checkpoint in my code, everything is fine.
P/S. I just want to comment, but I do not have enough reputation to do so.
I have been working with Spark 2.3.0 version and encountered same issue, I got it resolved just by implementing Serializable interface for the class it was throwing error.
In your case it spider.app.job.MeetMonitor should be implementing it like:.
public class MeetMonitor implements Serializable {
//
...
}
Another thing if you are making use of Logger in your class, please note that it's instance is also not serializable hence could cause the same issue.
This could also be resolved by defining it as:
private static final Logger logger = Logger.getLogger(.class);