Spark Timestamp Column Not Serializable Error - scala

I am writing a proof-of-concept application that modifies a DataFrame by adding a timestamp column, like so:
val modified = source.withColumn("time_stamp", current_timestamp().as("time_stamp"))
modified.show()
However, it throws an error on the second line:
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at
(yadda, yadda, yadda)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 80, not attempting to retry it. Exception during serialization: java.io.NotSerializableException: org.apache.spark.sql.Column
Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: current_timestamp())
- element of array (index: 6)
- array (class [Ljava.lang.Object;, size 7)
- field (class: org.apache.spark.sql.catalyst.expressions.GenericRow, name: values, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.catalyst.expressions.GenericRow, [1,1,like right now,system,no really, like just now,system,current_timestamp()])
- element of array (index: 0)
- array (class [Lorg.apache.spark.sql.Row;, size 1)
- field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray([1,1,like right now,system,no really, like just now,system,current_timestamp()]))
- writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
- object (class org.apache.spark.rdd.ParallelCollectionPartition, org.apache.spark.rdd.ParallelCollectionPartition#a47)
- field (class: org.apache.spark.scheduler.ShuffleMapTask, name: partition, type: interface org.apache.spark.Partition)
- object (class org.apache.spark.scheduler.ShuffleMapTask, ShuffleMapTask(24, 7))
What am I doing wrong?

Related

Spark NotSerializableException when overriding log4j logs

I have overriden databricks log4j logs using init script. When my code is trigged it is running fine till some point. when it reaches the below line:
val ds = df.as[MySDMData]
ds.map(a => func1(a)).write.format("delta").mode("overwrite").option("header","true").save(s"${Interimpath}/sdm_outer_java")
it fails with the following stacktrace:
Caused by: Job aborted due to stage failure.
Caused by: NotSerializableException: org.apache.log4j.Logger
Serialization stack:
- object not serializable (class: org.apache.log4j.Logger, value: org.apache.log4j.Logger#33280b3f)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw$$iw, name: logger, type: class org.apache.log4j.Logger)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw$$iw#78706dad)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw#259eefbe)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw#58203e28)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw#621d2f16)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw#49da2284)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw#4f8ac0c9)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read, $linef43f9ceebbd54e07ba09b7bf5984364029.$read#3fc7ede0)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, name: $linef43f9ceebbd54e07ba09b7bf5984364029$read, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw#375936db)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, name: $outer, type: class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw#750dbdc3)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, name: $outer, type: class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw#4c9f0247)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, name: $outer, type: class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw#e81047)
- element of array (index: 4)
- array (class [Ljava.lang.Object;, size 7)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$6664/133918073, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$6664/133918073#36b47765)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted:(Lscala/Function2;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=1])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.RDD$$Lambda$6661/1112983908, org.apache.spark.rdd.RDD$$Lambda$6661/1112983908#63563080)
- field (class: org.apache.spark.rdd.MapPartitionsRDD, name: f, type: interface scala.Function3)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[4613] at execute at DeltaInvariantCheckerExec.scala:85)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (MapPartitionsRDD[4613] at execute at DeltaInvariantCheckerExec.scala:85,org.apache.spark.sql.execution.datasources.FileFormatWriter$$$Lambda$7277/44213980#6853ff1b))
This is my case class
case class MyData(var case_id: String,
var mbr_facet_id: String,
var mbr_id: String,
....
var tpc_chg: String,
var icue_evi_flg: String)
I am mapping my case class with other dataframe. As shown below
val ds = df.as[MyData]
ds.map(a => func1(a)).write.mode("overwrite").option("header","true").parquet(s"${Interimpath}/cdf_cpm_interim3")
when it comes to this point I am getting this error.
Is this error because of map function. How do I solve this?
New Edit
we have a dataframe df, and MySDMData is a case class having some parameters.
Using this I am making data type to same in both.
val ds = df.as[MySDMData]
here ds is a dataset
then doing below
ds.map(a => func1(a)).write.format("delta").mode("overwrite").option("header", "true").save(s"${Interimpath}/sdm_outer_java")
where func1 is method accepecting dataset(ds) as parameter doing some logical operation and returning
def func1(ds: MySDMData): MySDMData = {
/*logical operation*/
val obj = MySDMData(ds.case_id,ds.mbr_facet_id,ds.mbr_id,....)
obj //return
}
Are you adding an instance of org.apache.log4j.Logger to one of your class/case classes? see the below lines in your stacktrace:
Caused by: NotSerializableException: org.apache.log4j.Logger // this type is un-serializable
----
field (class: ..., name: logger, type: class org.apache.log4j.Logger)
^ ^
It is trying to serialize the logger, if so, don't do that, logger is not something to serialize and send it somewhere else, loggers belong to a specific scope of your code (they are meant to be specific only to where they're used).

Spark/Scala serialization of list. Task not serializable: java.io.NotSerializableException

The issue is with Spark Dataset and serialization of a list of Ints. Scala version is 2.10.4 and Spark version is 1.6.
This is similar to other questions but I can't get it to work based on those responses. I've simplified the code down in order to just show the problem.
I have a case class:
case class FlightExt(callsign: Option[String], serials: List[Int])
And my main method is like this:
val (ctx, sctx) = SparkUtil.createContext() // just a helper function to build context
val flightsDataFrame = separateFlightsMock(sctx) // reads data from Parquet file
import sctx.implicits._
flightsDataFrame.as[FlightExt]
.map(flight => flight.callsign)
.show()
I get the following error:
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.reflect.internal.Symbols$PackageClassSymbol
Serialization stack:
- object not serializable (class: scala.reflect.internal.Symbols$PackageClassSymbol, value: package scala)
- field (class: scala.reflect.internal.Types$ThisType, name: sym, type: class scala.reflect.internal.Symbols$Symbol)
- object (class scala.reflect.internal.Types$UniqueThisType, scala.type)
- field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: class scala.reflect.internal.Types$Type)
- object (class scala.reflect.internal.Types$TypeRef$$anon$6, scala.Int)
- field (class: org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$5, name: elementType$2, type: class scala.reflect.api.Types$TypeApi)
- object (class org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$5, <function1>)
- field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, name: function, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.MapObjects, mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType))
- field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: targetObject, type: class org.apache.spark.sql.catalyst.expressions.Expression)
- object (class org.apache.spark.sql.catalyst.expressions.Invoke, invoke(mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType),array,ObjectType(class [Ljava.lang.Object;)))
- writeObject data (class: scala.collection.immutable.$colon$colon)
- object (class scala.collection.immutable.$colon$colon, List(invoke(mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType),array,ObjectType(class [Ljava.lang.Object;))))
- field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, name: arguments, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, staticinvoke(class scala.collection.mutable.WrappedArray$,ObjectType(interface scala.collection.Seq),make,invoke(mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType),array,ObjectType(class [Ljava.lang.Object;)),true))
- writeObject data (class: scala.collection.immutable.$colon$colon)
If I remove the list from FlightExt then everything works fine, which indicates there is no problem with the lambda function serialization.
Scala on its own seems to serialize a list of Int's fine. Perhaps Spark has an issue with serializing Lists?
I've also tried using a Java Integer.
EDIT:
If I change List to Array it works but if I have something like this:
case class FlightExt(callsign: Option[String], other: Array[AnotherCaseClass])
It also fails with the same error
I'm new to Scala and Spark and may be missing something, but any explanation would be appreciated.
Put FlightExt class inside object, Check below code.
object Flight {
case class FlightExt(callsign: Option[String], var serials: List[Int])
}
Use Flight.FlightExt
val (ctx, sctx) = SparkUtil.createContext() // just a helper function to build context
val flightsDataFrame = separateFlightsMock(sctx) // reads data from Parquet file
import sctx.implicits._
flightsDataFrame.as[Flight.FlightExt]
.map(flight => flight.callsign)
.show()

Spark : Scala mocking, Task not serializable

I am trying to use mockito for unit testing some scala code. I want to run spark locally, i.e. in my IntelliJ IDE. Here is a sample
class MyScalaSparkTests extends FunSuite with BeforeAndAfter with MockitoSugar with java.io.Serializable{
val configuration:SparkConf = new SparkConf()
.setAppName("Your Application Name")
.setMaster("local");
val sc = new SparkContext(configuration);
lazy val testSess = SparkSession.builder.appName("local_test").getOrCreate()
test ("test service") {
import testSess.implicits._
// (1) init
val testObject = spy(new MyScalaClass(<some args>))
val testDf = testSess.emptyDataset[MyCaseClass1].toDF()
testDf.union(Seq(MyCaseClass(<some args>)).toDF())
testObject.testDataFrame = testDf
val testSource = testSess.emptyDataset[MyCaseClass2].toDF()
testSource.union(Seq(MyCaseClass2(<some args>)).toDF())
testObject.setSourceDf(testSource)
val testRes = testObject.someMethod()
val r = testRes.take(1)
println(r)
}
}
so basically, here is what I am trying to do
MyScalaClass has someMethod() which compares data between two data frames called testDataFrame and testSource. It then returns another data frame which has the results. Now, in my unit test, I am spying on MyScalaClass to create testObject. Then I create testDataFrame and testSource and assign them to testObject. Finally, I call testObject.someMethod().
Now in the debugger, at this line
val r = testRes.take(1)
I see that testRes is a Dataset hence something is being returned by the method. But when I try to take something from it in order to verify the results I get
Task not serializable
org.apache.spark.SparkException: Task not serializable
and further down the stacktrace
Caused by: java.io.NotSerializableException: org.mockito.internal.creation.DelegatingMethod
Serialization stack:
- object not serializable (class: org.mockito.internal.creation.DelegatingMethod, value: org.mockito.internal.creation.DelegatingMethod#a97f2bff)
- field (class: org.mockito.internal.invocation.InterceptedInvocation, name: mockitoMethod, type: interface org.mockito.internal.invocation.MockitoMethod)
- object (class org.mockito.internal.invocation.InterceptedInvocation, bSV2PartValidator.toString();)
- field (class: org.mockito.internal.invocation.InvocationMatcher, name: invocation, type: interface org.mockito.invocation.Invocation)
- object (class org.mockito.internal.invocation.InvocationMatcher, bSV2PartValidator.toString();)
- field (class: org.mockito.internal.stubbing.InvocationContainerImpl, name: invocationForStubbing, type: interface org.mockito.invocation.MatchableInvocation)
- object (class org.mockito.internal.stubbing.InvocationContainerImpl, invocationForStubbing: bSV2PartValidator.toString();)
- field (class: org.mockito.internal.handler.MockHandlerImpl, name: invocationContainer, type: class org.mockito.internal.stubbing.InvocationContainerImpl)
- object (class org.mockito.internal.handler.MockHandlerImpl, org.mockito.internal.handler.MockHandlerImpl#47c019d7)
- field (class: org.mockito.internal.handler.NullResultGuardian, name: delegate, type: interface org.mockito.invocation.MockHandler)
- object (class org.mockito.internal.handler.NullResultGuardian, org.mockito.internal.handler.NullResultGuardian#7222e168)
- field (class: org.mockito.internal.handler.InvocationNotifierHandler, name: mockHandler, type: interface org.mockito.invocation.MockHandler)
- object (class org.mockito.internal.handler.InvocationNotifierHandler, org.mockito.internal.handler.InvocationNotifierHandler#1e4f8430)
- field (class: org.mockito.internal.creation.bytebuddy.MockMethodInterceptor, name: handler, type: interface org.mockito.invocation.MockHandler)
- object (class org.mockito.internal.creation.bytebuddy.MockMethodInterceptor, org.mockito.internal.creation.bytebuddy.MockMethodInterceptor#34d08905)
- field (class: com.walmart.labs.search.signals.validators.BSV2PartValidator$MockitoMock$213785213, name: mockitoInterceptor, type: class org.mockito.internal.creation.bytebuddy.MockMethodInterceptor)
- object (class com.walmart.labs.search.signals.validators.BSV2PartValidator$MockitoMock$213785213, com.walmart.labs.search.signals.validators.BSV2PartValidator$MockitoMock$213785213#7f289126)
- field (class: com.walmart.labs.search.signals.validators.BSV2PartValidator$$anonfun$1, name: $outer, type: class com.walmart.labs.search.signals.validators.BSV2PartValidator)
- object (class com.walmart.labs.search.signals.validators.BSV2PartValidator$$anonfun$1, <function1>)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 7)
- field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 78 more
What am i doing wrong? Is it even possible to spy on or mock spark behavior in IDE?
Mocks are not serialisable by default, as it's usually a code smell in unit testing
You can try enabling serialisation by creating the mock like mock[MyType](Mockito.withSettings().serializable()) and see what happens when spark tries to use it.
BTW, I recommend you to use mockito-scala instead of the traditional mockito as it may save you some other problems

RDD from Dataset results in a Serialization Error with Spark 2.x

I have an RDD that I created from a Dataset using Databricks notebook.
When I try to get concrete values from it, it simply fails with a serialization error message.
Here is where I get my data (PageCount is a Case class) :
val pcDf = spark.sql("SELECT * FROM pagecounts20160801")
val pcDs = pcDf.as[PageCount]
val pcRdd = pcDs.rdd
Then when I do :
pcRdd.take(10)
I get the following exception :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 82.0 (TID 2474) had a not serializable result: org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
Even though the same attempt on the Dataset works :
pcDs.take(10)
EDIT :
here is the complete stacktrace
Serialization stack:
- object not serializable (class: org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection, value: <function1>)
- field (class: org.apache.spark.sql.execution.datasources.FileFormat$$anon$1, name: appendPartitionColumns, type: class org.apache.spark.sql.catalyst.expressions.UnsafeProjection)
- object (class org.apache.spark.sql.execution.datasources.FileFormat$$anon$1, <function1>)
- field (class: org.apache.spark.sql.execution.datasources.FileScanRDD, name: readFunction, type: interface scala.Function1)
- object (class org.apache.spark.sql.execution.datasources.FileScanRDD, FileScanRDD[1095] at )
- field (class: org.apache.spark.NarrowDependency, name: _rdd, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.OneToOneDependency, org.apache.spark.OneToOneDependency#502bfe49)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#51dc790)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(org.apache.spark.OneToOneDependency#502bfe49))
- field (class: org.apache.spark.rdd.RDD, name: org$apache$spark$rdd$RDD$$dependencies_, type: interface scala.collection.Seq)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[1096] at )
- field (class: org.apache.spark.NarrowDependency, name: _rdd, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.OneToOneDependency, org.apache.spark.OneToOneDependency#52ce8951)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#57850f0)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(org.apache.spark.OneToOneDependency#52ce8951))
- field (class: org.apache.spark.rdd.RDD, name: org$apache$spark$rdd$RDD$$dependencies_, type: interface scala.collection.Seq)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[1097] at )
- field (class: org.apache.spark.NarrowDependency, name: _rdd, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.OneToOneDependency, org.apache.spark.OneToOneDependency#7e99329a)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#792f3145)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(org.apache.spark.OneToOneDependency#7e99329a))
- field (class: org.apache.spark.rdd.RDD, name: org$apache$spark$rdd$RDD$$dependencies_, type: interface scala.collection.Seq)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[1098] at )
- field (class: org.apache.spark.sql.Dataset, name: rdd, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.sql.Dataset, Invalid tree; null:
null)
- field (class: lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw, name: pcDs, type: class org.apache.spark.sql.Dataset)
- object (class lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw, lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw#3482035d)
- field (class: lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw$PageCount, name: $outer, type: class lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw)
- object (class lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw$PageCount, PageCount(de.b,Spezial:Linkliste/Datei:Playing_card_diamond_9.svg,1,6053))
- element of array (index: 0)
- array (class [Llineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw$PageCount;, size 10)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1452)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1440)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1439)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1439)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1665)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1881)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1894)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1311)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.take(RDD.scala:1285)
at lineb9de310f01c84f49b76c6c6295a1393c137.$read$$iw$$iw$$iw$$iw.<init>(<console>:33)
at lineb9de310f01c84f49b76c6c6295a1393c137.$read$$iw$$iw$$iw.<init>(<console>:40)
at lineb9de310f01c84f49b76c6c6295a1393c137.$read$$iw$$iw.<init>(<console>:42)
at lineb9de310f01c84f49b76c6c6295a1393c137.$read$$iw.<init>(<console>:44)
at lineb9de310f01c84f49b76c6c6295a1393c137.$eval$.$print$lzycompute(<console>:7)
at lineb9de310f01c84f49b76c6c6295a1393c137.$eval$.$print(<console>:6)
PageCount class definitely has non-serializable reference (some non-transient non-serializable member, or maybe parent type with the same problem). Inability to serialize the object given let Spark to try to serialize enclosing scope, up to more and more its members, including the member of FileFormat somewhere up the road, - the projection generated by Janino (which is not serializable by design).
This is just a side effect of bad target object (PageCount) serialization.
Relevant code from spark FileFormat.scala (should be marked with #transient to really avoid serialization in case if "appendPartitionColumns" was once materialized)
// Using lazy val to avoid serialization
private lazy val appendPartitionColumns =
GenerateUnsafeProjection.generate(fullSchema, fullSchema)
This "non-expected" serialization above will never happen though in a regular scenario, until user-defined types serialization succeeds.
Spark RDD (raw type! no global schema for it is known to Spark) serialization involves complete objects serialization (object data and object "schema", type) during materialization. Default mechanism of serialization is Java serializer (so that you can try to serialize PageCount using Java serializer, and this maybe reveal the problem with this type), and may be replaced with more efficient Kryo serializer (which will serialize object into blob though, so that we will loose the schema, and will not be able to apply sqls requiring columns access). This is why RDD access triggers serialization issues
Dataframes / Datasets are strong-typed, they are bound to schema which is known to Spark. Therefore, for Spark, no need to pass the objects structure between nodes, only data is being passed.
This is the reason why there's no problem to materialize Dataset / Dataframe of underlying object type PageCount.

Spark Scala API (Zeppelin Notebook): Ignore serialization of foreachRDD transformation

The call of a foreachRDD transformation causes a NotSerializableException in my Spark Scala project in Zeppelin Notebook.
It's a Streaming application which gathers data over windows, so I had to enable checkpointing:
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "xxx")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "xxx")
streamingCtx.checkpoint("s3n://mybucket/data/checkpoints")
Here the code line (without it, everything works perfectly):
countPerPlug1h.foreachRDD(rdd => rdd.toDF().registerTempTable("test6"))
I get the following error:
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#13d5321)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: streamingCtx, type: class org.apache.spark.streaming.StreamingContext)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#50a1a698)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#1d4c88f7)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC#257426e0)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC#125ec828)
- field (class: $iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC#48950383)
- field (class: $iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC, $iwC$$iwC$$iwC#412cdee8)
- field (class: $iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC)
- object (class $iwC$$iwC, $iwC$$iwC#1d8af6ae)
- field (class: $iwC, name: $iw, type: class $iwC$$iwC)
- object (class $iwC, $iwC#1a73298c)
- field (class: $line36.$read, name: $iw, type: class $iwC)
- object (class $line36.$read, $line36.$read#b6c225e)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $VAL184, type: class $line36.$read)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#3bcc2573)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#21c54459)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, <function1>)
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, <function2>)
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream#5697482b)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream#5697482b, org.apache.spark.streaming.dstream.ForEachDStream#4b0a272f))
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [
0 checkpoint files
])
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.SocketInputDStream, org.apache.spark.streaming.dstream.SocketInputDStream#3cbc22e2)
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [
0 checkpoint files
])
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.SocketInputDStream, org.apache.spark.streaming.dstream.SocketInputDStream#3d84a839)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.SocketInputDStream#3d84a839, org.apache.spark.streaming.dstream.SocketInputDStream#3cbc22e2))
- writeObject data (class: org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.DStreamGraph, org.apache.spark.streaming.DStreamGraph#27cc3d89)
- field (class: org.apache.spark.streaming.Checkpoint, name: graph, type: class org.apache.spark.streaming.DStreamGraph)
- object (class org.apache.spark.streaming.Checkpoint, org.apache.spark.streaming.Checkpoint#56913019)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:566)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:602)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:601)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:42)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
at $iwC$$iwC$$iwC.<init>(<console>:55)
at $iwC$$iwC.<init>(<console>:57)
at $iwC.<init>(<console>:59)
at <init>(<console>:61)
at .<init>(<console>:65)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:658)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:623)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:616)
at org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:276)
at org.apache.zeppelin.scheduler.Job.run(Job.java:170)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:118)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
But I need this table for presentation reasons.
So my question: Is there a possibility to avoid the serialization of this code line???
EDIT
I found a mail which seems to describes the same problem:
https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCAOADwJHhW7Rtv6sNpd3_y5Q12Uu6NEYLqgdT2H-WcnOf+3Aa-g#mail.gmail.com%3E
Can someone give a clue about it??