Spark : Scala mocking, Task not serializable - scala

I am trying to use mockito for unit testing some scala code. I want to run spark locally, i.e. in my IntelliJ IDE. Here is a sample
class MyScalaSparkTests extends FunSuite with BeforeAndAfter with MockitoSugar with java.io.Serializable{
val configuration:SparkConf = new SparkConf()
.setAppName("Your Application Name")
.setMaster("local");
val sc = new SparkContext(configuration);
lazy val testSess = SparkSession.builder.appName("local_test").getOrCreate()
test ("test service") {
import testSess.implicits._
// (1) init
val testObject = spy(new MyScalaClass(<some args>))
val testDf = testSess.emptyDataset[MyCaseClass1].toDF()
testDf.union(Seq(MyCaseClass(<some args>)).toDF())
testObject.testDataFrame = testDf
val testSource = testSess.emptyDataset[MyCaseClass2].toDF()
testSource.union(Seq(MyCaseClass2(<some args>)).toDF())
testObject.setSourceDf(testSource)
val testRes = testObject.someMethod()
val r = testRes.take(1)
println(r)
}
}
so basically, here is what I am trying to do
MyScalaClass has someMethod() which compares data between two data frames called testDataFrame and testSource. It then returns another data frame which has the results. Now, in my unit test, I am spying on MyScalaClass to create testObject. Then I create testDataFrame and testSource and assign them to testObject. Finally, I call testObject.someMethod().
Now in the debugger, at this line
val r = testRes.take(1)
I see that testRes is a Dataset hence something is being returned by the method. But when I try to take something from it in order to verify the results I get
Task not serializable
org.apache.spark.SparkException: Task not serializable
and further down the stacktrace
Caused by: java.io.NotSerializableException: org.mockito.internal.creation.DelegatingMethod
Serialization stack:
- object not serializable (class: org.mockito.internal.creation.DelegatingMethod, value: org.mockito.internal.creation.DelegatingMethod#a97f2bff)
- field (class: org.mockito.internal.invocation.InterceptedInvocation, name: mockitoMethod, type: interface org.mockito.internal.invocation.MockitoMethod)
- object (class org.mockito.internal.invocation.InterceptedInvocation, bSV2PartValidator.toString();)
- field (class: org.mockito.internal.invocation.InvocationMatcher, name: invocation, type: interface org.mockito.invocation.Invocation)
- object (class org.mockito.internal.invocation.InvocationMatcher, bSV2PartValidator.toString();)
- field (class: org.mockito.internal.stubbing.InvocationContainerImpl, name: invocationForStubbing, type: interface org.mockito.invocation.MatchableInvocation)
- object (class org.mockito.internal.stubbing.InvocationContainerImpl, invocationForStubbing: bSV2PartValidator.toString();)
- field (class: org.mockito.internal.handler.MockHandlerImpl, name: invocationContainer, type: class org.mockito.internal.stubbing.InvocationContainerImpl)
- object (class org.mockito.internal.handler.MockHandlerImpl, org.mockito.internal.handler.MockHandlerImpl#47c019d7)
- field (class: org.mockito.internal.handler.NullResultGuardian, name: delegate, type: interface org.mockito.invocation.MockHandler)
- object (class org.mockito.internal.handler.NullResultGuardian, org.mockito.internal.handler.NullResultGuardian#7222e168)
- field (class: org.mockito.internal.handler.InvocationNotifierHandler, name: mockHandler, type: interface org.mockito.invocation.MockHandler)
- object (class org.mockito.internal.handler.InvocationNotifierHandler, org.mockito.internal.handler.InvocationNotifierHandler#1e4f8430)
- field (class: org.mockito.internal.creation.bytebuddy.MockMethodInterceptor, name: handler, type: interface org.mockito.invocation.MockHandler)
- object (class org.mockito.internal.creation.bytebuddy.MockMethodInterceptor, org.mockito.internal.creation.bytebuddy.MockMethodInterceptor#34d08905)
- field (class: com.walmart.labs.search.signals.validators.BSV2PartValidator$MockitoMock$213785213, name: mockitoInterceptor, type: class org.mockito.internal.creation.bytebuddy.MockMethodInterceptor)
- object (class com.walmart.labs.search.signals.validators.BSV2PartValidator$MockitoMock$213785213, com.walmart.labs.search.signals.validators.BSV2PartValidator$MockitoMock$213785213#7f289126)
- field (class: com.walmart.labs.search.signals.validators.BSV2PartValidator$$anonfun$1, name: $outer, type: class com.walmart.labs.search.signals.validators.BSV2PartValidator)
- object (class com.walmart.labs.search.signals.validators.BSV2PartValidator$$anonfun$1, <function1>)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 7)
- field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 78 more
What am i doing wrong? Is it even possible to spy on or mock spark behavior in IDE?

Mocks are not serialisable by default, as it's usually a code smell in unit testing
You can try enabling serialisation by creating the mock like mock[MyType](Mockito.withSettings().serializable()) and see what happens when spark tries to use it.
BTW, I recommend you to use mockito-scala instead of the traditional mockito as it may save you some other problems

Related

Spark NotSerializableException when overriding log4j logs

I have overriden databricks log4j logs using init script. When my code is trigged it is running fine till some point. when it reaches the below line:
val ds = df.as[MySDMData]
ds.map(a => func1(a)).write.format("delta").mode("overwrite").option("header","true").save(s"${Interimpath}/sdm_outer_java")
it fails with the following stacktrace:
Caused by: Job aborted due to stage failure.
Caused by: NotSerializableException: org.apache.log4j.Logger
Serialization stack:
- object not serializable (class: org.apache.log4j.Logger, value: org.apache.log4j.Logger#33280b3f)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw$$iw, name: logger, type: class org.apache.log4j.Logger)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw$$iw#78706dad)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw#259eefbe)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw#58203e28)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw#621d2f16)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw#49da2284)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw, $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw#4f8ac0c9)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364029.$read, name: $iw, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364029.$read, $linef43f9ceebbd54e07ba09b7bf5984364029.$read#3fc7ede0)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, name: $linef43f9ceebbd54e07ba09b7bf5984364029$read, type: class $linef43f9ceebbd54e07ba09b7bf5984364029.$read)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw#375936db)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, name: $outer, type: class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw#750dbdc3)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, name: $outer, type: class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw#4c9f0247)
- field (class: $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, name: $outer, type: class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw)
- object (class $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw, $linef43f9ceebbd54e07ba09b7bf5984364041.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw#e81047)
- element of array (index: 4)
- array (class [Ljava.lang.Object;, size 7)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$6664/133918073, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$6664/133918073#36b47765)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted:(Lscala/Function2;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=1])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.rdd.RDD$$Lambda$6661/1112983908, org.apache.spark.rdd.RDD$$Lambda$6661/1112983908#63563080)
- field (class: org.apache.spark.rdd.MapPartitionsRDD, name: f, type: interface scala.Function3)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[4613] at execute at DeltaInvariantCheckerExec.scala:85)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (MapPartitionsRDD[4613] at execute at DeltaInvariantCheckerExec.scala:85,org.apache.spark.sql.execution.datasources.FileFormatWriter$$$Lambda$7277/44213980#6853ff1b))
This is my case class
case class MyData(var case_id: String,
var mbr_facet_id: String,
var mbr_id: String,
....
var tpc_chg: String,
var icue_evi_flg: String)
I am mapping my case class with other dataframe. As shown below
val ds = df.as[MyData]
ds.map(a => func1(a)).write.mode("overwrite").option("header","true").parquet(s"${Interimpath}/cdf_cpm_interim3")
when it comes to this point I am getting this error.
Is this error because of map function. How do I solve this?
New Edit
we have a dataframe df, and MySDMData is a case class having some parameters.
Using this I am making data type to same in both.
val ds = df.as[MySDMData]
here ds is a dataset
then doing below
ds.map(a => func1(a)).write.format("delta").mode("overwrite").option("header", "true").save(s"${Interimpath}/sdm_outer_java")
where func1 is method accepecting dataset(ds) as parameter doing some logical operation and returning
def func1(ds: MySDMData): MySDMData = {
/*logical operation*/
val obj = MySDMData(ds.case_id,ds.mbr_facet_id,ds.mbr_id,....)
obj //return
}
Are you adding an instance of org.apache.log4j.Logger to one of your class/case classes? see the below lines in your stacktrace:
Caused by: NotSerializableException: org.apache.log4j.Logger // this type is un-serializable
----
field (class: ..., name: logger, type: class org.apache.log4j.Logger)
^ ^
It is trying to serialize the logger, if so, don't do that, logger is not something to serialize and send it somewhere else, loggers belong to a specific scope of your code (they are meant to be specific only to where they're used).

Spark/Scala serialization of list. Task not serializable: java.io.NotSerializableException

The issue is with Spark Dataset and serialization of a list of Ints. Scala version is 2.10.4 and Spark version is 1.6.
This is similar to other questions but I can't get it to work based on those responses. I've simplified the code down in order to just show the problem.
I have a case class:
case class FlightExt(callsign: Option[String], serials: List[Int])
And my main method is like this:
val (ctx, sctx) = SparkUtil.createContext() // just a helper function to build context
val flightsDataFrame = separateFlightsMock(sctx) // reads data from Parquet file
import sctx.implicits._
flightsDataFrame.as[FlightExt]
.map(flight => flight.callsign)
.show()
I get the following error:
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.reflect.internal.Symbols$PackageClassSymbol
Serialization stack:
- object not serializable (class: scala.reflect.internal.Symbols$PackageClassSymbol, value: package scala)
- field (class: scala.reflect.internal.Types$ThisType, name: sym, type: class scala.reflect.internal.Symbols$Symbol)
- object (class scala.reflect.internal.Types$UniqueThisType, scala.type)
- field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: class scala.reflect.internal.Types$Type)
- object (class scala.reflect.internal.Types$TypeRef$$anon$6, scala.Int)
- field (class: org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$5, name: elementType$2, type: class scala.reflect.api.Types$TypeApi)
- object (class org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$5, <function1>)
- field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, name: function, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.MapObjects, mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType))
- field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: targetObject, type: class org.apache.spark.sql.catalyst.expressions.Expression)
- object (class org.apache.spark.sql.catalyst.expressions.Invoke, invoke(mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType),array,ObjectType(class [Ljava.lang.Object;)))
- writeObject data (class: scala.collection.immutable.$colon$colon)
- object (class scala.collection.immutable.$colon$colon, List(invoke(mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType),array,ObjectType(class [Ljava.lang.Object;))))
- field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, name: arguments, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, staticinvoke(class scala.collection.mutable.WrappedArray$,ObjectType(interface scala.collection.Seq),make,invoke(mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType),array,ObjectType(class [Ljava.lang.Object;)),true))
- writeObject data (class: scala.collection.immutable.$colon$colon)
If I remove the list from FlightExt then everything works fine, which indicates there is no problem with the lambda function serialization.
Scala on its own seems to serialize a list of Int's fine. Perhaps Spark has an issue with serializing Lists?
I've also tried using a Java Integer.
EDIT:
If I change List to Array it works but if I have something like this:
case class FlightExt(callsign: Option[String], other: Array[AnotherCaseClass])
It also fails with the same error
I'm new to Scala and Spark and may be missing something, but any explanation would be appreciated.
Put FlightExt class inside object, Check below code.
object Flight {
case class FlightExt(callsign: Option[String], var serials: List[Int])
}
Use Flight.FlightExt
val (ctx, sctx) = SparkUtil.createContext() // just a helper function to build context
val flightsDataFrame = separateFlightsMock(sctx) // reads data from Parquet file
import sctx.implicits._
flightsDataFrame.as[Flight.FlightExt]
.map(flight => flight.callsign)
.show()

How do I read a Scala serialization stack (from Spark)?

How do I read a serialization stack?
I'm building a distributed NLP application on top of Spark. I periodically run into these NotSerializable exceptions, and always muddle my way through them. But, I've never found good documentation on what everything in the Serialization stack means.
How do I read a Serialization stack accompanying a NotSerializable error in Scala? How do I pinpoint the class or object that is causing the error? What is the significance of the 'field', 'object', 'writeObject' and 'writeReplace' fields in the stack?
An example is below:
Caused by: java.io.NotSerializableException: MyPackage.testing.PreprocessTest$$typecreator1$1
Serialization stack:
- object not serializable (class: MyPackage.testing.PreprocessTest$$typecreator1$1, value: MyPackage.testing.PreprocessTest$$typecreator1$1#27f6854b)
- writeObject data (class: scala.reflect.api.SerializedTypeTag)
- object (class scala.reflect.api.SerializedTypeTag, scala.reflect.api.SerializedTypeTag#4a571516)
- writeReplace data (class: scala.reflect.api.SerializedTypeTag)
- object (class scala.reflect.api.TypeTags$TypeTagImpl, TypeTag[String])
- field (class: MyPackage.package$$anonfun$deserializeMaps$1, name: evidence$1$1, type: interface scala.reflect.api.TypeTags$TypeTag)
- object (class MyPackage.package$$anonfun$deserializeMaps$1, <function1>)
- field (class: MyPackage.package$$anonfun$deserializeMaps$1$$anonfun$apply$4, name: $outer, type: class MyPackage.package$$anonfun$deserializeMaps$1)
- object (class MyPackage.package$$anonfun$deserializeMaps$1$$anonfun$apply$4, <function1>)
- field (class: MyPackage.package$$anonfun$deserializeMaps$1$$anonfun$apply$4$$anonfun$apply$5, name: $outer, type: class MyPackage.package$$anonfun$deserializeMaps$1$$anonfun$apply$4)
- object (class MyPackage.package$$anonfun$deserializeMaps$1$$anonfun$apply$4$$anonfun$apply$5, <function1>)
- field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2, name: func$2, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2, <function1>)
- field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF, name: f, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF, UDF(UDF(tokenMap#149)))
- field (class: org.apache.spark.sql.catalyst.expressions.Alias, name: child, type: class org.apache.spark.sql.catalyst.expressions.Expression)
- object (class org.apache.spark.sql.catalyst.expressions.Alias, UDF(UDF(tokenMap#149)) AS tokenMap#3131)
- writeObject data (class: scala.collection.immutable.$colon$colon)
- object (class scala.collection.immutable.$colon$colon, List(id#148, UDF(UDF(tokenMap#149)) AS tokenMap#3131, UDF(UDF(bigramMap#150)) AS bigramMap#3132, sentences#151, se_sentence_count#152, se_word_count#153, se_subjective_count#154, se_objective_count#155, se_document_sentiment#156, UDF(UDF(se_category#157)) AS se_category#3133))
- field (class: org.apache.spark.sql.execution.Project, name: projectList, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.execution.Project, Project [id#148,UDF(UDF(tokenMap#149)) AS tokenMap#3131,UDF(UDF(bigramMap#150)) AS bigramMap#3132,sentences#151,se_sentence_count#152,se_word_count#153,se_subjective_count#154,se_objective_count#155,se_document_sentiment#156,UDF(UDF(se_category#157)) AS se_category#3133]
+- InMemoryColumnarTableScan [se_sentence_count#152,bigramMap#150,id#148,tokenMap#149,se_word_count#153,sentences#151,se_document_sentiment#156,se_subjective_count#154,se_category#157,se_objective_count#155], InMemoryRelation [id#148,tokenMap#149,bigramMap#150,sentences#151,se_sentence_count#152,se_word_count#153,se_subjective_count#154,se_objective_count#155,se_document_sentiment#156,se_category#157], true, 10000, StorageLevel(true, true, false, true, 1), Union, None
)
- field (class: org.apache.spark.sql.execution.ConvertToSafe, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.ConvertToSafe, ConvertToSafe
+- Project [id#148,UDF(UDF(tokenMap#149)) AS tokenMap#3131,UDF(UDF(bigramMap#150)) AS bigramMap#3132,sentences#151,se_sentence_count#152,se_word_count#153,se_subjective_count#154,se_objective_count#155,se_document_sentiment#156,UDF(UDF(se_category#157)) AS se_category#3133]
+- InMemoryColumnarTableScan [se_sentence_count#152,bigramMap#150,id#148,tokenMap#149,se_word_count#153,sentences#151,se_document_sentiment#156,se_subjective_count#154,se_category#157,se_objective_count#155], InMemoryRelation [id#148,tokenMap#149,bigramMap#150,sentences#151,se_sentence_count#152,se_word_count#153,se_subjective_count#154,se_objective_count#155,se_document_sentiment#156,se_category#157], true, 10000, StorageLevel(true, true, false, true, 1), Union, None
)
- field (class: org.apache.spark.sql.execution.ConvertToSafe$$anonfun$2, name: $outer, type: class org.apache.spark.sql.execution.ConvertToSafe)
- object (class org.apache.spark.sql.execution.ConvertToSafe$$anonfun$2, <function1>)
You can get a feeling of what that output means when looking at SerializationDebugger.scala's visit method.
At the time of writing this comment, it looks like this:
/**
* Visit the object and its fields and stop when we find an object that is not serializable.
* Return the path as a list. If everything can be serialized, return an empty list.
*/
def visit(o: Any, stack: List[String]): List[String] = {
if (o == null) {
List.empty
} else if (visited.contains(o)) {
List.empty
} else {
visited += o
o match {
// Primitive value, string, and primitive arrays are always serializable
case _ if o.getClass.isPrimitive => List.empty
case _: String => List.empty
case _ if o.getClass.isArray && o.getClass.getComponentType.isPrimitive => List.empty
// Traverse non primitive array.
case a: Array[_] if o.getClass.isArray && !o.getClass.getComponentType.isPrimitive =>
val elem = s"array (class ${a.getClass.getName}, size ${a.length})"
visitArray(o.asInstanceOf[Array[_]], elem :: stack)
case e: java.io.Externalizable =>
val elem = s"externalizable object (class ${e.getClass.getName}, $e)"
visitExternalizable(e, elem :: stack)
case s: Object with java.io.Serializable =>
val elem = s"object (class ${s.getClass.getName}, $s)"
visitSerializable(s, elem :: stack)
case _ =>
// Found an object that is not serializable!
s"object not serializable (class: ${o.getClass.getName}, value: $o)" :: stack
}
}
}
So this visit method traverses your object that needs to be serialized. There are multiple cases:
There are cases where there is no serialization issue:
your object is null
visited.contains(o) => you've traversed the whole object without an issue
o.getClass.isPrimitive => you're serializing a primitive type
String can be serialized
o.getClass.isArray && o.getClass.getComponentType.isPrimitive => serializing an array of primitive types
Then there are 3 cases, where a more complicated object is encountered and it needs to be checked whether that complicated object contains unserializable objects. They call the visitArray, visitExternalizable and visitSerializable functions on the serialization stack with the current element popped off (see elem :: stack) and they end up calling visit again. This happens recursively until:
you're done traversing the object (visited.contains(o))
you bounce onto an unserializable object
In your case, it seems like you made a UDF that needed to be serialized. It contained a bunch of Serializable objects (ConvertToSafe, Project, List, Alias, ScalaUDF, some anonymous functions, TypeTags) and then finally bounced against the unserializable object: MyPackage.testing.PreprocessTest. So that last one is the one you need to make serializable if you want this to pass.
Hope this helps!

RDD from Dataset results in a Serialization Error with Spark 2.x

I have an RDD that I created from a Dataset using Databricks notebook.
When I try to get concrete values from it, it simply fails with a serialization error message.
Here is where I get my data (PageCount is a Case class) :
val pcDf = spark.sql("SELECT * FROM pagecounts20160801")
val pcDs = pcDf.as[PageCount]
val pcRdd = pcDs.rdd
Then when I do :
pcRdd.take(10)
I get the following exception :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 82.0 (TID 2474) had a not serializable result: org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
Even though the same attempt on the Dataset works :
pcDs.take(10)
EDIT :
here is the complete stacktrace
Serialization stack:
- object not serializable (class: org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection, value: <function1>)
- field (class: org.apache.spark.sql.execution.datasources.FileFormat$$anon$1, name: appendPartitionColumns, type: class org.apache.spark.sql.catalyst.expressions.UnsafeProjection)
- object (class org.apache.spark.sql.execution.datasources.FileFormat$$anon$1, <function1>)
- field (class: org.apache.spark.sql.execution.datasources.FileScanRDD, name: readFunction, type: interface scala.Function1)
- object (class org.apache.spark.sql.execution.datasources.FileScanRDD, FileScanRDD[1095] at )
- field (class: org.apache.spark.NarrowDependency, name: _rdd, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.OneToOneDependency, org.apache.spark.OneToOneDependency#502bfe49)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#51dc790)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(org.apache.spark.OneToOneDependency#502bfe49))
- field (class: org.apache.spark.rdd.RDD, name: org$apache$spark$rdd$RDD$$dependencies_, type: interface scala.collection.Seq)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[1096] at )
- field (class: org.apache.spark.NarrowDependency, name: _rdd, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.OneToOneDependency, org.apache.spark.OneToOneDependency#52ce8951)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#57850f0)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(org.apache.spark.OneToOneDependency#52ce8951))
- field (class: org.apache.spark.rdd.RDD, name: org$apache$spark$rdd$RDD$$dependencies_, type: interface scala.collection.Seq)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[1097] at )
- field (class: org.apache.spark.NarrowDependency, name: _rdd, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.OneToOneDependency, org.apache.spark.OneToOneDependency#7e99329a)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy#792f3145)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, List(org.apache.spark.OneToOneDependency#7e99329a))
- field (class: org.apache.spark.rdd.RDD, name: org$apache$spark$rdd$RDD$$dependencies_, type: interface scala.collection.Seq)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[1098] at )
- field (class: org.apache.spark.sql.Dataset, name: rdd, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.sql.Dataset, Invalid tree; null:
null)
- field (class: lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw, name: pcDs, type: class org.apache.spark.sql.Dataset)
- object (class lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw, lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw#3482035d)
- field (class: lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw$PageCount, name: $outer, type: class lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw)
- object (class lineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw$PageCount, PageCount(de.b,Spezial:Linkliste/Datei:Playing_card_diamond_9.svg,1,6053))
- element of array (index: 0)
- array (class [Llineb9de310f01c84f49b76c6c6295a1393c121.$read$$iw$$iw$$iw$$iw$PageCount;, size 10)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1452)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1440)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1439)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1439)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1665)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1881)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1894)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1311)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.take(RDD.scala:1285)
at lineb9de310f01c84f49b76c6c6295a1393c137.$read$$iw$$iw$$iw$$iw.<init>(<console>:33)
at lineb9de310f01c84f49b76c6c6295a1393c137.$read$$iw$$iw$$iw.<init>(<console>:40)
at lineb9de310f01c84f49b76c6c6295a1393c137.$read$$iw$$iw.<init>(<console>:42)
at lineb9de310f01c84f49b76c6c6295a1393c137.$read$$iw.<init>(<console>:44)
at lineb9de310f01c84f49b76c6c6295a1393c137.$eval$.$print$lzycompute(<console>:7)
at lineb9de310f01c84f49b76c6c6295a1393c137.$eval$.$print(<console>:6)
PageCount class definitely has non-serializable reference (some non-transient non-serializable member, or maybe parent type with the same problem). Inability to serialize the object given let Spark to try to serialize enclosing scope, up to more and more its members, including the member of FileFormat somewhere up the road, - the projection generated by Janino (which is not serializable by design).
This is just a side effect of bad target object (PageCount) serialization.
Relevant code from spark FileFormat.scala (should be marked with #transient to really avoid serialization in case if "appendPartitionColumns" was once materialized)
// Using lazy val to avoid serialization
private lazy val appendPartitionColumns =
GenerateUnsafeProjection.generate(fullSchema, fullSchema)
This "non-expected" serialization above will never happen though in a regular scenario, until user-defined types serialization succeeds.
Spark RDD (raw type! no global schema for it is known to Spark) serialization involves complete objects serialization (object data and object "schema", type) during materialization. Default mechanism of serialization is Java serializer (so that you can try to serialize PageCount using Java serializer, and this maybe reveal the problem with this type), and may be replaced with more efficient Kryo serializer (which will serialize object into blob though, so that we will loose the schema, and will not be able to apply sqls requiring columns access). This is why RDD access triggers serialization issues
Dataframes / Datasets are strong-typed, they are bound to schema which is known to Spark. Therefore, for Spark, no need to pass the objects structure between nodes, only data is being passed.
This is the reason why there's no problem to materialize Dataset / Dataframe of underlying object type PageCount.

User Defined Variables in spark - org.apache.spark.SparkException: Task not serializable

I want to add user defined variables that can be used inside filter and map functions in spark, currently I am getting an error while trying to do that.
EDIT : I am running this code on zeppelin notebook so I dont have to create a seperate class.
My code is somewhat like :
val some_value = "qwerty"
val some_other_value = "x.y.z"
data.filter(r => r.getString("a.b.c").equals(some_value))
.map(r => (r.getString(some_other_value)))
Please note that, here "data" is an RDD containing JSONs
I am getting the following error:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:341)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:340)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.filter(RDD.scala:340)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:202)......
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:747)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:711)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:704)
at org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:312)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#7f166a30)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: c, type: class org.apache.spark.SparkContext)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#65b97194)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#5e9afdf2)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#1d72f175)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#7b44290b)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC#2fd0142d)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC#1812605b)
- field (class: $iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC#40020ce9)
- field (class: $iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC, $iwC$$iwC$$iwC#3db9c1c)
- field (class: $iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC)
- object (class $iwC$$iwC, $iwC$$iwC#6968a929)
- field (class: $iwC, name: $iw, type: class $iwC$$iwC)
- object (class $iwC, $iwC#1e8d42e6)
- field (class: $line25.$read, name: $iw, type: class $iwC)
- object (class $line25.$read, $line25.$read#6b4256a)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $VAL5449, type: class $line25.$read)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#7d6cf781)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#f4d2716)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#d8c3a34)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC#7bc0d5a5)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC#74ef7040)
- field (class: $iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC#7e55be5d)
- field (class: $iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC, $iwC$$iwC$$iwC#2f59915e)
- field (class: $iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC)
- object (class $iwC$$iwC, $iwC$$iwC#36faa9d0)
- field (class: $iwC, name: $iw, type: class $iwC$$iwC)
- object (class $iwC, $iwC#54f2aef7)
- field (class: $line385.$read, name: $iw, type: class $iwC)
- object (class $line385.$read, $line385.$read#3d8590f3)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $VAL5516, type: class $line385.$read)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#419dfc19)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#6e994c6d)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$2e9cf4ebd66898e1aa2d2fdd9497ea7$$$$C$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$2e9cf4ebd66898e1aa2d2fdd9497ea7$$$$C$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, <function1>)
at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 123 more
EDIT 2 : Finally I got the solution, just wrap the whole code as a scala function and pass the parameters.
val some_value = "qwerty"
val some_other_value = "x.y.z"
def func (value_1 : String, value_2 : String) : RDD[String] = {
val d = data.filter(r => r.getString("a.b.c").equals(value_2))
.map(r => (r.getString(value_2)))
return d
}
val new_data = func(some_value, some_other_value)
Another method is to use functions for complete operations inside the closure.
def myFilter(e: String) = (r: org.apache.avro.generic.GenericRecord) => r.getString("a.b.c").equals(e)
def myGenerator(f: String) = (r: org.apache.avro.generic.GenericRecord) => r.getString(f)
val p = data.filter (myFilter(x)).map(myGenerator(y))
org.apache.spark.SparkException: Task not serializable exception, it means that you use a reference to an instance of a non-serializable class inside a transformation.
Beware of closures using fields/methods of outer
object (these will reference the whole object)
For ex :
NotSerializable notSerializable = new NotSerializable();
JavaRDD<String> rdd = sc.textFile("/tmp/myfile");
rdd.map(s -> notSerializable.doSomething(s)).collect();
Here are some ideas to fix this error:
Serializable the class
Declare the instance only within the lambda function passed in map.
Make the NotSerializable object as a static and create it once per machine.
Call rdd.forEachPartition and create the NotSerializable object in there like this:
rdd.forEachPartition(iter -> {
NotSerializable notSerializable = new NotSerializable();
// ...Now process iterator
});
Also, look at Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
Tip : Can use jvm parameter to see detailed info about serialization
-Dsun.io.serialization.extendedDebugInfo=true in SPARK_JAVA_OPTS
SerializationDebugger :
SPARK-5307 introduced SerializationDebugger and Spark 1.3.0 is the first version to use it. It adds serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.For ex :
Serialization stack:
- object not serializable (class: testing, value: testing#2dfe2f00)
- field (class: testing$$anonfun$1, name: $outer, type: class testing)
- object (class testing$$anonfun$1, <function1>)
EDIT
I'm directly trying to address problem here...
Since we are going to run our Spark application on JVM, we have a
requirement that in order to serialize an object your class should
explicitly extend special Serializable interface. In Scala when you
are declaring a case class it automatically extends the Serializable
interface.
1.case class approach :
object Test extends App {
case class MyInputForMapAndFilter(somevalue: String, someothervalue: String)
val some_value = "qwerty"
val some_other_value = "x.y.z"
val myInputForMapAndFilter = MyInputForMapAndFilter(some_value,some_other_value)
data.filter(r => r.getString("a.b.c").equals(myInputForMapAndFilter.somevalue))
.map(r => (r.getString(myInputForMapAndFilter.someothervalue)))
}
Case class in scala is by default serializable
2. Another approach you can try : you can declare val as transient
#transient val some_value = "qwerty"
#transient val some_other_value = "x.y.z"
data.filter(r => r.getString("a.b.c").equals(some_value))
.map(r => (r.getString(some_other_value)))
please try this ans let me know output... pls be proactive to ask more questions/ problems
What you need to do is use a broadcast variable which you can then fetch in map and reduce functions.
For example to create the broadcast variable do:
val broadcastString = sc.broadcast("my string")
And in your map and reduce functions you can get the broadcasted string using:
val myStr = broadcastString.value
See here for more: http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables