org.apache.spark.SparkException: Task not serializable with lambda - scala

I'm very new to scala and spark. Now I'm having an issue that makes me very confused. Please give me an advice.
I'm making RDD[myEntityClass] from RDD[Array[String]] using lambda. But I faced an error which says there is null value to parse String to Long. To investigate this I implemented a method which makes me able to use breakpoint.
However now I'm getting org.apache.spark.SparkException: Task not serializable and I can't find what's wrong. Below is my code snippet please help me if you can find anything.
def makingData() : RDD[MyEntityClass] = {
.
.
data.map(row => toMyEntityClass(row))
}
def toMyEntityClass(row : Array[String]) : MyEntityClass = {
var id = row(0).toLong
var name = row(1)
var code = row(2).toLong
var parentId = row(3).toLong
var status = row(4)
MyEntityClass(id, name, code, parentId, status)
}
===== updated question =====
I'm updating my question to respond your advices. I've already had MyEntityClass as case class like below.
case class MyEntityClass(id: Long, name: String, code: Long, parentId: Long, status: String)
===== appended stack trace =====
Task not serializable
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:314)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:313)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.map(RDD.scala:313)
at com.myproject.repository.MyRepositorySpec.getDummyData(MyRepositorySpec.scala:40)
at com.myproject.repository.MyRepositorySpec$$anonfun$3.apply(MyRepositorySpec.scala:66)
at com.myproject.repository.MyRepositorySpec$$anonfun$3.apply(MyRepositorySpec.scala:65)
at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1681)
at org.scalatest.Suite$class.withFixture(Suite.scala:1031)
at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1691)
at org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1678)
at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1690)
at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1690)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:287)
at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1690)
at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1691)
at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1748)
at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1748)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:394)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:382)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:382)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:371)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:408)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:382)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:382)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:377)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:459)
at org.scalatest.FlatSpecLike$class.runTests(FlatSpecLike.scala:1748)
at org.scalatest.FlatSpec.runTests(FlatSpec.scala:1691)
at org.scalatest.Suite$class.run(Suite.scala:1320)
at org.scalatest.FlatSpec.org$scalatest$FlatSpecLike$$super$run(FlatSpec.scala:1691)
at org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1794)
at org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1794)
at org.scalatest.SuperEngine.runImpl(Engine.scala:519)
at org.scalatest.FlatSpecLike$class.run(FlatSpecLike.scala:1794)
at org.scalatest.FlatSpec.run(FlatSpec.scala:1691)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:46)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1340)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1334)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1334)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1011)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1010)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1500)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010)
at org.scalatest.tools.Runner$.run(Runner.scala:850)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
Caused by: java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper
Serialization stack:
- object not serializable (class: org.scalatest.Assertions$AssertionsHelper, value: org.scalatest.Assertions$AssertionsHelper#45e639ee)
- field (class: org.scalatest.FlatSpec, name: assertionsHelper, type: class org.scalatest.Assertions$AssertionsHelper)
- object (class com.myproject.repository.MyRepositorySpec, MyRepositorySpec)
- field (class: com.myproject.repository.MyRepositorySpec$$anonfun$getDummyData$1, name: $outer, type: class com.myproject.repository.MyRepositorySpec)
- object (class com.myproject.repository.MyRepositorySpec$$anonfun$getDummyData$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 61 more

From the code given above, I understand that you want to convert
RDD[Array[String]] to RDD[MyEntityClass]
We've 2 options here..
Make a case class MyEntityClass which is by default Serializable.
for example
case MyEntityClass(id : Long, name : String, code : String, parentId : Long, status : String)
Make a normal class MyEntityClass with Serializable then its eligible for serialization... Note : In general this approach is used when case class has more than 22 fields(productarity issue) and if you are using < scala 2.10
EDIT : After you confirmed that MyEntityClass is a case class, and
pasted Serialization Debugger stack trace, which reveals
MyRepositorySpec is just a test class which extends FlatSpec and has
makingData() and toMyEntityClass().You are using your test class
inside the closure which is the cause of this exception
With below error it is clearly evident
caused by: java.io.NotSerializableException:
org.scalatest.Assertions$AssertionsHelper Serialization stack:
- object not serializable (class: org.scalatest.Assertions$AssertionsHelper, value:
org.scalatest.Assertions$AssertionsHelper#45e639ee)
- field (class: org.scalatest.FlatSpec, name: assertionsHelper, type: class org.scalatest.Assertions$AssertionsHelper)
- object (class com.myproject.repository.MyRepositorySpec, MyRepositorySpec)
- field (class: com.myproject.repository.MyRepositorySpec$$anonfun$getDummyData$1,
name:
Solution : Make MyRepositorySpec as Serializable

Related

Spark/Scala serialization of list. Task not serializable: java.io.NotSerializableException

The issue is with Spark Dataset and serialization of a list of Ints. Scala version is 2.10.4 and Spark version is 1.6.
This is similar to other questions but I can't get it to work based on those responses. I've simplified the code down in order to just show the problem.
I have a case class:
case class FlightExt(callsign: Option[String], serials: List[Int])
And my main method is like this:
val (ctx, sctx) = SparkUtil.createContext() // just a helper function to build context
val flightsDataFrame = separateFlightsMock(sctx) // reads data from Parquet file
import sctx.implicits._
flightsDataFrame.as[FlightExt]
.map(flight => flight.callsign)
.show()
I get the following error:
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.reflect.internal.Symbols$PackageClassSymbol
Serialization stack:
- object not serializable (class: scala.reflect.internal.Symbols$PackageClassSymbol, value: package scala)
- field (class: scala.reflect.internal.Types$ThisType, name: sym, type: class scala.reflect.internal.Symbols$Symbol)
- object (class scala.reflect.internal.Types$UniqueThisType, scala.type)
- field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: class scala.reflect.internal.Types$Type)
- object (class scala.reflect.internal.Types$TypeRef$$anon$6, scala.Int)
- field (class: org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$5, name: elementType$2, type: class scala.reflect.api.Types$TypeApi)
- object (class org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$5, <function1>)
- field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, name: function, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.MapObjects, mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType))
- field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: targetObject, type: class org.apache.spark.sql.catalyst.expressions.Expression)
- object (class org.apache.spark.sql.catalyst.expressions.Invoke, invoke(mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType),array,ObjectType(class [Ljava.lang.Object;)))
- writeObject data (class: scala.collection.immutable.$colon$colon)
- object (class scala.collection.immutable.$colon$colon, List(invoke(mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType),array,ObjectType(class [Ljava.lang.Object;))))
- field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, name: arguments, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, staticinvoke(class scala.collection.mutable.WrappedArray$,ObjectType(interface scala.collection.Seq),make,invoke(mapobjects(<function1>,cast(serials#7 as array<int>),IntegerType),array,ObjectType(class [Ljava.lang.Object;)),true))
- writeObject data (class: scala.collection.immutable.$colon$colon)
If I remove the list from FlightExt then everything works fine, which indicates there is no problem with the lambda function serialization.
Scala on its own seems to serialize a list of Int's fine. Perhaps Spark has an issue with serializing Lists?
I've also tried using a Java Integer.
EDIT:
If I change List to Array it works but if I have something like this:
case class FlightExt(callsign: Option[String], other: Array[AnotherCaseClass])
It also fails with the same error
I'm new to Scala and Spark and may be missing something, but any explanation would be appreciated.
Put FlightExt class inside object, Check below code.
object Flight {
case class FlightExt(callsign: Option[String], var serials: List[Int])
}
Use Flight.FlightExt
val (ctx, sctx) = SparkUtil.createContext() // just a helper function to build context
val flightsDataFrame = separateFlightsMock(sctx) // reads data from Parquet file
import sctx.implicits._
flightsDataFrame.as[Flight.FlightExt]
.map(flight => flight.callsign)
.show()

Spark : Scala mocking, Task not serializable

I am trying to use mockito for unit testing some scala code. I want to run spark locally, i.e. in my IntelliJ IDE. Here is a sample
class MyScalaSparkTests extends FunSuite with BeforeAndAfter with MockitoSugar with java.io.Serializable{
val configuration:SparkConf = new SparkConf()
.setAppName("Your Application Name")
.setMaster("local");
val sc = new SparkContext(configuration);
lazy val testSess = SparkSession.builder.appName("local_test").getOrCreate()
test ("test service") {
import testSess.implicits._
// (1) init
val testObject = spy(new MyScalaClass(<some args>))
val testDf = testSess.emptyDataset[MyCaseClass1].toDF()
testDf.union(Seq(MyCaseClass(<some args>)).toDF())
testObject.testDataFrame = testDf
val testSource = testSess.emptyDataset[MyCaseClass2].toDF()
testSource.union(Seq(MyCaseClass2(<some args>)).toDF())
testObject.setSourceDf(testSource)
val testRes = testObject.someMethod()
val r = testRes.take(1)
println(r)
}
}
so basically, here is what I am trying to do
MyScalaClass has someMethod() which compares data between two data frames called testDataFrame and testSource. It then returns another data frame which has the results. Now, in my unit test, I am spying on MyScalaClass to create testObject. Then I create testDataFrame and testSource and assign them to testObject. Finally, I call testObject.someMethod().
Now in the debugger, at this line
val r = testRes.take(1)
I see that testRes is a Dataset hence something is being returned by the method. But when I try to take something from it in order to verify the results I get
Task not serializable
org.apache.spark.SparkException: Task not serializable
and further down the stacktrace
Caused by: java.io.NotSerializableException: org.mockito.internal.creation.DelegatingMethod
Serialization stack:
- object not serializable (class: org.mockito.internal.creation.DelegatingMethod, value: org.mockito.internal.creation.DelegatingMethod#a97f2bff)
- field (class: org.mockito.internal.invocation.InterceptedInvocation, name: mockitoMethod, type: interface org.mockito.internal.invocation.MockitoMethod)
- object (class org.mockito.internal.invocation.InterceptedInvocation, bSV2PartValidator.toString();)
- field (class: org.mockito.internal.invocation.InvocationMatcher, name: invocation, type: interface org.mockito.invocation.Invocation)
- object (class org.mockito.internal.invocation.InvocationMatcher, bSV2PartValidator.toString();)
- field (class: org.mockito.internal.stubbing.InvocationContainerImpl, name: invocationForStubbing, type: interface org.mockito.invocation.MatchableInvocation)
- object (class org.mockito.internal.stubbing.InvocationContainerImpl, invocationForStubbing: bSV2PartValidator.toString();)
- field (class: org.mockito.internal.handler.MockHandlerImpl, name: invocationContainer, type: class org.mockito.internal.stubbing.InvocationContainerImpl)
- object (class org.mockito.internal.handler.MockHandlerImpl, org.mockito.internal.handler.MockHandlerImpl#47c019d7)
- field (class: org.mockito.internal.handler.NullResultGuardian, name: delegate, type: interface org.mockito.invocation.MockHandler)
- object (class org.mockito.internal.handler.NullResultGuardian, org.mockito.internal.handler.NullResultGuardian#7222e168)
- field (class: org.mockito.internal.handler.InvocationNotifierHandler, name: mockHandler, type: interface org.mockito.invocation.MockHandler)
- object (class org.mockito.internal.handler.InvocationNotifierHandler, org.mockito.internal.handler.InvocationNotifierHandler#1e4f8430)
- field (class: org.mockito.internal.creation.bytebuddy.MockMethodInterceptor, name: handler, type: interface org.mockito.invocation.MockHandler)
- object (class org.mockito.internal.creation.bytebuddy.MockMethodInterceptor, org.mockito.internal.creation.bytebuddy.MockMethodInterceptor#34d08905)
- field (class: com.walmart.labs.search.signals.validators.BSV2PartValidator$MockitoMock$213785213, name: mockitoInterceptor, type: class org.mockito.internal.creation.bytebuddy.MockMethodInterceptor)
- object (class com.walmart.labs.search.signals.validators.BSV2PartValidator$MockitoMock$213785213, com.walmart.labs.search.signals.validators.BSV2PartValidator$MockitoMock$213785213#7f289126)
- field (class: com.walmart.labs.search.signals.validators.BSV2PartValidator$$anonfun$1, name: $outer, type: class com.walmart.labs.search.signals.validators.BSV2PartValidator)
- object (class com.walmart.labs.search.signals.validators.BSV2PartValidator$$anonfun$1, <function1>)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 7)
- field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 78 more
What am i doing wrong? Is it even possible to spy on or mock spark behavior in IDE?
Mocks are not serialisable by default, as it's usually a code smell in unit testing
You can try enabling serialisation by creating the mock like mock[MyType](Mockito.withSettings().serializable()) and see what happens when spark tries to use it.
BTW, I recommend you to use mockito-scala instead of the traditional mockito as it may save you some other problems

Spark Scala: Task Not serializable error

I am using IntelliJ Community Edition with Scala Plugin and spark libraries. I am still learning Spark and am using Scala Worksheet.
I have written the below code which removes punctuation marks in a String:
def removePunctuation(text: String): String = {
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
Then I read a text file and try to remove punctuation:
val myfile = sc.textFile("/home/ubuntu/data.txt",4).map(removePunctuation)
This gives error as below, any help would be appreciated:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(/home/ubuntu/src/main/scala/Test.sc:294)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(/home/ubuntu/src/main/scala/Test.sc:284)
at org.apache.spark.util.ClosureCleaner$.clean(/home/ubuntu/src/main/scala/Test.sc:104)
at org.apache.spark.SparkContext.clean(/home/ubuntu/src/main/scala/Test.sc:2090)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:365)
at org.apache.spark.rdd.RDDOperationScope$.withScope(/home/ubuntu/src/main/scala/Test.sc:147)
at #worksheet#.#worksheet#(/home/ubuntu/src/main/scala/Test.sc:108)
Caused by: java.io.NotSerializableException: A$A21$A$A21
Serialization stack:
- object not serializable (class: A$A21$A$A21, value: A$A21$A$A21#62db3891)
- field (class: A$A21$A$A21$$anonfun$words$1, name: $outer, type: class A$A21$A$A21)
- object (class A$A21$A$A21$$anonfun$words$1, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at A$A21$A$A21.words$lzycompute(Test.sc:27)
at A$A21$A$A21.words(Test.sc:27)
at A$A21$A$A21.get$$instance$$words(Test.sc:27)
at A$A21$.main(Test.sc:73)
at A$A21.main(Test.sc)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.jetbrains.plugins.scala.worksheet.MyWorksheetRunner.main(MyWorksheetRunner.java:22)
As T. Gaweda already pointed out, you're most likely defining your function in a class that's not serializable. Because it is a pure function, i.e. it doesn't depend on any context of the enclosing class, I suggest you put it into a companion object which should extend Serializable. This would be Scala's equivalent of a Java static method:
object Helper extends Serializable {
def removePunctuation(text: String): String = {
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
}
As #TGaweda suggests, Spark's SerializationDebugger is very helpful for identifying "the serialization path leading from the given object to the problematic object." All the dollar signs before the "Serialization stack" in the stack trace indicate that the container object for your method is the problem.
While it is easiest to just slap Serializable on your container class, I prefer to take advantage of the fact Scala is a functional language and use your function as a first class citizen:
sc.textFile("/home/ubuntu/data.txt",4).map { text =>
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
Or if you really want to keep things separate:
val removePunctuation: String => String = (text: String) => {
val punctPattern = "[^a-zA-Z0-9\\s]".r
punctPattern.replaceAllIn(text, "").toLowerCase
}
sc.textFile("/home/ubuntu/data.txt",4).map(removePunctuation)
These options work of course since Regex is serializable as you should confirm.
On a secondary but very important note, constructing a Regex is expensive, so factor it out of your transformations for the sake of performance--possibly with a broadcast.
Read the stacktrace, there is:
$outer, type: class A$A21$A$A21
It is a very good hint. Your lambda is serializable, but your class is not serializable.
When you make lambda expression, then this expression has reference to outer class. Outer class in your case is not serializable, i.e. is not implementing Serializable or one of fields is not an instance of Serializable

Spark not serializable issue

I'm, working on refactoring our code so that we can use the CAKE pattern for DI.
I've stumbled upon a serialisation issue which I'm having difficulty in understanding.
When I call this function:
def getAccounts(winZones: Broadcast[List[WindowsZones]]): RDD[AccountDetails] = {
val accounts = getAccounts //call to db
val result = accounts.map(row =>
Some(AccountDetails(UUID.fromString(row.getAs[String]("")),
row.getAs[String](""),
UUID.fromString(row.getAs[String]("")),
row.getAs[String](""),
row.getAs[String](""),
DateUtils.getIanaZoneFromWinZone(row.getAs[String]("timeZone"), winZones))))
.map(m=>m.get)
result
}
it works perfectly, but this is ugly and I want to refactor it so that the middle mapping from row to AccountDetails is placed inside a private function - but when doing that it causes the serialisation issue.
I'd like:
def getAccounts(winZones: Broadcast[List[WindowsZones]]): RDD[AccountDetails] = {
val accounts = getAccounts
val result = accounts
.map(m => getAccountDetails(m, winZones))
.filter(_.isDefined)
.map(m => m.get)
result
}
private def getAccountDetails(row: Row, winZones: Broadcast[List[WindowsZones]]): Option[AccountDetails] = {
try {
Some(AccountDetails(UUID.fromString(""),
row.getAs[String](""),
UUID.fromString(row.getAs[String]("")),
row.getAs[String](""),
row.getAs[String](""),
DateUtils.getIanaZoneFromWinZone(row.getAs[String]("timeZone"), winZones)))
}
catch {
case e: Exception =>
logger.error(s"Unable to set AccountDetails $e")
None
}
}
Any help is appreciated of course, the AccountDetails obj is a case class should that be pertinent. Also happy to take any other advice on implementing cake or DI with spark in general. Thanks.
Edit to show structure:
trait serviceImpl extends anotherComponent {this: DBA =>
def Accounts = new Accounts
class Accounts extends AccountService {
//the methods above are defined here.
}
Edit to include stacktrace:
17/02/13 17:32:32 INFO CodeGenerator: Code generated in 271.36617 ms
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.map(RDD.scala:365)
at FunnelServiceComponentImpl$FunnelAccounts.getAccounts(FunnelServiceComponentImpl.scala:24)
at Main$.delayedEndpoint$Main$1(Main.scala:26)
at Main$delayedInit$body.apply(Main.scala:7)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Main$.main(Main.scala:7)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: FunnelServiceComponentImpl$FunnelAccounts
Serialization stack:
- object not serializable (class: FunnelServiceComponentImpl$FunnelAccounts, value: FunnelServiceComponentImpl$FunnelAccounts#16b7e04a)
- field (class: FunnelServiceComponentImpl$FunnelAccounts$$anonfun$1, name: $outer, type: class FunnelServiceComponentImpl$FunnelAccounts)
- object (class FunnelServiceComponentImpl$FunnelAccounts$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 26 more
17/02/13 17:32:32 INFO SparkContext: Invoking stop() from shutdown hook
Where are you defining the functions?
Let's say you are defining them in a class X. If the class is not serializable this would cause your issue.
To solve this you can either make it an object instead or make the class serializable.
Because getAccountDetails is in your class, Spark will want to serialize your entire FunnelAccounts object. After all, you need an instance in order to use this method. However, FunnelAccounts is not serializable. Thus it can't be sent off to a worker.
In your case, you should move getAccountDetails into a FunnelAccounts object, so that you don't need an instance FunnelAccounts to run it.

User Defined Variables in spark - org.apache.spark.SparkException: Task not serializable

I want to add user defined variables that can be used inside filter and map functions in spark, currently I am getting an error while trying to do that.
EDIT : I am running this code on zeppelin notebook so I dont have to create a seperate class.
My code is somewhat like :
val some_value = "qwerty"
val some_other_value = "x.y.z"
data.filter(r => r.getString("a.b.c").equals(some_value))
.map(r => (r.getString(some_other_value)))
Please note that, here "data" is an RDD containing JSONs
I am getting the following error:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:341)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:340)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.filter(RDD.scala:340)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:202)......
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:747)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:711)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:704)
at org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:312)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#7f166a30)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: c, type: class org.apache.spark.SparkContext)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#65b97194)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#5e9afdf2)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#1d72f175)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#7b44290b)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC#2fd0142d)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC#1812605b)
- field (class: $iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC#40020ce9)
- field (class: $iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC, $iwC$$iwC$$iwC#3db9c1c)
- field (class: $iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC)
- object (class $iwC$$iwC, $iwC$$iwC#6968a929)
- field (class: $iwC, name: $iw, type: class $iwC$$iwC)
- object (class $iwC, $iwC#1e8d42e6)
- field (class: $line25.$read, name: $iw, type: class $iwC)
- object (class $line25.$read, $line25.$read#6b4256a)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $VAL5449, type: class $line25.$read)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#7d6cf781)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#f4d2716)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#d8c3a34)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC#7bc0d5a5)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC#74ef7040)
- field (class: $iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC#7e55be5d)
- field (class: $iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC, $iwC$$iwC$$iwC#2f59915e)
- field (class: $iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC)
- object (class $iwC$$iwC, $iwC$$iwC#36faa9d0)
- field (class: $iwC, name: $iw, type: class $iwC$$iwC)
- object (class $iwC, $iwC#54f2aef7)
- field (class: $line385.$read, name: $iw, type: class $iwC)
- object (class $line385.$read, $line385.$read#3d8590f3)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $VAL5516, type: class $line385.$read)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#419dfc19)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#6e994c6d)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$2e9cf4ebd66898e1aa2d2fdd9497ea7$$$$C$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$2e9cf4ebd66898e1aa2d2fdd9497ea7$$$$C$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, <function1>)
at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 123 more
EDIT 2 : Finally I got the solution, just wrap the whole code as a scala function and pass the parameters.
val some_value = "qwerty"
val some_other_value = "x.y.z"
def func (value_1 : String, value_2 : String) : RDD[String] = {
val d = data.filter(r => r.getString("a.b.c").equals(value_2))
.map(r => (r.getString(value_2)))
return d
}
val new_data = func(some_value, some_other_value)
Another method is to use functions for complete operations inside the closure.
def myFilter(e: String) = (r: org.apache.avro.generic.GenericRecord) => r.getString("a.b.c").equals(e)
def myGenerator(f: String) = (r: org.apache.avro.generic.GenericRecord) => r.getString(f)
val p = data.filter (myFilter(x)).map(myGenerator(y))
org.apache.spark.SparkException: Task not serializable exception, it means that you use a reference to an instance of a non-serializable class inside a transformation.
Beware of closures using fields/methods of outer
object (these will reference the whole object)
For ex :
NotSerializable notSerializable = new NotSerializable();
JavaRDD<String> rdd = sc.textFile("/tmp/myfile");
rdd.map(s -> notSerializable.doSomething(s)).collect();
Here are some ideas to fix this error:
Serializable the class
Declare the instance only within the lambda function passed in map.
Make the NotSerializable object as a static and create it once per machine.
Call rdd.forEachPartition and create the NotSerializable object in there like this:
rdd.forEachPartition(iter -> {
NotSerializable notSerializable = new NotSerializable();
// ...Now process iterator
});
Also, look at Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
Tip : Can use jvm parameter to see detailed info about serialization
-Dsun.io.serialization.extendedDebugInfo=true in SPARK_JAVA_OPTS
SerializationDebugger :
SPARK-5307 introduced SerializationDebugger and Spark 1.3.0 is the first version to use it. It adds serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.For ex :
Serialization stack:
- object not serializable (class: testing, value: testing#2dfe2f00)
- field (class: testing$$anonfun$1, name: $outer, type: class testing)
- object (class testing$$anonfun$1, <function1>)
EDIT
I'm directly trying to address problem here...
Since we are going to run our Spark application on JVM, we have a
requirement that in order to serialize an object your class should
explicitly extend special Serializable interface. In Scala when you
are declaring a case class it automatically extends the Serializable
interface.
1.case class approach :
object Test extends App {
case class MyInputForMapAndFilter(somevalue: String, someothervalue: String)
val some_value = "qwerty"
val some_other_value = "x.y.z"
val myInputForMapAndFilter = MyInputForMapAndFilter(some_value,some_other_value)
data.filter(r => r.getString("a.b.c").equals(myInputForMapAndFilter.somevalue))
.map(r => (r.getString(myInputForMapAndFilter.someothervalue)))
}
Case class in scala is by default serializable
2. Another approach you can try : you can declare val as transient
#transient val some_value = "qwerty"
#transient val some_other_value = "x.y.z"
data.filter(r => r.getString("a.b.c").equals(some_value))
.map(r => (r.getString(some_other_value)))
please try this ans let me know output... pls be proactive to ask more questions/ problems
What you need to do is use a broadcast variable which you can then fetch in map and reduce functions.
For example to create the broadcast variable do:
val broadcastString = sc.broadcast("my string")
And in your map and reduce functions you can get the broadcasted string using:
val myStr = broadcastString.value
See here for more: http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables