Scala: Xtream complains object not serializable

Scala: Xtream complains object not serializable - scala

I have the following case classes defined and i would like to print out ClientData in xml format using xstream.
case class Address(addressLine1: String,
addressLine2: String,
city: String,
provinceCode: String,
country: String,
addressTypeDesc: String) extends Serializable{
}
case class ClientData(title: String,
firstName: String,
lastName: String,
addrList:Option[List[Address]]) extends Serializable{
}
object ex1{
def main(args: Array[String]){
...
...
...
// In below, x is Try[ClientData]
val xstream = new XStream(new DomDriver)
newClientRecord.foreach(x=> if (x.isSuccess) println(xstream.toXML(x.get)))
}
}
And when the program execute the line to print each ClientData in xml format, I am getting the runtime error below. Please help.
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:911)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:910)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:910)
at lab9$.main(lab9.scala:63)
at lab9.main(lab9.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: com.thoughtworks.xstream.XStream
Serialization stack:
- object not serializable (class: com.thoughtworks.xstream.XStream, value: com.thoughtworks.xstream.XStream#51e94b7d)
- field (class: lab9$$anonfun$main$1, name: xstream$1, type: class com.thoughtworks.xstream.XStream)
- object (class lab9$$anonfun$main$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 16 more

It isn't XStream which complains, it's Spark. You need to define xstream variable inside the task:
newClientRecord.foreach { x=>
if (x.isSuccess) {
val xstream = new XStream(new DomDriver)
println(xstream.toXML(x.get))
}
}
if XStream is sufficiently cheap to create;
newClientRecord.foreachPartition { xs =>
val xstream = new XStream(new DomDriver)
xs.foreach { x =>
if (x.isSuccess) {
println(xstream.toXML(x.get))
}
}
}
otherwise.

Related

Task not serializable after adding it to ForEachPartition

I am receiving a task not serializable exception in spark when attempting to implement an Apache pulsar Sink in spark structured streaming.
I have already attempted to extrapolate the PulsarConfig to a separate class and call this within the .foreachPartition lambda function which I normally do for JDBC connections and other systems I integrate into spark structured streaming like shown below:
PulsarSink Class
class PulsarSink(
sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode) extends Sink{
override def addBatch(batchId: Long, data: DataFrame): Unit = {
data.toJSON.foreachPartition( partition => {
val pulsarConfig = new PulsarConfig(parameters).client
val producer = pulsarConfig.newProducer(Schema.STRING)
.topic(parameters.get("topic").get)
.compressionType(CompressionType.LZ4)
.sendTimeout(0, TimeUnit.SECONDS)
.create
partition.foreach(rec => producer.send(rec))
producer.flush()
})
}
PulsarConfig Class
class PulsarConfig(parameters: Map[String, String]) {
def client(): PulsarClient = {
import scala.collection.JavaConverters._
if(!parameters.get("tlscert").isEmpty && !parameters.get("tlskey").isEmpty) {
val tlsAuthMap = Map("tlsCertFile" -> parameters.get("tlscert").get,
"tlsKeyFile" -> parameters.get("tlskey").get).asJava
val tlsAuth: Authentication = AuthenticationFactory.create(classOf[AuthenticationTls].getName, tlsAuthMap)
PulsarClient.builder
.serviceUrl(parameters.get("broker").get)
.tlsTrustCertsFilePath(parameters.get("tlscert").get)
.authentication(tlsAuth)
.enableTlsHostnameVerification(false)
.allowTlsInsecureConnection(true)
.build
}
else{
PulsarClient.builder
.serviceUrl(parameters.get("broker").get)
.enableTlsHostnameVerification(false)
.allowTlsInsecureConnection(true)
.build
}
}
}
The error message I receive is the following:
ERROR StreamExecution: Query [id = 12c715c2-2d62-4523-a37a-4555995ccb74, runId = d409c0db-7078-4654-b0ce-96e46dfb322c] terminated with error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2341)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2341)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2341)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2828)
at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2340)
at org.apache.spark.datamediation.impl.sink.PulsarSink.addBatch(PulsarSink.scala:20)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:666)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:666)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:666)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:665)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:306)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:290)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
Caused by: java.io.NotSerializableException: org.apache.spark.datamediation.impl.sink.PulsarSink
Serialization stack:
- object not serializable (class: org.apache.spark.datamediation.impl.sink.PulsarSink, value: org.apache.spark.datamediation.impl.sink.PulsarSink#38813f43)
- field (class: org.apache.spark.datamediation.impl.sink.PulsarSink$$anonfun$addBatch$1, name: $outer, type: class org.apache.spark.datamediation.impl.sink.PulsarSink)
- object (class org.apache.spark.datamediation.impl.sink.PulsarSink$$anonfun$addBatch$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:337)
... 31 more

Values used in "foreachPartition" can be reassigned from class level to function variables:
override def addBatch(batchId: Long, data: DataFrame): Unit = {
val parametersLocal = parameters
data.toJSON.foreachPartition( partition => {
val pulsarConfig = new PulsarConfig(parametersLocal).client

Is using function in transformation causing Not Serializable exceptions?

I have a Breeze DenseMatrix, i find mean per row and mean of squares per row and put them in another DenseMatrix, one per column. But i get Task Not Serializable exception. I know that sc is not Serializable but i think that the exception is because i call functions in a transformation in Safe Zones.
Am i right? And how could be a possible way to be done without any functions? Any help would be great!
Code:
object MotitorDetection {
case class MonDetect() extends Serializable {
var sc: SparkContext = _
var machines: Int=0
var counters: Int=0
var GlobalVec= BDM.zeros[Double](counters, 2)
def findMean(a: BDM[Double]): BDV[Double] = {
var c = mean(a(*, ::))
c}
def toMatrix(x: BDV[Double], y: BDV[Double], C: Int): BDM[Double]={
val m = BDM.zeros[Double](C,2)
m(::, 0) := x
m(::, 1) := y
m}
def SafeZones(stream: DStream[(Int, BDM[Double])]){
stream.foreachRDD { (rdd: RDD[(Int, BDM[Double])], _) =>
if (isEmpty(rdd) == false) {
val InputVec = rdd.map(x=> (x._1, toMatrix(findMean(x._2), findMean(pow(x._2, 2)), counters)))
GlobalMeanVector(InputVec)
}}}
Exception:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1.apply(MotitorDetection.scala:85)
at ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1.apply(MotitorDetection.scala:82)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#6eee7027)
- field (class: ScalaApps.MotitorDetection$MonDetect, name: sc, type: class org.apache.spark.SparkContext)
- object (class ScalaApps.MotitorDetection$MonDetect, MonDetect())
- field (class: ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1, name: $outer, type: class ScalaApps.MotitorDetection$MonDetect)
- object (class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1, <function2>)
- field (class: ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1$$anonfun$2, name: $outer, type: class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1)
- object (class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1$$anonfun$2, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 28 more

The findMean method is a method of the object MotitorDetection. The object MotitorDetection has a SparkContext on-board, which is not serializable. Thus, the task used in rdd.map is not serializable.
Move all the matrix-related functions into a separate serializable object, MatrixUtils, say:
object MatrixUtils {
def findMean(a: BDM[Double]): BDV[Double] = {
var c = mean(a(*, ::))
c
}
def toMatrix(x: BDV[Double], y: BDV[Double], C: Int): BDM[Double]={
val m = BDM.zeros[Double](C,2)
m(::, 0) := x
m(::, 1) := y
m
}
...
}
and then use only those methods from rdd.map(...):
object MotitorDetection {
val sc = ...
def SafeZones(stream: DStream[(Int, BDM[Double])]){
import MatrixUtils._
... = rdd.map( ... )
}
}

solr add document error

I am trying to load CSV file to solr doc, i am trying using scala. I am new to scala. For case class structure, if i pass one set of values it works fine. But if i want to want all read values from CSV, it gives an error. I am not sure how to do it in scala, any help greatly appreciated.
object BasicParseCsv {
case class Person(id: String, name: String,age: String, addr: String )
val schema = ArrayBuffer[Person]()
def main(args: Array[String]) {
val master = args(0)
val inputFile = args(1)
val outputFile = args(2)
val sc = new SparkContext(master, "BasicParseCsv", System.getenv("SPARK_HOME"))
val params = new ModifiableSolrParams
val Solr = new HttpSolrServer("http://localhost:8983/solr/person1")
//Preparing the Solr document
val doc = new SolrInputDocument()
val input = sc.textFile(inputFile)
val result = input.map{ line =>
val reader = new CSVReader(new StringReader(line));
reader.readNext();
}
def getSolrDocument(person: Person): SolrInputDocument = {
val document = new SolrInputDocument()
document.addField("id",person.id)
document.addField("name", person.name)
document.addField("age",person.age)
document.addField("addr", person.addr)
document
}
def send(persons:List[Person]){
persons.foreach(person=>Solr.add(getSolrDocument(person)))
Solr.commit()
}
val people = result.map(x => Person(x(0), x(1),x(2),x(3)))
val book1 = new Person("101","xxx","20","abcd")
send(List(book1))
people.map(person => send(List(Person(person.id, person.name, person.age,person.addr))))
System.out.println("Documents added")
}
}
people.map(person => send(List(Person(person.id, person.name, person.age,person.addr)))) ==> gives error
val book1 = new Person("101","xxx","20","abcd") ==> works fine
Update : I get below error
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2067)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:323)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.map(RDD.scala:323)
at BasicParseCsv$.main(BasicParseCsv.scala:90)
at BasicParseCsv.main(BasicParseCsv.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.io.NotSerializableException: org.apache.http.impl.client.SystemDefaultHttpClient
Serialization stack:
- object not serializable (class: org.apache.http.impl.client.SystemDefaultHttpClient, value: org.apache.http.impl.client.SystemDefaultHttpClient#1dbd580)
- field (class: org.apache.solr.client.solrj.impl.HttpSolrServer, name: httpClient, type: interface org.apache.http.client.HttpClient)
- object (class org.apache.solr.client.solrj.impl.HttpSolrServer, org.apache.solr.client.solrj.impl.HttpSolrServer#17e0827)
- field (class: BasicParseCsv$$anonfun$main$1, name: Solr$1, type: class org.apache.solr.client.solrj.impl.HttpSolrServer)

Spark application got the error of "Task not serializable"?

The following code got the error of "Task not serializable"?
The error
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at ConnTest$.main(main.scala:41)
at ConnTest.main(main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: DoWork
Serialization stack:
- object not serializable (class: DoWork, value: DoWork#655621fd)
- field (class: ConnTest$$anonfun$2, name: doWork$1, type: class DoWork)
- object (class ConnTest$$anonfun$2, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 20 more
Code:
object ConnTest extends App {
override def main(args: scala.Array[String]): Unit = {
super.main(args)
val date = args(0)
val conf = new SparkConf()
val sc = new SparkContext(conf.setAppName("Test").setMaster("local[*]"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jdbcSqlConn = "jdbc:sqlserver://......;"
val listJob = new ItemListJob(sqlContext, jdbcSqlConn)
val list = listJob.run(date).select("id").rdd.map(r => r(0).asInstanceOf[Int]).collect()
// It returns about 3000 rows
val doWork = new DoWork(sqlContext, jdbcSqlConn)
val processed = sc.parallelize(list).map(d => {
doWork.run(d, date)
})
}
}
class ItemList(sqlContext: org.apache.spark.sql.SQLContext, jdbcSqlConn: String) {
def run(date: LocalDate) = {
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"dbo.GetList('$date')"
)).load()
}
}
class DoWork(sqlContext: org.apache.spark.sql.SQLContext, jdbcSqlConn: String) {
def run(id: Int, date: LocalDate) = {
// ...... read the data from database for id, and create a text file
val data = sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"someFunction('$id', $date)"
)).load()
// .... create a text file with content of data
(id, date)
}
}
Update:
I changed the .map() call to the following,
val processed = sc.parallelize(dealList).toDF.map(d => {
doWork.run(d(0).asInstanceOf[Int], rc)
})
Now I got the error of
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for java.time.LocalDate
- field (class: "java.time.LocalDate", name: "_2")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)

The issue is in the following closure:
val processed = sc.parallelize(list).map(d => {
doWork.run(d, date)
})
The closure in map will run in executors, so Spark needs to serialize doWork and send it to executors. DoWork must be serializable. However. I saw DoWork contains sc and sqlContext so you cannot just make DoWork implement Serializable because you cannot use them in executors.
I guess you probably want to store data into database in DoWork. If so, you can convert RDD to DataFrame and save it via jdbc method, such as:
sc.parallelize(list).toDF.write.jdbc(...)
I cannot give more suggestions since you don't provides the codes in DoWork.

Play 2.2 Scala class serializable for cache

I have one scala class from play-mongojack example. It works fine. However, when I tried to save it in Play ehcache, it throws NotSerializableException. How can I make this class serializable?
class BlogPost(#ObjectId #Id val id: String,
#BeanProperty #JsonProperty("date") val date: Date,
#BeanProperty #JsonProperty("title") val title: String,
#BeanProperty #JsonProperty("author") val author: String,
#BeanProperty #JsonProperty("content") val content: String) {
#ObjectId #Id #BeanProperty var blogId: String = _
#BeanProperty #JsonProperty("uploadedFile") var uploadedFile: Option[(String, String, Long)] = None
}
object BlogPost {
def apply(
date: Date,
title: String,
author: String,
content: String): BlogPost = new BlogPost(date,title,author,content)
def unapply(e: Event) =
new Some((e.messageId,
e.date,
e.title,
e.author,
e.content,
e.blogId,
e.uploadedFile) )
private lazy val db = MongoDB.collection("blogposts", classOf[BlogPost], classOf[String])
def save(blogPost: BlogPost) { db.save(blogPost) }
def findByAuthor(author: String) = db.find().is("author", author).asScala
}
Saving to cache:
var latestBlogs = List[BlogPost]()
Cache.set("latestBlogs", latestBlogs, 30)
It throws an exception:
[error] n.s.e.s.d.DiskStorageFactory - Disk Write of latestBlogs failed:
java.io.NotSerializableException: BlogPost
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) ~[na:1.7.0_45]
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) ~[na:1.7.0_45]
at java.util.ArrayList.writeObject(ArrayList.java:742) ~[na:1.7.0_45]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.7.0_45]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[na:1.7.0_45]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_45]
EDIT 1:
I tried to extends the object with Serializable, it doesn't work.
object BlogPost extends Serializable {}
EDIT 2:
The vitalii's comment works for me.
class BlogPost() extends scala.Serializable {}

Try to derive class BlogPost from Serializable or define it as a case class which are serializable by default.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala: Xtream complains object not serializable - scala

Related

Task not serializable after adding it to ForEachPartition

Is using function in transformation causing Not Serializable exceptions?

solr add document error

Spark application got the error of "Task not serializable"?

Play 2.2 Scala class serializable for cache

Categories

Resources