Serialization issue while renaming HDFS File using scala Spark in parallel - scala

I want to rename HDFS Files in parallel using spark. But I am getting serialization exception, I have mention the exception after my code.
I am getting this issue while using spark.sparkContext.parallelize. Also I am able to rename all the files, when doing it in a loop.
def renameHdfsToS3(spark : SparkSession, hdfsFolder :String, outputFileName:String,
renameFunction: (String,String) => String, bktOutput:String, folderOutput:String, kmsKey:String): Boolean = {
try {
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val path = new Path(hdfsFolder)
val files = fs.listStatus(path)
.filter(fs => fs.isFile)
val parallelRename=spark.sparkContext.parallelize(files).map(
f=>{
parallelRenameHdfs(fs,outputFileName,renamePartFileWithTS,f)
}
)
val hdfsTopLevelPath=fs.getWorkingDirectory()+"/"+hdfsFolder
return true
} catch {
case NonFatal(e) => {
e.printStackTrace()
return false
}
}
}
Below is the exception I am getting
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.LocalFileSystem
Serialization stack:
- object not serializable (class: org.apache.hadoop.fs.LocalFileSystem, value: org.apache.hadoop.fs.LocalFileSystem#1d96d872)
- field (class: at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)

The approach is incorrect as sc.parallelize is for consuming data via RDDs. You need to be working at the operating system level. Many such posts exist.
Something like this should suffice blending it with your own logic, note par which allows parallel processing, e.g.:
originalpath.par.foreach( e => hdfs.rename(e,e.suffix("finish")))
You need to check how parallelism is defined with .par. Look here https://docs.scala-lang.org/overviews/parallel-collections/configuration.html

Related

Unable to get the value of first column from a row in dataset using spark scala

I'm trying to iterate a dataframe using foreachpartition for inserting a value into database. I used foreachpartition and group the rows and using foreach to iterate each row. Please find my code below,
val endDF=spark.read.parquet(path).select("pc").filter(col("pc").isNotNull);
endDF.foreachpartition((partition: Iterator[Row]) =>
class.forname(driver)
val con=DriverManager.connection(jdbcurl,user,pwd)
partition.grouped(100).foreach(batch => {
val st=con.createStatement()
batch.foreach(row => {
val pc=row.get(0).toString()
val in=s"""insert tshdim (pc) values(${pc})""".stripMargin
st.addBatch(in)
})
st.executeLargeBatch
})
con.close()
})
When I try to get the pc value from the row(val pc=row.get(0).toString()) it's throwing the following exception. I'm doing this in spark-shell
org.apache.spark.SparkException : Task not serializable . .
Caused by:
Java.io.NotSerializable exception:
org.apache.spark.sql.DataSet$RDDQueryExecution$ Serialization stack:
Object not serializable
(class:org.apache.spark.sql.DataSet$RDDQueryExecution$, value:
org.apache.spark.sql.DataSet$RDDQueryExecution$#jfaf )
-field(class:org.apache.spark.sql.DataSet, name:RDDQueryExecutionModule, type:
org.apache.spark.sql.DataSet$RDDQueryExecution$)
-object(class:org.apache.spark.sql.DataSet,[pc:String])
Function in foreachpartition need to be serialized and passed to executors.
So, in your case, spark is trying to serialize DriverManager class and everything for your jdbc connection, and some of that is not serializable.
foreachPartition works without DriverManager -
endDF.foreachPartition((partition: Iterator[Row]) => {
partition.grouped(100).foreach(batch => {
batch.foreach(row => {
val pc=row.get(0)
println(pc)
})
})
})
To save it in your DB, first do .collect

Spark: java.io.NotSerializableException: com.amazonaws.services.s3.AmazonS3Client

I'm trying to read a large number of large files from S3, which takes considerable time if done as a Dataframe function. So following this post and the related gist I'm trying to use RDD to read the s3 objects in parallel as below
def dfFromS3Objects(s3: AmazonS3, bucket: String, prefix: String, pageLength: Int = 1000) = {
import com.amazonaws.services.s3._
import model._
import spark.sqlContext.implicits._
import scala.collection.JavaConversions._
val request = new ListObjectsRequest()
request.setBucketName(bucket)
request.setPrefix(prefix)
request.setMaxKeys(pageLength)
val objs: ObjectListing = s3.listObjects(request) // Note that this method returns truncated data if longer than the "pageLength" above. You might need to deal with that.
spark.sparkContext.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }.toDF()
}
which when tested ends up with
Caused by: java.io.NotSerializableException: com.amazonaws.services.s3.AmazonS3Client
Serialization stack:
- object not serializable (class: com.amazonaws.services.s3.AmazonS3Client, value: com.amazonaws.services.s3.AmazonS3Client#35c8be21)
- field (class: de.smava.data.bards.anonymize.HistoricalBardAnonymization$$anonfun$dfFromS3Objects$2, name: s3$1, type: interface com.amazonaws.services.s3.AmazonS3)
- object (class de.smava.data.bards.anonymize.HistoricalBardAnonymization$$anonfun$dfFromS3Objects$2, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
... 63 more
I understand that AmazonS3 object that I supply needs to be shipped to executors, hence needs to be serializable, but this is from a sample snippet meaning someone got it working, need help in figuring out what am I missing here
In the gist the s3 is defined as method which will create a new client for every call. This is not recommended. One way around the problem is to use mapPartitions
spark
.sparkContext
.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.mapPartitions { it =>
val s3 = ... // init the client here
it.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
}
.toDF
This would still create multiple clients per JVM but possibly vastly less than the version that creates a client per every object. If you wanted to re-use the client between threads inside a JVM, you could e.g. wrap it in a top-level object
object Foo {
val s3 = ...
}
and use static configuration for the client.

java.io.NotSerializableException with Spark Streaming Checkpoint enabled

I have enabled checkpointing in my spark streaming application and encounter this error on a class that is downloaded as a dependency.
With no checkpointing the application works great.
Error:
com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer
Serialization stack:
- object not serializable (class: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer, value: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer#46c7c593)
- field (class: com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector, name: _paranamer, type: interface com.fasterxml.jackson.module.paranamer.shaded.Paranamer)
- object (class com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector, com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector#39d62e47)
- field (class: com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, name: _secondary, type: class com.fasterxml.jackson.databind.AnnotationIntrospector)
- object (class com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair#7a925ac4)
- field (class: com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, name: _primary, type: class com.fasterxml.jackson.databind.AnnotationIntrospector)
- object (class com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair, com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair#203b98cf)
- field (class: com.fasterxml.jackson.databind.cfg.BaseSettings, name: _annotationIntrospector, type: class com.fasterxml.jackson.databind.AnnotationIntrospector)
- object (class com.fasterxml.jackson.databind.cfg.BaseSettings, com.fasterxml.jackson.databind.cfg.BaseSettings#78c34153)
- field (class: com.fasterxml.jackson.databind.cfg.MapperConfig, name: _base, type: class com.fasterxml.jackson.databind.cfg.BaseSettings)
- object (class com.fasterxml.jackson.databind.DeserializationConfig, com.fasterxml.jackson.databind.DeserializationConfig#2df0a4c3)
- field (class: com.fasterxml.jackson.databind.ObjectMapper, name: _deserializationConfig, type: class com.fasterxml.jackson.databind.DeserializationConfig)
- object (class com.fasterxml.jackson.databind.ObjectMapper, com.fasterxml.jackson.databind.ObjectMapper#2db07651)
I am not sure how to extend this class as serializable as its a maven dependency. I am using v2.6.0 of the jackson core in my pom.xml. If I try to use a newer version of Jackson core I am getting Incompatible Jackson version exception.
Code
liveRecordStream
.foreachRDD(newRDD => {
if (!newRDD.isEmpty()) {
val cacheRDD = newRDD.cache()
val updTempTables = tempTableView(t2s, stgDFMap, cacheRDD)
val rdd = updatestgDFMap(stgDFMap, cacheRDD)
persistStgTable(stgDFMap)
dfMap
.filter(entry => updTempTables.contains(entry._2))
.map(spark.sql)
.foreach( df => writeToES(writer, df))
cacheRDD.unpersist()
}
}
The issue is happening only if a method call happens inside foreachRDD like tempTableView in this case.
tempTableView
def tempTableView(t2s: Map[String, StructType], stgDFMap: Map[String, DataFrame], cacheRDD: RDD[cacheRDD]): Set[String] = {
stgDFMap.keys.filter { table =>
val tRDD = cacheRDD
.filter(r => r.Name == table)
.map(r => r.values)
val tDF = spark.createDataFrame(tRDD, tableNameToSchema(table))
if (!tRDD.isEmpty()) {
val tName = s"temp_$table"
tDF.createOrReplaceTempView(tName)
}
!tRDD.isEmpty()
}.toSet
}
Any help is appreciated. Not sure how to debug this and fix the issue.
From the code snippet you had shared, I don't see where the jackson library is invoked. However, NotSerializableException usually happens when you try to send an object which doesn't implement Serializable interface over wire.
And Spark is distributed processing engine, meaning it works this way: There is a driver and multiple executors across nodes. Only the part of code that is needed to be computed is sent by the driver to the executors (over wire). Spark transformations happen in that way i.e. across multiple nodes and if you try pass an instance of a class, which doesn't implement serializable interface, to such code blocks (the block that executes across nodes), it will throw NotSerializableException.
Ex:
def main(args: Array[String]): Unit = {
val gson: Gson = new Gson()
val sparkConf = new SparkConf().setMaster("local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq("0","1"))
val something = rdd.map(str => {
gson.toJson(str)
})
something.foreach(println)
spark.close()
}
This code block will throw NotSerializableException because we are sending an instance of Gson to a distributed function. map is a Spark transformation operation so it will execute on executors. The following will work:
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq("0","1"))
val something = rdd.map(str => {
val gson: Gson = new Gson()
gson.toJson(str)
})
something.foreach(println)
spark.close()
}
Reason why the above will work is, we are instantiating Gson within a transformation, so it will be instantiated at the executor, meaning it won't be sent from the driver program over the wire so no serialization is needed.
The issue was with jackson objectMapper that was trying to be serialized. objectMapper should not be serialized. Fixed this by adding #transient val objMapper = new ObjectMapper...

org.apache.spark.SparkException: Task not serializable in Spark Scala

I am trying to get employeeId from employee_table and use this id to query employee_address table to fetch the address.
There is nothing wrong with tables. But when I run the below code, I get org.apache.spark.SparkException: Task not serializable
I think I know the issue. The issue is sparkContext is with master and not with worker. But I don't know how to get my head around this.
val employeeRDDRdd = sc.cassandraTable("local_keyspace", "employee_table")
try {
val data = employeeRDDRdd
.map(row => {
row.getStringOption("employeeID") match {
case Some(s) if (s != null) && s.nonEmpty => s
case None => ""
}
})
//create tuple of employee id and address. Filtering out cases when for an employee address is empty.
val id = data
.map(s => (s,getID(s)))
filter(tups => tups._2.nonEmpty)
//printing out total size of rdd.
println(id.count())
} catch {
case e: Exception => e.printStackTrace()
}
def getID(employeeID: String): String = {
val addressRDD = sc.cassandraTable("local_keyspace", "employee_address")
val data = addressRDD.map(row => row.getStringOption("address") match {
case Some(s) if (s != null) && s.nonEmpty => s
case None => ""
})
data.collect()(0)
}
Exception ==>
rg.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.map(RDD.scala:365)
Serialization Error Caused by SparkContext Captured in Lambda
The serialization issue is caused by
val addressRDD = sc.cassandraTable("local_keyspace", "employee_address")
This portion is used inside of a serialized lambda here :
val id = data
.map(s => (s,getID(s)))
All RDD transformations represent remotely executed code which means their entire contents must be serializable.
The Spark Context is not serializable but it is necessary for "getIDs" to work so there is an exception. The basic rule is you cannot touch the SparkContext within any RDD transformation.
If you are actually trying to join with data in cassandra you have a few options.
If you are just pulling rows based on Partition Key
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
If you are trying to join on some other field
Load both RDDs seperately and do a Spark Join
val leftrdd = sc.cassandraTable(test, table1)
val rightrdd = sc.cassandraTable(test, table2)
leftrdd.join(rightRdd)

SparkContext not serializable inside a companion object

I'm currently trying to extend a Machine Learning application that uses Scala and Spark. I'm using the structure of a previous project from Dieterich Lawson that I found on Github
https://github.com/dieterichlawson/admm
This project basically uses SparkContext to build an RDD of blocks of training samples, and then perform local computations on each of these sets (for example solving a linear system).
I was following the same scheme, but for my local computation I need to perform a L-BFGS algorithm on each block of training samples. In order to do so, I wanted to use the L-BFGS algorithm from the mlLib which has the following signature.
runLBFGS(RDD<scala.Tuple2<Object,Vector>> data, Gradient gradient,
Updater updater, int numCorrections, double convergenceTol,
int maxNumIterations, double regParam, Vector initialWeights)
As it says, the method takes as input an RDD[Object,Vector] of the training samples. The problem is that locally on each worker I no longer keep the RDD structure of the data. Therefore, I'm trying to use parallelize function of the SparkContext on each block of the matrix. But when I do this, I get a serializer exception. (The exact exception message is at the end of the question).
This is a detailed explanation on how I'm handling the SparkContext.
First, in the main application it is used to open a textfile and it is used in the factory of the class LogRegressionXUpdate:
val A = sc.textFile("ds1.csv")
A.checkpoint
val f = LogRegressionXUpdate.fromTextFile(A,params.rho,1024,sc)
In the application, the class LogRegressionXUpdate is implemented as follows
class LogRegressionXUpdate(val training: RDD[(Double, NV)],
val rho: Double) extends Function1[BDV[Double],Double] with Prox with Serializable{
def prox(x: BDV[Double], rho: Double): BDV[Double] = {
val numCorrections = 10
val convergenceTol = 1e-4
val maxNumIterations = 20
val regParam = 0.1
val (weights, loss) = LBFGS.runLBFGS(
training,
new GradientForLogRegADMM(rho,fromBreeze(x)),
new SimpleUpdater(),
numCorrections,
convergenceTol,
maxNumIterations,
regParam,
fromBreeze(x))
toBreeze(weights.toArray).toDenseVector
}
def apply(x: BDV[Double]): Double = {
Math.pow(1,2.0)
}
}
With the following companion object:
object LogRegressionXUpdate {
def fromTextFile(file: RDD[String], rho: Double, blockHeight: Int = 1024, #transient sc: SparkContext): RDF[LogRegressionXUpdate] = {
val fns = new BlockMatrix(file, blockHeight).blocks.
map(X => new LogRegressionXUpdate(sc.parallelize((X(*,::).map(fila => (fila(-1),fromBreeze(fila(0 to -2))))).toArray),rho))
new RDF[LogRegressionXUpdate](fns, 0L)
}
}
This constructor is causing a serialization error though I'm not really needing the SparkContext to build each RDD locally. I've searched for solutions to this problem and adding #transient didn't solve it.
Then, my question is: is it really possible to build these "second layer RDDs" or I'm forced to perform a non distributed version of the L-BFGS algorithm.
Thanks in advance!
Error Log:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.map(RDD.scala:293)
at admm.functions.LogRegressionXUpdate$.fromTextFile(LogRegressionXUpdate.scala:70)
at admm.examples.Lasso$.run(Lasso.scala:96)
at admm.examples.Lasso$$anonfun$main$1.apply(Lasso.scala:70)
at admm.examples.Lasso$$anonfun$main$1.apply(Lasso.scala:69)
at scala.Option.map(Option.scala:145)
at admm.examples.Lasso$.main(Lasso.scala:69)
at admm.examples.Lasso.main(Lasso.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#20576557)
- field (class: admm.functions.LogRegressionXUpdate$$anonfun$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class admm.functions.LogRegressionXUpdate$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
... 21 more
RDDs should only be accessed from the driver. Whenever you call something like
myRDD.map(someObject.someMethod)
spark serializes whatever that is needed for the computation of someMethod, and sends it to the workers. There, the method is deserialized and then it runs on each partition independently.
You, however, try to use a method that itself uses spark: you attempt to create a new RDD. However, this is not possible since they can only be created in the driver. The error you see is spark's attempt to serialize the spark context itself since it is needed for the computation at each block. More about serialization can be found in the first answer to this question.
"... though I'm not really needing the SparkContext to build each RDD locally" - actually this is exactly what you are doing when calling sc.parallelize. Bottom line - you need to find (or write) a local implementation of L-BFGS.