org.apache.spark.SparkException: Task not serializable in Spark Scala - scala

I am trying to get employeeId from employee_table and use this id to query employee_address table to fetch the address.
There is nothing wrong with tables. But when I run the below code, I get org.apache.spark.SparkException: Task not serializable
I think I know the issue. The issue is sparkContext is with master and not with worker. But I don't know how to get my head around this.
val employeeRDDRdd = sc.cassandraTable("local_keyspace", "employee_table")
try {
val data = employeeRDDRdd
.map(row => {
row.getStringOption("employeeID") match {
case Some(s) if (s != null) && s.nonEmpty => s
case None => ""
}
})
//create tuple of employee id and address. Filtering out cases when for an employee address is empty.
val id = data
.map(s => (s,getID(s)))
filter(tups => tups._2.nonEmpty)
//printing out total size of rdd.
println(id.count())
} catch {
case e: Exception => e.printStackTrace()
}
def getID(employeeID: String): String = {
val addressRDD = sc.cassandraTable("local_keyspace", "employee_address")
val data = addressRDD.map(row => row.getStringOption("address") match {
case Some(s) if (s != null) && s.nonEmpty => s
case None => ""
})
data.collect()(0)
}
Exception ==>
rg.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.map(RDD.scala:365)

Serialization Error Caused by SparkContext Captured in Lambda
The serialization issue is caused by
val addressRDD = sc.cassandraTable("local_keyspace", "employee_address")
This portion is used inside of a serialized lambda here :
val id = data
.map(s => (s,getID(s)))
All RDD transformations represent remotely executed code which means their entire contents must be serializable.
The Spark Context is not serializable but it is necessary for "getIDs" to work so there is an exception. The basic rule is you cannot touch the SparkContext within any RDD transformation.
If you are actually trying to join with data in cassandra you have a few options.
If you are just pulling rows based on Partition Key
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
If you are trying to join on some other field
Load both RDDs seperately and do a Spark Join
val leftrdd = sc.cassandraTable(test, table1)
val rightrdd = sc.cassandraTable(test, table2)
leftrdd.join(rightRdd)

Related

Unable to get the value of first column from a row in dataset using spark scala

I'm trying to iterate a dataframe using foreachpartition for inserting a value into database. I used foreachpartition and group the rows and using foreach to iterate each row. Please find my code below,
val endDF=spark.read.parquet(path).select("pc").filter(col("pc").isNotNull);
endDF.foreachpartition((partition: Iterator[Row]) =>
class.forname(driver)
val con=DriverManager.connection(jdbcurl,user,pwd)
partition.grouped(100).foreach(batch => {
val st=con.createStatement()
batch.foreach(row => {
val pc=row.get(0).toString()
val in=s"""insert tshdim (pc) values(${pc})""".stripMargin
st.addBatch(in)
})
st.executeLargeBatch
})
con.close()
})
When I try to get the pc value from the row(val pc=row.get(0).toString()) it's throwing the following exception. I'm doing this in spark-shell
org.apache.spark.SparkException : Task not serializable . .
Caused by:
Java.io.NotSerializable exception:
org.apache.spark.sql.DataSet$RDDQueryExecution$ Serialization stack:
Object not serializable
(class:org.apache.spark.sql.DataSet$RDDQueryExecution$, value:
org.apache.spark.sql.DataSet$RDDQueryExecution$#jfaf )
-field(class:org.apache.spark.sql.DataSet, name:RDDQueryExecutionModule, type:
org.apache.spark.sql.DataSet$RDDQueryExecution$)
-object(class:org.apache.spark.sql.DataSet,[pc:String])
Function in foreachpartition need to be serialized and passed to executors.
So, in your case, spark is trying to serialize DriverManager class and everything for your jdbc connection, and some of that is not serializable.
foreachPartition works without DriverManager -
endDF.foreachPartition((partition: Iterator[Row]) => {
partition.grouped(100).foreach(batch => {
batch.foreach(row => {
val pc=row.get(0)
println(pc)
})
})
})
To save it in your DB, first do .collect

Clarification on Spark Scala UDFs

I have a few questions regarding spark UDFs
What is the difference between spark.udf.register syntax and udf?
What I have explored so far is that spark.udf.register allows me to pass a function that has a name, i,e.
def isLessThanAverage(revenue: Double) = {
revenue <= average match {
case true => "BelowAverage"
case false => "AboveAverage"
}
}
I am encountering a weird error using udf.
val lessThanAverage_udf = udf((revenue: Double) => revenue <= average match {case true => "BelowAverage" case false => "AboveAverage"})
The above code block works. And I can apply it to a dataframe, i.e.
val sales_marker = sales_by_date_log.withColumn("IsBelowAverage",lessThanAverage_udf(col("Revenue")))
sales_marker.show()
However, if i use
val lessThanAverage2 = (revenue: Double) => revenue <= average match {case true => "BelowAverage" case false => "AboveAverage"}
val lessThanAverage_UDF = udf(lessThanAverage2)
val sales_marker2 = sales_by_date_log.withColumn("IsBelowAverage",lessThanAverage_UDF(col("Revenue")))
sales_marker2.show()
I get the error below:
Job aborted due to stage failure.
Caused by: ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field line425abd1ae5ca416b8e9fe842b8ff8fc6532.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.lessThanAverage2 of type scala.Function1 in instance of line425abd1ae5ca416b8e9fe842b8ff8fc6532.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw
EDIT : I'm using spark 3.0.1 and scala 2.12.10

Serialization issue while renaming HDFS File using scala Spark in parallel

I want to rename HDFS Files in parallel using spark. But I am getting serialization exception, I have mention the exception after my code.
I am getting this issue while using spark.sparkContext.parallelize. Also I am able to rename all the files, when doing it in a loop.
def renameHdfsToS3(spark : SparkSession, hdfsFolder :String, outputFileName:String,
renameFunction: (String,String) => String, bktOutput:String, folderOutput:String, kmsKey:String): Boolean = {
try {
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val path = new Path(hdfsFolder)
val files = fs.listStatus(path)
.filter(fs => fs.isFile)
val parallelRename=spark.sparkContext.parallelize(files).map(
f=>{
parallelRenameHdfs(fs,outputFileName,renamePartFileWithTS,f)
}
)
val hdfsTopLevelPath=fs.getWorkingDirectory()+"/"+hdfsFolder
return true
} catch {
case NonFatal(e) => {
e.printStackTrace()
return false
}
}
}
Below is the exception I am getting
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.LocalFileSystem
Serialization stack:
- object not serializable (class: org.apache.hadoop.fs.LocalFileSystem, value: org.apache.hadoop.fs.LocalFileSystem#1d96d872)
- field (class: at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
The approach is incorrect as sc.parallelize is for consuming data via RDDs. You need to be working at the operating system level. Many such posts exist.
Something like this should suffice blending it with your own logic, note par which allows parallel processing, e.g.:
originalpath.par.foreach( e => hdfs.rename(e,e.suffix("finish")))
You need to check how parallelism is defined with .par. Look here https://docs.scala-lang.org/overviews/parallel-collections/configuration.html

NullPointerException applying a function to spark RDD that works on non-RDD

I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/

Scala-Spark NullPointerError when submitting jar, not in shell

My spark job raises a null pointer exception that I cannot trace down. When I print potential null variables, they're all populated on every worker. My data does not contain null values as the same job works within the spark shell. The execute function of the job is below, followed by the error message.
All helper methods not defined in the function are defined within the body of the spark job object, so I believe closure is not the problem.
override def execute(sc:SparkContext) = {
def construct_query(targetTypes:List[String]) = Map("query" ->
Map("nested" ->
Map("path"->"annotations.entities.items",
"query"-> Map("terms"->
Map("annotations.entities.items.type"-> targetTypes)))))
val sourceConfig = HashMap(
"es.nodes" -> params.targetClientHost
)
// Base elastic search RDD returning articles which match the above query on entity types
val rdd = EsSpark.esJsonRDD(sc,
params.targetIndex,
toJson(construct_query(params.entityTypes)),
sourceConfig
).sample(false,params.sampleRate)
// Mapping ES json into news article object, then extracting the entities list of
// well defined annotations
val objectsRDD = rdd.map(tuple => {
val maybeArticle =
try {
Some(JavaJsonUtils.fromJson(tuple._2, classOf[SearchableNewsArticle]))
}catch {
case e: Exception => None
}
(tuple._1,maybeArticle)
}
).filter(tuple => {tuple._2.isDefined && tuple._2.get.annotations.isDefined &&
tuple._2.get.annotations.get.entities.isDefined}).map(tuple => (tuple._1, tuple._2.get.annotations.get.entities.get))
// flat map the RDD of entities lists into a list of (entity text, (entity type, 1)) tuples
(line 79) val entityDataMap: RDD[(String, (String, Int))] = objectsRDD.flatMap(tuple => tuple._2.items.collect({
case item if (item.`type`.isDefined) && (item.text.isDefined) &&
(line 81)(params.entityTypes.contains(item.`type`.get)) => (cleanUpText(item.text.get), (item.`type`.get, 1))
}))
// bucketize the tuples RDD into entity text, List(entity_type, entity_count) to make count aggregation and file writeouts
// easier to follow
val finalResults: Array[(String, (String, Int))] = entityDataMap.reduceByKey((x, y) => (x._1, x._2+y._2)).collect()
val entityTypeMapping = Map(
"HealthCondition" -> "HEALTH_CONDITION",
"Drug" -> "DRUG",
"FieldTerminology" -> "FIELD_TERMINOLOGY"
)
for (finalTuple <- finalResults) {
val entityText = finalTuple._1
val entityType = finalTuple._2._1
if(entityTypeMapping.contains(entityType))
{
if(!Files.exists(Paths.get(entityTypeMapping.get(entityType).get+".txt"))){
val myFile = new java.io.FileOutputStream(new File(entityTypeMapping.get(entityType).get+".txt"),false)
printToFile(myFile) {p => p.println(entityTypeMapping.get(entityType))}
}
}
val myFile = new java.io.FileOutputStream(new File(entityTypeMapping.get(entityType).get+".txt"),true)
printToFile(myFile) {p => p.println(entityText)}
}
}
And the error message below:
java.lang.NullPointerException at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4$$anonfun$apply$1.isDefinedAt(GazetteerGenerator.scala:81)
at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4$$anonfun$apply$1.isDefinedAt(GazetteerGenerator.scala:79)
at
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
at scala.collection.immutable.List.foreach(List.scala:318) at
scala.collection.TraversableLike$class.collect(TraversableLike.scala:278)
at
scala.collection.AbstractTraversable.collect(Traversable.scala:105)
at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4.apply(GazetteerGenerator.scala:79)
at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4.apply(GazetteerGenerator.scala:79)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This question has been resolved. The params attribute was not serialized and available to spark workers. The solution is to form a spark broadcast variable within scope of the areas where the params attribute is needed.