SORM class cast exception - scala

I'm trying to migrate my Play framework app from slick to SORM. Created basic classes, but cannot get it working.
case class UserAction(url: String, request: Map[String,String], ts: DateTime)
val ts = DateTime.now()
val act = Db.save(UserAction(request.uri, request.headers.toSimpleMap, ts))
But getting following exception:
play.api.Application$$anon$1: Execution exception[[ClassCastException: __wrapper$1$cf01f4410488416a9990d12622889a84.__wrapper$1$cf01f4410488416a9990d12622889a84$PersistedAnonymous31$1 cannot be cast to models.sorm.UserAction]]
at play.api.Application$class.handleError(Application.scala:293) ~[play_2.10.jar:2.2.1]
at play.api.DefaultApplication.handleError(Application.scala:399) [play_2.10.jar:2.2.1]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [play_2.10.jar:2.2.1]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [play_2.10.jar:2.2.1]
at scala.Option.map(Option.scala:145) [scala-library.jar:na]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2.applyOrElse(PlayDefaultUpstreamHandler.scala:261) [play_2.10.jar:2.2.1]
Caused by: java.lang.ClassCastException: __wrapper$1$cf01f4410488416a9990d12622889a84.__wrapper$1$cf01f4410488416a9990d12622889a84$PersistedAnonymous31$1 cannot be cast to models.sorm.UserAction
at Global$$anonfun$doFilter$1.apply(Global.scala:21) ~[classes/:na]
at Global$$anonfun$doFilter$1.apply(Global.scala:15) ~[classes/:na]
at play.api.mvc.EssentialAction$$anon$2.apply(Action.scala:61) ~[play_2.10.jar:2.2.1]
at play.api.mvc.EssentialAction$$anon$2.apply(Action.scala:60) ~[play_2.10.jar:2.2.1]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$12.apply(PlayDefaultUpstreamHandler.scala:162) ~[play_2.10.jar:2.2.1]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$12.apply(PlayDefaultUpstreamHandler.scala:160) ~[play_2.10.jar:2.2.1]
What do I do wrong?

Related

Writing to Kafka topic from Scala Spark fails with error: java.lang.ClassCastException

I am trying to read data from Elasticsearch and dump the data into Kafka topic. When writing to Kafka topic I get ClassCastException
I am using Spark 2.3.2 and Scala 2.11.2. I am running the below code and ending up in ClassCastException. I am not sure how to fix this. The same code (writing to Kafka from Spark) seems to be working for others (https://stackoverflow.com/a/66546024/153271).
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import spark.implicits._
val reader = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes.wan.only","false").option("es.port","9200").option("es.net.ssl","false").option("es.nodes", "elasticsearch.com").option("es.read.field.include","userId,refNum").option("es.nodes.discovery","true")
val df = reader.load("xxxxx")
df.show()
df.registerTempTable("dftable")
val df2=sql("select * from dftable where refNum='yyyy' limit 2 ")
df2.show
val kafkaServer: String = "localhost:9092"
val topicSampleName: String = "quickstart-events"
df3=df2.cache()
df3.select(to_json(struct("*")).as("value")).selectExpr("CAST(value AS STRING)").write.format("kafka").option("kafka.bootstrap.servers", kafkaServer).option("topic", topicSampleName).save()
The error I get is as follows
2022-05-02 14:02:52 ERROR TaskSetManager:70 - Task 0 in stage 19.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 27, 192.168.22.145, executor 0): java.lang.ClassCastException: cannot assign instance of org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1 to field org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.cleanF$6 of type scala.Function1 in instance of org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2411)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:933)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933)
at org.apache.spark.sql.kafka010.KafkaWriter$.write(KafkaWriter.scala:87)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.createRelation(KafkaSourceProvider.scala:206)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:264)
... 51 elided
Caused by: java.lang.ClassCastException: cannot assign instance of org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1 to field org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.cleanF$6 of type scala.Function1 in instance of org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2411)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This is what I have tried so far:
When I run the code df3.select(to_json(struct("*")).as("value")).selectExpr("CAST(value AS STRING)") I get the below result
and when I run this code I get to see the json in the dataframe df3.select(to_json(struct("*")).as("value")).selectExpr("CAST(value AS STRING)").show. Meaning somewhere in .write.format("kafka") is where the issue is coming. I am unable to identify it.
I am new to Spark/Scala. How can I fix it?
It looks like deserialization error.
You might want to check weather you have properly structured data for Kafka.
Seems like this part holds error:
struct("*")).as("value"))
Check here to check how to create Kafka Source for Streaming Queries.

Scala Runtime Error : org.apache.spark.SparkException:Failed to execute user defined function(Tokenizer$$Lambda$../..: (string) => array<string>)

I have the following program which is running in cluster mode of 4 EMR VM
val spark = is a spark session
val file_path = input from args
val wherePutOutput = file_path +"\\output\\"
val whereReadingDB = file_path + "\\inputJsonFiles"
// PREPROCESS DB and WRITE IT on S3
var toBePreprocessedDB = readFullDBFromJson ( whereReadingDB, spark ) repartition 5
//PREPROCESS FULL DB
println("starting preprocessing")
val preProcessData = new DataPreprocessing ( toBePreprocessedDB )
var tokenizedPreprocessedDB = preProcessData.preProcessDF()
//WRITE PREPROCESSED DBs TO A FILE
println("finished preprocessing and starting to writing files" )
tokenizedPreprocessedDB.write mode ( SaveMode.Overwrite ) format ( "json" ) save ( wherePutOutput + "foo/" )
where DataPreprocessing class is
class DataPreprocessing(private var df: DataFrame ) {
def preProcessDF(): DataFrame = {
val cleanedDF = removePuntuaction ()
val tokenizedDF = tokenize ( cleanedDF )
removeStopWords ( tokenizedDF )
}
def setDF(newDF: DataFrame): Unit = {
df = newDF
}
//return sentenceDF_clean
private def removePuntuaction(): DataFrame = {
df.select ( col ( "id" ), lower ( regexp_replace ( col ( "content" ), "[^a-zA-Z\\s]", "" ) ).alias ( "text" ) )
}
private def tokenize(df: DataFrame): DataFrame = {
val tokenizer = new Tokenizer ().setInputCol ( "text" ).setOutputCol ( "words" )
tokenizer.transform ( df ).select ( col ( "id" ), col ( "words" ) )
}
private def removeStopWords(df: DataFrame): DataFrame = {
val remover = new StopWordsRemover ().setInputCol ( "words" ).setOutputCol ( "words_clean" )
remover.transform ( df ).select ( col ( "id" ), col ( "words_clean" ) )
}
}
Now, my runtime error is the following and it happens on the execution of the write command.
[ERROR] Aborting task
org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
[ERROR] Job job_20210508112210_0006 aborted.
[ERROR] Exception in task 3.0 in stage 6.0 (TID 62)
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
[ WARN] Lost task 3.0 in stage 6.0 (TID 62, LAPTOP-H7MM9952, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
[ERROR] Task 3 in stage 6.0 failed 1 times; aborting job
[ERROR] Aborting job f82df4a0-373c-4517-a1cf-fe70a2c7086c.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 6.0 failed 1 times, most recent failure: Lost task 3.0 in stage 6.0 (TID 62, LAPTOP-H7MM9952, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:195)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:178)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at ServiceMain$.block_of_code(ServiceMain.scala:105)
at ServiceMain$.$anonfun$main$1(ServiceMain.scala:226)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at ServiceMain$.time(ServiceMain.scala:251)
at ServiceMain$.main(ServiceMain.scala:226)
at ServiceMain.main(ServiceMain.scala)
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
[ WARN] Failed to delete file or dir [C:\...\_temporary\0\_temporary\attempt_20210508112210_0006_m_000000_59\.part-00000-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json.crc]: it still exists.
[ERROR] Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:156)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[ WARN] Failed to delete file or dir [C:\...\output\wikidb\_temporary\0\_temporary\attempt_20210508112210_0006_m_000000_59\part-00000-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json]: it still exists.
[ WARN] Failed to delete file or dir [C:\...\output\wikidb\_temporary\0\_temporary\attempt_20210508112210_0006_m_000001_60\.part-00001-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json.crc]: it still exists.
[ WARN] Failed to delete file or dir [C:\...\output\wikidb\_temporary\0\_temporary\attempt_20210508112210_0006_m_000001_60\part-00001-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json]: it still exists.
[ WARN] Failed to delete file or dir [C:\...\output\wikidb\_temporary\0\_temporary\attempt_20210508112210_0006_m_000002_61\.part-00002-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json.crc]: it still exists.
[ WARN] Failed to delete file or dir [C:\...\output\wikidb\_temporary\0\_temporary\attempt_20210508112210_0006_m_000002_61\part-00002-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json]: it still exists.
[ERROR] Job job_20210508112210_0006 aborted.
[ERROR] Aborting task
org.apache.spark.TaskKilledException
...
[ERROR] Job job_20210508112210_0006 aborted.
[ERROR] Aborting task
org.apache.spark.TaskKilledException
...
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:226)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:178)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at ServiceMain$.block_of_code(ServiceMain.scala:105)
at ServiceMain$.$anonfun$main$1(ServiceMain.scala:226)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at ServiceMain$.time(ServiceMain.scala:251)
at ServiceMain$.main(ServiceMain.scala:226)
at ServiceMain.main(ServiceMain.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 6.0 failed 1 times, most recent failure: Lost task 3.0 in stage 6.0 (TID 62, LAPTOP-H7MM9952, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:195)
... 27 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
[ERROR] Job job_20210508112210_0006 aborted.
The created tokenizedPreprocessedDB is a two columns spark dataframe. <id, array(string)>
What I already tried is:
df.na.fill(...) , to remove every null value even if none was found.
I checked that every "WrappedArray[String]" in the column was actually a WrappedArray and not a String value.
I set the nullable property of the column as false
I also noticed that, trying to run "df.rdd.foreach( row => println(row))" the execution fails with the same error. Removing "repartition by" the error is not thrown by the "println" statement, but the "write" statement throws it.
Any suggestion ?

Calling function inside RDD map function in Spark cluster

I was testing a simple string parser function defined by me in my code, but one of the worker nodes always fails at execution time. Here is the dummy code that I've been testing:
/* JUST A SIMPLE PARSER TO CLEAN PARENTHESIS */
def parseString(field: String): String = {
val Pattern = "(.*.)".r
field match{
case "null" => "null"
case Pattern(field) => field.replace('(',' ').replace(')',' ').replace(" ", "")
}
}
/* CREATE TWO DISTRIBUTED RDDs TO JOIN THEM */
val emp = sc.parallelize(Seq((1,"jordan",10), (2,"ricky",20), (3,"matt",30), (4,"mince",35), (5,"rhonda",30)), 6)
val dept = sc.parallelize(Seq(("hadoop",10), ("spark",20), ("hive",30), ("sqoop",40)), 6)
val manipulated_emp = emp.keyBy(t => t._3)
val manipulated_dept = dept.keyBy(t => t._2)
val left_outer_join_data = manipulated_emp.leftOuterJoin(manipulated_dept)
/* OUTPUT */
left_outer_join_data.collect.foreach(println)
/*
(30,((3,matt,30),Some((hive,30))))
(30,((5,rhonda,30),Some((hive,30))))
(20,((2,ricky,20),Some((spark,20))))
(10,((1,jordan,10),Some((hadoop,10))))
(35,((4,mince,35),None))
*/
val res = left_outer_join_data
.map(f => (f._2._1._1, f._2._1._2, f._2._2.getOrElse("null").toString))
.collect
res
.map(f => ( f._1, f._2, parseString(f._3)))
.foreach(println)
/* DESIRED OUTPUT */
/*
(3,matt,hive,30)
(5,rhonda,hive,30)
(2,ricky,spark,20)
(1,jordan,hadoop,10)
(4,mince,null)
*/
This code works if I collect the results of res in the driver first. Since this is a testing, there is no problem doing that, but my actual application would deal with millions of rows and collecting results in the driver is discouraged. So if I do the same without collecting it first, like this:
val res = left_outer_join_data
.map(f => (f._2._1._1, f._2._1._2, f._2._2.getOrElse("null").toString))
res
.map(f => ( f._1, f._2, parseString(f._3)))
.foreach(println)
I get the following:
ERROR TaskSetManager: Task 5 in stage 17.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 17.0 failed 4 times, most recent failure: Lost task 5.3 in stage 17.0 (TID 166, 192.168.28.101, executor 1): java.lang.NoClassDefFoundError: Could not initialize class tele.com.SimcardMsisdn$
at tele.com.SimcardMsisdn$$anonfun$main$1.apply(SimcardMsisdn.scala:249)
at tele.com.SimcardMsisdn$$anonfun$main$1.apply(SimcardMsisdn.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:916)
at tele.com.SimcardMsisdn$.main(SimcardMsisdn.scala:249)
at tele.com.SimcardMsisdn.main(SimcardMsisdn.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class tele.com.SimcardMsisdn$
at tele.com.SimcardMsisdn$$anonfun$main$1.apply(SimcardMsisdn.scala:249)
at tele.com.SimcardMsisdn$$anonfun$main$1.apply(SimcardMsisdn.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Why Spark fails to execute my parser on the nodes ? Could you please recommend a solution or workaround ?
UPDATE
I found the solution to this problem (posted below), nonetheless, I'm still confused about this issue, maybe is something that I'm doing wrong.
Well, I've managed to solve it myself by broadcasting the Pattern variable to the workers:
val Pattern = sc.broadcast("(.*.)".r)
and doing the pattern matching within the map, not in a function, and without collecting to the driver:
val res = left_outer_join_data.map(f => (f._2._1._1, f._2._1._2, f._2._2.getOrElse("null").toString))
res.map(f => (f._1, f._2, f._3 match {
case "null" => "null"
case Pattern.value(f._3) => f._3.replace('(',' ').replace(')',' ').replace(" ", "")})
)
.foreach(println)
Then I got the desired output from the worker stdout:
(3,matt,hive,30)
(5,rhonda,hive,30)
(2,ricky,spark,20)
(1,jordan,hadoop,10)
(4,mince,null)

Play 2.5.3 - Running Scalatest on console, getting java.lang.NoClassDefFoundError: Could not initialize class play.api.Play$

I get below error when I run the unit tests using sbt console twice or more. I do not get any error when I run for the first time. If I get out from Sbt (Ctrl+D) and go back to Sbt, there is no issue with running the tests for the first time.Using IntelliJ.
My test suites contain many test cases like below
val timeOutSpan = 5
val intervalSpan = 500
implicit val defaultPatience =
PatienceConfig(timeout = Span(timeOutSpan, Seconds), interval = Span(intervalSpan, Millis))
it should "get houses" in {
val index: Future[Result] = destinations.index
.apply(FakeRequest(GET, "/houses"))
whenReady(index){ index =>
index.header.status shouldBe 200
}
}
Getting below error
java.lang.ExceptionInInitializerError
at play.api.http.HttpConfiguration$.current(HttpConfiguration.scala:149)
at play.api.mvc.GlobalStateHttpConfiguration$.httpConfiguration(Http.scala:22)
at play.api.mvc.Session$.path(Http.scala:713)
at play.api.i18n.DefaultMessagesApi.setLang(Messages.scala:502)
at play.api.i18n.I18nSupport$ResultWithLang.withLang(I18nSupport.scala:49)
at controllers.BaseController.decorate(BaseController.scala:38)
at modules.home.controllers.Home$$anonfun$index$1$$anonfun$apply$1$$anonfun$apply$2$$anonfun$apply$3$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$9$$anonfun$apply$10.apply(Home.scala:51)
at modules.home.controllers.Home$$anonfun$index$1$$anonfun$apply$1$$anonfun$apply$2$$anonfun$apply$3$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$9$$anonfun$apply$10.apply(Home.scala:40)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Success.map(Try.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ClassCastException: org.apache.xerces.parsers.XIncludeAwareParserConfiguration cannot be cast to org.apache.xerces.xni.parser.XMLParserConfiguration
at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.<init>(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.<init>(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.<init>(Unknown Source)
at org.apache.xerces.jaxp.SAXParserFactoryImpl.newSAXParserImpl(Unknown Source)
at org.apache.xerces.jaxp.SAXParserFactoryImpl.setFeature(Unknown Source)
at play.api.Play$.<init>(Play.scala:46)
at play.api.Play$.<clinit>(Play.scala)
... 19 more
Including OneAppPerSuite in my test suite solved the issue.
class ExampleSpec extends FlatSpec with OneAppPerSuite {
test cases go here
}
details about OneAppPerSuite is here

PlayFramework NoClassDefFoundError: jdk/nashorn/api/scripting/NashornScriptEngineFactory

Calling the NashornScriptEngineFactory yields a RuntimeException on a simple PlayFramework application.
build.sbt
lazy val root = (project in file(".")).enablePlugins(PlayScala)
scalaVersion := "2.11.6"
scalacOptions += "-target:jvm-1.8"
Application.scala
import jdk.nashorn.api.scripting.NashornScriptEngineFactory
import play.api.mvc._
object Application extends Controller
{
def index = Action
{
val nashorn = new NashornScriptEngineFactory().getScriptEngine( "-scripting" )
Ok( nashorn.eval( "3;" ).toString )
}
}
On a similar sbt project however, it works:
Main.scala
import jdk.nashorn.api.scripting.NashornScriptEngineFactory
object Main extends App
{
val nashorn = new NashornScriptEngineFactory().getScriptEngine( "-scripting" )
println( nashorn.eval( "4;" ) )
}
Why is play failing to load the required factory?
Stacktrace
[error] application -
! #6locb3604 - Internal server error, for (GET) [/] ->
play.api.Application$$anon$1: Execution exception[[RuntimeException: java.lang.NoClassDefFoundError: jdk/nashorn/api/scripting/NashornScriptEngineFactory]]
at play.api.Application$class.handleError(Application.scala:296) ~[play_2.11-2.3.8.jar:2.3.8]
at play.api.DefaultApplication.handleError(Application.scala:402) [play_2.11-2.3.8.jar:2.3.8]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$14$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:205) [play_2.11-2.3.8.jar:2.3.8]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$14$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:202) [play_2.11-2.3.8.jar:2.3.8]
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) [scala-library-2.11.6.jar:na]
Caused by: java.lang.RuntimeException: java.lang.NoClassDefFoundError: jdk/nashorn/api/scripting/NashornScriptEngineFactory
at play.api.mvc.ActionBuilder$$anon$1.apply(Action.scala:523) ~[play_2.11-2.3.8.jar:2.3.8]
at play.api.mvc.Action$$anonfun$apply$1$$anonfun$apply$4$$anonfun$apply$5.apply(Action.scala:130) ~[play_2.11-2.3.8.jar:2.3.8]
at play.api.mvc.Action$$anonfun$apply$1$$anonfun$apply$4$$anonfun$apply$5.apply(Action.scala:130) ~[play_2.11-2.3.8.jar:2.3.8]
at play.utils.Threads$.withContextClassLoader(Threads.scala:21) ~[play_2.11-2.3.8.jar:2.3.8]
at play.api.mvc.Action$$anonfun$apply$1$$anonfun$apply$4.apply(Action.scala:129) ~[play_2.11-2.3.8.jar:2.3.8]
Caused by: java.lang.NoClassDefFoundError: jdk/nashorn/api/scripting/NashornScriptEngineFactory
at controllers.Application$$anonfun$index$1.apply(Application.scala:10) ~[classes/:na]
at controllers.Application$$anonfun$index$1.apply(Application.scala:9) ~[classes/:na]
at play.api.mvc.ActionBuilder$$anonfun$apply$17.apply(Action.scala:464) ~[play_2.11-2.3.8.jar:2.3.8]
at play.api.mvc.ActionBuilder$$anonfun$apply$17.apply(Action.scala:464) ~[play_2.11-2.3.8.jar:2.3.8]
at play.api.mvc.ActionBuilder$$anonfun$apply$16.apply(Action.scala:433) ~[play_2.11-2.3.8.jar:2.3.8]
Caused by: java.lang.ClassNotFoundException: jdk.nashorn.api.scripting.NashornScriptEngineFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381) ~[na:1.8.0_40]
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[na:1.8.0_40]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[na:1.8.0_40]
at controllers.Application$$anonfun$index$1.apply(Application.scala:10) ~[classes/:na]
at controllers.Application$$anonfun$index$1.apply(Application.scala:9) ~[classes/:na]
This is related to https://github.com/playframework/playframework/pull/3420 and works in play 2.4.x!
addSbtPlugin("com.typesafe.play" % "sbt-plugin" % "2.4.0-M3")