Getting error while writing RDD data to avro file - scala

Below is my RDD
val title = movies.map(f=>(f.toString().split("::")(0).replaceAll("\\[", "")trim(),f.toString().split("::")(1)))
//movieID,MovieName
println(title.toDebugString)
val view = ratings.map(f=>(f.toString().split("::")(1).trim(),1)).reduceByKey(_+_).sortBy(f=>f._2, false).take(10).toSeq
val viewRDD = sc.parallelize(view)
val join = title.join(viewRDD).map(f=>(f._2._1,f._2._2))
val dataRdd = join.map(row=>(row._1,row._2))
I am trying to save the dataRdd in avro format for which I am using saveAsNewAPIHadoopFile method.
Below is how I am saving the RDD:
dataRdd.saveAsNewAPIHadoopFile("E:\\ml-1m\\ml-1m\\movieAvro2",classOf[AvroKey[_]], classOf[AvroValue[_]], classOf[AvroKeyValueOutputFormat[_, _]], sc.hadoopConfiguration)
When I run the program I am getting below error:
java.lang.IllegalStateException: Writer schema for output key was not set. Use AvroJob.setOutputKeySchema().
at org.apache.avro.hadoop.io.AvroDatumConverterFactory.create(AvroDatumConverterFactory.java:94)
at org.apache.avro.mapreduce.AvroKeyValueOutputFormat.getRecordWriter(AvroKeyValueOutputFormat.java:55)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
18/10/18 11:27:46 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor driver): java.lang.IllegalStateException: Writer schema for output key was not set. Use AvroJob.setOutputKeySchema().
at org.apache.avro.hadoop.io.AvroDatumConverterFactory.create(AvroDatumConverterFactory.java:94)
at org.apache.avro.mapreduce.AvroKeyValueOutputFormat.getRecordWriter(AvroKeyValueOutputFormat.java:55)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

Related

Scala Runtime Error : org.apache.spark.SparkException:Failed to execute user defined function(Tokenizer$$Lambda$../..: (string) => array<string>)

I have the following program which is running in cluster mode of 4 EMR VM
val spark = is a spark session
val file_path = input from args
val wherePutOutput = file_path +"\\output\\"
val whereReadingDB = file_path + "\\inputJsonFiles"
// PREPROCESS DB and WRITE IT on S3
var toBePreprocessedDB = readFullDBFromJson ( whereReadingDB, spark ) repartition 5
//PREPROCESS FULL DB
println("starting preprocessing")
val preProcessData = new DataPreprocessing ( toBePreprocessedDB )
var tokenizedPreprocessedDB = preProcessData.preProcessDF()
//WRITE PREPROCESSED DBs TO A FILE
println("finished preprocessing and starting to writing files" )
tokenizedPreprocessedDB.write mode ( SaveMode.Overwrite ) format ( "json" ) save ( wherePutOutput + "foo/" )
where DataPreprocessing class is
class DataPreprocessing(private var df: DataFrame ) {
def preProcessDF(): DataFrame = {
val cleanedDF = removePuntuaction ()
val tokenizedDF = tokenize ( cleanedDF )
removeStopWords ( tokenizedDF )
}
def setDF(newDF: DataFrame): Unit = {
df = newDF
}
//return sentenceDF_clean
private def removePuntuaction(): DataFrame = {
df.select ( col ( "id" ), lower ( regexp_replace ( col ( "content" ), "[^a-zA-Z\\s]", "" ) ).alias ( "text" ) )
}
private def tokenize(df: DataFrame): DataFrame = {
val tokenizer = new Tokenizer ().setInputCol ( "text" ).setOutputCol ( "words" )
tokenizer.transform ( df ).select ( col ( "id" ), col ( "words" ) )
}
private def removeStopWords(df: DataFrame): DataFrame = {
val remover = new StopWordsRemover ().setInputCol ( "words" ).setOutputCol ( "words_clean" )
remover.transform ( df ).select ( col ( "id" ), col ( "words_clean" ) )
}
}
Now, my runtime error is the following and it happens on the execution of the write command.
[ERROR] Aborting task
org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
[ERROR] Job job_20210508112210_0006 aborted.
[ERROR] Exception in task 3.0 in stage 6.0 (TID 62)
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
[ WARN] Lost task 3.0 in stage 6.0 (TID 62, LAPTOP-H7MM9952, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
[ERROR] Task 3 in stage 6.0 failed 1 times; aborting job
[ERROR] Aborting job f82df4a0-373c-4517-a1cf-fe70a2c7086c.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 6.0 failed 1 times, most recent failure: Lost task 3.0 in stage 6.0 (TID 62, LAPTOP-H7MM9952, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:195)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:178)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at ServiceMain$.block_of_code(ServiceMain.scala:105)
at ServiceMain$.$anonfun$main$1(ServiceMain.scala:226)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at ServiceMain$.time(ServiceMain.scala:251)
at ServiceMain$.main(ServiceMain.scala:226)
at ServiceMain.main(ServiceMain.scala)
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
[ WARN] Failed to delete file or dir [C:\...\_temporary\0\_temporary\attempt_20210508112210_0006_m_000000_59\.part-00000-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json.crc]: it still exists.
[ERROR] Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:156)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[ WARN] Failed to delete file or dir [C:\...\output\wikidb\_temporary\0\_temporary\attempt_20210508112210_0006_m_000000_59\part-00000-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json]: it still exists.
[ WARN] Failed to delete file or dir [C:\...\output\wikidb\_temporary\0\_temporary\attempt_20210508112210_0006_m_000001_60\.part-00001-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json.crc]: it still exists.
[ WARN] Failed to delete file or dir [C:\...\output\wikidb\_temporary\0\_temporary\attempt_20210508112210_0006_m_000001_60\part-00001-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json]: it still exists.
[ WARN] Failed to delete file or dir [C:\...\output\wikidb\_temporary\0\_temporary\attempt_20210508112210_0006_m_000002_61\.part-00002-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json.crc]: it still exists.
[ WARN] Failed to delete file or dir [C:\...\output\wikidb\_temporary\0\_temporary\attempt_20210508112210_0006_m_000002_61\part-00002-15278b9c-2615-425f-a831-9622cc1e3fc2-c000.json]: it still exists.
[ERROR] Job job_20210508112210_0006 aborted.
[ERROR] Aborting task
org.apache.spark.TaskKilledException
...
[ERROR] Job job_20210508112210_0006 aborted.
[ERROR] Aborting task
org.apache.spark.TaskKilledException
...
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:226)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:178)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at ServiceMain$.block_of_code(ServiceMain.scala:105)
at ServiceMain$.$anonfun$main$1(ServiceMain.scala:226)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at ServiceMain$.time(ServiceMain.scala:251)
at ServiceMain$.main(ServiceMain.scala:226)
at ServiceMain.main(ServiceMain.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 6.0 failed 1 times, most recent failure: Lost task 3.0 in stage 6.0 (TID 62, LAPTOP-H7MM9952, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:195)
... 27 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(Tokenizer$$Lambda$2873/1370139092: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer.$anonfun$createTransformFunc$1(Tokenizer.scala:40)
... 15 more
[ERROR] Job job_20210508112210_0006 aborted.
The created tokenizedPreprocessedDB is a two columns spark dataframe. <id, array(string)>
What I already tried is:
df.na.fill(...) , to remove every null value even if none was found.
I checked that every "WrappedArray[String]" in the column was actually a WrappedArray and not a String value.
I set the nullable property of the column as false
I also noticed that, trying to run "df.rdd.foreach( row => println(row))" the execution fails with the same error. Removing "repartition by" the error is not thrown by the "println" statement, but the "write" statement throws it.
Any suggestion ?

Spark SQL execution in scala

I have a following data(alldata) which has SQL query and viewname name.
Select_Query|viewname
select v1,v2 from conditions|cond
select w1,w2 from locations|loca
I have split and properly assigned it to the temptable(alldata)
val Select_Querydf = spark.sql("select Select_Query,ViewName from alldata")
while I try to execute the query and register a tempview or table out of it, its showing nullpointer error. But the PRINTLN shows all the values in the table right when I comment out the spark.sql stmt.
Select_Querydf.foreach{row =>
val Selectstmt = row(0).toString()
val viewname = row(1).toString()
println(Selectstmt+"-->"+viewname)
spark.sql(Selectstmt).registerTempTable(viewname)//.createOrReplaceTempView(viewname)
}
output:
select v1,v2 from conditions-->cond
select w1,w2 from locations-->loca
But while I execute it with spark.sql, it shows the following error, Please help where I am going wrong.
19/12/09 02:43:12 ERROR Executor: Exception in task 0.0 in stage 4.0
(TID 4) java.lang.NullPointerException at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:128)
at
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:126)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623) at
sparkscalacode1.SQLQueryexecutewithheader$$anonfun$main$1.apply(SQLQueryexecutewithheader.scala:36)
at
sparkscalacode1.SQLQueryexecutewithheader$$anonfun$main$1.apply(SQLQueryexecutewithheader.scala:32)
at scala.collection.Iterator$class.foreach(Iterator.scala:891) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source) 19/12/09 02:43:12 ERROR
TaskSetManager: Task 0 in stage 4.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most
recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor
driver): java.lang.NullPointerException at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:128)
at
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:126)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623) at
sparkscalacode1.SQLQueryexecutewithheader$$anonfun$main$1.apply(SQLQueryexecutewithheader.scala:36)
at
sparkscalacode1.SQLQueryexecutewithheader$$anonfun$main$1.apply(SQLQueryexecutewithheader.scala:32)
at scala.collection.Iterator$class.foreach(Iterator.scala:891) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Here the spark.sql which is SparkSession cannot be used in foreach of Dataframe. Sparksession is created in Driver and foreach is executed in worker and not serialized.
I hope the you have a small list for Select_Querydf, if so you can collect as a list and use it as below.
Select_Querydf.collect().foreach { row =>
val Selectstmt = row.getString(0)
val viewname = row.getString(1)
println(Selectstmt + "-->" + viewname)
spark.sql(Selectstmt).createOrReplaceTempView(viewname)
}
Hope this helps!

Rdd verify Date format and remove row if date format is incorrect Scala Spark

Sample rddDate: [2016-08-01,"pm",5,"ri"]
There are some rows with incorrect format of a date in this RDD, so I can't count rows in RDD. That throws IndexOutOfBound exception.
Used Date format is java.sql.Date
The expected Date Format is for every row in RDD: "yyyy-mm-dd"
2016-08-01
To verify date format in RDD, below code is implemented,
val rddVerified: RDD[(Date, String, Long, String)] = rddDate.map{
a => {
val fmt = DateTimeFormat forPattern "yyyy-mm-dd"
val input = a._1.toString
try {
val output = fmt parseDateTime input
} catch {
case e: Exception => {
val v1 = new java.util.Date("2016-08-01")
val v2 = new Date(a1.getTime)
val ed:(Date,String, Int, String) = (v2, "p1",2,"r1")
Some(ed) // This gives compile time error
}
} finally {
Some(a._1, a._2,a._3,a._4)
}
}
}
I am not able to handle the exception in catch section. I want to either remove that row from the RDD or correct the format of date in that row.
I want to get returned RDD in this format:
RDD[(Date, String, Long, String)]
Thanks.
UPDATE
Exception when counting Dataframe:
COUNT : :
[error] o.a.s.e.Executor - Exception in task 0.0 in stage 7.0 (TID 7)
java.lang.IndexOutOfBoundsException: 1
at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
at scala.collection.immutable.List.apply(List.scala:84)
at controllers.Spark$$anonfun$5.apply(Spark.scala:78)
at controllers.Spark$$anonfun$5.apply(Spark.scala:78)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
[warn] o.a.s.s.TaskSetManager - Lost task 0.0 in stage 7.0 (TID 7, localhost, executor driver): java.lang.IndexOutOfBoundsException: 1
at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
at scala.collection.immutable.List.apply(List.scala:84)
at controllers.Spark$$anonfun$5.apply(Spark.scala:78)
at controllers.Spark$$anonfun$5.apply(Spark.scala:78)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
[error] o.a.s.s.TaskSetManager - Task 0 in stage 7.0 failed 1 times; aborting job
[warn] o.a.s.s.BlockManager - Putting block rdd_1_1 failed due to an exception
[warn] o.a.s.s.BlockManager - Block rdd_1_1 could not be removed as it was not found on disk or in memory
[warn] o.a.s.s.BlockManager - Putting block rdd_1_2 failed due to an exception
[warn] o.a.s.s.BlockManager - Block rdd_1_2 could not be removed as it was not found on disk or in memory
[warn] o.a.s.s.TaskSetManager - Lost task 1.0 in stage 7.0 (TID 8, localhost, executor driver): TaskKilled (unknown reason)
[warn] o.a.s.s.TaskSetManager - Lost task 2.0 in stage 7.0 (TID 9, localhost, executor driver): TaskKilled (unknown reason)
[error] application -
stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 7, localhost, executor driver): java.lang.IndexOutOfBoundsException: 1
at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
at scala.collection.immutable.List.apply(List.scala:84)
at controllers.Spark$$anonfun$5.apply(Spark.scala:78)
at controllers.Spark$$anonfun$5.apply(Spark.scala:78)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:]]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:293)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:220)
at play.api.GlobalSettings$class.onError(GlobalSettings.scala:160)
at play.api.DefaultGlobal$.onError(GlobalSettings.scala:188)
at play.api.http.GlobalSettingsHttpErrorHandler.onServerError(HttpErrorHandler.scala:100)
at play.core.server.netty.PlayRequestHandler$$anonfun$2$$anonfun$apply$1.applyOrElse(PlayRequestHandler.scala:100)
at play.core.server.netty.PlayRequestHandler$$anonfun$2$$anonfun$apply$1.applyOrElse(PlayRequestHandler.scala:99)
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:346)
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:345)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 7, localhost, executor driver): java.lang.IndexOutOfBoundsException: 1
at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
at scala.collection.immutable.List.apply(List.scala:84)
at controllers.Spark$$anonfun$5.apply(Spark.scala:78)
at controllers.Spark$$anonfun$5.apply(Spark.scala:78)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
Caused by: java.lang.IndexOutOfBoundsException: 1
at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
at scala.collection.immutable.List.apply(List.scala:84)
at controllers.Spark$$anonfun$5.apply(Spark.scala:78)
at controllers.Spark$$anonfun$5.apply(Spark.scala:78)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
I would recommend using Dataset. It is not only faster and simpler, but also future friendly
import org.apache.spark.sql.functions.to_date
val spark: SparkSession = ???
import spark.implicits._
rddDate.toDF.withColumn("_1", to_date($"_1"))
.na.drop(Seq("_1))
.as[(java.sql.Date, String, Long, String)]
Edit
But the problem is some else in your code.
Caused by: java.lang.IndexOutOfBoundsException: 1
Suggest that you make some mistake, probably in the parsing logic. You have to step back to the place where you call apply add exception handling there.

Program running on Spark 2.1.1 but not on 2.0.2

I have a very simple clustering program I developed on IntelliJ Idea with Spark 2.1.1. However when I launch the .jar with spark 2.0.2 on my cluster it gives the following error :
17/09/25 14:23:11 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 7)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$2: (vector) => vector)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:106)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:98)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Do not support vector type class org.apache.spark.mllib.linalg.SparseVector
at org.apache.spark.mllib.feature.StandardScalerModel.transform(StandardScaler.scala:160)
at org.apache.spark.ml.feature.StandardScalerModel$$anonfun$2.apply(StandardScaler.scala:167)
at org.apache.spark.ml.feature.StandardScalerModel$$anonfun$2.apply(StandardScaler.scala:167)
... 37 more
Here is my code :
def main(args:Array[String]): Unit = {
val spark = SparkSession.builder.config("spark.eventLog.enabled", "true").config("spark.eventLog.dir", "").appName("S1").getOrCreate()
val df = spark.read.format("csv").option("header", true).csv("petitexport.csv")
var dff = df.drop("numeroCarte")
dff.cache()
for(field <- dff.schema.fields)
{
dff = dff.withColumn(field.name, dff(field.name).cast(DoubleType))
}
val featureCols = Array("NB de trx","NB de trx RD","Somme RD","Somme refus","NB Pays visite","NB trx nocturnes")
val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val dff2 = assembler.transform(dff)
val scaler = new StandardScaler().setWithStd(true).setWithMean(true).setInputCol("features").setOutputCol("scaledFeatures")
val scalerModel = scaler.fit(dff2)
val scaledData2 = scalerModel.transform(dff2)
scaledData2.cache
val kmeans = new KMeans().setK(5).setMaxIter(10).setTol(0.001).setSeed(200).setFeaturesCol("scaledFeatures")
val model = kmeans.fit(scaledData2)
val predictions = model.transform(scaledData2)
predictions.show
Is it possible to fix this to make it work on Spark 2.0.2 ? I understand it is about SparseVector but I don't really see a solution.

Use a method inside a UDF function Spark Scala

I want to use a method located in another class inside a user-designed function but it's not working.
I have a method:
def traitementDataFrameEleve(sc:SparkSession, dfRedis:DataFrame, domainMail:String, dir:String):Boolean ={
def loginUDF = udf((sn: String, givenName:String) => {
LoginClass.GenerateloginPersone(sn,givenName,dfr)
})
dfEleve.withColumn("ENTPersonLogin",loginUDF(dfEleve("sn"),dfEleve("givenName")))
}
LoginClass is a class that contains the GenerateloginPersone method.
Output error :
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$loginUDF$1$1: (string, string) => string)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.Dataset.schema(Dataset.scala:410)
at org.apache.spark.sql.Dataset.printSchema(Dataset.scala:419)
at IntegrationDonneesENTLea_V1_AcBordeaux.LoginClass$.GenerateloginPersone(LoginClass.scala:16)
at IntegrationDonneesENTLea_V1_AcBordeaux.Eleve$$anonfun$loginUDF$1$1.apply(Eleve.scala:25)
at IntegrationDonneesENTLea_V1_AcBordeaux.Eleve$$anonfun$loginUDF$1$1.apply(Eleve.scala:23)
... 16 more
Thank you.
It is not allowed to access:
distributed data structures (like Dataset or RDD).
SparkConext / SparkSession
from Spark task (transformation, udf application). This is why you get a NPE.