Spark Job keeps failing by the running the same map 3 times - scala

My job has a step where I am converting the data frame as RDD[(key, value)] but the step runs three 3 times and getting stuck in the third time and fails
Spark UI shows :
Active Jobs(1)
Job Id (Job Group) Description Submitted Duration Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total
3 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 05:50:02 1.6 min 0/1 210/45623
Completed Jobs (2)
2 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 05:16:28 23 min 1/1 46742/46075 (21 failed)
1 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 04:47:58 17 min 1/1 47369/46795 (20 failed)
This is the code :
val eventsRDD = {
r =>
val customerId = r.getAs[String]("customerId")
val itemId = r.getAs[String]("itemId")
val countryId = r.getAs[Long]("countryId").toInt
val timeStamp = r.getAs[String]("eventTimestamp")
val totalRent = r.getAs[Int]("totalRent")
val totalPurchase = r.getAs[Int]("totalPurchase")
val totalProfit = r.getAs[Int]("totalProfit")
val store = r.getAs[String]("store")
val itemName = r.getAs[String]("itemName")
val itemName = if (itemName.size > 0 && itemName.nonEmpty && itemName != null ) itemName else "NA"
(itemId, (customerId, countryId, timeStamp, totalRent, totalProfit, totalPurchase, store,itemName ))
Can someone tell what is wrong here ? If I want persist/cache which one I should do ?
Error :
16/10/25 23:28:55 INFO YarnClientSchedulerBackend: Asked to remove non-existent executor 181
16/10/25 23:28:55 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1477415847345_0005_02_031011 on host: ip-172-31-14-104.ec2.internal. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_1477415847345_0005_02_031011
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$

You map operation is resulting in some error and that propogates to the driver which results in task failure.
By default spark.task.maxFailures has value as 4 which is for :
Number of failures of any particular task before giving up on the job.
The total number of failures spread across different tasks will not
cause the job to fail; a particular task has to fail this number of
attempts. Should be greater than or equal to 1. Number of allowed
retries = this value - 1.
So what happens when your task fails spark tries to recompute the map operation untill it has failed 4 times in all.
If I want persist/cache which one I should do ?
cache is just specific persist operation where RDD is persisted with default storage level (MEMORY_ONLY).


apache-spark count filtered words from textfilestream

Hi I am new to Scala and I am using intellij Idea. I am trying to filestream a text file running Scala cluster on Hadoop-Spark. My main goal is to count only words (without any special characters) in a key, value format.
I found apache-spark regex extract words from rdd article where they use findAllIn() function with regex but not sure if I am using it correctly in my code.
When I built my project and generate the jar file to run it in Spark I manually provide the text file and it seems that it runs and count words but it also seems that it enters in a loop as it processed the file infinitely. I thought it should process it only once.
Can someone tell me why this may be happening? or is it a better way to achieve my goal?
Part of my text is:
It wlll only take a day,' he said. The others disagreed.
It's too fragile," 289 they said disapprovingly 23 age, but he refused to listen. Not
quite so lazy, the second little pig went in search of planks of seasoned 12 hola, 1256 23.
My code is:
package streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object NetworkWordCount {
def main(args: Array[String]): Unit = {
if (args.length < 1) {
System.err.println("Usage: HdfsWordCount <directory>")
val sparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster("local")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(5))
// Create the FileInputDStream on the directory and use the
// stream to count words in new files created
val lines = ssc.textFileStream(args(0))
val sep_words = lines.flatMap("[a-zA-Z]+".r.findAllIn(_))
val wordCounts = => (x, 1)).reduceByKey(_ + _)
My expected output is something like below but as mentioned before it keeps repeating:
22/10/24 05:11:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/10/24 05:11:40 INFO Executor: Finished task 0.0 in stage 27.0 (TID 20). 1529 bytes result sent to driver
22/10/24 05:11:40 INFO TaskSetManager: Finished task 0.0 in stage 27.0 (TID 20) in 18 ms on localhost (executor driver) (1/1)
22/10/24 05:11:40 INFO TaskSchedulerImpl: Removed TaskSet 27.0, whose tasks have all completed, from pool
22/10/24 05:11:40 INFO DAGScheduler: ResultStage 27 (print at NetworkWordCount.scala:28) finished in 0.031 s
22/10/24 05:11:40 INFO DAGScheduler: Job 13 finished: print at NetworkWordCount.scala:28, took 0.036092 s
Time: 1666588300000 ms
22/10/24 05:11:40 INFO JobScheduler: Finished job streaming job 1666588300000 ms.1 from job set of time 1666588300000 ms
22/10/24 05:11:40 INFO JobScheduler: Total delay: 0.289 s for time 1666588300000 ms (execution: 0.222 s)
22/10/24 05:11:40 INFO ShuffledRDD: Removing RDD 40 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 40
22/10/24 05:11:40 INFO MapPartitionsRDD: Removing RDD 39 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 39
Additionally, I notice that when I only use split() function without any regex neither findAllIn() function it process the file only once as expected, but of course it only splits the text by spaces.
Something like this:
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordCounts = => (x, 1)).reduceByKey(_ + _)
I almost forgot if you can explain also the meaning of the usage of underscores in code will help me a lot. I am having some problems to understand that part too.
Thanks in advance.
If your file doesn't change while you're parsing it, you won't need to use SparkStreamingContext.
Just create SparkContext:
val sc = new SparkContext(new SparkConf().setAppName("HdfsWordCount").setMaster("local"))
and then process you file in the way you need.
You will need SparkStreamingContext if you have some datasources which changing and you need to process changed delta continiously
There're many different ways to use underscore in scala, but in your current case underscore is a just syntax sugar reduces syntax of lambda function. It actully means element of collection.
You can rewrite code with undescore like this:
//both lines doing the same
.flatMap((s: String) => "[a-zA-Z]+".r.findAllIn(s))
//both lines doing the same
.map(x => (x, 1)).reduceByKey(_ + _)
.map(x => (x, 1)).reduceByKey((l, r) => l + r)
For better understanding this case and other usages read some articles. Like this or this
Well, I think that I found what was my main error. I was telling spark to monitor the same directory where I was creating my output file. So, my thought is that it was self-triggering a new event everytime it saved the output processed data. Once I created an input and output directory the repeating output stopped and worked as expected. It only generates a new output until I provide a new file to the server manually.

Spark getting slow as time goes on

I have a scala code that runs on top of spark 2.4.0 to compute the BFS of a graph which is stored in a table as below:
At some point in the algorithm, I need to update the visited flag of current vertex neighbors. I am doing this by the following code. When I execute the code, it works fine but as time goes on, it becomes slower and slower. It seems that the last nested loop is the problem:
//var vertices = schema(StructType(Seq(StructField("id",IntegerType),StructField("visited", IntegerType),StructField("qOrder",IntegerType))
val neighbours = edges.filter($"src" === start).join(vertices,$"id" === $"dst").filter($"visited" === lit(0))
.select($"dst".as("id")).withColumn("visited", lit(1)).withColumn("qOrder", lit(priorityCounter)).cache()
vertices = vertices.filter($"id" =!= x(0)).union(neighbours.filter(col("id")===x(0))).cache()
When it becomes slow, it starts giving the following errors and warnings:
2021-06-08 19:55:08,998 [driver-heartbeater] WARN org.apache.spark.executor.Executor - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: GC overhead limit exceeded
Does anyone have any idea about the problem?
I have set the spark parameters as follows:
spark.scheduler.listenerbus.eventqueue.capacity 100000000
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout 2m
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 10000
spark.max.fetch.failures.per.stage 10 64
spark.rpc.askTimeout 600s
spark.driver.memory 32g
spark.executor.memory 32g

Scala-Spark read and manipulate data from mongoDB: cursor not found exception

I have 3 json files stored in mongoDB and I want to manipulate their to obtain a particular dataframe.
val readConfigUser = ReadConfig(Map("uri" -> "mongodb://<IP>:<port>/db.collection1"))
val userDF = MongoSpark
.select($"user_id", $"review_count", $"friends", $"fans")
val readConfigBusiness = ReadConfig(Map("uri" -> "mongodb://<IP>:<port>/yelpdb.businessCollection"))
val businessDF = MongoSpark
.select($"business_id", $"categories", $"review_count", $"stars")
val readConfigReview = ReadConfig(Map("uri" -> "mongodb://<IP>:<port>/yelpdb.reviewsCollection"))
val reviewDF = MongoSpark
.select($"review_id", $"user_id", $"cool", $"stars", $"business_id", $"useful", $"funny")
After many manipulations I want to find the max value in the dataframe performing this snippet of code:
val max_influence = final_tempDF.agg(max("influence") as ("max_influence")).first.getAs[Double](0)
val finalDF = final_tempDF
.select($"user_id", $"business_id", $"categories_reviewed", $"list_users_with_same_reviews_business", ($"influence"/max_influence) as ("normalized_influence"))
But execution fails in this point with this expection:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 11.0 failed 1 times, most recent failure: Lost task 2.0 in stage 11.0 (TID 32, localhost, executor driver): com.mongodb.MongoCursorNotFoundException: Query failed with error code -5 and error message 'Cursor 50354548740 not found on server <IP>:<port>' on server <IP>:<port>
This is reported in mongoDB log file:
2017-09-22T15:54:08.565+0200 I COMMAND [conn36] killcursors: found 0 of 1
2017-09-22T15:54:09.184+0200 I - [conn36] end connection (14 connections now open)
2017-09-22T15:56:35.642+0200 I - [conn44] end connection (13 connections now open)
2017-09-22T15:56:35.642+0200 I - [conn47] end connection (12 connections now open)
2017-09-22T15:56:35.642+0200 I - [conn48] end connection (11 connections now open)
2017-09-22T15:56:35.643+0200 I - [conn46] end connection (10 connections now open)
2017-09-22T15:56:35.643+0200 I - [conn45] end connection (9 connections now open)
What is the problem? How can I fix it? I'm using databricks community edition (I'm a student) and mongoDB version is 3.4.9.

saveAsTextFile hangs in spark Connection reset by peer in Data frame

I am running an application in spark which do the simple diff between two data frame .
I execute as jar file in my cluster environment .
My cluster environment is 94 node cluster.
There are two data set 2 GB and 4 GB which mapped to data frame .
My job is working fine for the very small size files ...
I personal think saveAsTextFile takes more time in my application
Below my cluster connfig details
Total Vmem allocated for Containers 394.80 GB
Total Vmem allocated for Containers 394.80 GB
Total VCores allocated for Containers 36
This is how i run my spark job
spark-submit --queue root.queue --deploy-mode client --master yarn SparkApplication-SQL-jar-with-dependencies.jar
Here is my code .
object TestDiff {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount");
conf.set("spark.executor.memory", "32g")
conf.set("spark.driver.memory", "32g")
conf.set("spark.driver.maxResultSize", "4g")
val sc = new SparkContext(conf); //Creating spark context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{ StructType, StructField, StringType, DoubleType, IntegerType }
import org.apache.spark.sql.functions.udf
val schema = StructType(Array(
StructField("filler1", StringType),
StructField("dunsnumber", StringType),
StructField("transactionalindicator", StringType)))
import org.apache.spark.sql.functions._
val textRdd1 = sc.textFile("/home/cloudera/TRF/PCFP/INCR")
val rowRdd1 = => Row.fromSeq(line.split("\\|", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)
val textRdd2 = sc.textFile("/home/cloudera/TRF/PCFP/MAIN")
val rowRdd2 = => Row.fromSeq(line.split("\\|", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)
//Finding the diff between two if any of the columns has changed
val diffAnyColumnDF = df1.except(df2)
It takes more than 30 minutes and then it fails .
with below exception
Here is the logs
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
17/09/15 11:55:01 WARN netty.NettyRpcEnv: Ignored message: HeartbeatResponse(false)
17/09/15 11:56:19 WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;#7fe57079,BlockManagerId(1,, 33507))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:520)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:520)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:520)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1818)
at org.apache.spark.executor.Executor$$anon$
at java.util.concurrent.Executors$
at java.util.concurrent.FutureTask.runAndReset(
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
at java.util.concurrent.ScheduledThreadPoolExecutor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
17/09/15 11:56:19 WARN netty.NettyRpcEnv: Ignored message: HeartbeatResponse(false)
Please suggest how to tune my spark job ?
I just changed executor memory and it job got succeeded but it is very very slow .
conf.set("spark.executor.memory", "64g")
But job is very slow ...It takes around 15 minutes to complete ..
And job has taken 15 minutes to complete .
Attaching DAG Visualization
After increasing the time out conf getting below error ..
executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 175200 ms
I think your single partitiond file size is big. it takes long time to stream the data over the TCP channel and the connection can not be made alive for a long time and gets reset.
Can you coalesce to a higher number of partitions?

Scala Spark SQLContext Program throwing array out of bound exception

I am new to Apache Spark. I am trying to create a schema and load data from hdfs. Below is my code:
// importing sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
//defining the schema
case class Author1(Author_Key: Long, Author_ID: Long, Author: String, First_Name: String, Last_Name: String, Middle_Name: String, Full_Name: String, Institution_Full_Name: String, Country: String, DIAS_ID: Int, R_ID: String)
val D_Authors1 =
.map(auth => Author1(auth(0).trim.toLong, auth(1).trim.toLong, auth(2), auth(3), auth(4), auth(5), auth(6), auth(7), auth(8), auth(9).trim.toInt, auth(10)))
//register the table
val auth = sqlContext.sql("SELECT * FROM D_Authors1")
sqlContext.sql("SELECT * FROM D_Authors").collect().foreach(println)
When I am executing this code it throwing array out of bound exception. Below is the error:
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions
14/08/18 06:57:14 INFO FileInputFormat: Total input paths to process : 1
14/08/18 06:57:14 INFO SparkContext: Starting job: collect at <console>:24
14/08/18 06:57:14 INFO DAGScheduler: Got job 5 (collect at <console>:24) with 2 output partitions (allowLocal=false)
14/08/18 06:57:14 INFO DAGScheduler: Final stage: Stage 5(collect at <console>:24)
14/08/18 06:57:14 INFO DAGScheduler: Parents of final stage: List()
14/08/18 06:57:14 INFO DAGScheduler: Missing parents: List()
14/08/18 06:57:14 INFO DAGScheduler: Submitting Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174), which has no missing parents
14/08/18 06:57:14 INFO DAGScheduler: Submitting 2 missing tasks from Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174)
14/08/18 06:57:14 INFO YarnClientClusterScheduler: Adding task set 5.0 with 2 tasks
14/08/18 06:57:14 INFO TaskSetManager: Starting task 5.0:0 as TID 38 on executor 1: (NODE_LOCAL)
14/08/18 06:57:14 INFO TaskSetManager: Serialized task 5.0:0 as 4401 bytes in 1 ms
14/08/18 06:57:15 INFO TaskSetManager: Starting task 5.0:1 as TID 39 on executor 1: (NODE_LOCAL)
14/08/18 06:57:15 INFO TaskSetManager: Serialized task 5.0:1 as 4401 bytes in 0 ms
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 38 (task 5.0:0)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 10
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$
at scala.collection.Iterator$$anon$
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 39 (task 5.0:1)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 9
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$
at scala.collection.Iterator$$anon$
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Your problem has nothing to do with Spark.
Format your code correctly (I have corrected)
Don't mix camel & underscore naming - use underscore for SQL fields, use camel for Scala vals,
When you get an exception read it it usually tells you what you are doing wrong, in your case it's probably that some of the records in hdfs:///user/D_Authors.txt are not how you expect them
When you get an exception debug it, try actually catching the exception and printing out what the records are that fail to parse
_.split("\\|") ignores empty leading and trailing strings, use _.split("\\|", -1)
In Scala you don't need magic numbers that manually access elements of an array, it's ugly and more prone to error, use a pattern match ...
here is a simple example which includes unusual record handling!:
case class Author(author: String, authorAge: Int)"\t", -1) match {
case Array(author, authorAge) => Author(author, authorAge.toInt)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: " +
Now if you coded it like this, your exception would tell you straight away exactly what is wrong with your data.
One final point/concern; why are you using Spark SQL? Your data is in text form, are you trying to transform it into, say, parquet? If not, why not just use the regular Scala API to perform your analysis, moreover it's type checked and compile checked, unlike SQL.