Unable to run Spark with Mesos - scala

I set up Spark-0.9.1 to run on mesos-0.13.0 using the steps mentioned here. The Mesos UI is showing two workers registered. I want to run these commands on Spark-shell
> scala> val data = 1 to 10000 data:
> scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6,
> 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
> 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
> 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
> 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75,
> 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
> 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
> 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121,
> 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135,
> 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
> 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163,
> 164, 165, 166, 167, 168, 169, 170...
> scala> val distData = sc.parallelize(data) distData:
> org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
> parallelize at <console>:14
Now when i run the collect method, the following error occurs.
> scala> distData.filter(_< 10).collect()
14/06/03 19:54:55 INFO SparkContext: Starting job: collect at <console>:17
14/06/03 19:54:55 INFO DAGScheduler: Got job 0 (collect at <console>:17) with 8 output partitions (allowLocal=false)
14/06/03 19:54:55 INFO DAGScheduler: Final stage: Stage 0 (collect at <console>:17)
14/06/03 19:54:55 INFO DAGScheduler: Parents of final stage: List()
14/06/03 19:54:55 INFO DAGScheduler: Missing parents: List()
14/06/03 19:54:55 INFO DAGScheduler: Submitting Stage 0 (FilteredRDD[1] at filter at <console>:17), which has no missing parents
14/06/03 19:54:55 INFO DAGScheduler: Submitting 8 missing tasks from Stage 0 (FilteredRDD[1] at filter at <console>:17)
14/06/03 19:54:55 INFO TaskSchedulerImpl: Adding task set 0.0 with 8 tasks
14/06/03 19:54:55 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:55 INFO TaskSetManager: Serialized task 0.0:0 as 1338 bytes in 8 ms
14/06/03 19:54:55 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:55 INFO TaskSetManager: Serialized task 0.0:1 as 1338 bytes in 0 ms
14/06/03 19:54:55 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:55 INFO TaskSetManager: Serialized task 0.0:2 as 1338 bytes in 0 ms
14/06/03 19:54:55 INFO TaskSetManager: Starting task 0.0:3 as TID 3 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:55 INFO TaskSetManager: Serialized task 0.0:3 as 1338 bytes in 1 ms
14/06/03 19:54:55 INFO TaskSetManager: Starting task 0.0:4 as TID 4 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:55 INFO TaskSetManager: Serialized task 0.0:4 as 1338 bytes in 0 ms
14/06/03 19:54:55 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:55 INFO TaskSetManager: Serialized task 0.0:5 as 1338 bytes in 0 ms
14/06/03 19:54:55 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:55 INFO TaskSetManager: Serialized task 0.0:6 as 1338 bytes in 0 ms
14/06/03 19:54:55 INFO TaskSetManager: Starting task 0.0:7 as TID 7 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:55 INFO TaskSetManager: Serialized task 0.0:7 as 1338 bytes in 0 ms
14/06/03 19:54:56 INFO TaskSetManager: Re-queueing tasks for 201406031732-3213994176-5050-6320-10 from TaskSet 0.0
14/06/03 19:54:56 WARN TaskSetManager: Lost TID 5 (task 0.0:5)
14/06/03 19:54:56 WARN TaskSetManager: Lost TID 7 (task 0.0:7)
14/06/03 19:54:56 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
14/06/03 19:54:56 WARN TaskSetManager: Lost TID 3 (task 0.0:3)
14/06/03 19:54:56 INFO DAGScheduler: Executor lost: 201406031732-3213994176-5050-6320-10 (epoch 0)
14/06/03 19:54:56 INFO BlockManagerMasterActor: Trying to remove executor 201406031732-3213994176-5050-6320-10 from BlockManagerMaster.
14/06/03 19:54:56 INFO BlockManagerMaster: Removed 201406031732-3213994176-5050-6320-10 successfully in removeExecutor
14/06/03 19:54:56 INFO TaskSetManager: Starting task 0.0:3 as TID 8 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:56 INFO TaskSetManager: Serialized task 0.0:3 as 1338 bytes in 0 ms
14/06/03 19:54:56 INFO DAGScheduler: Host gained which was in lost list earlier: host-DSRV04.host
14/06/03 19:54:56 INFO TaskSetManager: Starting task 0.0:1 as TID 9 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:56 INFO TaskSetManager: Serialized task 0.0:1 as 1338 bytes in 0 ms
14/06/03 19:54:56 INFO TaskSetManager: Starting task 0.0:7 as TID 10 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:56 INFO TaskSetManager: Serialized task 0.0:7 as 1338 bytes in 0 ms
14/06/03 19:54:56 INFO TaskSetManager: Starting task 0.0:5 as TID 11 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:56 INFO TaskSetManager: Serialized task 0.0:5 as 1338 bytes in 0 ms
14/06/03 19:54:57 INFO TaskSetManager: Re-queueing tasks for 201406031732-3213994176-5050-6320-11 from TaskSet 0.0
14/06/03 19:54:57 WARN TaskSetManager: Lost TID 8 (task 0.0:3)
14/06/03 19:54:57 WARN TaskSetManager: Lost TID 2 (task 0.0:2)
14/06/03 19:54:57 WARN TaskSetManager: Lost TID 4 (task 0.0:4)
14/06/03 19:54:57 WARN TaskSetManager: Lost TID 10 (task 0.0:7)
14/06/03 19:54:57 WARN TaskSetManager: Lost TID 6 (task 0.0:6)
14/06/03 19:54:57 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/06/03 19:54:57 INFO DAGScheduler: Executor lost: 201406031732-3213994176-5050-6320-11 (epoch 1)
14/06/03 19:54:57 INFO BlockManagerMasterActor: Trying to remove executor 201406031732-3213994176-5050-6320-11 from BlockManagerMaster.
14/06/03 19:54:57 INFO BlockManagerMaster: Removed 201406031732-3213994176-5050-6320-11 successfully in removeExecutor
14/06/03 19:54:57 INFO DAGScheduler: Host gained which was in lost list earlier: host-DSRV05.host
14/06/03 19:54:57 INFO TaskSetManager: Starting task 0.0:0 as TID 12 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:57 INFO TaskSetManager: Serialized task 0.0:0 as 1338 bytes in 1 ms
14/06/03 19:54:57 INFO TaskSetManager: Starting task 0.0:6 as TID 13 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:57 INFO TaskSetManager: Serialized task 0.0:6 as 1338 bytes in 0 ms
14/06/03 19:54:57 INFO TaskSetManager: Starting task 0.0:7 as TID 14 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:57 INFO TaskSetManager: Serialized task 0.0:7 as 1338 bytes in 1 ms
14/06/03 19:54:57 INFO TaskSetManager: Starting task 0.0:4 as TID 15 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:57 INFO TaskSetManager: Serialized task 0.0:4 as 1338 bytes in 0 ms
14/06/03 19:54:57 INFO TaskSetManager: Starting task 0.0:2 as TID 16 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:57 INFO TaskSetManager: Serialized task 0.0:2 as 1338 bytes in 0 ms
14/06/03 19:54:57 INFO TaskSetManager: Starting task 0.0:3 as TID 17 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:57 INFO TaskSetManager: Serialized task 0.0:3 as 1338 bytes in 1 ms
14/06/03 19:54:57 INFO TaskSetManager: Re-queueing tasks for 201406031732-3213994176-5050-6320-11 from TaskSet 0.0
14/06/03 19:54:57 WARN TaskSetManager: Lost TID 14 (task 0.0:7)
14/06/03 19:54:57 WARN TaskSetManager: Lost TID 16 (task 0.0:2)
14/06/03 19:54:57 WARN TaskSetManager: Lost TID 12 (task 0.0:0)
14/06/03 19:54:57 INFO DAGScheduler: Executor lost: 201406031732-3213994176-5050-6320-11 (epoch 2)
14/06/03 19:54:57 INFO BlockManagerMasterActor: Trying to remove executor 201406031732-3213994176-5050-6320-11 from BlockManagerMaster.
14/06/03 19:54:57 INFO BlockManagerMaster: Removed 201406031732-3213994176-5050-6320-11 successfully in removeExecutor
14/06/03 19:54:57 INFO DAGScheduler: Host gained which was in lost list earlier: host-DSRV05.host
14/06/03 19:54:57 INFO TaskSetManager: Starting task 0.0:0 as TID 18 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:57 INFO TaskSetManager: Serialized task 0.0:0 as 1338 bytes in 0 ms
14/06/03 19:54:57 INFO TaskSetManager: Starting task 0.0:2 as TID 19 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:57 INFO TaskSetManager: Serialized task 0.0:2 as 1338 bytes in 0 ms
14/06/03 19:54:57 INFO TaskSetManager: Starting task 0.0:7 as TID 20 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:57 INFO TaskSetManager: Serialized task 0.0:7 as 1338 bytes in 0 ms
14/06/03 19:54:58 INFO TaskSetManager: Re-queueing tasks for 201406031732-3213994176-5050-6320-10 from TaskSet 0.0
14/06/03 19:54:58 WARN TaskSetManager: Lost TID 17 (task 0.0:3)
14/06/03 19:54:58 WARN TaskSetManager: Lost TID 11 (task 0.0:5)
14/06/03 19:54:58 WARN TaskSetManager: Lost TID 13 (task 0.0:6)
14/06/03 19:54:58 WARN TaskSetManager: Lost TID 9 (task 0.0:1)
14/06/03 19:54:58 WARN TaskSetManager: Lost TID 15 (task 0.0:4)
14/06/03 19:54:58 INFO DAGScheduler: Executor lost: 201406031732-3213994176-5050-6320-10 (epoch 3)
14/06/03 19:54:58 INFO BlockManagerMasterActor: Trying to remove executor 201406031732-3213994176-5050-6320-10 from BlockManagerMaster.
14/06/03 19:54:58 INFO BlockManagerMaster: Removed 201406031732-3213994176-5050-6320-10 successfully in removeExecutor
14/06/03 19:54:58 INFO DAGScheduler: Host gained which was in lost list earlier: host-DSRV04.host
14/06/03 19:54:58 INFO TaskSetManager: Starting task 0.0:4 as TID 21 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:58 INFO TaskSetManager: Serialized task 0.0:4 as 1338 bytes in 0 ms
14/06/03 19:54:58 INFO TaskSetManager: Starting task 0.0:1 as TID 22 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:58 INFO TaskSetManager: Serialized task 0.0:1 as 1338 bytes in 0 ms
14/06/03 19:54:58 INFO TaskSetManager: Starting task 0.0:6 as TID 23 on executor 201406031732-3213994176-5050-6320-11: host-DSRV05.host (PROCESS_LOCAL)
14/06/03 19:54:58 INFO TaskSetManager: Serialized task 0.0:6 as 1338 bytes in 0 ms
14/06/03 19:54:58 INFO TaskSetManager: Starting task 0.0:5 as TID 24 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:58 INFO TaskSetManager: Serialized task 0.0:5 as 1338 bytes in 1 ms
14/06/03 19:54:58 INFO TaskSetManager: Starting task 0.0:3 as TID 25 on executor 201406031732-3213994176-5050-6320-10: host-DSRV04.host (PROCESS_LOCAL)
14/06/03 19:54:58 INFO TaskSetManager: Serialized task 0.0:3 as 1338 bytes in 0 ms
14/06/03 19:54:59 INFO TaskSetManager: Re-queueing tasks for 201406031732-3213994176-5050-6320-11 from TaskSet 0.0
14/06/03 19:54:59 WARN TaskSetManager: Lost TID 23 (task 0.0:6)
14/06/03 19:54:59 WARN TaskSetManager: Lost TID 20 (task 0.0:7)
14/06/03 19:54:59 ERROR TaskSetManager: Task 0.0:7 failed 4 times; aborting job
14/06/03 19:54:59 INFO DAGScheduler: Failed to run collect at <console>:17
14/06/03 19:54:59 INFO DAGScheduler: Executor lost: 201406031732-3213994176-5050-6320-11 (epoch 4)
14/06/03 19:54:59 INFO BlockManagerMasterActor: Trying to remove executor 201406031732-3213994176-5050-6320-11 from BlockManagerMaster.
14/06/03 19:54:59 INFO BlockManagerMaster: Removed 201406031732-3213994176-5050-6320-11 successfully in removeExecutor
14/06/03 19:54:59 INFO DAGScheduler: Host gained which was in lost list earlier: host-DSRV05.host
org.apache.spark.SparkException: Job aborted: Task 0.0:7 failed 4 times (most recent failure: unknown)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> scala> 14/06/03 19:55:00 INFO TaskSetManager: Re-queueing tasks for
> 201406031732-3213994176-5050-6320-10 from TaskSet 0.0 14/06/03
> 19:55:00 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have
> all completed, from pool 14/06/03 19:55:00 INFO DAGScheduler: Executor
> lost: 201406031732-3213994176-5050-6320-10 (epoch 5) 14/06/03 19:55:00
> INFO BlockManagerMasterActor: Trying to remove executor
> 201406031732-3213994176-5050-6320-10 from BlockManagerMaster. 14/06/03
> 19:55:00 INFO BlockManagerMaster: Removed
> 201406031732-3213994176-5050-6320-10 successfully in removeExecutor
> 14/06/03 19:55:00 INFO DAGScheduler: Host gained which was in lost
> list earlier: host-DSRV04.host
I've checked my configuration of spark many times and it looks fine to me. Any ideas what might have gone wrong?
--
Thanks

As it turns out my tar file wasn't created properly.
Recreated it and its working fine now.
Sorry for the trouble.

Related

Azure databricks: Problem saving to csv with +100 columns and many nulls, org.apache.spark.SparkException: Job aborted

I have seen many ppl having this issue but none of those were able to resolve my issue. I have a json file that I read in as a spark dataframe and then flatten it. Lastly, I try to save the flatten file as a csv with +100 columns and >47 million rows and many nulls.
I get the error org.apache.spark.SparkException: Job aborted.
Here is part of the stack trace:
WARN TaskSetManager: Lost task 0.2 in stage 41.0 (TID 164) (10.139.64.7 executor 22): java.lang.BootstrapMethodError: call site initialization exception
at java.lang.invoke.CallSite.makeSite(CallSite.java:341)
at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
at java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.recordWriteFailure$1(FileFormatWriter.scala:403)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:431)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$12(FileFormatWriter.scala:310)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:169)
at org.apache.spark.scheduler.Task.$anonfun$run$4(Task.scala:137)
at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:104)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:137)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:96)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:902)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1696)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:905)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:760)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.InternalError: BMH.reinvoke=Lambda(a0:L/SpeciesData<LLL>,a1:L,a2:L,a3:L,a4:L,a5:L,a6:L,a7:L,a8:L,a9:L,a10:L,a11:L)=>{
t12:L=BoundMethodHandle$Species_L3.argL2(a0:L);
t13:L=MethodHandle.invokeBasic(t12:L,a2:L);
t14:L=BoundMethodHandle$Species_L3.argL1(a0:L);
t15:L=MethodHandle.invokeBasic(t14:L,a1:L);
t16:L=MethodHandleImpl.array(a4:L,a5:L,a6:L,a7:L,a8:L,a9:L,a10:L,a11:L);
t17:L=BoundMethodHandle$Species_L3.argL0(a0:L);
t18:L=MethodHandle.invokeBasic(t17:L,t15:L,t13:L,a3:L,t16:L);t18:L}
at java.lang.invoke.MethodHandleStatics.newInternalError(MethodHandleStatics.java:127)
at java.lang.invoke.LambdaForm.compileToBytecode(LambdaForm.java:660)
at java.lang.invoke.LambdaForm.prepare(LambdaForm.java:635)
at java.lang.invoke.MethodHandle.<init>(MethodHandle.java:461)
at java.lang.invoke.BoundMethodHandle.<init>(BoundMethodHandle.java:58)
at java.lang.invoke.BoundMethodHandle$Species_L3.<init>(Species_L3)
at java.lang.invoke.BoundMethodHandle$Species_L3.make(Species_L3)
at java.lang.invoke.BoundMethodHandle$Species_LL.copyWithExtendL(Species_LL)
at java.lang.invoke.MethodHandleImpl.makePairwiseConvertByEditor(MethodHandleImpl.java:231)
at java.lang.invoke.MethodHandleImpl.makePairwiseConvert(MethodHandleImpl.java:194)
at java.lang.invoke.MethodHandleImpl.makePairwiseConvert(MethodHandleImpl.java:380)
at java.lang.invoke.MethodHandle.asTypeUncached(MethodHandle.java:776)
at java.lang.invoke.MethodHandle.asType(MethodHandle.java:761)
at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.asTypeUncached(MethodHandleImpl.java:508)
at java.lang.invoke.MethodHandle.asType(MethodHandle.java:761)
at java.lang.invoke.CallSite.makeSite(CallSite.java:323)
... 25 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
22/12/13 09:56:11 INFO TaskSetManager: Starting task 0.3 in stage 41.0 (TID 165) (10.139.64.5, executor 23, partition 0, PROCESS_LOCAL, taskResourceAssignments Map())
22/12/13 09:56:12 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:12 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221213083124-0000/22 is now EXITED (Command exited with code 50)
22/12/13 09:56:12 INFO StandaloneSchedulerBackend: Executor app-20221213083124-0000/22 removed: Command exited with code 50
22/12/13 09:56:12 ERROR TaskSchedulerImpl: Lost executor 22 on 10.139.64.7: Command exited with code 50
22/12/13 09:56:12 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221213083124-0000/25 on worker-20221213085244-10.139.64.7-41835 (10.139.64.7:41835) with 4 core(s)
22/12/13 09:56:12 INFO DAGScheduler: Executor lost: 22 (epoch 18)
22/12/13 09:56:12 INFO StandaloneSchedulerBackend: Granted executor ID app-20221213083124-0000/25 on hostPort 10.139.64.7:41835 with 4 core(s), 3.1 GiB RAM
22/12/13 09:56:12 INFO BlockManagerMasterEndpoint: Trying to remove executor 22 from BlockManagerMaster.
22/12/13 09:56:12 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(22, 10.139.64.7, 40355, None)
22/12/13 09:56:12 INFO BlockManagerMaster: Removed 22 successfully in removeExecutor
22/12/13 09:56:12 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221213083124-0000/25 is now RUNNING
22/12/13 09:56:15 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.139.64.6:55600) with ID 24, ResourceProfileId 0
22/12/13 09:56:15 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:16 INFO BlockManagerMasterEndpoint: Registering block manager 10.139.64.6:39649 with 1504.5 MiB RAM, BlockManagerId(24, 10.139.64.6, 39649, None)
22/12/13 09:56:16 INFO BlockManagerInfo: Added broadcast_36_piece0 in memory on 10.139.64.5:45187 (size: 220.5 KiB, free: 1504.3 MiB)
22/12/13 09:56:18 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:18 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.139.64.5:45187 (size: 18.1 KiB, free: 1504.3 MiB)
22/12/13 09:56:21 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:22 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.139.64.7:55476) with ID 25, ResourceProfileId 0
22/12/13 09:56:24 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:24 INFO BlockManagerMasterEndpoint: Registering block manager 10.139.64.7:43527 with 1504.5 MiB RAM, BlockManagerId(25, 10.139.64.7, 43527, None)
22/12/13 09:56:24 INFO DataSourceFactory$: DataSource Jdbc URL: jdbc:mariadb://consolidated-westeuropec2-prod-metastore-1.mysql.database.azure.com:3306/organization8773400708728304?useSSL=true&sslMode=VERIFY_CA&disableSslHostnameVerification=true&trustServerCertificate=false&serverSslCert=/databricks/common/mysql-ssl-ca-cert.crt
22/12/13 09:56:24 INFO HikariDataSource: metastore-monitor - Starting...
22/12/13 09:56:24 INFO HikariDataSource: metastore-monitor - Start completed.
22/12/13 09:56:24 INFO HikariDataSource: metastore-monitor - Shutdown initiated...
22/12/13 09:56:24 INFO HikariDataSource: metastore-monitor - Shutdown completed.
22/12/13 09:56:24 INFO MetastoreMonitor: Metastore healthcheck successful (connection duration = 207 milliseconds)
22/12/13 09:56:27 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:27 INFO BlockManagerInfo: Added broadcast_35_piece0 in memory on 10.139.64.5:45187 (size: 13.6 KiB, free: 1504.3 MiB)
22/12/13 09:56:30 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:33 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:34 INFO DriverCorral: DBFS health check ok
22/12/13 09:56:34 INFO HiveMetaStore: 1: get_database: default
22/12/13 09:56:34 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
22/12/13 09:56:34 INFO DriverCorral: Metastore health check ok
22/12/13 09:56:36 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:39 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:42 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:45 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:48 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:51 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:54 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:56:57 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:57:00 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:57:03 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:57:06 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:57:08 INFO CommChannelWebSocket: onWebSocketConnect: websocket connected with session: WebSocketSession[websocket=JettyAnnotatedEventDriver[com.databricks.backend.daemon.driver.CommChannelWebSocket#e83bc3c],behavior=SERVER,connection=WebSocketServerConnection#52d196e1::DecryptedEndPoint#685a7d89{l=/10.139.64.4:6062,r=/10.139.0.4:51020,OPEN,fill=-,flush=-,to=2/7200000},remote=WebSocketRemoteEndpoint#486edc7e[batching=true],incoming=JettyAnnotatedEventDriver[com.databricks.backend.daemon.driver.CommChannelWebSocket#e83bc3c],outgoing=ExtensionStack[queueSize=0,extensions=[],incoming=org.eclipse.jetty.websocket.common.WebSocketSession,outgoing=org.eclipse.jetty.websocket.server.WebSocketServerConnection]]
22/12/13 09:57:08 ERROR OutgoingDirectNotebookMessageBuffer: Session should be closed before interrupting.
22/12/13 09:57:08 INFO OutgoingDirectNotebookMessageBuffer: Stop MessageSendTask with session: 1914687926
22/12/13 09:57:08 INFO OutgoingDirectNotebookMessageBuffer: Start MessageSendTask with session: 349608004
22/12/13 09:57:08 INFO CommChannelWebSocket: onWebSocketClose: websocket closed with statusCode: 1006, reason: Disconnected
22/12/13 09:57:09 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old Ema: 1.0, New Ema: 1.0
22/12/13 09:57:32 INFO BlockManagerMaster: Removed 23 successfully in removeExecutor
22/12/13 09:57:32 ERROR FileFormatWriter: Aborting job 6535aa05-18e2-43f7-83e5-bbcf56a1ee02.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 41.0 (TID 165) (10.139.64.5 executor 23): ExecutorLostFailure (executor 23 exited caused by one of the running tasks) Reason: Command exited with code 52
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3312)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3244)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3235)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3235)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1424)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1424)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1424)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3524)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3462)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3450)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$runJob$1(DAGScheduler.scala:1169)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1157)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2713)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2696)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$1(FileFormatWriter.scala:299)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:154)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:207)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:126)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:124)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:138)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$2(QueryExecution.scala:241)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:243)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:392)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:188)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:985)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:142)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:342)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:241)
at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$withMVTagsIfNecessary(QueryExecution.scala:226)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:239)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:232)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:99)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:268)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:264)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:232)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:324)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:232)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:186)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:177)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:268)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:965)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:430)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:397)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:251)
at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:956)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)
22/12/13 09:57:32 INFO ClusterLoadMonitor: Removed query with execution ID:9. Current active queries:0
22/12/13 09:57:32 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221213083124-0000/26 is now RUNNING
22/12/13 09:57:32 WARN JupyterDriverLocal: User code returned error with traceback: [0;31m---------------------------------------------------------------------------[0m
[0;31mPy4JJavaError[0m Traceback (most recent call last)
[0;32m<command-3309668900398459>[0m in [0;36m<cell line: 268>[0;34m()[0m
[1;32m 266[0m [0;34m[0m[0m
[1;32m 267[0m [0m_type[0m [0;34m=[0m [0;34m"orderState"[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 268[0;31m [0mcheck_if_dupe[0m[0;34m([0m[0m_type[0m[0;34m,[0m [0myear[0m[0;34m,[0m [0mmonth[0m[0;34m,[0m [0mday[0m[0;34m,[0m [0myear_prev[0m[0;34m,[0m [0mmonth_prev[0m[0;34m,[0m [0mday_prev[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m 269[0m [0;31m#display(df)[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m 270[0m [0;31m#check_df(df, _type)[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m<command-3309668900398459>[0m in [0;36mcheck_if_dupe[0;34m(_type, year, month, day, year_prev, month_prev, day_prev)[0m
[1;32m 235[0m [0mdf_flattened[0m [0;34m=[0m [0mdf_flattened[0m[0;34m.[0m[0mna[0m[0;34m.[0m[0mfill[0m[0;34m([0m[0mvalue[0m[0;34m=[0m[0;36m0[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py[0m in [0;36mget_return_value[0;34m(answer, gateway_client, target_id, name)[0m
[1;32m 324[0m [0mvalue[0m [0;34m=[0m [0mOUTPUT_CONVERTER[0m[0;34m[[0m[0mtype[0m[0;34m][0m[0;34m([0m[0manswer[0m[0;34m[[0m[0;36m2[0m[0;34m:[0m[0;34m][0m[0;34m,[0m [0mgateway_client[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m 325[0m [0;32mif[0m [0manswer[0m[0;34m[[0m[0;36m1[0m[0;34m][0m [0;34m==[0m [0mREFERENCE_TYPE[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 326[0;31m raise Py4JJavaError(
[0m[1;32m 327[0m [0;34m"An error occurred while calling {0}{1}{2}.\n"[0m[0;34m.[0m[0;34m[0m[0;34m[0m[0m
[1;32m 328[0m format(target_id, ".", name), value)
[0;31mPy4JJavaError[0m: An error occurred while calling o3023.csv.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:882)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$1(FileFormatWriter.scala:334)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:154)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:207)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:126)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:124)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:138)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$2(QueryExecution.scala:241)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:243)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:392)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:188)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:985)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:142)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:342)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:241)
at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$withMVTagsIfNecessary(QueryExecution.scala:226)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:239)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:232)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:99)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:268)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:264)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:232)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:324)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:232)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:186)
... 49 more
22/12/13 09:57:32 INFO ProgressReporter$: Removed result fetcher for 3479767264321116845_5405572452950350648_76aea015601745f79d10ebdb188a0ee2
22/12/13 09:57:33 INFO ClusterLoadAvgHelper: Current cluster load: 0, Old Ema: 1.0, New Ema: 0.85
22/12/13 09:57:36 INFO ClusterLoadAvgHelper: Current cluster load: 0, Old Ema: 0.85, New Ema: 0.0
22/12/13 09:57:40 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.139.64.5:50180) with ID 26, ResourceProfileId 0
22/12/13 09:57:42 INFO BlockManagerMasterEndpoint: Registering block manager 10.139.64.5:44301 with 1504.5 MiB RAM, BlockManagerId(26, 10.139.64.5, 44301, None)
22/12/13 09:58:28 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221213083124-0000/25 is now DECOMMISSIONED (worker decommissioned because of kill request from HTTP endpoint (data migration disabled))
22/12/13 09:58:28 INFO StandaloneSchedulerBackend: Asked to decommission executor app-20221213083124-0000/25
22/12/13 09:58:28 INFO StandaloneSchedulerBackend: Decommission executors: 25
22/12/13 09:58:28 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(25, 10.139.64.7, 43527, None)) as being decommissioning.
22/12/13 09:58:28 INFO StandaloneSchedulerBackend: Notify executor 25 to decommissioning.
22/12/13 09:58:28 INFO StandaloneSchedulerBackend: Executor app-20221213083124-0000/25 decommissioned: ExecutorDecommissionInfo(worker decommissioned because of kill request from HTTP endpoint (data migration disabled),Some(10.139.64.7),false)
22/12/13 09:58:28 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221213083124-0000/25 is now LOST (worker lost)
22/12/13 09:58:28 INFO StandaloneSchedulerBackend: Executor app-20221213083124-0000/25 removed: worker lost
22/12/13 09:58:28 INFO StandaloneAppClient$ClientEndpoint: Master removed worker worker-20221213085244-10.139.64.7-41835: 10.139.64.7:41835 got disassociated
22/12/13 09:58:28 INFO StandaloneSchedulerBackend: Worker worker-20221213085244-10.139.64.7-41835 removed: 10.139.64.7:41835 got disassociated
22/12/13 09:58:28 ERROR TaskSchedulerImpl: Lost executor 25 on 10.139.64.7: Executor decommission: worker decommissioned because of kill request from HTTP endpoint (data migration disabled)
22/12/13 09:58:28 INFO DAGScheduler: Executor lost: 25 (epoch 18)
22/12/13 09:58:28 INFO TaskSchedulerImpl: Handle removed worker worker-20221213085244-10.139.64.7-41835: 10.139.64.7:41835 got disassociated
22/12/13 09:58:28 INFO BlockManagerMasterEndpoint: Trying to remove executor 25 from BlockManagerMaster.
22/12/13 09:58:28 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(25, 10.139.64.7, 43527, None)
22/12/13 09:58:28 INFO BlockManagerMaster: Removed 25 successfully in removeExecutor
22/12/13 09:58:28 INFO DAGScheduler: Shuffle files lost for host: 10.139.64.7 (epoch 18)
22/12/13 09:58:28 INFO DAGScheduler: Shuffle files lost for worker worker-20221213085244-10.139.64.7-41835 on host 10.139.64.7
I have tried to increase the driver memory without any success and switching to a higher performer cluster without any success.

What the difference between using single quote and double quote in split() method in scala?

I am working on the cca-175 practice questions. I am given a text file which is split by |:
Christopher|Jan 11, 2015, |5
Kapil|11 Jan, 2015|5
Thomas|6/17/2014|5
John|22-08-2013|5
Mithun|2013|5
Jitendra||5
Then I saved the file as an RDD and tried to map it. However, when used single quote and double quote in the split method, Scala returns two different outcomes and using the single quote is right.
Using single quoteline.split('|') , it returned:
Array[String] = Array(Christopher, Jan 11, 2015, 5), which is right.
Using double quote line.split("|"), it returned :
Array[String] = Array(C, h, r, i, s, t, o, p, h, e, r, |, J, a, n, " ", 1, 1, , " ", 2, 0, 1, 5, |, 5),
which is not what I need.
Can anyone help me with the question?
Thanks!
scala> val feedbackmap = feedback.map(line=>line.split('|'))
feedbackmap: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[4] at map at <console>:29
scala> feedbackmap.first
19/04/10 14:15:55 INFO SparkContext: Starting job: first at <console>:32
19/04/10 14:15:55 INFO DAGScheduler: Got job 4 (first at <console>:32) with 1 output partitions
19/04/10 14:15:55 INFO DAGScheduler: Final stage: ResultStage 4 (first at <console>:32)
19/04/10 14:15:55 INFO DAGScheduler: Parents of final stage: List()
19/04/10 14:15:55 INFO DAGScheduler: Missing parents: List()
19/04/10 14:15:55 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[4] at map at <console>:29), which has no missing parents
19/04/10 14:15:55 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.4 KB, free 510.7 MB)
19/04/10 14:15:55 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2003.0 B, free 510.7 MB)
19/04/10 14:15:55 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:43371 (size: 2003.0 B, free: 511.1 MB)
19/04/10 14:15:55 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1008
19/04/10 14:15:55 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[4] at map at <console>:29)
19/04/10 14:15:55 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
19/04/10 14:15:55 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 5, localhost, partition 0,ANY, 2171 bytes)
19/04/10 14:15:55 INFO Executor: Running task 0.0 in stage 4.0 (TID 5)
19/04/10 14:15:55 INFO HadoopRDD: Input split: hdfs://nn01.itversity.com:8020/user/junyanxu/scenario_37/feedback.txt:0+58
19/04/10 14:15:55 INFO Executor: Finished task 0.0 in stage 4.0 (TID 5). 2173 bytes result sent to driver
19/04/10 14:15:55 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 7 ms on localhost (1/1)
19/04/10 14:15:55 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
19/04/10 14:15:55 INFO DAGScheduler: ResultStage 4 (first at <console>:32) finished in 0.007 s
19/04/10 14:15:55 INFO DAGScheduler: Job 4 finished: first at <console>:32, took 0.012483 s
19/04/10 14:15:55 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
res3: Array[String] = Array(Christopher, Jan 11, 2015, 5)
scala> 19/04/10 14:20:55 WARN SparkContext: Killing executors is only supported in coarse-grained mode
19/04/10 14:20:55 WARN ExecutorAllocationManager: Unable to reach the cluster manager to kill executor driver!
val
scala> val feedbackmap2 = feedback.map(line=>line.split("|"))
feedbackmap2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29
scala> feedbackmap2.first
19/04/10 14:22:58 INFO SparkContext: Starting job: first at <console>:32
19/04/10 14:22:58 INFO DAGScheduler: Got job 5 (first at <console>:32) with 1 output partitions
19/04/10 14:22:58 INFO DAGScheduler: Final stage: ResultStage 5 (first at <console>:32)
19/04/10 14:22:58 INFO DAGScheduler: Parents of final stage: List()
19/04/10 14:22:58 INFO DAGScheduler: Missing parents: List()
19/04/10 14:22:58 INFO DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[5] at map at <console>:29), which has no missing parents
19/04/10 14:22:58 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.4 KB, free 510.7 MB)
19/04/10 14:22:58 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 2003.0 B, free 510.7 MB)
19/04/10 14:22:58 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:43371 (size: 2003.0 B, free: 511.1 MB)
19/04/10 14:22:58 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1008
19/04/10 14:22:58 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[5] at map at <console>:29)
19/04/10 14:22:58 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
19/04/10 14:22:58 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 6, localhost, partition 0,ANY, 2171 bytes)
19/04/10 14:22:58 INFO Executor: Running task 0.0 in stage 5.0 (TID 6)
19/04/10 14:22:58 INFO HadoopRDD: Input split: hdfs://nn01.itversity.com:8020/user/junyanxu/scenario_37/feedback.txt:0+58
19/04/10 14:22:58 INFO Executor: Finished task 0.0 in stage 5.0 (TID 6). 2244 bytes result sent to driver
19/04/10 14:22:58 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 6) in 12 ms on localhost (1/1)
19/04/10 14:22:58 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
19/04/10 14:22:58 INFO DAGScheduler: ResultStage 5 (first at <console>:32) finished in 0.012 s
19/04/10 14:22:58 INFO DAGScheduler: Job 5 finished: first at <console>:32, took 0.040166 s
res4: Array[String] = Array(C, h, r, i, s, t, o, p, h, e, r, |, J, a, n, " ", 1, 1, ,, " ", 2, 0, 1, 5, |, 5)
in scala single quote denotes a char so split('|') uses the | char. When you use double quotes you use a string and specifically split can accept a regex string so the unescaped | inside a string is interpreted as the regex or
I think Arnon Rotem-Gal-Oz made a good point about the meaning of the | inside a string as argument to split: it's a logical operator.
Moreover, what's happening here is that you use regex which means empty string or empty string. As empty string can be found basically anywhere in a String (if it helps you you can understand that "ab" is equivalent to "a" + "" + "b"), split is made between each character.
See also scala string.split does not work which states:
If you use split('|') or split("""\|""") you should get what you want.
Indeed, an escaped | isn't anymore considered as a logical operator but as the character itself in a regex expression.

Getting NullPointerException in scala./Spark code

I am reading CSV file in spark using scala. And in the CSV file I am getting null in line 10(lets say). So my code throwing nullpointerexception at this line. So it is not printing next records.
Below is my code:
import org.apache.spark.sql.SparkSession
import java.lang.Long
object HighestGDP {
def main(args:Array[String]){
val spark = SparkSession.builder().appName("GDP").master("local").getOrCreate()
val data = spark.read.csv("D:\\BGH\\Spark\\World_Bank_Indicators.csv").rdd
val result = data.filter(line=>line.getString(1).substring(4,8).equals("2009")||line.getString(1).substring(4,8).equals("2010"))
result.foreach(println)
var gdp2009 = result.filter(rec=>rec.getString(1).substring(4,8).equals("2009"))
.map{line=>{
var GDP= 0L
if(line.getString(19).equals(null))
GDP=0L
else
GDP= line.getString(19).replaceAll(",", "").toLong
(line.getString(0),GDP)
}}
gdp2009.foreach(println)
result.foreach(println)
}
}
So is there any way where I can set the value to 0 where value is null. I tried with if else but still its not working.
ERROR:
18/03/06 22:56:01 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1208 bytes result sent to driver
18/03/06 22:56:01 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 297 ms on localhost (1/1)
18/03/06 22:56:01 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/03/06 22:56:01 INFO DAGScheduler: ResultStage 1 (foreach at HighestGDP.scala:12) finished in 0.297 s
18/03/06 22:56:01 INFO DAGScheduler: Job 1 finished: foreach at HighestGDP.scala:12, took 0.346954 s
18/03/06 22:56:01 INFO SparkContext: Starting job: foreach at HighestGDP.scala:21
18/03/06 22:56:01 INFO DAGScheduler: Got job 2 (foreach at HighestGDP.scala:21) with 1 output partitions
18/03/06 22:56:01 INFO DAGScheduler: Final stage: ResultStage 2 (foreach at HighestGDP.scala:21)
18/03/06 22:56:01 INFO DAGScheduler: Parents of final stage: List()
18/03/06 22:56:01 INFO DAGScheduler: Missing parents: List()
18/03/06 22:56:01 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[12] at map at HighestGDP.scala:14), which has no missing parents
18/03/06 22:56:01 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 15.3 KB, free 355.2 MB)
18/03/06 22:56:01 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 7.1 KB, free 355.2 MB)
18/03/06 22:56:01 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 13.133.209.137:57085 (size: 7.1 KB, free: 355.5 MB)
18/03/06 22:56:01 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1012
18/03/06 22:56:01 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[12] at map at HighestGDP.scala:14)
18/03/06 22:56:01 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
18/03/06 22:56:01 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0, PROCESS_LOCAL, 5918 bytes)
18/03/06 22:56:01 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
18/03/06 22:56:01 INFO FileScanRDD: Reading File path: file:///D:/BGH/Spark/World_Bank_Indicators.csv, range: 0-260587, partition values: [empty row]
(Afghanistan,425)
(Albania,3796)
(Algeria,3952)
18/03/06 22:56:01 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NullPointerException
at HighestGDP$$anonfun$3.apply(HighestGDP.scala:15)
at HighestGDP$$anonfun$3.apply(HighestGDP.scala:14)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:894)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:894)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
18/03/06 22:56:01 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.NullPointerException
at HighestGDP$$anonfun$3.apply(HighestGDP.scala:15)
at HighestGDP$$anonfun$3.apply(HighestGDP.scala:14)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:894)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:894)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
18/03/06 22:56:01 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
18/03/06 22:56:01 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
18/03/06 22:56:01 INFO TaskSchedulerImpl: Cancelling stage 2
18/03/06 22:56:01 INFO DAGScheduler: ResultStage 2 (foreach at HighestGDP.scala:21) failed in 0.046 s
18/03/06 22:56:01 INFO DAGScheduler: Job 2 failed: foreach at HighestGDP.scala:21, took 0.046961 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.NullPointerException
at HighestGDP$$anonfun$3.apply(HighestGDP.scala:15)
at HighestGDP$$anonfun$3.apply(HighestGDP.scala:14)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:894)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:894)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1913)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:894)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:892)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:892)
at HighestGDP$.main(HighestGDP.scala:21)
at HighestGDP.main(HighestGDP.scala)
Caused by: java.lang.NullPointerException
at HighestGDP$$anonfun$3.apply(HighestGDP.scala:15)
at HighestGDP$$anonfun$3.apply(HighestGDP.scala:14)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:894)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:894)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
18/03/06 22:56:01 INFO SparkContext: Invoking stop() from shutdown hook
18/03/06 22:56:01 INFO SparkUI: Stopped Spark web UI at http://13.133.209.137:4040
18/03/06 22:56:01 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/03/06 22:56:01 INFO MemoryStore: MemoryStore cleared
18/03/06 22:56:01 INFO BlockManager: BlockManager stopped
18/03/06 22:56:01 INFO BlockManagerMaster: BlockManagerMaster stopped
18/03/06 22:56:01 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/03/06 22:56:01 INFO SparkContext: Successfully stopped SparkContext
18/03/06 22:56:01 INFO ShutdownHookManager: Shutdown hook called
18/03/06 22:56:01 INFO ShutdownHookManager: Deleting directory C:\Users\kumar.harsh\AppData\Local\Temp\spark-65330823-f67a-4a9d-acaf-42478e3b7109
I guess the problem is line.getString(19).equals(null). If line.getString(19) return null you can not call the equals method (this will result in a NullPointerException). Instead of this check you should use line.getString(19) == null.
One more hint. Try to avoid setting the spark-master fixed in your code. That will cause problems later on. See the discussion on: Spark job with explicit setMaster("local"), passed to spark-submit with YARN.

Spark Exits with exception

This is the stackTrace that I am getting while running the application:
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 233 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID 217, 10.178.149.243): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/11/03 11:25:45 INFO TaskSetManager: Lost task 14.0 in stage 11.0 (TID 225) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 1]
16/11/03 11:25:45 INFO TaskSetManager: Starting task 14.1 in stage 11.0 (TID 234, 10.178.149.243, partition 14, NODE_LOCAL, 8828 bytes)
16/11/03 11:25:45 INFO TaskSetManager: Lost task 22.0 in stage 11.0 (TID 232) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 2]
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 234 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Starting task 22.1 in stage 11.0 (TID 235, 10.178.149.243, partition 22, NODE_LOCAL, 9066 bytes)
16/11/03 11:25:45 INFO TaskSetManager: Lost task 24.0 in stage 11.0 (TID 233) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 3]
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 235 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Starting task 24.1 in stage 11.0 (TID 236, 10.178.149.243, partition 24, NODE_LOCAL, 9185 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 236 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 22.1 in stage 11.0 (TID 235) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 4]
16/11/03 11:25:45 INFO TaskSetManager: Starting task 22.2 in stage 11.0 (TID 237, 10.178.149.243, partition 22, NODE_LOCAL, 9066 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 237 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 14.1 in stage 11.0 (TID 234) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 5]
16/11/03 11:25:45 INFO TaskSetManager: Starting task 14.2 in stage 11.0 (TID 238, 10.178.149.243, partition 14, NODE_LOCAL, 8828 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 238 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 24.1 in stage 11.0 (TID 236) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 6]
16/11/03 11:25:45 INFO TaskSetManager: Starting task 24.2 in stage 11.0 (TID 239, 10.178.149.243, partition 24, NODE_LOCAL, 9185 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 239 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 22.2 in stage 11.0 (TID 237) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 7]
16/11/03 11:25:45 INFO TaskSetManager: Starting task 22.3 in stage 11.0 (TID 240, 10.178.149.243, partition 22, NODE_LOCAL, 9066 bytes)
16/11/03 11:25:45 INFO TaskSetManager: Lost task 14.2 in stage 11.0 (TID 238) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 8]
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 240 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Starting task 14.3 in stage 11.0 (TID 241, 10.178.149.243, partition 14, NODE_LOCAL, 8828 bytes)
16/11/03 11:25:45 INFO TaskSetManager: Lost task 24.2 in stage 11.0 (TID 239) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 9]
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 241 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Starting task 24.3 in stage 11.0 (TID 242, 10.178.149.243, partition 24, NODE_LOCAL, 9185 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 242 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 22.3 in stage 11.0 (TID 240) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 10]
16/11/03 11:25:45 ERROR TaskSetManager: Task 22 in stage 11.0 failed 4 times; aborting job
16/11/03 11:25:45 INFO TaskSetManager: Starting task 0.0 in stage 12.0 (TID 243, 10.178.149.243, partition 0, NODE_LOCAL, 10016 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 243 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 14.3 in stage 11.0 (TID 241) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 11]
16/11/03 11:25:45 INFO TaskSchedulerImpl: Cancelling stage 12
16/11/03 11:25:45 INFO TaskSchedulerImpl: Stage 12 was cancelled
16/11/03 11:25:45 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 244, 10.178.149.243, partition 0, NODE_LOCAL, 7638 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 244 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 24.3 in stage 11.0 (TID 242) on executor 10.178.149.243: java.util.NoSuchElementException (None.get) [duplicate 12]
16/11/03 11:25:45 INFO DAGScheduler: ShuffleMapStage 12 (show at RNFBackTagger.scala:97) failed in 0.112 s
16/11/03 11:25:45 INFO TaskSchedulerImpl: Cancelling stage 14
16/11/03 11:25:45 INFO TaskSchedulerImpl: Stage 14 was cancelled
16/11/03 11:25:45 INFO DAGScheduler: ShuffleMapStage 14 (show at RNFBackTagger.scala:97) failed in 0.104 s
16/11/03 11:25:45 INFO TaskSchedulerImpl: Cancelling stage 11
16/11/03 11:25:45 INFO TaskSchedulerImpl: Stage 11 was cancelled
16/11/03 11:25:45 INFO DAGScheduler: ShuffleMapStage 11 (show at RNFBackTagger.scala:97) failed in 0.126 s
16/11/03 11:25:45 WARN TaskSetManager: Lost task 0.0 in stage 12.0 (TID 243, 10.178.149.243): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/11/03 11:25:45 INFO DAGScheduler: Job 7 failed: show at RNFBackTagger.scala:97, took 0.141681 s
16/11/03 11:25:45 INFO TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 11.0 failed 4 times, most recent failure: Lost task 22.3 in stage 11.0 (TID 240, 10.178.149.243): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
at com.knoldus.xml.RNFBackTagger$.main(RNFBackTagger.scala:97)
at com.knoldus.xml.RNFBackTagger.main(RNFBackTagger.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/11/03 11:25:45 WARN JobProgressListener: Task start for unknown stage 12
16/11/03 11:25:45 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 244, 10.178.149.243): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/11/03 11:25:45 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
16/11/03 11:25:45 INFO SparkContext: Invoking stop() from shutdown hook
16/11/03 11:25:45 WARN JobProgressListener: Task start for unknown stage 14
16/11/03 11:25:45 INFO SerialShutdownHooks: Successfully executed shutdown hook: Clearing session cache for C* connector
16/11/03 11:25:45 INFO TaskSetManager: Finished task 5.0 in stage 11.0 (TID 219) in 137 ms on 10.178.149.22 (1/35)
16/11/03 11:25:45 INFO SparkUI: Stopped Spark web UI at http://10.178.149.133:4040
16/11/03 11:25:45 INFO StandaloneSchedulerBackend: Shutting down all executors
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
16/11/03 11:25:45 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:152)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:132)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:571)
at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:179)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:108)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/11/03 11:25:45 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:152)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:132)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:571)
at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:179)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:108)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/11/03 11:25:45 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/03 11:25:45 INFO MemoryStore: MemoryStore cleared
16/11/03 11:25:45 INFO BlockManager: BlockManager stopped
16/11/03 11:25:45 INFO BlockManagerMaster: BlockManagerMaster stopped
16/11/03 11:25:45 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/03 11:25:45 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:150)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:132)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:571)
at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:179)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:108)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/11/03 11:25:45 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:150)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:132)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:571)
at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:179)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:108)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/11/03 11:25:45 INFO SparkContext: Successfully stopped SparkContext
16/11/03 11:25:45 INFO ShutdownHookManager: Shutdown hook called
16/11/03 11:25:45 INFO ShutdownHookManager: Deleting directory /tmp/spark-c52a6da9-5702-4128-9950-805d5f9dd75e
Earlier I was not able to pin point the problem !
Then I tried the removing unncessary Code approach !
Then I found out the problem lies in this :
val groupedDF = selectedDF.groupBy("id").agg(collect_list("name"))
groupedDF.show
Because if I try to show selectedDF it displays the correct result!
The spark version that I am using is 2.0.0 ! Please help me out and let me know what is the problem.
Link to Code is :
https://gist.github.com/shiv4nsh/0c3f62e3afd95634a6061b405c774582
Show on line 19 prints and the show on 28 throws this exception.
Server Configuration: I have spark 2.0 running on 8 core worker with 10 gb memory and its running on centOS
Script for launching application:
./bin/spark-submit --class com.knoldus.Application /root/code/newCode/project1/target/deployable.jar
Any help is appreciated !
Note: The code works fine in local mode. This error is thrown when i try to run it on cluster.
I had a similar issue and it turned out to be because of the fact that my application was creating a new SparkContext everytime it tried to reload load certain classes in the executors. It's very likely that the same problem in your case if the code that will need to be loaded by the executors to run certain steps is in the same 'logical context' as code that instantiates the SparkContext.
You need to make sure that your SparkContext is loaded only once at most simply by restructuring your code.
I had a similar problem, too. It turned out I was inadvertently trying to call SparkContext inside a UDAF (which runs inside executors).
More details here: How to collect a single row dataframe and use fields as constants

ClassNotFoundException anonfun when deploy scala code to Spark

I'm new to Apache Spark and I'm trying to deploy a piece of simple scala code to the Spark.
Note: I am trying to connect to a an existing running cluster which I configure via my java parameters to be: spark.master=spark://MyHostName:7077
Environment
Spark 1.5.1 build with scala 2.10
Spark runs standalone mode on my local machine
OS: Mac OS El Captain
JVM: JDK 1.8.0_60
IDE: IntelliJ IDEA Community 14.1.5
Scala version: 2.10.4
sbt: 0.13.8
Code
import org.apache.spark.{SparkConf, SparkContext}
object HelloSpark {
def main(args: Array[String]) {
val logFile = "/README.md"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
println("%s done!".format(numAs))
}
}
build.sbt
name := "data-streamer210"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.5.1",
"org.apache.spark" % "spark-streaming_2.10" % "1.5.1",
"org.apache.spark" % "spark-mllib_2.10" % "1.5.1",
"org.apache.spark" % "spark-bagel_2.10" % "1.5.1",
"org.apache.spark" % "spark-streaming-twitter_2.10" % "1.5.1"
)
Error
15/10/19 19:40:09 INFO SparkContext: Starting job: count at HelloSpark.scala:14
15/10/19 19:40:09 INFO DAGScheduler: Got job 0 (count at HelloSpark.scala:14) with 2 output partitions
15/10/19 19:40:09 INFO DAGScheduler: Final stage: ResultStage 0(count at HelloSpark.scala:14)
15/10/19 19:40:09 INFO DAGScheduler: Parents of final stage: List()
15/10/19 19:40:09 INFO DAGScheduler: Missing parents: List()
15/10/19 19:40:09 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at HelloSpark.scala:14), which has no missing parents
15/10/19 19:40:09 INFO MemoryStore: ensureFreeSpace(3192) called with curMem=120313, maxMem=2061647216
15/10/19 19:40:09 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 1966.0 MB)
15/10/19 19:40:09 INFO MemoryStore: ensureFreeSpace(1892) called with curMem=123505, maxMem=2061647216
15/10/19 19:40:09 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1892.0 B, free 1966.0 MB)
15/10/19 19:40:09 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 127.0.0.1:50941 (size: 1892.0 B, free: 1966.1 MB)
15/10/19 19:40:09 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
15/10/19 19:40:09 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at HelloSpark.scala:14)
15/10/19 19:40:09 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/10/19 19:40:10 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#127.0.0.1:50951/user/Executor#-147774947]) with ID 0
15/10/19 19:40:10 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:10 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:10 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#127.0.0.1:50952/user/Executor#1450479604]) with ID 2
15/10/19 19:40:10 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#127.0.0.1:50957/user/Executor#1447408721]) with ID 1
15/10/19 19:40:10 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#127.0.0.1:50955/user/Executor#1397136754]) with ID 3
15/10/19 19:40:10 INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:50963 with 530.0 MB RAM, BlockManagerId(0, 127.0.0.1, 50963)
15/10/19 19:40:10 INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:50964 with 530.0 MB RAM, BlockManagerId(2, 127.0.0.1, 50964)
15/10/19 19:40:10 INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:50965 with 530.0 MB RAM, BlockManagerId(1, 127.0.0.1, 50965)
15/10/19 19:40:10 INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:50966 with 530.0 MB RAM, BlockManagerId(3, 127.0.0.1, 50966)
15/10/19 19:40:11 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 127.0.0.1:50963 (size: 1892.0 B, free: 530.0 MB)
15/10/19 19:40:11 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 127.0.0.1): java.lang.ClassNotFoundException: HelloSpark$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/10/19 19:40:11 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 1]
15/10/19 19:40:11 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 3, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 127.0.0.1:50966 (size: 1892.0 B, free: 530.0 MB)
15/10/19 19:40:11 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 127.0.0.1:50964 (size: 1892.0 B, free: 530.0 MB)
15/10/19 19:40:11 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 2]
15/10/19 19:40:11 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 3]
15/10/19 19:40:11 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 4]
15/10/19 19:40:11 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 5, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 6, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 5) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 5]
15/10/19 19:40:11 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 7, 127.0.0.1, PROCESS_LOCAL, 2160 bytes)
15/10/19 19:40:11 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7) on executor 127.0.0.1: java.lang.ClassNotFoundException (HelloSpark$$anonfun$1) [duplicate 6]
15/10/19 19:40:11 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
15/10/19 19:40:11 INFO TaskSchedulerImpl: Cancelling stage 0
15/10/19 19:40:11 INFO TaskSchedulerImpl: Stage 0 was cancelled
15/10/19 19:40:11 INFO DAGScheduler: ResultStage 0 (count at HelloSpark.scala:14) failed in 2.613 s
15/10/19 19:40:11 INFO DAGScheduler: Job 0 failed: count at HelloSpark.scala:14, took 2.716305 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, 127.0.0.1): java.lang.ClassNotFoundException: HelloSpark$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
at org.apache.spark.rdd.RDD.count(RDD.scala:1121)
at HelloSpark$.main(HelloSpark.scala:14)
at HelloSpark.main(HelloSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.ClassNotFoundException: HelloSpark$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/10/19 19:40:11 INFO SparkContext: Invoking stop() from shutdown hook
15/10/19 19:40:11 WARN TaskSetManager: Lost task 1.3 in stage 0.0 (TID 6, 127.0.0.1): org.apache.spark.TaskKilledException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:204)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/10/19 19:40:11 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/10/19 19:40:11 INFO SparkUI: Stopped Spark web UI at http://127.0.0.1:4040
15/10/19 19:40:11 INFO DAGScheduler: Stopping DAGScheduler
15/10/19 19:40:11 INFO SparkDeploySchedulerBackend: Shutting down all executors
15/10/19 19:40:11 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
15/10/19 19:40:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/10/19 19:40:11 INFO MemoryStore: MemoryStore cleared
15/10/19 19:40:11 INFO BlockManager: BlockManager stopped
15/10/19 19:40:11 INFO BlockManagerMaster: BlockManagerMaster stopped
15/10/19 19:40:11 INFO SparkContext: Successfully stopped SparkContext
15/10/19 19:40:11 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/10/19 19:40:11 INFO ShutdownHookManager: Shutdown hook called
15/10/19 19:40:11 INFO ShutdownHookManager: Deleting directory /private/var/folders/q9/m_d81ms107n09tj8k5wbzfb40000gp/T/spark-53ce9474-5488-4d50-bfb6-c58ddeed7640
Process finished with exit code 1
When you run Spark from IntelliJ you can either connect to a "local" spark JVM or to a remote cluster.
If you set you master to be local (e.g., setMaster("local[*]")), then any code you have in your local scope/project will be available to this temporary, local (single JVM) cluster you just created. Everything runs locally and will exit when your tests ends (if you running a unit test), or when you exit the app if you are running it as an app inside IntelliJ.
However, if you set master to point to a remote cluster (say setMaster("spark://localhost:7077")) you need to make sure that your cluster has access to your new code (in your case it needs to have access to the closure you are passing to filter).
When I want to execute a new piece of code on a running Spark cluster, I usually do that by packaging my app in an Uber Jar (see sbt-assembly) and then passing this as an argument in spark-submit (see more details by clicking on the link).
There's also an interesting interaction if you call setMaster in your code, even if you have it set to the right master. For example, I had code like this:
val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://greine:7077")
that I submitted like this:
bin/spark-submit --class SimpleApp --master yarn --deploy-mode cluster /Users/james/Projects/sparkHelloWorld/target/scala-2.11/sparkHelloWorld-assembly-1.0.jar
The jar (sparkHelloWorld-assembly-1.0.jar) I believe was built correctly and had all the required class files. It still got an error:
17/04/08 09:19:08 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 5, 10.178.252.14, executor 1): java.lang.ClassNotFoundException: SimpleApp$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1819)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1986)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Once I removed the call to setMaster("spark://greine:7077") it ran and completed correctly, using the same spark-submit command.