Spark unit test fails due to stage failure - scala

I recently upgraded an application from Spark 1.4.1 to 1.6.0 where the unit tests in my application (in ScalaTest 3.0) suddenly fail, which is not due to API or behavioural changes in Spark.
The weird thing is that each time I run the tests with sbt test a different test fails and always with the following message:
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 87 in stage 206.0 failed 1 times, most recent failure: Lost task 87.0 in stage 206.0 (TID 4228, localhost): ExecutorLostFailure (executor driver exited caused by one of the running
tasks) Reason: Executor heartbeat timed out after 148400 ms
[info] Driver stacktrace:
[info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
[info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
[info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
[info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[info] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
[info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
[info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
[info] at scala.Option.foreach(Option.scala:236)
[info] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
[info] ...
I have set the following in build.sbt:
javaOptions in test += "-Xmx2G"
fork in test := true
parallelExecution in test := false
So, the unit tests are fine but there is something going on that I cannot put my finger on. Does anyone have an idea?

Since this code has been working, I suspect the default memory settings (either executor or driver or overhead) may have changed with the upgrade.
Please post yarn logs for your application id. It will have more details of the error.
Also, please see this link for a similar error https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Executor-Timed-Out/td-p/45097

Related

resolve the issue of possible the underlying files have been updated in data bricks

val mnt_point_write="/mnt/pnt"
ord_JsonDF.write.mode("overwrite").format("json").option("header",true).json(mnt_point_write+"/Processed_file")
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 114.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 114.0 (TID 13876, 10.139.64.40, executor 27):
com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt-02/AF_tab/Processed/YYYY=2019/MM=07/DD=21/part-00000-tid-677839983764655717-ahfuhaufhehfhurfawefkjfaffadfe-2685-1.c000.csv.
It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Coming across the above issue. Can anyone help in resolving the issue please.

OCI error "/opt/docker/bin/my_job" : no such file or directory using sbt docker:publishLocal

If you use sbt docker:publishLocal to build a docker image from your scala project, you will see the below lines printed out:
[info] Packaging /home/user123/myUser/repos/my_job/target/scala-2.12/app_internal_2.12-0.1.jar ...
[info] Done packaging.
[info] Sending build context to Docker daemon 129.7MB
[info] Step 1/7 : FROM openjdk:11-jre
[info] ---> 8c8b7f0ab84c
[info] Step 2/7 : LABEL MAINTAINER="no_name#my.org"
[info] ---> Using cache
[info] ---> d5caf9a92999
[info] Step 3/7 : WORKDIR /opt/docker
[info] ---> Using cache
[info] ---> d887eeb10e8e
[info] Step 4/7 : ADD --chown=root:root opt /opt
[info] ---> 1b43a84a5e32
[info] Step 5/7 : USER root
[info] ---> Running in 282c7f7de8ad
[info] Removing intermediate container 282c7f7de8ad
[info] ---> 11fed4892683
[info] Step 6/7 : ENTRYPOINT ["/opt/docker/bin/my_job"]
[info] ---> Running in 1d297dd1e960
[info] Removing intermediate container 1d297dd1e960
[info] ---> 1923a8df3fcf
[info] Step 7/7 : CMD []
[info] ---> Running in 3d9f7a4a262b
[info] Removing intermediate container 3d9f7a4a262b
[info] ---> d67ed46fd3fe
[info] Successfully built d67ed46fd3fe
[info] Successfully tagged docker_app_internal:0.1
[info] Built image docker_app_internal with tags [0.1]
[success] Total time: 25 s, completed Mar 27, 2019 10:23:35 AM
You may get confused by the error. And why:
THIS WORKS:
docker run -it --entrypoint=/bin/bash docker_app_internal:0.1 -i
Does not work:
docker run docker_app_internal:0.1
Thanks to #Muki for creating this helpful project.
Refer: https://github.com/sbt/sbt-native-packager
If you have the project root folder different from MainClass name, then your entrypoint using sbt docker:publishLocal becomes /your/linuxpath/bin/rootFolder. But, the actual file that gets created in the docker image is /your/linuxpath/bin/main-class (if your main class name is MainClass)
To fix this, please explicitly mention the entrypoint in build.sbt as below:
dockerEntrypoint := Seq("/opt/docker/bin/main-class")

Failed to get broadcast_22_piece0 of broadcast_22

when I run Scala application on Spark cluster in yarn mode(spark version 2.2.0),the application is using the pregel model, each vertex in the data graph sends message. the Exception information as follows:
Exception in thread "main" org.apache.spark.SparkException:
Job aborted due to stage failure: Task 29 in stage 25.0 failed 4 times,
most recent failure: Lost task 29.3 in stage 25.0 (TID 1632, 192.168.1.5, executor 1): java.io.IOException: org.apache.spark.SparkException:
Failed to get broadcast_22_piece0 of broadcast_22
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_22_piece0 of broadcast_22
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:222)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
... 12 more
I have searched the exception online, and one of the suggestions is adding statement .set("spark.cleaner.ttl","2000")
but it still does not work well.
Can you help me? thanks a lot.
some sinnpets that may cause the above exception as follows:
val spark = SparkSession.builder.master("spark://node01.:7077").appName("ioce").getOrCreate()
.......
and in the program, joining dataframe is used(which I looked through online warns that may be also relevant to the exception).

Scala/Spark tests failing with "copyAndReset must return a zero value copy"

After switching Spark framework from 1.6 to 2.1, running my tests results in many occurrences of the following error:
[info] java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy
[info] at scala.Predef$.assert(Predef.scala:170)
[info] at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:163)
[info] at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
[info] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[info] at java.lang.reflect.Method.invoke(Method.java:497)
[info] at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1075)
[info] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
[info] at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
[info] at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
[info] at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
[info] ...
Any ideas what's happening and how I can fix this?

Spark exception after re-submitting stopped application

I'm running a Spark job (from a Spark notebook) using dynamic allocation with the options
"spark.master": "yarn-client",
"spark.shuffle.service.enabled": "true",
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.executorIdleTimeout": "30s",
"spark.dynamicAllocation.cachedExecutorIdleTimeout": "1h",
"spark.dynamicAllocation.minExecutors": "0",
"spark.dynamicAllocation.maxExecutors": "20",
"spark.executor.cores": 2
(Note: I'm not sure yet whether the issue is caused by dynamicAllocation or not)
I'm using Spark version 1.6.1.
If I cancel a running job/app (either by pressing the cancel-button on the cell in the notebook, or by shuting down the notebook server and thus the app) and restart the same app shortly (some minutes) after, I often get the following excpetion:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 38, i89810.sbb.ch): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 11 more
Using the Yarn ResourceManager, I verified that the old job is not running anymore before re-submitting the job. Still I suppose that the problem arises because the killed job is not yet fully cleaned up and interferes with the newly launched job?
Somebody has encountered the same issue and knows how to solve this?
You need to setup external shuffle service when dynamic allocation is enabled. Otherwise shuffle files are deleted when executors are removed. Which is why Failed to get broadcast_3_piece0 of broadcast_3 exception is thrown.
For more information on this, see official spark documentation Dynamic Resource Allocation