Spark streaming jobs are failing after running for few days - scala

I am facing this issue, my spark streaming jobs keeps on failing after running for few days with below error:
AM Container for appattempt_1610108774021_0354_000001 exited with exitCode: -104
Failing this attempt.Diagnostics: Container [pid=31537,containerID=container_1610108774021_0354_01_000001] is running beyond physical memory limits. Current usage: 5.8 GB of 5.5 GB physical memory used; 8.0 GB of 27.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1610108774021_0354_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 31742 31537 31537 31537 (java) 1583676 58530 8499392512 1507368 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5078m -
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
spark submit:
spark-submit --name DWH-CDC-commonJob --deploy-mode cluster --master yarn --conf spark.sql.shuffle.partitions=10 --conf spark.eventLog.enabled=false --conf spark.sql.caseSensitive=true --conf spark.driver.memory=5078M --class com.aos.Loader --jars file:////home/hadoop/lib/* --executor-memory 5000M --conf "spark.alert.duration=4" --conf spark.dynamicAllocation.enabled=false --num-executors 3 --files /home/hadoop/log4j.properties,/home/hadoop/application.conf --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" streams_2.11-1.0.jar application.conf
Have tried increasing the spark.executor.memoryOverhead but it fails after few days, I want to understand how we can arrive to number where it can run without any interruptions. Or is there any other configuration that I am missing.
Spark 2.4 version
aws EMR: 5.23
scala :2.11.12
Two data nodes( vCPU 4, 16 GB ram each).

Related

Why does spark application fail with java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig even though the jar exists?

I am working on a Hadoop cluster that has Spark 2.3.x. For my use case, I need Spark 2.4x which I downloaded from internet and moved it to my server and extracted into a new dir: ~/john/spark247ext/spark-2.4.7-bin-hadoop2.7
This is how my Spark2.4.7 directory looks like:
username#:[~/john/spark247ext/spark-2.4.7-bin-hadoop2.7] {173} $ ls
bin conf data examples jars kubernetes LICENSE licenses NOTICE python R README.md RELEASE sbin yarn
These are the contents of my bin dir.
username#:[~/john/spark247ext/spark-2.4.7-bin-hadoop2.7/bin] {175} $ ls
beeline find-spark-home.cmd pyspark2.cmd spark-class sparkR2.cmd spark-shell.cmd spark-submit
beeline.cmd load-spark-env.cmd pyspark.cmd spark-class2.cmd sparkR.cmd spark-sql spark-submit2.cmd
docker-image-tool.sh load-spark-env.sh run-example spark-class.cmd spark-shell spark-sql2.cmd spark-submit.cmd
find-spark-home pyspark run-example.cmd sparkR spark-shell2.cmd spark-sql.cmd
I am submitting my spark code using the below spark spark submit command:
./spark-submit --master yarn --deploy-mode cluster --driver-class-path /home/john/jars/mssql-jdbc-9.2.0.jre8.jar --jars /home/john/jars/spark-bigquery-with-dependencies_2.11-0.19.1.jar,/home/john/jars/mssql-jdbc-9.2.0.jre8.jar --driver-memory 1g --executor-memory 4g --executor-cores 4 --num-executors 4 --class com.loader /home/john/jars/HiveLoader-1.0-SNAPSHOT-jar-with-dependencies.jar somearg1 somearg2 somearg3
The job fails with exception java.lang.ClassNotFoundException:com.sun.jersey.api.client.config.ClientConfig
so I added that jar to my spark-submit command as well like below.
./spark-submit --master yarn --deploy-mode cluster --driver-class-path /home/john/jars/mssql-jdbc-9.2.0.jre8.jar --jars /home/john/jars/spark-bigquery-with-dependencies_2.11-0.19.1.jar,/home/john/jars/mssql-jdbc-9.2.0.jre8.jar,/home/john/jars/jersey-client-1.19.4.jar --driver-memory 1g --executor-memory 4g --executor-cores 4 --num-executors 4 --class com.loader /home/john/jars/HiveLoader-1.0-SNAPSHOT-jar-with-dependencies.jar somearg1 somearg2 somearg3
I also checked the the directory: /john/spark247ext/spark-2.4.7-bin-hadoop2.7/jars and found out that the jar: jersey-client-x.xx.x.jar exists there.
username#:[~/john/spark247ext/spark-2.4.7-bin-hadoop2.7/jars] {179} $ ls -ltr | grep jersey
-rwxrwxrwx 1 john john 951701 Sep 8 2020 jersey-server-2.22.2.jar
-rwxrwxrwx 1 john john 72733 Sep 8 2020 jersey-media-jaxb-2.22.2.jar
-rwxrwxrwx 1 john john 971310 Sep 8 2020 jersey-guava-2.22.2.jar
-rwxrwxrwx 1 john john 66270 Sep 8 2020 jersey-container-servlet-core-2.22.2.jar
-rwxrwxrwx 1 john john 18098 Sep 8 2020 jersey-container-servlet-2.22.2.jar
-rwxrwxrwx 1 john john 698375 Sep 8 2020 jersey-common-2.22.2.jar
-rwxrwxrwx 1 john john 167421 Sep 8 2020 jersey-client-2.22.2.jar
I also added the dependency in my pom.xml file:
<dependency>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-client</artifactId>
<version>1.19.4</version>
</dependency>
Even after giving the jar file in my spark-submit command and also creting a fat jar file out of my maven project which will have all dependencies, I still see the exception:
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:161)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1135)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1530)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
The spark I downloaded is for my own use case so I haven't changed any settings of the existing spark version in the project which is Spark 2.3
Could anyone let me know what do I do to fix the issue so that the code runs properly ?
Can you use the property in your spark-submit
--conf "spark.driver.userClassPathFirst=true"
I think you are getting a jar conflict where the different version of the same jar is being picked up from the environment

Spark: More Executors in one machine, longer duration time for each Task

When I run LogisticRegression in Spark, I found a stage is special, as the number of executors increases, the average task processing time becomes longer, why it could happen?
Environment:
All servers are local, no cloud.
Server 1: 6 cores 10g memory (Spark master, HDFS master, HDFS slave).
Server 2: 6 cores 10g memory (HDFS slave).
Server 3: 6 cores 10g memory (Spark slave, HDFS slave).
Deploy in Spark standalone mode.
Input file size: Large enough, it can meet the requirements of parallelism. Spark will read the file from HDFS.
All the workloads have the same input file.
You can see that only server3 will participate in the calculation(Only it will become Spark worker).
Special stage DAG
1 core 1g memory
spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 1 ....
mid-task duration: 1s
2 core 2g memory
spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 2 ....
mid-task duration: 2s
3 core 3g memory
spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 3 ....
mid-task duration: 2s
4 core 4g memory
spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 4 ....
mid-task duration: 3s
5 core 5g memory
spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 5 ....
mid-task duration: 3s
As can be seen from the above figure, the more executors on a machine will cause the longer the average running time of a single task. May I ask why this could happen, and I had not seen the executor had disk spill, the memory should be sufficient.
Note: Only this stage will produce this phenomenon, other stages did not have this problem.

Spark-Scala code runs to completion within Spark-Shell but runs infinitely via Spark-Submit

I have a piece of Spark-Scala code that runs as expected in the Spark-Shell, but when launching via spark-submit (as shown below), executors continually fail with exit code 1 and the job runs infinitely.
./spark-submit --name Loader --class SP.Loader --master spark://spark-master.default.svc.cluster.local:7077 --executor-memory 64G --packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.1 --conf spark.cores.max=280 --conf spark.cassandra.connection.host=cassandra.default.svc.cluster.local --driver-memory 32G --conf spark.driver.maxResultSize=16G /notebooks/Drago/scala/target/scala-2.11/loader_2.11-1.0.jar
19/09/10 20:16:15 INFO BlockManagerMaster: Removal of executor 11 requested
19/09/10 20:16:15 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 11
19/09/10 20:16:15 INFO BlockManagerMasterEndpoint: Trying to remove executor 11 from BlockManagerMaster.
19/09/10 20:16:15 INFO StandaloneSchedulerBackend: Granted executor ID app-20190910201611-0039/20 on hostPort 10.42.4.135:39575 with 35 core(s), 64.0 GB RAM
19/09/10 20:16:15 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190910201611-0039/20 is now RUNNING
19/09/10 20:16:15 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190910201611-0039/15 is now EXITED (Command exited with code 1)
19/09/10 20:16:15 INFO StandaloneSchedulerBackend: Executor app-20190910201611-0039/15 removed: Command exited with code 1
19/09/10 20:16:15 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20190910201611-0039/21 on worker-20190830191329-10.42.6.133-34823 (10.42.6.133:34823) with 35 core(s)

wholeTextFiles Method is failing with ExitCode 52 java.lang.OutOfMemoryError

I have HDFS directory with 13.2 GB and 4 files in it. I am trying to read all files using wholeTextFile method in spark, But i have some issues
This is my code.
val path = "/tmp/cnt/warehouse/"
val whole = sc.wholeTextFiles("path",32)
val data = whole.map(r => (r._1,r._2.split("\r\n")))
val x = file.flatMap(r => r._1)
x.take(1000).foreach(println)
Below is the spark Submit.
spark2-submit \
--class SparkTest \
--master yarn \
--deploy-mode cluster \
--num-executors 32 \
--executor-memory 15G \
--driver-memory 25G \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.port.maxRetries=100 \
--conf spark.kryoserializer.buffer.max=1g \
--conf spark.yarn.queue=xyz \
SparkTest-1.0-SNAPSHOT.jar
even though i give min partitions 32, it is storing in 4 partitions only.
My spark submit is correct or not?
Error Below
Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 113, , executor 37): ExecutorLostFailure (executor 37 exited caused by one of the running tasks) Reason: Container from a bad node: container_e599_1560551438641_35180_01_000057 on host: . Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e599_1560551438641_35180_01_000057
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.__launchContainer__(LinuxContainerExecutor.java:399)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 52
.
Driver stacktrace:
Even though i give min partitions 32, it is storing in 4 partitions
only.
You can refer below link
Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles
My spark submit is correct or not?
Syntax is correct but value you have passed is more than it needed. I mean you are giving 32 * 15 = 480 GB to Executors + 25 GB to driver just to process 13 GB data?
Giving more executors and more memory does not give efficient result. Sometime it cause overhead and also failure due to lack of resources
Error is also showing issue with resources you are using.
For processing only 13 GB data you should use like below configurations (not exactly, you have to calculate):
Executors # 6
Core #5
Executor-Memory 5 GB
Driver Memory 2 GB
For more details & calculation you can refer below link:
How to tune spark executor number, cores and executor memory?
Note: Driver does not require more memory than Executor so Driver
memory should be less or equal to Executor memory in most of cases.

Unable to see Spark UI when I submit in yarn-cluster of Hadoop-2.6

I'm using Apache Spark 1.6 with Cluster of 8 nodes remotely. I'm submitting job using spark-submit like below on master node:
hastimal#hadoop-8:/usr/local/spark$ ./bin/spark-submit --class umkc.graph.SparkRdfCcCount --master yarn-cluster --num-executors 7 --executor-memory 52g --executor-cores 7 --driver-memory 52g --conf spark.default.parallelism=49 --conf spark.driver.maxResultSize=4g --conf spark.yarn.executor.memoryOverhead=4608 --conf spark.yarn.driver.memoryOverhead=4608 --conf spark.akka.frameSize=1200 --conf spark.network.timeout=300 --conf spark.io.compression.codec=lz4 --conf spark.rdd.compress=true --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://128.110.152.54:9000/SparkHistory --conf spark.broadcast.compress=true --conf spark.shuffle.spill.compress=true --conf spark.shuffle.compress=true --conf spark.shuffle.manager=sort /users/hastimal/SparkProcessing.jar /inputRDF/data-793-805.nt
Everything is fine. I'm getting output without any error but when I go to see Spark UI it doesn't show. In my Spark Scala code I have written like this:
val conf = new SparkConf().setAppName("Spark Processing").set("spark.ui.port","4041")
After following couple of things including this and this I resolved my issues related to permission and writing in HDFS. When I run spark-submit and I see logs in Yarn it shows like this:
16/04/25 16:34:23 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4041
16/04/25 16:34:23 INFO util.Utils: Successfully started service 'SparkUI' on port 4041.
16/04/25 16:34:23 INFO ui.SparkUI: Started SparkUI at http://128.110.152.131:4041
16/04/25 16:34:23 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler
16/04/25 16:34:24 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41216.
16/04/25 16:34:24 INFO netty.NettyBlockTransferService: Server created on 41216
16/04/25 16:34:24 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/04/25 16:34:24 INFO storage.BlockManagerMasterEndpoint: Registering block manager 128.110.152.131:41216 with 34.4 GB RAM, BlockManagerId(driver, 128.110.152.131, 41216)
16/04/25 16:34:24 INFO storage.BlockManagerMaster: Registered BlockManager
Which means Spark UI has been started on http://128.110.152.131:4041which is one of the data node again and when I go to that URL it shows refuse error like below:
FYI: All used ports and opened in all machines. Please help me. I want to see DAG of my Spark Job. I'm able to see all yarn applications through Yarn UI. I can see Application UI like below using port 8088:
. I want to see Spark UI with DAG like we see in Standalone or using IntelliJ IDE.
In yarn mode application master creates the spark UI. While job is running goto resource manager and click on ApplicationMaster, you will see the UI.