I have HDFS directory with 13.2 GB and 4 files in it. I am trying to read all files using wholeTextFile method in spark, But i have some issues
This is my code.
val path = "/tmp/cnt/warehouse/"
val whole = sc.wholeTextFiles("path",32)
val data = whole.map(r => (r._1,r._2.split("\r\n")))
val x = file.flatMap(r => r._1)
x.take(1000).foreach(println)
Below is the spark Submit.
spark2-submit \
--class SparkTest \
--master yarn \
--deploy-mode cluster \
--num-executors 32 \
--executor-memory 15G \
--driver-memory 25G \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.port.maxRetries=100 \
--conf spark.kryoserializer.buffer.max=1g \
--conf spark.yarn.queue=xyz \
SparkTest-1.0-SNAPSHOT.jar
even though i give min partitions 32, it is storing in 4 partitions only.
My spark submit is correct or not?
Error Below
Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 113, , executor 37): ExecutorLostFailure (executor 37 exited caused by one of the running tasks) Reason: Container from a bad node: container_e599_1560551438641_35180_01_000057 on host: . Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e599_1560551438641_35180_01_000057
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.__launchContainer__(LinuxContainerExecutor.java:399)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 52
.
Driver stacktrace:
Even though i give min partitions 32, it is storing in 4 partitions
only.
You can refer below link
Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles
My spark submit is correct or not?
Syntax is correct but value you have passed is more than it needed. I mean you are giving 32 * 15 = 480 GB to Executors + 25 GB to driver just to process 13 GB data?
Giving more executors and more memory does not give efficient result. Sometime it cause overhead and also failure due to lack of resources
Error is also showing issue with resources you are using.
For processing only 13 GB data you should use like below configurations (not exactly, you have to calculate):
Executors # 6
Core #5
Executor-Memory 5 GB
Driver Memory 2 GB
For more details & calculation you can refer below link:
How to tune spark executor number, cores and executor memory?
Note: Driver does not require more memory than Executor so Driver
memory should be less or equal to Executor memory in most of cases.
Related
I am facing this issue, my spark streaming jobs keeps on failing after running for few days with below error:
AM Container for appattempt_1610108774021_0354_000001 exited with exitCode: -104
Failing this attempt.Diagnostics: Container [pid=31537,containerID=container_1610108774021_0354_01_000001] is running beyond physical memory limits. Current usage: 5.8 GB of 5.5 GB physical memory used; 8.0 GB of 27.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1610108774021_0354_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 31742 31537 31537 31537 (java) 1583676 58530 8499392512 1507368 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5078m -
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
spark submit:
spark-submit --name DWH-CDC-commonJob --deploy-mode cluster --master yarn --conf spark.sql.shuffle.partitions=10 --conf spark.eventLog.enabled=false --conf spark.sql.caseSensitive=true --conf spark.driver.memory=5078M --class com.aos.Loader --jars file:////home/hadoop/lib/* --executor-memory 5000M --conf "spark.alert.duration=4" --conf spark.dynamicAllocation.enabled=false --num-executors 3 --files /home/hadoop/log4j.properties,/home/hadoop/application.conf --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" streams_2.11-1.0.jar application.conf
Have tried increasing the spark.executor.memoryOverhead but it fails after few days, I want to understand how we can arrive to number where it can run without any interruptions. Or is there any other configuration that I am missing.
Spark 2.4 version
aws EMR: 5.23
scala :2.11.12
Two data nodes( vCPU 4, 16 GB ram each).
I keep getting org.apache.spark.SparkException: Job aborted when I try to save my flattened json file in azure blob as csv. Some answers that I have found recomends to increase the executor memory. Which I have done here:
I get this error when I try to save the config:
What do I need to do to solve this issue?
EDIT
Adding part of the stacktrace that is causing org.apache.spark.SparkException: Job aborted. I have also tried with and without coalesce when saving my flattend dataframe:
ERROR FileFormatWriter: Aborting job 0d8c01f9-9ff3-4297-b677-401355dca6c4.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 79.0 failed 4 times, most recent failure: Lost task 0.3 in stage 79.0 (TID 236) (10.139.64.7 executor 15): ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Command exited with code 52
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3312)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3244)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3235)
Experiencing similar error when executing the spark.executor.memory 4g command on my cluster with similar worker node.
The cause of the error is mainly the limit of executor memory in specific cluster node is 3 Gb and you are passing the value as 4 Gb as error message suggests.
Resolution:
Give spark.executor.memory less than 3Gb.
Select the bigger worker type Standard_F8, Standard_F16 etc.
Can someone please help me to figure out my simple spark application is requiring huge driver memory? Even though I allocated about 112GB, my application fails at about 67GB.
Thanks in advance
The spark driver is using huge memory for running simple application
Allocated about 112G of memory for running my application when do spark-submit
At start of the job, I see below message in the logs
1019 [main] INFO org.apache.spark.storage.memory.MemoryStore - MemoryStore started with capacity 67.0 GiB
My application fails with this error message
java.lang.IllegalStateException: dag-scheduler-event-loop has already been stopped accidentally.
at org.apache.spark.util.EventLoop.post(EventLoop.scala:107)
at org.apache.spark.scheduler.DAGScheduler.taskStarted(DAGScheduler.scala:283)
at org.apache.spark.scheduler.TaskSetManager.prepareLaunchingTask(TaskSetManager.scala:539)
at org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$2(TaskSetManager.scala:478)
at scala.Option.map(Option.scala:230)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:455)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2(TaskSchedulerImpl.scala:395)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2$adapted(TaskSchedulerImpl.scala:390)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$1(TaskSchedulerImpl.scala:390)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:381)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20(TaskSchedulerImpl.scala:587)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20$adapted(TaskSchedulerImpl.scala:582)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16(TaskSchedulerImpl.scala:582)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16$adapted(TaskSchedulerImpl.scala:555)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:555)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.$anonfun$makeOffers$5(CoarseGrainedSchedulerBackend.scala:359)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:955)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:351)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:162)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
59376410 [dispatcher-CoarseGrainedScheduler] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Cancelling stage 1
59376410 [dispatcher-CoarseGrainedScheduler] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Killing all running tasks in stage 1: Stage cancelled
59376415 [dispatcher-CoarseGrainedScheduler] ERROR org.apache.spark.scheduler.DAGSchedulerEventProcessLoop - DAGSchedulerEventProcessLoop failed; shutting down SparkContext
scala code snippet
val df = spark.read.parquet(data_path)
df.rdd.foreachPartition(p => {
// code to process the code...
})
Job Submit
'''
spark-submit --master "spark://x.x.x.x:7077" --driver-cores=4 --driver-memory=112G --conf spark.driver.maxResultSize=0 --conf spark.rpc.message.maxSize=2047 --conf spark.driver.host=x.x.x.x --class myclass.processor --packages "..,org.apache.hadoop:hadoop-azure:3.3.1" --deploy-mode client
'''
When I run LogisticRegression in Spark, I found a stage is special, as the number of executors increases, the average task processing time becomes longer, why it could happen?
Environment:
All servers are local, no cloud.
Server 1: 6 cores 10g memory (Spark master, HDFS master, HDFS slave).
Server 2: 6 cores 10g memory (HDFS slave).
Server 3: 6 cores 10g memory (Spark slave, HDFS slave).
Deploy in Spark standalone mode.
Input file size: Large enough, it can meet the requirements of parallelism. Spark will read the file from HDFS.
All the workloads have the same input file.
You can see that only server3 will participate in the calculation(Only it will become Spark worker).
Special stage DAG
1 core 1g memory
spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 1 ....
mid-task duration: 1s
2 core 2g memory
spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 2 ....
mid-task duration: 2s
3 core 3g memory
spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 3 ....
mid-task duration: 2s
4 core 4g memory
spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 4 ....
mid-task duration: 3s
5 core 5g memory
spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 5 ....
mid-task duration: 3s
As can be seen from the above figure, the more executors on a machine will cause the longer the average running time of a single task. May I ask why this could happen, and I had not seen the executor had disk spill, the memory should be sufficient.
Note: Only this stage will produce this phenomenon, other stages did not have this problem.
I have loaded a dataset which is just around ~ 20 GB in size - the cluster has ~ 1TB available so memory shouldn't be an issue imho.
It is no problem for me to save the original data which consists only of strings:
df_data.write.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')
However, as I transform the data:
df_transformed = df_data.drop('bri').join(
df_data[['docId', 'bri']].rdd\
.map(lambda x: (x.docId, json.loads(x.bri))
if x.bri is not None else (x.docId, dict()))\
.toDF()\
.withColumnRenamed('_1', 'docId')\
.withColumnRenamed('_2', 'bri'),
['dokumentId']
)
and then save it:
df_transformed.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')
The log output will tell me that the memory limit was exceeded:
18/03/08 10:23:09 WARN TaskSetManager: Lost task 17.0 in stage 18.3 (TID 2866, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 29.0 in stage 18.3 (TID 2878, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 65.0 in stage 18.3 (TID 2914, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm not quite sure what the problem is. Even setting the executor's memory to 60GB RAM each does not solve the problem.
So, obviously the problem comes with the transformation. Any idea what exactly causes this problem?