Losing executors when saving parquet file - pyspark

I have loaded a dataset which is just around ~ 20 GB in size - the cluster has ~ 1TB available so memory shouldn't be an issue imho.
It is no problem for me to save the original data which consists only of strings:
df_data.write.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')
However, as I transform the data:
df_transformed = df_data.drop('bri').join(
df_data[['docId', 'bri']].rdd\
.map(lambda x: (x.docId, json.loads(x.bri))
if x.bri is not None else (x.docId, dict()))\
.toDF()\
.withColumnRenamed('_1', 'docId')\
.withColumnRenamed('_2', 'bri'),
['dokumentId']
)
and then save it:
df_transformed.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')
The log output will tell me that the memory limit was exceeded:
18/03/08 10:23:09 WARN TaskSetManager: Lost task 17.0 in stage 18.3 (TID 2866, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 29.0 in stage 18.3 (TID 2878, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 65.0 in stage 18.3 (TID 2914, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm not quite sure what the problem is. Even setting the executor's memory to 60GB RAM each does not solve the problem.
So, obviously the problem comes with the transformation. Any idea what exactly causes this problem?

Related

Azure Databricks: Error, Specified heap memory (4096MB) is above the maximum executor memory (3157MB) allowed for node type Standard_F4

I keep getting org.apache.spark.SparkException: Job aborted when I try to save my flattened json file in azure blob as csv. Some answers that I have found recomends to increase the executor memory. Which I have done here:
I get this error when I try to save the config:
What do I need to do to solve this issue?
EDIT
Adding part of the stacktrace that is causing org.apache.spark.SparkException: Job aborted. I have also tried with and without coalesce when saving my flattend dataframe:
ERROR FileFormatWriter: Aborting job 0d8c01f9-9ff3-4297-b677-401355dca6c4.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 79.0 failed 4 times, most recent failure: Lost task 0.3 in stage 79.0 (TID 236) (10.139.64.7 executor 15): ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Command exited with code 52
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3312)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3244)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3235)
Experiencing similar error when executing the spark.executor.memory 4g command on my cluster with similar worker node.
The cause of the error is mainly the limit of executor memory in specific cluster node is 3 Gb and you are passing the value as 4 Gb as error message suggests.
Resolution:
Give spark.executor.memory less than 3Gb.
Select the bigger worker type Standard_F8, Standard_F16 etc.

wholeTextFiles Method is failing with ExitCode 52 java.lang.OutOfMemoryError

I have HDFS directory with 13.2 GB and 4 files in it. I am trying to read all files using wholeTextFile method in spark, But i have some issues
This is my code.
val path = "/tmp/cnt/warehouse/"
val whole = sc.wholeTextFiles("path",32)
val data = whole.map(r => (r._1,r._2.split("\r\n")))
val x = file.flatMap(r => r._1)
x.take(1000).foreach(println)
Below is the spark Submit.
spark2-submit \
--class SparkTest \
--master yarn \
--deploy-mode cluster \
--num-executors 32 \
--executor-memory 15G \
--driver-memory 25G \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.port.maxRetries=100 \
--conf spark.kryoserializer.buffer.max=1g \
--conf spark.yarn.queue=xyz \
SparkTest-1.0-SNAPSHOT.jar
even though i give min partitions 32, it is storing in 4 partitions only.
My spark submit is correct or not?
Error Below
Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 113, , executor 37): ExecutorLostFailure (executor 37 exited caused by one of the running tasks) Reason: Container from a bad node: container_e599_1560551438641_35180_01_000057 on host: . Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e599_1560551438641_35180_01_000057
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.__launchContainer__(LinuxContainerExecutor.java:399)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 52
.
Driver stacktrace:
Even though i give min partitions 32, it is storing in 4 partitions
only.
You can refer below link
Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles
My spark submit is correct or not?
Syntax is correct but value you have passed is more than it needed. I mean you are giving 32 * 15 = 480 GB to Executors + 25 GB to driver just to process 13 GB data?
Giving more executors and more memory does not give efficient result. Sometime it cause overhead and also failure due to lack of resources
Error is also showing issue with resources you are using.
For processing only 13 GB data you should use like below configurations (not exactly, you have to calculate):
Executors # 6
Core #5
Executor-Memory 5 GB
Driver Memory 2 GB
For more details & calculation you can refer below link:
How to tune spark executor number, cores and executor memory?
Note: Driver does not require more memory than Executor so Driver
memory should be less or equal to Executor memory in most of cases.

joblib Parallel running out of memory

I have something like this
outputs = Parallel(n_jobs=12, verbose=10)(delayed(_process_article)(article, config) for article in data)
Case 1: Run on ubuntu with 80 cores:
CPU(s): 80
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
There are a total of 90,000 tasks. At around 67k it fails and is terminated.
joblib.externals.loky.process_executor.BrokenProcessPool: A process in the executor was terminated abruptly, the pool is not usable anymore.
When I monitor the top at 67k I see a sharp fall in the memory
top - 11:40:25 up 2 days, 18:35, 4 users, load average: 7.09, 7.56, 7.13
Tasks: 32 total, 3 running, 29 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.6 us, 2.6 sy, 0.0 ni, 89.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 33554432 total, 40 free, 33520996 used, 33396 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 40 avail Mem
Case 2: Mac with 8 cores
hw.physicalcpu: 4
hw.logicalcpu: 8
But on the mac it is much much slower .. And surprisingly it does not get killed at 67k..
Additionally, I reduced the parallelism (in case 1) to 2,4 and it still fails :(
Why is this happening? Has anyone faced this issue before and has a fix?
Note: when I run for 50,000 tasks it runs well and does not give any problems.
Thank you!
Got a machine with an increased memory of 128GB and that solved the problem!

Is AWS Glue Scalable?

I have covered all the required information while I'm using glue, please let me know if you need more information.
Here is my scenario:
aws s3 ls s3://bucuketname/ --recursive --profile production | grep
Auto | wc -l
2487
There are no more than 2487 s3 interested objects for transformation.
aws s3api list-objects --bucket bucketname --output json --query
"[sum(Contents[].Size), length(Contents[])]" --profile production |
awk 'NR!=2 {print $0;next} NR==2 {print $0/1024/1024/1024" GB"}'
[
344.768 GB
3829
]
Each s3 object is not more than 100MB size and it is a compressed json.
3829 is the total number of objects, but I'm interested in only 2487 objects for processing.
Scala Code:
val glueContext: GlueContext = new GlueContext(sc)
val auto01: DynamicFrame = glueContext.getCatalogSource(database = "jsondb", tableName = "01").getDynamicFrame()
auto01.printSchema()
Trying to get the schema,
18/06/09 18:31:44 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 32, ip-172-31-16-40.ec2.internal, executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.7 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/09 18:31:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.7 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
..
..
..
18/06/09 18:34:13 WARN ExecutorAllocationManager: Attempted to mark unknown executor 12 idle
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 44, ip-172-31-16-40.ec2.internal, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 6.0 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)
at org.apache.spark.sql.glue.util.SchemaUtils$.fromRDD(SchemaUtils.scala:57)
at com.amazonaws.services.glue.DynamicFrame.recomputeSchema(DynamicFrame.scala:235)
at com.amazonaws.services.glue.DynamicFrame.schema(DynamicFrame.scala:223)
at com.amazonaws.services.glue.DynamicFrame.printSchema(DynamicFrame.scala:244)
... 48 elided
Anything am I missing here to consider using glue?

Why are the executors getting killed by the driver?

The first stage of my spark job is quite simple.
It reads from a big number of files (around 30,000 files and 100GB in total) -> RDD[String]
does a map (to parse each line) -> RDD[Map[String,Any]]
filters -> RDD[Map[String,Any]]
coalesces (.coalesce(100, true))
When running it, I observe a quite peculiar behavior. The number of executors grows until the given limit I specified in spark.dynamicAllocation.maxExecutors (typically 100 or 200 in my application). Then it starts decreasing quickly (at approx. 14000/33428 tasks) and only a few executors remain. They are killed by the drive. When this task is done. The number of executors increases back to its maximum value.
Below is a screenshot of the number of executors at its lowest.
An here is a screenshot of the task summary.
I guess that these executors are killed because they are idle. But, in this case, I do not understand why would they become idle. There remains a lot of task to do in the stage...
Do you have any idea of why it happens?
EDIT
More details about the driver logs when an executor is killed:
16/09/30 12:23:33 INFO cluster.YarnClusterSchedulerBackend: Disabling executor 91.
16/09/30 12:23:33 INFO scheduler.DAGScheduler: Executor lost: 91 (epoch 0)
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 91 from BlockManagerMaster.
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(91, server.com, 40923)
16/09/30 12:23:33 INFO storage.BlockManagerMaster: Removed 91 successfully in removeExecutor
16/09/30 12:23:33 INFO cluster.YarnClusterScheduler: Executor 91 on server.com killed by driver.
16/09/30 12:23:33 INFO spark.ExecutorAllocationManager: Existing executor 91 has been removed (new total is 94)
Logs on the executor
16/09/30 12:26:28 INFO rdd.HadoopRDD: Input split: hdfs://...
16/09/30 12:26:32 INFO executor.Executor: Finished task 38219.0 in stage 0.0 (TID 26519). 2312 bytes result sent to driver
16/09/30 12:27:33 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/09/30 12:27:33 INFO storage.DiskBlockManager: Shutdown hook called
16/09/30 12:27:33 INFO util.ShutdownHookManager: Shutdown hook called
I'm seeing this problem on executors that are killed as a result of an idle timeout. I have an exceedingly demanding computational load, but it's mostly computed in a UDF, invisible to Spark. I believe that there's some spark parameter that can be adjusted.
Try looking through the spark.executor parameters in https://spark.apache.org/docs/latest/configuration.html#spark-properties and see if anything jumps out.