I am working with following spark config
maxCores = 5
driverMemory=2g
executorMemory=17g
executorInstances=100
Issue:
Out of 100 Executors, My job ends up with only 10 active executors, nonetheless enough memory is available. Even tried setting the executors to 250 only 10 remains active.All I am trying to do is loading a mulitpartition hive table and doing df.count over it.
Please help me understanding the issue causing the executors kill
17/12/20 11:08:21 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
17/12/20 11:08:21 INFO storage.DiskBlockManager: Shutdown hook called
17/12/20 11:08:21 INFO util.ShutdownHookManager: Shutdown hook called
Not sure why yarn is killing my executors.
I faced a similar issue where the investigation of the NodeManager-Logs lead me to the root cause.
You can access them via the Web-interface
nodeManagerAddress:PORT/logs
The PORT is specified in the yarn-site.xml under yarn.nodemanager.webapp.address. (default: 8042)
My Investigation-Workflow:
Collect logs (yarn logs ... command)
Identify node and container (in these logs) emitting the error
Search the NodeManager-logs by Timestamp of the error for a root cause
Btw: you can access the aggregated collection (xml) of all configurations affecting a node at the same port with:
nodeManagerAdress:PORT/conf
I believe this issue has more to do with the memory and the dynamic time allocations on executor/container levels. Make sure you can change the config params on executor/container level.
One of the ways you can resolve this issue is by changing this config value either on your spark-shell or spark job.
spark.dynamicAllocation.executorIdleTimeout
This thread has more detailed information on how to resolve this issue which worked for me:
https://jira.apache.org/jira/browse/SPARK-21733
I had the same issue, my spark job was using only 1 task node and killing the other provisioned nodes. This also happened when switching to EMR Serverless, my job was being run on only one "thread". Please see below as it fixed it for me:
spark-submit \
--name KSSH-0.3 \
--class com.jiuye.KSSH \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 1 \
--num-executors 8 \
--jars $(echo /opt/software/spark2.1.1/spark_on_yarn/libs/*.jar | tr ' ' ',') \
--conf "spark.ui.showConsoleProgress=false" \
--conf "spark.yarn.am.memory=1024m" \
--conf "spark.yarn.am.memoryOverhead=1024m" \
--conf "spark.yarn.driver.memoryOverhead=1024m" \
--conf "spark.yarn.executor.memoryOverhead=1024m" \
--conf "spark.yarn.am.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.streaming.kafka.maxRatePerPartition=1250" \
--conf "spark.locality.wait=1s" \
--conf "spark.shuffle.consolidateFiles=true" \
--conf "spark.executor.heartbeatInterval=360000" \
--conf "spark.network.timeout=420000" \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.hadoop.fs.hdfs.impl.disable.cache=true" \
/opt/software/spark2.1.1/spark_on_yarn/KSSH-0.3.jar
Related
trying to write parquet data to s3 or HDFS gives same kind of error:I already mentioned df.overwrite
resFinal.write.mode(SaveMode.Overwrite).partitionBy("pro".......
spark submit used for this issue is:
spark-submit --master yarn --deploy-mode cluster --executor-memory 50G --driver-memory 54G --executor-cores 5 --queue High --conf spark.yarn.maxAppAttempts=1 --conf spark.driver.maxResultSize=7g --conf spark.executor.memoryOverhead=4500 --conf spark.driver.memoryOverhead=5400 --conf spark.sql.shuffle.partitions=7000 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.spill.compress=true --conf spark.sql.tungsten.enabled=true --conf spark.sql.autoBroadCastJoinThreshold=-1 --conf spark.speculation=true --conf spark.dynamicAllocation.minExecutors=200 --conf spark.dynamicAllocation.maxExecutors=500 --conf spark.memory.storageFraction=0.6 --conf spark.memory.fraction=0.7 --class com.mnb.history_cleanup s3://dv-cam/1/cleanup-1.0-SNAPSHOT.jar H 0 20170101-20170102 HSO
no matter whether i write to hdfs or s3 , i see
org.apache.hadoop.fs.FileAlreadyExistsException: Path already exists as a file: s3://dv-ms-east-1/ms414x-test1/dl/ry8/.spark-staging-28e84dbb-7e91-4d5c-87ba-8e880cf28904/
or
File does not exist: /user/m/dl/vi/.spark-staging-bdb317f3-7ff9-458e-9ea8-7fb70ce4/pro
I’m executing spark code on scala shell using Kafka jars and my intention is to stream messages from Kafka topic. My spark object is created but can anyone help me in how can I pass jaas configuration file while starting the spark shell ? My error point me to missing jaas configuration
Assuming you have a spark-kafka.jaas file in the current folder you are running spark-submit from, you pass it as a file, as well as a Driver and Executor option
spark-submit \
...
--files "spark-kafka.jaas#spark-kafka.jaas" \
--driver-java-options "-Djava.security.auth.login.config=./spark-kafka.jaas" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./spark-kafka.jaas"
You might also need to set "security.protocol" within the Spark code's Kafka properties to be one of the supported Kafka SASL protocols
I got an issue like yours, I'm using this startup-script to run my spark-shell, I'm using spark 2.3.0.
export HOME=/home/alessio.palma/scala_test
spark2-shell \
--verbose \
--principal hdp_ud_appadmin \
--files "jaas.conf" \
--keytab $HOME/hdp_ud_app.keytab \
--master local[2] \
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.0,org.apache.kafka:kafka-clients:0.10.0.1 \
--conf spark.driver.extraJavaOptions="-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf" \
--conf spark.executor.extraJavaOptions="-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf" \
--driver-java-options spark.driver.extraJavaOptions="-Djava.security.auth.login.config=file://jaas.conf -Djava.security.krb5.conf=file:///etc/krb5.conf" \
--driver-java-options spark.executor.extraJavaOptions="-Djava.security.auth.login.config=file://jaas.conf -Djava.security.krb5.conf=file:///etc/krb5.conf" \
--queue=root.Global.UnifiedData.hdp_global_ud_app
Any attempt failed with this error:
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
:
.
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Jaas configuration not found
it looks like the park.driver.extraJavaOptions and spark.executor.extraJavaOptions are not working. Anything was failing until I added this row in top of my startup script:
export SPARK_SUBMIT_OPTS='-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf'
And magically the jaas.conf file has been found. Another thing I suggest to add in your startup script is:
export SPARK_KAFKA_VERSION=0.10
Folks,
Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file
Cluster Info:
9 datanodes
128 GB Memory /48 vCore CPU /Node
Job config
conf = SparkConf().setAppName('test') \
.set('spark.executor.cores', 4) \
.set('spark.executor.memory', '72g') \
.set('spark.driver.memory', '16g') \
.set('spark.yarn.executor.memoryOverhead',4096 ) \
.set('spark.dynamicAllocation.enabled', 'true') \
.set('spark.shuffle.service.enabled', 'true') \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.set('spark.driver.maxResultSize',10000) \
.set('spark.kryoserializer.buffer.max', 2044)
fileRDD=sc.textFile("/tmp/test_file.txt")
fileRDD.cache
list_of_lines_from_file = fileRDD.map(lambda line: line.split(" ")).collect()
Error
The Collect piece is spitting outofmemory error.
18/05/17 19:03:15 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection fromHost/IP:53023 is closed
18/05/17 19:03:15 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.lang.OutOfMemoryError: Java heap space
any help is much appreciated.
A little background on this issue
I was having this issue while i run the code through Jupyter Notebook which runs on an edgenode of a hadoop cluster
Finding in Jupyter
since you can only submit the code from Jupyter through client mode,(equivalent to launching spark-shell from the edgenode) the spark driver is always the edgenode which is already packed with other long running daemon processes, where the available memory is always lesser than the memory required for fileRDD.collect() on my file
Worked fine in spark-submit
I put the content from Jupyer to a .py file and invoked the same through spark-submit with same settings Whoa!! , it ran in seconds there, reason being , spark-submit is optimized to choose the driver node from one of the nodes that has required memory free from the cluster .
spark-submit --name "test_app" --master yarn --deploy-mode cluster --conf spark.executor.cores=4 --conf spark.executor.memory=72g --conf spark.driver.memory=72g --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2044 --conf spark.driver.maxResultSize=1g --conf spark.driver.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' --conf spark.executor.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' test.py
Next Step :
Our next step is to see if Jupyter notebook can submit the spark job to YARN cluster , via a Livy JobServer or a similar approach.
I came across a scenario when I supply spark.yarn.stagingDir to spark-submit it starts failing and it doesn't give any clue about the rootcause, and I spent quite long time to figure out it's because of spark.yarn.stagingDir parameter. Why spark-submit fails when supply spark.yarn.stagingDir this parameter?
Check related question here for more details
Command which fails:
spark-submit \
--conf "spark.yarn.stagingDir=/xyz/warehouse/spark" \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar
When I remove spark.yarn.stagingDir, it starts working:
spark-submit \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar
Exception stacktrace:
Application application_1506717704791_145448 finished with failed
status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
I've encountered exactly the same problem when I set spark.yarn.stagingDir as /tmp (while it worked fine once I removed this very configuration entry).
My solution is to specify the full HDFS path like hdfs://hdfs_server_name/tmp instead of merely /tmp. Hope it works for you.
have an issue with running scala spark 2.1 application in cluster mode.
Release label:emr-5.7.0*
Hadoop distribution:Amazon 2.7.3
Applications:Hive 2.1.1, Presto 0.170, Spark 2.1.1, Ganglia 3.7.2, Zeppelin
0.7.2, ZooKeeper 3.4.10
Have a .jar which perfectly working and submitting via client mode on the cluster.
When i try to submit jar in cluster mode, i receive an exception:
java.lang.IllegalArgumentException: Error while instantiating org.apache.spark.sql.hive.HiveSessionState':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
...
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:97)
Here is how i try to run the application:
spark-submit --master yarn \
--deploy-mode cluster \
--num-executors 64 \
--executor-cores 6 \
--driver-memory 6g \
--executor-memory 10g \
--conf "spark.driver.extraClassPath=/usr/lib/spark/jars/*.jar" \
--conf "spark.executor.extraClassPath=/usr/lib/spark/jars/*.jar" \
--conf "spark.yarn.queue=test_queue" \
--conf "spark.sql.hive.metastore.jars=/usr/hive/lib/*.jar" \
--jars /usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar,/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar,/usr/lib/spark/jars/datanucleus-core-3.2.10.jar \
--class MainApp /home/hadoop/app/application-1.0.jar
Here is my initialization of SparkSession:
val sparkSession = SparkSession
.builder()
.appName(applicationName)
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
Could someone give some suggestion, what is worth to try?
PS: pyspark application on this cluster works like a charm in cluster mode
spark-submit --master yarn \
--deploy-mode cluster \
--num-executors 64 \
--executor-cores 6 \
--driver-memory 6g \
--executor-memory 10g \
--conf "spark.driver.extraClassPath=/usr/lib/spark/jars/*.jar" \
--conf "spark.executor.extraClassPath=/usr/lib/spark/jars/*.jar" \
--conf "spark.yarn.queue=test_queue" \
--conf "spark.sql.hive.metastore.jars=/usr/hive/lib/*.jar" \
--jars /usr/lib/spark/jars/datanucleus-rdbms-
3.2.9.jar,/usr/lib/spark/jars/datanucleus-api-jdo-
3.2.6.jar,/usr/lib/spark/jars/datanucleus-core-3.2.10.jar \
--class MainApp /home/hadoop/app/application-1.0.jar
Don't specify where to look for the hive jars with sqpar.sql.hive.metastore.jar. EMR will do that for you itself. Give that a shot.
If it doesn't work, please post your EMR cluster settings.
The issue was fixed, spark looks at obsolete libs