EMR Spark cluster mode Hive issue - scala

have an issue with running scala spark 2.1 application in cluster mode.
Release label:emr-5.7.0*
Hadoop distribution:Amazon 2.7.3
Applications:Hive 2.1.1, Presto 0.170, Spark 2.1.1, Ganglia 3.7.2, Zeppelin
0.7.2, ZooKeeper 3.4.10
Have a .jar which perfectly working and submitting via client mode on the cluster.
When i try to submit jar in cluster mode, i receive an exception:
java.lang.IllegalArgumentException: Error while instantiating org.apache.spark.sql.hive.HiveSessionState':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
...
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:97)
Here is how i try to run the application:
spark-submit --master yarn \
--deploy-mode cluster \
--num-executors 64 \
--executor-cores 6 \
--driver-memory 6g \
--executor-memory 10g \
--conf "spark.driver.extraClassPath=/usr/lib/spark/jars/*.jar" \
--conf "spark.executor.extraClassPath=/usr/lib/spark/jars/*.jar" \
--conf "spark.yarn.queue=test_queue" \
--conf "spark.sql.hive.metastore.jars=/usr/hive/lib/*.jar" \
--jars /usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar,/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar,/usr/lib/spark/jars/datanucleus-core-3.2.10.jar \
--class MainApp /home/hadoop/app/application-1.0.jar
Here is my initialization of SparkSession:
val sparkSession = SparkSession
.builder()
.appName(applicationName)
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
Could someone give some suggestion, what is worth to try?
PS: pyspark application on this cluster works like a charm in cluster mode

spark-submit --master yarn \
--deploy-mode cluster \
--num-executors 64 \
--executor-cores 6 \
--driver-memory 6g \
--executor-memory 10g \
--conf "spark.driver.extraClassPath=/usr/lib/spark/jars/*.jar" \
--conf "spark.executor.extraClassPath=/usr/lib/spark/jars/*.jar" \
--conf "spark.yarn.queue=test_queue" \
--conf "spark.sql.hive.metastore.jars=/usr/hive/lib/*.jar" \
--jars /usr/lib/spark/jars/datanucleus-rdbms-
3.2.9.jar,/usr/lib/spark/jars/datanucleus-api-jdo-
3.2.6.jar,/usr/lib/spark/jars/datanucleus-core-3.2.10.jar \
--class MainApp /home/hadoop/app/application-1.0.jar
Don't specify where to look for the hive jars with sqpar.sql.hive.metastore.jar. EMR will do that for you itself. Give that a shot.
If it doesn't work, please post your EMR cluster settings.

The issue was fixed, spark looks at obsolete libs

Related

Upgrading to spark_Cassandra_Connector_2_12:3.2.0 from spark_Cassandra_connector_2_11:2.5.1 is breaking

My code in production is working good over an year with
spark_Cassandra_connector_2_11:2.5.1 with EMR 5.29
We want to upgrade to scala 2_12 and EMR 6.6.0 and did respective scala upgrade
spark - 3.2.0
datastax_cassandra_connector - 3.2.0
scala - 2.12.11
EMR - 6.6.0
No config on the class path and just the below config passed to spark
--conf spark.cassandra.connection.host=${casHostIps} \
--conf spark.cassandra.auth.username=${casUser} \
--conf spark.cassandra.auth.password=${casPassword} \
--conf spark.cassandra.output.ignoreNulls=true \
--conf spark.cassandra.output.concurrent.writes=$SPARK_CASSANDRA_OUTPUT_CONCURRENT_WRITES \
Upgraded code is no more working in new EMR with the above config options. Always complaining the below
Caused by: com.typesafe.config.ConfigException$Missing: withValue(advanced.reconnection-policy.base-delay): No configuration setting found for key 'advanced.session-leak'
at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:157)
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:150)
at com.typesafe.config.impl.SimpleConfig.findOrNull(SimpleConfig.java:177)
at com.typesafe.config.impl.SimpleConfig.findOrNull(SimpleConfig.java:181)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:189)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:194)
at com.typesafe.config.impl.SimpleConfig.getConfigNumber(SimpleConfig.java:224)
at com.typesafe.config.impl.SimpleConfig.getInt(SimpleConfig.java:235)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at com.datastax.oss.driver.internal.core.config.typesafe.TypesafeDriverExecutionProfile.getCached(TypesafeDriverExecutionProfile.java:291)
at com.datastax.oss.driver.internal.core.config.typesafe.TypesafeDriverExecutionProfile.getInt(TypesafeDriverExecutionProfile.java:88)
at com.datastax.oss.driver.internal.core.session.DefaultSession.<init>(DefaultSession.java:103)
at com.datastax.oss.driver.internal.core.session.DefaultSession.init(DefaultSession.java:88)
at com.datastax.oss.driver.api.core.session.SessionBuilder.buildDefaultSessionAsync(SessionBuilder.java:903)
at com.datastax.oss.driver.api.core.session.SessionBuilder.buildAsync(SessionBuilder.java:817)
at com.datastax.oss.driver.api.core.session.SessionBuilder.build(SessionBuilder.java:835)
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createSession(CassandraConnectionFactory.scala:143)
at com.datastax.spark.connector.cql.CassandraConnector$.createSession(CassandraConnector.scala:167)
where I have ended up referring to datastax_java_driver config in my classpath and it is working finally with https://docs.datastax.com/en/developer/java-driver/4.14/manual/core/configuration/reference/
--conf spark.files=s3://${efSparkJarBucket}/${efSparkJarPrefix}reference.conf \
--conf spark.cassandra.connection.config.profile.path=reference.conf \
But there is not much tuning options to do in java-driver level that I do at spark level .ex:
--conf spark.cassandra.output.concurrent.writes=100 \
2 Questions:
why 3.2.0 spark_cassandra_Driver is a breaking change compared to 2.5.1?
if referring to a config (Reference.conf) is the only option, How I can tune the throttling as my load is happening but very slow, and also cassandra cluster nodes are going out and in.

spark job gives path already exists error ,even after using df.overwrite

trying to write parquet data to s3 or HDFS gives same kind of error:I already mentioned df.overwrite
resFinal.write.mode(SaveMode.Overwrite).partitionBy("pro".......
spark submit used for this issue is:
spark-submit --master yarn --deploy-mode cluster --executor-memory 50G --driver-memory 54G --executor-cores 5 --queue High --conf spark.yarn.maxAppAttempts=1 --conf spark.driver.maxResultSize=7g --conf spark.executor.memoryOverhead=4500 --conf spark.driver.memoryOverhead=5400 --conf spark.sql.shuffle.partitions=7000 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.spill.compress=true --conf spark.sql.tungsten.enabled=true --conf spark.sql.autoBroadCastJoinThreshold=-1 --conf spark.speculation=true --conf spark.dynamicAllocation.minExecutors=200 --conf spark.dynamicAllocation.maxExecutors=500 --conf spark.memory.storageFraction=0.6 --conf spark.memory.fraction=0.7 --class com.mnb.history_cleanup s3://dv-cam/1/cleanup-1.0-SNAPSHOT.jar H 0 20170101-20170102 HSO
no matter whether i write to hdfs or s3 , i see
org.apache.hadoop.fs.FileAlreadyExistsException: Path already exists as a file: s3://dv-ms-east-1/ms414x-test1/dl/ry8/.spark-staging-28e84dbb-7e91-4d5c-87ba-8e880cf28904/
or
File does not exist: /user/m/dl/vi/.spark-staging-bdb317f3-7ff9-458e-9ea8-7fb70ce4/pro

pyspark memory issue :Caused by: java.lang.OutOfMemoryError: Java heap space

Folks,
Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file
Cluster Info:
9 datanodes
128 GB Memory /48 vCore CPU /Node
Job config
conf = SparkConf().setAppName('test') \
.set('spark.executor.cores', 4) \
.set('spark.executor.memory', '72g') \
.set('spark.driver.memory', '16g') \
.set('spark.yarn.executor.memoryOverhead',4096 ) \
.set('spark.dynamicAllocation.enabled', 'true') \
.set('spark.shuffle.service.enabled', 'true') \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.set('spark.driver.maxResultSize',10000) \
.set('spark.kryoserializer.buffer.max', 2044)
fileRDD=sc.textFile("/tmp/test_file.txt")
fileRDD.cache
list_of_lines_from_file = fileRDD.map(lambda line: line.split(" ")).collect()
Error
The Collect piece is spitting outofmemory error.
18/05/17 19:03:15 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection fromHost/IP:53023 is closed
18/05/17 19:03:15 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.lang.OutOfMemoryError: Java heap space
any help is much appreciated.
A little background on this issue
I was having this issue while i run the code through Jupyter Notebook which runs on an edgenode of a hadoop cluster
Finding in Jupyter
since you can only submit the code from Jupyter through client mode,(equivalent to launching spark-shell from the edgenode) the spark driver is always the edgenode which is already packed with other long running daemon processes, where the available memory is always lesser than the memory required for fileRDD.collect() on my file
Worked fine in spark-submit
I put the content from Jupyer to a .py file and invoked the same through spark-submit with same settings Whoa!! , it ran in seconds there, reason being , spark-submit is optimized to choose the driver node from one of the nodes that has required memory free from the cluster .
spark-submit --name "test_app" --master yarn --deploy-mode cluster --conf spark.executor.cores=4 --conf spark.executor.memory=72g --conf spark.driver.memory=72g --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2044 --conf spark.driver.maxResultSize=1g --conf spark.driver.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' --conf spark.executor.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' test.py
Next Step :
Our next step is to see if Jupyter notebook can submit the spark job to YARN cluster , via a Livy JobServer or a similar approach.

Spark Error : executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

I am working with following spark config
maxCores = 5
driverMemory=2g
executorMemory=17g
executorInstances=100
Issue:
Out of 100 Executors, My job ends up with only 10 active executors, nonetheless enough memory is available. Even tried setting the executors to 250 only 10 remains active.All I am trying to do is loading a mulitpartition hive table and doing df.count over it.
Please help me understanding the issue causing the executors kill
17/12/20 11:08:21 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
17/12/20 11:08:21 INFO storage.DiskBlockManager: Shutdown hook called
17/12/20 11:08:21 INFO util.ShutdownHookManager: Shutdown hook called
Not sure why yarn is killing my executors.
I faced a similar issue where the investigation of the NodeManager-Logs lead me to the root cause.
You can access them via the Web-interface
nodeManagerAddress:PORT/logs
The PORT is specified in the yarn-site.xml under yarn.nodemanager.webapp.address. (default: 8042)
My Investigation-Workflow:
Collect logs (yarn logs ... command)
Identify node and container (in these logs) emitting the error
Search the NodeManager-logs by Timestamp of the error for a root cause
Btw: you can access the aggregated collection (xml) of all configurations affecting a node at the same port with:
nodeManagerAdress:PORT/conf
I believe this issue has more to do with the memory and the dynamic time allocations on executor/container levels. Make sure you can change the config params on executor/container level.
One of the ways you can resolve this issue is by changing this config value either on your spark-shell or spark job.
spark.dynamicAllocation.executorIdleTimeout
This thread has more detailed information on how to resolve this issue which worked for me:
https://jira.apache.org/jira/browse/SPARK-21733
I had the same issue, my spark job was using only 1 task node and killing the other provisioned nodes. This also happened when switching to EMR Serverless, my job was being run on only one "thread". Please see below as it fixed it for me:
spark-submit \
--name KSSH-0.3 \
--class com.jiuye.KSSH \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 1 \
--num-executors 8 \
--jars $(echo /opt/software/spark2.1.1/spark_on_yarn/libs/*.jar | tr ' ' ',') \
--conf "spark.ui.showConsoleProgress=false" \
--conf "spark.yarn.am.memory=1024m" \
--conf "spark.yarn.am.memoryOverhead=1024m" \
--conf "spark.yarn.driver.memoryOverhead=1024m" \
--conf "spark.yarn.executor.memoryOverhead=1024m" \
--conf "spark.yarn.am.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.streaming.kafka.maxRatePerPartition=1250" \
--conf "spark.locality.wait=1s" \
--conf "spark.shuffle.consolidateFiles=true" \
--conf "spark.executor.heartbeatInterval=360000" \
--conf "spark.network.timeout=420000" \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.hadoop.fs.hdfs.impl.disable.cache=true" \
/opt/software/spark2.1.1/spark_on_yarn/KSSH-0.3.jar

Why spark-submit fails with `spark.yarn.stagingDir` with master yarn and deploy-mode cluster

I came across a scenario when I supply spark.yarn.stagingDir to spark-submit it starts failing and it doesn't give any clue about the rootcause, and I spent quite long time to figure out it's because of spark.yarn.stagingDir parameter. Why spark-submit fails when supply spark.yarn.stagingDir this parameter?
Check related question here for more details
Command which fails:
spark-submit \
--conf "spark.yarn.stagingDir=/xyz/warehouse/spark" \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar
When I remove spark.yarn.stagingDir, it starts working:
spark-submit \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar
Exception stacktrace:
Application application_1506717704791_145448 finished with failed
status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
I've encountered exactly the same problem when I set spark.yarn.stagingDir as /tmp (while it worked fine once I removed this very configuration entry).
My solution is to specify the full HDFS path like hdfs://hdfs_server_name/tmp instead of merely /tmp. Hope it works for you.