SparkPi program keeps running under Yarn/Spark/Google Compute Engine - scala

Deployed a Hadoop (Yarn + Spark) cluster on Google Compute Engine with one master & two slaves. When I run the following shell script:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 /home/hadoop/spark-install/lib/spark-examples-1.1.0-hadoop2.4.0.jar 10
the job just keeps running & every second I get a message similar to this:
15/02/06 22:47:12 INFO yarn.Client: Application report from ResourceManager:
application identifier: application_1423247324488_0008<br>
appId: 8<br>
clientToAMToken: null<br>
appDiagnostics:<br>
appMasterHost: hadoop-w-zrem.c.myapp.internal<br>
appQueue: default<br>
appMasterRpcPort: 0<br>
appStartTime: 1423261517468<br>
yarnAppState: RUNNING<br>
distributedFinalState: UNDEFINED<br>
appTrackingUrl: http://hadoop-m-xxxx:8088/proxy/application_1423247324488_0008/<br>
appUser: achitre

Instead of --master yarn-cluster use --master yarn-client

After adding following line to my script, it worked:
export SPARK_JAVA_OPTS="-Dspark.yarn.executor.memoryOverhead=1024 -Dspark.local.dir=/tmp -Dspark.executor.memory=1024"
I guess, we shouldn't use 'm', 'g' etc when specifying memory; otherwise we get NumberFormatException.

Related

spark-submit --py-files gives warning RuntimeWarning: Failed to add file <abc.py> speficied in 'spark.submit.pyFiles' to Python path:

We have a pyspark based application and we are doing a spark-submit as shown below. Application is working as expected, however we are seeing a weird warning message. Any way to handle this or why is this coming ?
Note: The cluster is Azure HDI Cluster.
spark-submit --master yarn --deploy-mode cluster --jars file:/<localpath>/* --py-files pyFiles/__init__.py,pyFiles/<abc>.py,pyFiles/<abd>.py --files files/<env>.properties,files/<config>.json main.py
warning seen is:
warnings.warn(
/usr/hdp/current/spark3-client/python/pyspark/context.py:256:
RuntimeWarning: Failed to add file
[file:///home/sshuser/project/pyFiles/abc.py] speficied in
'spark.submit.pyFiles' to Python path:
/mnt/resource/hadoop/yarn/local/usercache/sshuser/filecache/929
above warning coming for all files i.e abc.py, abd.py etc (which ever passed to --py-files to)

how to read a file present on the edge node when submit spark application in deploy mode = cluster

I have an spark scala application( spark 2.4 ). I am passing a file present on my edge node as an argument to my driver(main) program, I read this file using scala.io.Source .Now when i do a spark-submit and mention --deploy-mode clientthen the application runs fine and it can read the file. But when i use deploy-mode cluster. the application fails saying file not found. Is there a way i can read the file from the edge node in cluster mode.
Thanks.
Edit..
I tried giving file:// before the file path but hat is not working either...
this is how i am giving the file path as an argument to my main class.
spark2-submit --jars spark-avro_2.11-2.4.0.jar --master yarn --deploy-mode cluster --driver-memory 4G --executor-memory 4G --executor-cores 4 --num-executors 6 --conf spark.executor.memoryOverhead=4096 --conf spark.driver.memoryOverhead=4096 --conf spark.executor.instances=150 --conf spark.shuffle.service.enabled=true --class com.citi.gct.main.StartGCTEtl global-consumer-etl-0.0.1-SNAPSHOT-jar-with-dependencies.jar file://home/gfrrtnee/aditya/Trigger_1250-ING-WS-ALL-PCL-INGEST-CPB_20200331_ING-GLOBAL-PCL-CPB-04-Apr-19-1.event dev Y
But still i am getting the same error in cluster mode.
20/05/07 06:27:47 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.io.FileNotFoundException: file:/home/gfrrtnee/aditya/Trigger_1250-ING-WS-ALL-PCL-INGEST-CPB_20200331_ING-GLOBAL-PCL-CPB-04-Apr-19-1.event (No such file or directory)
In cluster mode, you can use the --files option of spark-submit.
Example: https://cloud.ibm.com/docs/services/AnalyticsforApacheSpark?topic=AnalyticsforApacheSpark-specify-file-path
Another option for you is to place the file in a distributed file system like hdfs or dbfs.

pyspark memory issue :Caused by: java.lang.OutOfMemoryError: Java heap space

Folks,
Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file
Cluster Info:
9 datanodes
128 GB Memory /48 vCore CPU /Node
Job config
conf = SparkConf().setAppName('test') \
.set('spark.executor.cores', 4) \
.set('spark.executor.memory', '72g') \
.set('spark.driver.memory', '16g') \
.set('spark.yarn.executor.memoryOverhead',4096 ) \
.set('spark.dynamicAllocation.enabled', 'true') \
.set('spark.shuffle.service.enabled', 'true') \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.set('spark.driver.maxResultSize',10000) \
.set('spark.kryoserializer.buffer.max', 2044)
fileRDD=sc.textFile("/tmp/test_file.txt")
fileRDD.cache
list_of_lines_from_file = fileRDD.map(lambda line: line.split(" ")).collect()
Error
The Collect piece is spitting outofmemory error.
18/05/17 19:03:15 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection fromHost/IP:53023 is closed
18/05/17 19:03:15 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.lang.OutOfMemoryError: Java heap space
any help is much appreciated.
A little background on this issue
I was having this issue while i run the code through Jupyter Notebook which runs on an edgenode of a hadoop cluster
Finding in Jupyter
since you can only submit the code from Jupyter through client mode,(equivalent to launching spark-shell from the edgenode) the spark driver is always the edgenode which is already packed with other long running daemon processes, where the available memory is always lesser than the memory required for fileRDD.collect() on my file
Worked fine in spark-submit
I put the content from Jupyer to a .py file and invoked the same through spark-submit with same settings Whoa!! , it ran in seconds there, reason being , spark-submit is optimized to choose the driver node from one of the nodes that has required memory free from the cluster .
spark-submit --name "test_app" --master yarn --deploy-mode cluster --conf spark.executor.cores=4 --conf spark.executor.memory=72g --conf spark.driver.memory=72g --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2044 --conf spark.driver.maxResultSize=1g --conf spark.driver.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' --conf spark.executor.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' test.py
Next Step :
Our next step is to see if Jupyter notebook can submit the spark job to YARN cluster , via a Livy JobServer or a similar approach.

airflow java.sql.SQLException: No suitable driver

I am getting this error from airflow:
java.sql.SQLException: No suitable driver
The same code works fine in Oozie. The command to run this job in airflow is below.
I added this bold line but still no luck.
/usr/hdp/current/spark-client/bin/spark-submit --master yarn --deploy-mode cluster --queue batch --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar --files /usr/hdp/current/spark-client/conf/hive-site.xml --class com.bcp.test.SimpleBcpJob --executor-memory 1G hdfs://<server_name>/edw/libs/bcptest.jar **hdfs://<server_name>/edw/libs/sqljdbc41.jar** ${nameNode}/edw/data.properties ${nameNode}/edw/test/simple-bcp-job.properties reload
I'd appreciate any suggestions.

Spark: Executor Lost Failure (After adding groupBy job)

I’m trying to run Spark job on Yarn client. I have two nodes and each node has the following configurations.
I’m getting “ExecutorLostFailure (executor 1 lost)”.
I have tried most of the Spark tuning configuration. I have reduced to one executor lost because initially I got like 6 executor failures.
These are my configuration (my spark-submit) :
HADOOP_USER_NAME=hdfs spark-submit --class genkvs.CreateFieldMappings
--master yarn-client --driver-memory 11g --executor-memory 11G --total-executor-cores 16 --num-executors 15 --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.akka.frameSize=1000 --conf spark.shuffle.memoryFraction=1 --conf
spark.rdd.compress=true --conf
spark.core.connection.ack.wait.timeout=800
my-data/lookup_cache_spark-assembly-1.0-SNAPSHOT.jar -h
hdfs://hdp-node-1.zone24x7.lk:8020 -p 800
My data size is 6GB and I’m doing a groupBy in my job.
def process(in: RDD[(String, String, Int, String)]) = {
in.groupBy(_._4)
}
I’m new to Spark, please help me to find my mistake. I’m struggling for at least a week now.
Thank you very much in advance.
Two issues pop out:
the spark.shuffle.memoryFraction is set to 1. Why did you choose that instead of leaving the default 0.2 ? That may starve other non shuffle operations
You only have 11G available to 16 cores. With only 11G I would set the number of workers in your job to no more than 3 - and initially (to get past the executors lost issue) just try 1. With 16 executors each one gets like 700mb - which then no surprise they are getting OOME / executor lost.