DiskSpace quota exception when running spark action in oozie - scala

I am trying to run a spark action in oozie. My spark job fails with the below error :
The DiskSpace quota of /user/nidhin is exceeded: quota = 10737418240 B = 10 GB but diskspace consumed = 10973426088 B = 10.22 GB
I added the staging dir property in my oozie workflow and pointed to a HDFS directory other than home which has TBs of space , even then i get the same error.
<action name="CheckErrors" cred="hcat">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>JobName</name>
<class>com.nidhin.util.CheckErrorsRaw
</class>
<jar>${processor_jar}</jar>
<spark-opts>--queue=${queue_name}
--num-executors 0
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.yarn.stagingDir=${hdfs_data_base_dir}
</spark-opts>
<arg>${load_dt}</arg>
</spark>
<ok to="End" />
<error to="Kill" />
</action>
${hdfs_data_base_dir} is /tenants/proj/ directory in HDFS and has TBs worth of space in it.
When i look into the spark jobtracker UI , the property is properly reflected.
spark.yarn.stagingDir hdfs://tenants/proj/
How do fix this error and point to the above mentioned stagingDir ?

Related

how to read a file present on the edge node when submit spark application in deploy mode = cluster

I have an spark scala application( spark 2.4 ). I am passing a file present on my edge node as an argument to my driver(main) program, I read this file using scala.io.Source .Now when i do a spark-submit and mention --deploy-mode clientthen the application runs fine and it can read the file. But when i use deploy-mode cluster. the application fails saying file not found. Is there a way i can read the file from the edge node in cluster mode.
Thanks.
Edit..
I tried giving file:// before the file path but hat is not working either...
this is how i am giving the file path as an argument to my main class.
spark2-submit --jars spark-avro_2.11-2.4.0.jar --master yarn --deploy-mode cluster --driver-memory 4G --executor-memory 4G --executor-cores 4 --num-executors 6 --conf spark.executor.memoryOverhead=4096 --conf spark.driver.memoryOverhead=4096 --conf spark.executor.instances=150 --conf spark.shuffle.service.enabled=true --class com.citi.gct.main.StartGCTEtl global-consumer-etl-0.0.1-SNAPSHOT-jar-with-dependencies.jar file://home/gfrrtnee/aditya/Trigger_1250-ING-WS-ALL-PCL-INGEST-CPB_20200331_ING-GLOBAL-PCL-CPB-04-Apr-19-1.event dev Y
But still i am getting the same error in cluster mode.
20/05/07 06:27:47 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.io.FileNotFoundException: file:/home/gfrrtnee/aditya/Trigger_1250-ING-WS-ALL-PCL-INGEST-CPB_20200331_ING-GLOBAL-PCL-CPB-04-Apr-19-1.event (No such file or directory)
In cluster mode, you can use the --files option of spark-submit.
Example: https://cloud.ibm.com/docs/services/AnalyticsforApacheSpark?topic=AnalyticsforApacheSpark-specify-file-path
Another option for you is to place the file in a distributed file system like hdfs or dbfs.

Oozie pyspark action using Spark 1.6 instead of 2.2

When run from the command line using spark2-submit its running under Spark version 2.2.0. But when i use a oozie spark action its running under Spark version 1.6.0 and failing with error TypeError: 'JavaPackage' object is not callable
My oozie spark action below
<!-- Spark action first -->
<action name="foundationorder" cred="hcat">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${hiveConfig}</job-xml>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>OrderFill</name>
<jar>${envRoot}/oozie/scripts/pyscripts/orders_fill.py</jar>
<spark-opts>--py-files ${envRoot}/oozie/scripts/pyscripts/order_fill.zip
--files ${hiveConfig}
--conf spark.yarn.appMasterEnv.SPARK_HOME=/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/bin/
--conf spark.executorEnv.SPARK_HOME=/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/bin/
</spark-opts>
<arg>${Arg1}</arg>
<arg>${Arg2}</arg>
<arg>${Arg3}</arg>
</spark>
<ok to="sendEmailKill"/>
<error to="sendEmailKill"/>
</action>
I have mentioned oozie.action.sharelib.for.spark=spark2 in the job.properties file
Please advise how to force Oozie to use spark2

Pyspark Phoenix integration failing in oozie workflow

I am connecting and ingesting data into phoenix table using pyspark by below code
dataframe.write.format("org.apache.phoenix.spark").mode("overwrite").option("table", "tablename").option("zkUrl", "localhost:2181").save()
When i run this in spark submit it works fine by below command,
spark-submit --master local --deploy-mode client --files /etc/hbase/conf/hbase-site.xml --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar" --conf "spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar" sparkPhoenix.py
When i run this with oozie I am getting below error,
.ConnectionClosingException: Connection to ip-172-31-44-101.us-west-2.compute.internal/172.31.44.101:16020 is closing. Call id=9, waitTime=3 row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=ip-172-31-44-101
Below is workflow,
<action name="pysparkAction" retry-max="1" retry-interval="1" cred="hbase">
<spark
xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>local</master>
<mode>client</mode>
<name>Spark Example</name>
<jar>sparkPhoenix.py</jar>
<spark-opts>--py-files Leia.zip --files /etc/hbase/conf/hbase-site.xml --conf spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar --conf spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar</spark-opts>
</spark>
<ok to="successEmailaction"/>
<error to="failEmailaction"/>
</action>
Using spark-submit I got the same error I corrected that by passing required jars. In oozie, Even i pass jars, it throwing error.
I found that "--files /etc/hbase/conf/hbase-site.xml" does not working when integrated with oozie. I pass the hbase-site.xml as below with file tag in oozie spark action. It works fine now
<file>file:///etc/hbase/conf/hbase-site.xml</file>

pyspark memory issue :Caused by: java.lang.OutOfMemoryError: Java heap space

Folks,
Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file
Cluster Info:
9 datanodes
128 GB Memory /48 vCore CPU /Node
Job config
conf = SparkConf().setAppName('test') \
.set('spark.executor.cores', 4) \
.set('spark.executor.memory', '72g') \
.set('spark.driver.memory', '16g') \
.set('spark.yarn.executor.memoryOverhead',4096 ) \
.set('spark.dynamicAllocation.enabled', 'true') \
.set('spark.shuffle.service.enabled', 'true') \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.set('spark.driver.maxResultSize',10000) \
.set('spark.kryoserializer.buffer.max', 2044)
fileRDD=sc.textFile("/tmp/test_file.txt")
fileRDD.cache
list_of_lines_from_file = fileRDD.map(lambda line: line.split(" ")).collect()
Error
The Collect piece is spitting outofmemory error.
18/05/17 19:03:15 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection fromHost/IP:53023 is closed
18/05/17 19:03:15 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.lang.OutOfMemoryError: Java heap space
any help is much appreciated.
A little background on this issue
I was having this issue while i run the code through Jupyter Notebook which runs on an edgenode of a hadoop cluster
Finding in Jupyter
since you can only submit the code from Jupyter through client mode,(equivalent to launching spark-shell from the edgenode) the spark driver is always the edgenode which is already packed with other long running daemon processes, where the available memory is always lesser than the memory required for fileRDD.collect() on my file
Worked fine in spark-submit
I put the content from Jupyer to a .py file and invoked the same through spark-submit with same settings Whoa!! , it ran in seconds there, reason being , spark-submit is optimized to choose the driver node from one of the nodes that has required memory free from the cluster .
spark-submit --name "test_app" --master yarn --deploy-mode cluster --conf spark.executor.cores=4 --conf spark.executor.memory=72g --conf spark.driver.memory=72g --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2044 --conf spark.driver.maxResultSize=1g --conf spark.driver.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' --conf spark.executor.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' test.py
Next Step :
Our next step is to see if Jupyter notebook can submit the spark job to YARN cluster , via a Livy JobServer or a similar approach.

Spark: Executor Lost Failure (After adding groupBy job)

I’m trying to run Spark job on Yarn client. I have two nodes and each node has the following configurations.
I’m getting “ExecutorLostFailure (executor 1 lost)”.
I have tried most of the Spark tuning configuration. I have reduced to one executor lost because initially I got like 6 executor failures.
These are my configuration (my spark-submit) :
HADOOP_USER_NAME=hdfs spark-submit --class genkvs.CreateFieldMappings
--master yarn-client --driver-memory 11g --executor-memory 11G --total-executor-cores 16 --num-executors 15 --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.akka.frameSize=1000 --conf spark.shuffle.memoryFraction=1 --conf
spark.rdd.compress=true --conf
spark.core.connection.ack.wait.timeout=800
my-data/lookup_cache_spark-assembly-1.0-SNAPSHOT.jar -h
hdfs://hdp-node-1.zone24x7.lk:8020 -p 800
My data size is 6GB and I’m doing a groupBy in my job.
def process(in: RDD[(String, String, Int, String)]) = {
in.groupBy(_._4)
}
I’m new to Spark, please help me to find my mistake. I’m struggling for at least a week now.
Thank you very much in advance.
Two issues pop out:
the spark.shuffle.memoryFraction is set to 1. Why did you choose that instead of leaving the default 0.2 ? That may starve other non shuffle operations
You only have 11G available to 16 cores. With only 11G I would set the number of workers in your job to no more than 3 - and initially (to get past the executors lost issue) just try 1. With 16 executors each one gets like 700mb - which then no surprise they are getting OOME / executor lost.