Pyspark Phoenix integration failing in oozie workflow - pyspark

I am connecting and ingesting data into phoenix table using pyspark by below code
dataframe.write.format("org.apache.phoenix.spark").mode("overwrite").option("table", "tablename").option("zkUrl", "localhost:2181").save()
When i run this in spark submit it works fine by below command,
spark-submit --master local --deploy-mode client --files /etc/hbase/conf/hbase-site.xml --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar" --conf "spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar" sparkPhoenix.py
When i run this with oozie I am getting below error,
.ConnectionClosingException: Connection to ip-172-31-44-101.us-west-2.compute.internal/172.31.44.101:16020 is closing. Call id=9, waitTime=3 row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=ip-172-31-44-101
Below is workflow,
<action name="pysparkAction" retry-max="1" retry-interval="1" cred="hbase">
<spark
xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>local</master>
<mode>client</mode>
<name>Spark Example</name>
<jar>sparkPhoenix.py</jar>
<spark-opts>--py-files Leia.zip --files /etc/hbase/conf/hbase-site.xml --conf spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar --conf spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar</spark-opts>
</spark>
<ok to="successEmailaction"/>
<error to="failEmailaction"/>
</action>
Using spark-submit I got the same error I corrected that by passing required jars. In oozie, Even i pass jars, it throwing error.

I found that "--files /etc/hbase/conf/hbase-site.xml" does not working when integrated with oozie. I pass the hbase-site.xml as below with file tag in oozie spark action. It works fine now
<file>file:///etc/hbase/conf/hbase-site.xml</file>

Related

spark-submit --py-files gives warning RuntimeWarning: Failed to add file <abc.py> speficied in 'spark.submit.pyFiles' to Python path:

We have a pyspark based application and we are doing a spark-submit as shown below. Application is working as expected, however we are seeing a weird warning message. Any way to handle this or why is this coming ?
Note: The cluster is Azure HDI Cluster.
spark-submit --master yarn --deploy-mode cluster --jars file:/<localpath>/* --py-files pyFiles/__init__.py,pyFiles/<abc>.py,pyFiles/<abd>.py --files files/<env>.properties,files/<config>.json main.py
warning seen is:
warnings.warn(
/usr/hdp/current/spark3-client/python/pyspark/context.py:256:
RuntimeWarning: Failed to add file
[file:///home/sshuser/project/pyFiles/abc.py] speficied in
'spark.submit.pyFiles' to Python path:
/mnt/resource/hadoop/yarn/local/usercache/sshuser/filecache/929
above warning coming for all files i.e abc.py, abd.py etc (which ever passed to --py-files to)

Oozie pyspark action using Spark 1.6 instead of 2.2

When run from the command line using spark2-submit its running under Spark version 2.2.0. But when i use a oozie spark action its running under Spark version 1.6.0 and failing with error TypeError: 'JavaPackage' object is not callable
My oozie spark action below
<!-- Spark action first -->
<action name="foundationorder" cred="hcat">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${hiveConfig}</job-xml>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>OrderFill</name>
<jar>${envRoot}/oozie/scripts/pyscripts/orders_fill.py</jar>
<spark-opts>--py-files ${envRoot}/oozie/scripts/pyscripts/order_fill.zip
--files ${hiveConfig}
--conf spark.yarn.appMasterEnv.SPARK_HOME=/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/bin/
--conf spark.executorEnv.SPARK_HOME=/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/bin/
</spark-opts>
<arg>${Arg1}</arg>
<arg>${Arg2}</arg>
<arg>${Arg3}</arg>
</spark>
<ok to="sendEmailKill"/>
<error to="sendEmailKill"/>
</action>
I have mentioned oozie.action.sharelib.for.spark=spark2 in the job.properties file
Please advise how to force Oozie to use spark2

Spark Scala Jaas configuration

I’m executing spark code on scala shell using Kafka jars and my intention is to stream messages from Kafka topic. My spark object is created but can anyone help me in how can I pass jaas configuration file while starting the spark shell ? My error point me to missing jaas configuration
Assuming you have a spark-kafka.jaas file in the current folder you are running spark-submit from, you pass it as a file, as well as a Driver and Executor option
spark-submit \
...
--files "spark-kafka.jaas#spark-kafka.jaas" \
--driver-java-options "-Djava.security.auth.login.config=./spark-kafka.jaas" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./spark-kafka.jaas"
You might also need to set "security.protocol" within the Spark code's Kafka properties to be one of the supported Kafka SASL protocols
I got an issue like yours, I'm using this startup-script to run my spark-shell, I'm using spark 2.3.0.
export HOME=/home/alessio.palma/scala_test
spark2-shell \
--verbose \
--principal hdp_ud_appadmin \
--files "jaas.conf" \
--keytab $HOME/hdp_ud_app.keytab \
--master local[2] \
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.0,org.apache.kafka:kafka-clients:0.10.0.1 \
--conf spark.driver.extraJavaOptions="-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf" \
--conf spark.executor.extraJavaOptions="-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf" \
--driver-java-options spark.driver.extraJavaOptions="-Djava.security.auth.login.config=file://jaas.conf -Djava.security.krb5.conf=file:///etc/krb5.conf" \
--driver-java-options spark.executor.extraJavaOptions="-Djava.security.auth.login.config=file://jaas.conf -Djava.security.krb5.conf=file:///etc/krb5.conf" \
--queue=root.Global.UnifiedData.hdp_global_ud_app
Any attempt failed with this error:
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
:
.
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Jaas configuration not found
it looks like the park.driver.extraJavaOptions and spark.executor.extraJavaOptions are not working. Anything was failing until I added this row in top of my startup script:
export SPARK_SUBMIT_OPTS='-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf'
And magically the jaas.conf file has been found. Another thing I suggest to add in your startup script is:
export SPARK_KAFKA_VERSION=0.10

DiskSpace quota exception when running spark action in oozie

I am trying to run a spark action in oozie. My spark job fails with the below error :
The DiskSpace quota of /user/nidhin is exceeded: quota = 10737418240 B = 10 GB but diskspace consumed = 10973426088 B = 10.22 GB
I added the staging dir property in my oozie workflow and pointed to a HDFS directory other than home which has TBs of space , even then i get the same error.
<action name="CheckErrors" cred="hcat">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>JobName</name>
<class>com.nidhin.util.CheckErrorsRaw
</class>
<jar>${processor_jar}</jar>
<spark-opts>--queue=${queue_name}
--num-executors 0
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.yarn.stagingDir=${hdfs_data_base_dir}
</spark-opts>
<arg>${load_dt}</arg>
</spark>
<ok to="End" />
<error to="Kill" />
</action>
${hdfs_data_base_dir} is /tenants/proj/ directory in HDFS and has TBs worth of space in it.
When i look into the spark jobtracker UI , the property is properly reflected.
spark.yarn.stagingDir hdfs://tenants/proj/
How do fix this error and point to the above mentioned stagingDir ?

airflow java.sql.SQLException: No suitable driver

I am getting this error from airflow:
java.sql.SQLException: No suitable driver
The same code works fine in Oozie. The command to run this job in airflow is below.
I added this bold line but still no luck.
/usr/hdp/current/spark-client/bin/spark-submit --master yarn --deploy-mode cluster --queue batch --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar --files /usr/hdp/current/spark-client/conf/hive-site.xml --class com.bcp.test.SimpleBcpJob --executor-memory 1G hdfs://<server_name>/edw/libs/bcptest.jar **hdfs://<server_name>/edw/libs/sqljdbc41.jar** ${nameNode}/edw/data.properties ${nameNode}/edw/test/simple-bcp-job.properties reload
I'd appreciate any suggestions.