Oozie pyspark action using Spark 1.6 instead of 2.2 - pyspark

When run from the command line using spark2-submit its running under Spark version 2.2.0. But when i use a oozie spark action its running under Spark version 1.6.0 and failing with error TypeError: 'JavaPackage' object is not callable
My oozie spark action below
<!-- Spark action first -->
<action name="foundationorder" cred="hcat">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${hiveConfig}</job-xml>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>OrderFill</name>
<jar>${envRoot}/oozie/scripts/pyscripts/orders_fill.py</jar>
<spark-opts>--py-files ${envRoot}/oozie/scripts/pyscripts/order_fill.zip
--files ${hiveConfig}
--conf spark.yarn.appMasterEnv.SPARK_HOME=/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/bin/
--conf spark.executorEnv.SPARK_HOME=/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/bin/
</spark-opts>
<arg>${Arg1}</arg>
<arg>${Arg2}</arg>
<arg>${Arg3}</arg>
</spark>
<ok to="sendEmailKill"/>
<error to="sendEmailKill"/>
</action>
I have mentioned oozie.action.sharelib.for.spark=spark2 in the job.properties file
Please advise how to force Oozie to use spark2

Related

Upgrading to spark_Cassandra_Connector_2_12:3.2.0 from spark_Cassandra_connector_2_11:2.5.1 is breaking

My code in production is working good over an year with
spark_Cassandra_connector_2_11:2.5.1 with EMR 5.29
We want to upgrade to scala 2_12 and EMR 6.6.0 and did respective scala upgrade
spark - 3.2.0
datastax_cassandra_connector - 3.2.0
scala - 2.12.11
EMR - 6.6.0
No config on the class path and just the below config passed to spark
--conf spark.cassandra.connection.host=${casHostIps} \
--conf spark.cassandra.auth.username=${casUser} \
--conf spark.cassandra.auth.password=${casPassword} \
--conf spark.cassandra.output.ignoreNulls=true \
--conf spark.cassandra.output.concurrent.writes=$SPARK_CASSANDRA_OUTPUT_CONCURRENT_WRITES \
Upgraded code is no more working in new EMR with the above config options. Always complaining the below
Caused by: com.typesafe.config.ConfigException$Missing: withValue(advanced.reconnection-policy.base-delay): No configuration setting found for key 'advanced.session-leak'
at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:157)
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:150)
at com.typesafe.config.impl.SimpleConfig.findOrNull(SimpleConfig.java:177)
at com.typesafe.config.impl.SimpleConfig.findOrNull(SimpleConfig.java:181)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:189)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:194)
at com.typesafe.config.impl.SimpleConfig.getConfigNumber(SimpleConfig.java:224)
at com.typesafe.config.impl.SimpleConfig.getInt(SimpleConfig.java:235)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at com.datastax.oss.driver.internal.core.config.typesafe.TypesafeDriverExecutionProfile.getCached(TypesafeDriverExecutionProfile.java:291)
at com.datastax.oss.driver.internal.core.config.typesafe.TypesafeDriverExecutionProfile.getInt(TypesafeDriverExecutionProfile.java:88)
at com.datastax.oss.driver.internal.core.session.DefaultSession.<init>(DefaultSession.java:103)
at com.datastax.oss.driver.internal.core.session.DefaultSession.init(DefaultSession.java:88)
at com.datastax.oss.driver.api.core.session.SessionBuilder.buildDefaultSessionAsync(SessionBuilder.java:903)
at com.datastax.oss.driver.api.core.session.SessionBuilder.buildAsync(SessionBuilder.java:817)
at com.datastax.oss.driver.api.core.session.SessionBuilder.build(SessionBuilder.java:835)
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createSession(CassandraConnectionFactory.scala:143)
at com.datastax.spark.connector.cql.CassandraConnector$.createSession(CassandraConnector.scala:167)
where I have ended up referring to datastax_java_driver config in my classpath and it is working finally with https://docs.datastax.com/en/developer/java-driver/4.14/manual/core/configuration/reference/
--conf spark.files=s3://${efSparkJarBucket}/${efSparkJarPrefix}reference.conf \
--conf spark.cassandra.connection.config.profile.path=reference.conf \
But there is not much tuning options to do in java-driver level that I do at spark level .ex:
--conf spark.cassandra.output.concurrent.writes=100 \
2 Questions:
why 3.2.0 spark_cassandra_Driver is a breaking change compared to 2.5.1?
if referring to a config (Reference.conf) is the only option, How I can tune the throttling as my load is happening but very slow, and also cassandra cluster nodes are going out and in.

Hadoop, Spark: java.lang.NoSuchFieldError: TOKEN_KIND

I want to share an interesting error I've caught up recently:
Exception in thread "main" java.lang.NoSuchFieldError: TOKEN_KIND
at org.apache.hadoop.crypto.key.kms.KMSClientProvider$KMSTokenRenewer.handleKind(KMSClientProvider.java:166)
at org.apache.hadoop.security.token.Token.getRenewer(Token.java:351)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at org.apache.spark.deploy.security.HadoopFSCredentialProvider$$anonfun$getTokenRenewalInterval$1$$anonfun$5$$anonfun$apply$1.apply$mcJ$sp(HadoopFSDelegationTokeProvider.scala:119)
I was trying to spark2-submit a job to a remote driver host on Cloudera cluster like this:
spark = SparkSession.builder
.master("yarn")
.config("cluster")
.config("spark.driver.host", "remote_driver_host")
.config("spark.yarn.keytab", "path_to_pricnipar.keytab")
.config("spark.yarn.principal", "principal.name") \
.config("spark.driver.bindAddress", "0.0.0.0") \
.getOrCreate()
The Apache spark and Hadoop versions on Cloudera cluster are: 2.3.0 and 2.6.0 accordingly.
So the cause of issue was quite trivial, it is spark local binaries vs remote spark driver version mismatch.
Locally I had installed spark 2.4.5 and on Cloudera it was 2.3.0, after aligning the versions to 2.3.0, the issue resolved and the spark job completed successfully.

Pyspark Phoenix integration failing in oozie workflow

I am connecting and ingesting data into phoenix table using pyspark by below code
dataframe.write.format("org.apache.phoenix.spark").mode("overwrite").option("table", "tablename").option("zkUrl", "localhost:2181").save()
When i run this in spark submit it works fine by below command,
spark-submit --master local --deploy-mode client --files /etc/hbase/conf/hbase-site.xml --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar" --conf "spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar" sparkPhoenix.py
When i run this with oozie I am getting below error,
.ConnectionClosingException: Connection to ip-172-31-44-101.us-west-2.compute.internal/172.31.44.101:16020 is closing. Call id=9, waitTime=3 row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=ip-172-31-44-101
Below is workflow,
<action name="pysparkAction" retry-max="1" retry-interval="1" cred="hbase">
<spark
xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>local</master>
<mode>client</mode>
<name>Spark Example</name>
<jar>sparkPhoenix.py</jar>
<spark-opts>--py-files Leia.zip --files /etc/hbase/conf/hbase-site.xml --conf spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar --conf spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar</spark-opts>
</spark>
<ok to="successEmailaction"/>
<error to="failEmailaction"/>
</action>
Using spark-submit I got the same error I corrected that by passing required jars. In oozie, Even i pass jars, it throwing error.
I found that "--files /etc/hbase/conf/hbase-site.xml" does not working when integrated with oozie. I pass the hbase-site.xml as below with file tag in oozie spark action. It works fine now
<file>file:///etc/hbase/conf/hbase-site.xml</file>

Spark Scala Jaas configuration

I’m executing spark code on scala shell using Kafka jars and my intention is to stream messages from Kafka topic. My spark object is created but can anyone help me in how can I pass jaas configuration file while starting the spark shell ? My error point me to missing jaas configuration
Assuming you have a spark-kafka.jaas file in the current folder you are running spark-submit from, you pass it as a file, as well as a Driver and Executor option
spark-submit \
...
--files "spark-kafka.jaas#spark-kafka.jaas" \
--driver-java-options "-Djava.security.auth.login.config=./spark-kafka.jaas" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./spark-kafka.jaas"
You might also need to set "security.protocol" within the Spark code's Kafka properties to be one of the supported Kafka SASL protocols
I got an issue like yours, I'm using this startup-script to run my spark-shell, I'm using spark 2.3.0.
export HOME=/home/alessio.palma/scala_test
spark2-shell \
--verbose \
--principal hdp_ud_appadmin \
--files "jaas.conf" \
--keytab $HOME/hdp_ud_app.keytab \
--master local[2] \
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.0,org.apache.kafka:kafka-clients:0.10.0.1 \
--conf spark.driver.extraJavaOptions="-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf" \
--conf spark.executor.extraJavaOptions="-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf" \
--driver-java-options spark.driver.extraJavaOptions="-Djava.security.auth.login.config=file://jaas.conf -Djava.security.krb5.conf=file:///etc/krb5.conf" \
--driver-java-options spark.executor.extraJavaOptions="-Djava.security.auth.login.config=file://jaas.conf -Djava.security.krb5.conf=file:///etc/krb5.conf" \
--queue=root.Global.UnifiedData.hdp_global_ud_app
Any attempt failed with this error:
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
:
.
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Jaas configuration not found
it looks like the park.driver.extraJavaOptions and spark.executor.extraJavaOptions are not working. Anything was failing until I added this row in top of my startup script:
export SPARK_SUBMIT_OPTS='-Djava.security.auth.login.config=/home/alessio.palma/scala_test/jaas.conf'
And magically the jaas.conf file has been found. Another thing I suggest to add in your startup script is:
export SPARK_KAFKA_VERSION=0.10

DiskSpace quota exception when running spark action in oozie

I am trying to run a spark action in oozie. My spark job fails with the below error :
The DiskSpace quota of /user/nidhin is exceeded: quota = 10737418240 B = 10 GB but diskspace consumed = 10973426088 B = 10.22 GB
I added the staging dir property in my oozie workflow and pointed to a HDFS directory other than home which has TBs of space , even then i get the same error.
<action name="CheckErrors" cred="hcat">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>JobName</name>
<class>com.nidhin.util.CheckErrorsRaw
</class>
<jar>${processor_jar}</jar>
<spark-opts>--queue=${queue_name}
--num-executors 0
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.yarn.stagingDir=${hdfs_data_base_dir}
</spark-opts>
<arg>${load_dt}</arg>
</spark>
<ok to="End" />
<error to="Kill" />
</action>
${hdfs_data_base_dir} is /tenants/proj/ directory in HDFS and has TBs worth of space in it.
When i look into the spark jobtracker UI , the property is properly reflected.
spark.yarn.stagingDir hdfs://tenants/proj/
How do fix this error and point to the above mentioned stagingDir ?