Spark Job hanging - scala

I am running a spark job on yarn cluster through the spark-submit command. The job starts but hangs at action steps. Below is the command I am using for this:
spark-submit \
--class com.spark.Inbound \
--name spark-learning \
--deploy-mode cluster \
--queue default \
--conf spark.executor.extraJavaOptions=-
Dlog4j.configuration=file:/home_dir/sam/log4j.properties \
--conf spark.driver.extraJavaOptions=-
Dlog4j.configuration=file:/home_dir/sam/log4j.properties \
--conf spark.logConf=true \
sparkscala_2.11-1.0.jar dev.
val inputFileData = sparkContext.textFile(sourceConfig.employeeInLocation + "/" + sourceConfig.employeeInFileName)
inputFileData.
.flatMap(data => data.split("\n"))
.foreach(println(_))
The same job, when running without the log4j properties, finishes successfully.
spark-submit \
--class com.spark.Inbound \
--name spark-learning \
--deploy-mode cluster \
--queue default \
sparkscala_2.11-1.0.jar dev.
Am I missing any parameters?

Related

DriverManager.getConnection method is failing establish jdbc connection with apache phoenix

Getting below error, while writing data to Hbase through apache phoenix using spark scala code.
All the Htrace libraries I am passing in my spark-submit command
java.lang.NoClassDefFoundError: org/apache/htrace/Sampler
at org.apache.phoenix.trace.util.Tracing$Frequency.<clinit>(Tracing.java:73)
at org.apache.phoenix.query.QueryServicesOptions.<clinit>(QueryServicesOptions.java:230)
at org.apache.phoenix.query.QueryServicesImpl.<init>(QueryServicesImpl.java:36)
at org.apache.phoenix.jdbc.PhoenixDriver.getQueryServices(PhoenixDriver.java:197)
at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:235)
at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:150)
at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:221)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
Here is my spark-submit command,
spark-submit --jars $(echo /vol1/cloudera/parcels/CDH/jars/hbase*.jar | tr ' ' ','),$(echo /vol1/cloudera/parcels/CDH/jars/htrace-core4-4.2.0-incubating.jar | tr ' ' ','),mongo-spark-connector_2.11-2.4.0.jar,mongo-java-driver-3.11.0.jar,mail-1.4.7.jar,spark-sql-kafka-0-10_2.11-2.4.0-cdh6.2.1.jar,$(echo /vol1/cloudera/parcels/PHOENIX/lib/phoenix/lib/*.jar | tr ' ' ',') --files /etc/hbase/conf.cloudera.hbase/hbase-site.xml --driver-class-path mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:mail-1.4.7.jar:spark-sql-kafka-0-10_2.11-2.4.0-cdh6.2.1.jar:phoenix-core-5.0.0-cdh6.2.0.jar --conf spark.executor.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:mail-1.4.7.jar:spark-sql-kafka-0-10_2.11-2.4.0-cdh6.2.1.jar:phoenix-core-5.0.0-cdh6.2.0.jar:htrace-core4-4.2.0-incubating.jar --conf spark.driver.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:mail-1.4.7.jar:spark-sql-kafka-0-10_2.11-2.4.0-cdh6.2.1.jar:phoenix-core-5.0.0-cdh6.2.0.jar:htrace-core4-4.2.0-incubating.jar --class com.collectivei.spark2.opprtntymgnt.StreamEntry --driver-memory 2g --num-executors 2 --executor-cores 3 --executor-memory 3g --conf spark.streaming.backpressure.enabled=true --conf spark.streaming.backpressure.pid.minRate=10 --master yarn --deploy-mode cluster --conf spark.yarn.maxAppAttempts=4 --conf spark.yarn.am.attemptFailuresValidityInterval=1h --conf spark.yarn.max.executor.failures=16 --conf spark.yarn.executor.failuresValidityInterval=1h --conf spark.task.maxFailures=8 --queue users.admin --conf spark.speculation=true example.jar

Is there a specific order of parameters used in spark-submit while submitting a job?

I am trying to submit a spark job using spark-submit as below:
> SPARK_MAJOR_VERSION=2 spark-submit --conf spark.ui.port=4090
> --driver-class-path /home/devusr/jars/greenplum-spark_2.11-1.3.0.jar --jars /home/devusr/jars/greenplum-spark_2.11-1.3.0.jar --executor-cores 3 --executor-memory 13G --class com.partition.source.YearPartition splinter_2.11-0.1.jar --master=yarn
> --keytab /home/devusr/devusr.keytab --principal devusr#DEV.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties
> --name Splinter --conf spark.executor.extraClassPath=/home/devusr/jars/greenplum-spark_2.11-1.3.0.jar
> --conf spark.executor.instances=10 --conf spark.dynamicAllocation.enabled=false --conf
> spark.files.maxPartitionBytes=256M
But the job doesn't run and instead just prints:
SPARK_MAJOR_VERSION is set to 2, using Spark2
Could anyone let me know if there is any specific order for the parameters used in spark-submit ?
The format to use spark-submit in cluster mode on yarn is
$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options] as documented in https://spark.apache.org/docs/2.1.0/running-on-yarn.html
If splinter_2.11-0.1.jar is the jar that contains your class com.partition.source.YearPartition, Can you try using this:
spark-submit \
--class com.partition.source.YearPartition \
--master=yarn \
--conf spark.ui.port=4090 \
--driver-class-path /home/devusr/jars/greenplum-spark_2.11-1.3.0.jar \
--jars /home/devusr/jars/greenplum-spark_2.11-1.3.0.jar \
--executor-cores 3 \
--executor-memory 13G \
--keytab /home/devusr/devusr.keytab \
--principal devusr#DEV.COM \
--files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties \
--name Splinter \
--conf spark.executor.extraClassPath=/home/devusr/jars/greenplum-spark_2.11-1.3.0.jar \
--conf spark.executor.instances=10 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.files.maxPartitionBytes=256M \
splinter_2.11-0.1.jar

cassandra/datastax: programatically setting datastax package

The following spark-submit script works:
nohup ./bin/spark-submit --jars ./ikoda/extrajars/ikoda_assembled_ml_nlp.jar,./ikoda/extrajars/stanford-corenlp-3.8.0.jar,./ikoda/extrajars/stanford-parser-3.8.0.jar \
--packages datastax:spark-cassandra-connector:2.0.1-s_2.11 \
--class ikoda.mlserver.Application \
--conf spark.cassandra.connection.host=192.168.0.33 \
--master local[*] ./ikoda/ikodaanalysis-mlserver-0.1.0.jar 1000 > ./logs/nohup.out &
Programatically, I can do the same by configuring SparkContext:
val conf = new SparkConf().setMaster("local[4]").setAppName("MLPCURLModelGenerationDataStream")
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
conf.set("spark.cassandra.connection.host", sparkcassandraconnectionhost)
conf.set("spark.driver.maxResultSize", sparkdrivermaxResultSize)
conf.set("spark.network.timeout", sparknetworktimeout)
Question
Can I add --packages datastax:spark-cassandra-connector:2.0.1-s_2.11 programatically? If yes, how?
The corresponding option is spark.jars.packages
conf.set(
"spark.jars.packages",
"datastax:spark-cassandra-connector:2.0.1-s_2.11")

Passing sys props to Spark 1.5 especially properties with spaces in it

After using Spark 1.2 for quite a long time, I have realised that you can no longer pass spark configuration to the driver via the --conf via command line.
I am thinking about using system properties and picking the config up using the following bit of code:
def getConfigOption(conf: SparkConf, name: String)
conf getOption name orElse sys.props.get(name)
How do i pass a config.file option and string version of the date specified as a start time to a spark-submit command?
I have attempted using the following in my start up shell script:
--conf "spark.executor.extraJavaOptions=-Dconfig.file=../conf/application.conf -DstartTime=2016-06-04 00:00:00"
but this fails at it space splits the command up.
Any idea how to do this successfully, or has anyone got any advice on this one?
I am EDITing this to show the bash script being used:
#!/bin/bash
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
LIB_DIRECTORY=/opt/app/latest/lib/
ANALYSIS_JAR=spark-fa-2.16.18-standalone.jar
ANALYSIS_DRIVER_CLASS=com.spark.fa.Main
OTHER_OPTIONS=""
KEYTAB="/opt/app/keytab/fa.keytab"
PRINCIPAL="spark_K"
CLUSTER_OPTIONS=" \
--master yarn-client \
--driver-memory 2000M \
--executor-memory 5G \
--num-executors 39 \
--executor-cores 5 \
--conf spark.default.parallelism=200 \
--driver-java-options=-Dconfig.file=../conf/application.conf \
--conf "spark.executor.extraJavaOptions=-DstartTime='2016-06-04 00:00:00'" \
--conf spark.storage.memoryFraction=0.9 \
--files /opt/app/latest/conf/application.conf \
--conf spark.storage.safetyFraction=0.9 \
--keytab ${KEYTAB} \
--principal ${PRINCIPAL} \
"
spark-submit --class ${ANALYSIS_DRIVER_CLASS} ${CLUSTER_OPTIONS} ${LIB_DIRECTORY}/${ANALYSIS_JAR} ${CONFIG} ${#}
Use quotes:
--conf "spark.executor.extraJavaOptions=-Dconfig.file=../conf/application.conf -DstartTime='2016-06-04 00:00:00'"
If your parameter contains both spaces and single quotes (for instance a query paramter) you should enclose it with in escaped double quote \"
Example:
spark-submit --master yarn --deploy-mode cluster --conf "spark.driver.extraJavaOptions=-DfileFormat=PARQUET -Dquery=\"select * from bucket where code in ('A')\" -Dchunk=yes" spark-app.jar

How to pass -D parameter or environment variable to Spark job?

I want to change Typesafe config of a Spark job in dev/prod environment. It seems to me that the easiest way to accomplish this is to pass -Dconfig.resource=ENVNAME to the job. Then Typesafe config library will do the job for me.
Is there way to pass that option directly to the job? Or maybe there is better way to change job config at runtime?
EDIT:
Nothing happens when I add --conf "spark.executor.extraJavaOptions=-Dconfig.resource=dev" option to spark-submit command.
I got Error: Unrecognized option '-Dconfig.resource=dev'. when I pass -Dconfig.resource=dev to spark-submit command.
Change spark-submit command line adding three options:
--files <location_to_your_app.conf>
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
--conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'
Here is my spark program run with addition java option
/home/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
--files /home/spark/jobs/fact_stats_ad.conf \
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf \
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf' \
--class jobs.DiskDailyJob \
--packages com.databricks:spark-csv_2.10:1.4.0 \
--jars /home/spark/jobs/alluxio-core-client-1.2.0-RC2-jar-with-dependencies.jar \
--driver-memory 2g \
/home/spark/jobs/convert_to_parquet.jar \
AD_COOKIE_REPORT FACT_AD_STATS_DAILY | tee /data/fact_ad_stats_daily.log
as you can see
the custom config file
--files /home/spark/jobs/fact_stats_ad.conf
the executor java options
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf
the driver java options.
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf'
Hope it can helps.
I Had a lot of problems with passing -D parameters to spark executors and the driver, I've added a quote from my blog post about it:
"
The right way to pass the parameter is through the property:
“spark.driver.extraJavaOptions” and “spark.executor.extraJavaOptions”:
I’ve passed both the log4J configurations property and the parameter that I needed for the configurations. (To the Driver I was able to pass only the log4j configuration).
For example (was written in properties file passed in spark-submit with “—properties-file”):
“
spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -
spark.executor.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -Dapplication.properties.file=hdfs:///some/path/on/hdfs/app.properties
spark.application.properties.file hdfs:///some/path/on/hdfs/app.properties
“
You can read my blog post about overall configurations of spark.
I'm am running on Yarn as well.
--files <location_to_your_app.conf>
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
--conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'
if you write in this way, the later --conf will overwrite the previous one, you can verify this by looking at sparkUI after job started under Environment tab.
so the correct way is to put the options under same line like this:
--conf 'spark.executor.extraJavaOptions=-Da=b -Dc=d'
if you do this, you can find all your settings will be shown under sparkUI.
I am starting my Spark application via a spark-submit command launched from within another Scala application. So I have an Array like
Array(".../spark-submit", ..., "--conf", confValues, ...)
where confValues is:
for yarn-cluster mode:
"spark.driver.extraJavaOptions=-Drun.mode=production -Dapp.param=..."
for local[*] mode:
"run.mode=development"
It is a bit tricky to understand where (not) to escape quotes and spaces, though. You can check the Spark web interface for system property values.
spark-submit --driver-java-options "-Denv=DEV -Dmode=local" --class co.xxx.datapipeline.jobs.EventlogAggregator target/datapipeline-jobs-1.0-SNAPSHOT.jar
The above command works for me:
-Denv=DEV => to read DEV env properties file, and
-Dmode=local => to create SparkContext in local - .setMaster("local[*]")
Use the method like in below command, may be helpful for you -
spark-submit --master local[2] --conf
'spark.driver.extraJavaOptions=Dlog4j.configuration=file:/tmp/log4j.properties'
--conf 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/tmp/log4j.properties'
--class com.test.spark.application.TestSparkJob target/application-0.0.1-SNAPSHOT-jar-with-dependencies.jar prod
I have tried and it worked for me, I would suggest also go through heading below spark post which is really helpful -
https://spark.apache.org/docs/latest/running-on-yarn.html
I originally had this config file:
my-app {
environment: dev
other: xxx
}
This is how I'm loading my config in my spark scala code:
val config = ConfigFactory.parseFile(File<"my-app.conf">)
.withFallback(ConfigFactory.load())
.resolve
.getConfig("my-app")
With this setup, despite what the Typesafe Config documentation and all the other answers say, the system property override didn't work for me when I launched my spark job like so:
spark-submit \
--master yarn \
--deploy-mode cluster \
--name my-app \
--driver-java-options='-XX:MaxPermSize=256M -Dmy-app.environment=prod' \
--files my-app.conf \
my-app.jar
To get it to work I had to change my config file to:
my-app {
environment: dev
environment: ${?env.override}
other: xxx
}
and then launch it like so:
spark-submit \
--master yarn \
--deploy-mode cluster \
--name my-app \
--driver-java-options='-XX:MaxPermSize=256M -Denv.override=prod' \
--files my-app.conf \
my-app.jar