Getting below error, while writing data to Hbase through apache phoenix using spark scala code.
All the Htrace libraries I am passing in my spark-submit command
java.lang.NoClassDefFoundError: org/apache/htrace/Sampler
at org.apache.phoenix.trace.util.Tracing$Frequency.<clinit>(Tracing.java:73)
at org.apache.phoenix.query.QueryServicesOptions.<clinit>(QueryServicesOptions.java:230)
at org.apache.phoenix.query.QueryServicesImpl.<init>(QueryServicesImpl.java:36)
at org.apache.phoenix.jdbc.PhoenixDriver.getQueryServices(PhoenixDriver.java:197)
at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:235)
at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:150)
at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:221)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
Here is my spark-submit command,
spark-submit --jars $(echo /vol1/cloudera/parcels/CDH/jars/hbase*.jar | tr ' ' ','),$(echo /vol1/cloudera/parcels/CDH/jars/htrace-core4-4.2.0-incubating.jar | tr ' ' ','),mongo-spark-connector_2.11-2.4.0.jar,mongo-java-driver-3.11.0.jar,mail-1.4.7.jar,spark-sql-kafka-0-10_2.11-2.4.0-cdh6.2.1.jar,$(echo /vol1/cloudera/parcels/PHOENIX/lib/phoenix/lib/*.jar | tr ' ' ',') --files /etc/hbase/conf.cloudera.hbase/hbase-site.xml --driver-class-path mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:mail-1.4.7.jar:spark-sql-kafka-0-10_2.11-2.4.0-cdh6.2.1.jar:phoenix-core-5.0.0-cdh6.2.0.jar --conf spark.executor.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:mail-1.4.7.jar:spark-sql-kafka-0-10_2.11-2.4.0-cdh6.2.1.jar:phoenix-core-5.0.0-cdh6.2.0.jar:htrace-core4-4.2.0-incubating.jar --conf spark.driver.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:mail-1.4.7.jar:spark-sql-kafka-0-10_2.11-2.4.0-cdh6.2.1.jar:phoenix-core-5.0.0-cdh6.2.0.jar:htrace-core4-4.2.0-incubating.jar --class com.collectivei.spark2.opprtntymgnt.StreamEntry --driver-memory 2g --num-executors 2 --executor-cores 3 --executor-memory 3g --conf spark.streaming.backpressure.enabled=true --conf spark.streaming.backpressure.pid.minRate=10 --master yarn --deploy-mode cluster --conf spark.yarn.maxAppAttempts=4 --conf spark.yarn.am.attemptFailuresValidityInterval=1h --conf spark.yarn.max.executor.failures=16 --conf spark.yarn.executor.failuresValidityInterval=1h --conf spark.task.maxFailures=8 --queue users.admin --conf spark.speculation=true example.jar
Related
I am trying to submit a spark job using spark-submit as below:
> SPARK_MAJOR_VERSION=2 spark-submit --conf spark.ui.port=4090
> --driver-class-path /home/devusr/jars/greenplum-spark_2.11-1.3.0.jar --jars /home/devusr/jars/greenplum-spark_2.11-1.3.0.jar --executor-cores 3 --executor-memory 13G --class com.partition.source.YearPartition splinter_2.11-0.1.jar --master=yarn
> --keytab /home/devusr/devusr.keytab --principal devusr#DEV.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties
> --name Splinter --conf spark.executor.extraClassPath=/home/devusr/jars/greenplum-spark_2.11-1.3.0.jar
> --conf spark.executor.instances=10 --conf spark.dynamicAllocation.enabled=false --conf
> spark.files.maxPartitionBytes=256M
But the job doesn't run and instead just prints:
SPARK_MAJOR_VERSION is set to 2, using Spark2
Could anyone let me know if there is any specific order for the parameters used in spark-submit ?
The format to use spark-submit in cluster mode on yarn is
$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options] as documented in https://spark.apache.org/docs/2.1.0/running-on-yarn.html
If splinter_2.11-0.1.jar is the jar that contains your class com.partition.source.YearPartition, Can you try using this:
spark-submit \
--class com.partition.source.YearPartition \
--master=yarn \
--conf spark.ui.port=4090 \
--driver-class-path /home/devusr/jars/greenplum-spark_2.11-1.3.0.jar \
--jars /home/devusr/jars/greenplum-spark_2.11-1.3.0.jar \
--executor-cores 3 \
--executor-memory 13G \
--keytab /home/devusr/devusr.keytab \
--principal devusr#DEV.COM \
--files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties \
--name Splinter \
--conf spark.executor.extraClassPath=/home/devusr/jars/greenplum-spark_2.11-1.3.0.jar \
--conf spark.executor.instances=10 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.files.maxPartitionBytes=256M \
splinter_2.11-0.1.jar
I am running a spark job on yarn cluster through the spark-submit command. The job starts but hangs at action steps. Below is the command I am using for this:
spark-submit \
--class com.spark.Inbound \
--name spark-learning \
--deploy-mode cluster \
--queue default \
--conf spark.executor.extraJavaOptions=-
Dlog4j.configuration=file:/home_dir/sam/log4j.properties \
--conf spark.driver.extraJavaOptions=-
Dlog4j.configuration=file:/home_dir/sam/log4j.properties \
--conf spark.logConf=true \
sparkscala_2.11-1.0.jar dev.
val inputFileData = sparkContext.textFile(sourceConfig.employeeInLocation + "/" + sourceConfig.employeeInFileName)
inputFileData.
.flatMap(data => data.split("\n"))
.foreach(println(_))
The same job, when running without the log4j properties, finishes successfully.
spark-submit \
--class com.spark.Inbound \
--name spark-learning \
--deploy-mode cluster \
--queue default \
sparkscala_2.11-1.0.jar dev.
Am I missing any parameters?
The following spark-submit script works:
nohup ./bin/spark-submit --jars ./ikoda/extrajars/ikoda_assembled_ml_nlp.jar,./ikoda/extrajars/stanford-corenlp-3.8.0.jar,./ikoda/extrajars/stanford-parser-3.8.0.jar \
--packages datastax:spark-cassandra-connector:2.0.1-s_2.11 \
--class ikoda.mlserver.Application \
--conf spark.cassandra.connection.host=192.168.0.33 \
--master local[*] ./ikoda/ikodaanalysis-mlserver-0.1.0.jar 1000 > ./logs/nohup.out &
Programatically, I can do the same by configuring SparkContext:
val conf = new SparkConf().setMaster("local[4]").setAppName("MLPCURLModelGenerationDataStream")
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
conf.set("spark.cassandra.connection.host", sparkcassandraconnectionhost)
conf.set("spark.driver.maxResultSize", sparkdrivermaxResultSize)
conf.set("spark.network.timeout", sparknetworktimeout)
Question
Can I add --packages datastax:spark-cassandra-connector:2.0.1-s_2.11 programatically? If yes, how?
The corresponding option is spark.jars.packages
conf.set(
"spark.jars.packages",
"datastax:spark-cassandra-connector:2.0.1-s_2.11")
After using Spark 1.2 for quite a long time, I have realised that you can no longer pass spark configuration to the driver via the --conf via command line.
I am thinking about using system properties and picking the config up using the following bit of code:
def getConfigOption(conf: SparkConf, name: String)
conf getOption name orElse sys.props.get(name)
How do i pass a config.file option and string version of the date specified as a start time to a spark-submit command?
I have attempted using the following in my start up shell script:
--conf "spark.executor.extraJavaOptions=-Dconfig.file=../conf/application.conf -DstartTime=2016-06-04 00:00:00"
but this fails at it space splits the command up.
Any idea how to do this successfully, or has anyone got any advice on this one?
I am EDITing this to show the bash script being used:
#!/bin/bash
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
LIB_DIRECTORY=/opt/app/latest/lib/
ANALYSIS_JAR=spark-fa-2.16.18-standalone.jar
ANALYSIS_DRIVER_CLASS=com.spark.fa.Main
OTHER_OPTIONS=""
KEYTAB="/opt/app/keytab/fa.keytab"
PRINCIPAL="spark_K"
CLUSTER_OPTIONS=" \
--master yarn-client \
--driver-memory 2000M \
--executor-memory 5G \
--num-executors 39 \
--executor-cores 5 \
--conf spark.default.parallelism=200 \
--driver-java-options=-Dconfig.file=../conf/application.conf \
--conf "spark.executor.extraJavaOptions=-DstartTime='2016-06-04 00:00:00'" \
--conf spark.storage.memoryFraction=0.9 \
--files /opt/app/latest/conf/application.conf \
--conf spark.storage.safetyFraction=0.9 \
--keytab ${KEYTAB} \
--principal ${PRINCIPAL} \
"
spark-submit --class ${ANALYSIS_DRIVER_CLASS} ${CLUSTER_OPTIONS} ${LIB_DIRECTORY}/${ANALYSIS_JAR} ${CONFIG} ${#}
Use quotes:
--conf "spark.executor.extraJavaOptions=-Dconfig.file=../conf/application.conf -DstartTime='2016-06-04 00:00:00'"
If your parameter contains both spaces and single quotes (for instance a query paramter) you should enclose it with in escaped double quote \"
Example:
spark-submit --master yarn --deploy-mode cluster --conf "spark.driver.extraJavaOptions=-DfileFormat=PARQUET -Dquery=\"select * from bucket where code in ('A')\" -Dchunk=yes" spark-app.jar
I have some scala / spark code, packaged into sparktest_2.10-1.0.jar file
I'm trying to do spark-submit:
spark-submit --class sparktest_2.10-1.0 --master local[2]
I get: Error: Must specify a primary resource (JAR or Python or R file)
What is the proper way to do spark-submit ?
spark-submit
--class "main-class"
--master spark://master-url
--deploy-mode "deploy-mode"
--conf <key>=<value>
... # other options
application-jar
[application-arguments]
Eg:
spark-submit --class "com.example.myapp" myapp.jar