How to handle the exception: "No suitable driver" while running spark-submit - postgresql

I am trying to run a spark code which will read a table on postgres database and insert it into Hive table on HDFS. To do that I have set my connection properties in a properties file as below:
devUserName=usrname
devPassword=pwd
gpDriverClass=org.postgresql.Driver
gpDevUrl=jdbc:postgresql://ip:port/dbname?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory
And reading the properties file as:
val conFile = "testconnection.properties"
val properties = new Properties()
properties.load(new FileInputStream(conFile))
// GP Properties
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName = properties.getProperty("devUserName")
val devPassword = properties.getProperty("devPassword")
val driverClass = properties.getProperty("gpDriverClass")
I am submitting the jar file in the following way:
SPARK_MAJOR_VERSION=2 spark-submit --conf spark.ui.port=4090 --driver-class-path /home/usrname/jars/postgresql-42.1.4.jar --master=yarn --deploy-mode=cluster --driver-memory 40g --driver-cores 3 --executor-memory 40g --executor-cores 20 --num-executors 5 --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --jars /home/usrname/jars/postgresql-42.1.4.jar --class com.partition.source.YearPartition splinter_2.11-0.1.jar --keytab /home/usrname/usrname.keytab --principal usrname#DEV.COM --name Splinter --conf spark.executor.extraClassPath=/home/usrname/jars/postgresql-42.1.4.jar
The job fails with the following error message:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: User class threw exception: java.sql.SQLException: No suitable driver
ApplicationMaster host: 10.230.137.190
ApplicationMaster RPC port: 0
queue: default
start time: 1537545547309
final status: FAILED
tracking URL: http://ip:port/proxy/application_123456789_76/
user: fdlhdpetl
18/09/21 15:59:38 INFO Client: Deleted staging directory hdfs://dev/user/usrname/.sparkStaging/application_123456789_76
Exception in thread "main" org.apache.spark.SparkException: Application application_123456789_76 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1187)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1233)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/09/21 15:59:38 INFO ShutdownHookManager: Shutdown hook called
18/09/21 15:59:38 INFO ShutdownHookManager: Deleting directory /tmp/spark-59934097-dfe9-4ad2-a66a-ff93d42cf838
My dependencies in the build.sbt file:
name := "Splinter"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.0.0" % "provided",
"org.json4s" %% "json4s-jackson" % "3.2.11" % "provided",
"org.apache.httpcomponents" % "httpclient" % "4.5.3"
)
// https://mvnrepository.com/artifact/org.postgresql/postgresql
libraryDependencies += "org.postgresql" % "postgresql" % "42.1.4"
Code used to read the table on postgres:
def prepareFinalDF(splitColumns:List[String], textList: ListBuffer[String], allColumns:String, dataMapper:Map[String, String], partition_columns:Array[String], spark:SparkSession): DataFrame = {
val execQuery = s"select ${allColumns}, 0 as ${flagCol} from schema.tablename where period_year='2017'"
val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017").option("user", devUserName).option("password", devPassword).option("numPartitions",20).load()
val totalCols:List[String] = splitColumns ++ textList
val cdt = new ChangeDataTypes(totalCols, dataMapper)
hiveDataTypes = cdt.gpDetails()
prepareHiveTableSchema(hiveDataTypes, partition_columns)
val allColsOrdered = yearDF.columns.diff(partition_columns) ++ partition_columns
val allCols = allColsOrdered.map(colname => org.apache.spark.sql.functions.col(colname))
val resultDF = yearDF.select(allCols:_*)
val stringColumns = resultDF.schema.fields.filter(x => x.dataType == StringType).map(s => s.name)
val finalDF = stringColumns.foldLeft(resultDF) {
(tempDF, colName) => tempDF.withColumn(colName, regexp_replace(regexp_replace(col(colName), "[\r\n]+", " "), "[\t]+"," "))
}
finalDF
}
execQuery contains: select * from schema.tablename where period_year=2017
val dataDF = prepareFinalDF(splitColumns, textList, allColumns, dataMapper, partition_columns, spark)
dataDF.createOrReplaceTempView("preparedDF")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
spark.sql(s"INSERT OVERWRITE TABLE default.xx_gl_forecast PARTITION(${prtn_String_columns}) select * from preparedDF")
I checked the jar mentioned in the spark-submit and it is present in the dir given but it still fails when submitted.
Could anyone tell me what is the mistake I am doing here ? Is there any particular order to be followed while giving parameters in spark-submit ?

The JDBC driver for postgres has to be in the classpath not only of the spark driver but also in the executors. There are 3 ways to do it:
you build a fat jar that includes the driver. This means adding "org.postgresql" % "postgresql" % "42.1.4" to your projects' libraryDependencies.
you tell spark to download the dependency from maven. In your case, add --packages org.postgresql:postgresql:42.1.4 to your spark-submit (instead of --driver-class-path).
use --jars followed with the path to the drivers' jar. This will add the driver to both the driver's and executors' classpaths.

Related

scala spark NoClassDefFoundError - InitialPositionInStream

Deploying a spark application written in scala to an EMR cluster with the following command and i cannot figure out why I am receiving a missing dependency error message when deployed to the EMR cluster instance.
error message:
User class threw exception: java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/clientlibrary/lib/worker/InitialPositionInStream
aws emr add-steps --cluster-id j-xxxxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--jars,s3://xxx.xxx.xxx/aws-java-sdk-1.11.715.jar,\
--conf,spark.yarn.submit.waitAppCompletion=false,\
s3://xxx.xxxx.xxxx/simple-project_2.12-1.0.jar\
],ActionOnFailure=CONTINUE
and sbt file
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.8"
libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.715"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
partial code below
...
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
...
val streamingContext = new StreamingContext(sparkContext, batchInterval)
// Populate the appropriate variables from the given args
val streamAppName = "xxxxxx"
val streamName = "xxxxxx"
val endpointUrl = "https://kinesis.xxxxx.amazonaws.com"
val regionName = "xx-xx-x"
val initialPosition = InitialPositionInStream.LATEST
val checkpointInterval = batchInterval
val storageLevel = StorageLevel.MEMORY_AND_DISK_2
val kinesisStream = KinesisUtils.createStream(streamingContext, streamAppName, streamAppName, endpointUrl, regionName, initialPosition, checkpointInterval, storageLevel)
val initialPosition = InitialPositionInStream.LATEST
val checkpointInterval = batchInterval
val storageLevel = StorageLevel.MEMORY_AND_DISK_2
val kinesisStream = KinesisUtils.createStream(streamingContext, streamAppName, streamAppName, endpointUrl, regionName, initialPosition, checkpointInterval, storageLevel)
20/02/05 21:43:10 ERROR ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/clientlibrary/lib/worker/InitialPositionInStream
java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/clientlibrary/lib/worker/InitialPositionInStream
at ScalaStream$.main(stream.scala:32)
at ScalaStream.main(stream.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
20/02/05 21:43:10 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/clientlibrary/lib/worker/InitialPositionInStream
at ScalaStream$.main(stream.scala:32)
at ScalaStream.main(stream.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
)
I've tried including the aws dependencies both in the sbt file and also --jars paramater of spark-submit but cannot see why the dependency is missing?
fixed by updating the following
sbt
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.8"
libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
deploy script
aws emr add-steps --cluster-id j-xxxxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0,org.postgresql:postgresql:42.2.9,com.facebook.presto:presto-jdbc:0.60\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://xxx.xxx/simple-project_2.12-1.0.jar\
],ActionOnFailure=CONTINUE
the key being the --packages flag added to the aws emr add-steps. Mistakenly thought sbt package bundled the required dependencies.

Exception in thread "main" java.lang.NoSuchMethodError in ubuntu only

I have the code below:
import java.io.File
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
object RDFBenchVerticalPartionedTables {
def main(args: Array[String]): Unit = {
println("Start of programm .... ")
val conf = new SparkConf().setMaster("local").setAppName("SQLSPARK")
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
println("Conf and SC declared... ")
val spark = SparkSession
.builder()
.master("local[*]")
.appName("SparkConversionSingleTable")
.getOrCreate()
println("SparkSession declared... ")
println("Before Agrs..... ")
val filePathCSV=args(0)
val filePathAVRO=args(1)
val filePathORC=args(2)
val filePathParquet=args(3)
println("After Agrs..... ")
val csvFiles = new File(filePathCSV).list()
println("After List of Files Agrs..... " + csvFiles.length )
println("Before the foreach ... ")
csvFiles.foreach{verticalTableName=>
println("inside the foreach ... ")
val verticalTableName2=verticalTableName.dropRight(4)
val RDFVerticalTableDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(filePathCSV+"/"+verticalTableName).toDF()
RDFVerticalTableDF.write.format("com.databricks.spark.avro").save(filePathAVRO+"/"+verticalTableName2+".avro")
RDFVerticalTableDF.write.parquet(filePathParquet+"/"+verticalTableName2+".parquet")
RDFVerticalTableDF.write.orc(filePathORC+"/"+verticalTableName2+".orc")
println("Vertical Table: '" +verticalTableName2+"' Has been Successfully Converted to AVRO, PARQUET and ORC !")
}
}
}
this class transforms list of csv files in adirectory that is given in a arguments (0) and save different formats (avro,orc and parquet) in three directories given also as args(1) args(2) and args(3).
I tried to submit this job using the spark-submit on windows it works, but while running the same job in ubuntu it fails with this error:
ubuntu#ragab:~$ spark-submit --class RDFBenchVerticalPartionedTables --master local[*] /home/ubuntu/testjar/rdfschemaconversion_2.11-0.1.jar "/data/RDFBench4/VerticalPartionnedTables/VerticalPartitionedTables100" "/data/RDFBench3/ConvertedData/SP2Bench100/AVRO/VerticalTables" "/data/RDFBench3/ConvertedData/SP2Bench100/ORC/VerticalTables" "/data/RDFBench3/ConvertedData/SP2Bench100/Parquet"
19/05/04 18:10:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Start of programm ....
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Conf and SC declared...
SparkSession declared...
Before Agrs.....
After Agrs.....
After List of Files Agrs..... 25
Before the foreach ...
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at RDFBenchVerticalPartionedTables$.main(RDFBenchVerticalPartionedTables.scala:45)
at RDFBenchVerticalPartionedTables.main(RDFBenchVerticalPartionedTables.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
this is my sbt file:
name := "RDFSchemaConversion"
version := "0.1"
scalaVersion := "2.11.12"
mainClass in (Compile, run) := Some("RDFBenchVerticalPartionedTables")
mainClass in (Compile, packageBin) := Some("RDFBenchVerticalPartionedTables")
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
libraryDependencies += "com.typesafe" % "config" % "1.3.1"
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
Your Spark distribution on Ubuntu seems to have been compiled with Scala 2.12. It is incompatible with your jar file which is compiled with Scala 2.11.

java.lang.NullPointerException while reading data from MSSQL server with spark

I am having issues with reading data from MSSQL server using Cloudera Spark. I am not sure where is the problem and what is causing it.
Here is my build.sbt
val sparkversion = "1.6.0-cdh5.10.1"
name := "SimpleSpark"
organization := "com.huff.spark"
version := "1.0"
scalaVersion := "2.10.5"
mainClass in Compile := Some("com.huff.spark.example.SimpleSpark")
assemblyJarName in assembly := "mssql.jar"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.6.0" % "provided",
"org.apache.spark" % "spark-core_2.10" % sparkversion % "provided", // to test in cluseter
"org.apache.spark" % "spark-sql_2.10" % sparkversion % "provided" // to test in cluseter
)
resolvers += "Confluent IO" at "http://packages.confluent.io/maven"
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos"
And here is my scala source:
package com.huff.spark.example
import org.apache.spark.sql._
import java.sql.{Connection, DriverManager}
import java.util.Properties
import org.apache.spark.{SparkContext, SparkConf}
object SimpleSpark {
def main(args: Array[String]) {
val sourceProp = new java.util.Properties
val conf = new SparkConf().setAppName("SimpleSpark").setMaster("yarn-cluster") //to test in cluster
val sc = new SparkContext(conf)
var SqlContext = new SQLContext(sc)
val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
val jdbcDF = SqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlserver://sqltestsrver;databaseName=LEh;user=sparkaetl;password=sparkaetl","driver" -> driver,"dbtable" -> "StgS")).load()
jdbcDF.show(5)
}
}
And this is the error I see:
17/05/24 04:35:20 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:155)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:222)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
at com.huff.spark.example.SimpleSpark$.main(SimpleSpark.scala:16)
at com.huff.spark.example.SimpleSpark.main(SimpleSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:552)
17/05/24 04:35:20 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.NullPointerException)
I know the problem is in line 16 which is:
val jdbcDF = SqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlserver://sqltestsrver;databaseName=LEh;user=sparkaetl;password=sparkaetl","driver" -> driver,"dbtable" -> "StgS")).load()
But I can't pinpoint out what exactly is the problem. Is it something to do with access? (which is doubtful), problems with connection parameters (the error message would say it), or something else which I am not aware of. Thanks in advance :-)
If you are using azure SQL server please copy the jdbc connection string from azure portal. I tried and it worked for me.
Azure databricks using scala mode:
import com.microsoft.sqlserver.jdbc.SQLServerDriver
import java.sql.DriverManager
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
// MS SQL JDBC Connection String ...
val jdbcSqlConn = "jdbc:sqlserver://***.database.windows.net:1433;database=**;user=***;password=****;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Loading the ms sql table via spark context into dataframe
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> jdbcSqlConn,
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"dbtable" -> "***")).load()
// Registering the temp table so that we can SQL like query against the table
jdbcDF.registerTempTable("yourtablename")
// selecting only top 10 rows here but you can use any sql statement
val yourdata = sqlContext.sql("SELECT * FROM yourtablename LIMIT 10")
// display the data
yourdata.show()
The NPE occurs when you try to close the database Connection which indicates that the system could not obtain the proper connector via JdbcUtils.createConnectionFactory. You should check your connection URL and the logs for failures.

java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration

I want to create my first scala program using the scala example HBaseTest2.scala, provided in Sparkd 1.4.1. The goal is to connect to HBase and do some basic stuff, such as counting rows or scan rows. However, when I tried to execute the program, I got an error. It seems that Spark couldn't find the class HBaseConfiguration. Assuming the we're located a the root path of my project HBaseTest2 /usr/local/Cellar/spark/programs/HBaseTest2. Here are some details for the exception :
./src/main/scala/com/orange/spark/examples/HBaseTest2.scala
package com.orange.spark.examples
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor, TableName}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.spark._
object HBaseTest2 {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseTest2")
val sc = new SparkContext(sparkConf)
val tableName = "personal-cloud-test"
// please ensure HBASE_CONF_DIR is on classpath of spark driver
// e.g: set it through spark.driver.extraClassPath property
// in spark-defaults.conf or through --driver-class-path
// command line option of spark-submit
val conf = HBaseConfiguration.create()
// Other options for configuring scan behavior are available. More information available at
// http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html
conf.set(TableInputFormat.INPUT_TABLE, tableName)
// Initialize hBase table if necessary
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(TableName.valueOf(tableName))
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
println("hbaseRDD.count()")
println(hBaseRDD.count())
sc.stop()
admin.close()
}
}
./build.sbt
I've added dependencies in this file to ensure all classes called are included in the jar file.
name := "HBaseTest2"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.1"
libraryDependencies ++= Seq(
"org.apache.hadoop" % "hadoop-core" % "1.2.1",
"org.apache.hbase" % "hbase" % "1.0.1.1",
"org.apache.hbase" % "hbase-client" % "1.0.1.1",
"org.apache.hbase" % "hbase-common" % "1.0.1.1",
"org.apache.hbase" % "hbase-server" % "1.0.1.1"
)
Run application
MacBook-Pro-de-Mincong:spark-1.4.1 minconghuang$ bin/spark-submit \
--class "com.orange.spark.examples.HBaseTest2" \
--master local[4] \
../programs/HBaseTest2/target/scala-2.11/hbasetest2_2.11-1.0.jar
Exception
15/08/18 12:06:17 INFO storage.BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
at com.orange.spark.examples.HBaseTest2$.main(HBaseTest2.scala:21)
at com.orange.spark.examples.HBaseTest2.main(HBaseTest2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 11 more
15/08/18 12:06:17 INFO spark.SparkContext: Invoking stop() from shutdown hook
The problem might come from the HBase configuration as mentioned in HBaseTest2.scala line 16 :
// please ensure HBASE_CONF_DIR is on classpath of spark driver
// e.g: set it through spark.driver.extraClassPath property
// in spark-defaults.conf or through --driver-class-path
// command line option of spark-submit
But I don't know how to configure it... I've added the HBASE_CONF_DIR to CLASSPATH in my command line. The CLASSPATH is now /usr/local/Cellar/hadoop/hbase-1.0.1.1/conf. Nothing happened... T_T So what should I do to get this fixed ? I can add/delete details if needed. Thanks a lot !!
Have you tried
sparkConf.set("spark.driver.extraClassPath", "/usr/local/Cellar/hadoop/hbase-1.0.1.1/conf")
The problem came from class-path-setting as mentioned in HBaseTest2.scala line 33 :
// please ensure HBASE_CONF_DIR is on classpath of spark driver
// e.g: set it through spark.driver.extraClassPath property
// in spark-defaults.conf or through --driver-class-path
// command line option of spark-submit
As I'm using a MAC OS X, setting is different from Linux. When I tried echo $CLASSPATH, it returned empty. It seems that Mac doesn't use the CLASSPATH to do the driver job. So I need to add all jar files through spark.driver.extraClassPath in spark-defaults.conf file. My collegue did the same way in Linux. I think there's a better way to handle it elegantly, but we didn't find out. Please share if you know the answer. Thanks.
Mac / Linux
add all external jars in conf/spark-defaults.conf
spark.driver.extraClassPath /path/to/a.jar:/path/to/b.jar:/path/to/c.jar

ZeroMQ word count app gives error when you compile in spark 1.2.1

I'm trying to setup zeromq data stream to spark. Basically I took the ZeroMQWordCount.scala app an tried to recompile it and run it.
I have zeromq 2.1 installed, and spark 1.2.1
here is my scala code:
package org.apache.spark.examples.streaming
import akka.actor.ActorSystem
import akka.actor.actorRef2Scala
import akka.zeromq._
import akka.zeromq.Subscribe
import akka.util.ByteString
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.zeromq._
import scala.language.implicitConversions
import org.apache.spark.SparkConf
object ZmqBenchmark {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: ZmqBenchmark <zeroMQurl> <topic>")
System.exit(1)
}
//StreamingExamples.setStreamingLogLevels()
val Seq(url, topic) = args.toSeq
val sparkConf = new SparkConf().setAppName("ZmqBenchmark")
// Create the context and set the batch size
val ssc = new StreamingContext(sparkConf, Seconds(2))
def bytesToStringIterator(x: Seq[ByteString]) = (x.map(_.utf8String)).iterator
// For this stream, a zeroMQ publisher should be running.
val lines = ZeroMQUtils.createStream(ssc, url, Subscribe(topic), bytesToStringIterator _)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
and this is my .sbt file for dependencies:
name := "ZmqBenchmark"
version := "1.0"
scalaVersion := "2.10.4"
resolvers += "Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/"
resolvers += "Sonatype (releases)" at "https://oss.sonatype.org/content/repositories/releases/"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.1"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.2.1"
libraryDependencies += "org.apache.spark" % "spark-streaming-zeromq_2.10" % "1.2.1"
libraryDependencies += "com.typesafe.akka" %% "akka-actor" % "2.2.0"
libraryDependencies += "org.zeromq" %% "zeromq-scala-binding" % "0.0.6"
libraryDependencies += "com.typesafe.akka" % "akka-zeromq_2.10.0-RC5" % "2.1.0-RC6"
libraryDependencies += "org.apache.spark" % "spark-examples_2.10" % "1.1.1"
libraryDependencies += "org.spark-project.zeromq" % "zeromq-scala-binding_2.11" % "0.0.7-spark"
The application compiles without any errors using sbt package, however when i run the application with spark submit, i get an error:
zaid#zaid-VirtualBox:~/spark-1.2.1$ ./bin/spark-submit --master local[*] ./zeromqsub/example/target/scala-2.10/zmqbenchmark_2.10-1.0.jar tcp://127.0.0.1:5553 hello
15/03/06 10:21:11 WARN Utils: Your hostname, zaid-VirtualBox resolves to a loopback address: 127.0.1.1; using 192.168.220.175 instead (on interface eth0)
15/03/06 10:21:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/zeromq/ZeroMQUtils$
at ZmqBenchmark$.main(ZmqBenchmark.scala:78)
at ZmqBenchmark.main(ZmqBenchmark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.zeromq.ZeroMQUtils$
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 9 more
Any ideas why this happens? i know the app should work because when i run the same example using the $/run-example $ script and point to the ZeroMQWordCount app from spark, it runs without the exception. My guess is the sbt file is incorrect, what else do I need to have in the sbt file?
Thanks
You are using ZeroMQUtils.createStream but the line
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.zeromq.ZeroMQUtils
shows that the bytecode for ZeroMQUtils was not located. When the spark examples are run, they are run against a jar file (like spark-1.2.1/examples/target/scala-2.10/spark-examples-1.2.1-hadoop1.0.4.jar) including the ZeroMQUtils class. A solution would be to use the --jars flag so spark-submit command can find the bytecode. In your case, this could be something like
spark-submit --jars /opt/spark/spark-1.2.1/examples/target/scala-2.10/spark-examples-1.2.1-hadoop1.0.4.jar--master local[*] ./zeromqsub/example/target/scala-2.10/zmqbenchmark_2.10-1.0.jar tcp://127.0.0.1:5553 hello
assuming that you have installed spark-1.2.1 in /opt/spark.