Apache Spark NoSuchMethodError using .reduceByKey() - scala

I'm trying to run some examples in Apache Spark to learn more about it, but when I try to do it (in spark-shell) I'm receiving the error:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addDeprecations([Lorg/apache/hadoop/conf/Configuration$DeprecationDelta;)V
There's the full execution and the error trace. I wish you could help me.
pcitbu#pcitbumint /usr/spark/spark-2.0.1-bin-hadoop2.7 $ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/10/25 09:52:38 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '3').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark config.
16/10/25 09:52:38 WARN Utils: Your hostname, pcitbumint resolves to a loopback address: 127.0.1.1; using 192.168.0.119 instead (on interface ens33)
16/10/25 09:52:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/10/25 09:52:38 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://192.168.0.119:4040
Spark context available as 'sc' (master = local[*], app id = local-1477381958561).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val file = sc.textFile("README.md")
file: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> val counts = file.flatMap(line => line.split(" "))
counts: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26
scala> .map(word => (word, 1))
res0: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:29
scala> .reduceByKey(_ + _)
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addDeprecations([Lorg/apache/hadoop/conf/Configuration$DeprecationDelta;)V
at org.apache.hadoop.hdfs.HdfsConfiguration.addDeprecatedKeys(HdfsConfiguration.java:66)
at org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:31)
at org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:116)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1440)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:124)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:563)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:318)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:291)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$29.apply(SparkContext.scala:992)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$29.apply(SparkContext.scala:992)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:146)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:328)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:327)
... 48 elided

please check your hadoop version. More particularly the version of hadoop-common-x.x.x.jar
Couple of points:
Did you built this spark source yourself, if yes check the hadoop version you built with.
If you've a pre-built version, please see the version of hadoop-common inside your install dir i.e. /usr/spark/spark-2.0.1-bin-hadoop2.7

Related

Why spark-shell throws NoSuchMethodException while calling newInstance via reflection

spark-shell throws NoSuchMethodException if I define a class in REPL and then call newInstance via reflection.
Spark context available as 'sc' (master = yarn, app id = application_1656488084960_0162).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.3
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
scala> class Demo {
| def demo(s: String): Unit = println(s)
| }
defined class Demo
scala> classOf[Demo].newInstance().demo("OK")
java.lang.InstantiationException: Demo
at java.lang.Class.newInstance(Class.java:427)
... 47 elided
Caused by: java.lang.NoSuchMethodException: Demo.<init>()
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.newInstance(Class.java:412)
... 47 more
But the same code works fine in native scala REPL:
Welcome to Scala 2.12.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131).
Type in expressions for evaluation. Or try :help.
scala> class Demo {
| def demo(s: String): Unit = println(s)
| }
defined class Demo
scala> classOf[Demo].newInstance().demo("OK")
OK
What's the difference between spark-shell REPL and native scala REPL?
I guess the Demo class might be treated as inner class in spark-shell REPL.
But ... how to solve the problem?
In Scala 2.12.4 REPL the class is nested into objects, so it has zero-argument constructor accessible via .newInstance(). In Spark 3.0.3 shell the class is nested into classes, so there is no zero-argument constructor, the constructor of Demo accepts an instance of outer class and should be accessed via .getConstructors()(0).newInstance(...). Start Scala REPL and Spark shell with ./scala -Xprint:typer and ./spark-shell -Xprint:typer correspondingly and you'll see the difference.
So in Spark shell try
classOf[Demo].getDeclaredMethod("demo", classOf[String])
.invoke(
classOf[Demo].getConstructors()(0).newInstance($lineXX.$read.INSTANCE.$iw.$iw),
"OK"
)
//OK
//resYY: Object = null
(XX is the number of line where Demo is defined).
See details in Spark adds hidden parameter to constructor of a Scala class

Why sample code with accumulator throws exception?

all
I was reading the section about accumulator in Spark documentation. http://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators
I was trying to run the sample code in spark-shell:
download the spark zip file
unzip file and cd to the directory
execute ./bin/spark-shell
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
scala> accum.value
res2: Long = 10
But no luck, I tried spark 2 and spark 3, both throw an exception. Could you tell me why?
This is spark 2
Spark context available as 'sc' (master = local[*], app id = local-1609254493356).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.7
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 15.0.1)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
at org.apache.spark.util.FieldAccessFinder$$anon$4$$anonfun$visitMethodInsn$7.apply(ClosureCleaner.scala:845)
at org.apache.spark.util.FieldAccessFinder$$anon$4$$anonfun$visitMethodInsn$7.apply(ClosureCleaner.scala:828)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
This is spark 3
Spark context Web UI available at http://192.168.57.243:4040
Spark context available as 'sc' (master = local[*], app id = local-1609254662877).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 15.0.1)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
java.lang.IllegalAccessException: Can not set final $iw field $Lambda$2159/0x0000000801470488.arg$1 to $iw
at java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76)
at java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80)
at java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79)
at java.base/java.lang.reflect.Field.set(Field.java:793)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2362)
at org.apache.spark.rdd.RDD.$anonfun$foreach$1(RDD.scala:985)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:984)
... 47 elided
Java 15.0.1 is the problem. Java 11 is the highest version supported:
Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.5+. Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0.

Why df write console format not showing anything?

I have a static dataframe, how to write it to the console instead of using df.show()
val sparkConfig = new SparkConf().setAppName("streaming-vertica").setMaster("local[2]")
val sparkSession = SparkSession.builder().master("local[2]").config(sparkConfig).getOrCreate()
val sc = sparkSession.sparkContext
val rows = sc.parallelize(Array(
Row(1,"hello", true),
Row(2,"goodbye", false)
))
val schema = StructType(Array(
StructField("id",IntegerType, false),
StructField("sings",StringType,true),
StructField("still_here",BooleanType,true)
))
val df = sparkSession.createDataFrame(rows, schema)
df.write
.format("console")
.mode("append")
This is writing nothing into console:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/27 00:30:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Process finished with exit code 0
On using save :
df.write
.format("console")
.mode("append")
.save()
It gives :
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/27 00:45:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:473)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at rep.StaticDFWrite$.main(StaticDFWrite.scala:35)
at rep.StaticDFWrite.main(StaticDFWrite.scala)
Spark version = 2.2.1
scala version = 2.11.12
You have to call save on DataFrameWriter object.
without save method it will just create DataFrameWriter object & terminate your session.
Check below code, I have checked in spark-shell.
Please note this code is working on spark version 2.4.0 but not working on 2.2.0
console format will not work with write in spark 2.2.0 - https://issues.apache.org/jira/browse/SPARK-20599
scala> df.write.format("console").mode("append")
res5: org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row] = org.apache.spark.sql.DataFrameWriter#148a3112
scala> df.write.format("console").mode("append").save()
+--------+---+
| name|age|
+--------+---+
|srinivas| 20|
+--------+---+

Exception while connecting to Hbase using Spark

I am connecting to Hbase using Spark. I have added all the dependencies but still i am getting this exception. Kindly help me like which JAR i need to add to resolve this issue.
SPARK_MAJOR_VERSION is set to 2, using Spark2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.0-292/spark2/jars/slf4j-log4j12 -1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.0-292/spark2/jars/slf4j-log4j12 -1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.0-292/spark2/jars/phoenix-4.7.0 .2.6.5.0-292-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.0-292/spark2/jars/phoenix-4.7.0 .2.6.5.0-292-thin-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLeve l(newLevel).
18/09/17 05:34:36 WARN Utils: Service 'SparkUI' could not bind on port 4040. Att empting port 4041.
Spark context Web UI available at http://sandbox-hdp.hortonworks.com:4041
Spark context available as 'sc' (master = local[*], app id = local-1537162476668).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0.2.6.5.0-292
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.{SQLContext, _}
import org.apache.spark.sql.execution.datasources.hbase._
import org.apache.spark.{SparkConf, SparkContext}
import spark.sqlContext.implicits._
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{ConnectionFactory,HBaseAdmin,HTable,Put,Get}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HTableDescriptor,HColumnDescriptor}
def catalog = s"""{
|"table":{"namespace":"default", "name":"Contacts"},
|"rowkey":"key",
|"columns":{
|"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
|"officeAddress":{"cf":"Office", "col":"Address", "type":"string"},
|"officePhone":{"cf":"Office", "col":"Phone", "type":"string"},
|"personalName":{"cf":"Personal", "col":"Name", "type":"string"},
|"personalPhone":{"cf":"Personal", "col":"Phone", "type":"string"}
|}
|}""".stripMargin
def withCatalog(cat: String): DataFrame = {
spark.sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
val df = withCatalog(catalog)
df.registerTempTable("contacts")
val query = spark.sqlContext.sql("select personalName, officeAddress from contacts")
query.show() <p>
// Exiting paste mode, now interpreting.
warning: there was one deprecation warning; re-run with -deprecation for details
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/shaded/protobuf/generated/MasterProtos$MasterService$BlockingInterface
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
Below are the Jar's available in spark-Jar's Folder
hbase-0.94.2.jar
hbase-annotations-1.2.0.jar
hbase-client-2.1.0.jar
hbase-common-2.1.0.jar
hbase-hadoop-compat-2.1.0.jar
hbase-hadoop2-compat-2.1.0.jar
hbase-it-1.1.2.2.6.5.0-292.jar
hbase-prefix-tree-1.1.2.2.6.5.0-292.jar
hbase-procedure-1.1.2.2.6.5.0-292.jar
hbase-protocol-2.1.0.jar
hbase-server-2.1.0.jar
hbase-spark-1.2.0-cdh5.8.3.jar
hbase-spark-1.1.2.2.6.5.0-292.jar
hbase-thrift-1.1.2.2.6.5.0-292.jar
hive-hbase-handler-0.12.0-cdh5.1.3.jar
hive-hbase-handler-3.1.0.jar
protobuf-java-3.5.1.jar
Kindly provide me suggestion like which jar i missed to add in the jars folder in order to connect to hbase.
Seems like you are missing a shc-core jar which is used to write dataframes to hbase which has been implented by hortonworks.
As you are importing the package from hortonworks-shc-connector
import org.apache.spark.sql.execution.datasources.hbase._
You need add the jar to your spark application.
Steps to get the jar of shc-core connector:
First get pull of hortonworks-spark/hbase connector github repository then checkout to a appropriate branch with version of hbase and hadoop that you are using in your environment and build it using
mvn clean install -DskipTests
after executing above you will have a jar in your ~/.m2/repository/com/hortonworks/shc/
Use this jar for your spark application.
You can either add to your spark-jar folder or you can pass it in spark-submit/spark-shell with --jars flag
Then use try to execute the code you are trying run.
I have followed the same steps and was able to read from hbase with HCatalog.
Example
spark-shell --jars shc-core-1.1.3-2.4-s_2.11.jar
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://sandbox-hdp.hortonworks.com:4040
Spark context available as 'sc' (master = yarn, app id =
application_1592322799672_0007).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0.7.0.3.0-79
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_232)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.{SQLContext, _}
import org.apache.spark.sql.execution.datasources.hbase._
import org.apache.spark.{SparkConf, SparkContext}
import spark.sqlContext.implicits._
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{ConnectionFactory,HBaseAdmin,HTable,Put,Get}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HTableDescriptor,HColumnDescriptor}
def catalog = s"""{
|"table":{"namespace":"default", "name":"Contacts"},
|"rowkey":"key",
|"columns":{
|"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
|"officeAddress":{"cf":"Office", "col":"Address", "type":"string"},
|"officePhone":{"cf":"Office", "col":"Phone", "type":"string"},
|"personalName":{"cf":"Personal", "col":"Name", "type":"string"},
|"personalPhone":{"cf":"Personal", "col":"Phone", "type":"string"}
|}
|}""".stripMargin
def withCatalog(cat: String): DataFrame = {
spark.sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
val df = withCatalog(catalog)
df.registerTempTable("contacts")
val query = spark.sqlContext.sql("select personalName, officeAddress from contacts")
query.show()
// Exiting paste mode, now interpreting.
warning: there was one deprecation warning; re-run with -deprecation for details
Hive Session ID = 5cc02976-98c4-447f-9ba0-e70c4a3c4ab1
+------------+-------------+
|personalName|officeAddress|
+------------+-------------+
|John Jackson| 40 Ellis St.|
|John Jackson| 40 Ellis St.|
+------------+-------------+
import org.apache.spark.sql.{SQLContext, _}
import org.apache.spark.sql.execution.datasources.hbase._
import org.apache.spark.{SparkConf, SparkContext}
import spark.sqlContext.implicits._
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{ConnectionFactory, HBaseAdmin, HTable, Put, Get}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HTableDescriptor, HColumnDescriptor}
catalog: String
withCatalog: (cat: String)org.apache.spark.sql.DataFrame
df: org.apache.spark.sql.DataFrame = [rowkey: string, officeAddress: string ... 3 more fields]
query: org.apache.spark.sql.DataFrame = [personalName: string, officeAddress: string]
scala> query.show()
+------------+-------------+
|personalName|officeAddress|
+------------+-------------+
|John Jackson| 40 Ellis St.|
|John Jackson| 40 Ellis St.|
+------------+-------------+
scala>
Stack Versions :
HBase 2.2.0
Hadoop 3.1.1
Spark 2.4.0
Scala 2.11.12

Test Spark Scala with Maven Got Error: java.lang.NoClassDefFoundError

I tried to Test Spark Scala on Scala IDE (eclipse) with Maven but keep getting error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:68)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:55)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:904)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.SimpleApp$.main(SimpleApp.scala:7)
at com.SimpleApp.main(SimpleApp.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more
The program I try is the Quick Start code, from the Spark Documentation:
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
I use Spark 2.2.0 and Scala 2.11.7. The pom.xml file is:
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
I followed a solution from another thread: NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell
But it doesn't work for me. The content in my spark-env.sh file is:
# If 'hadoop' binary is on your PATH
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
# With explicit path to 'hadoop' binary
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
# Passing a Hadoop configuration directory
export SPARK_DIST_CLASSPATH=$(hadoop --config /usr/local/hadoop/etc/hadoop classpath)
Could anybody help me with this? Appreciate your help.
Devesh's answer solve parts of my problem. However, I got other problems:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/08/17 10:34:03 INFO SparkContext: Running Spark version 2.2.0
18/08/17 10:34:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/17 10:34:03 WARN Utils: Your hostname, toshiba0 resolves to a loopback address: 127.0.1.1; using 192.168.1.217 instead (on interface wlp2s0)
18/08/17 10:34:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/08/17 10:34:03 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.SimpleApp$.main(SimpleApp.scala:11)
at com.SimpleApp.main(SimpleApp.scala)
18/08/17 10:34:03 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.SimpleApp$.main(SimpleApp.scala:11)
at com.SimpleApp.main(SimpleApp.scala)
I don't know why Spark says my loopback address is 127.0.1.1, I checked my configuration: /etc/network/interfaces, it's auto loopback, and I ping 127.0.0.1. It works.
I followed the solution from this link Error initializing SparkContext: A master URL must be set in your configuration
and put the following code, because I use my laptop. It still doesn't work.
val conf = new SparkConf().setMaster("local[2]")
Don't know what happen to my settings. Thank you!
Just add following in maven pom.xml file
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.0</version>
</dependency>
In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark whereas in Spark 2.0 onwards the same effects can be achieved through SparkSession, without explicitly creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession
** sample code snippet:-**
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // some file on system
val spark = SparkSession
.builder
.appName("Simple Application")
.master("local[2]")
.getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
}
}