use spark sql through Scala IDE - scala

i want to try spark sql , i used at first the bin/spark-shell
inserting this code
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
val data=sc.textFile("hdfs://localhost:9000/cars.csv")
val mapr=data.map (p => p.split(','))
val MyMatchRDD=mapr.map(p =>MyMatch(p(0).toString(),p(1).toString(),p(2).toString(),
p(3).toString(),p(4).toString(),p(5).toString(),p(6).toString(),p(7).toString(),
p(8).toString()))
import sqlcontext.implicits._
val personDF=MyMatchRDD.toDF()
personDF.registerTempTable("Person")
val res = sqlcontext.sql("SELECT * FROM Person")
res.collect().foreach(println)
i didn't get any issue ,all is good.
But when i used the scala ide
i used in pom file (maven)
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.3.0</version>
</dependency>
and i used the same code
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.slf4j.Logger
import org.slf4j.LoggerFactory;
object SparkSQL {
case class MyMatch( col1: String, col2: String,col3: String, col4 :String ,col5: String,
col6: String,col7 :String ,col8: String,
col9: String)
def main(args:Array[String]) {
val sparkConf = new SparkConf().setAppName("HiveFromSpark").setMaster("local")
val sc = new SparkContext(sparkConf)
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
val data=sc.textFile("hdfs://localhost:9000/cars.csv")
val mapr=data.map (p => p.split(','))
val MyMatchRDD=mapr.map(p =>MyMatch(p(0).toString(),p(1).toString(),p(2).toString(),p(3).toString(),
p(4).toString(),p(5).toString(),p(6).toString(),p(7).toString(),
p(8).toString()) )
import sqlcontext.implicits._
val personDF=MyMatchRDD.toDF()
personDF.registerTempTable("Person")
val res = sqlcontext.sql("SELECT * FROM Person")
res.collect().foreach(println)
}
}
i got this issue
Exception in thread "main"
scala.reflect.internal.MissingRequirementError: class
org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with
primordial classloader with boot classpath
[D:\scala-SDK-4.4.1-vfinal-2.11-win32.win32.x86_64\eclipse\plugins\org.scala-
Thanks in advance for your help

You are using an wrong Scala version - Spark's compiled with Scala version 2.10. Check your runtime and compiler Scala version.
Why you're using so old dependencies? Spark has version 2.0.2 right now with Scala 2.11
Recommended actions:
(optional) Change <version>1.3.0</version> to <version>2.0.2</version>
In your Scala compiler, change version to 2.11 (if updated to Spark 2) or 2.10 (if you still use Spark 1)
Make sure you have proper Scala version installed on your machine - 2.11 in case of Spark 2, 2.10 in case of Spark 1. You can check Scala version by typing scala -version in console
Make sure your Scala IDE supports Scala version that was choosen

Related

Test Spark Scala with Maven Got Error: java.lang.NoClassDefFoundError

I tried to Test Spark Scala on Scala IDE (eclipse) with Maven but keep getting error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:68)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:55)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:904)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.SimpleApp$.main(SimpleApp.scala:7)
at com.SimpleApp.main(SimpleApp.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more
The program I try is the Quick Start code, from the Spark Documentation:
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
I use Spark 2.2.0 and Scala 2.11.7. The pom.xml file is:
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
I followed a solution from another thread: NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell
But it doesn't work for me. The content in my spark-env.sh file is:
# If 'hadoop' binary is on your PATH
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
# With explicit path to 'hadoop' binary
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
# Passing a Hadoop configuration directory
export SPARK_DIST_CLASSPATH=$(hadoop --config /usr/local/hadoop/etc/hadoop classpath)
Could anybody help me with this? Appreciate your help.
Devesh's answer solve parts of my problem. However, I got other problems:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/08/17 10:34:03 INFO SparkContext: Running Spark version 2.2.0
18/08/17 10:34:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/17 10:34:03 WARN Utils: Your hostname, toshiba0 resolves to a loopback address: 127.0.1.1; using 192.168.1.217 instead (on interface wlp2s0)
18/08/17 10:34:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/08/17 10:34:03 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.SimpleApp$.main(SimpleApp.scala:11)
at com.SimpleApp.main(SimpleApp.scala)
18/08/17 10:34:03 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.SimpleApp$.main(SimpleApp.scala:11)
at com.SimpleApp.main(SimpleApp.scala)
I don't know why Spark says my loopback address is 127.0.1.1, I checked my configuration: /etc/network/interfaces, it's auto loopback, and I ping 127.0.0.1. It works.
I followed the solution from this link Error initializing SparkContext: A master URL must be set in your configuration
and put the following code, because I use my laptop. It still doesn't work.
val conf = new SparkConf().setMaster("local[2]")
Don't know what happen to my settings. Thank you!
Just add following in maven pom.xml file
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.0</version>
</dependency>
In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark whereas in Spark 2.0 onwards the same effects can be achieved through SparkSession, without explicitly creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession
** sample code snippet:-**
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // some file on system
val spark = SparkSession
.builder
.appName("Simple Application")
.master("local[2]")
.getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
}
}

MongoDB Scala Driver doesn't allow use Map collection in case class

I just started using MongoDB and I'm trying to write a small application to test Mongo with Scala. I created the following case class in order to cast the Documents to a Scala class:
case class User(
_id: ObjectId,
userId: String,
items: Map[String, Int]
)
object User {
def apply(userId: String , items: Map[String, Int]): User =
new User(new ObjectId, userId, items)
implicit val codecRegistry: CodecRegistry =
fromRegistries(fromProviders(classOf[User]), DEFAULT_CODEC_REGISTRY)
}
I get the following error but I don't know why since the Map keys are in fact strings.
[ERROR] error: Maps must contain string types for keys
[INFO] implicit val codecRegistry: CodecRegistry = fromRegistries (fromProviders (classOf [User]), DEFAULT_CODEC_REGISTRY)
[INFO] ^
[ERROR] one error found
I'm also applying the codecRegistry to the MongoDatabase.
Thank you very much.
The problem was that I was using a version of the driver that is compiled for Scala 2.11 and not 2.12. By changing the Maven dependency from
<dependency>
<groupId>org.mongodb.scala</groupId>
<artifactId>mongo-scala-driver_2.11</artifactId>
<version>2.2.1</version>
</dependency>
to
<dependency>
<groupId>org.mongodb.scala</groupId>
<artifactId>mongo-scala-driver_2.12</artifactId>
<version>2.2.1</version>
</dependency>
solved the problem.

scala.reflect.internal.MissingRequirementError: object java.lang.Object in compiler mirror not found

I'am trying to build spark streaming application using sbt package,I can't discover what's the reason of this error.
this is some thing of the error
scala.reflect.internal.MissingRequirementError: object
java.lang.Object in compiler mirror not found. at
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
and here is the code
import org.apache.spark.SparkContext
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import twitter4j.Status
object TrendingHashTags {
def main(args: Array[String]): Unit = {
val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret,
lang, batchInterval, minThreshold, showCount ) = args.take(8)
val filters = args.takeRight(args.length - 8)
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)
System.setProperty("twitter4j.oauth.accessToken", accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)
val conf = new SparkConf().setAppName("TrendingHashTags")
val ssc = new StreamingContext(conf, Seconds(batchInterval.toInt))
val tweets = TwitterUtils.createStream(ssc, None, filters)
val tweetsFilteredByLang = tweets.filter{tweet => tweet.getLang() == lang}
val statuses = tweetsFilteredByLang.map{ tweet => tweet.getText()}
val words = statuses.flatMap{status => status.split("""\s+""")}
val hashTags = words.filter{word => word.startsWith("#")}
val hashTagPairs = hashTags.map{hashtag => (hashtag, 1)}
val tagsWithCounts = hashTagPairs.updateStateByKey(
(counts: Seq[Int], prevCount: Option[Int]) =>
prevCount.map{c => c + counts.sum}.orElse{Some(counts.sum)}
)
val topHashTags = tagsWithCounts.filter{ case(t, c) =>
c > minThreshold.toInt
}
val sortedTopHashTags = topHashTags.transform{ rdd =>
rdd.sortBy({case(w, c) => c}, false)
}
sortedTopHashTags.print(showCount.toInt)
ssc.start()
ssc.awaitTermination()
}
}
I solved this issue ,I found that I used java 9 that isn't compatible with scala version so I migrated from java 9 into java 8.
The error means that scala was compiled using a version of java, different from the current version.
I am using maven instead of sbt, but the same behavior is observed.
Find the java version:
> /usr/libexec/java_home -V
Matching Java Virtual Machines (2):
15.0.1, x86_64: "OpenJDK 15.0.1" /Users/noname/Library/Java/JavaVirtualMachines/openjdk-15.0.1/Contents/Home
1.8.0_271, x86_64: "Java SE 8" /Library/Java/JavaVirtualMachines/jdk1.8.0_271.jdk/Contents/Home
If you installed scala, while you were on version >1.8 and then downgraded the java version (edited the $JAVA_HOME to point to 1.8), you will get this error.
Checked the scala version being used by the project :
$ ls -l /Users/noname/.m2/repository/org/scala-lang/scala-library/2.11.11/scala-library-2.11.11.jar
-rwxrwxrwx 1 noname staff 0 Nov 17 03:41 /Users/noname/.m2/repository/org/scala-lang/scala-library/2.11.11/scala-library-2.11.11.jar
To rectify the issue, remove the scala jar file:
$ rm /Users/noname/.m2/repository/org/scala-lang/scala-library/2.11.11/scala-library-2.11.11.jar
Now, execute mvn clean install again and the project would compile.
I had faced this issue when I had to downgrade my projects Scala version to use a dependency that was compiled in a lower Scala version and could not resolved it even after I made sure JDK and all other dependencies are compatible with the downgraded Scala library version.
As #ForeverLearner mentioned above, deleting Scala library versions higher than the one I am now using to compile project from maven repo (/Users/<>/.m2/repository/org/scala-lang/scala-library/...) helped me get rid of this error
The above fix resolved my issue as well (setting Java 8) , If you are using Intellij you can go to Project Settings and under Project change the Project SDK to 1.8 .

Run time error in scala : NoSuchMethodError

I am trying to use Spark MLlib algorithm's in Scala language in eclipse. There are no problems during compilation and while running there is an error saying "NoSuchMethodError".
Here is my code #Copied
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib._
object LinearRegression {
def truncate(k: Array[String], n: Int): List[String] = {
var trunced = k.take(n - 1) ++ k.drop(n)
// println(trunced.length)
return trunced.toList
}
}
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("linear regression").setMaster("local"))
//Loading Data
val data = sc.textFile("D://Innominds//DataSets//Regression//Regression Dataset.csv")
println("Total no of instances :" + data.count())
//Split the data into training and testing
val split = data.randomSplit(Array(0.8, 0.2))
val train = split(0).cache()
println("Training instances :" + train.count())
val test = split(1).cache()
println("Testing instances :" + test.count())
//Mapping the data
val trainingRDD = train.map {
line =>
val parts = line.split(',')
//println(parts.length)
LabeledPoint(parts(5).toDouble, Vectors.dense(truncate(parts, 5).map(x => x.toDouble).toArray))
}
val testingRDD = test.map {
line =>
val parts = line.split(',')
LabeledPoint(parts(5).toDouble, Vectors.dense(truncate(parts, 5).map(x => x.toDouble).toArray))
}
val model = LinearRegressionWithSGD.train(trainingRDD, 20)
val predict = testingRDD.map { x =>
val score = model.predict(x.features)
(score, x.label)
}
val loss = predict.map {
case (p, l) =>
val err = p - l
err * err
}.reduce(_ + _)
val rmse = math.sqrt(loss / test.count())
println("Test RMSE = " + rmse)
sc.stop()
}
The error arises while developing model i.e.,
Var model = LInearRegressionWithSGD(trainingRDD,20).
The print statements before this line are printing the values on console perfectly.
Dependencies in pom.Xml are:
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>14.0.1</version>
</dependency>
</dependencies>
Error in eclipse:
15/03/19 15:11:32 INFO SparkContext: Created broadcast 6 from broadcast at GradientDescent.scala:185
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.rdd.RDD.treeAggregate$default$4(Ljava/lang/Object;)I
at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1.a pply$mcVI$sp(GradientDescent.scala:189)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166)
at org.apache.spark.mllib.optimization.GradientDescent$.runMiniBatchSGD(GradientDes cent.scala:184)
at org.apache.spark.mllib.optimization.GradientDescent.optimize(GradientDescent.sca la:107)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLine arAlgorithm.scala:263)
at
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLine arAlgorithm.scala:190)
at org.apache.spark.mllib.regression.LinearRegressionWithSGD$.train(LinearRegressio n.scala:150)
at org.apache.spark.mllib.regression.LinearRegressionWithSGD$.train(LinearRegressio n.scala:184)
at Algorithms.LinearRegression$.main(LinearRegression.scala:46)
at Algorithms.LinearRegression.main(LinearRegression.scala)
You're using spark-core 1.2.1 and spark-mllib 1.3.0. Make sure you use the same version for both dependencies.

Error compiling Spark code: object mapreduce is not a member of package org.apache.hadoop.hbase

Edit: I added the hbase dependencies defined in the top level pom file to the project level pom and now it can find the package.
I have a scala object to read data from an HBase (0.98.4-hadoop2) table within Spark (1.0.1). However, compiling with maven results in an error when I try to import org.apache.hadoop.hbase.mapreduce.TableInputFormat.
error: object mapreduce is not a member of package org.apache.hadoop.hbase
The code and relevant pom are below:
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.SparkContext
import java.util.Properties
import java.io.FileInputStream
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
object readDataFromHbase {
def main(args: Array[String]): Unit = {
var propFileName = "hbaseConfig.properties"
if(args.size > 0){
propFileName = args(0)
}
/** Load properties **/
val prop = new Properties
val inStream = new FileInputStream(propFileName)
prop.load(inStream)
//set spark context and open input file
val sparkMaster = prop.getProperty("hbase.spark.master")
val sparkJobName = prop.getProperty("hbase.spark.job.name")
val sc = new SparkContext(sparkMaster,sparkJobName )
//set hbase connection
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.rootdir", prop.getProperty("hbase.rootdir"))
hbaseConf.set(TableInputFormat.INPUT_TABLE, prop.getProperty("hbase.table.name"))
val hBaseRDD = sc.newAPIHadoopRDD(hbaseConf, classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result]
)
val hBaseData = hBaseRDD.map(t=>t._2)
.map(res =>res.getColumnLatestCell("cf".getBytes(), "col".getBytes()))
.map(c=>c.getValueArray())
.map(a=> new String(a, "utf8"))
hBaseData.foreach(println)
}
}
The Hbase part of the pom file is (hbase.version = 0.98.4-hadoop2):
<!-- HBase -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop2-compat</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop-compat</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop-compat</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-protocol</artifactId>
<version>${hbase.version}</version>
</dependency>
I have cleaned the package with no luck. The main thing I need from the import is the classOf(TableInputFormat) to be used in setting the RDD. I suspect that I'm missing a dependency in my pom file but can't figure out which one. Any help would be greatly appreciated.
TableInputFormat is in the org.apache.hadoop.hbase.mapreduce
packacge, which is part of the hbase-server artifact, so you will need to add that as a dependency, like #xgdgsc commented:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
in spark 1.0 and above:
put all your hbase jar into spark/assembly/lib or spark/core/lib directory. Hopefully youhave docker to automate all this.
a)For CDH version, the relate hbase jar is usually under /usr/lib/hbase/*.jar which are symlink to correct jar.
b) good article to read from http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html