I am trying to use Spark MLlib algorithm's in Scala language in eclipse. There are no problems during compilation and while running there is an error saying "NoSuchMethodError".
Here is my code #Copied
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib._
object LinearRegression {
def truncate(k: Array[String], n: Int): List[String] = {
var trunced = k.take(n - 1) ++ k.drop(n)
// println(trunced.length)
return trunced.toList
}
}
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("linear regression").setMaster("local"))
//Loading Data
val data = sc.textFile("D://Innominds//DataSets//Regression//Regression Dataset.csv")
println("Total no of instances :" + data.count())
//Split the data into training and testing
val split = data.randomSplit(Array(0.8, 0.2))
val train = split(0).cache()
println("Training instances :" + train.count())
val test = split(1).cache()
println("Testing instances :" + test.count())
//Mapping the data
val trainingRDD = train.map {
line =>
val parts = line.split(',')
//println(parts.length)
LabeledPoint(parts(5).toDouble, Vectors.dense(truncate(parts, 5).map(x => x.toDouble).toArray))
}
val testingRDD = test.map {
line =>
val parts = line.split(',')
LabeledPoint(parts(5).toDouble, Vectors.dense(truncate(parts, 5).map(x => x.toDouble).toArray))
}
val model = LinearRegressionWithSGD.train(trainingRDD, 20)
val predict = testingRDD.map { x =>
val score = model.predict(x.features)
(score, x.label)
}
val loss = predict.map {
case (p, l) =>
val err = p - l
err * err
}.reduce(_ + _)
val rmse = math.sqrt(loss / test.count())
println("Test RMSE = " + rmse)
sc.stop()
}
The error arises while developing model i.e.,
Var model = LInearRegressionWithSGD(trainingRDD,20).
The print statements before this line are printing the values on console perfectly.
Dependencies in pom.Xml are:
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>14.0.1</version>
</dependency>
</dependencies>
Error in eclipse:
15/03/19 15:11:32 INFO SparkContext: Created broadcast 6 from broadcast at GradientDescent.scala:185
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.rdd.RDD.treeAggregate$default$4(Ljava/lang/Object;)I
at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1.a pply$mcVI$sp(GradientDescent.scala:189)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166)
at org.apache.spark.mllib.optimization.GradientDescent$.runMiniBatchSGD(GradientDes cent.scala:184)
at org.apache.spark.mllib.optimization.GradientDescent.optimize(GradientDescent.sca la:107)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLine arAlgorithm.scala:263)
at
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLine arAlgorithm.scala:190)
at org.apache.spark.mllib.regression.LinearRegressionWithSGD$.train(LinearRegressio n.scala:150)
at org.apache.spark.mllib.regression.LinearRegressionWithSGD$.train(LinearRegressio n.scala:184)
at Algorithms.LinearRegression$.main(LinearRegression.scala:46)
at Algorithms.LinearRegression.main(LinearRegression.scala)
You're using spark-core 1.2.1 and spark-mllib 1.3.0. Make sure you use the same version for both dependencies.
Related
when I run a normal wordcount program(with below code) with out any Dataframe included I am able run the application with spark-submit.
object wordCount {
def main(args: Array[String]): Unit = {
val logFile= "path/thread.txt"
val sparkConf = new SparkConf().setAppName("Spark Word Count")
val sc = new SparkContext(sparkConf)
val file = sc.textFile(logFile)
val counts = file.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("path/output1234")
sc.stop()
}
}
But when I run the below code
import scala.reflect.runtime.universe
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object wordCount {
def main(args: Array[String]): Unit = {
val logFile = "path/thread.txt"
val sparkConf = new SparkConf().setAppName("Spark Word Count")
val sc = new SparkContext(sparkConf)
val file = sc.textFile(logFile)
val counts = file.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
case class count1(key:String,value:Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._;
counts.toDF.registerTempTable("count1")
val counts1 = sqlContext.sql("select * from count1")
counts.saveAsTextFile("path/output1234")
sc.stop()
}
}
I am getting the below error:
Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
at com.cadillac.spark.sparkjob.wordCount$.main(wordCount.scala:18)
I am not sure what I am missing.
Pom.xml I am using is as below,
<name>sparkjob</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.10</version>
</dependency>
</dependencies>
</project>
Please suggest any changes.
My cluster is with
spark-version 2.1.0-mapr-1703
Scala version 2.11.8
Thanks in advance
If you go to this documentation the reason for the error is defined there as
This means that there is a mix of Scala versions in the libraries used in your code. The collection API is different between Scala 2.10 and 2.11 and this the most common error which occurs if a Scala 2.10 library is attempted to be loaded in a Scala 2.11 runtime. To fix this make sure that the name has the correct Scala version suffix to match your Scala version.
So changing your dependencies from
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.1</version>
</dependency>
to
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
and add one more dependency
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
I guess the error should go away
I am trying to use Spark Structured Streaming with Kafka.
object StructuredStreaming {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: StructuredStreaming <hostname> <port>")
System.exit(1)
}
val host = args(0)
val port = args(1).toInt
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
// Subscribe to 1 topic
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9093")
.option("subscribe", "sparkss")
.load()
lines.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
}
}
I got my code from Spark documentation and I got this build error :
Unable to find encoder for type stored in a Dataset. Primitive types
(Int, String, etc) and Product types (case classes) are supported by
importing spark.implicits._ Support for serializing other types will
be added in future releases.
.as[(String, String)]
I read on other SO post that it was due to the lack of import spark.implicits._. But it does not change anything for me.
UPDATE :
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<slf4j.version>1.7.12</slf4j.version>
<spark.version>2.1.0</spark.version>
<scala.version>2.10.4</scala.version>
<scala.binary.version>2.10</scala.binary.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.10</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
Well, I tried with scala 2.11.8
<scala.version>2.11.8</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
and with corresponding dependencies (for scala 2.11) and it eventually worked.
Warning : You need to restart your project on intelliJ, I think there are some problems when changing version and not restarting, the errors are still there.
i have an error when I try to compile, test and run a junit test.
I want to load a local Avro file using DataFrames but I am getting an exception:
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
I am not using Cassandra at all, the version of involved jars are:
<properties>
<!-- Generic properties -->
<java.version>1.7</java.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<!-- Dependency versions -->
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<scala.version>2.10.4</scala.version>
<junit.version>4.11</junit.version>
<slf4j.version>1.7.12</slf4j.version>
<spark.version>1.5.0-cdh5.5.2</spark.version>
<databricks.version>1.5.0</databricks.version>
<json4s-native.version>3.5.0</json4s-native.version>
<spark-avro.version>2.0.1</spark-avro.version>
</properties>
and these are the dependencies:
<dependencies>
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-native_2.10</artifactId>
<version>${json4s-native.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>${junit.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>${databricks.version}</version>
<exclusions>
<exclusion>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
<version>1.0.4.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.10</artifactId>
<version>${spark-avro.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/log4j/log4j -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
</dependencies>
I have tried to compile the project with
mvn clean install -Dorg.xerial.snappy.lib.name=libsnappyjava.jnlib -Dorg.xerial.snappy.tempdir=/tmp
before copying the jar within /tmp, with no luck.
$ ls -lt /tmp/
total 1944
...27 dic 13:01 snappy-java-1.0.4.jar
This is the code:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SaveMode}
import org.apache.spark.{SparkConf, SparkContext}
import com.databricks.spark.avro._
import java.io._
//auxiliary function
def readRawData(pathToResources: String, sqlContext: SQLContext, rawFormat: String = "json"): DataFrame = {
val a: DataFrame = rawFormat match {
case "avro" => sqlContext.read.avro(pathToResources)
case "json" => sqlContext.read.json(pathToResources)
case _ => throw new Exception("Format not supported, use AVRO or JSON instead.")
}
val b: DataFrame = a.filter("extraData.type = 'data'")
val c: DataFrame = a.select("extraData.topic", "extraData.timestamp",
"extraData.sha1Hex", "extraData.filePath", "extraData.fileName",
"extraData.lineNumber", "extraData.type",
"message")
val indexForMessage: Int = c.schema.fieldIndex("message")
val result: RDD[Row] = c.rdd.filter(r =>
!r.anyNull match {
case true => true
case false => false
}
).flatMap(r => {
val metadata: String = r.toSeq.slice(0, indexForMessage).mkString(",")
val lines = r.getString(indexForMessage).split("\n")
lines.map(l => Row.fromSeq(metadata.split(",").toSeq ++ Seq(l)))
})
sqlContext.createDataFrame(result, c.schema)
}//readRawData
def validate(rawFlumeData : String = "FlumeData.1482407196579",fileNamesToBeDigested : String = "fileNames-to-be-digested.txt", sqlContext: SQLContext,sc:SparkContext) : Boolean = {
val result : Boolean = true
sqlContext.sparkContext.hadoopConfiguration.set("avro.mapred.ignore.inputs.without.extension", "false")
val rawDF : DataFrame = readRawData(rawFlumeData, sqlContext, rawFormat = "avro")
rawDF.registerTempTable("RAW")
//this line provokes the exception! cannot load snappy jar file!
val arrayRows : Array[org.apache.spark.sql.Row] = sqlContext.sql("SELECT distinct fileName as filenames FROM RAW GROUP BY fileName").collect()
val arrayFileNames : Array[String] = arrayRows.map(row=>row.getString(0))
val fileNamesDigested = "fileNames-AVRO-1482407196579.txt"
val pw = new PrintWriter(new File(fileNamesDigested))
for (filename <-arrayFileNames) pw.write(filename + "\n")
pw.close
val searchListToBeDigested : org.apache.spark.rdd.RDD[String] = sc.textFile(fileNamesToBeDigested)
//creo un map con valores como éstos: Map(EUR_BACK_SWVOL_SMILE_GBP_20160930.csv -> 0, UK_SC_equities_20160930.csv -> 14,...
//val mapFileNamesToBeDigested: Map[String, Long] = searchListToBeDigested.zipWithUniqueId().collect().toMap
val searchFilesAVRODigested = sc.textFile(fileNamesDigested)
val mapFileNamesAVRODigested: Map[String, Long] = searchFilesAVRODigested.zipWithUniqueId().collect().toMap
val pwResults = new PrintWriter(new File("validation-results.txt"))
//Hay que guardar el resultado en un fichero de texto, en algún lado...
val buffer = StringBuilder.newBuilder
//Me traigo los resultados al Driver.
val listFilesToBeDigested = searchListToBeDigested.map {line =>
val resultTemp = mapFileNamesAVRODigested.getOrElse(line,"NOT INGESTED!")
var resul = ""
if (resultTemp == "NOT INGESTED!"){
resul = "File " + line + " " + resultTemp + "\n"
}
else{
resul = "File " + line + " " + " is INGESTED!" + "\n"
}
resul
}.collect()
//añado los datos al buffer
listFilesToBeDigested.foreach(buffer.append(_))
//guardo el contenido del buffer en el fichero de texto de salida.
pwResults.write(buffer.toString)
pwResults.close
//this boolean must return false in case of a exception or error...
result
}//
This is the unit test code:
private[validation] class ValidateInputCSVFilesTest {
//AS YOU CAN SEE, I do not WANT to use snappy at all!
val conf = new SparkConf()
.setAppName("ValidateInputCSVFilesTest")
.setMaster("local[2]")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.driver.host", "127.0.0.1")
.set("spark.io.compression.codec", "lzf")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val properties : Properties = new Properties()
properties.setProperty("frtb.input.csv.validation.avro","./src/test/resources/avro/FlumeData.1482407196579")
properties.setProperty("frtb.input.csv.validation.list.files","./src/test/resources/fileNames-to-be-digested.txt")
import sqlContext.implicits._
sqlContext.sparkContext.hadoopConfiguration.set("avro.mapred.ignore.inputs.without.extension", "false")
#Test
def testValidateInputFiles() = {
//def validate(rawFlumeData : String = "FlumeData.1482407196579",fileNamesToBeDigested : String = "fileNames-to-be-digested.txt", sqlContext: SQLContext)
val rawFlumeData = properties.getProperty("frtb.input.csv.validation.avro")
val fileNamesToBeDigested = properties.getProperty("frtb.input.csv.validation.list.files")
println("rawFlumeData is " + rawFlumeData )
println("fileNamesToBeDigested is " + fileNamesToBeDigested )
val result : Boolean = ValidateInputCSVFiles.validate(rawFlumeData ,fileNamesToBeDigested ,sqlContext,sc)
Assert.assertTrue("Must be true...",result)
}//end of test method
}//end of unit class
I can run perfectly the same code in a local spark-shell, using this command:
$ bin/spark-shell --packages org.json4s:json4s-native_2.10:3.5.0 --packages com.databricks:spark-csv_2.10:1.5.0 --packages com.databricks:spark-avro_2.10:2.0.1
What else can I do?
Thanks in advance.
The problem was solved when I changed the scope of spark dependencies.
This is part of the pom.xml that solves my problem, now I can run the job with spark-submit command...
<properties>
<!-- Generic properties -->
<java.version>1.7</java.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<!-- Dependency versions -->
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<scala.version>2.10.4</scala.version>
<junit.version>4.11</junit.version>
<slf4j.version>1.7.12</slf4j.version>
<spark.version>1.5.0-cdh5.5.2</spark.version>
<databricks.version>1.5.0</databricks.version>
<json4s-native.version>3.5.0</json4s-native.version>
<spark-avro.version>2.0.1</spark-avro.version>
</properties>
...
<dependencies>
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-native_2.10</artifactId>
<version>${json4s-native.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>${junit.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>${databricks.version}</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
<version>1.0.4.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.10</artifactId>
<version>${spark-avro.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/log4j/log4j -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
</dependencies>
...
I tried the below code and cannot import sqlContext.implicits._ - it throws an error (in the Scala IDE), unable to build the code:
value implicits is not a member of org.apache.spark.sql.SQLContext
Do I need to add any dependencies in pom.xml?
Spark version 1.5.2
package com.Spark.ConnectToHadoop
import org.apache.spark.SparkConf
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
//import groovy.sql.Sql.CreateStatementCommand
//import org.apache.spark.SparkConf
object CountWords {
def main(args:Array[String]){
val objConf = new SparkConf().setAppName("Spark Connection").setMaster("spark://IP:7077")
var sc = new SparkContext(objConf)
val objHiveContext = new HiveContext(sc)
objHiveContext.sql("USE test")
var rdd= objHiveContext.sql("select * from Table1")
val options=Map("path" -> "hdfs://URL/apps/hive/warehouse/test.db/TableName")
//val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ //Error
val dataframe = rdd.toDF()
dataframe.write.format("orc").options(options).mode(SaveMode.Overwrite).saveAsTable("TableName")
}
}
My pom.xml file is as follows
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.Sudhir.Maven1</groupId>
<artifactId>SparkDemo</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>SparkDemo</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>0.9.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
first create
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
now we have sqlContext w.r.t sc (this will be available automatically when we launch spark-shell)
now,
import sqlContext.implicits._
With the release of Spark 2.0.0 (July 26, 2016) one should now use the following:
import spark.implicits._ // spark = SparkSession.builder().getOrCreate()
https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
You use an old version of Spark-SQL. Change it to:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.5.2</version>
</dependency>
For someone using sbt to build, update the library versions to
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.12" % "2.4.6" % "provided",
"org.apache.spark" % "spark-sql_2.12" % "2.4.6" % "provided"
)
And then import SqlImplicits as below.
val spark = SparkSession.builder()
.appName("appName")
.getOrCreate()
import spark.sqlContext.implicits._;
You can also use
<properties>
<spark.version>2.2.0</spark.version>
</properties>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
Edit: I added the hbase dependencies defined in the top level pom file to the project level pom and now it can find the package.
I have a scala object to read data from an HBase (0.98.4-hadoop2) table within Spark (1.0.1). However, compiling with maven results in an error when I try to import org.apache.hadoop.hbase.mapreduce.TableInputFormat.
error: object mapreduce is not a member of package org.apache.hadoop.hbase
The code and relevant pom are below:
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.SparkContext
import java.util.Properties
import java.io.FileInputStream
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
object readDataFromHbase {
def main(args: Array[String]): Unit = {
var propFileName = "hbaseConfig.properties"
if(args.size > 0){
propFileName = args(0)
}
/** Load properties **/
val prop = new Properties
val inStream = new FileInputStream(propFileName)
prop.load(inStream)
//set spark context and open input file
val sparkMaster = prop.getProperty("hbase.spark.master")
val sparkJobName = prop.getProperty("hbase.spark.job.name")
val sc = new SparkContext(sparkMaster,sparkJobName )
//set hbase connection
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.rootdir", prop.getProperty("hbase.rootdir"))
hbaseConf.set(TableInputFormat.INPUT_TABLE, prop.getProperty("hbase.table.name"))
val hBaseRDD = sc.newAPIHadoopRDD(hbaseConf, classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result]
)
val hBaseData = hBaseRDD.map(t=>t._2)
.map(res =>res.getColumnLatestCell("cf".getBytes(), "col".getBytes()))
.map(c=>c.getValueArray())
.map(a=> new String(a, "utf8"))
hBaseData.foreach(println)
}
}
The Hbase part of the pom file is (hbase.version = 0.98.4-hadoop2):
<!-- HBase -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop2-compat</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop-compat</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop-compat</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-protocol</artifactId>
<version>${hbase.version}</version>
</dependency>
I have cleaned the package with no luck. The main thing I need from the import is the classOf(TableInputFormat) to be used in setting the RDD. I suspect that I'm missing a dependency in my pom file but can't figure out which one. Any help would be greatly appreciated.
TableInputFormat is in the org.apache.hadoop.hbase.mapreduce
packacge, which is part of the hbase-server artifact, so you will need to add that as a dependency, like #xgdgsc commented:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
in spark 1.0 and above:
put all your hbase jar into spark/assembly/lib or spark/core/lib directory. Hopefully youhave docker to automate all this.
a)For CDH version, the relate hbase jar is usually under /usr/lib/hbase/*.jar which are symlink to correct jar.
b) good article to read from http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html