Exception while running hive support on Spark: Unable to instantiate SparkSession with Hive support because Hive classes are not found - scala

Hello i am trying use Hive with spark but when i try executing, it shows this error
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
This is my source code
package com.spark.hiveconnect
import java.io.File
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
object sourceToHIve {
case class Record(key: Int, value: String)
def main(args: Array[String]){
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
sql("LOAD DATA LOCAL INPATH '/usr/local/spark3/examples/src/main/resources/kv1.txt' INTO TABLE src")
sql("SELECT * FROM src").show()
spark.close()
}
}
This is my build.sbt file.
name := "SparkHive"
version := "0.1"
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.5"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
And i also have hive running in the console.
Can anyone help me with this?
Thank You.

try adding
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.4.5"

The Major Problem is that the class "org.apache.hadoop.hive.conf.HiveConf" can not be loaded.
you can insert ther following code for testing.
Class.forName("org.apache.hadoop.hive.conf.HiveConf",true,
Thread.currentThread().getContextClassLoader);
And an error will occur in this line.
To be exactly,the fundament problem is your pom may not support hive on spark.
you may check the following dependency.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.3</version>
</dependency>
The class "org.apache.hadoop.hive.conf.HiveConf" locates in this dependcy.

Related

Setup Scala and Apache Spark with SBT in Intellij

I am trying to run Spark Scala project in IntelliJ Idea on Windows 10 machine.
My build.sbt:
name := "SbtIntellSpark1"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
project/build.properties:
sbt.version = 1.0.3
Main.scala:
package example
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object Main {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val session = SparkSession
.builder()
.appName("StackOverflowSurvey")
.master("local[1]")
.getOrCreate()
val df = session.read
val responses = df
.option("header", true)
.option("inferSchema", true)
.csv("2016-stack-overflow-survey-responses.csv")
responses.printSchema()
}
}
The code runs perfectly (the schema is properly printed) when I run the Main object as shown in the following image:
My Run Configuration is as follows:
The problem is when I run "Run the program", it shows a huge stack of error which is too large to show here. Please see this gist.
How can I solve this issue?

Why does spark-xml fail with NoSuchMethodError with Spark 2.0.0 dependency?

Hi I am a noob to Scala and Intellij and I am just trying to do this on Scala:
import org.apache.spark
import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml.XmlReader
object SparkSample {
def main(args: Array[String]): Unit = {
val conf = new spark.SparkConf()
conf.setAppName("Datasets Test")
conf.setMaster("local[2]")
val sc = new spark.SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "shop")
.load("shops.xml") /* NoSuchMethod error here */
val selectedData = df.select("author", "_id")
df.show
}
Basically I am trying to convert XML into spark dataframe
I am getting a NoSuchMethod error in '.load("shops.xml")'
the Below is the SBT
version := "0.1"
scalaVersion := "2.11.3"
val sparkVersion = "2.0.0"
val sparkXMLVersion = "0.3.3"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion exclude("jline", "2.12"),
"org.apache.spark" %% "spark-sql" % sparkVersion excludeAll(ExclusionRule(organization = "jline"),ExclusionRule("name","2.12")),
"com.databricks" %% "spark-xml" % sparkXMLVersion,
)
Below is the trace:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.types.DecimalType$.Unlimited()Lorg/apache/spark/sql/types/DecimalType;
at com.databricks.spark.xml.util.InferSchema$.<init>(InferSchema.scala:50)
at com.databricks.spark.xml.util.InferSchema$.<clinit>(InferSchema.scala)
at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:46)
at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:46)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:45)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
Can someone point out the error?Seems like a dependency issue to me.
spark-core seems to be working fine but not spark-sql
I had scala 2.12 before but changed to 2.11 because spark-core was not resolved
tl;dr I think it's a Scala version mismatch issue. Use spark-xml 0.4.1.
Quoting spark-xml's Requirements (highlighting mine):
This library requires Spark 2.0+ for 0.4.x.
For version that works with Spark 1.x, please check for branch-0.3.
That says to me that spark-xml 0.3.3 works with Spark 1.x (not Spark 2.0.0 that you requested).

java.lang.NullPointerException while reading data from MSSQL server with spark

I am having issues with reading data from MSSQL server using Cloudera Spark. I am not sure where is the problem and what is causing it.
Here is my build.sbt
val sparkversion = "1.6.0-cdh5.10.1"
name := "SimpleSpark"
organization := "com.huff.spark"
version := "1.0"
scalaVersion := "2.10.5"
mainClass in Compile := Some("com.huff.spark.example.SimpleSpark")
assemblyJarName in assembly := "mssql.jar"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.6.0" % "provided",
"org.apache.spark" % "spark-core_2.10" % sparkversion % "provided", // to test in cluseter
"org.apache.spark" % "spark-sql_2.10" % sparkversion % "provided" // to test in cluseter
)
resolvers += "Confluent IO" at "http://packages.confluent.io/maven"
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos"
And here is my scala source:
package com.huff.spark.example
import org.apache.spark.sql._
import java.sql.{Connection, DriverManager}
import java.util.Properties
import org.apache.spark.{SparkContext, SparkConf}
object SimpleSpark {
def main(args: Array[String]) {
val sourceProp = new java.util.Properties
val conf = new SparkConf().setAppName("SimpleSpark").setMaster("yarn-cluster") //to test in cluster
val sc = new SparkContext(conf)
var SqlContext = new SQLContext(sc)
val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
val jdbcDF = SqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlserver://sqltestsrver;databaseName=LEh;user=sparkaetl;password=sparkaetl","driver" -> driver,"dbtable" -> "StgS")).load()
jdbcDF.show(5)
}
}
And this is the error I see:
17/05/24 04:35:20 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:155)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:222)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
at com.huff.spark.example.SimpleSpark$.main(SimpleSpark.scala:16)
at com.huff.spark.example.SimpleSpark.main(SimpleSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:552)
17/05/24 04:35:20 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.NullPointerException)
I know the problem is in line 16 which is:
val jdbcDF = SqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlserver://sqltestsrver;databaseName=LEh;user=sparkaetl;password=sparkaetl","driver" -> driver,"dbtable" -> "StgS")).load()
But I can't pinpoint out what exactly is the problem. Is it something to do with access? (which is doubtful), problems with connection parameters (the error message would say it), or something else which I am not aware of. Thanks in advance :-)
If you are using azure SQL server please copy the jdbc connection string from azure portal. I tried and it worked for me.
Azure databricks using scala mode:
import com.microsoft.sqlserver.jdbc.SQLServerDriver
import java.sql.DriverManager
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
// MS SQL JDBC Connection String ...
val jdbcSqlConn = "jdbc:sqlserver://***.database.windows.net:1433;database=**;user=***;password=****;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Loading the ms sql table via spark context into dataframe
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> jdbcSqlConn,
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"dbtable" -> "***")).load()
// Registering the temp table so that we can SQL like query against the table
jdbcDF.registerTempTable("yourtablename")
// selecting only top 10 rows here but you can use any sql statement
val yourdata = sqlContext.sql("SELECT * FROM yourtablename LIMIT 10")
// display the data
yourdata.show()
The NPE occurs when you try to close the database Connection which indicates that the system could not obtain the proper connector via JdbcUtils.createConnectionFactory. You should check your connection URL and the logs for failures.

Cannot make Spark run inside a scala worksheet in Intellij Idea

The following code runs with no problems if I put it inside an object which extends the App trait and run it using Idea's run command.
However, when I try running it from a worksheet, I encounter one of these scenarios:
1- If the first line is present, I get:
Task not serializable: java.io.NotSerializableException:A$A34$A$A34
2- If the first line is commented out, I get:
Unable to generate an encoder for inner class A$A35$A$A35$A12 without
access to the scope that this class was defined in.
//First line!
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
case class AClass(id: Int, f1: Int, f2: Int)
val spark = SparkSession.builder()
.master("local[*]")
.appName("Test App")
.getOrCreate()
import spark.implicits._
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("f1", IntegerType),
StructField("f2", IntegerType)))
val df = spark.read.schema(schema)
.option("header", "true")
.csv("dataset.csv")
// Displays the content of the DataFrame to stdout
df.show()
val ads = df.as[AClass]
//This is the line that causes serialization error
ads.foreach(x => println(x))
The project has been created using Idea's Scala plugin, and this is my build.sbt:
...
scalaVersion := "2.10.6"
scalacOptions += "-unchecked"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.1.0",
"org.apache.spark" % "spark-sql_2.10" % "2.1.0",
"org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)
I tried the solution in this answer. But it is not working for Idea Ultimate 2017.1 which I am using and also, when I use worksheets, I prefer not to add an extra object to the worksheet if at all possible.
if I use collect() method on the dataset object and get an Array of "Aclass" instances, there will be no more errors either. It is trying to work with the DS directly that causes the error.
Use eclipse compatibility mode (open Preferences-> type scala -> in Languages & Frameworks, choose Scala -> Choose Worksheet -> only select eclipse compatibility mode) see https://gist.github.com/RAbraham/585939e5390d46a7d6f8

Object streaming is not a member of package org.apache.spark

I'm trying to compile a simple scala program and I'm using StreamingContext , here is a snippet of my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.scheduler.SparkListener
import org.apache.spark.scheduler.SparkListenerStageCompleted
import org.apache.spark.streaming.StreamingContext._ //error:object streaming is not a member of package org.apache.spark
object FileCount {
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("File Count")
.setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))
val ssc = new StreamingContext(sc, Seconds(10)) //error : not found: type StreamingContext
sc.stop()
}
}
I have these two error:
object streaming is not a member of package org.apache.spark
and
not found: type StreamingContext
any help please !!
If you are using sbt, add the following library dependencies:
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided"
If you are using maven, add the below to pom.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
<scope>provided</scope>
</dependency>
You'll need to add the dependency of spark-streaming into your build manager.
You need to add the correct dependency corresponds to your import statement. And hope obviously you have import the spark-streaming dependencies. except that we need this dependency also.
Here are the dependencies based on your dependency management tool.
For Maven : Add following to pom.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.1.0</version>
</dependency>
For SBT : Add following to build.sbt
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.0" % "provided"
For Gradle
provided group: 'org.apache.spark', name: 'spark-mllib_2.11', version: '2.1.0'
TIP : use grepcode.com to find the appropriate dependency by searching your import statement. It is a nite site!
NOTE : dependency versions can be changed & updated with the time.
I have added missing dependency after that its work for me, which are
"org.apache.spark" %% "spark-mllib" % SparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.0.1"