Not able to create a table locally, getting Hive support is required - scala

Getting error even after setting configuration
config("spark.sql.catalogImplementation","hive")
override def beforeAll(): Unit = {
super[SharedSparkContext].beforeAll()
SparkSessionProvider._sparkSession = SparkSession.builder()
.master("local[*]")
.config("spark.sql.catalogImplementation","hive")
.getOrCreate()
}
Edited:
This is how am setting up my local db and tables for testing.
val stgDb = "test_stagingDB"
val stgTbl_exp ="test_stagingDB_expected"
val stgTbl_result="test_stg_table_result"
val trgtDb = "test_activeDB"
val trgtTbl_exp ="test_activeDB_expected"
val trgtTbl_result ="test_activeDB_results"
def setUpDb ={
println("Set up DB started")
val localPath="file:/C:/Users/vmurthyms/Code-prdb/prdb/com.rxcorp.prdb"
spark.sql(s"CREATE DATABASE IF NOT EXISTS test_stagingDB LOCATION '$localPath/test_stagingDB.db'")
spark.sql(s"CREATE DATABASE IF NOT EXISTS test_activeDB LOCATION '$localPath/test_sctiveDB.db'")
spark.sql(s"CREATE TABLE IF NOT EXISTS $trgtDb.${trgtTbl_exp}_ina (Id String, Name String)")
println("Set up DB done")
}
setUpDb
While running spark.sql("CREATE TABLE.., ") cmd , am getting below error:
Error:
Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable test_activeDB.test_activeDB_expected_ina, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore
org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable test_activeDB.test_activeDB_expected_ina, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:392)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:390)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:390)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:388)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:349)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:349)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:349)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
at com.rxcorp.prdb.exe.SitecoreAPIExtractTest$$anonfun$2.setUpDb$1(SitecoreAPIExtractTest.scala:127)
at com.rxcorp.prdb.exe.SitecoreAPIExtractTest$$anonfun$2.apply$mcV$sp(SitecoreAPIExtractTest.scala:130)

It seems you are almost there(your error message is also giving you the clue), you need to call enableHiveSupport() when you are creating spark session. Eg.
SparkSession.builder()
.master("local[*]")
.config("spark.sql.catalogImplementation","hive")
.enableHiveSupport()
.getOrCreate()
And also when using enableHiveSupport(), setting config("spark.sql.catalogImplementation","hive") looks redundant. I think you can safely comment out that part.

Related

.csv not a SequenceFile error on Select Hive Query

I am quite a newbie to Spark and Scala ;)
Code summary :
Reading data from CSV files --> Creating A simple inner join on 2 Files --> Writing data to Hive table --> Submitting the job on the cluster
Can you please help to identify what went wrong.
The code is not really complex.
The job is executed well on cluster.
Therefore when I try to visualize data written on hive table I am facing issue.
hive> select * from Customers limit 10;
Failed with exception java.io.IOException:java.io.IOException: hdfs://m01.itversity.com:9000/user/itv000666/warehouse/updatedcustomers.db/customers/part-00000-348a54cf-aa0c-45b4-ac49-3a881ae39702_00000.c000 .csv not a SequenceFile
object LapeyreSparkDemo extends App {
//Getting spark ready
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name","Spark for Lapeyre")
//Creating Spark Session
val spark = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.config("spark.sql.warehouse.dir","/user/itv000666/warehouse")
.getOrCreate()
Logger.getLogger(getClass.getName).info("Spark Session Created Successfully")
//Reading
Logger.getLogger(getClass.getName).info("Data loading in DF started")
val ordersSchema = "orderid Int, customerName String, orderDate String, custId Int, orderStatus
String, age String, amount Int"
val orders2019Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv0006666/lapeyrePoc/orders2019.csv")
.load
val newOrder = orders2019Df.withColumnRenamed("custId", "oldCustId")
.withColumnRenamed("customername","oldCustomerName")
val orders2020Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv000666/lapeyrePoc/orders2020.csv")
.load
Logger.getLogger(getClass.getName).info("Data loading in DF complete")
//processing
Logger.getLogger(getClass.getName).info("Processing Started")
val joinCondition = newOrder.col("oldCustId") === orders2020Df.col("custId")
val joinType = "inner"
val joinData = newOrder.join(orders2020Df, joinCondition, joinType)
.select("custId","customername")
//Writing
spark.sql("create database if not exists updatedCustomers")
joinData.write
.format("csv")
.mode(SaveMode.Overwrite)
.bucketBy(4, "custId")
.sortBy("custId")
.saveAsTable("updatedCustomers.Customers")
//Stopping Spark Session
spark.stop()
}
Please let me know in case more information required.
Thanks in advance.
This is the culprit
joinData.write
.format("csv")
Instead used this and it worked.
joinData.write
.format("Hive")
Since I am writing data to hive table (orc format), the format should be "Hive" and not csv.
Also, do not forget to enable hive support while creating spark session.
Also, In spark 2, bucketby & sortby is not supported. Maybe it does in Spark 3.

Spark Scala Write dataframe to MongoDB

I am attempting to write my transformed data frame into MongoDB using this as a guide
https://docs.mongodb.com/spark-connector/master/scala/streaming/
So far, my reading of data frame from MongoDB works perfectly. As shown below.
val mongoURI = "mongodb://000.000.000.000:27017"
val Conf = makeMongoURI(mongoURI,"blog","articles")
val readConfigintegra: ReadConfig = ReadConfig(Map("uri" -> Conf))
val sparkSess = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.output.uri", "mongodb://000.000.000.000:27017/blog.vectors")
.getOrCreate()
// Uses the ReadConfig
val df3 = sparkSess.sqlContext.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://000.000.000.000:27017/blog.articles")))
However, writing this data frame to MongoDB seems to prove more difficult.
//reads data from mongo and does some transformations
val data = read_mongo()
data.show(20,false)
data.write.mode("append").mongo()
For the last line, I receive the following error.
Exception in thread "main" java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.output.uri' or 'spark.mongodb.output.database' property
This seems confusing to me as I set this within my spark Session in the code blocks above.
val sparkSess = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.output.uri", "mongodb://000.000.000.000:27017/blog.vectors")
.getOrCreate()
Can you spot anything I'm doing wrong?
My answer is pretty much parallels how I read it but uses writeConfig instead.
data.saveToMongoDB(WriteConfig(Map("uri" -> "mongodb://000.000.000.000:27017/blog.vectors")))

write to a JDBC source in scala

I am trying to write classic sql query using scala to insert some information into a sql server database table.
The connection to my database works perfectly and I succeed to read data from JDBC, from a table recently created called "textspark" which has only 1 column called "firstname" create table textspark(firstname varchar(10)).
However, when I try to write data into the table , I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: textspark
this is my code:
//Step 1: Check that the JDBC driver is available
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
//Step 2: Create the JDBC URL
val jdbcHostname = "localhost"
val jdbcPort = 1433
val jdbcDatabase ="mydatabase"
val jdbcUsername = "mylogin"
val jdbcPassword = "mypwd"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
//Step 3: Check connectivity to the SQLServer database
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
//Read data from JDBC
val textspark_table = spark.read.jdbc(jdbcUrl, "textspark", connectionProperties)
textspark_table.show()
//the read operation works perfectly!!
//Write data to JDBC
import org.apache.spark.sql.SaveMode
spark.sql("insert into textspark values('test') ")
.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbcUrl, "textspark", connectionProperties)
//the write operation generates error!!
Can anyone help me please to fix this error?
You don't use insert statement in Spark. You specified the append mode what is ok. You shouldn't insert data, you should select / create it. Try something like this:
spark.sql("select 'text'")
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)
or
Seq("test").toDS
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)

Not able to read data from AWS S3(orc) through Intellij local(spark/Scala)

we are reading the date/table from AWS(hive) through Spark/scala using Intellij(witch is on local machine). we can able to see the schema of table. but not able to read data.
please find below flow to get better understanding
Intellij(spark/scala)------> hive:9083(remote)------> s3(orc)
Note: Here Intellij is present on local and hive and S3 present on AWS
Please find below code for the same:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
//import org.apache.spark.sql.hive.HiveContext
object hiveconnect {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.master("local[*]")
.config("spark.sql.warehouse.dir", "s3://abc/test/main")
.config("spark.driver.allowMultipleContexts", "true")
.config("access-key","key")
.config("secret-key","key")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("show databases").show()
spark.sql("select *from ace.visit limit 5").show()
}
}
Error: Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

Can we able to use mulitple sparksessions to access two different Hive servers

I have a scenario to compare two different tables source and destination from two separate remote hive servers, can we able to use two SparkSessions something like I tried below:-
val spark = SparkSession.builder().master("local")
.appName("spark remote")
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.160:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.160:9083")
.enableHiveSupport()
.getOrCreate()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
val sparkdestination = SparkSession.builder()
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.42:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.42:9083")
.enableHiveSupport()
.getOrCreate()
I tried with SparkSession.clearActiveSession() and SparkSession.clearDefaultSession() but it isn't working, throwing the error below:
Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
is there any other way we can achieve accessing two hive tables using multiple SparkSessions or SparkContext.
Thanks
I use this way and working perfectly fine with Spark 2.1
val sc = SparkSession.builder()
.config("hive.metastore.uris", "thrift://dbsyz1111:10000")
.enableHiveSupport()
.getOrCreate()
// Createdataframe 1 from by reading the data from hive table of metstore 1
val dataframe_1 = sc.sql("select * from <SourcetbaleofMetaStore_1>")
// Resetting the existing Spark Contexts
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
//Initialize Spark session2 with Hive Metastore 2
val spc2 = SparkSession.builder()
.config("hive.metastore.uris", "thrift://dbsyz2222:10004")
.enableHiveSupport()
.getOrCreate()
// Load dataframe 2 of spark context 1 into a new dataframe of spark context2, By getting schema and data by converting to rdd API
val dataframe_2 = spc2.createDataFrame(dataframe_1.rdd, dataframe_1.schema)
dataframe_2.write.mode("Append").saveAsTable(<targettableNameofMetastore_2>)
Look at SparkSession getOrCreate method
which state that
gets an existing [[SparkSession]] or, if there is no existing one,
creates a new one based on the options set in this builder.
This method first checks whether there is a valid thread-local
SparkSession, and if yes, return that one. It then checks whether
there is a valid global default SparkSession, and if yes, return
that one. If no valid global default SparkSession exists, the method
creates a new SparkSession and assigns the newly created
SparkSession as the global default.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing
SparkSession.
That's the reason its returning first session and its configurations.
Please go through the docs to find out alternative ways to create session..
I'm working on <2 spark version. So I am not sure how to create new session with out collision of configuration exactly..
But, here is useful test case i.e SparkSessionBuilderSuite.scala to do that-
DIY..
Example method in that test case
test("use session from active thread session and propagate config options") {
val defaultSession = SparkSession.builder().getOrCreate()
val activeSession = defaultSession.newSession()
SparkSession.setActiveSession(activeSession)
val session = SparkSession.builder().config("spark-config2", "a").getOrCreate()
assert(activeSession != defaultSession)
assert(session == activeSession)
assert(session.conf.get("spark-config2") == "a")
assert(session.sessionState.conf == SQLConf.get)
assert(SQLConf.get.getConfString("spark-config2") == "a")
SparkSession.clearActiveSession()
assert(SparkSession.builder().getOrCreate() == defaultSession)
SparkSession.clearDefaultSession()
}