Not able to read data from AWS S3(orc) through Intellij local(spark/Scala) - scala

we are reading the date/table from AWS(hive) through Spark/scala using Intellij(witch is on local machine). we can able to see the schema of table. but not able to read data.
please find below flow to get better understanding
Intellij(spark/scala)------> hive:9083(remote)------> s3(orc)
Note: Here Intellij is present on local and hive and S3 present on AWS
Please find below code for the same:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
//import org.apache.spark.sql.hive.HiveContext
object hiveconnect {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.master("local[*]")
.config("spark.sql.warehouse.dir", "s3://abc/test/main")
.config("spark.driver.allowMultipleContexts", "true")
.config("access-key","key")
.config("secret-key","key")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("show databases").show()
spark.sql("select *from ace.visit limit 5").show()
}
}
Error: Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

Related

Fail to savetoMongoDB :java.lang.ClassNotFoundException: com.mongodb.hadoop.io.BSONWritable

I want to convert data from Dataframe to RDD, and save it to MongoDB, here is my code:
import pymongo
import pymongo_spark
from pyspark import SparkConf, SparkContext
from pyspark import BasicProfiler
from pyspark.sql import SparkSession
class MyCustomProfiler(BasicProfiler):
def show(self, id):
print("My custom profiles for RDD:%s" % id)
conf = SparkConf().set("spark.python.profile", "true")
spark = SparkSession.builder \
.master("local[*]") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# Important: activate pymongo_spark.
pymongo_spark.activate()
on_time_dataframe = spark.read.parquet(r'\data\on_time_performance.parquet')
on_time_dataframe.show()
# Note we have to convert the row to a dict to avoid https://jira.mongodb.org/browse/HADOOP-276
as_dict = on_time_dataframe.rdd.map(lambda row: row.asDict())
as_dict.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance')
some errors occurs:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: java.lang.ClassNotFoundException: com.mongodb.hadoop.io.BSONWritable
I have installed the Mongo-hadoop file; It seems I don't have a Bsonweitable class. I'm not good at java, So I want someone to help me.

Reading file from Azure Data Lake Storage V2 with Spark 2.4

I am trying to read a simple csv file Azure Data Lake Storage V2 with Spark 2.4 on my IntelliJ-IDE on mac
Code Below
package com.example
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object Test extends App {
val appName: String = "DataExtract"
val master: String = "local[*]"
val sparkConf: SparkConf = new SparkConf()
.setAppName(appName)
.setMaster(master)
.set("spark.scheduler.mode", "FAIR")
.set("spark.sql.session.timeZone", "UTC")
.set("spark.sql.shuffle.partitions", "32")
.set("fs.defaultFS", "abfs://development#xyz.dfs.core.windows.net/")
.set("fs.azure.account.key.xyz.dfs.core.windows.net", "~~key~~")
val spark: SparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
spark.time(run(spark))
def run(spark: SparkSession): Unit = {
val df = spark.read.csv("abfs://development#xyz.dfs.core.windows.net/development/sales.csv")
df.show(10)
}
}
It's able to read, and throwing security exception
Exception in thread "main" java.lang.NullPointerException
at org.wildfly.openssl.CipherSuiteConverter.toJava(CipherSuiteConverter.java:284)
at org.wildfly.openssl.OpenSSLEngine.toJavaCipherSuite(OpenSSLEngine.java:1094)
at org.wildfly.openssl.OpenSSLEngine.getEnabledCipherSuites(OpenSSLEngine.java:729)
at org.wildfly.openssl.OpenSSLContextSPI.getCiphers(OpenSSLContextSPI.java:333)
at org.wildfly.openssl.OpenSSLContextSPI$1.getSupportedCipherSuites(OpenSSLContextSPI.java:365)
at org.apache.hadoop.fs.azurebfs.utils.SSLSocketFactoryEx.<init>(SSLSocketFactoryEx.java:105)
at org.apache.hadoop.fs.azurebfs.utils.SSLSocketFactoryEx.initializeDefaultFactory(SSLSocketFactoryEx.java:72)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.<init>(AbfsClient.java:79)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:817)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:149)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
Can anyone help me, what is the mistake?
As per my research, you will receive this error message when you have incompatible jar with the hadoop version.
I would request you to kindly go through the below issues:
http://mail-archives.apache.org/mod_mbox/spark-issues/201907.mbox/%3CJIRA.13243325.1562321895000.591499.1562323440292#Atlassian.JIRA%3E
https://issues.apache.org/jira/browse/HADOOP-16410
Had the same issue, resolved by adding wildfly version 1.0.7, as the docs shared by #cheekatlapradeep-msft mentioned
<dependency>
<groupId>org.wildfly.openssl</groupId>
<artifactId>wildfly-openssl</artifactId>
<version>1.0.7.Final</version>
</dependency>

How to load files in sparksql through remote hive storage ( s3 orc) using spark/scala + code + configuration

intellij(spark)--->Hive(Remote)---storage on S3(orc format)
Not able to read remote Hive table through spark/scala.
was able to read table schema but not able to read table.
Error -Exception in thread "main" java.lang.IllegalArgumentException:
AWS Access Key ID and Secret Access Key must be specified as the
username or password (respectively) of a s3 URL, or by setting the
fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties
(respectively).
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.{Encoders, SparkSession}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql.types.StructType
object mainclas {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("local[*]")
.appName("hivetable")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.config("access-key","PQHFFDEGGDDVDVV")
.config("secret-key","FFGSGHhjhhhdjhJHJHHJGJHGjHH")
.config("format", "orc")
.enableHiveSupport()
.getOrCreate()
val res = spark.sqlContext.sql("show tables").show()
val res1 =spark.sql("select *from ace.visit limit 5").show()
}
}`
Try this:
val spark = SparkSession.builder
.master("local[*]")
.appName("hivetable")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.config("fs.s3n.awsAccessKeyId","PQHFFDEGGDDVDVV")
.config("fs.s3n.awsSecretAccessKey","FFGSGHhjhhhdjhJHJHHJGJHGjHH")
.config("format", "orc")
.enableHiveSupport()
.getOrCreate()
you need to prefix all the fs. options with spark.hadoop if you are setting them in the spark config. And as noted: use s3a over s3n if you can.

Can we able to use mulitple sparksessions to access two different Hive servers

I have a scenario to compare two different tables source and destination from two separate remote hive servers, can we able to use two SparkSessions something like I tried below:-
val spark = SparkSession.builder().master("local")
.appName("spark remote")
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.160:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.160:9083")
.enableHiveSupport()
.getOrCreate()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
val sparkdestination = SparkSession.builder()
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.42:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.42:9083")
.enableHiveSupport()
.getOrCreate()
I tried with SparkSession.clearActiveSession() and SparkSession.clearDefaultSession() but it isn't working, throwing the error below:
Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
is there any other way we can achieve accessing two hive tables using multiple SparkSessions or SparkContext.
Thanks
I use this way and working perfectly fine with Spark 2.1
val sc = SparkSession.builder()
.config("hive.metastore.uris", "thrift://dbsyz1111:10000")
.enableHiveSupport()
.getOrCreate()
// Createdataframe 1 from by reading the data from hive table of metstore 1
val dataframe_1 = sc.sql("select * from <SourcetbaleofMetaStore_1>")
// Resetting the existing Spark Contexts
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
//Initialize Spark session2 with Hive Metastore 2
val spc2 = SparkSession.builder()
.config("hive.metastore.uris", "thrift://dbsyz2222:10004")
.enableHiveSupport()
.getOrCreate()
// Load dataframe 2 of spark context 1 into a new dataframe of spark context2, By getting schema and data by converting to rdd API
val dataframe_2 = spc2.createDataFrame(dataframe_1.rdd, dataframe_1.schema)
dataframe_2.write.mode("Append").saveAsTable(<targettableNameofMetastore_2>)
Look at SparkSession getOrCreate method
which state that
gets an existing [[SparkSession]] or, if there is no existing one,
creates a new one based on the options set in this builder.
This method first checks whether there is a valid thread-local
SparkSession, and if yes, return that one. It then checks whether
there is a valid global default SparkSession, and if yes, return
that one. If no valid global default SparkSession exists, the method
creates a new SparkSession and assigns the newly created
SparkSession as the global default.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing
SparkSession.
That's the reason its returning first session and its configurations.
Please go through the docs to find out alternative ways to create session..
I'm working on <2 spark version. So I am not sure how to create new session with out collision of configuration exactly..
But, here is useful test case i.e SparkSessionBuilderSuite.scala to do that-
DIY..
Example method in that test case
test("use session from active thread session and propagate config options") {
val defaultSession = SparkSession.builder().getOrCreate()
val activeSession = defaultSession.newSession()
SparkSession.setActiveSession(activeSession)
val session = SparkSession.builder().config("spark-config2", "a").getOrCreate()
assert(activeSession != defaultSession)
assert(session == activeSession)
assert(session.conf.get("spark-config2") == "a")
assert(session.sessionState.conf == SQLConf.get)
assert(SQLConf.get.getConfString("spark-config2") == "a")
SparkSession.clearActiveSession()
assert(SparkSession.builder().getOrCreate() == defaultSession)
SparkSession.clearDefaultSession()
}

How to query data stored in Hive table using SparkSession of Spark2?

I am trying to query data stored in Hive table from Spark2. Environment: 1.cloudera-quickstart-vm-5.7.0-0-vmware 2. Eclipse with Scala2.11.8 plugin 3. Spark2 and Maven under
I did not change spark default configuration. Do I need configure anything in Spark or Hive?
Code
import org.apache.spark._
import org.apache.spark.sql.SparkSession
object hiveTest {
def main (args: Array[String]){
val sparkSession = SparkSession.builder.
master("local")
.appName("HiveSQL")
.enableHiveSupport()
.getOrCreate()
val data= sparkSession2.sql("select * from test.mark")
}
}
Getting error
16/08/29 00:18:10 INFO SparkSqlParser: Parsing command: select * from test.mark
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:48)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:47)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:54)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:54)
at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
at hiveTest$.main(hiveTest.scala:34)
at hiveTest.main(hiveTest.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: Duplicate SQLConfigEntry. spark.sql.hive.convertCTAS has been registered
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.internal.SQLConf$.org$apache$spark$sql$internal$SQLConf$$register(SQLConf.scala:44)
at org.apache.spark.sql.internal.SQLConf$SQLConfigBuilder$$anonfun$apply$1.apply(SQLConf.scala:51)
at org.apache.spark.sql.internal.SQLConf$SQLConfigBuilder$$anonfun$apply$1.apply(SQLConf.scala:51)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:122)
at org.apache.spark.sql.hive.HiveUtils$.<init>(HiveUtils.scala:103)
at org.apache.spark.sql.hive.HiveUtils$.<clinit>(HiveUtils.scala)
... 14 more
Any suggestion is appreciated
Thanks
Robin
This is what I am using:
import org.apache.spark.sql.SparkSession
object LoadCortexDataLake extends App {
val spark = SparkSession.builder().appName("Cortex-Batch").enableHiveSupport().getOrCreate()
spark.read.parquet(file).createOrReplaceTempView("temp")
spark.sql(s"insert overwrite table $table_nm partition(year='$yr',month='$mth',day='$dt') select * from temp")
I think you should use 'sparkSession.sql' instead of 'sparkSession2.sql'
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
val spark = SparkSession.
builder().
appName("Connect to Hive").
config("hive.metastore.warehouse.uris","thrift://cdh-hadoop-master:Port").
enableHiveSupport().
getOrCreate()
val df = spark.sql("SELECT * FROM table_name")