How to load files in sparksql through remote hive storage ( s3 orc) using spark/scala + code + configuration - scala

intellij(spark)--->Hive(Remote)---storage on S3(orc format)
Not able to read remote Hive table through spark/scala.
was able to read table schema but not able to read table.
Error -Exception in thread "main" java.lang.IllegalArgumentException:
AWS Access Key ID and Secret Access Key must be specified as the
username or password (respectively) of a s3 URL, or by setting the
fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties
(respectively).
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.{Encoders, SparkSession}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql.types.StructType
object mainclas {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("local[*]")
.appName("hivetable")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.config("access-key","PQHFFDEGGDDVDVV")
.config("secret-key","FFGSGHhjhhhdjhJHJHHJGJHGjHH")
.config("format", "orc")
.enableHiveSupport()
.getOrCreate()
val res = spark.sqlContext.sql("show tables").show()
val res1 =spark.sql("select *from ace.visit limit 5").show()
}
}`

Try this:
val spark = SparkSession.builder
.master("local[*]")
.appName("hivetable")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.config("fs.s3n.awsAccessKeyId","PQHFFDEGGDDVDVV")
.config("fs.s3n.awsSecretAccessKey","FFGSGHhjhhhdjhJHJHHJGJHGjHH")
.config("format", "orc")
.enableHiveSupport()
.getOrCreate()

you need to prefix all the fs. options with spark.hadoop if you are setting them in the spark config. And as noted: use s3a over s3n if you can.

Related

Fail to savetoMongoDB :java.lang.ClassNotFoundException: com.mongodb.hadoop.io.BSONWritable

I want to convert data from Dataframe to RDD, and save it to MongoDB, here is my code:
import pymongo
import pymongo_spark
from pyspark import SparkConf, SparkContext
from pyspark import BasicProfiler
from pyspark.sql import SparkSession
class MyCustomProfiler(BasicProfiler):
def show(self, id):
print("My custom profiles for RDD:%s" % id)
conf = SparkConf().set("spark.python.profile", "true")
spark = SparkSession.builder \
.master("local[*]") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# Important: activate pymongo_spark.
pymongo_spark.activate()
on_time_dataframe = spark.read.parquet(r'\data\on_time_performance.parquet')
on_time_dataframe.show()
# Note we have to convert the row to a dict to avoid https://jira.mongodb.org/browse/HADOOP-276
as_dict = on_time_dataframe.rdd.map(lambda row: row.asDict())
as_dict.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance')
some errors occurs:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: java.lang.ClassNotFoundException: com.mongodb.hadoop.io.BSONWritable
I have installed the Mongo-hadoop file; It seems I don't have a Bsonweitable class. I'm not good at java, So I want someone to help me.

How to fix 22: error: not found: value SparkSession in Scala?

I am new to Spark and I would like to read a CSV-file to a Dataframe.
Spark 1.3.0 / Scala 2.3.0
This is what I have so far:
# Start Scala with CSV Package Module
spark-shell --packages com.databricks:spark-csv_2.10:1.3.0
# Import Spark Classes
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import sqlCtx ._
# Create SparkConf
val conf = new SparkConf().setAppName("local").setMaster("master")
val sc = new SparkContext(conf)
# Create SQLContext
val sqlCtx = new SQLContext(sc)
# Create SparkSession and use it for all purposes:
val session = SparkSession.builder().appName("local").master("master").getOrCreate()
# Read CSV-File and turn it into Dataframe.
val df_fc = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")
However at SparkSession.builder() it gives the following error:
^
How can I fix this error?
SparkSession is available in spark 2. No need to create sparkcontext in spark version 2. sparksession itself provides the gateway to all .
Try below as you are using version 1.x:
val df_fc = sqlCtx.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")

Not able to read data from AWS S3(orc) through Intellij local(spark/Scala)

we are reading the date/table from AWS(hive) through Spark/scala using Intellij(witch is on local machine). we can able to see the schema of table. but not able to read data.
please find below flow to get better understanding
Intellij(spark/scala)------> hive:9083(remote)------> s3(orc)
Note: Here Intellij is present on local and hive and S3 present on AWS
Please find below code for the same:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
//import org.apache.spark.sql.hive.HiveContext
object hiveconnect {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.master("local[*]")
.config("spark.sql.warehouse.dir", "s3://abc/test/main")
.config("spark.driver.allowMultipleContexts", "true")
.config("access-key","key")
.config("secret-key","key")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("show databases").show()
spark.sql("select *from ace.visit limit 5").show()
}
}
Error: Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

Spark Scala Cassandra CSV insert into cassandra

Here is the code below:
Scala Version: 2.11.
Spark Version: 2.0.2.6
Cassandra Version: cqlsh 5.0.1 | Cassandra 3.11.0.1855 | DSE 5.1.3 | CQL spec 3.4.4 | Native protocol v4
I am trying to read from CSV and write to Cassandra Table. I am new to Scala and Spark. Please correct me where I am doing wrong
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
import com.datastax
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
import org.apache.spark.sql._
import com.datastax.spark.connector.UDTValue
import com.datastax.spark.connector.mapper.DefaultColumnMapper
object dataframeset {
def main(args: Array[String]): Unit = {
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val rdd1 = sc.cassandraTable("tdata", "map")
rdd1.collect().foreach(println)
// Scala Read CSV Part
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val spark1 = org.apache.spark.sql.SparkSession
.builder()
.master("local")
.appName("Spark SQL basic example")
.getOrCreate()
val df = spark1.read.format("csv")
.option("header","true")
.option("inferschema", "true")
.load("/Users/tom/Desktop/del2.csv")
import spark1.implicits._
df.printSchema()
val dfprev = df.select(col = "Year","Measure").filter("Category = 'Prevention'" )
// dfprev.collect().foreach(println)
val a = dfprev.select("YEAR")
val b = dfprev.select("Measure")
val collection = sc.parallelize(Seq(a,b))
collection.saveToCassandra("tdata", "map", SomeColumns("sno", "name"))
spark1.stop()
}
}
Error:
Exception in thread "main" java.lang.IllegalArgumentException: Multiple constructors with the same number of parameters not allowed.
Cassandra Table
cqlsh:tdata> desc map
CREATE TABLE tdata.map (
sno int PRIMARY KEY,
name text;
I know I am missing something especially trying to write entire Data frame into Cassandra in one shot. Not I don't know what needs to be done either.
Thanks
tom
You can directly write a dataframe (dataset[Row] in spark 2.x) to cassandra.
You will have to define cassandra host, username and password if authentication is enabled in spark conf to connect to cassandra using somethin like
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "CASSANDRA_HOST")
.set("spark.cassandra.auth.username", "CASSANDRA_USERNAME")
.set("spark.cassandra.auth.password", "CASSANDRA_PASSWORD")
OR
val spark1 = org.apache.spark.sql.SparkSession
.builder()
.master("local")
.config("spark.cassandra.connection.host", "CASSANDRA_HOST")
.config("spark.cassandra.auth.username", "CASSANDRA_USERNAME")
.config("spark.cassandra.auth.password", "CASSANDRA_PASSWORD")
.appName("Spark SQL basic example")
.getOrCreate()
val dfprev = df.filter("Category = 'Prevention'" ).select(col("Year").as("yearAdded"),col("Measure").as("Recording"))
dfprev .write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "map", "keyspace" -> "tdata"))
.save()
Dataframe in spark-cassandra-connector

How to query data stored in Hive table using SparkSession of Spark2?

I am trying to query data stored in Hive table from Spark2. Environment: 1.cloudera-quickstart-vm-5.7.0-0-vmware 2. Eclipse with Scala2.11.8 plugin 3. Spark2 and Maven under
I did not change spark default configuration. Do I need configure anything in Spark or Hive?
Code
import org.apache.spark._
import org.apache.spark.sql.SparkSession
object hiveTest {
def main (args: Array[String]){
val sparkSession = SparkSession.builder.
master("local")
.appName("HiveSQL")
.enableHiveSupport()
.getOrCreate()
val data= sparkSession2.sql("select * from test.mark")
}
}
Getting error
16/08/29 00:18:10 INFO SparkSqlParser: Parsing command: select * from test.mark
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:48)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:47)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:54)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:54)
at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
at hiveTest$.main(hiveTest.scala:34)
at hiveTest.main(hiveTest.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: Duplicate SQLConfigEntry. spark.sql.hive.convertCTAS has been registered
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.internal.SQLConf$.org$apache$spark$sql$internal$SQLConf$$register(SQLConf.scala:44)
at org.apache.spark.sql.internal.SQLConf$SQLConfigBuilder$$anonfun$apply$1.apply(SQLConf.scala:51)
at org.apache.spark.sql.internal.SQLConf$SQLConfigBuilder$$anonfun$apply$1.apply(SQLConf.scala:51)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:122)
at org.apache.spark.sql.hive.HiveUtils$.<init>(HiveUtils.scala:103)
at org.apache.spark.sql.hive.HiveUtils$.<clinit>(HiveUtils.scala)
... 14 more
Any suggestion is appreciated
Thanks
Robin
This is what I am using:
import org.apache.spark.sql.SparkSession
object LoadCortexDataLake extends App {
val spark = SparkSession.builder().appName("Cortex-Batch").enableHiveSupport().getOrCreate()
spark.read.parquet(file).createOrReplaceTempView("temp")
spark.sql(s"insert overwrite table $table_nm partition(year='$yr',month='$mth',day='$dt') select * from temp")
I think you should use 'sparkSession.sql' instead of 'sparkSession2.sql'
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
val spark = SparkSession.
builder().
appName("Connect to Hive").
config("hive.metastore.warehouse.uris","thrift://cdh-hadoop-master:Port").
enableHiveSupport().
getOrCreate()
val df = spark.sql("SELECT * FROM table_name")