How to load files in sparksql through remote hive storage ( s3 orc) using spark/scala + code + configuration - scala

intellij(spark)--->Hive(Remote)---storage on S3(orc format)
Not able to read remote Hive table through spark/scala.
was able to read table schema but not able to read table.
Error -Exception in thread "main" java.lang.IllegalArgumentException:
AWS Access Key ID and Secret Access Key must be specified as the
username or password (respectively) of a s3 URL, or by setting the
fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.{Encoders, SparkSession}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql.types.StructType
object mainclas {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.config("hive.metastore.uris", "thrift://")
.config("format", "orc")
val res = spark.sqlContext.sql("show tables").show()
val res1 =spark.sql("select *from ace.visit limit 5").show()

Try this:
val spark = SparkSession.builder
.config("hive.metastore.uris", "thrift://")
.config("format", "orc")

you need to prefix all the fs. options with spark.hadoop if you are setting them in the spark config. And as noted: use s3a over s3n if you can.


Fail to savetoMongoDB :java.lang.ClassNotFoundException:

I want to convert data from Dataframe to RDD, and save it to MongoDB, here is my code:
import pymongo
import pymongo_spark
from pyspark import SparkConf, SparkContext
from pyspark import BasicProfiler
from pyspark.sql import SparkSession
class MyCustomProfiler(BasicProfiler):
def show(self, id):
print("My custom profiles for RDD:%s" % id)
conf = SparkConf().set("spark.python.profile", "true")
spark = SparkSession.builder \
.master("local[*]") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
# Important: activate pymongo_spark.
on_time_dataframe ='\data\on_time_performance.parquet')
# Note we have to convert the row to a dict to avoid
as_dict = row: row.asDict())
some errors occurs:
py4j.protocol.Py4JJavaError: An error occurred while calling
: java.lang.ClassNotFoundException:
I have installed the Mongo-hadoop file; It seems I don't have a Bsonweitable class. I'm not good at java, So I want someone to help me.

How to fix 22: error: not found: value SparkSession in Scala?

I am new to Spark and I would like to read a CSV-file to a Dataframe.
Spark 1.3.0 / Scala 2.3.0
This is what I have so far:
# Start Scala with CSV Package Module
spark-shell --packages com.databricks:spark-csv_2.10:1.3.0
# Import Spark Classes
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import sqlCtx ._
# Create SparkConf
val conf = new SparkConf().setAppName("local").setMaster("master")
val sc = new SparkContext(conf)
# Create SQLContext
val sqlCtx = new SQLContext(sc)
# Create SparkSession and use it for all purposes:
val session = SparkSession.builder().appName("local").master("master").getOrCreate()
# Read CSV-File and turn it into Dataframe.
val df_fc ="com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")
However at SparkSession.builder() it gives the following error:
How can I fix this error?
SparkSession is available in spark 2. No need to create sparkcontext in spark version 2. sparksession itself provides the gateway to all .
Try below as you are using version 1.x:
val df_fc ="com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")

Spark Scala Cassandra CSV insert into cassandra

Here is the code below:
Scala Version: 2.11.
Spark Version:
Cassandra Version: cqlsh 5.0.1 | Cassandra | DSE 5.1.3 | CQL spec 3.4.4 | Native protocol v4
I am trying to read from CSV and write to Cassandra Table. I am new to Scala and Spark. Please correct me where I am doing wrong
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
import com.datastax
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
import org.apache.spark.sql._
import com.datastax.spark.connector.UDTValue
import com.datastax.spark.connector.mapper.DefaultColumnMapper
object dataframeset {
def main(args: Array[String]): Unit = {
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd1 = sc.cassandraTable("tdata", "map")
// Scala Read CSV Part
val spark1 = org.apache.spark.sql.SparkSession
.appName("Spark SQL basic example")
val df ="csv")
.option("inferschema", "true")
import spark1.implicits._
val dfprev = = "Year","Measure").filter("Category = 'Prevention'" )
// dfprev.collect().foreach(println)
val a ="YEAR")
val b ="Measure")
val collection = sc.parallelize(Seq(a,b))
collection.saveToCassandra("tdata", "map", SomeColumns("sno", "name"))
Exception in thread "main" java.lang.IllegalArgumentException: Multiple constructors with the same number of parameters not allowed.
Cassandra Table
cqlsh:tdata> desc map
sno int PRIMARY KEY,
name text;
I know I am missing something especially trying to write entire Data frame into Cassandra in one shot. Not I don't know what needs to be done either.
You can directly write a dataframe (dataset[Row] in spark 2.x) to cassandra.
You will have to define cassandra host, username and password if authentication is enabled in spark conf to connect to cassandra using somethin like
val conf = new SparkConf(true)
.set("", "CASSANDRA_HOST")
.set("spark.cassandra.auth.username", "CASSANDRA_USERNAME")
.set("spark.cassandra.auth.password", "CASSANDRA_PASSWORD")
val spark1 = org.apache.spark.sql.SparkSession
.config("", "CASSANDRA_HOST")
.config("spark.cassandra.auth.username", "CASSANDRA_USERNAME")
.config("spark.cassandra.auth.password", "CASSANDRA_PASSWORD")
.appName("Spark SQL basic example")
val dfprev = df.filter("Category = 'Prevention'" ).select(col("Year").as("yearAdded"),col("Measure").as("Recording"))
dfprev .write
.options(Map("table" -> "map", "keyspace" -> "tdata"))
Dataframe in spark-cassandra-connector

How to query data stored in Hive table using SparkSession of Spark2?

I am trying to query data stored in Hive table from Spark2. Environment: 1.cloudera-quickstart-vm-5.7.0-0-vmware 2. Eclipse with Scala2.11.8 plugin 3. Spark2 and Maven under
I did not change spark default configuration. Do I need configure anything in Spark or Hive?
import org.apache.spark._
import org.apache.spark.sql.SparkSession
object hiveTest {
def main (args: Array[String]){
val sparkSession = SparkSession.builder.
val data= sparkSession2.sql("select * from test.mark")
Getting error
16/08/29 00:18:10 INFO SparkSqlParser: Parsing command: select * from test.mark
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:48)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:47)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:54)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:54)
at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
at hiveTest$.main(hiveTest.scala:34)
at hiveTest.main(hiveTest.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: Duplicate SQLConfigEntry. spark.sql.hive.convertCTAS has been registered
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.internal.SQLConf$.org$apache$spark$sql$internal$SQLConf$$register(SQLConf.scala:44)
at org.apache.spark.sql.internal.SQLConf$SQLConfigBuilder$$anonfun$apply$1.apply(SQLConf.scala:51)
at org.apache.spark.sql.internal.SQLConf$SQLConfigBuilder$$anonfun$apply$1.apply(SQLConf.scala:51)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:122)
at org.apache.spark.sql.hive.HiveUtils$.<init>(HiveUtils.scala:103)
at org.apache.spark.sql.hive.HiveUtils$.<clinit>(HiveUtils.scala)
... 14 more
Any suggestion is appreciated
This is what I am using:
import org.apache.spark.sql.SparkSession
object LoadCortexDataLake extends App {
val spark = SparkSession.builder().appName("Cortex-Batch").enableHiveSupport().getOrCreate()"temp")
spark.sql(s"insert overwrite table $table_nm partition(year='$yr',month='$mth',day='$dt') select * from temp")
I think you should use 'sparkSession.sql' instead of 'sparkSession2.sql'
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
val spark = SparkSession.
appName("Connect to Hive").
val df = spark.sql("SELECT * FROM table_name")