How spark-shell or Zepellin notebook set HiveContext to SparkSession? - scala

Does anyone know why I can access to an existing hive table from spark-shell or zepelling notebook doing this
val df = spark.sql("select * from hive_table")
But when I submit a spark jar with a spark object created this way,
val spark = SparkSession
.builder()
.appName("Yet another spark app")
.config("spark.sql.shuffle.partitions", 18)
.config("spark.executor.memory", "2g")
.config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
I got this
Table or view not found
What I really want is to learn, understand, what the shell and the notebooks are doing for us in order to provide hive context to the SparkSession.

When working with Hive, one must instantiate SparkSession with Hive support
You need to call enableHiveSupport() on the session builder

Related

Hdinsight Spark Session issue with Parquet

Using HDinsight to run spark and a scala script.
I'm using the example scripts provided by the Azure plugin in intellij.
It provides me with the following code:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
Fair enough. And I can do things like:
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
and I can save files:
rdd1.saveAsTextFile("wasb:///HVACout2")
However, I am looking to load in a parquet file. The code I have found (elsewhere) for parquet files coming in is:
val df = spark.read.parquet("resources/Parquet/MyFile.parquet/")
Line above gives an error on this in HDinsight (when I submit the jar via intellij).
Why don't you use?:
val spark = SparkSession.builder
.master("local[*]") // adjust accordingly
.config("spark.sql.warehouse.dir", "E:/Exp/") //change accordingly
.appName("MySparkSession") //change accordingly
.getOrCreate()
When I put in spark session and get rid of spark context, HD insight breaks.
What am I doing wrong?
How using HdInsight do I go about creating either a spark session or context, that allows me to read in text files, parquet and all the rest? How do I get the best of both worlds
My understanding is SparkSession, is the better and more recent way. And what we should be using. So how do I get it running in HDInsight?
Thanks in advance
Turns out if I add
val spark = SparkSession.builder().appName("Spark SQL basic").getOrCreate()
After the spark context line and before the parquet, read part, it works.

Spark Local Session with custom maven library

In my scala code, which I run thru sbt run command I am creating local spark session and I need to make use of following library: com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
My code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.eventhubs._
...
val spark = SparkSession.builder
.master("local")
.appName("RandomForestClassifierExample")
.getOrCreate()
...
val connectionString = ConnectionStringBuilder("<connectionstring>")
.setEventHubName("energinet")
.build
val eventHubsConf = EventHubsConf(connectionString)
.setStartingPosition(EventPosition.fromEndOfStream)
.setConsumerGroup("$default")
val eventhubs = spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
Of course it fails, because of missing event hubs library. I know I can run spark-submit and pull the library by setting --packages parameter, however I want to run my app using sbt run command. Please is there a way, how to make the library available for local spark sessions I create from scala code?

How to use TestHiveContext using Spark 2.2

I am trying to upgrade to Spark 2.2 from Spark 1.6. The existing unit tests are depending on a defined HiveContext which was initialised using TestHiveContext.
val conf = new SparkConf().set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext("local", "sc", conf)
sc.setLogLevel("WARN")
val sqlContext = new TestHiveContext(sc)
In spark 2.2, HiveContext is deprecated and SparkSession.builder.enableHiveSupport is advised to be used. I tried to create a new SparkSession using SparkSession.builder but I couldn't find a way to initialise a SparkSession that uses TestHiveContext.
Is it possible to do that or should I change my approach ?
HiveContext and SQLContext has been replaced by SparkSession as stated in the migration guide :
SparkSession is now the new entry point of Spark that replaces the old
SQLContext and
HiveContext. Note that the old SQLContext and HiveContext are kept for
backward compatibility. A new catalog interface is accessible from
SparkSession - existing API on databases and tables access such as
listTables, createExternalTable, dropTempView, cacheTable are moved
here.
https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-16-to-20
So you create a Sparksession instance with your test configuration and use it instead of HiveContext

engine used when creating Hive table with joins using Spark SQL

I am not sure from documentation if when creating an Hive table using HiveContext from Spark, will it use the Spark engine or the standard Hive mapreduce job to perform the task?
val sc = new SparkContext()
val hc = new HiveContext(sc)
hc.sql("""
CREATE TABLE db.new_table
STORED AS PARQUET
AS SELECT
field1,
field2,
field3
FROM db.src1
JOIN db.src2
ON (x = y)
"""
)
Spark 1.6
Spark SQL supports Apache Hive using HiveContext. It uses the Spark SQL execution engine to work with data stored in Hive.
above Spark 2.x
val spark = SparkSession .builder() .appName( "SparkSessionExample" ) .config( "spark.sql.warehouse.dir" , warehouseLocation) .enableHiveSupport() .getOrCreate()
When doing this now, SPARK will use SPARK APIs and not MR. Hivecontext need not be explicitly referenced as is deprecated, even in spark-submit / program mode.

Spark-Scala with Cassandra

I am beginner with Spark, Scala and Cassandra. I am working with ETL programming.
Now my project ETL POCs required Spark, Scala and Cassandra. I configured Cassandra with my ubuntu system in /usr/local/Cassandra/* and after that I installed Spark and Scala. Now I am using Scala editor to start my work, I created simply load a file in landing location, but after that I am trying to connect with cassandra in scala but I am not getting an help how we can connect and process the data in destination database?.
Any one help me Is this correct way? or some where I am wrong? please help me to how we can achieve this process with above combination.
Thanks in advance!
Add spark-cassandra-connector to your pom or sbt by reading instruction, then work this way
Import this in your file
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.cassandra._
spark scala file
object SparkCassandraConnector {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.setAppName("UpdateCassandra")
.setMaster("spark://spark:7077") // spark server
.set("spark.cassandra.input.split.size_in_mb","67108864")
.set("spark.cassandra.connection.host", "192.168.3.167") // cassandra host
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
// connecting with cassandra for spark and sql query
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
// Load data from node publish table
val df = spark
.read
.cassandraFormat( "table_nmae", "keyspace_name")
.load()
}
}
This will work for spark 2.2 and cassandra 2
you can perform this easly with spark-cassandra-connector