How to use SparkSession and StreamingContext together? - scala

I'm trying to stream CSV files from a folder on my local machine (OSX). I'm using SparkSession and StreamingContext together like so:
val sc: SparkContext = createSparkContext(sparkContextName)
val sparkSess = SparkSession.builder().config(sc.getConf).getOrCreate()
val ssc = new StreamingContext(sparkSess.sparkContext, Seconds(time))
val csvSchema = new StructType().add("field_name",StringType)
val inputDF = sparkSess.readStream.format("org.apache.spark.csv").schema(csvSchema).csv("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
If I run ssc.start() after this, I get this error:
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
Instead if I try to start the SparkSession like this:
inputDF.writeStream.format("console").start()
I get:
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Clearly I'm not understanding how SparkSession and StreamingContext should work together. If I get rid of SparkSession, StreamingContext only has textFileStream on which I need to impose a CSV schema. Would appreciate any clarifications on how to get this working.

You cannot have a spark session and spark context together. With the release of Spark 2.0.0 there is a new abstraction available to developers - the Spark Session - which can be instantiated and called upon just like the Spark Context that was previously available.
You can still access spark context from the spark session builder:
val sparkSess = SparkSession.builder().appName("My App").getOrCreate()
val sc = sparkSess.sparkContext
val ssc = new StreamingContext(sc, Seconds(time))
One more thing that is causing your job to fail is you are performing the transformation and no action is called. Some action should be called in the end such as inputDF.show()

Related

Hdinsight Spark Session issue with Parquet

Using HDinsight to run spark and a scala script.
I'm using the example scripts provided by the Azure plugin in intellij.
It provides me with the following code:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
Fair enough. And I can do things like:
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
and I can save files:
rdd1.saveAsTextFile("wasb:///HVACout2")
However, I am looking to load in a parquet file. The code I have found (elsewhere) for parquet files coming in is:
val df = spark.read.parquet("resources/Parquet/MyFile.parquet/")
Line above gives an error on this in HDinsight (when I submit the jar via intellij).
Why don't you use?:
val spark = SparkSession.builder
.master("local[*]") // adjust accordingly
.config("spark.sql.warehouse.dir", "E:/Exp/") //change accordingly
.appName("MySparkSession") //change accordingly
.getOrCreate()
When I put in spark session and get rid of spark context, HD insight breaks.
What am I doing wrong?
How using HdInsight do I go about creating either a spark session or context, that allows me to read in text files, parquet and all the rest? How do I get the best of both worlds
My understanding is SparkSession, is the better and more recent way. And what we should be using. So how do I get it running in HDInsight?
Thanks in advance
Turns out if I add
val spark = SparkSession.builder().appName("Spark SQL basic").getOrCreate()
After the spark context line and before the parquet, read part, it works.

How to use TestHiveContext using Spark 2.2

I am trying to upgrade to Spark 2.2 from Spark 1.6. The existing unit tests are depending on a defined HiveContext which was initialised using TestHiveContext.
val conf = new SparkConf().set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext("local", "sc", conf)
sc.setLogLevel("WARN")
val sqlContext = new TestHiveContext(sc)
In spark 2.2, HiveContext is deprecated and SparkSession.builder.enableHiveSupport is advised to be used. I tried to create a new SparkSession using SparkSession.builder but I couldn't find a way to initialise a SparkSession that uses TestHiveContext.
Is it possible to do that or should I change my approach ?
HiveContext and SQLContext has been replaced by SparkSession as stated in the migration guide :
SparkSession is now the new entry point of Spark that replaces the old
SQLContext and
HiveContext. Note that the old SQLContext and HiveContext are kept for
backward compatibility. A new catalog interface is accessible from
SparkSession - existing API on databases and tables access such as
listTables, createExternalTable, dropTempView, cacheTable are moved
here.
https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-16-to-20
So you create a Sparksession instance with your test configuration and use it instead of HiveContext

How to load a local file with a local spark session

I'm running a local spark session on my mac via my intellij sbt console and I get a
org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/myuser/Documents/data/dataset.csv; error.
my current code looks like this:
val data = spark.read.csv("file:///Users/myuser/Documents/data/dataset.csv")
I've also tried:
val data = spark.read.csv("/Users/myuser/Documents/data/dataset.csv")
my spark session looks like this
import org.apache.spark.sql.SparkSession
trait SparkSessionWrapper {
lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("avro_test")
.getOrCreate()
}
}
I know this is the same issue as the one found here: How to load local file in sc.textFile, instead of HDFS
but none of the answers here (and others i've looked at) are helping me or else i'm not fully understanding them. any suggestions?

spark Cassandra tuning

How to set following Cassandra write parameters in spark scala code for
version - DataStax Spark Cassandra Connector 1.6.3.
Spark version - 1.6.2
spark.cassandra.output.batch.size.rows
spark.cassandra.output.concurrent.writes
spark.cassandra.output.batch.size.bytes
spark.cassandra.output.batch.grouping.key
Thanks,
Chandra
In DataStax Spark Cassandra Connector 1.6.X, you can pass these parameters as part of your SparkConf.
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "192.168.123.10")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
.set("spark.cassandra.output.batch.size.rows", "100")
.set("spark.cassandra.output.concurrent.writes", "100")
.set("spark.cassandra.output.batch.size.bytes", "100")
.set("spark.cassandra.output.batch.grouping.key", "partition")
val sc = new SparkContext("spark://192.168.123.10:7077", "test", conf)
You can refer to this readme for more information.
The most flexible way is to add those variables in a file, such as spark.conf:
spark.cassandra.output.concurrent.writes 10
etc...
and then create your spark context in your app with something like:
val conf = new SparkConf()
val sc = new SparkContext(conf)
and finally, when you submit your app, you can specify your properties file with:
spark-submit --properties-file spark.conf ...
Spark will automatically read your configuration from spark.conf when creating the spark context
That way, you can modify the properties on your spark.conf without needing to recompile your code each time.

SparkSession not picking up Runtime Configuration

In my application I'm creating a SparkSession object and then trying to Read my properties file and setting the properties at runtime. But it is not picking up the properties that I am passing at runtime.
I am submitting my App in YARN Cluster Mode
This is my inital Spark session object which I am creating in a Trait
val spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()
Then in my main function which is inside an object, i am extending this Trait so my spark session is Initialized in Trait and in my Object (containing main) i am setting this :
spark.conf.set(spark.sql.hive.convertMetastoreParquet, false)
spark.conf.set(mapreduce.input.fileinputformat.input.dir.recursive,true)
spark.conf.set(spark.dynamicAllocation.enabled, true)
spark.conf.set(spark.shuffle.service.enabled, true)
spark.conf.set(spark.dynamicAllocation.minExecutors,40)
So Ideally my App must start with 40 Executors but it is starting and then running Entirely using the Default 2 executors ..
There is nothing unexpected here. Only certain subset of Spark SQL properties (prefixed with spark.sql) can be set on runtime (see SparkConf documentation):
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
Remaining options have to be set before SparkContext is initalized. It means initalizing SparkSession with SparkContext:
val conf: SparkConf = ... // Set options here
val sc = SparkContext(conf)
val spark = SparkSession(sc)
with config method of SparkSession.Builder and SparkConf
val conf: SparkConf = ... // Set options here
val spark = SparkSession.builder.config(conf).getOrCreate
or key-value pairs:
val spark = SparkSession.builder.config("spark.some.key", "some_value").getOrCreate
This applies in particular to spark.dynamicAllocation.enabled,
spark.shuffle.service.enabled and spark.dynamicAllocation.minExecutors.
mapreduce.input.fileinputformat.input.dir.recursive from the other hand, is a property of Hadoop configuration, not Spark, and should be set there:
spark.sparkContext.hadoopConfiguration.set("some.hadoop.property", "some_value")