Can I call the SparkContext constructor twice? - scala

I need to do something like the following.
val conf = new SparkConf().setAppName("MyApp")
val master = new SparkContext(conf).master
if (master == "local[*]") // running locally
{
conf.set(...)
conf.set(...)
}
else // running on a cluster
{
conf.set(...)
conf.set(...)
}
val sc = new SparkContext(conf)
I first check whether I am running in local mode or cluster mode, and set the conf properties accordingly. But just to know about the master, I first have to create a SparkContext object. And after setting the conf properties, I obviously create another SparkContext object. Is this fine? Or Spark would just ignore my second constructor? If that is the case, in what other way I can find about the master (whether local or in cluster mode that is) before creating the SparkContext object?

Starting multiple contexts at the same time will give an error.
You can get around this by stopping the first context before creating the second.
master.stop()
val sc = new SparkContext(conf)
It's silly to do this though, you can get the master from the spark conf without needing to start a spark context.
conf.get("spark.master")

Related

Spark program taking hadoop configurations from an unspecified location

I have few test cases such as reading/writing a file on HDFS that I want to automate using Scala and run using maven. I have taken the Hadoop configuration files of test environment and put it in the resources directory of my maven project. The project is also running fine on the desired cluster from any cluster that I am using to run the project from.
One thing that I am not getting is how is Spark taking Hadoop configurations from resources directory even when I have not specified it anywhere in the project. Below is a code snippet from project.
def getSparkContext(hadoopConfiguration: Configuration): SparkContext ={
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
val hdfsCoreSitePath = new Path("/etc/hadoop/conf/core-site.xml","core-site.xml")
val hdfsHDFSSitePath = new Path("/etc/hadoop/conf/hdfs-site.xml","hdfs-site.xml")
val hdfsYarnSitePath = new Path("/etc/hadoop/conf/yarn-site.xml","yarn-site.xml")
val hdfsMapredSitePath = new Path("/etc/hadoop/conf/mapred-site.xml","mapred-site.xml")
hadoopConfiguration.addResource(hdfsCoreSitePath)
hadoopConfiguration.addResource(hdfsHDFSSitePath)
hadoopConfiguration.addResource(hdfsYarnSitePath)
hadoopConfiguration.addResource(hdfsMapredSitePath)
hadoopConfiguration.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(hadoopConfiguration)
UserGroupInformation.loginUserFromKeytab("alice", "/etc/security/keytab/alice.keytab")
println("-----------------Logged-in via keytab---------------------")
FileSystem.get(hadoopConfiguration)
val sc=new SparkContext(conf)
return sc
}
#Test
def testCase(): Unit = {
var hadoopConfiguration: Configuration = new Configuration()
val sc=getSparkContext(hadoopConfiguration)
//rest of the code
//...
//...
}
Here, I have used hadoopconfiguration object but I am not specifying this anywhere to sparkContext as this will run the tests on the cluster which I am using for running the project and not on some remote test environment.
If this is not a correct way? Can anyone please explain how I should carry out my motive of running spark test-cases on test environment from some remote cluster?

How to use SparkSession and StreamingContext together?

I'm trying to stream CSV files from a folder on my local machine (OSX). I'm using SparkSession and StreamingContext together like so:
val sc: SparkContext = createSparkContext(sparkContextName)
val sparkSess = SparkSession.builder().config(sc.getConf).getOrCreate()
val ssc = new StreamingContext(sparkSess.sparkContext, Seconds(time))
val csvSchema = new StructType().add("field_name",StringType)
val inputDF = sparkSess.readStream.format("org.apache.spark.csv").schema(csvSchema).csv("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
If I run ssc.start() after this, I get this error:
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
Instead if I try to start the SparkSession like this:
inputDF.writeStream.format("console").start()
I get:
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Clearly I'm not understanding how SparkSession and StreamingContext should work together. If I get rid of SparkSession, StreamingContext only has textFileStream on which I need to impose a CSV schema. Would appreciate any clarifications on how to get this working.
You cannot have a spark session and spark context together. With the release of Spark 2.0.0 there is a new abstraction available to developers - the Spark Session - which can be instantiated and called upon just like the Spark Context that was previously available.
You can still access spark context from the spark session builder:
val sparkSess = SparkSession.builder().appName("My App").getOrCreate()
val sc = sparkSess.sparkContext
val ssc = new StreamingContext(sc, Seconds(time))
One more thing that is causing your job to fail is you are performing the transformation and no action is called. Some action should be called in the end such as inputDF.show()

SparkSession not picking up Runtime Configuration

In my application I'm creating a SparkSession object and then trying to Read my properties file and setting the properties at runtime. But it is not picking up the properties that I am passing at runtime.
I am submitting my App in YARN Cluster Mode
This is my inital Spark session object which I am creating in a Trait
val spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()
Then in my main function which is inside an object, i am extending this Trait so my spark session is Initialized in Trait and in my Object (containing main) i am setting this :
spark.conf.set(spark.sql.hive.convertMetastoreParquet, false)
spark.conf.set(mapreduce.input.fileinputformat.input.dir.recursive,true)
spark.conf.set(spark.dynamicAllocation.enabled, true)
spark.conf.set(spark.shuffle.service.enabled, true)
spark.conf.set(spark.dynamicAllocation.minExecutors,40)
So Ideally my App must start with 40 Executors but it is starting and then running Entirely using the Default 2 executors ..
There is nothing unexpected here. Only certain subset of Spark SQL properties (prefixed with spark.sql) can be set on runtime (see SparkConf documentation):
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
Remaining options have to be set before SparkContext is initalized. It means initalizing SparkSession with SparkContext:
val conf: SparkConf = ... // Set options here
val sc = SparkContext(conf)
val spark = SparkSession(sc)
with config method of SparkSession.Builder and SparkConf
val conf: SparkConf = ... // Set options here
val spark = SparkSession.builder.config(conf).getOrCreate
or key-value pairs:
val spark = SparkSession.builder.config("spark.some.key", "some_value").getOrCreate
This applies in particular to spark.dynamicAllocation.enabled,
spark.shuffle.service.enabled and spark.dynamicAllocation.minExecutors.
mapreduce.input.fileinputformat.input.dir.recursive from the other hand, is a property of Hadoop configuration, not Spark, and should be set there:
spark.sparkContext.hadoopConfiguration.set("some.hadoop.property", "some_value")

Scala Code to connect to Spark and Cassandra

I have scala ( IntelliJ) running on my laptop. I also have Spark and Cassandra running on Machine A,B,C ( 3 node Cluster using DataStax, running in Analytics mode).
I tried running Scala programs on Cluster, they are running fine.
I need to create code and run using IntelliJ on my laptop. How do I connect and run. I know I am making mistake in the code. I used general words. I need to help in writing specific code? Example: Localhost is incorrect.
import org.apache.spark.{SparkContext, SparkConf}
object HelloWorld {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark:master", "localhost")
val sc = new SparkContext(conf)
val data = sc.cassandraTable("my_keyspace", "my_table")
}
}
val conf = new SparkConf().setAppName("APP_NAME")
.setMaster("local")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.auth.username", "")
.set("spark.cassandra.auth.password", "")
Use above code to connect to local spark and cassandra. If your cassandra cluster has authentication enabled then use username and password.
In case you want to connect to remote spark and cassandra cluster then replace localhost with cassandra host and in setMaster use spark:\\SPARK_HOST

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream with Spark on local mode

I have used Spark before in yarn-cluster mode and it's been good so far.
However, I wanted to run it "local" mode, so I created a simple scala app, added spark as dependency via maven and then tried to run the app like a normal application.
However, I get the above exception in the very first line where I try to create a SparkConf object.
I don't understand, why I need hadoop to run a standalone spark app. Could someone point out what's going on here.
My two line app:
val sparkConf = new SparkConf().setMaster("local").setAppName("MLPipeline.AutomatedBinner")//.set("spark.default.parallelism", "300").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.mb", "256").set("spark.akka.frameSize", "256").set("spark.akka.timeout", "1000") //.set("spark.akka.threads", "300")//.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") //.set("spark.akka.timeout", "1000")
val sc = new SparkContext(sparkConf)