SparkSession not picking up Runtime Configuration - scala

In my application I'm creating a SparkSession object and then trying to Read my properties file and setting the properties at runtime. But it is not picking up the properties that I am passing at runtime.
I am submitting my App in YARN Cluster Mode
This is my inital Spark session object which I am creating in a Trait
val spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()
Then in my main function which is inside an object, i am extending this Trait so my spark session is Initialized in Trait and in my Object (containing main) i am setting this :
spark.conf.set(spark.sql.hive.convertMetastoreParquet, false)
spark.conf.set(mapreduce.input.fileinputformat.input.dir.recursive,true)
spark.conf.set(spark.dynamicAllocation.enabled, true)
spark.conf.set(spark.shuffle.service.enabled, true)
spark.conf.set(spark.dynamicAllocation.minExecutors,40)
So Ideally my App must start with 40 Executors but it is starting and then running Entirely using the Default 2 executors ..

There is nothing unexpected here. Only certain subset of Spark SQL properties (prefixed with spark.sql) can be set on runtime (see SparkConf documentation):
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
Remaining options have to be set before SparkContext is initalized. It means initalizing SparkSession with SparkContext:
val conf: SparkConf = ... // Set options here
val sc = SparkContext(conf)
val spark = SparkSession(sc)
with config method of SparkSession.Builder and SparkConf
val conf: SparkConf = ... // Set options here
val spark = SparkSession.builder.config(conf).getOrCreate
or key-value pairs:
val spark = SparkSession.builder.config("spark.some.key", "some_value").getOrCreate
This applies in particular to spark.dynamicAllocation.enabled,
spark.shuffle.service.enabled and spark.dynamicAllocation.minExecutors.
mapreduce.input.fileinputformat.input.dir.recursive from the other hand, is a property of Hadoop configuration, not Spark, and should be set there:
spark.sparkContext.hadoopConfiguration.set("some.hadoop.property", "some_value")

Related

How to use TestHiveContext using Spark 2.2

I am trying to upgrade to Spark 2.2 from Spark 1.6. The existing unit tests are depending on a defined HiveContext which was initialised using TestHiveContext.
val conf = new SparkConf().set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext("local", "sc", conf)
sc.setLogLevel("WARN")
val sqlContext = new TestHiveContext(sc)
In spark 2.2, HiveContext is deprecated and SparkSession.builder.enableHiveSupport is advised to be used. I tried to create a new SparkSession using SparkSession.builder but I couldn't find a way to initialise a SparkSession that uses TestHiveContext.
Is it possible to do that or should I change my approach ?
HiveContext and SQLContext has been replaced by SparkSession as stated in the migration guide :
SparkSession is now the new entry point of Spark that replaces the old
SQLContext and
HiveContext. Note that the old SQLContext and HiveContext are kept for
backward compatibility. A new catalog interface is accessible from
SparkSession - existing API on databases and tables access such as
listTables, createExternalTable, dropTempView, cacheTable are moved
here.
https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-16-to-20
So you create a Sparksession instance with your test configuration and use it instead of HiveContext

Spark on HDInsights - No FileSystem for scheme: adl

I am writing an application that processes files from ADLS. When attempting to read the files from the cluster by running the code within spark-shell it has no problem accessing the files. However, when I attempt to sbt run the project on the cluster it gives me:
[error] java.io.IOException: No FileSystem for scheme: adl
implicit val spark = SparkSession.builder().master("local[*]").appName("AppMain").getOrCreate()
import spark.implicits._
val listOfFiles = spark.sparkContext.binaryFiles("adl://adlAddressHere/FolderHere/")
val fileList = listOfFiles.collect()
This is spark 2.2 on HDI 3.6
In your build.sbt add:
libraryDependencies += "org.apache.hadoop" % "hadoop-azure-datalake" % "2.8.0" % Provided
I use Spark 2.3.1 instead of 2.2. That version works well with hadoop-azure-datalake 2.8.0.
Then, configure your spark context:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
hadoopConf.set("fs.AbstractFileSystem.adl.impl", "org.apache.hadoop.fs.adl.Adl")
hadoopConf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
hadoopConf.set("dfs.adls.oauth2.client.id", clientId)
hadoopConf.set("dfs.adls.oauth2.credential", clientSecret)
hadoopConf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
TL;DR;
If you are using RDD through spark context you can tell Hadoop Configuration where to find the implementation of your org.apache.hadoop.fs.adl.AdlFileSystem.
The key come in the format fs.<fs-prefix>.impl, and the value is a full class name that implements the class org.apache.hadoop.fs.FileSystem.
In your case, you need fs.adl.impl which is implemented by org.apache.hadoop.fs.adl.AdlFileSystem.
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
I usually work with Spark SQL, so I need to configure spark session too:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", clientId)
spark.conf.set("dfs.adls.oauth2.credential", clientSecret)
spark.conf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
Well, I found if I package the jar and spark-submit it that it works fine so that will work for the mean time. I'm still surprised it would not work in local[*] mode though.

How to use SparkSession and StreamingContext together?

I'm trying to stream CSV files from a folder on my local machine (OSX). I'm using SparkSession and StreamingContext together like so:
val sc: SparkContext = createSparkContext(sparkContextName)
val sparkSess = SparkSession.builder().config(sc.getConf).getOrCreate()
val ssc = new StreamingContext(sparkSess.sparkContext, Seconds(time))
val csvSchema = new StructType().add("field_name",StringType)
val inputDF = sparkSess.readStream.format("org.apache.spark.csv").schema(csvSchema).csv("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
If I run ssc.start() after this, I get this error:
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
Instead if I try to start the SparkSession like this:
inputDF.writeStream.format("console").start()
I get:
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Clearly I'm not understanding how SparkSession and StreamingContext should work together. If I get rid of SparkSession, StreamingContext only has textFileStream on which I need to impose a CSV schema. Would appreciate any clarifications on how to get this working.
You cannot have a spark session and spark context together. With the release of Spark 2.0.0 there is a new abstraction available to developers - the Spark Session - which can be instantiated and called upon just like the Spark Context that was previously available.
You can still access spark context from the spark session builder:
val sparkSess = SparkSession.builder().appName("My App").getOrCreate()
val sc = sparkSess.sparkContext
val ssc = new StreamingContext(sc, Seconds(time))
One more thing that is causing your job to fail is you are performing the transformation and no action is called. Some action should be called in the end such as inputDF.show()

spark Cassandra tuning

How to set following Cassandra write parameters in spark scala code for
version - DataStax Spark Cassandra Connector 1.6.3.
Spark version - 1.6.2
spark.cassandra.output.batch.size.rows
spark.cassandra.output.concurrent.writes
spark.cassandra.output.batch.size.bytes
spark.cassandra.output.batch.grouping.key
Thanks,
Chandra
In DataStax Spark Cassandra Connector 1.6.X, you can pass these parameters as part of your SparkConf.
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "192.168.123.10")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
.set("spark.cassandra.output.batch.size.rows", "100")
.set("spark.cassandra.output.concurrent.writes", "100")
.set("spark.cassandra.output.batch.size.bytes", "100")
.set("spark.cassandra.output.batch.grouping.key", "partition")
val sc = new SparkContext("spark://192.168.123.10:7077", "test", conf)
You can refer to this readme for more information.
The most flexible way is to add those variables in a file, such as spark.conf:
spark.cassandra.output.concurrent.writes 10
etc...
and then create your spark context in your app with something like:
val conf = new SparkConf()
val sc = new SparkContext(conf)
and finally, when you submit your app, you can specify your properties file with:
spark-submit --properties-file spark.conf ...
Spark will automatically read your configuration from spark.conf when creating the spark context
That way, you can modify the properties on your spark.conf without needing to recompile your code each time.

Can I call the SparkContext constructor twice?

I need to do something like the following.
val conf = new SparkConf().setAppName("MyApp")
val master = new SparkContext(conf).master
if (master == "local[*]") // running locally
{
conf.set(...)
conf.set(...)
}
else // running on a cluster
{
conf.set(...)
conf.set(...)
}
val sc = new SparkContext(conf)
I first check whether I am running in local mode or cluster mode, and set the conf properties accordingly. But just to know about the master, I first have to create a SparkContext object. And after setting the conf properties, I obviously create another SparkContext object. Is this fine? Or Spark would just ignore my second constructor? If that is the case, in what other way I can find about the master (whether local or in cluster mode that is) before creating the SparkContext object?
Starting multiple contexts at the same time will give an error.
You can get around this by stopping the first context before creating the second.
master.stop()
val sc = new SparkContext(conf)
It's silly to do this though, you can get the master from the spark conf without needing to start a spark context.
conf.get("spark.master")