Best Practice for properties in ScalaSpark - scala

I'm starting a project using Hadoop Spark. I'll be developing in Scala.
I'm creating the project from scratch and I was wondering what to do with properties.
I come from a Java Background where I use .properties file and load them at the start. Then I have a class used to access the different value of my properties.
Is this also a good practice in Scala ?
Tried googling, but there isn't anything relating to this.

You can read the properties file in scala similar to Java
import scala.io.Source.fromUrl
val reader = fromURL(getClass.getResource("conf/fp.properties")).bufferedReader()
You can read more about I/O package at Scala Standard Library I/O
If you are looking to provide spark properties then that have different way of doing it e.g. providing them at time when you submit spark job.
Hoping this helps.

Here we do:
scopt.OptionParser to parse command line arguments.
key/value arguments conf are replicated to System.properties
command line arg config-file is used to read config file (using spark context to be able to read from S3/HDFS with custom code path to be able to read from jar resources)
config file parsed using com.typesafe.config.ConfigFactory.
Default configs from resources and from read file are combined using the withFallback mechanism. The order is important since we want typesafe to use values from (2) to override thoses from the files.

There are three ways to determine properties for Spark:
Spark Propertis in SparkConf original spec:
Spark properties control most application settings and are configured
separately for each application. These properties can be set directly
on a SparkConf passed to your SparkContext.
Dynamically Loading Spark Properties original spec, it avoids hard-coding certain configurations in a SparkConf:
./bin/spark-submit --name "My app" --master local[*] --conf spark.eventLog.enabled=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
Overriding spark-defaults.conf — Default Spark Properties File - original spec
I described properties by priority - SparkConf has the highest priority and spark-conf has the lowest priority. For more details check this post
If you want to store all properties in the single place, just you Typesafe Config. Typesafe Config gets rid of using input streams for reading file, it's widely used the approach in scala app.

Related

What is the difference between defining Spark Master in the CLI vs defining 'master' in the Spark application code?

What is the difference between Spark-submit "--master" defined in the CLI and spark application code, defining the master?
In Spark we can specify the master URI in either the application code like below:
Or we can specify the master URI in the spark-submit as an argument to a parameter, like below:
Does one take precendence over the other? Do they have to agree contractually, so I have two instances of the same URI referenced in the program spark-submit and the spark application code, creating the SparkSession? Will one override the other? What will the SparkSession do differently with the master argument, and what will the spark-submit master parameter do differently?
Any help would be greatly appreciated. Thank you!
To quote the official documentation
The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application. By default, it will read options from conf/spark-defaults.conf in the Spark directory. For more detail, see the section on loading default configurations.
Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.
If you are ever unclear where configuration options are coming from, you can print out fine-grained debugging information by running spark-submit with the --verbose option.
So all are valid options, and there is a well defined hierarchy which defines precedence if the same option is set in multiple place. From highest to lowest:
Explicit settings in the application.
Commandline arguments.
Options from the configuration files.
From the Spark documentation:
In general,
configuration values explicitly set on a SparkConf take the highest precedence,
then flags passed to spark-submit,
then values in the defaults file.
It strikes me the most flexible approach is flags passed to spark-submit.

How to stop Spark from loading defaults?

When I do a spark-submit, the defaults conf set up in the SPARK_HOME directory is found and loaded into the System properties.
I want to stop the defaults conf from being loaded, and just get the command line arguments, so that I may re-order how spark is configured before creating my spark context.
Is this possible?
There are a couple ways to modify configurations.
According to the spark docs, you can modify configs at runtime with flags (http://spark.apache.org/docs/latest/configuration.html):
The Spark shell and spark-submit tool support two ways to load
configurations dynamically. The first are command line options, such
as --master, as shown above. spark-submit can accept any Spark
property using the --conf flag... Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf.
which means you can kick off your jobs like this:
./bin/spark-submit --conf spark.eventLog.enabled=false --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
OR, you can go edit the spark-defaults.conf and not have to pass additional flags in your spark-submit command.
Here's a solution I found acceptable for my issue:
Create a blank "blank.conf" file, and supply it to spark using --properties
${SPARK_HOME}/bin/spark-submit --master local --properties-file "blank.conf" # etc
Spark will use the conf in its configuration instead of finding the defaults conf. You can then manually load up the defaults conf later, before creating your SparkContext, if that's your desire.

Spark difference or conflicts between setMaster in app conf and --master flag on sparkSubmit

I'm trying to understand the importance of setting the master property when running a spark application.
The cluster location is at the default port of 7077. I'm running this app from a testmachine where it will hit an s3 bucket.
Currently spark configuration in the app reads:
val sparkConf = new SparkConf()
.setMaster("spark://127.0.0.1:7077")
but I'm also setting the flag on the command line with spark submit:
--master spark://127.0.0.1:7077
So, does having both of these set cause problems? Does one get overridden by the other? Are they both necessary?
So, does having both of these set cause problems? Does one get
overridden by the other? Are they both necessary?
The Spark Configuration page is very clear (emphasis mine):
Any values specified as flags or in the properties file will be passed
on to the application and merged with those specified through
SparkConf. Properties set directly on the SparkConf take highest
precedence, then flags passed to spark-submit or spark-shell, then
options in the spark-defaults.conf file. A few configuration keys have
been renamed since earlier versions of Spark; in such cases, the older
key names are still accepted, but take lower precedence than any
instance of the newer key.

Best practice for keeping local vs. test vs. production configuration properties in Spark/Scala

We have some input directories which we use to load files and process in Spark SQL.
The directories are, of course, different on local machines vs. test vs. production.
What is the best way to parametrize these,
so that we can build, run tests
and deploy with sbt without having to
a) change much of the configuration settings by hand,
b) have developers use their own configuration settings,
c) have build target different deployments?
You can choose from many options:
Pass as arguments in spark-submit
Very simple, but won't scale, if the number of settings increases
I'd only use it to pass a single argument which defines the environment (dev, test, prod, ...)
Use property files
Use an argument passed to spark-submit to speficy the file to be read from HDFS (example: hdfs://localhost:9000/conf/dev.properties)
Store in a JSON file, and read in as DataFrame
If you want to query the configuration using SQL
Store in a RDBMS, and read in as DataFrame
If you have access to a running RDBMS (or you can install one)
Possibly there is already a RDBMS, if you have a Hive metastore backed by one)
Offers batch updates/deletes using SQL
Might require some effort if you want high availability
Use a distributed configuration service
If you have access to a running ZooKeeper et. al.
In case of ZooKeeper:
You can update values
You can register call backs, if a values changes
Use a key/value store
If you have access to Infinispan, Redis, Memcached et. al.
For example, Infinispan provides a distributed, replicated, persistent java.util.Map
There are certainly other options (LDAP for example), but I'd opt for properties: Immutable configuration values are normally sufficient, it possibly does not introduce new dependencies, and it's easy to manage from the command line and/or an sbt task.
Well if you don't want to have your own config or settings package that your Spark application will simply read off then you could have the application take in the file path as a parameter when submitting the spark app.
For example:
./bin/spark-submit \
--class org.YourApp \
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/yourApp.jar \
/path/to/directory
And if you use Oozie then you can change the path parameter in the XML as needed.

spark on yarn; how to send metrics to graphite sink?

I am new to spark and we are running spark on yarn. I can run my test applications just fine. I am trying to collect the spark metrics in Graphite. I know what changes to make to metrics.properties file. But how will my spark application see this conf file?
/xxx/spark/spark-0.9.0-incubating-bin-hadoop2/bin/spark-class org.apache.spark.deploy.yarn.Client --jar /xxx/spark/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar --addJars "hdfs://host:port/spark/lib/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar" --class org.apache.spark.examples.Test --args yarn-standalone --num-workers 50 --master-memory 1024m --worker-memory 1024m --args "xx"
Where should I be specifying the metrics.properties file?
I made these changes to it:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.Graphite.host=machine.domain.com
*.sink.Graphite.port=2003
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I have found a different solution to the same problem. It looks like that Spark can also take these metric settings from its config properties. For example the following line from metrics.properties:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
Can also be specified as a Spark property with key spark.metrics.conf.*.sink.graphite.class and value org.apache.spark.metrics.sink.GraphiteSink. You just need to prepend spark.metrics.conf. to each key.
I have ended up putting all these settings in the code like this:
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host", graphiteHostName)
// etc.
val sc = new spark.SparkContext(sparkConf)
This way I've got the metrics sink set up for both the driver and the executors. I was using Spark 1.6.0.
I struggled with the same thing. I have it working using these flags:
--files=/path/to/metrics.properties --conf spark.metrics.conf=metrics.properties
It's tricky because the --files flag makes it so your /path/to/metrics.properties file ends up in every executor's local disk space as metrics.properties; AFAIK there's no way to specify more complex directory structure there, or have two files with the same basename.
Related, I filed SPARK-5152 about letting the spark.metrics.conf file be read from HDFS, but that seems like it would require a fairly invasive change, so I'm not holding my breath on that one.