Best practice for keeping local vs. test vs. production configuration properties in Spark/Scala - scala

We have some input directories which we use to load files and process in Spark SQL.
The directories are, of course, different on local machines vs. test vs. production.
What is the best way to parametrize these,
so that we can build, run tests
and deploy with sbt without having to
a) change much of the configuration settings by hand,
b) have developers use their own configuration settings,
c) have build target different deployments?

You can choose from many options:
Pass as arguments in spark-submit
Very simple, but won't scale, if the number of settings increases
I'd only use it to pass a single argument which defines the environment (dev, test, prod, ...)
Use property files
Use an argument passed to spark-submit to speficy the file to be read from HDFS (example: hdfs://localhost:9000/conf/dev.properties)
Store in a JSON file, and read in as DataFrame
If you want to query the configuration using SQL
Store in a RDBMS, and read in as DataFrame
If you have access to a running RDBMS (or you can install one)
Possibly there is already a RDBMS, if you have a Hive metastore backed by one)
Offers batch updates/deletes using SQL
Might require some effort if you want high availability
Use a distributed configuration service
If you have access to a running ZooKeeper et. al.
In case of ZooKeeper:
You can update values
You can register call backs, if a values changes
Use a key/value store
If you have access to Infinispan, Redis, Memcached et. al.
For example, Infinispan provides a distributed, replicated, persistent java.util.Map
There are certainly other options (LDAP for example), but I'd opt for properties: Immutable configuration values are normally sufficient, it possibly does not introduce new dependencies, and it's easy to manage from the command line and/or an sbt task.

Well if you don't want to have your own config or settings package that your Spark application will simply read off then you could have the application take in the file path as a parameter when submitting the spark app.
For example:
./bin/spark-submit \
--class org.YourApp \
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/yourApp.jar \
/path/to/directory
And if you use Oozie then you can change the path parameter in the XML as needed.

Related

What is the difference between defining Spark Master in the CLI vs defining 'master' in the Spark application code?

What is the difference between Spark-submit "--master" defined in the CLI and spark application code, defining the master?
In Spark we can specify the master URI in either the application code like below:
Or we can specify the master URI in the spark-submit as an argument to a parameter, like below:
Does one take precendence over the other? Do they have to agree contractually, so I have two instances of the same URI referenced in the program spark-submit and the spark application code, creating the SparkSession? Will one override the other? What will the SparkSession do differently with the master argument, and what will the spark-submit master parameter do differently?
Any help would be greatly appreciated. Thank you!
To quote the official documentation
The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application. By default, it will read options from conf/spark-defaults.conf in the Spark directory. For more detail, see the section on loading default configurations.
Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.
If you are ever unclear where configuration options are coming from, you can print out fine-grained debugging information by running spark-submit with the --verbose option.
So all are valid options, and there is a well defined hierarchy which defines precedence if the same option is set in multiple place. From highest to lowest:
Explicit settings in the application.
Commandline arguments.
Options from the configuration files.
From the Spark documentation:
In general,
configuration values explicitly set on a SparkConf take the highest precedence,
then flags passed to spark-submit,
then values in the defaults file.
It strikes me the most flexible approach is flags passed to spark-submit.

Best Practice for properties in ScalaSpark

I'm starting a project using Hadoop Spark. I'll be developing in Scala.
I'm creating the project from scratch and I was wondering what to do with properties.
I come from a Java Background where I use .properties file and load them at the start. Then I have a class used to access the different value of my properties.
Is this also a good practice in Scala ?
Tried googling, but there isn't anything relating to this.
You can read the properties file in scala similar to Java
import scala.io.Source.fromUrl
val reader = fromURL(getClass.getResource("conf/fp.properties")).bufferedReader()
You can read more about I/O package at Scala Standard Library I/O
If you are looking to provide spark properties then that have different way of doing it e.g. providing them at time when you submit spark job.
Hoping this helps.
Here we do:
scopt.OptionParser to parse command line arguments.
key/value arguments conf are replicated to System.properties
command line arg config-file is used to read config file (using spark context to be able to read from S3/HDFS with custom code path to be able to read from jar resources)
config file parsed using com.typesafe.config.ConfigFactory.
Default configs from resources and from read file are combined using the withFallback mechanism. The order is important since we want typesafe to use values from (2) to override thoses from the files.
There are three ways to determine properties for Spark:
Spark Propertis in SparkConf original spec:
Spark properties control most application settings and are configured
separately for each application. These properties can be set directly
on a SparkConf passed to your SparkContext.
Dynamically Loading Spark Properties original spec, it avoids hard-coding certain configurations in a SparkConf:
./bin/spark-submit --name "My app" --master local[*] --conf spark.eventLog.enabled=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
Overriding spark-defaults.conf — Default Spark Properties File - original spec
I described properties by priority - SparkConf has the highest priority and spark-conf has the lowest priority. For more details check this post
If you want to store all properties in the single place, just you Typesafe Config. Typesafe Config gets rid of using input streams for reading file, it's widely used the approach in scala app.

Spark (Scala) Writing (and reading) to local file system from driver

1st Question :
I have a 2 node virtual cluster with hadoop.
I have a jar that runs a spark job.
This jar accepts as a cli argument : a path to a commands.txt file which tells the jar which commands to run.
I run the job with spark-submit, and i have noticed that my slave node wasn't running because it couldn't find the commands.txt file which was local on the master.
This is the command i used to run it :
./spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class
univ.bigdata.course.MainRunner --master yarn\
--deploy-mode cluster --executor-memory 1g \
--num-executors 4 \
final-project-1.0-SNAPSHOT.jar commands commands.txt
Do i need to upload commands.txt to the hdfs and give the hdfs path instead as follows ? :
hdfs://master:9000/user/vagrant/commands.txt
2nd Question :
How do i write to a file on the driver machine in the cwd ?
I used a normal scala filewriter to write output to queries_out.txt and it worked fine when using spark submit with
-master local[]
But, when running in
-master yarn
I cant find the file, No exceptions are thrown but i just cant locate the file. It doesn't exist as if it was never written. Is there a way to write the results to a file on the driver machine locally ? Or should i only write results to HDFS ?
Thanks.
Question 1: Yes, uploading it to hdfs or any network accessible file system is how you solve your problem.
Question 2:
This is a bit tricky. Assuming your results are in a RDD you could call collect(), that will aggregate all the data on your driver process. Then, you have a standard collection in your hands which you could simply write on disk. Note that you should give your driver's process enough memory to be able to hold all results in memory, do not forget to also increase the maximum result size. The parameters are:
--driver-memory 16G
--conf "spark.driver.maxResultSize=15g"
This is has absolutely poor scaling behaviour in both communication complexity and memory (both in the size of the result RDD). This is the easiest way and perfectly fine for a toy project or when the data set is always small. In all other cases it will certainly blow up at some point.
The better way, as you may have mentioned, is to use the build-in "saveAs" methods to write to i.e. hdfs (or another storage format). You can check the documentation for that: http://spark.apache.org/docs/latest/programming-guide.html#actions
Note that if you only want to persist the RDD, because you are reusing it in several computations (like cache, but instead of holding it in memory hold it in disk) there is also a persist method on RDDs.
Solution was very simple, i changed --deploy-mode cluster to --deploy-mode client and then the file writes were done correctly on the machine where i ran the driver.
Answer to Question 1:
Submitting spark job with the --files tag followed by path to a local file downloads the file from the driver node to the cwd of all the worker nodes and thus be accessed just by using its name.

How we can deploy my existing kafka - spark - cassandra project to kafka - dataproc -cassandra in google-cloud-platform?

My existing project is kafka-spark-cassandra. Now I have got gcp account and have to migrate spark jobs to dataproc. In my existing spark jobs parameters like masterip,memory,cores etc are passed through command line which is triggerd by a linux shell script and create new sparkConf.
val conf = new SparkConf(true)
.setMaster(master)
.setAppName("xxxx")
.setJars(List(path+"/xxxx.jar"))
.set("spark.executor.memory", memory)
.set("spark.cores.max",cores)
.set("spark.scheduler.mode", "FAIR")
.set("spark.cassandra.connection.host", cassandra_ip)
1) How this can configure in dataproc?
2) Wheather there will be any compatibility issue b/w Spark 1.3(existing project) and Spark 1.6 provided by dataproc ? How it can resolve?
3) Is there any other connector needed for dataproc to get connected with Kafka and cassandra? I couldnt find any.
1) When submitting a job, you can specify arguments and properties: https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/spark. When determining which properties to set, keep in mind that Dataproc submits Spark jobs in yarn-client mode.
In general, this means you should avoid specifying master directly in code, instead letting it come from the spark.master value inside of spark-defaults.conf, and then your local setup would have that config set to local while Dataproc would automatically have it set to yarn-client with the necessary yarn config settings alongside it.
Likewise, keys like spark.executor.memory, etc., should make use of Spark's first-class command-line if running spark-submit directly:
spark-submit --conf spark.executor.memory=42G --conf spark.scheduler.mode=FAIR
or if submitting to Dataproc with gcloud:
gcloud dataproc jobs submit spark \
--properties spark.executor.memory=42G,spark.scheduler.mode=FAIR
You'll also want to look at the equivalent --jars flags for jars instead of specifying it in code.
2) When building your project to deploy, ensure you exclude spark (e.g., in maven, mark spark as provided). You may hit compatibility issues, but without knowing all APIs in use, I can't say one way or the other. The simplest way to find out is to bump Spark to 1.6.1 in your build config and see what happens.
In general Spark core is considered GA and should thus be mostly backwards compatible in 1.X versions, but the compatibility guidelines didn't apply yet to subprojects like mllib and SparkSQL, so if you use those you're more likely to need to recompile against the newer Spark version.
3) Connectors should either be included in a fat jar, specified as --jars, or installed onto the cluster at creation via initialization actions.

spark-jobserver - managing multiple EMR clusters

I have a production environment that consists of several (persistent and ad-hoc) EMR Spark clusters.
I would like to use one instance of spark-jobserver to manage the job JARs for this environment in general, and be able to specify the intended master right when I POST /jobs, and not permanently in the config file (using master = "local[4]" configuration key).
Obviously I would prefer to have spark-jobserver running on a standalone machine, and not on any of the masters.
Is this somehow possible?
You can write a SparkMasterProvider
https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/spark.jobserver/util/SparkMasterProvider.scala
A complex example is here https://github.com/spark-jobserver/jobserver-cassandra/blob/master/src/main/scala/spark.jobserver/masterLocators/dse/DseSparkMasterProvider.scala
I think all you have to do is write one that will return the config input as spark master, that way you can pass it as part of job config.