How to pass multiple environment variable for spark in cluster mode? - scala

I have a spark setup currently running in PROD.
In which we are calling spark submit using shell script.
We export some variables inside shell script before spark submit in yarn client mode.
Those exported variable will refer inside scala program using "System.getenv(<variable_name_exportrd>)".
Problem:
Now the problem is we are switching to yarn cluster mode in spark-submit.
If we submit that job using Cluster mode. Those exported variables coming as null inside the program.
As per below blog, if I use "spark.yarn.appMasterEnv." i am able to access those exported variables. We are exporting nearly 40 variables in shell script. So buliding --conf for 40 variables is tedious task. (Variable changes dynamically)
How to pass environment variables to spark driver in cluster mode with spark-submit
Now my question is : Is there a way to specify multiple environment variables in a file and pass that in spark-submit.
This makes code change very less.
Please help. Thanks in advance.

Related

What is the difference between defining Spark Master in the CLI vs defining 'master' in the Spark application code?

What is the difference between Spark-submit "--master" defined in the CLI and spark application code, defining the master?
In Spark we can specify the master URI in either the application code like below:
Or we can specify the master URI in the spark-submit as an argument to a parameter, like below:
Does one take precendence over the other? Do they have to agree contractually, so I have two instances of the same URI referenced in the program spark-submit and the spark application code, creating the SparkSession? Will one override the other? What will the SparkSession do differently with the master argument, and what will the spark-submit master parameter do differently?
Any help would be greatly appreciated. Thank you!
To quote the official documentation
The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application. By default, it will read options from conf/spark-defaults.conf in the Spark directory. For more detail, see the section on loading default configurations.
Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.
If you are ever unclear where configuration options are coming from, you can print out fine-grained debugging information by running spark-submit with the --verbose option.
So all are valid options, and there is a well defined hierarchy which defines precedence if the same option is set in multiple place. From highest to lowest:
Explicit settings in the application.
Commandline arguments.
Options from the configuration files.
From the Spark documentation:
In general,
configuration values explicitly set on a SparkConf take the highest precedence,
then flags passed to spark-submit,
then values in the defaults file.
It strikes me the most flexible approach is flags passed to spark-submit.

Spark on YARN and spark-bigquery connector

I have developed a Scala Spark application for streaming data directly into Google BigQuery, using the spark-bigquery connector by Spotify.
Locally it works correctly, I have configured my application as described here https://github.com/spotify/spark-bigquery
val ssc = new StreamingContext(sc, Seconds(120))
val sqlContext = new SQLContext(sc)
sqlContext.setGcpJsonKeyFile("/opt/keyfile.json")
sqlContext.setBigQueryProjectId("projectid")
sqlContext.setBigQueryGcsBucket("gcsbucketname")
sqlContext.setBigQueryDatasetLocation("US")
but when I submit the application on my Spark on YARN cluster the job fails looking for GOOGLE_APPLICATION_CREDENTIALS environment variable...
The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials.
I set the variable as OS env var for root user to the .json file containing the credentials required, but it still fails.
I have also tried with the following line
System.setProperty("GOOGLE_APPLICATION_CREDENTIALS", "/opt/keyfile.json")
without success.
Any idea on what I'm missing?
Thank you,
Leonardo
the documentation suggests:
"Environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file.
Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode."

How to stop Spark from loading defaults?

When I do a spark-submit, the defaults conf set up in the SPARK_HOME directory is found and loaded into the System properties.
I want to stop the defaults conf from being loaded, and just get the command line arguments, so that I may re-order how spark is configured before creating my spark context.
Is this possible?
There are a couple ways to modify configurations.
According to the spark docs, you can modify configs at runtime with flags (http://spark.apache.org/docs/latest/configuration.html):
The Spark shell and spark-submit tool support two ways to load
configurations dynamically. The first are command line options, such
as --master, as shown above. spark-submit can accept any Spark
property using the --conf flag... Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf.
which means you can kick off your jobs like this:
./bin/spark-submit --conf spark.eventLog.enabled=false --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
OR, you can go edit the spark-defaults.conf and not have to pass additional flags in your spark-submit command.
Here's a solution I found acceptable for my issue:
Create a blank "blank.conf" file, and supply it to spark using --properties
${SPARK_HOME}/bin/spark-submit --master local --properties-file "blank.conf" # etc
Spark will use the conf in its configuration instead of finding the defaults conf. You can then manually load up the defaults conf later, before creating your SparkContext, if that's your desire.

Spark (Scala) Writing (and reading) to local file system from driver

1st Question :
I have a 2 node virtual cluster with hadoop.
I have a jar that runs a spark job.
This jar accepts as a cli argument : a path to a commands.txt file which tells the jar which commands to run.
I run the job with spark-submit, and i have noticed that my slave node wasn't running because it couldn't find the commands.txt file which was local on the master.
This is the command i used to run it :
./spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class
univ.bigdata.course.MainRunner --master yarn\
--deploy-mode cluster --executor-memory 1g \
--num-executors 4 \
final-project-1.0-SNAPSHOT.jar commands commands.txt
Do i need to upload commands.txt to the hdfs and give the hdfs path instead as follows ? :
hdfs://master:9000/user/vagrant/commands.txt
2nd Question :
How do i write to a file on the driver machine in the cwd ?
I used a normal scala filewriter to write output to queries_out.txt and it worked fine when using spark submit with
-master local[]
But, when running in
-master yarn
I cant find the file, No exceptions are thrown but i just cant locate the file. It doesn't exist as if it was never written. Is there a way to write the results to a file on the driver machine locally ? Or should i only write results to HDFS ?
Thanks.
Question 1: Yes, uploading it to hdfs or any network accessible file system is how you solve your problem.
Question 2:
This is a bit tricky. Assuming your results are in a RDD you could call collect(), that will aggregate all the data on your driver process. Then, you have a standard collection in your hands which you could simply write on disk. Note that you should give your driver's process enough memory to be able to hold all results in memory, do not forget to also increase the maximum result size. The parameters are:
--driver-memory 16G
--conf "spark.driver.maxResultSize=15g"
This is has absolutely poor scaling behaviour in both communication complexity and memory (both in the size of the result RDD). This is the easiest way and perfectly fine for a toy project or when the data set is always small. In all other cases it will certainly blow up at some point.
The better way, as you may have mentioned, is to use the build-in "saveAs" methods to write to i.e. hdfs (or another storage format). You can check the documentation for that: http://spark.apache.org/docs/latest/programming-guide.html#actions
Note that if you only want to persist the RDD, because you are reusing it in several computations (like cache, but instead of holding it in memory hold it in disk) there is also a persist method on RDDs.
Solution was very simple, i changed --deploy-mode cluster to --deploy-mode client and then the file writes were done correctly on the machine where i ran the driver.
Answer to Question 1:
Submitting spark job with the --files tag followed by path to a local file downloads the file from the driver node to the cwd of all the worker nodes and thus be accessed just by using its name.

Is there a way to submit a Spark application to a cluster without creating a jar?

I am building a service which takes in a string of Spark code to execute on a cluster. Is there any way for me to set the Spark context to the cluster and execute without building a jar and submitting it?
Indeed, you can use the spark shell, or look at something like the IBM Spark Kernel, Zeppelin, etc. to have a long running Spark Context you can submit code to and have it run. As you are almost certainly already aware of, be very careful with accepting strings and executing them on the cluster (e.g. only from a trusted source).