How to use spark-submit's --properties-file option to launch Spark application in IntelliJ IDEA? - scala

I'm starting a project using Spark developed with Scala and IntelliJ IDE.
I was wondering how to set -- properties-file with specific configuration of Spark in IntelliJ configuration.
I'm reading configuration like this "param1" -> sc.getConf.get("param1")
When I execute Spark job from command line works like a charm:
/opt/spark/bin/spark-submit --class "com.class.main" --master local --properties-file properties.conf ./target/scala-2.11/main.jar arg1 arg2 arg3 arg4
The problem is when I execute job using IntelliJ Run Configuration using VM Options:
I succeed with --master param as -Dspark.master=local
I succeed with --conf params as -Dspark.param1=value1
I failed with --properties-file
Can anyone point me at the right way to set this up?

I don't think it's possible to use --properties-file to launch a Spark application from within IntelliJ IDEA.
spark-submit is the shell script to submit Spark application for execution and does few extra things before it creates a proper submission environment for the Spark application.
You can however mimic the behaviour of --properties-file by leveraging conf/spark-defaults.conf that a Spark application loads by default.
You could create a conf/spark-defaults.conf under src/test/resources (or src/main/resources) with the content of properties.conf. That is supposed to work.

Related

Spark - A master URL must be set in your configuration” when trying to run an app

I know this question has a duplicate
, but my use case is a little specific. I want to run my Spark job (compiled to a .jar) on an EMR (via Spark submit) and give 2 options like this:
spark-submit --master yarn --deploy-mode cluster <rest of command>
To achieve this, I wrote the code like this:
val sc = new SparkContext(new SparkConf())
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()
However this gives the error during building the jar:
org.apache.spark.SparkException: A master URL must be set in your configuration
So what's a workaround? How do I set these 2 variables in code so that the master and deploy mode options are taken up while submitting; yet I should be able to use the variables sc and spark in my code (e.g:- val x = spark.read())
You could simply access command-line arguments simply as below and pass as many values as you want.
val spark = SparkSession.builder().appName("Test App")
.master(args(0))
.getOrCreate()
spark-submit --master yarn --deploy-mode cluster master-url
If you need something more fancy command-line parser then you can take a look here https://github.com/scopt/scopt

Output results of spark-submit

I'm beginner in spark and scala programing, I tried running example with spark-submit in local mode, it's run complete without any error or other message but i can't see any output result in consul or spark history web UI .Where and how can I see the results of my program in spark-submit?
This is a command that I run on spark
spark-submit --master local[*] --conf spark.history.fs.logDirectory=/tmp /spark-events --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/spark-events --conf spark.history.ui.port=18080 --class com.intel.analytics.bigdl.models.autoencoder.Train dist/lib/bigdl-0.5.0-SNAPSHOT-jar-with-dependencies.jar -f /opt/work/mnist -b 8
and this is a screenshot from end of run program
You can also locate your spark-defaults.conf (or spark-defaults.conf.template and copy it to spark-defaults.conf)
Create a logging dir (like /tmp/spark-events/)
Add these 2 lines:
spark.eventLog.enabled true
spark.eventLog.dir file:///tmp/spark-events/
And run sbin/start-history-server.sh
To make all jobs run by spark-submit log to event dir and overviews available in History Server (http://localhost:18080/) => Web UI, without keeping your spark job running
More info: https://spark.apache.org/docs/latest/monitoring.html
PS: On mac via homebrew this is all in the subdirs /usr/local/Cellar/apache-spark/[version]/libexec/
Try to add this while(true) Thread.sleep(1000) in your code, to keep the server running then check the sparks task in the browser. Normally you should see your application running.
thank a lot for your answer, I already made these settings in the spark-submit command using "--conf" and I can see web UI history with "spark-class org.apache.spark.deploy.history.HistoryServer" but I don't have access to "start-history-server.sh" .I see tasks and jobs completed in history web UI ,I checked all the tabs (jobs,stages,storage, executors) and not found the output results nowhere
.can you explain to me where are the results in history web UI or even consul?(My goal is numerical results as the output of the dataset accepted in the spark-submit command)
screenshot from web UI history
Regards
To get the output from spark-submit, you can add below command in your code.scala file which we create and save in src/main/scala location before running sbt package command.
code.scala contents ->
........
........
result.saveAsTextFile("file:///home/centos/project")
Now, you should run "sbt package" command followed by "spark-submit". It will create project folder at your given location. This folder will contain two files: part-00000 and _SUCCESS. You can check your output in file -> part-00000

How to stop Spark from loading defaults?

When I do a spark-submit, the defaults conf set up in the SPARK_HOME directory is found and loaded into the System properties.
I want to stop the defaults conf from being loaded, and just get the command line arguments, so that I may re-order how spark is configured before creating my spark context.
Is this possible?
There are a couple ways to modify configurations.
According to the spark docs, you can modify configs at runtime with flags (http://spark.apache.org/docs/latest/configuration.html):
The Spark shell and spark-submit tool support two ways to load
configurations dynamically. The first are command line options, such
as --master, as shown above. spark-submit can accept any Spark
property using the --conf flag... Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf.
which means you can kick off your jobs like this:
./bin/spark-submit --conf spark.eventLog.enabled=false --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
OR, you can go edit the spark-defaults.conf and not have to pass additional flags in your spark-submit command.
Here's a solution I found acceptable for my issue:
Create a blank "blank.conf" file, and supply it to spark using --properties
${SPARK_HOME}/bin/spark-submit --master local --properties-file "blank.conf" # etc
Spark will use the conf in its configuration instead of finding the defaults conf. You can then manually load up the defaults conf later, before creating your SparkContext, if that's your desire.

How we can deploy my existing kafka - spark - cassandra project to kafka - dataproc -cassandra in google-cloud-platform?

My existing project is kafka-spark-cassandra. Now I have got gcp account and have to migrate spark jobs to dataproc. In my existing spark jobs parameters like masterip,memory,cores etc are passed through command line which is triggerd by a linux shell script and create new sparkConf.
val conf = new SparkConf(true)
.setMaster(master)
.setAppName("xxxx")
.setJars(List(path+"/xxxx.jar"))
.set("spark.executor.memory", memory)
.set("spark.cores.max",cores)
.set("spark.scheduler.mode", "FAIR")
.set("spark.cassandra.connection.host", cassandra_ip)
1) How this can configure in dataproc?
2) Wheather there will be any compatibility issue b/w Spark 1.3(existing project) and Spark 1.6 provided by dataproc ? How it can resolve?
3) Is there any other connector needed for dataproc to get connected with Kafka and cassandra? I couldnt find any.
1) When submitting a job, you can specify arguments and properties: https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/spark. When determining which properties to set, keep in mind that Dataproc submits Spark jobs in yarn-client mode.
In general, this means you should avoid specifying master directly in code, instead letting it come from the spark.master value inside of spark-defaults.conf, and then your local setup would have that config set to local while Dataproc would automatically have it set to yarn-client with the necessary yarn config settings alongside it.
Likewise, keys like spark.executor.memory, etc., should make use of Spark's first-class command-line if running spark-submit directly:
spark-submit --conf spark.executor.memory=42G --conf spark.scheduler.mode=FAIR
or if submitting to Dataproc with gcloud:
gcloud dataproc jobs submit spark \
--properties spark.executor.memory=42G,spark.scheduler.mode=FAIR
You'll also want to look at the equivalent --jars flags for jars instead of specifying it in code.
2) When building your project to deploy, ensure you exclude spark (e.g., in maven, mark spark as provided). You may hit compatibility issues, but without knowing all APIs in use, I can't say one way or the other. The simplest way to find out is to bump Spark to 1.6.1 in your build config and see what happens.
In general Spark core is considered GA and should thus be mostly backwards compatible in 1.X versions, but the compatibility guidelines didn't apply yet to subprojects like mllib and SparkSQL, so if you use those you're more likely to need to recompile against the newer Spark version.
3) Connectors should either be included in a fat jar, specified as --jars, or installed onto the cluster at creation via initialization actions.

Spark and YARN. How does the SparkPi example use the first argument as the master url?

I'm just starting out learning Spark, and I'm trying to replicate the SparkPi example by copying the code into a new project and building a jar. The source for SparkPi is: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala
I have a working YARN cluster (running CDH 5.0.1), and I've uploaded the spark assembly jar and set it's hdfs location in SPARK_JAR.
If I run this command, the example works:
$ SPARK_CLASSPATH=/usr/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.1.jar /usr/lib/spark/bin/spark-class org.apache.spark.examples.SparkPi yarn-client 10
However, if I copy the source into a new project and build a jar and run the same command (with a different jar and classname), I get the following error:
$ SPARK_CLASSPATH=Spark.jar /usr/lib/spark/bin/spark-class spark.SparkPi yarn-client 10
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:113)
at spark.SparkPi$.main(SparkPi.scala:9)
at spark.SparkPi.main(SparkPi.scala)
Somehow, the first argument isn't being passed as the master in the SparkContext in my version, but it works fine in the example.
Looking at the SparkPi code, it seems to only expect a single numeric argument.
So is there something about the Spark examples jar file which intercepts the first argument and somehow sets the spark.master property to be that?
This is a recent change — you are running old code in the first case and running new code in the second.
Here is the change: https://github.com/apache/spark/commit/44dd57fb66bb676d753ad8d9757f9f4c03364113
I think this would be the right command now:
/usr/lib/spark/bin/spark-submit Spark.jar --class spark.SparkPi yarn-client 10