Spark - A master URL must be set in your configuration” when trying to run an app - scala

I know this question has a duplicate
, but my use case is a little specific. I want to run my Spark job (compiled to a .jar) on an EMR (via Spark submit) and give 2 options like this:
spark-submit --master yarn --deploy-mode cluster <rest of command>
To achieve this, I wrote the code like this:
val sc = new SparkContext(new SparkConf())
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()
However this gives the error during building the jar:
org.apache.spark.SparkException: A master URL must be set in your configuration
So what's a workaround? How do I set these 2 variables in code so that the master and deploy mode options are taken up while submitting; yet I should be able to use the variables sc and spark in my code (e.g:- val x = spark.read())

You could simply access command-line arguments simply as below and pass as many values as you want.
val spark = SparkSession.builder().appName("Test App")
.master(args(0))
.getOrCreate()
spark-submit --master yarn --deploy-mode cluster master-url
If you need something more fancy command-line parser then you can take a look here https://github.com/scopt/scopt

Related

Workflow and Scheduling Framework for Spark with Scala in Maven Done with Intellij IDEA

I have created a spark project with Scala. Its a maven project with all dependency configured in POM.
Spark i am using as ETL. Source is file generated by API, All kind of transformation in spark then load it to cassandra.
Is there any Workflow software, which can used the jar to automate the process with email triggering, success or failure job flow.
May someone please help me..... is Airflow can be used for this purpose, i have used SCALA and NOT Python
Kindly share your thoughts.
There is no built-in mechanism in Spark that will help. A cron job seems reasonable for your case. If you find yourself continuously adding dependencies to the scheduled job, try Azkaban
one such example of shell script is :-
#!/bin/bash
cd /locm/spark_jobs
export SPARK_HOME=/usr/hdp/2.2.0.0-2041/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_USER_NAME=hdfs
export HADOOP_GROUP=hdfs
#export SPARK_CLASSPATH=$SPARK_CLASSPATH:/locm/spark_jobs/configs/*
CLASS=$1
MASTER=$2
ARGS=$3
CLASS_ARGS=$4
echo "Running $CLASS With Master: $MASTER With Args: $ARGS And Class Args: $CLASS_ARGS"
$SPARK_HOME/bin/spark-submit --class $CLASS --master $MASTER --num-executors 4 --executor-cores 4 "application jar file"
You can even try using spark-launcher which can be used to start a spark application programmatically :-
First create a sample spark application and build a jar file for it.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object SparkApp extends App{
val conf=new SparkConf().setMaster("local[*]").setAppName("spark-app")
val sc=new SparkContext(conf)
val rdd=sc.parallelize(Array(2,3,2,1))
rdd.saveAsTextFile("result")
sc.stop()
}
This is our simple spark application, make a jar of this application using sbt assembly, now we make a scala application through which we start this spark application as follows:
import org.apache.spark.launcher.SparkLauncher
object Launcher extends App {
val spark = new SparkLauncher()
.setSparkHome("/home/knoldus/spark-1.4.0-bin-hadoop2.6")
.setAppResource("/home/knoldus/spark_launcher-assembly-1.0.jar")
.setMainClass("SparkApp")
.setMaster("local[*]")
.launch();
spark.waitFor();
}
In the above code we use SparkLauncher object and set values for its like
setSparkHome(“/home/knoldus/spark-1.4.0-bin-hadoop2.6”) is use to set spark home which is use internally to call spark submit.
.setAppResource(“/home/knoldus/spark_launcher-assembly-1.0.jar”) is use to specify jar of our spark application.
.setMainClass(“SparkApp”) the entry point of the spark program i.e driver program.
.setMaster(“local[*]”) set the address of master where its start here now we run it on loacal machine.
.launch() is simply start our spark application.
Its a minimal requirement you can also set many other configurations like pass arguments, add jar , set configurations etc.

Set spark.driver.memory for Spark running inside a web application

I have a REST API in Scala Spray that triggers Spark jobs like the following:
path("vectorize") {
get {
parameter('apiKey.as[String]) { (apiKey) =>
if (apiKey == API_KEY) {
MoviesVectorizer.calculate() // Spark Job run in a Thread (returns Future)
complete("Ok")
} else {
complete("Wrong API KEY")
}
}
}
}
I'm trying to find the way to specify Spark driver memory for the jobs. As I found, configuring driver.memory from within the application code doesn't effect anything.
The whole web application along with the Spark is packaged in a fat Jar.
I run it by running
java -jar app.jar
Thus, as I understand, spark-submit is not relevant here (or is it?). So, I can not specify --driver-memory option when running the app.
Is there any way to set the driver memory for Spark within the web app?
Here's my current Spark configuration:
val spark: SparkSession = SparkSession.builder()
.appName("Recommender")
.master("local[*]")
.config("spark.mongodb.input.uri", uri)
.config("spark.mongodb.output.uri", uri)
.config("spark.mongodb.keep_alive_ms", "100000")
.getOrCreate()
spark.conf.set("spark.executor.memory", "10g")
val sc = spark.sparkContext
sc.setCheckpointDir("/tmp/checkpoint/")
val sqlContext = spark.sqlContext
As it is said in the documentation, Spark UI Environment tab shows only variables that are effected by the configuration. Everything I set is there - apart from spark.executor.memory.
This happens because you use local mode. In local mode there is no real executor - all Spark components run in a single JVM, with single heap configuration, so executor specific configuration doesn't matter.
spark.executor options are applicable only when applications is submitted to a cluster.
Also, Spark supports only a single application per JVM instance. This means that all core Spark properties, will be applied only when SparkContext is initialized, and persist as long as context (not SparkSession) is kept alive. Since SparkSession initializes SparkContext, no additional "core" settings will can applied after getOrCreate.
This means that all "core" options should be provided using config method of the SparkSession.builder.
If you're looking for alternatives to embedding you check an exemplary answer to Best Practice to launch Spark Applications via Web Application? by T. Gawęda.
Note: Officially Spark doesn't support applications running outside spark-submit and there are some elusive bugs related to that.

How to use spark-submit's --properties-file option to launch Spark application in IntelliJ IDEA?

I'm starting a project using Spark developed with Scala and IntelliJ IDE.
I was wondering how to set -- properties-file with specific configuration of Spark in IntelliJ configuration.
I'm reading configuration like this "param1" -> sc.getConf.get("param1")
When I execute Spark job from command line works like a charm:
/opt/spark/bin/spark-submit --class "com.class.main" --master local --properties-file properties.conf ./target/scala-2.11/main.jar arg1 arg2 arg3 arg4
The problem is when I execute job using IntelliJ Run Configuration using VM Options:
I succeed with --master param as -Dspark.master=local
I succeed with --conf params as -Dspark.param1=value1
I failed with --properties-file
Can anyone point me at the right way to set this up?
I don't think it's possible to use --properties-file to launch a Spark application from within IntelliJ IDEA.
spark-submit is the shell script to submit Spark application for execution and does few extra things before it creates a proper submission environment for the Spark application.
You can however mimic the behaviour of --properties-file by leveraging conf/spark-defaults.conf that a Spark application loads by default.
You could create a conf/spark-defaults.conf under src/test/resources (or src/main/resources) with the content of properties.conf. That is supposed to work.

A master url must be set to your configuration (Spark scala on AWS)

This is what I wrote via intellij. I plan on eventually writing larger spark scala files.
Anyways, I uploaded it on an AWS cluster that I had made. The "master" line, line 11 was "master("local")". I ran into this error
The second picture is the error that was returned by AWS when it did not run successfully. i changed line 11 to "yarn" instead of local (see the first picture for its current state)
It still is returning the same error. I put in the following flags when I uploaded it manually
--steps Type=CUSTOM_JAR,Name="SimpleApp"
It worked two weeks ago. My friend did almost the exact same thing as me. I am not sure why it isn't working.
I am looking for both a brief explanation and an answer. Looks like I need a little more knowledge on how spark works.
I am working with amazon EMR.
I think on the line 9 you are creating SparkContext with "old way" approach in spark 1.6.x and older version - you need to set master in default configuration file (usually location conf/spark-defaults.conf) or pass it to spark-submit (it is required in new SparkConf())...
On line 10 you are creating "spark" context with SparkSesion which is approach in spark 2.0.0. So in my opinion your problem is line num. 9 and I think you should remove it and work with SparkSesion or set reqiered configuration for SparkContext In case when you need sc.
You can access to sparkContext with sparkSession.sparkContext();
If you still want to use SparkConf you need to define master programatically:
val sparkConf = new SparkConf()
.setAppName("spark-application-name")
.setMaster("local[4]")
.set("spark.executor.memory","512m");
or with declarative approach in conf/spark-defaults.conf
spark.master local[4]
spark.executor.memory 512m
or simply at runtime:
./bin/spark-submit --name "spark-application-name" --master local[4] --executor-memory 512m your-spark-job.jar
Try using the below code:
val spark = SparkSession.builder().master("spark://ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com:xxxx").appName("example").getOrCreate()
you need to provide the proper link to your aws cluster.

Spark on standalone cluster throws java.lang.illegalStateException

I hava a app and read data from MongoDB.
If I use local pattern, it runs well, however, it throws java.lang.illegalStateExcetion when I use standalone cluster pattern
With local pattern, the SparkContext is val sc = new SparkContext("local","Scala Word Count")
With Standalone cluster pattern, the SparkContext is val sc = new SparkContext() and submit shell is ./spark-submit --class "xxxMain" /usr/local/jarfile/xxx.jar --master spark://master:7077
It trys 4 times then throw error when it runs to the first action
My code
configOriginal.set("mongo.input.uri","mongodb://172.16.xxx.xxx:20000/xxx.Original")
configOriginal.set("mongo.output.uri","mongodb://172.16.xxx.xxx:20000/xxx.sfeature")
mongoRDDOriginal =sc.newAPIHadoopRDD(configOriginal,classOf[com.mongodb.hadoop.MongoInputFormat],classOf[Object], classOf[BSONObject])
I learned from this example
mongo-spark
I searched and someone said it was because of mongo-hadoop-core-1.3.2, but either I up the version to mongo-hadoop-core-1.4.0 or down to 'mongo-hadoop-core-1.3.1', it didn't work.
Please help me!
Finally, I got the solution.
Because each of my workers have many cores and mongo-hadoop-core-1.3.2 doesn't support multiple threads, however it fixed in mongo-hadoop-core-1.4.0. But why my app still get error is because of "intellij idea" cache. You should add mongo-java-driver dependency, too.