Set spark.driver.memory for Spark running inside a web application - scala

I have a REST API in Scala Spray that triggers Spark jobs like the following:
path("vectorize") {
get {
parameter('apiKey.as[String]) { (apiKey) =>
if (apiKey == API_KEY) {
MoviesVectorizer.calculate() // Spark Job run in a Thread (returns Future)
complete("Ok")
} else {
complete("Wrong API KEY")
}
}
}
}
I'm trying to find the way to specify Spark driver memory for the jobs. As I found, configuring driver.memory from within the application code doesn't effect anything.
The whole web application along with the Spark is packaged in a fat Jar.
I run it by running
java -jar app.jar
Thus, as I understand, spark-submit is not relevant here (or is it?). So, I can not specify --driver-memory option when running the app.
Is there any way to set the driver memory for Spark within the web app?
Here's my current Spark configuration:
val spark: SparkSession = SparkSession.builder()
.appName("Recommender")
.master("local[*]")
.config("spark.mongodb.input.uri", uri)
.config("spark.mongodb.output.uri", uri)
.config("spark.mongodb.keep_alive_ms", "100000")
.getOrCreate()
spark.conf.set("spark.executor.memory", "10g")
val sc = spark.sparkContext
sc.setCheckpointDir("/tmp/checkpoint/")
val sqlContext = spark.sqlContext
As it is said in the documentation, Spark UI Environment tab shows only variables that are effected by the configuration. Everything I set is there - apart from spark.executor.memory.

This happens because you use local mode. In local mode there is no real executor - all Spark components run in a single JVM, with single heap configuration, so executor specific configuration doesn't matter.
spark.executor options are applicable only when applications is submitted to a cluster.
Also, Spark supports only a single application per JVM instance. This means that all core Spark properties, will be applied only when SparkContext is initialized, and persist as long as context (not SparkSession) is kept alive. Since SparkSession initializes SparkContext, no additional "core" settings will can applied after getOrCreate.
This means that all "core" options should be provided using config method of the SparkSession.builder.
If you're looking for alternatives to embedding you check an exemplary answer to Best Practice to launch Spark Applications via Web Application? by T. Gawęda.
Note: Officially Spark doesn't support applications running outside spark-submit and there are some elusive bugs related to that.

Related

How to use TypeSafe config with Apache Spark?

I have a Spark application which I am trying to package as a fat jar and deploy to the local cluster with spark-submit. I am using Typesafe config to create config files for various deployment environments - local.conf, staging.conf, and production.conf - and trying to submit my jar.
The command I am running is the following:
/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit \
--master spark://127.0.0.1:7077 \
--files ../files/local.conf \
--driver-java-options '-Dconfig.file=local.conf' \
target/scala-2.12/spark-starter-2.jar
I built the command incrementally by adding options one after another. With --files, logs suggest that the file is being uploaded to Spark but when I add --driver-java-options, submitting fails with file not being found.
Caused by: java.io.FileNotFoundException: local.conf (No such file or directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at com.typesafe.config.impl.Parseable$ParseableFile.reader(Parseable.java:629)
at com.typesafe.config.impl.Parseable.reader(Parseable.java:99)
at com.typesafe.config.impl.Parseable.rawParseValue(Parseable.java:233)
at com.typesafe.config.impl.Parseable.parseValue(Parseable.java:180)
... 35 more
Code:
import com.example.spark.settings.Settings
import com.typesafe.config.ConfigFactory
import org.apache.spark.sql.SparkSession
object App extends App {
val config = ConfigFactory.load()
val settings = Settings(config = config)
val spark = SparkSession
.builder()
.getOrCreate()
spark.stop()
}
What do I need to change so that I can provide config files separately?
According to Spark Docs, --files are placed in the working directory of each executor. While you're trying to access this file from driver, not executor.
In order to load config on driver side, try something like this:
/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit \
--master spark://127.0.0.1:7077 \
--driver-java-options '-Dconfig.file=../files/local.conf' \
target/scala-2.12/spark-starter-2.jar
If what you want is to load config on executor side, you need to use spark.executor.extraJavaOptions property. In this case you need to load the config inside lambda that runs on executor, example for RDD API:
myRdd.map { row =>
val config = ConfigFactory.load()
...
}
Visibility of the config will be limited to the scope of the lambda. This is a quite complicated way, and I'll describe a better option below.
My general recommendation on how to work with custom configs in Spark:
Read this chapter of Spark Docs
Load the config on driver side
Map settings that you need to immutable case class
Pass this case class to executors via closures
Keep in mind that case class with settings should contain as less data as possible, any field types should be either primitive or implement java.io.Serializable
EMR specific is that it's hard to get to the driver's filesystem. So it's preferable to store the config in the external storage, typically S3.
Typesafe config lib is not capable to load files directly from S3, so you can pass a path to the config as an app argument, not as -Dproperty, read it from S3 using AmazonS3Client and then load it as config using ConfigFactory.parseString(). See this answer as an example.

Workflow and Scheduling Framework for Spark with Scala in Maven Done with Intellij IDEA

I have created a spark project with Scala. Its a maven project with all dependency configured in POM.
Spark i am using as ETL. Source is file generated by API, All kind of transformation in spark then load it to cassandra.
Is there any Workflow software, which can used the jar to automate the process with email triggering, success or failure job flow.
May someone please help me..... is Airflow can be used for this purpose, i have used SCALA and NOT Python
Kindly share your thoughts.
There is no built-in mechanism in Spark that will help. A cron job seems reasonable for your case. If you find yourself continuously adding dependencies to the scheduled job, try Azkaban
one such example of shell script is :-
#!/bin/bash
cd /locm/spark_jobs
export SPARK_HOME=/usr/hdp/2.2.0.0-2041/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_USER_NAME=hdfs
export HADOOP_GROUP=hdfs
#export SPARK_CLASSPATH=$SPARK_CLASSPATH:/locm/spark_jobs/configs/*
CLASS=$1
MASTER=$2
ARGS=$3
CLASS_ARGS=$4
echo "Running $CLASS With Master: $MASTER With Args: $ARGS And Class Args: $CLASS_ARGS"
$SPARK_HOME/bin/spark-submit --class $CLASS --master $MASTER --num-executors 4 --executor-cores 4 "application jar file"
You can even try using spark-launcher which can be used to start a spark application programmatically :-
First create a sample spark application and build a jar file for it.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object SparkApp extends App{
val conf=new SparkConf().setMaster("local[*]").setAppName("spark-app")
val sc=new SparkContext(conf)
val rdd=sc.parallelize(Array(2,3,2,1))
rdd.saveAsTextFile("result")
sc.stop()
}
This is our simple spark application, make a jar of this application using sbt assembly, now we make a scala application through which we start this spark application as follows:
import org.apache.spark.launcher.SparkLauncher
object Launcher extends App {
val spark = new SparkLauncher()
.setSparkHome("/home/knoldus/spark-1.4.0-bin-hadoop2.6")
.setAppResource("/home/knoldus/spark_launcher-assembly-1.0.jar")
.setMainClass("SparkApp")
.setMaster("local[*]")
.launch();
spark.waitFor();
}
In the above code we use SparkLauncher object and set values for its like
setSparkHome(“/home/knoldus/spark-1.4.0-bin-hadoop2.6”) is use to set spark home which is use internally to call spark submit.
.setAppResource(“/home/knoldus/spark_launcher-assembly-1.0.jar”) is use to specify jar of our spark application.
.setMainClass(“SparkApp”) the entry point of the spark program i.e driver program.
.setMaster(“local[*]”) set the address of master where its start here now we run it on loacal machine.
.launch() is simply start our spark application.
Its a minimal requirement you can also set many other configurations like pass arguments, add jar , set configurations etc.

How we can deploy my existing kafka - spark - cassandra project to kafka - dataproc -cassandra in google-cloud-platform?

My existing project is kafka-spark-cassandra. Now I have got gcp account and have to migrate spark jobs to dataproc. In my existing spark jobs parameters like masterip,memory,cores etc are passed through command line which is triggerd by a linux shell script and create new sparkConf.
val conf = new SparkConf(true)
.setMaster(master)
.setAppName("xxxx")
.setJars(List(path+"/xxxx.jar"))
.set("spark.executor.memory", memory)
.set("spark.cores.max",cores)
.set("spark.scheduler.mode", "FAIR")
.set("spark.cassandra.connection.host", cassandra_ip)
1) How this can configure in dataproc?
2) Wheather there will be any compatibility issue b/w Spark 1.3(existing project) and Spark 1.6 provided by dataproc ? How it can resolve?
3) Is there any other connector needed for dataproc to get connected with Kafka and cassandra? I couldnt find any.
1) When submitting a job, you can specify arguments and properties: https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/spark. When determining which properties to set, keep in mind that Dataproc submits Spark jobs in yarn-client mode.
In general, this means you should avoid specifying master directly in code, instead letting it come from the spark.master value inside of spark-defaults.conf, and then your local setup would have that config set to local while Dataproc would automatically have it set to yarn-client with the necessary yarn config settings alongside it.
Likewise, keys like spark.executor.memory, etc., should make use of Spark's first-class command-line if running spark-submit directly:
spark-submit --conf spark.executor.memory=42G --conf spark.scheduler.mode=FAIR
or if submitting to Dataproc with gcloud:
gcloud dataproc jobs submit spark \
--properties spark.executor.memory=42G,spark.scheduler.mode=FAIR
You'll also want to look at the equivalent --jars flags for jars instead of specifying it in code.
2) When building your project to deploy, ensure you exclude spark (e.g., in maven, mark spark as provided). You may hit compatibility issues, but without knowing all APIs in use, I can't say one way or the other. The simplest way to find out is to bump Spark to 1.6.1 in your build config and see what happens.
In general Spark core is considered GA and should thus be mostly backwards compatible in 1.X versions, but the compatibility guidelines didn't apply yet to subprojects like mllib and SparkSQL, so if you use those you're more likely to need to recompile against the newer Spark version.
3) Connectors should either be included in a fat jar, specified as --jars, or installed onto the cluster at creation via initialization actions.

Deploying Play Framework (needing a SparkContext) on cluster

We are deploying to a Cluster a Play! Framework (version 2.3) app that runs jobs needing a SparkContext (e.g : search in Hbase, or WordCount). Our cluster is a Yarn cluster.
To start our app we execute the command line :
activator run
(after compiling and packaging)
We are experimenting some issues with the SparkContext configuration, we use this configuration :
new SparkConf(false)
.setMaster("yarn-cluster")
.setAppName("MyApp")
val sc = new SparkContext(sparkConf)
When we call a route that calls a job using the SparkContext we and up with this error :
[RuntimeException: java.lang.ExceptionInInitializerError]
or this one (when we reload):
[RuntimeException: java.lang.NoClassDefFoundError: Could not initialize class mypackage.MyClass$]
To test our code we changed the the parameter setMaster to
.setMaster("local[4]")
And it worked well. But of course our code is not distributed and we do not use the Spark capabilities of distributing our code (e.g : RDD).
Is running our App with the spark submit command a solution? If it is how can this be done ?
We would rather still use the activator command.

Spark on standalone cluster throws java.lang.illegalStateException

I hava a app and read data from MongoDB.
If I use local pattern, it runs well, however, it throws java.lang.illegalStateExcetion when I use standalone cluster pattern
With local pattern, the SparkContext is val sc = new SparkContext("local","Scala Word Count")
With Standalone cluster pattern, the SparkContext is val sc = new SparkContext() and submit shell is ./spark-submit --class "xxxMain" /usr/local/jarfile/xxx.jar --master spark://master:7077
It trys 4 times then throw error when it runs to the first action
My code
configOriginal.set("mongo.input.uri","mongodb://172.16.xxx.xxx:20000/xxx.Original")
configOriginal.set("mongo.output.uri","mongodb://172.16.xxx.xxx:20000/xxx.sfeature")
mongoRDDOriginal =sc.newAPIHadoopRDD(configOriginal,classOf[com.mongodb.hadoop.MongoInputFormat],classOf[Object], classOf[BSONObject])
I learned from this example
mongo-spark
I searched and someone said it was because of mongo-hadoop-core-1.3.2, but either I up the version to mongo-hadoop-core-1.4.0 or down to 'mongo-hadoop-core-1.3.1', it didn't work.
Please help me!
Finally, I got the solution.
Because each of my workers have many cores and mongo-hadoop-core-1.3.2 doesn't support multiple threads, however it fixed in mongo-hadoop-core-1.4.0. But why my app still get error is because of "intellij idea" cache. You should add mongo-java-driver dependency, too.