How to connect to ElasticSearch from Apache Spark Service on Bluemix - scala

I use Apache Spark service on Bluemix to create demo (collecting/parsing twitter data).I want to transport Elastic Search.
I created my scala app according to the following URL [1]:
[1] https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
However, when using Jupyter notebook on Bluemix, I couldn't run my app properly. A special interpreter-aware SparkContext "sc" was already running, but I cloudn't add properties to "sc" such as "es.nodes", "es.port" ,and so on to connect Elastic Search.
Q1.
Does anyone know how to add extra properties to a special interpreter-aware SparkContext on Bluemix? In my local spark environment, it's easy to add.
Q2.
I tried to create another SparkContext as follows and use for streaming, but it was uncontrollable on Jupyter notebook..
var conf = sc.getConf
conf.set("es.index.auto.create", "true")
conf.set("es.nodes", "XXXXXXXX")
conf.set("es.port", "9020")
conf.set("spark.driver.allowMultipleContexts", "true")
val sc1 = new SparkContext(conf)
My procedure to create extra SparkContext may not be right, I think.
Does anyone know how to create 2nd SparkContext properly on Bluemix?

If I'm not mistaken, you're already setting the properties on the configuration object within the existing SparkContext.
These lines (correcting what I assume is a typo) should be setting the option on the existing SparkContext's configuration:
val conf = sc.getConf
conf.set("es.index.auto.create", "true")
conf.set("es.nodes", "XXXXXXXX")
conf.set("es.port", "9020")
conf.set("spark.driver.allowMultipleContexts", "true")
You mentioned you couldn't add these properties -- can you elaborate on the problem it was causing doing it this way?

Related

Set spark.driver.memory for Spark running inside a web application

I have a REST API in Scala Spray that triggers Spark jobs like the following:
path("vectorize") {
get {
parameter('apiKey.as[String]) { (apiKey) =>
if (apiKey == API_KEY) {
MoviesVectorizer.calculate() // Spark Job run in a Thread (returns Future)
complete("Ok")
} else {
complete("Wrong API KEY")
}
}
}
}
I'm trying to find the way to specify Spark driver memory for the jobs. As I found, configuring driver.memory from within the application code doesn't effect anything.
The whole web application along with the Spark is packaged in a fat Jar.
I run it by running
java -jar app.jar
Thus, as I understand, spark-submit is not relevant here (or is it?). So, I can not specify --driver-memory option when running the app.
Is there any way to set the driver memory for Spark within the web app?
Here's my current Spark configuration:
val spark: SparkSession = SparkSession.builder()
.appName("Recommender")
.master("local[*]")
.config("spark.mongodb.input.uri", uri)
.config("spark.mongodb.output.uri", uri)
.config("spark.mongodb.keep_alive_ms", "100000")
.getOrCreate()
spark.conf.set("spark.executor.memory", "10g")
val sc = spark.sparkContext
sc.setCheckpointDir("/tmp/checkpoint/")
val sqlContext = spark.sqlContext
As it is said in the documentation, Spark UI Environment tab shows only variables that are effected by the configuration. Everything I set is there - apart from spark.executor.memory.
This happens because you use local mode. In local mode there is no real executor - all Spark components run in a single JVM, with single heap configuration, so executor specific configuration doesn't matter.
spark.executor options are applicable only when applications is submitted to a cluster.
Also, Spark supports only a single application per JVM instance. This means that all core Spark properties, will be applied only when SparkContext is initialized, and persist as long as context (not SparkSession) is kept alive. Since SparkSession initializes SparkContext, no additional "core" settings will can applied after getOrCreate.
This means that all "core" options should be provided using config method of the SparkSession.builder.
If you're looking for alternatives to embedding you check an exemplary answer to Best Practice to launch Spark Applications via Web Application? by T. Gawęda.
Note: Officially Spark doesn't support applications running outside spark-submit and there are some elusive bugs related to that.

configuring scheduling pool in spark using zeppelin, scala and EMR

In pyspark I'm able to change to a fair scheduler within zeppelin (on AWS EMR) by doing the following:
conf = sc.getConf()
conf.set('spark.scheduler.allocation.file',
'/etc/spark/conf.dist/fairscheduler.xml.template')
sc.setLocalProperty("spark.scheduler.pool", 'production')
However if I try something similar in a scala cell it then things continue to run in the FIFO pool
val conf = sc.getConf()
conf.set("spark.scheduler.allocation.file",
"/etc/spark/conf.dist/fairscheduler.xml.template")
sc.setLocalProperty("spark.scheduler.pool", "FAIR")
I've tried so many combinations, but nothing has worked. Any advice is appreciated.
I ran into a similar issue with Spark 2.4. In my case, the problem was resolved by removing the default "spark.scheduler.pool" option in my Spark config. It might be that your Scala Spark interpreter is set up with spark.scheduler.pool but your python isn't.
I traced the issue to a bug in Spark - https://issues.apache.org/jira/browse/SPARK-26988. The problem is that if you set the config property "spark.scheduler.pool" in the base configuration, you can't then override it using setLocalProperty. Removing it from the base configuration made it work correctly. See the bug description for more detail.

A master url must be set to your configuration (Spark scala on AWS)

This is what I wrote via intellij. I plan on eventually writing larger spark scala files.
Anyways, I uploaded it on an AWS cluster that I had made. The "master" line, line 11 was "master("local")". I ran into this error
The second picture is the error that was returned by AWS when it did not run successfully. i changed line 11 to "yarn" instead of local (see the first picture for its current state)
It still is returning the same error. I put in the following flags when I uploaded it manually
--steps Type=CUSTOM_JAR,Name="SimpleApp"
It worked two weeks ago. My friend did almost the exact same thing as me. I am not sure why it isn't working.
I am looking for both a brief explanation and an answer. Looks like I need a little more knowledge on how spark works.
I am working with amazon EMR.
I think on the line 9 you are creating SparkContext with "old way" approach in spark 1.6.x and older version - you need to set master in default configuration file (usually location conf/spark-defaults.conf) or pass it to spark-submit (it is required in new SparkConf())...
On line 10 you are creating "spark" context with SparkSesion which is approach in spark 2.0.0. So in my opinion your problem is line num. 9 and I think you should remove it and work with SparkSesion or set reqiered configuration for SparkContext In case when you need sc.
You can access to sparkContext with sparkSession.sparkContext();
If you still want to use SparkConf you need to define master programatically:
val sparkConf = new SparkConf()
.setAppName("spark-application-name")
.setMaster("local[4]")
.set("spark.executor.memory","512m");
or with declarative approach in conf/spark-defaults.conf
spark.master local[4]
spark.executor.memory 512m
or simply at runtime:
./bin/spark-submit --name "spark-application-name" --master local[4] --executor-memory 512m your-spark-job.jar
Try using the below code:
val spark = SparkSession.builder().master("spark://ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com:xxxx").appName("example").getOrCreate()
you need to provide the proper link to your aws cluster.

Spark on YARN and spark-bigquery connector

I have developed a Scala Spark application for streaming data directly into Google BigQuery, using the spark-bigquery connector by Spotify.
Locally it works correctly, I have configured my application as described here https://github.com/spotify/spark-bigquery
val ssc = new StreamingContext(sc, Seconds(120))
val sqlContext = new SQLContext(sc)
sqlContext.setGcpJsonKeyFile("/opt/keyfile.json")
sqlContext.setBigQueryProjectId("projectid")
sqlContext.setBigQueryGcsBucket("gcsbucketname")
sqlContext.setBigQueryDatasetLocation("US")
but when I submit the application on my Spark on YARN cluster the job fails looking for GOOGLE_APPLICATION_CREDENTIALS environment variable...
The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials.
I set the variable as OS env var for root user to the .json file containing the credentials required, but it still fails.
I have also tried with the following line
System.setProperty("GOOGLE_APPLICATION_CREDENTIALS", "/opt/keyfile.json")
without success.
Any idea on what I'm missing?
Thank you,
Leonardo
the documentation suggests:
"Environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file.
Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode."

Spark on standalone cluster throws java.lang.illegalStateException

I hava a app and read data from MongoDB.
If I use local pattern, it runs well, however, it throws java.lang.illegalStateExcetion when I use standalone cluster pattern
With local pattern, the SparkContext is val sc = new SparkContext("local","Scala Word Count")
With Standalone cluster pattern, the SparkContext is val sc = new SparkContext() and submit shell is ./spark-submit --class "xxxMain" /usr/local/jarfile/xxx.jar --master spark://master:7077
It trys 4 times then throw error when it runs to the first action
My code
configOriginal.set("mongo.input.uri","mongodb://172.16.xxx.xxx:20000/xxx.Original")
configOriginal.set("mongo.output.uri","mongodb://172.16.xxx.xxx:20000/xxx.sfeature")
mongoRDDOriginal =sc.newAPIHadoopRDD(configOriginal,classOf[com.mongodb.hadoop.MongoInputFormat],classOf[Object], classOf[BSONObject])
I learned from this example
mongo-spark
I searched and someone said it was because of mongo-hadoop-core-1.3.2, but either I up the version to mongo-hadoop-core-1.4.0 or down to 'mongo-hadoop-core-1.3.1', it didn't work.
Please help me!
Finally, I got the solution.
Because each of my workers have many cores and mongo-hadoop-core-1.3.2 doesn't support multiple threads, however it fixed in mongo-hadoop-core-1.4.0. But why my app still get error is because of "intellij idea" cache. You should add mongo-java-driver dependency, too.