Code errors out from IntelliJ but runs well on the Databricks Notebook - scala

I develop Spark code using Scala APIs on IntelliJ and when I run this I get below error but runs well on the Databricks notebook though.
I am using Databricks Connect to connect from local installation of IntelliJ to the Databricks Spark Cluster. I am connected to the cluster and was able to submit a job from IntelliJ to the Cluster too. AMOF, everything else works except the below piece.
DBConnect is 6.1 , Databricks Runtime is 6.2
Imported the jar file from the cluster (using Databricks-connect get-jar-dir) and set up the SBT project with the jar in the project library
source code:
val sparkSession = SparkSession.builder.getOrCreate()
val sparkContext = sparkSession.sparkContext
import sparkSession.implicits._
val v_textFile_read = sparkContext.textFile(v_filename_path)
v_textFile_read.take(2).foreach(println)
Error:
cannot assign instance of scala.Some to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of
type scala.collection.Seq in instance of org.apache.spark.rdd.HadoopRDD
The reason I use a RDD reader for text is so I can pass this output to a createDataFrame API. As you know, the createdataframe API takes in an RDD and schema as input parameters.
step-1: val v_RDD_textFile_read = sparkContext.textFile(v_filename_path).map(x => MMRSplitRowIntoStrings(x))
step-2: val v_DF_textFile_read = sparkSession.sqlContext.createDataFrame(v_RDD_textFile_read, v_schema) (edited

Related

Workflow and Scheduling Framework for Spark with Scala in Maven Done with Intellij IDEA

I have created a spark project with Scala. Its a maven project with all dependency configured in POM.
Spark i am using as ETL. Source is file generated by API, All kind of transformation in spark then load it to cassandra.
Is there any Workflow software, which can used the jar to automate the process with email triggering, success or failure job flow.
May someone please help me..... is Airflow can be used for this purpose, i have used SCALA and NOT Python
Kindly share your thoughts.
There is no built-in mechanism in Spark that will help. A cron job seems reasonable for your case. If you find yourself continuously adding dependencies to the scheduled job, try Azkaban
one such example of shell script is :-
#!/bin/bash
cd /locm/spark_jobs
export SPARK_HOME=/usr/hdp/2.2.0.0-2041/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_USER_NAME=hdfs
export HADOOP_GROUP=hdfs
#export SPARK_CLASSPATH=$SPARK_CLASSPATH:/locm/spark_jobs/configs/*
CLASS=$1
MASTER=$2
ARGS=$3
CLASS_ARGS=$4
echo "Running $CLASS With Master: $MASTER With Args: $ARGS And Class Args: $CLASS_ARGS"
$SPARK_HOME/bin/spark-submit --class $CLASS --master $MASTER --num-executors 4 --executor-cores 4 "application jar file"
You can even try using spark-launcher which can be used to start a spark application programmatically :-
First create a sample spark application and build a jar file for it.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object SparkApp extends App{
val conf=new SparkConf().setMaster("local[*]").setAppName("spark-app")
val sc=new SparkContext(conf)
val rdd=sc.parallelize(Array(2,3,2,1))
rdd.saveAsTextFile("result")
sc.stop()
}
This is our simple spark application, make a jar of this application using sbt assembly, now we make a scala application through which we start this spark application as follows:
import org.apache.spark.launcher.SparkLauncher
object Launcher extends App {
val spark = new SparkLauncher()
.setSparkHome("/home/knoldus/spark-1.4.0-bin-hadoop2.6")
.setAppResource("/home/knoldus/spark_launcher-assembly-1.0.jar")
.setMainClass("SparkApp")
.setMaster("local[*]")
.launch();
spark.waitFor();
}
In the above code we use SparkLauncher object and set values for its like
setSparkHome(“/home/knoldus/spark-1.4.0-bin-hadoop2.6”) is use to set spark home which is use internally to call spark submit.
.setAppResource(“/home/knoldus/spark_launcher-assembly-1.0.jar”) is use to specify jar of our spark application.
.setMainClass(“SparkApp”) the entry point of the spark program i.e driver program.
.setMaster(“local[*]”) set the address of master where its start here now we run it on loacal machine.
.launch() is simply start our spark application.
Its a minimal requirement you can also set many other configurations like pass arguments, add jar , set configurations etc.

Apache spark - loading data from elasticsearch is too slow

I'm new to Apache Spark and I'm trying to load some elasticsearch data from a scala script I'm running on it.
Here is my script:
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder.appName("Simple Application").getOrCreate()
val options = Map("es.nodes" -> "x.x.x.x:9200", "pushdown" -> "true")
import sparkSession.implicits._
val df = sparkSession.read.format("org.elasticsearch.spark.sql").options(options).load("my_index-07.05.2018/_doc").limit(5).select("SomeField", "AnotherField", "AnotherOne")
df.cache()
df.show()
And it works, but It's terribly slow. Am I doing anything wrong here?
Connectivity shouldn't be an issue at all, the index I'm trying to query has at around 200k documents but I'm limiting the query to 5 results.
Btw I had to run the spark-shell (or submit) by passing the elasticsearch-hadoop dependency as a parameter in the command line (--packages org.elasticsearch:elasticsearch-hadoop:6.3.0). Is that the right way to do it? Is there any way to just build sbt package including all the dependencies?
Thanks a lot
Are you running this locally in a single machine? If so it could be normal... You
will have to check your network your spark web ui etc...
About submitting all the dependencies without specifying them in the shell withing spark-submit what we usually create a FAT jar by using sbt assembly.
http://queirozf.com/entries/creating-scala-fat-jars-for-spark-on-sbt-with-sbt-assembly-plugin

Set spark.driver.memory for Spark running inside a web application

I have a REST API in Scala Spray that triggers Spark jobs like the following:
path("vectorize") {
get {
parameter('apiKey.as[String]) { (apiKey) =>
if (apiKey == API_KEY) {
MoviesVectorizer.calculate() // Spark Job run in a Thread (returns Future)
complete("Ok")
} else {
complete("Wrong API KEY")
}
}
}
}
I'm trying to find the way to specify Spark driver memory for the jobs. As I found, configuring driver.memory from within the application code doesn't effect anything.
The whole web application along with the Spark is packaged in a fat Jar.
I run it by running
java -jar app.jar
Thus, as I understand, spark-submit is not relevant here (or is it?). So, I can not specify --driver-memory option when running the app.
Is there any way to set the driver memory for Spark within the web app?
Here's my current Spark configuration:
val spark: SparkSession = SparkSession.builder()
.appName("Recommender")
.master("local[*]")
.config("spark.mongodb.input.uri", uri)
.config("spark.mongodb.output.uri", uri)
.config("spark.mongodb.keep_alive_ms", "100000")
.getOrCreate()
spark.conf.set("spark.executor.memory", "10g")
val sc = spark.sparkContext
sc.setCheckpointDir("/tmp/checkpoint/")
val sqlContext = spark.sqlContext
As it is said in the documentation, Spark UI Environment tab shows only variables that are effected by the configuration. Everything I set is there - apart from spark.executor.memory.
This happens because you use local mode. In local mode there is no real executor - all Spark components run in a single JVM, with single heap configuration, so executor specific configuration doesn't matter.
spark.executor options are applicable only when applications is submitted to a cluster.
Also, Spark supports only a single application per JVM instance. This means that all core Spark properties, will be applied only when SparkContext is initialized, and persist as long as context (not SparkSession) is kept alive. Since SparkSession initializes SparkContext, no additional "core" settings will can applied after getOrCreate.
This means that all "core" options should be provided using config method of the SparkSession.builder.
If you're looking for alternatives to embedding you check an exemplary answer to Best Practice to launch Spark Applications via Web Application? by T. Gawęda.
Note: Officially Spark doesn't support applications running outside spark-submit and there are some elusive bugs related to that.

How to connect to ElasticSearch from Apache Spark Service on Bluemix

I use Apache Spark service on Bluemix to create demo (collecting/parsing twitter data).I want to transport Elastic Search.
I created my scala app according to the following URL [1]:
[1] https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
However, when using Jupyter notebook on Bluemix, I couldn't run my app properly. A special interpreter-aware SparkContext "sc" was already running, but I cloudn't add properties to "sc" such as "es.nodes", "es.port" ,and so on to connect Elastic Search.
Q1.
Does anyone know how to add extra properties to a special interpreter-aware SparkContext on Bluemix? In my local spark environment, it's easy to add.
Q2.
I tried to create another SparkContext as follows and use for streaming, but it was uncontrollable on Jupyter notebook..
var conf = sc.getConf
conf.set("es.index.auto.create", "true")
conf.set("es.nodes", "XXXXXXXX")
conf.set("es.port", "9020")
conf.set("spark.driver.allowMultipleContexts", "true")
val sc1 = new SparkContext(conf)
My procedure to create extra SparkContext may not be right, I think.
Does anyone know how to create 2nd SparkContext properly on Bluemix?
If I'm not mistaken, you're already setting the properties on the configuration object within the existing SparkContext.
These lines (correcting what I assume is a typo) should be setting the option on the existing SparkContext's configuration:
val conf = sc.getConf
conf.set("es.index.auto.create", "true")
conf.set("es.nodes", "XXXXXXXX")
conf.set("es.port", "9020")
conf.set("spark.driver.allowMultipleContexts", "true")
You mentioned you couldn't add these properties -- can you elaborate on the problem it was causing doing it this way?

Deploying Play Framework (needing a SparkContext) on cluster

We are deploying to a Cluster a Play! Framework (version 2.3) app that runs jobs needing a SparkContext (e.g : search in Hbase, or WordCount). Our cluster is a Yarn cluster.
To start our app we execute the command line :
activator run
(after compiling and packaging)
We are experimenting some issues with the SparkContext configuration, we use this configuration :
new SparkConf(false)
.setMaster("yarn-cluster")
.setAppName("MyApp")
val sc = new SparkContext(sparkConf)
When we call a route that calls a job using the SparkContext we and up with this error :
[RuntimeException: java.lang.ExceptionInInitializerError]
or this one (when we reload):
[RuntimeException: java.lang.NoClassDefFoundError: Could not initialize class mypackage.MyClass$]
To test our code we changed the the parameter setMaster to
.setMaster("local[4]")
And it worked well. But of course our code is not distributed and we do not use the Spark capabilities of distributing our code (e.g : RDD).
Is running our App with the spark submit command a solution? If it is how can this be done ?
We would rather still use the activator command.