Creating new SparkContext for each SparkStep in MRJob/ pySpark - pyspark

I am new to pySpark and I'm trying to implement a multi-step EMR/Spark job using MRJob, do I need to create a new SparkContext for each SparkStep, or can I share the same SparkContext for all SparkSteps?
I tried to look up the MRJob manual but unfortunately it was not clear on this.
Can someone please advise what's the correct approach?
Creating a separate SparkContext:
class MRSparkJob(MRJob):
def spark_step1(self, input_path, output_path):
from pyspark import SparkContext
sc = SparkContext(appName='appname')
...
sc.stop()
def spark_step2(self, input_path, output_path):
from pyspark import SparkContext
sc = SparkContext(appName='appname')
...
sc.stop()
def steps(self):
return [SparkStep(spark=self.spark_step1),
SparkStep(spark=self.spark_step2)]
if __name__ == '__main__':
MRSparkJob.run()
Create a single SparkContext and share it among differnt SparkSteps
class MRSparkJob(MRJob):
sc = None
def spark_step1(self, input_path, output_path):
from pyspark import SparkContext
self.sc = SparkContext(appName='appname')
...
def spark_step2(self, input_path, output_path):
from pyspark import SparkContext
... (reuse the same self.sc)
self.sc.stop()
def steps(self):
return [SparkStep(spark=self.spark_step1),
SparkStep(spark=self.spark_step2)]
if __name__ == '__main__':
MRSparkJob.run()

According to Dave at MRJob discussion group, we should create a new SparkContext for each step, as each step is a completely new invocation of Hadoop and Spark (ie. #1 above is the correct approach).

Related

spark - scala script for wordcount not working Only one SparkContext should be running

I am trying to run a scala code in spark. The wordcount code was copied from internet.
I have the hdfs running.
I keep having the same """Only one SparkContext should be running""" even if I stop de sc
I have also mounted the hdfs and pointed there the file value, but I have the same error.
DonĀ“t know what else could do.
scala> sc.stop()
scala> :load /Users/dvegamar/spark_ej/wordcount.scala
Loading /Users/dvegamar/spark_ej/wordcount.scala...
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
defined object WordCount
scala> WordCount.main()
org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:85)
WordCount$.main(/Users/dvegamar/spark_ej/wordcount.scala:72)
<init>(<console>:59)
<init>(<console>:63)
<init>(<console>:65)
<init>(<console>:67)
<init>(<console>:69)
<init>(<console>:71)
<init>(<console>:73)
<init>(<console>:75)
<init>(<console>:77)
<init>(<console>:79)
<init>(<console>:81)
<init>(<console>:83)
<init>(<console>:85)
<init>(<console>:87)
<init>(<console>:89)
<init>(<console>:91)
<init>(<console>:93)
<init>(<console>:95)
at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
at WordCount$.main(/Users/dvegamar/spark_ej/wordcount.scala:83)
... 63 elided
The scala code is:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object WordCount {
def main() {
//Configuration for a Spark application.
val conf = new SparkConf()
conf.setAppName("SparkWordCount").setMaster("local")
conf.set("spark.driver.allowMultipleContexts", "true")
//Create Spark Context
val sc = new SparkContext(conf)
//Create MappedRDD by reading from HDFS file from path command line parameter
val rdd = sc.textFile("file:///Users/dvegamar/spark_ej/texto_ejemplo.txt")
//WordCount
rdd.flatMap(_.split(" ")).
map((_, 1)).
reduceByKey(_ + _).
map(x => (x._2, x._1)).
sortByKey(false).
map(x => (x._2, x._1)).
saveAsTextFile("SparkWordCountResult")
//stop context,
sc.stop
}
}

SparkSession and SparkContext initiation in PySpark

I would like to know the PySpark equivalent of the following code in Scala. I am using databricks. I need the same output as below:-
to create new Spark session and output the session id (SparkSession#123d0e8)
val new_spark = spark.newSession()
**Output**
new_spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#123d0e8
to view SparkContext and output the SparkContext id (SparkContext#2dsdas33)
new_spark.sparkContext
**Output**
org.apache.spark.SparkContext = org.apache.spark.SparkContext#2dsdas33
It's very similar. If you have already a session and want to open another one, you can use
my_session = spark.newSession()
print(my_session)
This will produce the new session object I think you are trying to create
<pyspark.sql.session.SparkSession object at 0x7fc3bae3f550>
spark is a session object already running, because you are using a databricks notebook
SparkSession could be created as http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
>>> from pyspark.sql import SparkSession
>>> from pyspark.conf import SparkConf
>>> SparkSession.builder.config(conf=SparkConf())
or
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('FirstSparkApp').getOrCreate()

spark scala datastax csv load file and print schema

Spark version 2.0.2.6
Scala version 2.11.11
Using DataStax 5.0
import org.apache.log4j.{Level, Logger}
import java.util.Calendar
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object csvtocassandra {
def main(args: Array[String]): Unit = {
val key_space = scala.io.StdIn.readLine("Please enter cassandra Key Space Name: ")
val table_name = scala.io.StdIn.readLine("Please enter cassandra Table Name: ")
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
println(Calendar.getInstance.getTime)
// Scala Read CSV Part
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host", "127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val csv_input = scala.io.StdIn.readLine("Please enter csv file location: ")
val df_csv = spark1.read.format("csv").option("header", "true").option("inferschema", "true").load(csv_input)
df_csv.printSchema()
}
}
Why am I not able to run this program as a Job trying to submit it to spark. When I run this program using IntelliJ it works.
But When I create a JAR and run it I am getting following Error.
Command:
> dse spark-submit --class "csvtospark" /Users/del/target/scala-2.11/csvtospark_2.11-1.0.jar
I am getting following Error:
ERROR 2017-11-02 11:46:10,245 org.apache.spark.deploy.DseSparkSubmitBootstrapper: Failed to start or submit Spark application
org.apache.spark.sql.AnalysisException: Path does not exist: dsefs://127.0.0.1/Users/Desktop/csv/example.csv;
Why is it appending dsefs://127.0.0.1 part even though I am giving just the path /Users/Desktop/csv/example.csv when asked.
I tried giving --mater option as well. How ever I am getting the same error. I am running DataStax Spark in Local Machine. No Cluster.
Please correct me where I am doing things wrong.
Got it. Never mind. Sorry about that.
input should be file:///file_name

Best practice to create SparkSession object in Scala to use both in unittest and spark-submit

I have tried to write a transform method from DataFrame to DataFrame.
And I also want to test it by scalatest.
As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.bulider
.config("spark.master", "local[2]")
.getOrCreate()
This code works fine with unit tests.
But, when I run this code with spark-submit, the cluster options did not work.
For example,
spark-submit --master yarn --deploy-mode client --num-executors 10 ...
does not create any executors.
I have found that the spark-submit arguments are applied when I remove config("master", "local[2]") part of the above code.
But, without master setting the unit test code did not work.
I tried to split spark (SparkSession) object generation part to test and main.
But there is so many code blocks needs spark, for example import spark.implicit,_ and spark.createDataFrame(rdd, schema).
Is there any best practice to write a code to create spark object both to test and to run spark-submit?
One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:
trait SparkTestContext {
private val master = "local[*]"
private val appName = "testing"
System.setProperty("hadoop.home.dir", "c:\\winutils\\")
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
And your test class header then looks like this for example:
class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{
I made a version where Spark will close correctly after tests.
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
var ss: SparkSession = _
var sc: SparkContext = _
var sqlContext: SQLContext = _
override def beforeAll(): Unit = {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
ss = SparkSession.builder().config(conf).getOrCreate()
sc = ss.sparkContext
sqlContext = ss.sqlContext
super.beforeAll()
}
override def afterAll(): Unit = {
sc.stop()
super.afterAll()
}
}
The spark-submit command with parameter --master yarn is setting yarn master.
And this will be conflict with your code master("x"), even using like master("yarn").
If you want to use import sparkSession.implicits._ like toDF ,toDS or other func,
you can just use a local sparkSession variable created like below:
val spark = SparkSession.builder().appName("YourName").getOrCreate()
without setting master("x") in spark-submit --master yarn, not in local machine.
I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.
hope this helps you.
good luck!
How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.
In your main method, you can create a separate SparkSession instance, which can have different configurations.

Running Mlib via Spark Job Server

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.
Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.