Creating new SparkContext for each SparkStep in MRJob/ pySpark

Creating new SparkContext for each SparkStep in MRJob/ pySpark - pyspark

I am new to pySpark and I'm trying to implement a multi-step EMR/Spark job using MRJob, do I need to create a new SparkContext for each SparkStep, or can I share the same SparkContext for all SparkSteps?
I tried to look up the MRJob manual but unfortunately it was not clear on this.
Can someone please advise what's the correct approach?
Creating a separate SparkContext:
class MRSparkJob(MRJob):
def spark_step1(self, input_path, output_path):
from pyspark import SparkContext
sc = SparkContext(appName='appname')
...
sc.stop()
def spark_step2(self, input_path, output_path):
from pyspark import SparkContext
sc = SparkContext(appName='appname')
...
sc.stop()
def steps(self):
return [SparkStep(spark=self.spark_step1),
SparkStep(spark=self.spark_step2)]
if __name__ == '__main__':
MRSparkJob.run()
Create a single SparkContext and share it among differnt SparkSteps
class MRSparkJob(MRJob):
sc = None
def spark_step1(self, input_path, output_path):
from pyspark import SparkContext
self.sc = SparkContext(appName='appname')
...
def spark_step2(self, input_path, output_path):
from pyspark import SparkContext
... (reuse the same self.sc)
self.sc.stop()
def steps(self):
return [SparkStep(spark=self.spark_step1),
SparkStep(spark=self.spark_step2)]
if __name__ == '__main__':
MRSparkJob.run()

According to Dave at MRJob discussion group, we should create a new SparkContext for each step, as each step is a completely new invocation of Hadoop and Spark (ie. #1 above is the correct approach).

Related

spark - scala script for wordcount not working Only one SparkContext should be running

I am trying to run a scala code in spark. The wordcount code was copied from internet.
I have the hdfs running.
I keep having the same """Only one SparkContext should be running""" even if I stop de sc
I have also mounted the hdfs and pointed there the file value, but I have the same error.
Don´t know what else could do.
scala> sc.stop()
scala> :load /Users/dvegamar/spark_ej/wordcount.scala
Loading /Users/dvegamar/spark_ej/wordcount.scala...
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
defined object WordCount
scala> WordCount.main()
org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:85)
WordCount$.main(/Users/dvegamar/spark_ej/wordcount.scala:72)
<init>(<console>:59)
<init>(<console>:63)
<init>(<console>:65)
<init>(<console>:67)
<init>(<console>:69)
<init>(<console>:71)
<init>(<console>:73)
<init>(<console>:75)
<init>(<console>:77)
<init>(<console>:79)
<init>(<console>:81)
<init>(<console>:83)
<init>(<console>:85)
<init>(<console>:87)
<init>(<console>:89)
<init>(<console>:91)
<init>(<console>:93)
<init>(<console>:95)
at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
at WordCount$.main(/Users/dvegamar/spark_ej/wordcount.scala:83)
... 63 elided
The scala code is:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object WordCount {
def main() {
//Configuration for a Spark application.
val conf = new SparkConf()
conf.setAppName("SparkWordCount").setMaster("local")
conf.set("spark.driver.allowMultipleContexts", "true")
//Create Spark Context
val sc = new SparkContext(conf)
//Create MappedRDD by reading from HDFS file from path command line parameter
val rdd = sc.textFile("file:///Users/dvegamar/spark_ej/texto_ejemplo.txt")
//WordCount
rdd.flatMap(_.split(" ")).
map((_, 1)).
reduceByKey(_ + _).
map(x => (x._2, x._1)).
sortByKey(false).
map(x => (x._2, x._1)).
saveAsTextFile("SparkWordCountResult")
//stop context,
sc.stop
}
}

SparkSession and SparkContext initiation in PySpark

I would like to know the PySpark equivalent of the following code in Scala. I am using databricks. I need the same output as below:-
to create new Spark session and output the session id (SparkSession#123d0e8)
val new_spark = spark.newSession()
**Output**
new_spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#123d0e8
to view SparkContext and output the SparkContext id (SparkContext#2dsdas33)
new_spark.sparkContext
**Output**
org.apache.spark.SparkContext = org.apache.spark.SparkContext#2dsdas33

It's very similar. If you have already a session and want to open another one, you can use
my_session = spark.newSession()
print(my_session)
This will produce the new session object I think you are trying to create
<pyspark.sql.session.SparkSession object at 0x7fc3bae3f550>
spark is a session object already running, because you are using a databricks notebook

SparkSession could be created as http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
>>> from pyspark.sql import SparkSession
>>> from pyspark.conf import SparkConf
>>> SparkSession.builder.config(conf=SparkConf())
or
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('FirstSparkApp').getOrCreate()

spark scala datastax csv load file and print schema

Spark version 2.0.2.6
Scala version 2.11.11
Using DataStax 5.0
import org.apache.log4j.{Level, Logger}
import java.util.Calendar
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object csvtocassandra {
def main(args: Array[String]): Unit = {
val key_space = scala.io.StdIn.readLine("Please enter cassandra Key Space Name: ")
val table_name = scala.io.StdIn.readLine("Please enter cassandra Table Name: ")
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
println(Calendar.getInstance.getTime)
// Scala Read CSV Part
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host", "127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val csv_input = scala.io.StdIn.readLine("Please enter csv file location: ")
val df_csv = spark1.read.format("csv").option("header", "true").option("inferschema", "true").load(csv_input)
df_csv.printSchema()
}
}
Why am I not able to run this program as a Job trying to submit it to spark. When I run this program using IntelliJ it works.
But When I create a JAR and run it I am getting following Error.
Command:
> dse spark-submit --class "csvtospark" /Users/del/target/scala-2.11/csvtospark_2.11-1.0.jar
I am getting following Error:
ERROR 2017-11-02 11:46:10,245 org.apache.spark.deploy.DseSparkSubmitBootstrapper: Failed to start or submit Spark application
org.apache.spark.sql.AnalysisException: Path does not exist: dsefs://127.0.0.1/Users/Desktop/csv/example.csv;
Why is it appending dsefs://127.0.0.1 part even though I am giving just the path /Users/Desktop/csv/example.csv when asked.
I tried giving --mater option as well. How ever I am getting the same error. I am running DataStax Spark in Local Machine. No Cluster.
Please correct me where I am doing things wrong.

Got it. Never mind. Sorry about that.
input should be file:///file_name

Best practice to create SparkSession object in Scala to use both in unittest and spark-submit

I have tried to write a transform method from DataFrame to DataFrame.
And I also want to test it by scalatest.
As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.bulider
.config("spark.master", "local[2]")
.getOrCreate()
This code works fine with unit tests.
But, when I run this code with spark-submit, the cluster options did not work.
For example,
spark-submit --master yarn --deploy-mode client --num-executors 10 ...
does not create any executors.
I have found that the spark-submit arguments are applied when I remove config("master", "local[2]") part of the above code.
But, without master setting the unit test code did not work.
I tried to split spark (SparkSession) object generation part to test and main.
But there is so many code blocks needs spark, for example import spark.implicit,_ and spark.createDataFrame(rdd, schema).
Is there any best practice to write a code to create spark object both to test and to run spark-submit?

One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:
trait SparkTestContext {
private val master = "local[*]"
private val appName = "testing"
System.setProperty("hadoop.home.dir", "c:\\winutils\\")
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
And your test class header then looks like this for example:
class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{

I made a version where Spark will close correctly after tests.
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
var ss: SparkSession = _
var sc: SparkContext = _
var sqlContext: SQLContext = _
override def beforeAll(): Unit = {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
ss = SparkSession.builder().config(conf).getOrCreate()
sc = ss.sparkContext
sqlContext = ss.sqlContext
super.beforeAll()
}
override def afterAll(): Unit = {
sc.stop()
super.afterAll()
}
}

The spark-submit command with parameter --master yarn is setting yarn master.
And this will be conflict with your code master("x"), even using like master("yarn").
If you want to use import sparkSession.implicits._ like toDF ,toDS or other func,
you can just use a local sparkSession variable created like below:
val spark = SparkSession.builder().appName("YourName").getOrCreate()
without setting master("x") in spark-submit --master yarn, not in local machine.
I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.
hope this helps you.
good luck!

How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.
In your main method, you can create a separate SparkSession instance, which can have different configurations.

Running Mlib via Spark Job Server

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.

Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Creating new SparkContext for each SparkStep in MRJob/ pySpark - pyspark

According to Dave at MRJob discussion group, we should create a new SparkContext for each step, as each step is a completely new invocation of Hadoop and Spark (ie. #1 above is the correct approach).

Related

spark - scala script for wordcount not working Only one SparkContext should be running

SparkSession and SparkContext initiation in PySpark

spark scala datastax csv load file and print schema

Best practice to create SparkSession object in Scala to use both in unittest and spark-submit

Running Mlib via Spark Job Server

Categories

Resources