How to stop a running SparkContext before opening the new one - scala

I am executing tests in Scala with Spark creating a SparkContext as follows:
val conf = new SparkConf().setMaster("local").setAppName("test")
val sc = new SparkContext(conf)
After the first execution there was no error. But now I am getting this message (and a failed test notification):
Only one SparkContext may be running in this JVM (see SPARK-2243).
It looks like I need to check if there is any running SparkContext and stop it before launching a new one (I do not want to allow multiple contexts).
How can I do this?
UPDATE:
I tried this, but there is the same error (I am running tests from IntellijIdea and I make the code before executing it):
val conf = new SparkConf().setMaster("local").setAppName("test")
// also tried: .set("spark.driver.allowMultipleContexts", "true")
UPDATE 2:
class TestApp extends SparkFunSuite with TestSuiteBase {
// use longer wait time to ensure job completion
override def maxWaitTimeMillis: Int = 20000
System.clearProperty("spark.driver.port")
System.clearProperty("spark.hostPort")
var ssc: StreamingContext = _
val config: SparkConf = new SparkConf().setMaster("local").setAppName("test")
.set("spark.driver.allowMultipleContexts", "true")
val sc: SparkContext = new SparkContext(config)
//...
test("Test1")
{
sc.stop()
}
}

To stop existing context you can use stop method on a given SparkContext instance.
import org.apache.spark.{SparkContext, SparkConf}
val conf: SparkConf = ???
val sc: SparkContext = new SparkContext(conf)
...
sc.stop()
To reuse existing context or create a new one you can use SparkContex.getOrCreate method.
val sc1 = SparkContext.getOrCreate(conf)
...
val sc2 = SparkContext.getOrCreate(conf)
When used in test suites both methods can be used to achieve different things:
stop - stopping context in afterAll method (see for example MLlibTestSparkContext.afterAll)
getOrCreate - to get active instance in individual test cases (see for example QuantileDiscretizerSuite)

Related

Best practice to create SparkSession object in Scala to use both in unittest and spark-submit

I have tried to write a transform method from DataFrame to DataFrame.
And I also want to test it by scalatest.
As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.bulider
.config("spark.master", "local[2]")
.getOrCreate()
This code works fine with unit tests.
But, when I run this code with spark-submit, the cluster options did not work.
For example,
spark-submit --master yarn --deploy-mode client --num-executors 10 ...
does not create any executors.
I have found that the spark-submit arguments are applied when I remove config("master", "local[2]") part of the above code.
But, without master setting the unit test code did not work.
I tried to split spark (SparkSession) object generation part to test and main.
But there is so many code blocks needs spark, for example import spark.implicit,_ and spark.createDataFrame(rdd, schema).
Is there any best practice to write a code to create spark object both to test and to run spark-submit?
One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:
trait SparkTestContext {
private val master = "local[*]"
private val appName = "testing"
System.setProperty("hadoop.home.dir", "c:\\winutils\\")
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
And your test class header then looks like this for example:
class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{
I made a version where Spark will close correctly after tests.
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
var ss: SparkSession = _
var sc: SparkContext = _
var sqlContext: SQLContext = _
override def beforeAll(): Unit = {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
ss = SparkSession.builder().config(conf).getOrCreate()
sc = ss.sparkContext
sqlContext = ss.sqlContext
super.beforeAll()
}
override def afterAll(): Unit = {
sc.stop()
super.afterAll()
}
}
The spark-submit command with parameter --master yarn is setting yarn master.
And this will be conflict with your code master("x"), even using like master("yarn").
If you want to use import sparkSession.implicits._ like toDF ,toDS or other func,
you can just use a local sparkSession variable created like below:
val spark = SparkSession.builder().appName("YourName").getOrCreate()
without setting master("x") in spark-submit --master yarn, not in local machine.
I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.
hope this helps you.
good luck!
How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.
In your main method, you can create a separate SparkSession instance, which can have different configurations.

Can SparkContext and StreamingContext co-exist in the same program?

I am trying to set up a Sparkstreaming code which reads line from the Kafka server but processes it using rules written in another local file. I am creating streamingContext for the streaming data and sparkContext for other applying all other spark features - like string manipulation, reading local files etc
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("ReadLine")
val ssc = new StreamingContext(sparkConf, Seconds(15))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val sentence = lines.toString
val conf = new SparkConf().setAppName("Bi Gram").setMaster("local[2]")
val sc = new SparkContext(conf)
val stringRDD = sc.parallelize(Array(sentence))
But this throws the following error
Exception in thread "main" org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:874)
org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:81)
One application can only have ONE SparkContext. StreamingContext is created on SparkContext. Just need to create ssc StreamingContext using SparkContext
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(15))
If using the following constructor.
StreamingContext(conf: SparkConf, batchDuration: Duration)
It internally create another SparkContext
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
the SparkContext can get from StreamingContext by
ssc.sparkContext
yes you can do it
you have to first start spark session and
then use its context to start any number of streaming context
val spark = SparkSession.builder().appName("someappname").
config("spark.sql.warehouse.dir",warehouseLocation).getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
Simple!!!

Spark Streaming StreamingContext error

Hi i am started spark streaming learning but i can't run an simple application
My code is here
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("spark://beyhan:7077").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
And i am getting error like as the following
scala> val newscc = new StreamingContext(conf, Seconds(1))
15/10/21 13:41:18 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
Thanks
If you are using spark-shell, and it seems like you do, you should not instantiate StreamingContext using SparkConf object, you should pass shell-provided sc directly.
This means:
val conf = new SparkConf().setMaster("spark://beyhan:7077").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
becomes,
val ssc = new StreamingContext(sc, Seconds(1))
It looks like you work in the Spark Shell.
There is already a SparkContext defined for you there, so you don't need to create a new one. The SparkContext in the shell is available as sc
If you need a StreamingContext you can create one using the existing SparkContext:
val ssc = new StreamingContext(sc, Seconds(1))
You only need the SparkConf and SparkContext if you create an application.

Good Practice to insure only one Spark context in application

I am looking for a good way to insure that my app is only using one single Spark Context (sc). While developing I often run into errors and have to restart my Play! server to re test my modifications.
Would a Singleton pattern be solution ?
object sparckContextSingleton {
#transient private var instance: SparkContext = _
private val conf : SparkConf = new SparkConf()
.setMaster("local[2]")
.setAppName("myApp")
def getInstance(): SparkContext = {
if (instance == null){
instance = new SparkContext(conf)
}
instance
}
}
This does not make a good job. Should I stop the SparkContext?
This should be enough to do the trick, important is to use val and not var.
object SparkContextKeeper {
val conf = new SparkConf().setAppName("SparkApp")
val context= new SparkContext(conf)
val sqlContext = new SQLContext(context)
}
In Play you should write a plugin that exposes the SparkContext. Use the plugin's start and stop hooks to start and stop the context.

What are All the ways we can run a scala code in Apache-Spark?

I know there is two ways to run a scala code in Apache-Spark:
1- Using spark-shell
2- Making a jar file from our project and Use spark-submit to run it
Is there any other way to run a scala code in Apache-Spark? for example, can I run a scala object (ex: object.scala) in Apache-Spark directly?
Thanks
1. Using spark-shell
2. Making a jar file from our project and Use spark-submit to run it
3. Running Spark Job programmatically
String sourcePath = "hdfs://hdfs-server:54310/input/*";
SparkConf conf = new SparkConf().setAppName("TestLineCount");
conf.setJars(new String[] { App.class.getProtectionDomain()
.getCodeSource().getLocation().getPath() });
conf.setMaster("spark://spark-server:7077");
conf.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> log = sc.textFile(sourcePath);
JavaRDD<String> lines = log.filter(x -> {
return true;
});
System.out.println(lines.count());
Scala version:
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.{SparkConf, SparkContext}
object SimpleApp {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("okka").setLevel(Level.OFF)
val logFile = "/tmp/logs.txt"
val conf = new SparkConf()
.setAppName("Simple Application")
.setMaster("local")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache
println("line count: " + logData.count())
}
}
for more detail refer to this blog post.