How should you end a Spark job inside an if statement? - scala

What is the recommended way to end a spark job inside a conditional statement?
I am doing validation on my data, and if false, I want to end the spark job gracefully.
Right now I have:
if (isValid(data)) {
sparkSession.sparkContext.stop()
}
However, I get the following error:
Exception in thread "main" java.lang.IllegalStateException: SparkContext has been shutdown
Then it shows a stacktrace.
Is sparkContext.stop() not the proper way to end a spark job gracefully?

Once you stop the SparkSession means, your SparkContext is killed on the JVM. sc is no longer active now.
So You can't call any sparkContext related objects/functions for creating RDD/Dataframe or anything else.
If you call the same sparksession again in the flow of program.. you should find the above Exception.
For example.
` val rdd=sc.parallelize(Seq(Row("RAMA","DAS","25"),Row("smritu","ranjan","26")))
val df=spark.createDataFrame(rdd,schema)
df.show() //It works fine
if(df.select("fname").collect()(0).getAs[String]("fname")=="MAA"){
println("continue")
}
else{
spark.stop() //stopping sparkSession
println("inside Stopiing condition")
}
println("code continues")
val rdd1=sc.parallelize(Seq(Row("afdaf","DAS","56"),Row("sadfeafe","adsadaf","27")))
//Throws Exception...
val df1=spark.createDataFrame(rdd1,schema)
df1.show()
`

There is nothing to say that you can't call stop in an if statement, but there is very little reason to do so and it is probably a mistake to do so. It seems implicit in your question that you may be attempting to open multiple Spark sessions.
The Spark session is intended to be left open for the life of the program - if you try to start two you will find that Spark throws an exception and prints some background including a JIRA ticket that discusses the topic to the logs.
If you wish to run multiple Spark tasks, you may submit them to the same context. One context can run multiple tasks at once.

Related

How to apply spark.sql into worker nodes combining with df.createOrReplaceTempView()

When I executed the following codes,
df6.createOrReplaceTempView("table2")
def func(partition_id, r):
flmgd=str(r.FNAME)+str(r.LNAME)+str(r.MNAME)+str(r.GENDER)+str(r.DOB)
query="""
SELECT PID, FNAMELNAMEMNAMEGENDERDOB
FROM table2
WHERE FNAMELNAMEMNAMEGENDERDOB=\"%s\"
"""%flmgd
df=spark.sql(query)
list1=df.select('FNAMELNAMEMNAMEGENDERDOB').rdd.collect()
if list1 ==[]:
ID=None
else:
ID=df.select(['PID']).collect()
yield Row(**r.asDict(),PIDF=ID)
df3=df2.rdd.mapPartitionsWithIndex(func).toDF()
then I received the warning message
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
It looks like that the main issue is SparkContext is unable to be run in worker nodes. I googled but failed to find solutions. Would you mind helping me getting around this issue? Thank you.

Spark context not invoking stop() and shutdown hook not called when saving to s3

I'm reading data from parquet files, processing it, and then saving the result to S3. The problem is relevant only to the last (saving) part.
When running locally, after saving the data, the sparkContext's stop() is not invoked, and the shutdown hook is not called. I need to manually invoke/call them by clicking the IDE's (IntelliJ) stop button.
When saving to a local folder, the process finishes correctly.
When running on EMR, the process finishes correctly.
Changing the format/header/etc. doesn't solve the problem.
Removing any/all of the transformations/joins doesn't solve the problem.
Problem occurs when using both dataframes and datasets.
EDIT: Following comments, tried using both s3:// and s3a:// - as well as separating the commands, but the issue remains.
Example code:
package test
import org.apache.spark.sql.SparkSession
object test {
def main(args: Array[String]): Unit = {
SparkSession
.builder
.appName("s3_output_test")
.master("local[*]")
.getOrCreate
.read
.parquet("path-to-input-data")
// .transform(someFunc)
// .join(someDF)
.coalesce(1)
.write
.format("csv")
.option("header", true)
.save("s3a://bucket-name/output-folder")
sparkSession.stop()
}
}
Any ideas on a solution or how to further debug the issue will be greatly appreciated, thanks!

What happens if SparkSession is not closed?

What's the difference between the following 2?
object Example1 {
def main(args: Array[String]): Unit = {
try {
val spark = SparkSession.builder.getOrCreate
// spark code here
} finally {
spark.close
}
}
}
object Example2 {
val spark = SparkSession.builder.getOrCreate
def main(args: Array[String]): Unit = {
// spark code here
}
}
I know that SparkSession implements Closeable and it hints that it needs to be closed. However, I can't think of any issues if the SparkSession is just created as in Example2 and never closed directly.
In case of success or failure of the Spark application (and exit from main method), the JVM will terminate and the SparkSession will be gone with it. Is this correct?
IMO: The fact that the SparkSession is a singleton should not make a big difference either.
You should always close your SparkSession when you are done with its use (even if the final outcome were just to follow a good practice of giving back what you've been given).
Closing a SparkSession may trigger freeing cluster resources that could be given to some other application.
SparkSession is a session and as such maintains some resources that consume JVM memory. You can have as many SparkSessions as you want (see SparkSession.newSession to create a session afresh) but you don't want them to use memory they should not if you don't use one and hence close the one you no longer need.
SparkSession is Spark SQL's wrapper around Spark Core's SparkContext and so under the covers (as in any Spark application) you'd have cluster resources, i.e. vcores and memory, assigned to your SparkSession (through SparkContext). That means that as long as your SparkContext is in use (using SparkSession) the cluster resources won't be assigned to other tasks (not necessarily Spark's but also for other non-Spark applications submitted to the cluster). These cluster resources are yours until you say "I'm done" which translates to...close.
If however, after close, you simply exit a Spark application, you don't have to think about executing close since the resources will be closed automatically anyway. The JVMs for the driver and executors terminate and so does the (heartbeat) connection to the cluster and so eventually the resources are given back to the cluster manager so it can offer them to use by some other application.
Both are same!
Spark session's stop/close eventually calls spark context's stop
def stop(): Unit = {
sparkContext.stop()
}
override def close(): Unit = stop()
Spark context has run time shutdown hook to close the spark context before exiting the JVM. Please find the spark code below for adding shutdown hook while creating the context
ShutdownHookManager.addShutdownHook(
_shutdownHookRef = ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
logInfo("Invoking stop() from shutdown hook")
stop()
}
So this will be called irrespective of how JVM exits. If you stop() manually, this shutdown hook will be cancelled to avoid duplication
def stop(): Unit = {
if (LiveListenerBus.withinListenerThread.value) {
throw new SparkException(
s"Cannot stop SparkContext within listener thread of ${LiveListenerBus.name}")
}
// Use the stopping variable to ensure no contention for the stop scenario.
// Still track the stopped variable for use elsewhere in the code.
if (!stopped.compareAndSet(false, true)) {
logInfo("SparkContext already stopped.")
return
}
if (_shutdownHookRef != null) {
ShutdownHookManager.removeShutdownHook(_shutdownHookRef)
}

Has the limitation of a single SparkContext actually been lifted in Spark 2.0?

There has been plenty of chatter about Spark 2.0 supporting multiple SparkContext s. A configuration variable to support it has been around for much longer but not actually effective.
In $SPARK_HOME/conf/spark-defaults.conf :
spark.driver.allowMultipleContexts true
Let's verify that property were recognized:
scala> println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
allowMultiCtx = true
Here is a small poc program for it:
import org.apache.spark._
import org.apache.spark.streaming._
println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
def createAndStartFileStream(dir: String) = {
val sc = new SparkContext("local[1]",s"Spark-$dir" /*,conf*/)
val ssc = new StreamingContext(sc, Seconds(4))
val dstream = ssc.textFileStream(dir)
val valuesCounts = dstream.countByValue()
ssc.start
ssc.awaitTermination
}
val dirs = Seq("data10m", "data50m", "dataSmall").map { d =>
s"/shared/demo/data/$d"
}
dirs.foreach{ d =>
createAndStartFileStream(d)
}
However Attempts to use When that capability are not succeeding:
16/08/14 11:38:55 WARN SparkContext: Multiple running SparkContexts detected
in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in
this JVM (see SPARK-2243). To ignore this error,
set spark.driver.allowMultipleContexts = true.
The currently running SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:814)
org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
Anyone have any insight on how to use the multiple contexts?
Per #LostInOverflow this feature will not be fixed. Here is info from that jira
SPARK-2243 Support multiple SparkContexts in the same JVM
https://issues.apache.org/jira/browse/SPARK-2243
Sean Owen added a comment - 16/Jan/16 17:35 You say you're concerned
with over-utilizing a cluster for steps that don't require much
resource. This is what dynamic allocation is for: the number of
executors increases and decreases with load. If one context is already
using all cluster resources, yes, that doesn't do anything. But then,
neither does a second context; the cluster is already fully used. I
don't know what overhead you're referring to, but certainly one
context running N jobs is busier than N contexts running N jobs. Its
overhead is higher, but the total overhead is lower. This is more an
effect than a cause that would make you choose one architecture over
another. Generally, Spark has always assumed one context per JVM and I
don't see that changing, which is why I finally closed this. I don't
see any support for making this happen.

Using Spark From Within My Application

I am relatively new to Spark. I have created scripts which are run for offline, batch processing tasks but now I am attempting to use it from within a live application (i.e. not using spark-submit).
My application looks like the following:
An HTTP request will be received to initiate a data-processing tasks. The payload of the request will contain data needed for the spark job (time window to look at, data types to ignore, weights to use for ranking algorithms, etc).
Processing begins, which will eventually produce a result. The result is expected to fit into memory.
The result is sent to an endpoint (Written to a single file, sent to a client through Web Socket or SSE, etc)
The fact that the job is initiated by an HTTP request, and that the spark-job is dependant on data in this request, seems to imply that I can't use spark-submit for this problem. Although this originally seemed fairly straightforward to me, I've run into some issues that no amount of Googling has been able to solve. The main issue is exemplified by the following:
object MyApplication {
def main(args: Array[String]): Unit = {
val config = ConfigFactory.load()
val sparkConf = new SparkConf(true)
.setMaster(config.getString("spark.master-uri"))
.setAppName(config.getString("spark.app-name"))
val sc = new SparkContext(sparkConf)
val localData = Seq.fill(10000)(1)
// The following produces ClassNotFoundException for Test
val data = sc.parallelize[Int](localData).map(i => new Test(i))
println(s"My Result ${data.count()}")
sc.stop()
}
}
class Test(num: Int) extends Serializable
Which is then run
sbt run
...
[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 127.0.0.1): java.lang.ClassNotFoundException: my.package.Test
...
I have classes within my application that I want to create and manipulate while using Spark. I've considered extracting all of these classes into a stand-alone library which I can send using .setJars("my-library-of-classes") but that will be very awkward to work with and deploy. I've also considered not using any classes within the spark logic and only using tuples. That would be just as awkward to use.
Is there a way for me to do this? Have I missed something obvious or are we really forced to use spark-submit to communicate with Spark?
Currently spark-submit is the only way to submit spark jobs to a cluster. So you have to create a fat jar consisting of all your classes and dependencies and use it with spark-submit. You could use command line arguments to pass either data or paths to configuration files