What's the difference between the following 2?
object Example1 {
def main(args: Array[String]): Unit = {
try {
val spark = SparkSession.builder.getOrCreate
// spark code here
} finally {
spark.close
}
}
}
object Example2 {
val spark = SparkSession.builder.getOrCreate
def main(args: Array[String]): Unit = {
// spark code here
}
}
I know that SparkSession implements Closeable and it hints that it needs to be closed. However, I can't think of any issues if the SparkSession is just created as in Example2 and never closed directly.
In case of success or failure of the Spark application (and exit from main method), the JVM will terminate and the SparkSession will be gone with it. Is this correct?
IMO: The fact that the SparkSession is a singleton should not make a big difference either.
You should always close your SparkSession when you are done with its use (even if the final outcome were just to follow a good practice of giving back what you've been given).
Closing a SparkSession may trigger freeing cluster resources that could be given to some other application.
SparkSession is a session and as such maintains some resources that consume JVM memory. You can have as many SparkSessions as you want (see SparkSession.newSession to create a session afresh) but you don't want them to use memory they should not if you don't use one and hence close the one you no longer need.
SparkSession is Spark SQL's wrapper around Spark Core's SparkContext and so under the covers (as in any Spark application) you'd have cluster resources, i.e. vcores and memory, assigned to your SparkSession (through SparkContext). That means that as long as your SparkContext is in use (using SparkSession) the cluster resources won't be assigned to other tasks (not necessarily Spark's but also for other non-Spark applications submitted to the cluster). These cluster resources are yours until you say "I'm done" which translates to...close.
If however, after close, you simply exit a Spark application, you don't have to think about executing close since the resources will be closed automatically anyway. The JVMs for the driver and executors terminate and so does the (heartbeat) connection to the cluster and so eventually the resources are given back to the cluster manager so it can offer them to use by some other application.
Both are same!
Spark session's stop/close eventually calls spark context's stop
def stop(): Unit = {
sparkContext.stop()
}
override def close(): Unit = stop()
Spark context has run time shutdown hook to close the spark context before exiting the JVM. Please find the spark code below for adding shutdown hook while creating the context
ShutdownHookManager.addShutdownHook(
_shutdownHookRef = ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
logInfo("Invoking stop() from shutdown hook")
stop()
}
So this will be called irrespective of how JVM exits. If you stop() manually, this shutdown hook will be cancelled to avoid duplication
def stop(): Unit = {
if (LiveListenerBus.withinListenerThread.value) {
throw new SparkException(
s"Cannot stop SparkContext within listener thread of ${LiveListenerBus.name}")
}
// Use the stopping variable to ensure no contention for the stop scenario.
// Still track the stopped variable for use elsewhere in the code.
if (!stopped.compareAndSet(false, true)) {
logInfo("SparkContext already stopped.")
return
}
if (_shutdownHookRef != null) {
ShutdownHookManager.removeShutdownHook(_shutdownHookRef)
}
Related
What is the recommended way to end a spark job inside a conditional statement?
I am doing validation on my data, and if false, I want to end the spark job gracefully.
Right now I have:
if (isValid(data)) {
sparkSession.sparkContext.stop()
}
However, I get the following error:
Exception in thread "main" java.lang.IllegalStateException: SparkContext has been shutdown
Then it shows a stacktrace.
Is sparkContext.stop() not the proper way to end a spark job gracefully?
Once you stop the SparkSession means, your SparkContext is killed on the JVM. sc is no longer active now.
So You can't call any sparkContext related objects/functions for creating RDD/Dataframe or anything else.
If you call the same sparksession again in the flow of program.. you should find the above Exception.
For example.
` val rdd=sc.parallelize(Seq(Row("RAMA","DAS","25"),Row("smritu","ranjan","26")))
val df=spark.createDataFrame(rdd,schema)
df.show() //It works fine
if(df.select("fname").collect()(0).getAs[String]("fname")=="MAA"){
println("continue")
}
else{
spark.stop() //stopping sparkSession
println("inside Stopiing condition")
}
println("code continues")
val rdd1=sc.parallelize(Seq(Row("afdaf","DAS","56"),Row("sadfeafe","adsadaf","27")))
//Throws Exception...
val df1=spark.createDataFrame(rdd1,schema)
df1.show()
`
There is nothing to say that you can't call stop in an if statement, but there is very little reason to do so and it is probably a mistake to do so. It seems implicit in your question that you may be attempting to open multiple Spark sessions.
The Spark session is intended to be left open for the life of the program - if you try to start two you will find that Spark throws an exception and prints some background including a JIRA ticket that discusses the topic to the logs.
If you wish to run multiple Spark tasks, you may submit them to the same context. One context can run multiple tasks at once.
Willing to be the most efficient in writing data back into kafka, i am interested in using Akka Stream to write my RDD partition back into Kafka.
The problem is that i need a way to create an actor system per executor and not per partition which would be ridiculous. One may end up with 8 actorSystems on one node on one JVM. However having a Stream per partition is fine.
Has anyone already done that ?
My understanding, an actor system can't be serialized, hence can't be
sent has broadcast variable which would be per executor.
If one has had the experience around figuring a solution to that and tested please would you share ?
Else i can always fall back to https://index.scala-lang.org/benfradet/spark-kafka-writer/spark-kafka-0-10-writer/0.3.0?target=_2.11 but i am not sure it is the most efficient way.
You can always define a global lazy val with an actor system:
object Execution {
implicit lazy val actorSystem: ActorSystem = ActorSystem()
implicit lazy val materializer: Materializer = ActorMaterializer()
}
Then you just import it in any of the classes where you want to use Akka Streams:
import Execution._
val stream: DStream[...] = ...
stream.foreachRDD { rdd =>
...
rdd.foreachPartition { records =>
val (queue, done) = Source.queue(...)
.via(Producer.flow(...))
.toMat(Sink.ignore)(Keep.both)
.run() // implicitly pulls `Execution.materializer` from scope,
// which in turn will initialize `Execution.actorSystem`
... // push records to the queue
// wait until the stream is completed
Await.result(done, 10.minutes)
}
}
The above is kind of pseudocode but I think it should convey the general idea.
This way the system is going to be initialized on every executor JVM only once when it is needed. Additionally you can make the actor system "daemonic" in order for it to shut down automatically when the JVM finishes:
object Execution {
private lazy val config = ConfigFactory.parseString("akka.daemonic = on")
.withFallback(ConfigFactory.load())
implicit lazy val actorSystem: ActorSystem = ActorSystem("system", config)
implicit lazy val materializer: Materializer = ActorMaterializer()
}
We're doing this in our Spark jobs and it works flawlessly.
This works without any kind of broadcast variables, and, naturally, can be used in all kinds of Spark jobs, streaming or otherwise. Because the system is defined in a singleton object, it is guaranteed to be initialized only once per JVM instance (modulo various classloader shenanigans, but it doesn't really matter in the context of Spark), therefore even if some of the partitions get placed onto the same JVM (maybe in different threads), it will only initialize the actor system one time. lazy val ensures the thread-safety of the initialization, and ActorSystem is thread-safe, so this won't cause problems in this regard as well.
I have a scala program that runs for a while and then terminates. I'd like to provide a library to this program that, behind the scenes, schedules an asynchronous task to run every N seconds. I'd also like the program to terminate when the main entrypoint's work is finished without needing to explicitly tell the background work to shut down (since it's inside a library).
As best I can tell the idiomatic way to do polling or scheduled work in Scala is with Akka's ActorSystem.scheduler.schedule, but using an ActorSystem makes the program hang after main waiting for the actors. I then tried and failed to add another actor that joins on the main thread, seemingly because "Anything that blocks a thread is not advised within Akka"
I could introduce a custom dispatcher; I could kludge something together with a polling isAlive check, or adding a similar check inside each worker; or I could give up on Akka and just use raw Threads.
This seems like a not-too-unusual thing to want to do, so I'd like to use idiomatic Scala if there's a clear best way.
I don't think there is an idiomatic Scala way.
The JVM program terminates when all non-daemon thread are finished. So you can schedule your task to run on a daemon thread.
So just use Java functionality:
import java.util.concurrent._
object Main {
def main(args: Array[String]): Unit = {
// Make a ThreadFactory that creates daemon threads.
val threadFactory = new ThreadFactory() {
def newThread(r: Runnable) = {
val t = Executors.defaultThreadFactory().newThread(r)
t.setDaemon(true)
t
}
}
// Create a scheduled pool using this thread factory
val pool = Executors.newSingleThreadScheduledExecutor(threadFactory)
// Schedule some function to run every second after an initial delay of 0 seconds
// This assumes Scala 2.12. In 2.11 you'd have to create a `new Runnable` manually
// Note that scheduling will stop, if there is an exception thrown from the function
pool.scheduleAtFixedRate(() => println("run"), 0, 1, TimeUnit.SECONDS)
Thread.sleep(5000)
}
}
You can also use guava to create a daemon thread factory with new ThreadFactoryBuilder().setDaemon(true).build().
If you use Akka scheduler you will be relying on highly tuned and optimized implementation that is well tested. Bringing up an actor system is a bit heavy weight though, I agree. Additionally you have to bring in a dependency on akka. If you are ok with that you can explicitly call system.shutdown from main when you are done, or wrap it in a function that will do it for you.
Alternatively, you could try something along these lines:
import scala.concurrent._
import ExecutionContext.Implicits.global
object Main extends App {
def repeatEvery[T](timeoutMillis: Int)(f: => T): Future[T] = {
val p = Promise[T]()
val never = p.future
f
def timeout = Future {
Thread.sleep(timeoutMillis)
throw new TimeoutException
}
val failure = Future.firstCompletedOf(List(never, timeout))
failure.recoverWith { case _ => repeatEvery(timeoutMillis)(f) }
}
repeatEvery(1000) {
println("scheduled job called")
}
println("main started doing its work")
Thread.sleep(10000)
println("main finished")
}
Prints:
scheduled job called
main started doing its work
scheduled job called
scheduled job called
scheduled job called
scheduled job called
scheduled job called
scheduled job called
scheduled job called
scheduled job called
scheduled job called
main finished
I don't like that it uses Thread.sleep, but that is done to avoid using any other 3rd party schedulers and Scala Future does not provide timeout options. So you'll be wasting one thread on that scheduling task, but that's what Akka scheduler seems to do anyway. The difference is that perhaps you want a single scheduler for the whole JVM not to waste too many threads. The code I provided albeit simpler will waste a thread per job.
In my spark application, there is an object ResourceFactory which contains an akka ActorSystem for providing resource clients. So when I run this spark application, every worker node will create an ActorSystem. The problem is that when the spark application finishes its works and gets shutdown. The ActorSystem still keeps alive on every worker node and prevents the whole application to terminate, it's just hung on.
Is there a way to register some listener to the SparkContext so that when the sc gets shutdown, then the ActorSystem on every worker node will get notified to shutdown themselves?
UPDATE:
Following is the simplified skeleton:
There is a ResourceFactory, which is an object and it contains an actor system. And it also provides a fetchData method.
object ResourceFactory{
val actorSystem = ActorSystem("resource-akka-system")
def fetchData(): SomeData = ...
}
And then, there is a user-defined RDD class, in its compute method, it needs to fetch data from the ResourceFactory.
class MyRDD extends RDD[SomeClass] {
override def compute(...) {
...
ResourceFactory.fetchData()
...
someIterator
}
}
So on every node there will be one ActorSystem named "resource-akka-system", and those MyRDD instances distributed on those worker nodes can get data from the "resource-akka-system".
The problem is that, when the SparkContext gets shutdown, there is no need for those "resource-akka-system"s, but I don't know how to notify the ResourceFactory to shutdown the "resource-akka-system" when the SparkContext gets shutdown. So now, the "resouce-akka-system" keeps alive on each worker node and prevents the whole program to exit.
UPDATE2:
With some more experiments, I find that in local mode the program is hung on, but in yarn-cluster mode, the program will exit successfully. May be this is because yarn will kill the threads on worker nodes when the sc is shutdown?
UPDATE3:
To check whether every node contains an ActorSystem, I change the code as following(following is the real skeleton, as I add another class definition):
object ResourceFactory{
println("creating resource factory")
val actorSystem = ActorSystem("resource-akka-system")
def fetchData(): SomeData = ...
}
class MyRDD extends RDD[SomeClass] {
println("creating my rdd")
override def compute(...) {
new RDDIterator(...)
}
}
class RDDIterator(...) extends Iterator[SomeClass] {
println("creating rdd iterator")
...
lazy val reader = {
...
ResourceFactory.fetchData()
...
}
...
override next() = {
...
reader.xx()
}
}
After adding those printlns, I run the code on spark on yarn-cluster mode. I find that on the driver I have following prints:
creating my rdd
creating resource factory
creating my rdd
...
While on some of the workers, I have following prints:
creating rdd iterator
creating resource factory
And some of the workers, it prints nothing (and all of them are not assigned any tasks).
Based on the above, I think the object is initialized in driver eagerly, since it prints creating resource factory on the driver even when no thing refers to it, and object is initialized in worker lazily because it prints creating resource factory after printing creating rdd iterator as resource factory is lazily referenced by the first created RDDIterator.
And I find that in my use case the MyRDD class is only created in the driver.
I am not very sure about the laziness of the initialization of the object on driver and worker, it's my guess, because maybe it's caused by other part of the program to make it looks like that. But I think it should be right that there is one actor system on each worker node when it is necessary.
I don't think that there is a way to tap into each Worker lifecycle.
Also I have some questions regarding your implementation:
If you have object that contains val, that is used from function run on worker, my understanding is that this val gets serialized and broadcasted to worker. Can you confirm, that you have one ActorSystem running per worker?
Actor System usually terminated immediately if you don't explicitly wait for it's termination. Are you calling something like system.awaitTermination or blocking on system.whenTerminated?
Anyway, there is another way, how you can shutdown actor systems on remote workers:
Make your ActorSystem on each node part of the akka cluster. Here are some docs how to do that programmatically.
Have address of your "coordination" Actor on driver node (where your sc is) broadcasted to each worker. In simple words, just have val with that address.
When your akka system is started on each worker use that "coordination" Actor address to register this particular actor system (send corresponding message to coordination Actor).
Coordination Actor keeps track of all registered "worker" Actors
When your computation is completed and you want to shut down Akka system on every worker, send messages to all registered Actors from coordination Actor on driver node.
Shutdown on worker Akka systems when "shutdown" message is received.
It seems like there is no need in a class with a main method in it to be able to run Akka How to run akka actors in IntelliJ IDEA. However, here is what I have:
object Application extends App {
val system = ActorSystem()
val supervisor = system.actorOf(Props[Supervisor])
implicit val timeout = Timeout(100 seconds)
import system.dispatcher
system.scheduler.schedule(1 seconds, 600 seconds) {
val future = supervisor ? Supervisor.Start
val list = Await.result(future, timeout.duration).asInstanceOf[List[Int]]
supervisor ! list
}
}
I know I have to specify a main method called "akka.Main" in the configuration. But nonetheless, where should I move the current code from object Application ?
You can write something like
import _root_.akka.Main
object Application extends App {
Main.main(Array("somepackage.Supervisor"))
}
and Supervisor actor should have overriden preStart function as #cmbaxter suggested.
Then run sbt console in intellij and write run.
I agree with #kdrakon that your code is fine the way it is, but if you wanted to leverage the akka.Main functionality, then a simple refactor like so will make things work:
package code
class ApplicationActor extends Actor {
override def preStart = {
val supervisor = context.actorOf(Props[Supervisor])
implicit val timeout = Timeout(100 seconds)
import context.dispatcher
context.system.scheduler.schedule(1 seconds, 600 seconds) {
val future = (supervisor ? Supervisor.Start).mapTo[List[Int]]
val list = Await.result(future, timeout.duration)
supervisor ! list
}
}
def receive = {
case _ => //Not sure what to do here
}
}
In this case, the ApplicationActor is the arg you would pass to akka.Main and it would basically be the root supervisor to all other actors created in your hierarchy. The only fishy thing here is that being an Actor, it needs a receive implementation and I don't imagine any other actors will be sending messages here thus it doesn't really do anything. But the power to this approach is that when the ApplicationActor is stopped, the stop will also be cascaded down to all other actors that it started, simplifying a graceful shutdown. I suppose you could have the ApplicationActor handle a message to shutdown the actor system given some kind of input (maybe a ShutdownHookThread could initiate this) and give this actor some kind of a purpose after all. Anyway, as stated earlier, your current approach seems fine but this could also be an option if you so desire.
EDIT
So if you wanted to run this ApplicationActor via akka.Main, according to the instructions here, you would execute this from your command prompt:
java -classpath <all those JARs> akka.Main code.ApplicationActor
You will of course need to supply <all those JARS> with your dependencies including akka. At a minimum you will need scala-library and akka-actor in your classpath to make this run.
If you refer to http://doc.akka.io/docs/akka/snapshot/scala/hello-world.html, you'll find that akka.Main expects your root/parent Actor. In your case, Supervisor. As for your already existing code, it can be copied directly into the actors code, possibly in some initialisation calls. For example, refer to the HelloWorld's preStart function.
However, in my opinion, your already existing code is just fine too. Akka.main is a nice helper, as is the microkernel binary. But creating your own main executable is a viable option too.