How to clean up other resources when spark gets stopped

How to clean up other resources when spark gets stopped - scala

In my spark application, there is an object ResourceFactory which contains an akka ActorSystem for providing resource clients. So when I run this spark application, every worker node will create an ActorSystem. The problem is that when the spark application finishes its works and gets shutdown. The ActorSystem still keeps alive on every worker node and prevents the whole application to terminate, it's just hung on.
Is there a way to register some listener to the SparkContext so that when the sc gets shutdown, then the ActorSystem on every worker node will get notified to shutdown themselves?
UPDATE:
Following is the simplified skeleton:
There is a ResourceFactory, which is an object and it contains an actor system. And it also provides a fetchData method.
object ResourceFactory{
val actorSystem = ActorSystem("resource-akka-system")
def fetchData(): SomeData = ...
}
And then, there is a user-defined RDD class, in its compute method, it needs to fetch data from the ResourceFactory.
class MyRDD extends RDD[SomeClass] {
override def compute(...) {
...
ResourceFactory.fetchData()
...
someIterator
}
}
So on every node there will be one ActorSystem named "resource-akka-system", and those MyRDD instances distributed on those worker nodes can get data from the "resource-akka-system".
The problem is that, when the SparkContext gets shutdown, there is no need for those "resource-akka-system"s, but I don't know how to notify the ResourceFactory to shutdown the "resource-akka-system" when the SparkContext gets shutdown. So now, the "resouce-akka-system" keeps alive on each worker node and prevents the whole program to exit.
UPDATE2:
With some more experiments, I find that in local mode the program is hung on, but in yarn-cluster mode, the program will exit successfully. May be this is because yarn will kill the threads on worker nodes when the sc is shutdown?
UPDATE3:
To check whether every node contains an ActorSystem, I change the code as following(following is the real skeleton, as I add another class definition):
object ResourceFactory{
println("creating resource factory")
val actorSystem = ActorSystem("resource-akka-system")
def fetchData(): SomeData = ...
}
class MyRDD extends RDD[SomeClass] {
println("creating my rdd")
override def compute(...) {
new RDDIterator(...)
}
}
class RDDIterator(...) extends Iterator[SomeClass] {
println("creating rdd iterator")
...
lazy val reader = {
...
ResourceFactory.fetchData()
...
}
...
override next() = {
...
reader.xx()
}
}
After adding those printlns, I run the code on spark on yarn-cluster mode. I find that on the driver I have following prints:
creating my rdd
creating resource factory
creating my rdd
...
While on some of the workers, I have following prints:
creating rdd iterator
creating resource factory
And some of the workers, it prints nothing (and all of them are not assigned any tasks).
Based on the above, I think the object is initialized in driver eagerly, since it prints creating resource factory on the driver even when no thing refers to it, and object is initialized in worker lazily because it prints creating resource factory after printing creating rdd iterator as resource factory is lazily referenced by the first created RDDIterator.
And I find that in my use case the MyRDD class is only created in the driver.
I am not very sure about the laziness of the initialization of the object on driver and worker, it's my guess, because maybe it's caused by other part of the program to make it looks like that. But I think it should be right that there is one actor system on each worker node when it is necessary.

I don't think that there is a way to tap into each Worker lifecycle.
Also I have some questions regarding your implementation:
If you have object that contains val, that is used from function run on worker, my understanding is that this val gets serialized and broadcasted to worker. Can you confirm, that you have one ActorSystem running per worker?
Actor System usually terminated immediately if you don't explicitly wait for it's termination. Are you calling something like system.awaitTermination or blocking on system.whenTerminated?
Anyway, there is another way, how you can shutdown actor systems on remote workers:
Make your ActorSystem on each node part of the akka cluster. Here are some docs how to do that programmatically.
Have address of your "coordination" Actor on driver node (where your sc is) broadcasted to each worker. In simple words, just have val with that address.
When your akka system is started on each worker use that "coordination" Actor address to register this particular actor system (send corresponding message to coordination Actor).
Coordination Actor keeps track of all registered "worker" Actors
When your computation is completed and you want to shut down Akka system on every worker, send messages to all registered Actors from coordination Actor on driver node.
Shutdown on worker Akka systems when "shutdown" message is received.

Related

Akka Stream from within a Spark Job to write into kafka

Willing to be the most efficient in writing data back into kafka, i am interested in using Akka Stream to write my RDD partition back into Kafka.
The problem is that i need a way to create an actor system per executor and not per partition which would be ridiculous. One may end up with 8 actorSystems on one node on one JVM. However having a Stream per partition is fine.
Has anyone already done that ?
My understanding, an actor system can't be serialized, hence can't be
sent has broadcast variable which would be per executor.
If one has had the experience around figuring a solution to that and tested please would you share ?
Else i can always fall back to https://index.scala-lang.org/benfradet/spark-kafka-writer/spark-kafka-0-10-writer/0.3.0?target=_2.11 but i am not sure it is the most efficient way.

You can always define a global lazy val with an actor system:
object Execution {
implicit lazy val actorSystem: ActorSystem = ActorSystem()
implicit lazy val materializer: Materializer = ActorMaterializer()
}
Then you just import it in any of the classes where you want to use Akka Streams:
import Execution._
val stream: DStream[...] = ...
stream.foreachRDD { rdd =>
...
rdd.foreachPartition { records =>
val (queue, done) = Source.queue(...)
.via(Producer.flow(...))
.toMat(Sink.ignore)(Keep.both)
.run() // implicitly pulls `Execution.materializer` from scope,
// which in turn will initialize `Execution.actorSystem`
... // push records to the queue
// wait until the stream is completed
Await.result(done, 10.minutes)
}
}
The above is kind of pseudocode but I think it should convey the general idea.
This way the system is going to be initialized on every executor JVM only once when it is needed. Additionally you can make the actor system "daemonic" in order for it to shut down automatically when the JVM finishes:
object Execution {
private lazy val config = ConfigFactory.parseString("akka.daemonic = on")
.withFallback(ConfigFactory.load())
implicit lazy val actorSystem: ActorSystem = ActorSystem("system", config)
implicit lazy val materializer: Materializer = ActorMaterializer()
}
We're doing this in our Spark jobs and it works flawlessly.
This works without any kind of broadcast variables, and, naturally, can be used in all kinds of Spark jobs, streaming or otherwise. Because the system is defined in a singleton object, it is guaranteed to be initialized only once per JVM instance (modulo various classloader shenanigans, but it doesn't really matter in the context of Spark), therefore even if some of the partitions get placed onto the same JVM (maybe in different threads), it will only initialize the actor system one time. lazy val ensures the thread-safety of the initialization, and ActorSystem is thread-safe, so this won't cause problems in this regard as well.

What happens if SparkSession is not closed?

What's the difference between the following 2?
object Example1 {
def main(args: Array[String]): Unit = {
try {
val spark = SparkSession.builder.getOrCreate
// spark code here
} finally {
spark.close
}
}
}
object Example2 {
val spark = SparkSession.builder.getOrCreate
def main(args: Array[String]): Unit = {
// spark code here
}
}
I know that SparkSession implements Closeable and it hints that it needs to be closed. However, I can't think of any issues if the SparkSession is just created as in Example2 and never closed directly.
In case of success or failure of the Spark application (and exit from main method), the JVM will terminate and the SparkSession will be gone with it. Is this correct?
IMO: The fact that the SparkSession is a singleton should not make a big difference either.

You should always close your SparkSession when you are done with its use (even if the final outcome were just to follow a good practice of giving back what you've been given).
Closing a SparkSession may trigger freeing cluster resources that could be given to some other application.
SparkSession is a session and as such maintains some resources that consume JVM memory. You can have as many SparkSessions as you want (see SparkSession.newSession to create a session afresh) but you don't want them to use memory they should not if you don't use one and hence close the one you no longer need.
SparkSession is Spark SQL's wrapper around Spark Core's SparkContext and so under the covers (as in any Spark application) you'd have cluster resources, i.e. vcores and memory, assigned to your SparkSession (through SparkContext). That means that as long as your SparkContext is in use (using SparkSession) the cluster resources won't be assigned to other tasks (not necessarily Spark's but also for other non-Spark applications submitted to the cluster). These cluster resources are yours until you say "I'm done" which translates to...close.
If however, after close, you simply exit a Spark application, you don't have to think about executing close since the resources will be closed automatically anyway. The JVMs for the driver and executors terminate and so does the (heartbeat) connection to the cluster and so eventually the resources are given back to the cluster manager so it can offer them to use by some other application.

Both are same!
Spark session's stop/close eventually calls spark context's stop
def stop(): Unit = {
sparkContext.stop()
}
override def close(): Unit = stop()
Spark context has run time shutdown hook to close the spark context before exiting the JVM. Please find the spark code below for adding shutdown hook while creating the context
ShutdownHookManager.addShutdownHook(
_shutdownHookRef = ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
logInfo("Invoking stop() from shutdown hook")
stop()
}
So this will be called irrespective of how JVM exits. If you stop() manually, this shutdown hook will be cancelled to avoid duplication
def stop(): Unit = {
if (LiveListenerBus.withinListenerThread.value) {
throw new SparkException(
s"Cannot stop SparkContext within listener thread of ${LiveListenerBus.name}")
}
// Use the stopping variable to ensure no contention for the stop scenario.
// Still track the stopped variable for use elsewhere in the code.
if (!stopped.compareAndSet(false, true)) {
logInfo("SparkContext already stopped.")
return
}
if (_shutdownHookRef != null) {
ShutdownHookManager.removeShutdownHook(_shutdownHookRef)
}

How to name an actor?

The data layer in my web application is comprised of Akka actors. Whenever I need to access data, I invoke the ActorSystem mechanism like so:
val myActor = system.actorOf(Props[MyActor], name = "myactor")
implicit val timeout = Timeout(120 seconds)
val future = myActor ? Request1
val result = Await.result(future, timeout.duration)
I'm using Play, and the ActorSystem variable is obtained through injection:
class MyClass #Inject() (system: ActorSystem)
But I'm getting the following exception saying that the actor name is not unique the second time I access the function, how to fix this? How to name the actor, taking into account that can be used concurrently by more than one thread?
play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution
exception[[InvalidActorNameException: actor name [myactor] is not
unique!]]
** EDIT **
What I'm trying to achieve is something similar to having a container of Entity Beans in the EJB model, where each actor would be an Entity Bean. The difference I'm noticing is that the actors are not created/destroyed automatically as needed.

Depending on your goal, the question may be not how to name an actor, but when to create it. You are creating a new actor every time you need to access some data. I suppose you aren't stopping old actors when they are no longer needed.
You should probably create an actor once (or multiple times if you want a pool of actors, but using different names) and reuse it later by keeping an ActorRef somewhere or using dependency injected actors. You can also use system.actorFor or system.actorSelection (depending on Akka version you're using) if you really need to.
Most of the time you don't even need an explicit ActorRef because you want to reply to a sender of some message.
If you have to create a separate actor each time, then see Wonpyo's answer. In my opinion, though, you could simply use a Future directly instead.
There is a great guide on Actors in the Akka documentation.
Edit:
Since you specified you want each actor to act like a DAO class, I think it should look something like:
// Somewhere in some singleton object (injected as dependency)
val personDao : ActorRef = system.actorOf(Props[PersonDaoActor], name = "personDao")
val fruitDao : ActorRef = system.actorOf(Props[FruitDaoActor], name = "fruitDao")
Then, when you need to access some data:
val johnSmithFuture = personDao ? Get("John Smith")
johnSmithFuture.map {
case Person(name, age) => println(s"${name} ${age}")
}
Alternatively, instead of personDao you can use system.actorFor("personDao") (or system.actorSelection equivalent in Akka 2.4). You can also inject actors directly.
If you want multiple actors to process your messages in parallel you can use routers. Example:
val personDao: ActorRef =
system.actorOf(RoundRobinPool(5).props(Props[PersonDaoActor]), "personDao")
It would create 5 instances of your PersonDaoActor and distribute any messages sent to personDao among those 5 actors, so you could process 5 queries in parallel. If all 5 actors are busy, messages will be queued.
Using Await defeats the purpose of Akka in this case. There are some cases when this is the only option (legacy code, mostly), but using it every time effectively makes your code completely blocking, maybe even single-threaded (depending on your actor code). This is especially true in Play, which is designed to do everything asynchronously, so there's no need to Await.
It may be a good idea to reconsider if actors are really the best solution to your problem. If all you want is parallel execution, then Futures are much simpler. Some people still use actors in such case because they like the abstraction and the simplicity of routing. I found an interesting article describing this in detail: "Don't use Actors for concurrency" (also read the comments for opposing views).

Actor System requires unique name (path) for each actor.
Path has follwing format akka://system#host:port/user/{your-actor-path}
For example
val system = ActorSystem("hello")
val myActor = system.actorOf(Props[MyActor], name ="myactor")
// myActor Path
// "akka://hello/user/myactor" // purely local
// "akka.tcp://hello#ip:port/user/myactor" // remote
and in your code, myActor is created everytime, you make a call.
which makes an actor in the same path everytime.
Thus, Bad solution is to change the code as following
val myActor = system.actorOf(Props[MyActor])
If you don't assign a name to an actor then actor system will assign an random name
and myActor will not have same path for each function call.
But, this is really bad solution, since myActor will not be destructed
(Actor is not terminated by GC)
If you keep calling the function, then your memory will be out of space one day.
So, please DESTRUCT myActor after you done with the function.

How to run Akka

It seems like there is no need in a class with a main method in it to be able to run Akka How to run akka actors in IntelliJ IDEA. However, here is what I have:
object Application extends App {
val system = ActorSystem()
val supervisor = system.actorOf(Props[Supervisor])
implicit val timeout = Timeout(100 seconds)
import system.dispatcher
system.scheduler.schedule(1 seconds, 600 seconds) {
val future = supervisor ? Supervisor.Start
val list = Await.result(future, timeout.duration).asInstanceOf[List[Int]]
supervisor ! list
}
}
I know I have to specify a main method called "akka.Main" in the configuration. But nonetheless, where should I move the current code from object Application ?

You can write something like
import _root_.akka.Main
object Application extends App {
Main.main(Array("somepackage.Supervisor"))
}
and Supervisor actor should have overriden preStart function as #cmbaxter suggested.
Then run sbt console in intellij and write run.

I agree with #kdrakon that your code is fine the way it is, but if you wanted to leverage the akka.Main functionality, then a simple refactor like so will make things work:
package code
class ApplicationActor extends Actor {
override def preStart = {
val supervisor = context.actorOf(Props[Supervisor])
implicit val timeout = Timeout(100 seconds)
import context.dispatcher
context.system.scheduler.schedule(1 seconds, 600 seconds) {
val future = (supervisor ? Supervisor.Start).mapTo[List[Int]]
val list = Await.result(future, timeout.duration)
supervisor ! list
}
}
def receive = {
case _ => //Not sure what to do here
}
}
In this case, the ApplicationActor is the arg you would pass to akka.Main and it would basically be the root supervisor to all other actors created in your hierarchy. The only fishy thing here is that being an Actor, it needs a receive implementation and I don't imagine any other actors will be sending messages here thus it doesn't really do anything. But the power to this approach is that when the ApplicationActor is stopped, the stop will also be cascaded down to all other actors that it started, simplifying a graceful shutdown. I suppose you could have the ApplicationActor handle a message to shutdown the actor system given some kind of input (maybe a ShutdownHookThread could initiate this) and give this actor some kind of a purpose after all. Anyway, as stated earlier, your current approach seems fine but this could also be an option if you so desire.
EDIT
So if you wanted to run this ApplicationActor via akka.Main, according to the instructions here, you would execute this from your command prompt:
java -classpath <all those JARs> akka.Main code.ApplicationActor
You will of course need to supply <all those JARS> with your dependencies including akka. At a minimum you will need scala-library and akka-actor in your classpath to make this run.

If you refer to http://doc.akka.io/docs/akka/snapshot/scala/hello-world.html, you'll find that akka.Main expects your root/parent Actor. In your case, Supervisor. As for your already existing code, it can be copied directly into the actors code, possibly in some initialisation calls. For example, refer to the HelloWorld's preStart function.
However, in my opinion, your already existing code is just fine too. Akka.main is a nice helper, as is the microkernel binary. But creating your own main executable is a viable option too.

Scala how to use akka actors to handle a timing out operation efficiently

I am currently evaluating javascript scripts using Rhino in a restful service. I wish for there to be an evaluation time out.
I have created a mock example actor (using scala 2.10 akka actors).
case class Evaluate(expression: String)
class RhinoActor extends Actor {
override def preStart() = { println("Start context'"); super.preStart()}
def receive = {
case Evaluate(expression) ⇒ {
Thread.sleep(100)
sender ! "complete"
}
}
override def postStop() = { println("Stop context'"); super.postStop()}
}
Now I run use this actor as follows:
def run {
val t = System.currentTimeMillis()
val system = ActorSystem("MySystem")
val actor = system.actorOf(Props[RhinoActor])
implicit val timeout = Timeout(50 milliseconds)
val future = (actor ? Evaluate("10 + 50")).mapTo[String]
val result = Try(Await.result(future, Duration.Inf))
println(System.currentTimeMillis() - t)
println(result)
actor ! PoisonPill
system.shutdown()
}
Is it wise to use the ActorSystem in a closure like this which may have simultaneous requests on it?
Should I make the ActorSystem global, and will that be ok in this context?
Is there a more appropriate alternative approach?
EDIT: I think I need to use futures directly, but I will need the preStart and postStop. Currently investigating.
EDIT: Seems you don't get those hooks with futures.

I'll try and answer some of your questions for you.
First, an ActorSystem is a very heavy weight construct. You should not create one per request that needs an actor. You should create one globally and then use that single instance to spawn your actors (and you won't need system.shutdown() anymore in run). I believe this covers your first two questions.
Your approach of using an actor to execute javascript here seems sound to me. But instead of spinning up an actor per request, you might want to pool a bunch of the RhinoActors behind a Router, with each instance having it's own rhino engine that will be setup during preStart. Doing this will eliminate per request rhino initialization costs, speeding up your js evaluations. Just make sure you size your pool appropriately. Also, you won't need to be sending PoisonPill messages per request if you adopt this approach.
You also might want to look into the non-blocking callbacks onComplete, onSuccess and onFailure as opposed to using the blocking Await. These callbacks also respect timeouts and are preferable to blocking for higher throughput. As long as whatever is way way upstream waiting for this response can handle the asynchronicity (i.e. an async capable web request), then I suggest going this route.
The last thing to keep in mind is that even though code will return to the caller after the timeout if the actor has yet to respond, the actor still goes on processing that message (performing the evaluation). It does not stop and move onto the next message just because a caller timed out. Just wanted to make that clear in case it wasn't.
EDIT
In response to your comment about stopping a long execution there are some things related to Akka to consider first. You can call stop the actor, send a Kill or a PosionPill, but none of these will stop if from processing the message that it's currently processing. They just prevent it from receiving new messages. In your case, with Rhino, if infinite script execution is a possibility, then I suggest handling this within Rhino itself. I would dig into the answers on this post (Stopping the Rhino Engine in middle of execution) and setup your Rhino engine in the actor in such a way that it will stop itself if it has been executing for too long. That failure will kick out to the supervisor (if pooled) and cause that pooled instance to be restarted which will init a new Rhino in preStart. This might be the best approach for dealing with the possibility of long running scripts.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse