Akka Stream from within a Spark Job to write into kafka - scala

Willing to be the most efficient in writing data back into kafka, i am interested in using Akka Stream to write my RDD partition back into Kafka.
The problem is that i need a way to create an actor system per executor and not per partition which would be ridiculous. One may end up with 8 actorSystems on one node on one JVM. However having a Stream per partition is fine.
Has anyone already done that ?
My understanding, an actor system can't be serialized, hence can't be
sent has broadcast variable which would be per executor.
If one has had the experience around figuring a solution to that and tested please would you share ?
Else i can always fall back to https://index.scala-lang.org/benfradet/spark-kafka-writer/spark-kafka-0-10-writer/0.3.0?target=_2.11 but i am not sure it is the most efficient way.

You can always define a global lazy val with an actor system:
object Execution {
implicit lazy val actorSystem: ActorSystem = ActorSystem()
implicit lazy val materializer: Materializer = ActorMaterializer()
}
Then you just import it in any of the classes where you want to use Akka Streams:
import Execution._
val stream: DStream[...] = ...
stream.foreachRDD { rdd =>
...
rdd.foreachPartition { records =>
val (queue, done) = Source.queue(...)
.via(Producer.flow(...))
.toMat(Sink.ignore)(Keep.both)
.run() // implicitly pulls `Execution.materializer` from scope,
// which in turn will initialize `Execution.actorSystem`
... // push records to the queue
// wait until the stream is completed
Await.result(done, 10.minutes)
}
}
The above is kind of pseudocode but I think it should convey the general idea.
This way the system is going to be initialized on every executor JVM only once when it is needed. Additionally you can make the actor system "daemonic" in order for it to shut down automatically when the JVM finishes:
object Execution {
private lazy val config = ConfigFactory.parseString("akka.daemonic = on")
.withFallback(ConfigFactory.load())
implicit lazy val actorSystem: ActorSystem = ActorSystem("system", config)
implicit lazy val materializer: Materializer = ActorMaterializer()
}
We're doing this in our Spark jobs and it works flawlessly.
This works without any kind of broadcast variables, and, naturally, can be used in all kinds of Spark jobs, streaming or otherwise. Because the system is defined in a singleton object, it is guaranteed to be initialized only once per JVM instance (modulo various classloader shenanigans, but it doesn't really matter in the context of Spark), therefore even if some of the partitions get placed onto the same JVM (maybe in different threads), it will only initialize the actor system one time. lazy val ensures the thread-safety of the initialization, and ActorSystem is thread-safe, so this won't cause problems in this regard as well.

Related

akka streaming file lines to actor router and writing with single actor. how to handle the backpressure

I want to stream a file from s3 to actor to be parsed and enriched and to write the output to other file.
The number of parserActors should be limited e.g
application.conf
akka{
actor{
deployment {
HereClient/router1 {
router = round-robin-pool
nr-of-instances = 28
}
}
}
}
code
val writerActor = actorSystem.actorOf(WriterActor.props())
val parser = actorSystem.actorOf(FromConfig.props(ParsingActor.props(writerActor)), "router1")
however the actor that is writing to a file should be limited to 1 (singleton)
I tried doing something like
val reader: ParquetReader[GenericRecord] = AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
val source: Source[GenericRecord, NotUsed] = AvroParquetSource(reader)
source.map (record => record ! parser)
but I am not sure that the backpressure is handled correctly. any advice ?
Indeed your solution is disregarding backpressure.
The correct way to have a stream interact with an actor while maintaining backpressure is to use the ask pattern support of akka-stream (reference).
From my understanding of your example you have 2 separate actor interaction points:
send records to the parsing actors (via a router)
send parsed records to the singleton write actor
What I would do is something similar to the following:
val writerActor = actorSystem.actorOf(WriterActor.props())
val parserActor = actorSystem.actorOf(FromConfig.props(ParsingActor.props(writerActor)), "router1")
val reader: ParquetReader[GenericRecord] = AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
val source: Source[GenericRecord, NotUsed] = AvroParquetSource(reader)
source.ask[ParsedRecord](28)(parserActor)
.ask[WriteAck](writerActor)
.runWith(Sink.ignore)
The idea is that you send all the GenericRecord elements to the parserActor which will reply with a ParsedRecord. Here as an example we specify a parallelism of 28 since that's the number of instances you have configured, however as long as you use a value higher than the actual number of actor instances no actor should suffer from work starvation.
Once the parseActor replies with the parsing result (here represented by the ParsedRecord) we apply the same pattern to interact with the singleton writer actor. Note that here we don't specify the parallelism as we have a single instance so it doesn't make sense the send more than 1 message at a time (in reality this happens anyway due to buffering at async boundaries, but this is just a built-in optimization). In this case we expect that the writer actor replies with a WriteAck to inform us that the writing has been successful and we can send the next element.
Using this method you are maintaining backpressure throughout your whole stream.
I think you should be using one of the "async" operations
Perhaps this other q/a gives you some insperation Processing an akka stream asynchronously and writing to a file sink

Right way of handling multiple future callbacks using threadpool in Scala

I am trying to do a very simple thing and want to understand the right way of doing it. I need to periodically make some Rest API calls to a separate service and then process the results asynchronously. I am using actor system's default scheduler to schedule the Http requests and have created a separate threadpool to handle the Future callbacks. Since there is no dependency between requests and response I thought a separate threadpool for handling future callbacks should be fine.
Is there some problem with this approach?
I read the Scala doc and it says there is some issue here (though i not clear on it).
Generally what is recommended way of handling these scenarios?
implicit val system = ActorSystem("my-actor-system") // define an actor system
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10)) // create a thread pool
// define a thread which periodically does some work using the actor system's scheduler
system.scheduler.scheduleWithFixedDelay(5.seconds, 5.seconds)(new Runnable {
override def run(): Unit = {
val urls = getUrls() // get list of urls
val futureResults = urls.map(entry => getData[MyData](entry))) // get data foreach url
futureResults onComplete {
case Success(res) => // do something with the result
case Failure(e) => // do something with the error
}
}
}))
def getdata[T](url : String) : Future[Option[Future[T]] = {
implicit val ec1 = system.dispatcher
val responseFuture: Future[HttpResponse] = execute(url)
responseFuture map { result => {
// transform the response and return data in format T
}
}
}
Whether or not having a separate thread pool really depends on the use case. If the service integration is very critical and is designed to take a lot of resources, then a separate thread pool may make sense, otherwise, just use the default one should be fine. Feel free to refer to Levi's question for more in-depth discussions on this part.
Regarding "job scheduling in an actor system", I think Akka streams are a perfect fit here. I give you an example below. Feel free to refer to the blog post https://blog.colinbreck.com/rethinking-streaming-workloads-with-akka-streams-part-i/ regarding how many things can Akka streams simplify for you.
import akka.actor.ActorSystem
import akka.stream.scaladsl.{Sink, Source}
import scala.concurrent.duration._
import scala.concurrent.{ExecutionContext, Future}
import scala.util.{Failure, Success}
object Timer {
def main(args: Array[String]): Unit = {
implicit val system: ActorSystem = ActorSystem("Timer")
// default thread pool
implicit val ec: ExecutionContext = system.dispatcher
// comment out below if custom thread pool is needed
// also make sure you read https://doc.akka.io/docs/akka/current/dispatchers.html#setting-the-dispatcher-for-an-actor
// to define the custom thread pool
// implicit val ec: ExecutionContext = system.dispatchers.lookup("my-custom-dispatcher")
Source
.tick(5.seconds, 5.seconds, getUrls())
.mapConcat(identity)
.mapAsync(1)(url => fetch(url))
.runWith(Sink.seq)
.onComplete {
case Success(responses) =>
// handle responses
case Failure(ex) =>
// handle exceptions
}
}
def getUrls(): Seq[String] = ???
def fetch(url: String): Future[Response] = ???
case class Response(body: String)
}
In addition to Yik San Chan's answer above (especially regarding using Akka Streams), I'd also point out that what exactly you're doing in the .onComplete block is quite relevant to the choice of which ExecutionContext to use for the onComplete callback.
In general, if what you're doing in the callback will be doing blocking I/O, it's probably best to do it in a threadpool which is large relative to the number of cores (note that each thread on the JVM consumes about 1MB or so of heap, so it's probably not a great idea to use an ExecutionContext that spawns an unbounded number of threads; a fixed pool of about 10x your core count is probably OK).
Otherwise, it's probably OK to use an ExecutionContext with a threadpool roughly equal in size to the number of cores: the default Akka dispatcher is such an ExecutionContext. The only real reason to consider not using the Akka dispatcher, in my experience/opinion, is if the callback is going to occupy the CPU for a long time. The phenomenon known as "thread starvation" can occur in that scenario, with adverse impacts on performance and cluster stability (if using, e.g. Akka Cluster or health-checks). In such a scenario, I'd tend to use a dispatcher with fewer threads than cores and consider configuring the default dispatcher with fewer threads than the default (while the kernel's scheduler can and will manage more threads ready-to-run than cores, there are strong arguments for not letting it do so).
In an onComplete callback (in comparison to the various transformation methods on Future like map/flatMap and friends), since all you can do is side-effect, it's probably more likely than not that you're doing blocking I/O.

flink increase parallelism of async operation

We have AsyncFunction the async operation is done using akka http client
class Foo[A,B] extends AsyncFunction[A, B] with {
val akkaConfig = ConfigFactory.load()
implicit lazy val executor: ExecutionContext = ExecutionContext.fromExecutor(Executors.directExecutor())
implicit lazy val system = ActorSystem("MyActorSystem", akkaConfig)
implicit lazy val materializer = ActorMaterializer()
def postReq(uriStr: String, str: String): Future[HttpResponse] = {
Http().singleRequest(HttpRequest(
method = HttpMethods.POST,
uri = uriStr,
entity = HttpEntity(ContentTypes.`application/json`, str))
)
}
override def asyncInvoke(input: A, resultFuture: ResultFuture[B]) : Unit = {
val resultFutureRequested: Future[HttpResponse] = postReq(...)
//the rest of the class ...
Questions :
If I want to increase the parallelism of the http requests - should I do it using the akka config or is there is a way to config it via flink.yamel
Since Flink is using akka as well is that the correct way to create the ActorSystem and the ExecutionContext ?
As for the first question, You have three different settings that can affect the performance and the number of actual requests executed:
Parallelism, this will cause the Flink to create multiple instances of Your AsyncFunction including multiple instances of Your HttpClient.
Number of concurrent requests in the function itself. When you are calling orderedWait or unorderedWait You should provide the capacity in the function, which will limit the number of concurrent requests.
The actual settings of Your Http client.
As You can see, the points 2. and 3. are connected, since the Flink can limit the number of possible concurrent requests, so sometimes the changes in Your Http Client settings may not have an effect, since number of requests is bounded by Flink intself.
Increasing the throughput of Your AsyncFunction depends on the case. You need to remeber that AsyncFunction is callend IN SINGLE THREAD. This basically means that If the time to respond of the service You are calling is big, You will simply block the number of requests waiting for the response and thus the only way is to increase the parallelism'. Generally however, changing the settings of the HttpClient and the capacity of the function should allow You to obtain better throughput.
As for the second question, I don' t see an issue with creating the multiple ActorSystems. You can see the similar question answered [here].1

How to clean up other resources when spark gets stopped

In my spark application, there is an object ResourceFactory which contains an akka ActorSystem for providing resource clients. So when I run this spark application, every worker node will create an ActorSystem. The problem is that when the spark application finishes its works and gets shutdown. The ActorSystem still keeps alive on every worker node and prevents the whole application to terminate, it's just hung on.
Is there a way to register some listener to the SparkContext so that when the sc gets shutdown, then the ActorSystem on every worker node will get notified to shutdown themselves?
UPDATE:
Following is the simplified skeleton:
There is a ResourceFactory, which is an object and it contains an actor system. And it also provides a fetchData method.
object ResourceFactory{
val actorSystem = ActorSystem("resource-akka-system")
def fetchData(): SomeData = ...
}
And then, there is a user-defined RDD class, in its compute method, it needs to fetch data from the ResourceFactory.
class MyRDD extends RDD[SomeClass] {
override def compute(...) {
...
ResourceFactory.fetchData()
...
someIterator
}
}
So on every node there will be one ActorSystem named "resource-akka-system", and those MyRDD instances distributed on those worker nodes can get data from the "resource-akka-system".
The problem is that, when the SparkContext gets shutdown, there is no need for those "resource-akka-system"s, but I don't know how to notify the ResourceFactory to shutdown the "resource-akka-system" when the SparkContext gets shutdown. So now, the "resouce-akka-system" keeps alive on each worker node and prevents the whole program to exit.
UPDATE2:
With some more experiments, I find that in local mode the program is hung on, but in yarn-cluster mode, the program will exit successfully. May be this is because yarn will kill the threads on worker nodes when the sc is shutdown?
UPDATE3:
To check whether every node contains an ActorSystem, I change the code as following(following is the real skeleton, as I add another class definition):
object ResourceFactory{
println("creating resource factory")
val actorSystem = ActorSystem("resource-akka-system")
def fetchData(): SomeData = ...
}
class MyRDD extends RDD[SomeClass] {
println("creating my rdd")
override def compute(...) {
new RDDIterator(...)
}
}
class RDDIterator(...) extends Iterator[SomeClass] {
println("creating rdd iterator")
...
lazy val reader = {
...
ResourceFactory.fetchData()
...
}
...
override next() = {
...
reader.xx()
}
}
After adding those printlns, I run the code on spark on yarn-cluster mode. I find that on the driver I have following prints:
creating my rdd
creating resource factory
creating my rdd
...
While on some of the workers, I have following prints:
creating rdd iterator
creating resource factory
And some of the workers, it prints nothing (and all of them are not assigned any tasks).
Based on the above, I think the object is initialized in driver eagerly, since it prints creating resource factory on the driver even when no thing refers to it, and object is initialized in worker lazily because it prints creating resource factory after printing creating rdd iterator as resource factory is lazily referenced by the first created RDDIterator.
And I find that in my use case the MyRDD class is only created in the driver.
I am not very sure about the laziness of the initialization of the object on driver and worker, it's my guess, because maybe it's caused by other part of the program to make it looks like that. But I think it should be right that there is one actor system on each worker node when it is necessary.
I don't think that there is a way to tap into each Worker lifecycle.
Also I have some questions regarding your implementation:
If you have object that contains val, that is used from function run on worker, my understanding is that this val gets serialized and broadcasted to worker. Can you confirm, that you have one ActorSystem running per worker?
Actor System usually terminated immediately if you don't explicitly wait for it's termination. Are you calling something like system.awaitTermination or blocking on system.whenTerminated?
Anyway, there is another way, how you can shutdown actor systems on remote workers:
Make your ActorSystem on each node part of the akka cluster. Here are some docs how to do that programmatically.
Have address of your "coordination" Actor on driver node (where your sc is) broadcasted to each worker. In simple words, just have val with that address.
When your akka system is started on each worker use that "coordination" Actor address to register this particular actor system (send corresponding message to coordination Actor).
Coordination Actor keeps track of all registered "worker" Actors
When your computation is completed and you want to shut down Akka system on every worker, send messages to all registered Actors from coordination Actor on driver node.
Shutdown on worker Akka systems when "shutdown" message is received.

How to run Akka

It seems like there is no need in a class with a main method in it to be able to run Akka How to run akka actors in IntelliJ IDEA. However, here is what I have:
object Application extends App {
val system = ActorSystem()
val supervisor = system.actorOf(Props[Supervisor])
implicit val timeout = Timeout(100 seconds)
import system.dispatcher
system.scheduler.schedule(1 seconds, 600 seconds) {
val future = supervisor ? Supervisor.Start
val list = Await.result(future, timeout.duration).asInstanceOf[List[Int]]
supervisor ! list
}
}
I know I have to specify a main method called "akka.Main" in the configuration. But nonetheless, where should I move the current code from object Application ?
You can write something like
import _root_.akka.Main
object Application extends App {
Main.main(Array("somepackage.Supervisor"))
}
and Supervisor actor should have overriden preStart function as #cmbaxter suggested.
Then run sbt console in intellij and write run.
I agree with #kdrakon that your code is fine the way it is, but if you wanted to leverage the akka.Main functionality, then a simple refactor like so will make things work:
package code
class ApplicationActor extends Actor {
override def preStart = {
val supervisor = context.actorOf(Props[Supervisor])
implicit val timeout = Timeout(100 seconds)
import context.dispatcher
context.system.scheduler.schedule(1 seconds, 600 seconds) {
val future = (supervisor ? Supervisor.Start).mapTo[List[Int]]
val list = Await.result(future, timeout.duration)
supervisor ! list
}
}
def receive = {
case _ => //Not sure what to do here
}
}
In this case, the ApplicationActor is the arg you would pass to akka.Main and it would basically be the root supervisor to all other actors created in your hierarchy. The only fishy thing here is that being an Actor, it needs a receive implementation and I don't imagine any other actors will be sending messages here thus it doesn't really do anything. But the power to this approach is that when the ApplicationActor is stopped, the stop will also be cascaded down to all other actors that it started, simplifying a graceful shutdown. I suppose you could have the ApplicationActor handle a message to shutdown the actor system given some kind of input (maybe a ShutdownHookThread could initiate this) and give this actor some kind of a purpose after all. Anyway, as stated earlier, your current approach seems fine but this could also be an option if you so desire.
EDIT
So if you wanted to run this ApplicationActor via akka.Main, according to the instructions here, you would execute this from your command prompt:
java -classpath <all those JARs> akka.Main code.ApplicationActor
You will of course need to supply <all those JARS> with your dependencies including akka. At a minimum you will need scala-library and akka-actor in your classpath to make this run.
If you refer to http://doc.akka.io/docs/akka/snapshot/scala/hello-world.html, you'll find that akka.Main expects your root/parent Actor. In your case, Supervisor. As for your already existing code, it can be copied directly into the actors code, possibly in some initialisation calls. For example, refer to the HelloWorld's preStart function.
However, in my opinion, your already existing code is just fine too. Akka.main is a nice helper, as is the microkernel binary. But creating your own main executable is a viable option too.