How to use an Akka actor from the Spark nodes of a cluster - scala

Having a Spark cluster running a certain application, I would like to use an Akka actor to stream data from within each of the Spark nodes in the cluster. That is: the nodes are processing data in some way, and in parallel the actor is sending some other data within the node to an external process.
Now, these are the possible options:
Just create the ActorRef through a regular ActorSystem: not possible as the ActorSystem instance is not Serializable and it will fail at runtime
Use the Spark internal ActorSystem to create the actor: not a good option since Spark 1.4 as Spark.get.ActorSystem is deprecated
So what is the best way for the Spark nodes to instantiate a given actor if the options above are not valid? is it possible at all?
This question is somewhat related to this one although formulated on a wider scope
Note: I know I could somehow use Spark streaming for this scenario, but at the moment I would like to explore the feasibility of a pure Akka option

Related

when we initialize actor system and create an actor using actorOf method, how many actors are getting created?

I have 2 questions:
How many actors does the below code create?
How do I create 1000 actors at the same time?
val system = ActorSystem("DonutStoreActorSystem");
val donutInfoActor = system.actorOf(Props[DonutInfoActor], name = "DonutInfoActor")
When you start the classic actor system and use actorOf like that it will create one of your DonutInfoActor and a few internal Akka system actors related to the event bus, logging, cluster if you are using that.
Just as texasbruce said in a comment, a loop lets you create any number of actors from a single spot, startup is async, so you will get an ActorRef back that is ready to use but the actor that it referencing it may still be starting up.
Note that if you are building something new we recommend the new "typed" actor APIs that was completed in Akka 2.6 over the classic API used in your sample.

Can a spark job use akka actors?

Can my spark job use the akka actor system or is that not possible and or a bad idea?
Can someone explain if it is a bad idea or not?
The challenge is that you would need to serialize your actor to send it to each node in the cluster. Actors are often used for things like sharing mutable state across threads, running in a single thread -- if the actor exists on every node in the cluster, that's not going to work. It's probably hypothetically possible to use an actor in a spark operation, but I'm not sure what problem it would solve taking into account the limitations you'd face.

Dependency Injection/IoC within Apache Spark

I am trying to improve my spark skills and currently working on a generic Spark Streaming job.
The Job assumes that there will be a Flume like pipeline for ingesting data into other databases/storage systems => read -> deserialize -> split -> serialize -> sink to whatever systems.
When implementing that out of 'context of spark', there will be a class requiring the implementation of deserializer, splitter, serializer and sink.
But spark needs to serialize its tasks to be able to distribute them across the cluster, so - afaik - it's not a good idea to pass instances of classes into spark task closures.
Now my question is how dependency injection a la guice/spring/whatever will do its job so the components that have to be injected will be available at the executors instead of the driver?
Are there any example jobs where I can see how DI will be done with spark?
There is a link at the official spark website (http://spark.apache.org/documentation.html) to a talk about abstraction of spark jobs. The speaker mentioned DI and the usage of guice but the code shown is very high level and didn't show how that integrates into the whole setup.
https://youtu.be/C7gWtxelYNM?list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a&t=2002
Any hints will be awesome!

Akka: "Trying to deserialize a serialized ActorRef without an ActorSystem in scope" error

I am integrating the use of Akka actors and Spark in the following way: when a task is distributed among the Spark nodes, while processing that tasks, each node also periodically sends metrics data to a different collector process that sits somewhere else on the network through the use of an Akka actor (connecting to the remote process through akka-remote).
The actor-based metrics sending/receiving functionality works just fine when used in standalone mode, but when integrated in a Spark task the following error is thrown:
java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope. Use 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }'
at akka.actor.SerializedActorRef.readResolve(ActorRef.scala:407) ~[akka-actor_2.10-2.3.11.jar:na]
If I understood it correctly, the source of the problem is the Spark node being unable to deserialize the ActorRef because it does not have the full information required to do it. I understand that putting an ActorSystem in scope would fix it, but I am not sure how to use the suggested akka.serialization.Serialization.currentSystem.withValue(system) { ... }
The Akka official docs are very good in pretty much all topics they cover. Unfortunately, the chapter devoted to Serialization could be improved IMHO.
Note: there is a similar SO question here but the accepted solution is too specific and thus not really useful in the general case
An ActorSystem is responsible for all of the functionality involved with ActorRef objects.
When you program something like
actorRef ! message
You're actually invoking a bunch of work within the ActorSystem, not the ActorRef, to put the message in the right mailbox, tee-up the Actor to run the receive method within the thread pool, etc... From the documentation:
An actor system manages the resources it is configured to use in order
to run the actors which it contains. There may be millions of actors
within one such system, after all the mantra is to view them as
abundant and they weigh in at an overhead of only roughly 300 bytes
per instance. Naturally, the exact order in which messages are
processed in large systems is not controllable by the application
author
That is why your code works fine "standalone", but not in Spark. Each of your Spark nodes is missing the ActorSystem machinery, therefore even if you could de-serialize the ActorRef in a node there would be no ActorSystem to process the ! in your node function.
You can establish an ActorSystem within each node and use (i) remoting to send messages to your ActorRef in the "master" ActorSystem via actorSelection or (ii) the serialization method you mentioned where each node's ActorSystem would be the system in the example you quoted.

Best practices for actors lookup from actors

I am using akka 2.2.x. in cluster mode. My question is related to actors lookup.
I have two kind of actors:
processor. It accepts text and processes it somehow
result collector. Acceps messages from processor and summarize the results.
Processor actor needs to send message to result collector. So, I need to have ActorRef inside processor.
The queston is - how to pass/lookup this ActorRef to processor.
I have 3 different solutions for now:
Lookup ActorRef to processor in creation time and pass ActorRef as contructor parameter. Looks possibly wrong because it does not handle actor restart process and does not fit into cluster environment.
Lookup in preStart with context.actorSelection("../result-collector"). After this I have the object of ActorSelection and can send messages with !. In this solution I am aware of performance degradation because of lookup in cluster before every call. Or am I wrong here?
Lookup in preStart with context.actorSelection("../result-collector") and call resolveOne to obtain ActorRef. Looks ok, but may not handle akka cluster changes.
Thanks!
There is no problem, neither with restarts nor with clustering, when using #1. Lookups are only useful in those cases where the ActorRef is not available by any other means.