Scala Akka Consumer/Producer: Return Value - scala

Problem Statement
Assume I have a file with sentences that is processed line by line. In my case, I need to extract Named Entities (Persons, Organizations, ...) from these lines. Unfortunately, the tagger is quite slow. Therefore, I decided to parallelize the computation, such that lines could be processed independent from each other and the result is collected in a central location.
Current Approach
My current approach comprises the usage of a single producer multiple consumer concept. However, I'm relative new to Akka, but I think my problem description fits well into its capabilities. Let me show you some code:
Producer
The Producer reads the file line by line and sends it to the Consumer. If it reaches the total line limit, it propagates the result back to WordCount.
class Producer(consumers: ActorRef) extends Actor with ActorLogging {
var master: Option[ActorRef] = None
var result = immutable.List[String]()
var totalLines = 0
var linesProcessed = 0
override def receive = {
case StartProcessing() => {
master = Some(sender)
Source.fromFile("sent.txt", "utf-8").getLines.foreach { line =>
consumers ! Sentence(line)
totalLines += 1
}
context.stop(self)
}
case SentenceProcessed(list) => {
linesProcessed += 1
result :::= list
//If we are done, we can propagate the result to the creator
if (linesProcessed == totalLines) {
master.map(_ ! result)
}
}
case _ => log.error("message not recognized")
}
}
Consumer
class Consumer extends Actor with ActorLogging {
def tokenize(line: String): Seq[String] = {
line.split(" ").map(_.toLowerCase)
}
override def receive = {
case Sentence(sent) => {
//Assume: This is representative for the extensive computation method
val tokens = tokenize(sent)
sender() ! SentenceProcessed(tokens.toList)
}
case _ => log.error("message not recognized")
}
}
WordCount (Master)
class WordCount extends Actor {
val consumers = context.actorOf(Props[Consumer].
withRouter(FromConfig()).
withDispatcher("consumer-dispatcher"), "consumers")
val producer = context.actorOf(Props(new Producer(consumers)), "producer")
context.watch(consumers)
context.watch(producer)
def receive = {
case Terminated(`producer`) => consumers ! Broadcast(PoisonPill)
case Terminated(`consumers`) => context.system.shutdown
}
}
object WordCount {
def getActor() = new WordCount
def getConfig(routerType: String, dispatcherType: String)(numConsumers: Int) = s"""
akka.actor.deployment {
/WordCount/consumers {
router = $routerType
nr-of-instances = $numConsumers
dispatcher = consumer-dispatcher
}
}
consumer-dispatcher {
type = $dispatcherType
executor = "fork-join-executor"
}"""
}
The WordCount actor is responsible for creating the other actors. When the Consumer is finished the Producer sends a message with all tokens. But, how to propagate the message again and also accept and wait for it? The architecture with the third WordCount actor might be wrong.
Main Routine
case class Run(name: String, actor: () => Actor, config: (Int) => String)
object Main extends App {
val run = Run("push_implementation", WordCount.getActor _, WordCount.getConfig("balancing-pool", "Dispatcher") _)
def execute(run: Run, numConsumers: Int) = {
val config = ConfigFactory.parseString(run.config(numConsumers))
val system = ActorSystem("Counting", ConfigFactory.load(config))
val startTime = System.currentTimeMillis
system.actorOf(Props(run.actor()), "WordCount")
/*
How to get the result here?!
*/
system.awaitTermination
System.currentTimeMillis - startTime
}
execute(run, 4)
}
Problem
As you see, the actual problem is to propagate the result back to the Main routine. Can you tell me how to do this in a proper way? The question is also how to wait for the result until the consumers are finished? I had a brief look into the Akka Future documentation section, but the whole system is a little bit overwhelming for beginners. Something like var future = message ? actor seems suitable. Not sure, how to do this. Also using the WordCount actor causes additional complexity. Maybe it is possible to come up with a solution that doesn't need this actor?

Consider using the Akka Aggregator Pattern. That takes care of the low-level primitives (watching actors, poison pill, etc). You can focus on managing state.
Your call to system.actorOf() returns an ActorRef, but you're not using it. You should ask that actor for results. Something like this:
implicit val timeout = Timeout(5 seconds)
val wCount = system.actorOf(Props(run.actor()), "WordCount")
val answer = Await.result(wCount ? "sent.txt", timeout.duration)
This means your WordCount class needs a receive method that accepts a String message. That section of code should aggregate the results and tell the sender(), like this:
class WordCount extends Actor {
def receive: Receive = {
case filename: String =>
// do all of your code here, using filename
sender() ! results
}
}
Also, rather than blocking on the results with Await above, you can apply some techniques for handling Futures.

Related

Ask Akka actor for a result only when all the messages are processed

I am trying to split a big chunk of text into multiple paragraphs and process it concurrently by calling an external API.
An immutable list is updated each time the response comes from the API for the paragraph.
Once the paragraphs are processed and the list is updated, I would like to ask the Actor for the final status to be used in the next steps.
The problem with the below approach is that I would never know when all the paragraphs are processed.
I need to get back the targetStore once all the paragraphs are processed and the list is final.
def main(args: Array[String]) {
val source = Source.fromFile("input.txt")
val extDelegator = new ExtractionDelegator()
source.getLines().foreach(line => extDelegator.processParagraph(line))
extDelegator.getFinalResult()
}
case class Extract(uuid: UUID, text: String)
case class UpdateList(text: String)
case class DelegateLambda(text: String)
case class FinalResult()
class ExtractionDelegator {
val system = ActorSystem("ExtractionDelegator")
val extActor = system.actorOf(Props(classOf[ExtractorDelegateActor]).withDispatcher("fixed-thread-pool"))
implicit val executionContext = system.dispatchers.lookup("fixed-thread-pool")
def processParagraph(text: String) = {
extActor ! Extract(uuid, text)
}
def getFinalResult(): java.util.List[String] = {
implicit val timeout = Timeout(5 seconds)
val askActor = system.actorOf(Props(classOf[ExtractorDelegateActor]))
val future = askActor ? FinalResult()
val result = Await.result(future, timeout.duration).asInstanceOf[java.util.List[String]]
result
}
def shutdown(): Unit = {
system.terminate()
}
}
/* Extractor Delegator actor*/
class ExtractorDelegateActor extends Actor with ActorLogging {
var targetStore:scala.collection.immutable.List[String] = scala.collection.immutable.List.empty
def receive = {
case Extract(uuid, text) => {
context.actorOf(Props[ExtractProcessor].withDispatcher("fixed-thread-pool")) ! DelegateLambda(text)
}
case UpdateList(res) => {
targetStore = targetStore :+ res
}
case FinalResult() => {
val senderActor=sender()
senderActor ! targetStore
}
}
}
/* Aggregator actor*/
class ExtractProcessor extends Actor with ActorLogging {
def receive = {
case DelegateLambda(text) => {
val res =callLamdaService(text)
sender ! UpdateList(res)
}
}
def callLamdaService(text: String): String = {
//THis is where external API is called.
Thread.sleep(1000)
result
}
}
Not sure why you want to use actors here, most simple would be to
// because you call external service, you have back async response most probably
def callLamdaService(text: String): Future[String]
and to process your text you do
implicit val ec = scala.concurrent.ExecutionContext.Implicits.global // use you execution context here
Future.sequence(source.getLines().map(callLamdaService)).map {results =>
// do what you want with results
}
If you still want to use actors, you can do it replacing callLamdaService to processParagraph which internally will do ask to worker actor, who returns result (so, signature for processParagraph will be def processParagraph(text: String): Future[String])
If you still want to start multiple tasks and then ask for result, then you just need to use context.become with receive(worker: Int), when you increase amount of workers for each Extract message and decrease amount of workers on each UpdateList message. You will also need to implement then delayed processing of FinalResult for the case of non-zero amount of processing workers.

Is there a limit to how many Akka Streams can run at the same time?

I am trying to implement a simple one-to-many pub/sub pattern using a BroadcastHub. This fails silently for large numbers of subscribers, which makes me think I am hitting some limit on the number of streams I can run.
First, let's define some events:
sealed trait Event
case object EX extends Event
case object E1 extends Event
case object E2 extends Event
case object E3 extends Event
case object E4 extends Event
case object E5 extends Event
I have implemented the publisher using a BroadcastHub, adding a Sink.actorRefWithAck each time I want to add a new subscriber. Publishing the EX event ends the broadcast:
trait Publisher extends Actor with ActorLogging {
implicit val materializer = ActorMaterializer()
private val sourceQueue = Source.queue[Event](Publisher.bufferSize, Publisher.overflowStrategy)
private val (
queue: SourceQueueWithComplete[Event],
source: Source[Event, NotUsed]
) = {
val (q,s) = sourceQueue.toMat(BroadcastHub.sink(bufferSize = 256))(Keep.both).run()
s.runWith(Sink.ignore)
(q,s)
}
def publish(evt: Event) = {
log.debug("Publishing Event: {}", evt.getClass().toString())
queue.offer(evt)
evt match {
case EX => queue.complete()
case _ => Unit
}
}
def subscribe(actor: ActorRef, ack: ActorRef): Unit =
source.runWith(
Sink.actorRefWithAck(
actor,
onInitMessage = Publisher.StreamInit(ack),
ackMessage = Publisher.StreamAck,
onCompleteMessage = Publisher.StreamDone,
onFailureMessage = onErrorMessage))
def onErrorMessage(ex: Throwable) = Publisher.StreamFail(ex)
def publisherBehaviour: Receive = {
case Publisher.Subscribe(sub, ack) => subscribe(sub, ack.getOrElse(sender()))
case Publisher.StreamAck => Unit
}
override def receive = LoggingReceive { publisherBehaviour }
}
object Publisher {
final val bufferSize = 5
final val overflowStrategy = OverflowStrategy.backpressure
case class Subscribe(sub: ActorRef, ack: Option[ActorRef])
case object StreamAck
case class StreamInit(ack: ActorRef)
case object StreamDone
case class StreamFail(ex: Throwable)
}
Subscribers can implement the Subscriber trait to separate the logic:
trait Subscriber {
def onInit(publisher: ActorRef): Unit = ()
def onInit(publisher: ActorRef, k: KillSwitch): Unit = onInit(publisher)
def onEvent(event: Event): Unit = ()
def onDone(publisher: ActorRef, subscriber: ActorRef): Unit = ()
def onFail(e: Throwable, publisher: ActorRef, subscriber: ActorRef): Unit = ()
}
The actor logic is quite simple:
class SubscriberActor(subscriber: Subscriber) extends Actor with ActorLogging {
def subscriberBehaviour: Receive = {
case Publisher.StreamInit(ack) => {
log.debug("Stream initialized.")
subscriber.onInit(sender())
sender() ! Publisher.StreamAck
ack.forward(Publisher.StreamInit(ack))
}
case Publisher.StreamDone => {
log.debug("Stream completed.")
subscriber.onDone(sender(),self)
}
case Publisher.StreamFail(ex) => {
log.error(ex, "Stream failed!")
subscriber.onFail(ex,sender(),self)
}
case e: Event => {
log.debug("Observing Event: {}",e)
subscriber.onEvent(e)
sender() ! Publisher.StreamAck
}
}
override def receive = LoggingReceive { subscriberBehaviour }
}
One of the key points is that all subscribers must receive all messages sent by the publisher, so we have to know that all streams have materialized and all actors are ready to receive before starting the broadcast. This is why the StreamInit message is forwarded to another, user-provided actor.
To test this, I define a simple MockPublisher that just broadcasts a list of events when told to do so:
class MockPublisher(events: Event*) extends Publisher {
def receiveBehaviour: Receive = {
case MockPublish => events map publish
}
override def receive = LoggingReceive { receiveBehaviour orElse publisherBehaviour }
}
case object MockPublish
I also define a MockSubscriber who merely counts how many events it has seen:
class MockSubscriber extends Subscriber {
var count = 0
val promise = Promise[Int]()
def future = promise.future
override def onInit(publisher: ActorRef): Unit = count = 0
override def onEvent(event: Event): Unit = count += 1
override def onDone(publisher: ActorRef, subscriber: ActorRef): Unit = promise.success(count)
override def onFail(e: Throwable, publisher: ActorRef, subscriber: ActorRef): Unit = promise.failure(e)
}
And a small method for subscription:
object MockSubscriber {
def sub(publisher: ActorRef, ack: ActorRef)(implicit system: ActorSystem): Future[Int] = {
val s = new MockSubscriber()
implicit val tOut = Timeout(1.minute)
val a = system.actorOf(Props(new SubscriberActor(s)))
val f = publisher ! Publisher.Subscribe(a, Some(ack))
s.future
}
}
I put everything together in a unit test:
class SubscriberTests extends TestKit(ActorSystem("SubscriberTests")) with
WordSpecLike with Matchers with BeforeAndAfterAll with ImplicitSender {
override def beforeAll:Unit = {
system.eventStream.setLogLevel(Logging.DebugLevel)
}
override def afterAll:Unit = {
println("Shutting down...")
TestKit.shutdownActorSystem(system)
}
"The Subscriber" must {
"publish events to many observers" in {
val n = 9
val p = system.actorOf(Props(new MockPublisher(E1,E2,E3,E4,E5,EX)))
val q = scala.collection.mutable.Queue[Future[Int]]()
for (i <- 1 to n) {
q += MockSubscriber.sub(p,self)
}
for (i <- 1 to n) {
expectMsgType[Publisher.StreamInit](70.seconds)
}
p ! MockPublish
q.map { f => Await.result(f, 10.seconds) should be (6) }
}
}
}
This test succeeds for relatively small values of n, but fails for, say, val n = 90000. No caught or uncaught exception appears anywhere and neither does any out-of-memory complaint from Java (which does occur if I go even higher).
What am I missing?
Edit: Tried this on multiple computers with different specs. Debug info shows no messages reach any of the subscribers once n is high enough.
Akka Stream (and any other reactive stream, actually) provides you backpressure. If you hadn't messed up with how you create your consumers (e.g. allowing creation of 1GB JSON, which will you chop into smaller pieces only after you fetched it into memory) you should have a comfortable situation where you can consider your memory usage pretty much upper-bounded (because of how backpressure manage push-pull mechanics). Once you measure where your upper-bound lies, your can set up your JVM and container memory, so that you could let it run without fear of out of memory errors (provided that there is not other thing happening in your JVM which could cause memory usage spike).
So, from this we can see that there is some constraint on how much stream you can run in parallel - specifically you can run only as much of them as your memory allows you. CPU should not be a limitation (as you will have multiple threads), but if you will start too much of them on one machine, then CPU inevitably with have to switch between different streams making each of them slower. It might not be a technical blocker, but you might end up in a situation where processing is so slow that it doesn't fulfill its business purpose (though, I guess, you would have to run much more than few of streams at once).
In your tests you might run into some other issues as well. E.g. if you reuse the same thread pool for some blocking operations as you use for Actor System without informing the thread pool that they are blocking, you might end up with a dead lock (as a matter of the fact, you should run all IO blocking operations on a different thread pool than "computing" operations). Having 90000(!) concurrent things happening at the same time (and probably having the same small thread pool) almost guarantees running into issues (I guess you could run into issues even if instead of actors you would run the code directly on futures). Here you are using actor system in tests, which AFAIR use blocking logic only highlighting all the possible issues with small thread pools which keep blocking and non-blocking tasks in the same place.

Akka-Streams ActorPublisher does not receive any Request messages

I am trying to continuously read the wikipedia IRC channel using this lib: https://github.com/implydata/wikiticker
I created a custom Akka Publisher, which will be used in my system as a Source.
Here are some of my classes:
class IrcPublisher() extends ActorPublisher[String] {
import scala.collection._
var queue: mutable.Queue[String] = mutable.Queue()
override def receive: Actor.Receive = {
case Publish(s) =>
println(s"->MSG, isActive = $isActive, totalDemand = $totalDemand")
queue.enqueue(s)
publishIfNeeded()
case Request(cnt) =>
println("Request: " + cnt)
publishIfNeeded()
case Cancel =>
println("Cancel")
context.stop(self)
case _ =>
println("Hm...")
}
def publishIfNeeded(): Unit = {
while (queue.nonEmpty && isActive && totalDemand > 0) {
println("onNext")
onNext(queue.dequeue())
}
}
}
object IrcPublisher {
case class Publish(data: String)
}
I am creating all this objects like so:
def createSource(wikipedias: Seq[String]) {
val dataPublisherRef = system.actorOf(Props[IrcPublisher])
val dataPublisher = ActorPublisher[String](dataPublisherRef)
val listener = new MessageListener {
override def process(message: Message) = {
dataPublisherRef ! Publish(Jackson.generate(message.toMap))
}
}
val ticker = new IrcTicker(
"irc.wikimedia.org",
"imply",
wikipedias map (x => s"#$x.wikipedia"),
Seq(listener)
)
ticker.start() // if I comment this...
Thread.currentThread().join() //... and this I get Request(...)
Source.fromPublisher(dataPublisher)
}
So the problem I am facing is this Source object. Although this implementation works well with other sources (for example from local file), the ActorPublisher don't receive Request() messages.
If I comment the two marked lines I can see, that my actor has received the Request(count) message from my flow. Otherwise all messages will be pushed into the queue, but not in my flow (so I can see the MSG messages printed).
I think it's something with multithreading/synchronization here.
I am not familiar enough with wikiticker to solve your problem as given. One question I would have is: why is it necessary to join to the current thread?
However, I think you have overcomplicated the usage of Source. It would be easier for you to work with the stream as a whole rather than create a custom ActorPublisher.
You can use Source.actorRef to materialize a stream into an ActorRef and work with that ActorRef. This allows you to utilize akka code to do the enqueing/dequeing onto the buffer while you can focus on the "business logic".
Say, for example, your entire stream is only to filter lines above a certain length and print them to the console. This could be accomplished with:
def dispatchIRCMessages(actorRef : ActorRef) = {
val ticker =
new IrcTicker("irc.wikimedia.org",
"imply",
wikipedias map (x => s"#$x.wikipedia"),
Seq(new MessageListener {
override def process(message: Message) =
actorRef ! Publish(Jackson.generate(message.toMap))
}))
ticker.start()
Thread.currentThread().join()
}
//these variables control the buffer behavior
val bufferSize = 1024
val overFlowStrategy = akka.stream.OverflowStrategy.dropHead
val minMessageSize = 32
//no need for a custom Publisher/Queue
val streamRef =
Source.actorRef[String](bufferSize, overFlowStrategy)
.via(Flow[String].filter(_.size > minMessageSize))
.to(Sink.foreach[String](println))
.run()
dispatchIRCMessages(streamRef)
The dispatchIRCMessages has the added benefit that it will work with any ActorRef so you aren't required to only work with streams/publishers.
Hopefully this solves your underlying problem...
I think the main problem is Thread.currentThread().join(). This line will 'hang' current thread because this thread is waiting for himself to die. Please read https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html#join-long- .

Akka Actors Still Available After Stopped by PoisonPill

I'm using akka to dynamically create actors and destroy them when they're finished with a particular job. I've got a handle on actor creation, however stopping the actors keeps them in memory regardless of how I've terminated them. Eventually this causes an out of memory exception, despite the fact that I should only have a handful of active actors at any given time.
I've used:
self.tell(PoisonPill, self)
and:
context.stop(self)
to try and destroy the actors. Any ideas?
Edit: Here's a bit more to flesh out what I'm trying to do. The program opens up and spawns ten actors.
val system = ActorSystem("system")
(1 to 10) foreach { x =>
Entity.count += 1
system.actorOf(Props[Entity], name = Entity.count.toString())
}
Here's the code for the Entity:
class Entity () extends Actor {
Entity.entities += this
val id = Entity.count
import context.dispatcher
val tick = context.system.scheduler.schedule(0 millis, 100 millis, self, "update")
def receive = {
case "update" => {
Entity.entities.foreach(that => collide(that))
}
}
override def postStop() = tick.cancel()
def collide(that:Entity) {
if (!this.isBetterThan(that)) {
destroyMe()
spawnNew()
}
}
def isBetterThan() :Boolean = {
//computationally intensive logic
}
private def destroyMe(){
Entity.entities.remove(Entity.entities.indexOf(this))
self.tell(PoisonPill, self)
//context.stop(self)
}
private def spawnNew(){
val system = ActorSystem("system")
Entity.count += 1
system.actorOf(Props[Entity], name = Entity.count.toString())
}
}
object Entity {
val entities = new ListBuffer[Entity]()
var count = 0
}
Thanks #AmigoNico, you pointed me in the right direction. It turns out that neither
self.tell(PoisonPill, self)
nor
context.stop(self)
worked for timely Actor disposal; I switched the line to:
system.stop(self)
and everything works as expected.

akka Actor selection without race condition

I have a futures pool , and each future works with the same akka Actor System - some Actors in system should be global, some are used only in one future.
val longFutures = for (i <- 0 until 2 ) yield Future {
val p:Page = PhantomExecutor(isDebug=true)
Await.result( p.open("http://www.stackoverflow.com/") ,timeout = 10.seconds)
}
PhantomExecutor tryes to use one shared global actor (simple increment counter) using system.actorSelection
def selectActor[T <: Actor : ClassTag](system:ActorSystem,name:String) = {
val timeout = Timeout(0.1 seconds)
val myFutureStuff = system.actorSelection("akka://"+system.name+"/user/"+name)
val aid:ActorIdentity = Await.result(myFutureStuff.ask(Identify(1))(timeout).mapTo[ActorIdentity],
0.1 seconds)
aid.ref match {
case Some(cacher) =>
cacher
case None =>
system.actorOf(Props[T],name)
}
}
But in concurrent environment this approach does not work because of race condition.
I know only one solution for this problem - create global actors before splitting to futures. But this means that I can't encapsulate alot of hidden work from top library user.
You're right in that making sure the global actors are initialized first is the right approach. Can't you tie them to a companion object and reference them from there so you know they will only ever be initialized one time? If you really can't go with such an approach then you could try something like this to lookup or create the actor. It is similar to your code but it include logic to go back through the lookup/create logic (recursively) if the race condition is hit (only up to a max number of times):
def findOrCreateActor[T <: Actor : ClassTag](system:ActorSystem, name:String, maxAttempts:Int = 5):ActorRef = {
import system.dispatcher
val timeout = 0.1 seconds
def doFindOrCreate(depth:Int = 0):ActorRef = {
if (depth >= maxAttempts)
throw new RuntimeException(s"Can not create actor with name $name and reached max attempts of $maxAttempts")
val selection = system.actorSelection(s"/user/$name")
val fut = selection.resolveOne(timeout).map(Some(_)).recover{
case ex:ActorNotFound => None
}
val refOpt = Await.result(fut, timeout)
refOpt match {
case Some(ref) => ref
case None => util.Try(system.actorOf(Props[T],name)).getOrElse(doFindOrCreate(depth + 1))
}
}
doFindOrCreate()
}
Now the retry logic would fire for any exception when creating the actor, so you might want to further specify that (probably via another recover combinator) to only recurse when it gets an InvalidActorNameException, but you get the idea.
You may want to consider creating a manager actor that would take care about creating "counter" actors. This way you would ensure that counter actor creation requests are serialized.
object CounterManagerActor {
case class SelectActorRequest(name : String)
case class SelectActorResponse(name : String, actorRef : ActorRef)
}
class CounterManagerActor extends Actor {
def receive = {
case SelectActorRequest(name) => {
sender() ! SelectActorResponse(name, selectActor(name))
}
}
private def selectActor(name : String) = {
// a slightly modified version of the original selectActor() method
???
}
}