Akka Thread Tuning - scala

I have 100 threads, need to process only 12 threads at a time not more than that. After completion of these threads other 12 have to be processed and so on but it's processing only first 12 set threads then it terminates after that.
Here is my Logic :
class AkkaProcessing extends Actor {
def receive = {
case message: List[Any] =>
var meterName = message(0) // It Contains only 12 threads , it process them and terminates. Am unable to get remaining threads
val sqlContext = message(1).asInstanceOf[SQLContext]
val FlagDF = message(2).asInstanceOf[DataFrame]
{
All the business logic here
}
context.system.shutdown()
}
}
}
object Processing {
def main(args: Array[String]) = {
val rawBuff = new ArrayBuffer[Any]()
val actorSystem = ActorSystem("ActorSystem") // Creating ActorSystem
val actor = actorSystem.actorOf(Props[AkkaProcessing].withRouter(RoundRobinPool(200)), "my-Actor")
implicit val executionContext = actorSystem.dispatchers.lookup("akka.actor.my-dispatcher")
for (i <- 0 until meter_list.length) {
var meterName = meter_list(i) // All 100 Meters here
rawBuff.append(meterName, sqlContext, FlagDF)
actor ! rawBuff.toList
}
}
}
Any Inputs highly appreciated

I think you might be best to create 2 actor types : consumer (which run in parallel) and coordinator (which takes the 12 thread tasks and passes them to the consumers). The coordinator would wait for the consumers to finish and then run the next batch.
See this answer for a code example: Can Scala actors process multiple messages simultaneously?
Failing that, you could just use Futures in a similar manner.

Related

Is there a limit to how many Akka Streams can run at the same time?

I am trying to implement a simple one-to-many pub/sub pattern using a BroadcastHub. This fails silently for large numbers of subscribers, which makes me think I am hitting some limit on the number of streams I can run.
First, let's define some events:
sealed trait Event
case object EX extends Event
case object E1 extends Event
case object E2 extends Event
case object E3 extends Event
case object E4 extends Event
case object E5 extends Event
I have implemented the publisher using a BroadcastHub, adding a Sink.actorRefWithAck each time I want to add a new subscriber. Publishing the EX event ends the broadcast:
trait Publisher extends Actor with ActorLogging {
implicit val materializer = ActorMaterializer()
private val sourceQueue = Source.queue[Event](Publisher.bufferSize, Publisher.overflowStrategy)
private val (
queue: SourceQueueWithComplete[Event],
source: Source[Event, NotUsed]
) = {
val (q,s) = sourceQueue.toMat(BroadcastHub.sink(bufferSize = 256))(Keep.both).run()
s.runWith(Sink.ignore)
(q,s)
}
def publish(evt: Event) = {
log.debug("Publishing Event: {}", evt.getClass().toString())
queue.offer(evt)
evt match {
case EX => queue.complete()
case _ => Unit
}
}
def subscribe(actor: ActorRef, ack: ActorRef): Unit =
source.runWith(
Sink.actorRefWithAck(
actor,
onInitMessage = Publisher.StreamInit(ack),
ackMessage = Publisher.StreamAck,
onCompleteMessage = Publisher.StreamDone,
onFailureMessage = onErrorMessage))
def onErrorMessage(ex: Throwable) = Publisher.StreamFail(ex)
def publisherBehaviour: Receive = {
case Publisher.Subscribe(sub, ack) => subscribe(sub, ack.getOrElse(sender()))
case Publisher.StreamAck => Unit
}
override def receive = LoggingReceive { publisherBehaviour }
}
object Publisher {
final val bufferSize = 5
final val overflowStrategy = OverflowStrategy.backpressure
case class Subscribe(sub: ActorRef, ack: Option[ActorRef])
case object StreamAck
case class StreamInit(ack: ActorRef)
case object StreamDone
case class StreamFail(ex: Throwable)
}
Subscribers can implement the Subscriber trait to separate the logic:
trait Subscriber {
def onInit(publisher: ActorRef): Unit = ()
def onInit(publisher: ActorRef, k: KillSwitch): Unit = onInit(publisher)
def onEvent(event: Event): Unit = ()
def onDone(publisher: ActorRef, subscriber: ActorRef): Unit = ()
def onFail(e: Throwable, publisher: ActorRef, subscriber: ActorRef): Unit = ()
}
The actor logic is quite simple:
class SubscriberActor(subscriber: Subscriber) extends Actor with ActorLogging {
def subscriberBehaviour: Receive = {
case Publisher.StreamInit(ack) => {
log.debug("Stream initialized.")
subscriber.onInit(sender())
sender() ! Publisher.StreamAck
ack.forward(Publisher.StreamInit(ack))
}
case Publisher.StreamDone => {
log.debug("Stream completed.")
subscriber.onDone(sender(),self)
}
case Publisher.StreamFail(ex) => {
log.error(ex, "Stream failed!")
subscriber.onFail(ex,sender(),self)
}
case e: Event => {
log.debug("Observing Event: {}",e)
subscriber.onEvent(e)
sender() ! Publisher.StreamAck
}
}
override def receive = LoggingReceive { subscriberBehaviour }
}
One of the key points is that all subscribers must receive all messages sent by the publisher, so we have to know that all streams have materialized and all actors are ready to receive before starting the broadcast. This is why the StreamInit message is forwarded to another, user-provided actor.
To test this, I define a simple MockPublisher that just broadcasts a list of events when told to do so:
class MockPublisher(events: Event*) extends Publisher {
def receiveBehaviour: Receive = {
case MockPublish => events map publish
}
override def receive = LoggingReceive { receiveBehaviour orElse publisherBehaviour }
}
case object MockPublish
I also define a MockSubscriber who merely counts how many events it has seen:
class MockSubscriber extends Subscriber {
var count = 0
val promise = Promise[Int]()
def future = promise.future
override def onInit(publisher: ActorRef): Unit = count = 0
override def onEvent(event: Event): Unit = count += 1
override def onDone(publisher: ActorRef, subscriber: ActorRef): Unit = promise.success(count)
override def onFail(e: Throwable, publisher: ActorRef, subscriber: ActorRef): Unit = promise.failure(e)
}
And a small method for subscription:
object MockSubscriber {
def sub(publisher: ActorRef, ack: ActorRef)(implicit system: ActorSystem): Future[Int] = {
val s = new MockSubscriber()
implicit val tOut = Timeout(1.minute)
val a = system.actorOf(Props(new SubscriberActor(s)))
val f = publisher ! Publisher.Subscribe(a, Some(ack))
s.future
}
}
I put everything together in a unit test:
class SubscriberTests extends TestKit(ActorSystem("SubscriberTests")) with
WordSpecLike with Matchers with BeforeAndAfterAll with ImplicitSender {
override def beforeAll:Unit = {
system.eventStream.setLogLevel(Logging.DebugLevel)
}
override def afterAll:Unit = {
println("Shutting down...")
TestKit.shutdownActorSystem(system)
}
"The Subscriber" must {
"publish events to many observers" in {
val n = 9
val p = system.actorOf(Props(new MockPublisher(E1,E2,E3,E4,E5,EX)))
val q = scala.collection.mutable.Queue[Future[Int]]()
for (i <- 1 to n) {
q += MockSubscriber.sub(p,self)
}
for (i <- 1 to n) {
expectMsgType[Publisher.StreamInit](70.seconds)
}
p ! MockPublish
q.map { f => Await.result(f, 10.seconds) should be (6) }
}
}
}
This test succeeds for relatively small values of n, but fails for, say, val n = 90000. No caught or uncaught exception appears anywhere and neither does any out-of-memory complaint from Java (which does occur if I go even higher).
What am I missing?
Edit: Tried this on multiple computers with different specs. Debug info shows no messages reach any of the subscribers once n is high enough.
Akka Stream (and any other reactive stream, actually) provides you backpressure. If you hadn't messed up with how you create your consumers (e.g. allowing creation of 1GB JSON, which will you chop into smaller pieces only after you fetched it into memory) you should have a comfortable situation where you can consider your memory usage pretty much upper-bounded (because of how backpressure manage push-pull mechanics). Once you measure where your upper-bound lies, your can set up your JVM and container memory, so that you could let it run without fear of out of memory errors (provided that there is not other thing happening in your JVM which could cause memory usage spike).
So, from this we can see that there is some constraint on how much stream you can run in parallel - specifically you can run only as much of them as your memory allows you. CPU should not be a limitation (as you will have multiple threads), but if you will start too much of them on one machine, then CPU inevitably with have to switch between different streams making each of them slower. It might not be a technical blocker, but you might end up in a situation where processing is so slow that it doesn't fulfill its business purpose (though, I guess, you would have to run much more than few of streams at once).
In your tests you might run into some other issues as well. E.g. if you reuse the same thread pool for some blocking operations as you use for Actor System without informing the thread pool that they are blocking, you might end up with a dead lock (as a matter of the fact, you should run all IO blocking operations on a different thread pool than "computing" operations). Having 90000(!) concurrent things happening at the same time (and probably having the same small thread pool) almost guarantees running into issues (I guess you could run into issues even if instead of actors you would run the code directly on futures). Here you are using actor system in tests, which AFAIR use blocking logic only highlighting all the possible issues with small thread pools which keep blocking and non-blocking tasks in the same place.

Compose two Scala futures with callbacks, WITHOUT a third ExecutionContext

I have two methods, let's call them load() and init(). Each one starts a computation in its own thread and returns a Future on its own execution context. The two computations are independent.
val loadContext = ExecutionContext.fromExecutor(...)
def load(): Future[Unit] = {
Future
}
val initContext = ExecutionContext.fromExecutor(...)
def init(): Future[Unit] = {
Future { ... }(initContext)
}
I want to call both of these from some third thread -- say it's from main() -- and perform some other computation when both are finished.
def onBothComplete(): Unit = ...
Now:
I don't care which completes first
I don't care what thread the other computation is performed on, except:
I don't want to block either thread waiting for the other;
I don't want to block the third (calling) thread; and
I don't want to have to start a fourth thread just to set the flag.
If I use for-comprehensions, I get something like:
val loading = load()
val initialization = initialize()
for {
loaded <- loading
initialized <- initialization
} yield { onBothComplete() }
and I get Cannot find an implicit ExecutionContext.
I take this to mean Scala wants a fourth thread to wait for the completion of both futures and set the flag, either an explicit new ExecutionContext or ExecutionContext.Implicits.global. So it would appear that for-comprehensions are out.
I thought I might be able to nest callbacks:
initialization.onComplete {
case Success(_) =>
loading.onComplete {
case Success(_) => onBothComplete()
case Failure(t) => log.error("Unable to load", t)
}
case Failure(t) => log.error("Unable to initialize", t)
}
Unfortunately onComplete also takes an implicit ExecutionContext, and I get the same error. (Also this is ugly, and loses the error message from loading if initialization fails.)
Is there any way to compose Scala Futures without blocking and without introducing another ExecutionContext? If not, I might have to just throw them over for Java 8 CompletableFutures or Javaslang Vavr Futures, both of which have the ability to run callbacks on the thread that did the original work.
Updated to clarify that blocking either thread waiting for the other is also not acceptable.
Updated again to be less specific about the post-completion computation.
Why not just reuse one of your own execution contexts? Not sure what your requirements for those are but if you use a single thread executor you could just reuse that one as the execution context for your comprehension and you won't get any new threads created:
implicit val loadContext = ExecutionContext.fromExecutor(Executors.newSingleThreadExecutor)
If you really can't reuse them you may consider this as the implicit execution context:
implicit val currentThreadExecutionContext = ExecutionContext.fromExecutor(
(runnable: Runnable) => {
runnable.run()
})
Which will run futures on the current thread. However, the Scala docs explicitly recommends against this as it introduces nondeterminism in which thread runs the Future (but as you stated, you don't care which thread it runs on so this may not matter).
See Synchronous Execution Context for why this isn't advisable.
An example with that context:
val loadContext = ExecutionContext.fromExecutor(Executors.newSingleThreadExecutor)
def load(): Future[Unit] = {
Future(println("loading thread " + Thread.currentThread().getName))(loadContext)
}
val initContext = ExecutionContext.fromExecutor(Executors.newSingleThreadExecutor)
def init(): Future[Unit] = {
Future(println("init thread " + Thread.currentThread().getName))(initContext)
}
val doneFlag = new AtomicBoolean(false)
val loading = load()
val initialization = init()
implicit val currentThreadExecutionContext = ExecutionContext.fromExecutor(
(runnable: Runnable) => {
runnable.run()
})
for {
loaded <- loading
initialized <- initialization
} yield {
println("yield thread " + Thread.currentThread().getName)
doneFlag.set(true)
}
prints:
loading thread pool-1-thread-1
init thread pool-2-thread-1
yield thread main
Though the yield line may print either pool-1-thread-1 or pool-2-thread-1 depending on the run.
In Scala, a Future represents a piece of work to be executed async (i.e. concurrently to other units of work). An ExecutionContext represent a pool of threads for executing Futures. In other words, ExecutionContext is the team of worker who performs the actual work.
For efficiency and scalability, it's better to have big team(s) (e.g. single ExecutionContext with 10 threads to execute 10 Future's) rather than small teams (e.g. 5 ExecutionContext with 2 threads each to execute 10 Future's).
In your case if you want to limit the number of threads to 2, you can:
def load()(implicit teamOfWorkers: ExecutionContext): Future[Unit] = {
Future { ... } /* will use the teamOfWorkers implicitly */
}
def init()(implicit teamOfWorkers: ExecutionContext): Future[Unit] = {
Future { ... } /* will use the teamOfWorkers implicitly */
}
implicit val bigTeamOfWorkers = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(2))
/* All async works in the following will use
the same bigTeamOfWorkers implicitly and works will be shared by
the 2 workers (i.e. thread) in the team */
for {
loaded <- loading
initialized <- initialization
} yield doneFlag.set(true)
The Cannot find an implicit ExecutionContext error does not mean that Scala wants additional threads. It only means that Scala wants a ExecutionContext to do the work. And additional ExecutionContext does not necessarily implies additional 'thread', e.g. the following ExecutionContext, instead of creating new threads, will execute works in the current thread:
val currThreadExecutor = ExecutionContext.fromExecutor(new Executor {
override def execute(command: Runnable): Unit = command.run()
})

How can Akka streams be materialized continually?

I am using Akka Streams in Scala to poll from an AWS SQS queue using the AWS Java SDK. I created an ActorPublisher which dequeues messages on a two second interval:
class SQSSubscriber(name: String) extends ActorPublisher[Message] {
implicit val materializer = ActorMaterializer()
val schedule = context.system.scheduler.schedule(0 seconds, 2 seconds, self, "dequeue")
val client = new AmazonSQSClient()
client.setRegion(RegionUtils.getRegion("us-east-1"))
val url = client.getQueueUrl(name).getQueueUrl
val MaxBufferSize = 100
var buf = Vector.empty[Message]
override def receive: Receive = {
case "dequeue" =>
val messages = iterableAsScalaIterable(client.receiveMessage(new ReceiveMessageRequest(url).getMessages).toList
messages.foreach(self ! _)
case message: Message if buf.size == MaxBufferSize =>
log.error("The buffer is full")
case message: Message =>
if (buf.isEmpty && totalDemand > 0)
onNext(message)
else {
buf :+= message
deliverBuf()
}
case Request(_) =>
deliverBuf()
case Cancel =>
context.stop(self)
}
#tailrec final def deliverBuf(): Unit =
if (totalDemand > 0) {
if (totalDemand <= Int.MaxValue) {
val (use, keep) = buf.splitAt(totalDemand.toInt)
buf = keep
use foreach onNext
} else {
val (use, keep) = buf.splitAt(Int.MaxValue)
buf = keep
use foreach onNext
deliverBuf()
}
}
}
In my application, I am attempting to run the flow at a 2 second interval as well:
val system = ActorSystem("system")
val sqsSource = Source.actorPublisher[Message](SQSSubscriber.props("queue-name"))
val flow = Flow[Message]
.map { elem => system.log.debug(s"${elem.getBody} (${elem.getMessageId})"); elem }
.to(Sink.ignore)
system.scheduler.schedule(0 seconds, 2 seconds) {
flow.runWith(sqsSource)(ActorMaterializer()(system))
}
However, when I run my application I receive java.util.concurrent.TimeoutException: Futures timed out after [20000 milliseconds] and subsequent dead letter notices which is caused by the ActorMaterializer.
Is there a recommended approach for continually materializing an Akka Stream?
I don't think you need to create a new ActorPublisher every 2 seconds. This seems redundant and wasteful of memory. Also, I don't think an ActorPublisher is necessary. From what I can tell of the code, your implementation will have an ever growing number of Streams all querying the same data. Each Message from the client will be processed by N different akka Streams and, even worse, N will grow over time.
Iterator For Infinite Loop Querying
You can get the same behavior from your ActorPublisher by using scala's Iterator. It is possible to create an Iterator which continuously queries the client:
//setup the client
val client = {
val sqsClient = new AmazonSQSClient()
sqsClient setRegion (RegionUtils getRegion "us-east-1")
sqsClient
}
val url = client.getQueueUrl(name).getQueueUrl
//single query
def queryClientForMessages : Iterable[Message] = iterableAsScalaIterable {
client receiveMessage (new ReceiveMessageRequest(url).getMessages)
}
def messageListIteartor : Iterator[Iterable[Message]] =
Iterator continually messageListStream
//messages one-at-a-time "on demand", no timer pushing you around
def messageIterator() : Iterator[Message] = messageListIterator flatMap identity
This implementation only queries the client when all previous Messages have been consumed and is therefore truly reactive. No need to keep track of a buffer with fixed size. Your solution needs a buffer because the creation of Messages (via a timer) is de-coupled from the consumption of Messages (via println). In my implementation, creation & consumption are tightly coupled via back-pressure.
Akka Stream Source
You can then use this Iterator generator-function to feed an akka stream Source:
def messageSource : Source[Message, _] = Source fromIterator messageIterator
Flow Formation
And finally you can use this Source to perform the println (As a side note: your flow value is actually a Sink since Flow + Sink = Sink). Using your flow value from the question:
messageSource runWith flow
One akka Stream processing all messages.

Scala Akka Consumer/Producer: Return Value

Problem Statement
Assume I have a file with sentences that is processed line by line. In my case, I need to extract Named Entities (Persons, Organizations, ...) from these lines. Unfortunately, the tagger is quite slow. Therefore, I decided to parallelize the computation, such that lines could be processed independent from each other and the result is collected in a central location.
Current Approach
My current approach comprises the usage of a single producer multiple consumer concept. However, I'm relative new to Akka, but I think my problem description fits well into its capabilities. Let me show you some code:
Producer
The Producer reads the file line by line and sends it to the Consumer. If it reaches the total line limit, it propagates the result back to WordCount.
class Producer(consumers: ActorRef) extends Actor with ActorLogging {
var master: Option[ActorRef] = None
var result = immutable.List[String]()
var totalLines = 0
var linesProcessed = 0
override def receive = {
case StartProcessing() => {
master = Some(sender)
Source.fromFile("sent.txt", "utf-8").getLines.foreach { line =>
consumers ! Sentence(line)
totalLines += 1
}
context.stop(self)
}
case SentenceProcessed(list) => {
linesProcessed += 1
result :::= list
//If we are done, we can propagate the result to the creator
if (linesProcessed == totalLines) {
master.map(_ ! result)
}
}
case _ => log.error("message not recognized")
}
}
Consumer
class Consumer extends Actor with ActorLogging {
def tokenize(line: String): Seq[String] = {
line.split(" ").map(_.toLowerCase)
}
override def receive = {
case Sentence(sent) => {
//Assume: This is representative for the extensive computation method
val tokens = tokenize(sent)
sender() ! SentenceProcessed(tokens.toList)
}
case _ => log.error("message not recognized")
}
}
WordCount (Master)
class WordCount extends Actor {
val consumers = context.actorOf(Props[Consumer].
withRouter(FromConfig()).
withDispatcher("consumer-dispatcher"), "consumers")
val producer = context.actorOf(Props(new Producer(consumers)), "producer")
context.watch(consumers)
context.watch(producer)
def receive = {
case Terminated(`producer`) => consumers ! Broadcast(PoisonPill)
case Terminated(`consumers`) => context.system.shutdown
}
}
object WordCount {
def getActor() = new WordCount
def getConfig(routerType: String, dispatcherType: String)(numConsumers: Int) = s"""
akka.actor.deployment {
/WordCount/consumers {
router = $routerType
nr-of-instances = $numConsumers
dispatcher = consumer-dispatcher
}
}
consumer-dispatcher {
type = $dispatcherType
executor = "fork-join-executor"
}"""
}
The WordCount actor is responsible for creating the other actors. When the Consumer is finished the Producer sends a message with all tokens. But, how to propagate the message again and also accept and wait for it? The architecture with the third WordCount actor might be wrong.
Main Routine
case class Run(name: String, actor: () => Actor, config: (Int) => String)
object Main extends App {
val run = Run("push_implementation", WordCount.getActor _, WordCount.getConfig("balancing-pool", "Dispatcher") _)
def execute(run: Run, numConsumers: Int) = {
val config = ConfigFactory.parseString(run.config(numConsumers))
val system = ActorSystem("Counting", ConfigFactory.load(config))
val startTime = System.currentTimeMillis
system.actorOf(Props(run.actor()), "WordCount")
/*
How to get the result here?!
*/
system.awaitTermination
System.currentTimeMillis - startTime
}
execute(run, 4)
}
Problem
As you see, the actual problem is to propagate the result back to the Main routine. Can you tell me how to do this in a proper way? The question is also how to wait for the result until the consumers are finished? I had a brief look into the Akka Future documentation section, but the whole system is a little bit overwhelming for beginners. Something like var future = message ? actor seems suitable. Not sure, how to do this. Also using the WordCount actor causes additional complexity. Maybe it is possible to come up with a solution that doesn't need this actor?
Consider using the Akka Aggregator Pattern. That takes care of the low-level primitives (watching actors, poison pill, etc). You can focus on managing state.
Your call to system.actorOf() returns an ActorRef, but you're not using it. You should ask that actor for results. Something like this:
implicit val timeout = Timeout(5 seconds)
val wCount = system.actorOf(Props(run.actor()), "WordCount")
val answer = Await.result(wCount ? "sent.txt", timeout.duration)
This means your WordCount class needs a receive method that accepts a String message. That section of code should aggregate the results and tell the sender(), like this:
class WordCount extends Actor {
def receive: Receive = {
case filename: String =>
// do all of your code here, using filename
sender() ! results
}
}
Also, rather than blocking on the results with Await above, you can apply some techniques for handling Futures.

How do I throttle messages in Akka (2.1.2)?

Could you guys please show example of throttling messages in Akka ?
Here is my code
object Program {
def main(args: Array[String]) {
val system = ActorSystem()
val actor: ActorRef = system.actorOf(Props[HelloActor].withDispatcher("akka.actor.my-thread-pool-dispatcher"))
val zzz : Function0[Unit] = () => {
println(System.currentTimeMillis())
Thread.sleep(5000)
}
var i: Int = 0
while (i < 100) {
actor ! zzz
i += 1
}
println("DONE")
// system.shutdown()
}
}
class HelloActor extends Actor {
def receive = {
case func : Function0[Unit] => func()
}
}
and here is my config
akka {
actor {
my-thread-pool-dispatcher {
type = Dispatcher
executor = "thread-pool-executor"
thread-pool-executor {
task-queue-type = "array"
task-queue-size = 4
}
}
}
}
But when I run it it appears to be single-threaded where as I expect 4 messages to be processed at the same time.
What am I missing here ?
I don't see the connection between the question's title and the content.
Here is an article about throttling messages in Akka:
http://letitcrash.com/post/28901663062/throttling-messages-in-akka-2
However, you seem puzzled about the fact that your actor is processing only one message at a time. But that's how Akka actors work. They have a single mailbox of messages and they process only one message at a time in a continuous loop.
If you want to handle multiple tasks concurrently with the same work processing unit I suggest you take a look at routers:
http://doc.akka.io/docs/akka/2.1.2/scala/routing.html
Typesafe has recently announced akka reactive streams. Throttling can be achieved using its backpressure capability.
http://java.dzone.com/articles/reactive-queue-akka-reactive