akka timeout when using spray client for multiple request - scala

Using spray 1.3.2 with akka 2.3.6. (akka is used only for spray).
I need to read huge files and for each line make a http request.
I read the files line by line with iterator, and for each item make the request.
It run successfully for some of the lines but at some time it start to fail with:
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://default/user/IO-HTTP#-35162984]] after [60000 ms].
I first thought I overloading the service, so I set the "spray.can.host-connector.max-connections" to 1. It run much slower but I got the same errors.
Here the code:
import spray.http.MediaTypes._
val EdnType = register(
MediaType.custom(
mainType = "application",
subType = "edn",
compressible = true,
binary = false,
fileExtensions = Seq("edn")))
val pipeline = (
addHeader("Accept", "application/json")
~> sendReceive
~> unmarshal[PipelineResponse])
def postData(data: String) = {
val request = Post(pipelineUrl).withEntity(HttpEntity.apply(EdnType, data))
val responseFuture: Future[PipelineResponse] = pipeline(request)
responseFuture
}
dataLines.map { d =>
val f = postData(d)
f.onFailure { case e => println("Error - "+e)} // This is where the errors are display
f.map { p => someMoreLogic(d, p) }
}
aggrigateResults(dataLines)
I do it in such way since I don't need the entire data, just some aggregations.
How can I solve this and keep it entirely async?

Akka ask timeout is implemented via firstCompletedOf, so the timer starts when the ask is initialized.
What you seem to be doing, is spawning a Future for each line (during the map) - so all your calls execute nearly at the same time. The timeouts start counting when the futures are initialized, but there are no executor threads left for all the spawned actors to do their work. Hence the asks time out.
Instead of processing "all at once", I would suggest a more flexible approach - somewhat similar to using iteratees, or akka-streams: Work Pulling Pattern. (Github)
You provide the iterator that you already have as an Epic. Introduce a Worker actor, which will perform the call & some logic. If you spawn N workers then, there will be at most N lines being processed concurrently (and the processing pipeline may involve multiple steps). This way you can ensure that you are not overloading the executors, and the timeouts shouldn't happen.

Related

How to specify the exact execution order of asynchronous calls in Scala unit tests?

I have written many different unit tests for futures in Scala.
All asynchronous calls use an execution context.
To make sure that the asynchronous calls are always executed in the same order, I need to delay some tasks which is rather difficult and slows the tests down.
The executor might still (depending on its implementation) complete some tasks before others.
What is the best way to test concurrent code with a specific execution order? For example, I have the following test case:
"firstSucc" should "complete the final future with the first one" in {
val executor = getExecutor
val util = getUtil
val f0 = util.async(executor, () => 10)
f0.sync
val f1 = util.async(executor, () => { delay(); 11 })
val f = f0.firstSucc(f1)
f.get should be(10)
}
where delay is def delay() = Thread.sleep(4000) and sync synchronizes the future (calls Await.ready(future, Duration.Inf)).
That's how I want to make sure that f0 is already completed and f1 completes AFTER f0. It is not enough that f0 is completed since firstSucc could be shuffling the futures. Therefore, f1 should be delayed until after the check of f.get.
Another idea is to create futures from promises and complete them at a certain point in time:
"firstSucc" should "complete the final future with the first one" in {
val executor = getExecutor
val util = getUtil
val f0 = util.async(executor, () => 10)
val p = getPromise
val f1 = p.future
val f = f0.firstSucc(f1)
f.get should be(10)
p.trySuccess(11)
}
Is there any easier/better approach to define the execution order? Maybe another execution service where one can configure the order of submitted tasks?
For this specific case it might be enough to delay the second future until after the result has been checked but in some cases ALL futures have to be completed but in a certain order.
The complete code can be found here: https://github.com/tdauth/scala-futures-promises
The test case is part of this class: https://github.com/tdauth/scala-futures-promises/blob/master/src/test/scala/tdauth/futuresandpromises/AbstractFutureTest.scala
This question might be related since Scala can use Java Executor Services: Controlling Task execution order with ExecutorService
For most simple cases, I'd say a single threaded executor should be enough - if you start your futures one-by-one, they'll be executed serially, and complete in the same order.
But it looks like your problem is actually more complex than what you are describing: you are not only looking for a way to ensure one future completes later than the other, but in general, to make a sequence of arbitrary events happen in a particular order. Fr example, the snippet in your question, verifies that the second future starts after the first one completes (I have not idea what the delay is for in that case btw).
You can use eventually to wait for a particular event to occur before continuing:
val f = Future(doSomething)
eventually {
someFlag shouldBe true
}
val f1 = Future(doSomethingElse)
eventually {
f.isCompleted shouldBe true
}
someFlag = false
eventually {
someFlag shouldBe true
}
f1.futureValue shoudlBe false

What's a good way in Akka to wait for a group of Actors to respond?

In Akka, I want to send out a "status" message to actors in a cluster for their status. These actors may be various states of health, including dead/unable to respond.
I want to wait up to some time, say 10 seconds, then proceed with whatever results I happened to receive back in that time limit. I don't want to fail the whole thing because 1 or 2 were having issues and didn't responded/timed-out at 10 seconds.
I've tried this:
object GetStats {
def unapply(n: ActorRef )(implicit system: ActorSystem): Option[Future[Any]] = Try {
implicit val t: Timeout = Timeout(10 seconds)
n ? "A"
}.toOption
}
...
val z = List(a,b,c,d) // where a-d are ActorRefs to nodes I want to status
val q = z.collect {
case GetStats(s) => s
}
// OK, so here 'q' is a List[Future[Any]]
val allInverted = Future.sequence(q) // now we have Future[List[Any]]
val ok = Await.result(allInverted, 10 seconds).asInstanceOf[List[String]]
println(ok)
The problem with this code is that it seems to throw a TimeoutException if 1 or more don't respond. Then I can't get to the responses that did come back.
Assuming, you really need to collect at least partial statistics every 10 seconds - the solution is to convert "not responding" to actual failure.
To achieve this, just increase the Await timeout a bit in comparision with implicit val t:Timeout for ask. After that your futures itselves (returned from ?) will fail earlier. So you can recover them:
// Works only when AskTimeout >> AwaitTimeout
val qfiltered = q.map(_.map(Some(_)).recover{case _ => None}) //it's better to match TimeoutException here instead of `_`
val allInverted = Future.sequence(q).map(_.flatten)
Example:
scala> class MyActor extends Actor{ def receive = {case 1 => sender ! 2; case _ =>}}
defined class MyActor
scala> val a = sys.actorOf(Props[MyActor])
a: akka.actor.ActorRef = Actor[akka://1/user/$c#1361310022]
scala> implicit val t: Timeout = Timeout(1 seconds)
t: akka.util.Timeout = Timeout(1 second)
scala> val l = List(a ? 1, a ? 100500).map(_.map(Some(_)).recover{case _ => None})
l: List[scala.concurrent.Future[Option[Any]]] = List(scala.concurrent.impl.Promise$DefaultPromise#7faaa183, scala.concurrent.impl.Promise$DefaultPromise#1b51e0f0)
scala> Await.result(Future.sequence(l).map(_.flatten), 3 seconds)
warning: there were 1 feature warning(s); re-run with -feature for details
res29: List[Any] = List(2)
If you want to know which Future didn't respond - remove flatten.
Receiving partial response should be enough for continously collecting statistics, as if some worker actor didn't respond in time - it will respond next time with actual data and without any data lost. But you should correcly process worker's lifecycle and not loose (if it matters) any data inside actor itself.
If the reason of timeouts is just high pressure on system - you may consider:
separate pool for workers
backpressure
caching for input requests (when system overloaded).
If the reason of such timeouts is some remote storage - then partial response is correct way to process it if client is ready for that. WebUI for example may warn a user that shown data may not be full using some spinning thing. But don't forget to not block actors with storage requests (futures may help) or at least move them to the separrate thread-pool.
If worker actor didn't respond because of failure (like exception) - you can still send notification to sender from your preRestart - so you can also receive the reason why there is no statistics from worker. The only thing here - you shoud check if sender is available (it may not be)
P.S. I hope you don't do Await.result inside some actor - blocking an actor is not recommended at least for your application performance. In some cases it may cause even deadlocks or memory leaks. So await's should be placed somewhere in facade of your system (if underlying framework doesn't support futures).
So it may have a sense to process your answers asynchronously (you will still need to recover them from failure if some actor doesn't respond):
//actor:
val parent = sender
for(list <- Future.sequence(qfiltered)) {
parent ! process(list)
}
//in sender (outside of the actors):
Await(actor ? Get, 10 seconds)

Why does Akka application fail with Out-of-Memory Error while executing NLP task?

I notice that my program has a severe memory leak (the memory consumption spirals up). I had to parallelize this NLP task (using StanfordNLP EnglishPCFG Parser and Tregex Matcher). So I built a pipeline of actors (only 6 actors for each task):
val listOfTregexActors = (0 to 5).map(m => system.actorOf(Props(new TregexActor(timer, filePrinter)), "TregexActor" + m))
val listOfParsers = (0 to 5).map(n => system.actorOf(Props(new ParserActor(timer, listOfTregexActors(n), lp)), "ParserActor" + n))
val listOfSentenceSplitters = (0 to 5).map(j => system.actorOf(Props(new SentenceSplitterActor(listOfParsers(j), timer)), "SplitActor" + j))
My actors are pretty standard. They need to stay alive to process all the information (there's no poisonpill along the way). The memory consumption goes up and up, and I don't have a single clue of what's wrong. If I run single-thread, the memory consumption would be just fine. I read it somewhere that if actors don't die, nothing inside will be released. Should I manually release things?
There are two heavy-lifting actors:
https://github.com/windweller/parallelAkka/blob/master/src/main/scala/blogParallel/ParserActor.scala
https://github.com/windweller/parallelAkka/blob/master/src/main/scala/blogParallel/TregexActor.scala
I wonder if it could be Scala's closure or other mechanism that retains too much information, and GC can't collect it somehow.
Here's part of TregexActor:
def receive = {
case Match(rows, sen) =>
println("Entering Pattern matching: " + rows(0))
val result = patternSearching(sen)
filePrinter ! Print(rows :+ sen.toString, result)
}
def patternSearching(tree: Tree):List[Array[Int]] = {
val statsFuture = search(patternFuture, tree)
val statsPast = search(patternsPast, tree)
List(statsFuture, statsPast)
}
def search(patterns: List[String], tree: Tree) = {
val stats = Array.fill[Int](patterns.size)(0)
for (i <- 0 to patterns.size - 1) {
val searchPattern = TregexPattern.compile(patterns(i))
val matcher = searchPattern.matcher(tree)
if (matcher.find()) {
stats(i) = stats(i) + 1
}
timer ! PatternAddOne
}
stats
}
Or if my code checks out, could it be StanfordNLP parser or tregex matcher's memory leak?? Is there a strategy to manually release memory or do I need to kill those actors after a while and assign their mailbox tasks to a new actor to release memory? (If so, how?)
After some struggling with profiling tools, I finally was able to use VisualVM with IntelliJ. Here are the snapshots. GC never ran.
The other is Heap Dump:
Summary of pipeline:
Raw Files -> SentenceSplit Actors (6) -> Parser Actors (6) -> Tregex Actors (6) -> File Output Actors (done)
Patterns are defined in Entry.scala file: https://github.com/windweller/parallelAkka/blob/master/src/main/scala/blogParallel/Entry.scala
This may not the correct answer but I don't have enough space to write it in the comment.
Try moving the actor creating inside your a Companion object.
val listOfTregexActors = (0 to 5).map(m => system.actorOf(Props(new TregexActor(timer, filePrinter)), "TregexActor" + m))
val listOfParsers = (0 to 5).map(n => system.actorOf(Props(new ParserActor(timer, listOfTregexActors(n), lp)), "ParserActor" + n))
val listOfSentenceSplitters = (0 to 5).map(j => system.actorOf(Props(new SentenceSplitterActor(listOfParsers(j), timer)), "SplitActor" + j))
OR don't use the new to create your actors.
I suspect when your create the app you are closing your App which is preventing the GC to collect any garbage.
You can easily verify if this is the issue by looking at the heap on Visual VM once you have made the change.
Also, how long does it take for you to run out of memory and what is the max. heap memory you are giving your JVM ?
EDIT
See - Creating Actors with Props here
Few other things to consider:
Make sure your actor are not dying and restarted automatically.
Create your NLP objects outside your actors and pass them when you create your actors.
Use Akka router instead of your hashing logic to distribute work between different actors.
I see these pieces of code in your actors:
val cleanedSentences = new java.util.ArrayList[java.util.List[HasWord]]()
Are these objects ever freed? If not, that might explain why memory goes up. Especially taking into account that you return these newly created objects (e.g. in cleanSentence method).
UPDATE: you might try, instead of creating new object, modify the object you have received instead, and then flag it's availability as the response (instead of sending the new object back), though from the point of thread-safety it might also be funky. Other option might be to use an external storage (e.g. database or Redis key-value store), put the resulting sentences there and send "sentence cleaned" as a reply, so that the client can then access the sentence from that key-value store or DB.
In general, using mutable objects (like java.util.List) which might potentially be leaked in actors is not a good idea, so it might be worth redesigning your application to use immutable objects whenever possible.
E.g. your cleanSentence method would then look like:
def cleanSentence(sentences: List[HasWord]): List[HasWord] = {
import TwitterRegex._
sentences.filter(ref =>
val word = ref.word() // do not call same function several times
!word.contains("#") &&
!word.contains("#") &&
!word.matches(searchPattern.toString())
)
}
You can convert your java.util.List to Scala list (before sending it to actor) in the following manner:
import scala.collection.JavaConverters._
val javaList:java.util.List[HasWord] = ...
javaList.asScala

Parallel file processing in Scala

Suppose I need to process files in a given folder in parallel. In Java I would create a FolderReader thread to read file names from the folder and a pool of FileProcessor threads. FolderReader reads file names and submits the file processing function (Runnable) to the pool executor.
In Scala I see two options:
create a pool of FileProcessor actors and schedule a file processing function with Actors.Scheduler.
create an actor for each file name while reading the file names.
Does it make sense? What is the best option?
Depending on what you're doing, it may be as simple as
for(file<-files.par){
//process the file
}
I suggest with all my energies to keep as far as you can from the threads. Luckily we have better abstractions which take care of what's happening below, and in your case it appears to me that you do not need to use actors (while you can) but you can use a simpler abstraction, called Futures. They are a part of Akka open source library, and I think in the future will be a part of the Scala standard library as well.
A Future[T] is simply something that will return a T in the future.
All you need to run a future, is to have an implicit ExecutionContext, which you can derive from a java executor service. Then you will be able to enjoy the elegant API and the fact that a future is a monad to transform collections into collections of futures, collect the result and so on. I suggest you to give a look to http://doc.akka.io/docs/akka/2.0.1/scala/futures.html
object TestingFutures {
implicit val executorService = Executors.newFixedThreadPool(20)
implicit val executorContext = ExecutionContext.fromExecutorService(executorService)
def testFutures(myList:List[String]):List[String]= {
val listOfFutures : Future[List[String]] = Future.traverse(myList){
aString => Future{
aString.reverse
}
}
val result:List[String] = Await.result(listOfFutures,1 minute)
result
}
}
There's a lot going on here:
I am using Future.traverse which receives as a first parameter which is M[T]<:Traversable[T] and as second parameter a T => Future[T] or if you prefer a Function1[T,Future[T]] and returns Future[M[T]]
I am using the Future.apply method to create an anonymous class of type Future[T]
There are many other reasons to look at Akka futures.
Futures can be mapped because they are monad, i.e. you can chain Futures execution :
Future { 3 }.map { _ * 2 }.map { _.toString }
Futures have callback: future.onComplete, onSuccess, onFailure, andThen etc.
Futures support not only traverse, but also for comprehension
Ideally you should use two actors. One for reading the list of files, and one for actually reading the file.
You start the process by simply sending a single "start" message to the first actor. The actor can then read the list of files, and send a message to the second actor. The second actor then reads the file and processes the contents.
Having multiple actors, which might seem complicated, is actually a good thing in the sense that you have a bunch of objects communicating with eachother, like in a theoretical OO system.
Edit: you REALLY shouldn't be doing doing concurrent reading of a single file.
I was going to write up exactly what #Edmondo1984 did except he beat me to it. :) I second his suggestion in a big way. I'll also suggest that you read the documentation for Akka 2.0.2. As well, I'll give you a slightly more concrete example:
import akka.dispatch.{ExecutionContext, Future, Await}
import akka.util.duration._
import java.util.concurrent.Executors
import java.io.File
val execService = Executors.newCachedThreadPool()
implicit val execContext = ExecutionContext.fromExecutorService(execService)
val tmp = new File("/tmp/")
val files = tmp.listFiles()
val workers = files.map { f =>
Future {
f.getAbsolutePath()
}
}.toSeq
val result = Future.sequence(workers)
result.onSuccess {
case filenames =>
filenames.foreach { fn =>
println(fn)
}
}
// Artificial just to make things work for the example
Thread.sleep(100)
execContext.shutdown()
Here I use sequence instead of traverse, but the difference is going to depend on your needs.
Go with the Future, my friend; the Actor is just a more painful approach in this instance.
But if use actors, what's wrong with that?
If we have to read / write to some property file. There is my Java example. But still with Akka Actors.
Lest's say we have an actor ActorFile represents one file. Hm.. Probably it can not represent One file. Right? (would be nice it could). So then it represents several files like PropertyFilesActor then:
Why would not use something like this:
public class PropertyFilesActor extends UntypedActor {
Map<String, String> filesContent = new LinkedHashMap<String, String>();
{ // here we should use real files of cource
filesContent.put("file1.xml", "");
filesContent.put("file2.xml", "");
}
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof WriteMessage) {
WriteMessage writeMessage = (WriteMessage) message;
String content = filesContent.get(writeMessage.fileName);
String newContent = content + writeMessage.stringToWrite;
filesContent.put(writeMessage.fileName, newContent);
}
else if (message instanceof ReadMessage) {
ReadMessage readMessage = (ReadMessage) message;
String currentContent = filesContent.get(readMessage.fileName);
// Send the current content back to the sender
getSender().tell(new ReadMessage(readMessage.fileName, currentContent), getSelf());
}
else unhandled(message);
}
}
...a message will go with parameter (fileName)
It has its own in-box, accepting messages like:
WriteLine(fileName, string)
ReadLine(fileName, string)
Those messages will be storing into to the in-box in the order, one after antoher. The actor would do its work by receiving messages from the box - storing/reading, and meanwhile sending feedback sender ! message back.
Thus, let's say if we write to the property file, and send showing the content on the web page. We can start showing page (right after we sent message to store a data to the file) and as soon as we received the feedback, update part of the page with a data from just updated file (by ajax).
Well, grab your files and stick them in a parallel structure
scala> new java.io.File("/tmp").listFiles.par
res0: scala.collection.parallel.mutable.ParArray[java.io.File] = ParArray( ... )
Then...
scala> res0 map (_.length)
res1: scala.collection.parallel.mutable.ParArray[Long] = ParArray(4943, 1960, 4208, 103266, 363 ... )

Processing concurrently in Scala

As in my own answer to my own question, I have the situation whereby I am processing a large number of events which arrive on a queue. Each event is handled in exactly the same manner and each even can be handled independently of all other events.
My program takes advantage of the Scala concurrency framework and many of the processes involved are modelled as Actors. As Actors process their messages sequentially, they are not well-suited to this particular problem (even though my other actors are performing actions which are sequential). As I want Scala to "control" all thread creation (which I assume is the point of it having a concurrency system in the first place) it seems I have 2 choices:
Send the events to a pool of event processors, which I control
get my Actor to process them concurrently by some other mechanism
I would have thought that #1 negates the point of using the actors subsystem: how many processor actors should I create? being one obvious question. These things are supposedly hidden from me and solved by the subsystem.
My answer was to do the following:
val eventProcessor = actor {
loop {
react {
case MyEvent(x) =>
//I want to be able to handle multiple events at the same time
//create a new actor to handle it
actor {
//processing code here
process(x)
}
}
}
}
Is there a better approach? Is this incorrect?
edit: A possibly better approach is:
val eventProcessor = actor {
loop {
react {
case MyEvent(x) =>
//Pass processing to the underlying ForkJoin framework
Scheduler.execute(process(e))
}
}
}
This seems like a duplicate of another question. So I'll duplicate my answer
Actors process one message at a time. The classic pattern to process multiple messages is to have one coordinator actor front for a pool of consumer actors. If you use react then the consumer pool can be large but will still only use a small number of JVM threads. Here's an example where I create a pool of 10 consumers and one coordinator to front for them.
import scala.actors.Actor
import scala.actors.Actor._
case class Request(sender : Actor, payload : String)
case class Ready(sender : Actor)
case class Result(result : String)
case object Stop
def consumer(n : Int) = actor {
loop {
react {
case Ready(sender) =>
sender ! Ready(self)
case Request(sender, payload) =>
println("request to consumer " + n + " with " + payload)
// some silly computation so the process takes awhile
val result = ((payload + payload + payload) map {case '0' => 'X'; case '1' => "-"; case c => c}).mkString
sender ! Result(result)
println("consumer " + n + " is done processing " + result )
case Stop => exit
}
}
}
// a pool of 10 consumers
val consumers = for (n <- 0 to 10) yield consumer(n)
val coordinator = actor {
loop {
react {
case msg # Request(sender, payload) =>
consumers foreach {_ ! Ready(self)}
react {
// send the request to the first available consumer
case Ready(consumer) => consumer ! msg
}
case Stop =>
consumers foreach {_ ! Stop}
exit
}
}
}
// a little test loop - note that it's not doing anything with the results or telling the coordinator to stop
for (i <- 0 to 1000) coordinator ! Request(self, i.toString)
This code tests to see which consumer is available and sends a request to that consumer. Alternatives are to just randomly assign to consumers or to use a round robin scheduler.
Depending on what you are doing, you might be better served with Scala's Futures. For instance, if you don't really need actors then all of the above machinery could be written as
import scala.actors.Futures._
def transform(payload : String) = {
val result = ((payload + payload + payload) map {case '0' => 'X'; case '1' => "-"; case c => c}).mkString
println("transformed " + payload + " to " + result )
result
}
val results = for (i <- 0 to 1000) yield future(transform(i.toString))
If the events can all be handled independently, why are they on a queue? Knowing nothing else about your design, this seems like an unnecessary step. If you could compose the process function with whatever is firing those events, you could potentially obviate the queue.
An actor essentially is a concurrent effect equipped with a queue. If you want to process multiple messages simultaneously, you don't really want an actor. You just want a function (Any => ()) to be scheduled for execution at some convenient time.
Having said that, your approach is reasonable if you want to stay within the actors library and if the event queue is not within your control.
Scalaz makes a distinction between Actors and concurrent Effects. While its Actor is very light-weight, scalaz.concurrent.Effect is lighter still. Here's your code roughly translated to the Scalaz library:
val eventProcessor = effect (x => process x)
This is with the latest trunk head, not yet released.
This sounds like a simple consumer/producer problem. I'd use a queue with a pool of consumers. You could probably write this with a few lines of code using java.util.concurrent.
The purpose of an actor (well, one of them) is to ensure that the state within the actor can only be accessed by a single thread at a time. If the processing of a message doesn't depend on any mutable state within the actor, then it would probably be more appropriate to just submit a task to a scheduler or a thread pool to process. The extra abstraction that the actor provides is actually getting in your way.
There are convenient methods in scala.actors.Scheduler for this, or you could use an Executor from java.util.concurrent.
Actors are much more lightweight than threads, and as such one other option is to use actor objects like Runnable objects you are used to submitting to a Thread Pool. The main difference is you do not need to worry about the ThreadPool - the thread pool is managed for you by the actor framework and is mostly a configuration concern.
def submit(e: MyEvent) = actor {
// no loop - the actor exits immediately after processing the first message
react {
case MyEvent(x) =>
process(x)
}
} ! e // immediately send the new actor a message
Then to submit a message, say this:
submit(new MyEvent(x))
, which corresponds to
eventProcessor ! new MyEvent(x)
from your question.
Tested this pattern successfully with 1 million messages sent and received in about 10 seconds on a quad-core i7 laptop.
Hope this helps.