How to page REST calls in a future with Dispatch and Scala - scala

I use Scala and Dispatch to get JSON from a paged REST API. The reason I use futures with Dispatch here is because I want to execute the calls to fetchIssuesByFile in parallel, because that function could result in many REST calls for every lookupId (1x findComponentKeyByRest, n x fetchIssuePage, where n is the number of pages yielded by the REST API).
Here's my code so far:
def fetchIssuePage(componentKey: String, pageIndex: Int): Future[json4s.JValue]
def extractPagingInfo(json: org.json4s.JValue): PagingInfo
def extractIssues(json: org.json4s.JValue): Seq[Issue]
def findAllIssuesByRest(componentKey: String): Future[Seq[Issue]] = {
Future {
var paging = PagingInfo(pageIndex = 0, pages = 0)
val allIssues = mutable.ListBuffer[Issue]()
do {
fetchIssuePage(componentKey, paging.pageIndex) onComplete {
case Success(json) =>
allIssues ++= extractIssues(json)
paging = extractPagingInfo(json)
case _ => //TODO error handling
}
} while (paging.pageIndex < paging.pages)
allIssues // (1)
}
}
def findComponentKeyByRest(lookupId: String): Future[Option[String]]
def fetchIssuesByFile(lookupId: String): Future[(String, Option[Seq[Issue]])] =
findComponentKeyByRest(lookupId).flatMap {
case Some(key) => findAllIssuesByRest(key).map(
issues => // (2) process issues
)
case None => //TODO error handling
}
The actual problem is that I never get the collected issues from findAllIssuesByRest (1) (i.e., the sequence of issues is always empty) when I try to process them at (2). Any ideas? Also, the pagination code isn't very functional, so I'm also open to ideas on how to improve this with Scala.
Any help is much appreciated.
Thanks,
Michael

I think you could do something like:
def findAllIssuesByRest(componentKey: String): Future[Seq[Issue]] =
// fetch first page
fetchIssuePage(componentKey, 0).flatMap { json =>
val nbPages = extractPagingInfo(json).pages // get the total number of pages
val firstIssues = extractIssues(json) // the first issues
// get the other pages
Future.sequence((1 to nbPages).map(page => fetchIssuePage(componentKey, page)))
// get the issues from the other pages
.map(pages => pages.map(extractIssues))
// combine first issues with other issues
.flatMap(issues => (firstIssues +: issues).flatten)
}

That's because fetchIssuePage returns a Future that the code doesn't await the result for.
Solution would be to build up a Seq of Futures from the fetchIssuePage calls. Then Future.sequence the Seq to produce a single future. Return this instead. This future will fire when they're all complete, ready for your flatMap code.
Update: Although Michael understood the above well (see comments), I thought I'd put in a much simplified code for the benefit of other readers, just to illustrate the point:
def fetch(n: Int): Future[Int] = Future { n+1 }
val fetches = Seq(1, 2, 3).map(fetch)
// a simplified version of your while loop, just to illustrate the point
Future.sequence(fetches).flatMap(results => ...)
// where results will be Seq[Int] - i.e., 2, 3, 4

Related

Trying to understand Scala enumerator/iteratees

I am new to Scala and Play!, but have a reasonable amount of experience of building webapps with Django and Python and of programming in general.
I've been doing an exercise of my own to try to improve my understanding - simply pull some records from a database and output them as a JSON array. I'm trying to use the Enumarator/Iteratee functionality to do this.
My code follows:
TestObjectController.scala:
def index = Action {
db.withConnection { conn=>
val stmt = conn.createStatement()
val result = stmt.executeQuery("select * from datatable")
logger.debug(result.toString)
val resultEnum:Enumerator[TestDataObject] = Enumerator.generateM {
logger.debug("called enumerator")
result.next() match {
case true =>
val obj = TestDataObject(result.getString("name"), result.getString("object_type"),
result.getString("quantity").toInt, result.getString("cost").toFloat)
logger.info(obj.toJsonString)
Future(Some(obj))
case false =>
logger.warn("reached end of iteration")
stmt.close()
null
}
}
val consume:Iteratee[TestDataObject,Seq[TestDataObject]] = {
Iteratee.fold[TestDataObject,Seq[TestDataObject]](Seq.empty[TestDataObject]) { (result,chunk) => result :+ chunk }
}
val newIteree = Iteratee.flatten(resultEnum(consume))
val eventuallyResult:Future[Seq[TestDataObject]] = newIteree.run
eventuallyResult.onSuccess { case x=> println(x)}
Ok("")
}
}
TestDataObject.scala:
package models
case class TestDataObject (name: String, objtype: String, quantity: Int, cost: Float){
def toJsonString: String = {
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
mapper.writeValueAsString(this)
}
}
I have two main questions:
How do i signal that the input is complete from the Enumerator callback? The documentation says "this method takes a callback function e: => Future[Option[E]] that will be called each time the iteratee this Enumerator is applied to is ready to take some input." but I am unable to pass any kind of EOF that I've found because it;s the wrong type. Wrapping it in a Future does not help, but instinctively I am not sure that's the right approach.
How do I get the final result out of the Future to return from the controller view? My understanding is that I would effectively need to pause the main thread to wait for the subthreads to complete, but the only examples I've seen and only things i've found in the future class is the onSuccess callback - but how can I then return that from the view? Does Iteratee.run block until all input has been consumed?
A couple of sub-questions as well, to help my understanding:
Why do I need to wrap my object in Some() when it's already in a Future? What exactly does Some() represent?
When I run the code for the first time, I get a single record logged from logger.info and then it reports "reached end of iteration". Subsequent runs in the same session call nothing. I am closing the statement though, so why do I get no results the second time around? I was expecting it to loop indefinitely as I don't know how to signal the correct termination for the loop.
Thanks a lot in advance for any answers, I thought I was getting the hang of this but obviously not yet!
How do i signal that the input is complete from the Enumerator callback?
You return a Future(None).
How do I get the final result out of the Future to return from the controller view?
You can use Action.async (doc):
def index = Action.async {
db.withConnection { conn=>
...
val eventuallyResult:Future[Seq[TestDataObject]] = newIteree.run
eventuallyResult map { data =>
OK(...)
}
}
}
Why do I need to wrap my object in Some() when it's already in a Future? What exactly does Some() represent?
The Future represents the (potentially asynchronous) processing to obtain the next element. The Option represents the availability of the next element: Some(x) if another element is available, None if the enumeration is completed.

Scala - Batched Stream from Futures

I have instances of a case class Thing, and I have a bunch of queries to run that return a collection of Things like so:
def queries: Seq[Future[Seq[Thing]]]
I need to collect all Things from all futures (like above) and group them into equally sized collections of 10,000 so they can be serialized to files of 10,000 Things.
def serializeThings(Seq[Thing]): Future[Unit]
I want it to be implemented in such a way that I don't wait for all queries to run before serializing. As soon as there are 10,000 Things returned after the futures of the first queries complete, I want to start serializing.
If I do something like:
Future.sequence(queries)
It will collect the results of all the queries, but my understanding is that operations like map won't be invoked until all queries complete and all the Things must fit into memory at once.
What's the best way to implement a batched stream pipeline using Scala collections and concurrent libraries?
I think that I managed to make something. The solution is based on my previous answer. It collects results from Future[List[Thing]] results until it reaches a treshold of BatchSize. Then it calls serializeThings future, when it finishes, the loop continues with the rest.
object BatchFutures extends App {
case class Thing(id: Int)
def getFuture(id: Int): Future[List[Thing]] = {
Future.successful {
List.fill(3)(Thing(id))
}
}
def serializeThings(things: Seq[Thing]): Future[Unit] = Future.successful {
//Thread.sleep(2000)
println("processing: " + things)
}
val ids = (1 to 4).toList
val BatchSize = 5
val future = ids.foldLeft(Future.successful[List[Thing]](Nil)) {
case (acc, id) =>
acc flatMap { processed =>
getFuture(id) flatMap { res =>
val all = processed ++ res
val (batch, rest) = all.splitAt(5)
if (batch.length == BatchSize) { // if futures filled the batch with needed amount
serializeThings(batch) map { _ =>
rest // process the rest
}
} else {
Future.successful(all) //if we need more Things for a batch
}
}
}
}.flatMap { rest =>
serializeThings(rest)
}
Await.result(future, Duration.Inf)
}
The result prints:
processing: List(Thing(1), Thing(1), Thing(1), Thing(2), Thing(2))
processing: List(Thing(2), Thing(3), Thing(3), Thing(3), Thing(4))
processing: List(Thing(4), Thing(4))
When the number of Things isn't divisible by BatchSize we have to call serializeThings once more(last flatMap). I hope it helps! :)
Before you do Future.sequence do what you want to do with individual future and then use Future.sequence.
//this can be used for serializing
def doSomething(): Unit = ???
//do something with the failed future
def doSomethingElse(): Unit = ???
def doSomething(list: List[_]) = ???
val list: List[Future[_]] = List.fill(10000)(Future(doSomething()))
val newList =
list.par.map { f =>
f.map { result =>
doSomething()
}.recover { case throwable =>
doSomethingElse()
}
}
Future.sequence(newList).map ( list => doSomething(list)) //wait till all are complete
instead of newList generation you could use Future.traverse
Future.traverse(list)(f => f.map( x => doSomething()).recover {case th => doSomethingElse() }).map ( completeListOfValues => doSomething(completeListOfValues))

How to handle an unknown number of futures that depend on the previous result?

I have a method that makes an API call and returns a Future[Seq[Post]], with each post having a sequential id. The API call returns 200 posts per call. In order to get a complete sequence of posts, I submit the smallest id (as maxId) in subsequent calls. Each Post object also contains a User object that contains a field specifying the total number of posts made by the user. What's the best way to accomplish getting all of a user's posts in a non-blocking manner?
Thus far, I have the following blocking code:
def completeTimeline(userId: Long) = {
def getTimeline(maxId = Option[Long] = None): Seq[Post] = {
val timeline: Future[Seq[Post]] = userPosts(userId) // calls method that returns Future[Seq[Post]]
Await.result(timeline, 5.seconds)
}
def recurseTimeline(timeline: Seq[Post]): Seq[Post] = {
if (timeline.length < timeline.head.user.map(user => user.post_count).getOrElse(0)) {
val maxId = timeline.min(Ordering.by((p: Post) => p.id)).id - 1 // API call with maxId is inclusive, subtract 1 to avoid duplicates
val nextTimeline = getTimeline(maxId = Some(maxId))
val newTimeline = timeline ++ nextTimeline
recurseTimeline(newTimeline)
} else timeline
}
recurseTimeline(getTimeline())
}
Something like this:
def completeTimeline(userId: Long): Future[Seq[Post]] = {
def getTimeline(maxId: Option[Long] = None): Future[Seq[Post]] = {
val timeline: Future[Seq[Post]] = userPosts(userId) // calls method that returns Future[Seq[Post]]
timeline
}
def recurseTimeline(timeline: Seq[Post]): Future[Seq[Post]] = {
if (timeline.length < timeline.head.user.map(user => user.post_count).getOrElse(0)) {
val maxId = timeline.min(Ordering.by((p: Post) => p.id)).id - 1 // API call with maxId is inclusive, subtract 1 to avoid duplicates
for {
nextTimeline <- getTimeline(maxId = Some(maxId))
full <- recurseTimeline(timeline ++ nextTimeline)
} yield full
} else Future.successful(timeline)
}
getTimeline() flatMap recurseTimeline
}
Essentially, you are scheduling things to happen once each future completes (map, flatMap, and for-comprehension), instead of waiting for them to complete.
The code can probably be simplified if i understood a bit better what you're trying to do.
When you say non-blocking that is incompatible with I need to check all of the Posts: you do need to wait for all of the Futures to complete - and thus be able to find the minimum id before proceeding. Correct? or would you like to further explain your intent?
Proceeding on that assumption - you can add each of the Future's to a list and then iterate on the list and calling Await on each one. This is in fact the standard blocking mechanism for obtaining a set of futures.
If there were further considerations and/or assumptions being made then please provide them - to allow tweaking this general approach for your particulars.

Running future n times

I'd like to run my future call n times, for example 5. Future "execution" will take some time and I want to call new one only when previous was completed. Something like:
def fun(times: Int): Future[AnyRef] = {
def _fun(times: Int) = {
createFuture()
}
(1 to times).foldLeft(_fun)((a,b) => {
println(s"Fun: $b of $times")
a.flatMap(_ => _fun)
})
}
So I want to call "_fun" function n times one by one. "createFuture()" will take some time, so "_fun" shouldn't be called again before previous future was completed. Also, I want to create a non-blocking solution. Currently, this code executes without waiting for previous future to end.
Any ideas how to make it work?
Thanks for your answers!
Without understanding what exactly you want the final future to return (I'm going to just return the result of the last completed future), you could try something like this:
def fun(times: Int): Future[AnyRef] = {
val prom = Promise[AnyRef]()
def _fun(t: Int) {
val fut = createFuture()
fut onComplete {
case Failure(ex) => prom.failure(ex)
case Success(result) if t >= times => prom.success(result)
case Success(result) => _fun(t + 1)
}
}
_fun(1)
prom.future
}
This is a sort of recursive solution that will chain the futures together on completion, stopping the chaining when the max number of times has been reached. This code is not perfect, but certainly conveys one possible solution for making sure the successive futures do not fire until the previous future completed successfully.
I think it will be nicer if you make it recursive using flatMap.
Let's imagine you have your createFuture defined as:
def createFuture() = Future( println("Create future"))
We can create a function to compose the result of createFuture with:
def compose(f: () => Future[Unit])(b: Future[Unit]) = b.flatMap(_ => f())
And then you can define fun as:
def fun(n : Int) = {
def nTimes(n : Int, f : Future[Unit] => Future[Unit], acc : Future[Unit]) = if (n == 0) acc else nTimes(n-1,f,f(acc))
nTimes(n,compose(createFuture),Future())
}

Incremental processing in an akka actor

I have actors that need to do very long-running and computationally expensive work, but the computation itself can be done incrementally. So while the complete computation itself takes hours to complete, the intermediate results are actually extremely useful, and I'd like to be able to respond to any requests of them. This is the pseudo code of what I want to do:
var intermediateResult = ...
loop {
while (mailbox.isEmpty && computationNotFinished)
intermediateResult = computationStep(intermediateResult)
receive {
case GetCurrentResult => sender ! intermediateResult
...other messages...
}
}
The best way to do this is very close to what you are doing already:
case class Continue(todo: ToDo)
class Worker extends Actor {
var state: IntermediateState = _
def receive = {
case Work(x) =>
val (next, todo) = calc(state, x)
state = next
self ! Continue(todo)
case Continue(todo) if todo.isEmpty => // done
case Continue(todo) =>
val (next, rest) = calc(state, todo)
state = next
self ! Continue(rest)
}
def calc(state: IntermediateState, todo: ToDo): (IntermediateState, ToDo)
}
EDIT: more background
When an actor sends messages to itself, Akka’s internal processing will basically run those within a while loop; the number of messages processed in one go is determined by the actor’s dispatcher’s throughput setting (defaults to 5), after this amount of processing the thread will be returned to the pool and the continuation be enqueued to the dispatcher as a new task. Hence there are two tunables in the above solution:
process multiple steps for a single message (if processing steps are REALLY small)
increase throughput setting for increased throughput and decreased fairness
The original problem seems to have hundreds of such actors running, presumably on common hardware which does not have hundreds of CPUs, so the throughput setting should probably be set such that each batch takes no longer than ca. 10ms.
Performance Assessment
Let’s play a bit with Fibonacci:
Welcome to Scala version 2.10.0-RC1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_07).
Type in expressions to have them evaluated.
Type :help for more information.
scala> def fib(x1: BigInt, x2: BigInt, steps: Int): (BigInt, BigInt) = if(steps>0) fib(x2, x1+x2, steps-1) else (x1, x2)
fib: (x1: BigInt, x2: BigInt, steps: Int)(BigInt, BigInt)
scala> def time(code: =>Unit) { val start = System.currentTimeMillis; code; println("code took " + (System.currentTimeMillis - start) + "ms") }
time: (code: => Unit)Unit
scala> time(fib(1, 1, 1000))
code took 1ms
scala> time(fib(1, 1, 1000))
code took 1ms
scala> time(fib(1, 1, 10000))
code took 5ms
scala> time(fib(1, 1, 100000))
code took 455ms
scala> time(fib(1, 1, 1000000))
code took 17172ms
Which means that in a presumably quite optimized loop, fib_100000 takes half a second. Now let’s play a bit with actors:
scala> case class Cont(steps: Int, batch: Int)
defined class Cont
scala> val me = inbox()
me: akka.actor.ActorDSL.Inbox = akka.actor.dsl.Inbox$Inbox#32c0fe13
scala> val a = actor(new Act {
var s: (BigInt, BigInt) = _
become {
case Cont(x, y) if y < 0 => s = (1, 1); self ! Cont(x, -y)
case Cont(x, y) if x > 0 => s = fib(s._1, s._2, y); self ! Cont(x - 1, y)
case _: Cont => me.receiver ! s
}
})
a: akka.actor.ActorRef = Actor[akka://repl/user/$c]
scala> time{a ! Cont(1000, -1); me.receive(10 seconds)}
code took 4ms
scala> time{a ! Cont(10000, -1); me.receive(10 seconds)}
code took 27ms
scala> time{a ! Cont(100000, -1); me.receive(10 seconds)}
code took 632ms
scala> time{a ! Cont(1000000, -1); me.receive(30 seconds)}
code took 17936ms
This is already interesting: given long enough time per step (with the huge BigInts behind the scenes in the last line), actors don’t much extra. Now let’s see what happens when doing smaller calculations in a more batched way:
scala> time{a ! Cont(10000, -10); me.receive(30 seconds)}
code took 462ms
This is pretty close to the result for the direct variant above.
Conclusion
Sending messages to self is NOT expensive for almost all applications, just keep the actual processing step slightly larger than a few hundred nanoseconds.
I assume from your comment to Roland Kuhn answer that you have some work which can be considered as recursive, at least in blocks. If this is not the case, I don't think there could be any clean solution to handle your problem and you will have to deal with complicated pattern matching blocks.
If my assumptions are correct, I would schedule the computation asynchronously and let the actor be free to answer other messages. The key point is to use Future monadic capabilities and having a simple receive block. You would have to handle three messages (startComputation, changeState, getState)
You will end up having the following receive:
def receive {
case StartComputation(myData) =>expensiveStuff(myData)
case ChangeState(newstate) = this.state = newstate
case GetState => sender ! this.state
}
And then you can leverage the map method on Future, by defining your own recursive map:
def mapRecursive[A](f:Future[A], handler: A => A, exitConditions: A => Boolean):Future[A] = {
f.flatMap { a=>
if (exitConditions(a))
f
else {
val newFuture = f.flatMap{ a=> Future(handler(a))}
mapRecursive(newFuture,handler,exitConditions)
}
}
}
Once you have this tool, everything is easier. If you look to the following example :
def main(args:Array[String]){
val baseFuture:Future[Int] = Promise.successful(64)
val newFuture:Future[Int] = mapRecursive(baseFuture,
(a:Int) => {
val result = a/2
println("Additional step done: the current a is " + result)
result
}, (a:Int) => (a<=1))
val one = Await.result(newFuture,Duration.Inf)
println("Computation finished, result = " + one)
}
Its output is:
Additional step done: the current a is 32
Additional step done: the current a is 16
Additional step done: the current a is 8
Additional step done: the current a is 4
Additional step done: the current a is 2
Additional step done: the current a is 1
Computation finished, result = 1
You understand you can do the same, inside your expensiveStuffmethod
def expensiveStuff(myData:MyData):Future[MyData]= {
val firstResult = Promise.successful(myData)
val handler : MyData => MyData = (myData) => {
val result = myData.copy(myData.value/2)
self ! ChangeState(result)
result
}
val exitCondition : MyData => Boolean = (myData:MyData) => myData.value==1
mapRecursive(firstResult,handler,exitCondition)
}
EDIT - MORE DETAILED
If you don't want to block the Actor, which processes messages from its mailbox in a thread-safe and synchronous manner, the only thing you can do is to get the computation executed on a different thread. This is exactly an high performance non blocking receive.
However, you were right in saying that the approach I propose pays a high performance penalty. Every step is done on a different future, which might be not necessary at all. You can therefore recurse the handler to obtain a single-threaded or multiple-threaded execution. There is no magic formula after all:
If you want to schedule asynchronously and minimize the cost, all the work should be done by a single thread
This however could prevent other work to start, because if all the threads on a thread pool are taken, the futures will queue. You might therefore want to break the operation into multiple futures so that even at full usage some new work can be scheduled before old work has been completed.
def recurseFuture[A](entryFuture: Future[A], handler: A => A, exitCondition: A => Boolean, maxNestedRecursion: Long = Long.MaxValue): Future[A] = {
def recurse(a:A, handler: A => A, exitCondition: A => Boolean, maxNestedRecursion: Long, currentStep: Long): Future[A] = {
if (exitCondition(a))
Promise.successful(a)
else
if (currentStep==maxNestedRecursion)
Promise.successful(handler(a)).flatMap(a => recurse(a,handler,exitCondition,maxNestedRecursion,0))
else{
recurse(handler(a),handler,exitCondition,maxNestedRecursion,currentStep+1)
}
}
entryFuture.flatMap { a => recurse(a,handler,exitCondition,maxNestedRecursion,0) }
}
I have enhanced for testing purposes my handler method:
val handler: Int => Int = (a: Int) => {
val result = a / 2
println("Additional step done: the current a is " + result + " on thread " + Thread.currentThread().getName)
result
}
Approach 1: Recurse the handler on itself so to get all execute on a single thread.
println("Starting strategy with all the steps on the same thread")
val deepestRecursion: Future[Int] = recurseFuture(baseFuture,handler, exitCondition)
Await.result(deepestRecursion, Duration.Inf)
println("Completed strategy with all the steps on the same thread")
println("")
Approach 2: Recurse for a limited depth the handler on itself
println("Starting strategy with the steps grouped by three")
val threeStepsInSameFuture: Future[Int] = recurseFuture(baseFuture,handler, exitCondition,3)
val threeStepsInSameFuture2: Future[Int] = recurseFuture(baseFuture,handler, exitCondition,4)
Await.result(threeStepsInSameFuture, Duration.Inf)
Await.result(threeStepsInSameFuture2, Duration.Inf)
println("Completed strategy with all the steps grouped by three")
executorService.shutdown()
You should not use Actors to make long running computations as these will block the threads that are supposed to run the Actors code.
I would try to go with a design that uses a separate Thread/ThreadPool for the computations and use AtomicReferences to store/query the intermediate results in the lines of the following pseudo code:
val cancelled = new AtomicBoolean(false)
val intermediateResult = new AtomicReference[IntermediateResult]()
object WorkerThread extends Thread {
override def run {
while(!cancelled.get) {
intermediateResult.set(computationStep(intermediateResult.get))
}
}
}
loop {
react {
case StartComputation => WorkerThread.start()
case CancelComputation => cancelled.set(true)
case GetCurrentResult => sender ! intermediateResult.get
}
}
This is a classic concurrency problem. You want want several routines/actors (or whatever you want to call them). Code is mostly correct Go, with obscenely long variable names for context. The first routine handles queries and intermediate results:
func serveIntermediateResults(
computationChannel chan *IntermediateResult,
queryChannel chan chan<-*IntermediateResult) {
var latestIntermediateResult *IntermediateResult // initial result
for {
select {
// an update arrives
case latestIntermediateResult, notClosed := <-computationChannel:
if !notClosed {
// the computation has finished, stop checking
computationChannel = nil
}
// a query arrived
case queryResponseChannel, notClosed := <-queryChannel:
if !notClosed {
// no more queries, so we're done
return
}
// respond with the latest result
queryResponseChannel<-latestIntermediateResult
}
}
}
In your long computation, you update your intermediate result wherever appropriate:
func longComputation(intermediateResultChannel chan *IntermediateResult) {
for notFinished {
// lots of stuff
intermediateResultChannel<-currentResult
}
close(intermediateResultChannel)
}
Finally to ask for the current result, you have a wrapper to make this nice:
func getCurrentResult() *IntermediateResult {
responseChannel := make(chan *IntermediateResult)
// queryChannel was given to the intermediate result server routine earlier
queryChannel<-responseChannel
return <-responseChannel
}