I'm implementing a iterator to a HTTP resource, which I can recover a list of elements paged, I tried to do this with a plain Iterator, but it's a blocking implementation, and since I'm using akka it makes my dispatcher go a little crazy.
My will it's to implement the same iterator using akka-stream. The problem is I need bit different retry strategy.
The service returns a list of elements, identified by a id, and sometimes when I query for the next page, the service returns the same products on the current page.
My current algorithm is this.
var seenIds = Set.empty
var position = 0
def isProblematicPage(elements: Seq[Element]) Boolean = {
val currentIds = elements.map(_.id)
val intersection = seenIds & currentIds
val hasOnlyNewIds = intersection.isEmpty
if (hasOnlyNewIds) {
seenIds = seenIds | currentIds
}
!hasOnlyNewIds
}
def incrementPage(): Unit = {
position += 10
}
def doBackOff(attempt: Int): Unit = {
// Backoff logic
}
#tailrec
def fetchPage(attempt: Int = 0): Iterator[Element] = {
if (attempt > MaxRetries) {
incrementPage()
return Iterator.empty
}
val eventualPage = service.retrievePage(position, position + 10)
val page = Await.result(eventualPage, 5 minutes)
if (isProblematicPage(page)) {
doBackOff(attempt)
fetchPage(attempt + 1)
} else {
incrementPage()
page.iterator
}
}
I'm doing the implementation using akka-streams but I can't figure out how to accumulate the pages and test for repetition using the streams structure.
Any suggestions?
The Flow.scan method is useful in such situations.
I would start your stream with a source of positions:
type Position = Int
//0,10,20,...
def positionIterator() : Iterator[Position] = Iterator from (0,10)
val positionSource : Source[Position,_] = Source fromIterator positionIterator
This position source can then be directed to a Flow.scan which utilizes a function similar to your fetchPage (side note: you should avoid awaits as much as possible, there is a way to not have awaits in your code but that is outside the scope of your original question). The new function needs to take in the "state" of already seen Elements:
def fetchPageWithState(service : Service)
(seenEls : Set[Element], position : Position) : Set[Elements] = {
val maxRetries = 10
val seenIds = seenEls map (_.id)
#tailrec
def readPosition(attempt : Int) : Seq[Elements] = {
if(attempt > maxRetries)
Iterator.empty
else {
val eventualPage : Seq[Element] =
Await.result(service.retrievePage(position, position + 10), 5 minutes)
if(eventualPage.map(_.id).exists(seenIds.contains)) {
doBackOff(attempt)
readPosition(attempt + 1)
}
else
eventualPage
}
}//end def readPosition
seenEls ++ readPosition(0).toSet
}//end def fetchPageWithState
This can now be used within a Flow:
def fetchFlow(service : Service) : Flow[Position, Set[Element],_] =
Flow[Position].scan(Set.empty[Element])(fetchPageWithState(service))
The new Flow can be easily connected to your Position Source to create a Source of Set[Element]:
def elementsSource(service : Service) : Source[Set[Element], _] =
positionSource via fetchFlow(service)
Each new value from elementsSource will be an ever growing Set of unique Elements from fetched pages.
The Flow.scan stage was a good advice, but it lacked the feature to deal with futures, so I implemented it asynchronous version Flow.scanAsync it's now available on akka 2.4.12.
The current implementation is:
val service: WebService
val maxTries: Int
val backOff: FiniteDuration
def retry[T](zero: T, attempt: Int = 0)(f: => Future[T]): Future[T] = {
f.recoverWith {
case ex if attempt >= maxAttempts =>
Future(zero)
case ex =>
akka.pattern.after(backOff, system.scheduler)(retry(zero, attempt + 1)(f))
}
}
def isProblematicPage(lastPage: Seq[Element], currPage: Seq[Element]): Boolean = {
val lastPageIds = lastPage.map(_.id).toSet
val currPageIds = currPage.map(_.id).toSet
val intersection = lastPageIds & currPageIds
intersection.nonEmpty
}
def retrievePage(lastPage: Seq[Element], startIndex: Int): Future[Seq[Element]] = {
retry(Seq.empty) {
service.fetchPage(startIndex).map { currPage: Seq[Element] =>
if (isProblematicPage(lastPage, currPage)) throw new ProblematicPageException(startIndex)
else currPage
}
}
}
val pagesRange: Range = Range(0, maxItems, pageSize)
val scanAsyncFlow = Flow[Int].via(ScanAsync(Seq.empty)(retrievePage))
Source(pagesRange)
.via(scanAsyncFlow)
.mapConcat(identity)
.runWith(Sink.seq)
Thanks Ramon for the advice :)
Related
I'm new in Scala and i'm facing a few problems in my assignment :
I want to build a stream class that can do 3 main tasks : filter,map,and forEach.
My streams data is an array of elements. Each of the 3 main tasks should run in 2 different threads on my streams array.
In addition, I need to divde the logic of the action and its actual run to two different parts. First declare all tasks in stream and only when I run stream.run() I want the actual actions to happen.
My code :
class LearningStream[A]() {
val es: ExecutorService = Executors.newFixedThreadPool(2)
val ec = ExecutionContext.fromExecutorService(es)
var streamValues: ArrayBuffer[A] = ArrayBuffer[A]()
var r: Runnable = () => "";
def setValues(streamv: ArrayBuffer[A]) = {
streamValues = streamv;
}
def filter(p: A => Boolean): LearningStream[A] = {
var ls_filtered: LearningStream[A] = new LearningStream[A]()
r = () => {
println("running real filter..")
val (l,r) = streamValues.splitAt(streamValues.length/2)
val a:ArrayBuffer[A]=es.submit(()=>l.filter(p)).get()
val b:ArrayBuffer[A]=es.submit(()=>r.filter(p)).get()
ms_filtered.setValues(a++b)
}
return ls_filtered
}
def map[B](f: A => B): LearningStream[B] = {
var ls_map: LearningStream[B] = new LearningStream[B]()
r = () => {
println("running real map..")
val (l,r) = streamValues.splitAt(streamValues.length/2)
val a:ArrayBuffer[B]=es.submit(()=>l.map(f)).get()
val b:ArrayBuffer[B]=es.submit(()=>r.map(f)).get()
ls_map.setValues(a++b)
}
return ls_map
}
def forEach(c: A => Unit): Unit = {
r=()=>{
println("running real forEach")
streamValues.foreach(c)}
}
def insert(a: A): Unit = {
streamValues += a
}
def start(): Unit = {
ec.submit(r)
}
def shutdown(): Unit = {
ec.shutdown()
}
}
my main :
def main(args: Array[String]): Unit = {
var factorial=0
val s = new LearningStream[String]
s.filter(str=>str.startsWith("-")).map(s=>s.toInt*(-1)).forEach(i=>factorial=factorial*i)
for(i <- -5 to 5){
s.insert(i.toString)
}
println(s.streamValues)
s.start()
println(factorial)
}
The main prints only the filter`s output and the factorial isnt changed (still 1).
What am I missing here ?
My solution: #Levi Ramsey left a few good hints in the comments if you want to get hints and not the real solution.
First problem: Only one command (filter) run and the other didn't. solution: insert to the runnable of each command a call for the next stream via:
ec.submit(ms_map.r)
In order to be able to close all sessions, we need to add another LearningStream data member to the class. However we can't add just a regular LearningStream object because it depends on parameter [A]. Therefore, I implemented a trait that has the close function and my data member was of that trait type.
Trying to execute a function in a given time frame, but if computation fails by TimeOut get a partial result instead of an empty exception.
The attached code solves it.
The timedRun function is from Computation with time limit
Any better approach?.
package ga
object Ga extends App {
//this is the ugly...
var bestResult = "best result";
try {
val result = timedRun(150)(bestEffort())
} catch {
case e: Exception =>
print ("timed at = ")
}
println(bestResult)
//dummy function
def bestEffort(): String = {
var res = 0
for (i <- 0 until 100000) {
res = i
bestResult = s" $res"
}
" " + res
}
//This is the elegant part from stackoverflow gruenewa
#throws(classOf[java.util.concurrent.TimeoutException])
def timedRun[F](timeout: Long)(f: => F): F = {
import java.util.concurrent.{ Callable, FutureTask, TimeUnit }
val task = new FutureTask(new Callable[F]() {
def call() = f
})
new Thread(task).start()
task.get(timeout, TimeUnit.MILLISECONDS)
}
}
I would introduce a small intermediate class for more explicitly communicating the partial results between threads. That way you don't have to modify non-local state in any surprising ways. Then you can also just catch the exception within the timedRun method:
class Result[A](var result: A)
val result = timedRun(150)("best result")(bestEffort)
println(result)
//dummy function
def bestEffort(r: Result[String]): Unit = {
var res = 0
for (i <- 0 until 100000) {
res = i
r.result = s" $res"
}
r.result = " " + res
}
def timedRun[A](timeout: Long)(initial: A)(f: Result[A] => _): A = {
import java.util.concurrent.{ Callable, FutureTask, TimeUnit }
val result = new Result(initial)
val task = new FutureTask(new Callable[A]() {
def call() = { f(result); result.result }
})
new Thread(task).start()
try {
task.get(timeout, TimeUnit.MILLISECONDS)
} catch {
case e: java.util.concurrent.TimeoutException => result.result
}
}
It's admittedly a bit awkward since you don't usually have the "return value" of a function passed in as a parameter. But I think it's the least-radical modification of your code that makes sense. You could also consider modeling your computation as something that returns a Stream or Iterator of partial results, and then essentially do .takeWhile(notTimedOut).last. But how feasible that is really depends on the actual computation.
First, you need to use one of the solution to recover after the future timed out which are unfortunately not built-in in Scala:
See: Scala Futures - built in timeout?
For example:
def withTimeout[T](fut:Future[T])(implicit ec:ExecutionContext, after:Duration) = {
val prom = Promise[T]()
val timeout = TimeoutScheduler.scheduleTimeout(prom, after)
val combinedFut = Future.firstCompletedOf(List(fut, prom.future))
fut onComplete{case result => timeout.cancel()}
combinedFut
}
Then it is easy:
var bestResult = "best result"
val expensiveFunction = Future {
var res = 0
for (i <- 0 until 10000) {
Thread.sleep(10)
res = i
bestResult = s" $res"
}
" " + res
}
val timeoutFuture = withTimeout(expensiveFunction) recover {
case _: TimeoutException => bestResult
}
println(Await.result(timeoutFuture, 1 seconds))
I have two Scala functions that are expensive to run. Each one is like below, they start improving the value of a variable and I'd like to run them simultaneously and after 5 minutes (or some other time). I'd like to terminate the two functions and take their latest value up to that time.
def func1(n: Int): Double = {
var a = 0.0D
while (not terminated) {
/// improve value of 'a' with algorithm 1
}
}
def func2(n: Int): Double = {
var a = 0.0D
while (not terminated) {
/// improve value of 'a' with algorithm 2
}
}
I would like to know how I should structure my code for doing that and what is the best practice here? I was thinking about running them in two different threads with a timeout and return their latest value at time out. But it seems there can be other ways for doing that. I am new to Scala so any insight would be tremendously helpful.
It is not hard. Here is one way of doing it:
#volatile var terminated = false
def func1(n: Int): Double = {
var a = 0.0D
while (!terminated) {
a = 0.0001 + a * 0.99999; //some useless formula1
}
a
}
def func2(n: Int): Double = {
var a = 0.0D
while (!terminated) {
a += 0.0001 //much simpler formula2, just for testing
}
a
}
def main(args: Array[String]): Unit = {
val f1 = Future { func1(1) } //work starts here
val f2 = Future { func2(2) } //and here
//aggregate results into one common future
val aggregatedFuture = for{
f1Result <- f1
f2Result <- f2
} yield (f1Result, f2Result)
Thread.sleep(500) //wait here for some calculations in ms
terminated = true //this is where we actually command to stop
//since looping to while() takes time, we need to wait for results
val res = Await.result(aggregatedFuture, 50.millis)
//just a printout
println("results:" + res)
}
But, of course, you would want to maybe look at your while loops and create a more manageable and chainable calculations.
Output: results:(9.999999999933387,31206.34691883926)
I am not 100% sure if this is something you would want to do, but here is one approach (not for 5 minutes, but you can change that) :
object s
{
def main(args: Array[String]): Unit = println(run())
def run(): (Int, Int) =
{
val (s, numNanoSec, seedVal) = (System.nanoTime, 500000000L, 0)
Seq(f1 _, f2 _).par.map(f =>
{
var (i, id) = f(seedVal)
while (System.nanoTime - s < numNanoSec)
{
i = f(i)._1
}
(i, id)
}).seq.maxBy(_._1)
}
def f1(a: Int): (Int, Int) = (a + 1, 1)
def f2(a: Int): (Int, Int) = (a + 2, 2)
}
Output:
me#ideapad:~/junk> scala s.scala
(34722678,2)
me#ideapad:~/junk> scala s.scala
(30065688,2)
me#ideapad:~/junk> scala s.scala
(34650716,2)
Of course this all assumes you have at least two threads available to distribute tasks to.
You can use Future with Await result to do that:
def fun2(): Double = {
var a = 0.0f
val f = Future {
// improve a with algorithm 2
a
}
try {
Await.result(f, 5 minutes)
} catch {
case e: TimeoutException => a
}
}
use the Await.result to wait algorithm with timeout, when we met this timeout, we return the a directly
There is some data that I have pulled from a remote API, for which I use a Future-style interface. The data is structured as a linked-list. A relevant example data container is shown below.
case class Data(information: Int) {
def hasNext: Boolean = ??? // Implemented
def next: Future[Data] = ??? // Implemented
}
Now I'm interested in adding some functionality to the data class, such as map, foreach, reduce, etc. To do so I want to implement some form of IterableLike such that it inherets these methods.
Given below is the trait Data may extend, such that it gets this property.
trait AsyncIterable[+T]
extends IterableLike[Future[T], AsyncIterable[T]]
{
def hasNext : Boolean
def next : Future[T]
// How to implement?
override def iterator: Iterator[Future[T]] = ???
override protected[this] def newBuilder: mutable.Builder[Future[T], AsyncIterable[T]] = ???
override def seq: TraversableOnce[Future[T]] = ???
}
It should be a non-blocking implementation, which when acted on, starts requesting the next data from the remote data source.
It is then possible to do cool stuff such as
case class Data(information: Int) extends AsyncIterable[Data]
val data = Data(1) // And more, of course
// Asynchronously print all the information.
data.foreach(data => println(data.information))
It is also acceptable for the interface to be different. But the result should in some way represent asynchronous iteration over the collection. Preferably in a way that is familiar to developers, as it will be part of an (open source) library.
In production I would use one of following:
Akka Streams
Reactive Extensions
For private tests I would implement something similar to following.
(Explanations are below)
I have modified a little bit your Data:
abstract class AsyncIterator[T] extends Iterator[Future[T]] {
def hasNext: Boolean
def next(): Future[T]
}
For it we can implement this Iterable:
class AsyncIterable[T](sourceIterator: AsyncIterator[T])
extends IterableLike[Future[T], AsyncIterable[T]]
{
private def stream(): Stream[Future[T]] =
if(sourceIterator.hasNext) {sourceIterator.next #:: stream()} else {Stream.empty}
val asStream = stream()
override def iterator = asStream.iterator
override def seq = asStream.seq
override protected[this] def newBuilder = throw new UnsupportedOperationException()
}
And if see it in action using following code:
object Example extends App {
val source = "Hello World!";
val iterator1 = new DelayedIterator[Char](100L, source.toCharArray)
new AsyncIterable(iterator1).foreach(_.foreach(print)) //prints 1 char per 100 ms
pause(2000L)
val iterator2 = new DelayedIterator[String](100L, source.toCharArray.map(_.toString))
new AsyncIterable(iterator2).reduceLeft((fl: Future[String], fr) =>
for(l <- fl; r <- fr) yield {println(s"$l+$r"); l + r}) //prints 1 line per 100 ms
pause(2000L)
def pause(duration: Long) = {println("->"); Thread.sleep(duration); println("\n<-")}
}
class DelayedIterator[T](delay: Long, data: Seq[T]) extends AsyncIterator[T] {
private val dataIterator = data.iterator
private var nextTime = System.currentTimeMillis() + delay
override def hasNext = dataIterator.hasNext
override def next = {
val thisTime = math.max(System.currentTimeMillis(), nextTime)
val thisValue = dataIterator.next()
nextTime = thisTime + delay
Future {
val now = System.currentTimeMillis()
if(thisTime > now) Thread.sleep(thisTime - now) //Your implementation will be better
thisValue
}
}
}
Explanation
AsyncIterable uses Stream because it's calculated lazily and it's simple.
Pros:
simplicity
multiple calls to iterator and seq methods return same iterable with all items.
Cons:
could lead to memory overflow because stream keeps all prevously obtained values.
first value is eagerly gotten during creation of AsyncIterable
DelayedIterator is very simplistic implementation of AsyncIterator, don't blame me for quick and dirty code here.
It's still strange for me to see synchronous hasNext and asynchronous next()
Using Twitter Spool I've implemented a working example.
To implement spool I modified the example in the documentation.
import com.twitter.concurrent.Spool
import com.twitter.util.{Await, Return, Promise}
import scala.concurrent.{ExecutionContext, Future}
trait AsyncIterable[+T <: AsyncIterable[T]] { self : T =>
def hasNext : Boolean
def next : Future[T]
def spool(implicit ec: ExecutionContext) : Spool[T] = {
def fill(currentPage: Future[T], rest: Promise[Spool[T]]) {
currentPage foreach { cPage =>
if(hasNext) {
val nextSpool = new Promise[Spool[T]]
rest() = Return(cPage *:: nextSpool)
fill(next, nextSpool)
} else {
val emptySpool = new Promise[Spool[T]]
emptySpool() = Return(Spool.empty[T])
rest() = Return(cPage *:: emptySpool)
}
}
}
val rest = new Promise[Spool[T]]
if(hasNext) {
fill(next, rest)
} else {
rest() = Return(Spool.empty[T])
}
self *:: rest
}
}
Data is the same as before, and now we can use it.
// Cool stuff
implicit val ec = scala.concurrent.ExecutionContext.global
val data = Data(1) // And others
// Print all the information asynchronously
val fut = data.spool.foreach(data => println(data.information))
Await.ready(fut)
It will trow an exception on the second element, because the implementation of next was not provided.
How can i consume a service that returns pages as a Stream of items?
Amazon S3, for example, lets you fetch the initial object listing or the next object listing from the previous one.
For example consider this code that simulates such behavior:
import math._
case class Page(number: Int)
case class Pages(pages: Seq[Page], truncated: Boolean)
class PagesService(pageSize: Int, pagesServed: Int) {
def getPages =
Pages((1 to pageSize).map(Page), pageSize < pagesServed)
def nextPages(previous: Pages) = {
val first = previous.pages.last.number + 1
val last = min(first + pageSize, pagesServed)
Pages((first to last).map(Page), last < pagesServed)
}
}
object PagesClient extends App {
val service = new PagesService(10, 100)
val first = service.getPages
assert(first.truncated)
first.pages.foreach(println(_))
val second = service.nextPages(first)
second.pages.foreach(println(_))
val book: Stream[Page] = ???
}
How could i write that last expression?
val book: Stream[Pages] = first #:: book.map(service.nextPages).takeWhile(_.pages.nonEmpty)
val pages: Stream[Page] = book.flatten(_.pages)
I do not know if it is a typo. If you mean Stream[Pages] it is simple:
val book: Stream[Pages] = first #:: book.map(x => service.nextPages(x))
If you meant Stream[Page] i.e. a stream of Page from all pages, then:
val first = service.getPages
val second = service.nextPages(first)
val books: Stream[Page] = {
val currentPages = book.iterator
val firstPages = currentPages.next.pages.iterator
def inner(current: Iterator[Page]): Stream[Page] = {
if (current.hasNext) {
current.next #:: inner(current)
} else {
val i = currentPages.next
inner(i.pages.iterator)
}
}
inner(firstPages);
}
The above basically takes a Pages, returns its Page as part of stream. If the Pages is exhausted, then goes over to the next Pages and so on.