Serving multiple result Futures as soon as available to a client - scala

I have a page that is populated by data that I get using different calls to distant servers. Some requests take longer than others, the way I do things now is that I do all the calls at once and wrap the whole thing in a Future, then put the the whole thing in a Action.async for Play to handle.
This, theoretically, does the job but I don't want my users to be waiting a long time and instead start loading the page part by part. Meaning that as soon as data is available for a given request to a distant server, it should be sent to the client as Json or whatever.
I was able to partially achieve this using EventSource by modifying Play's event-source sample by doing something like this:
Ok.chunked((enumerator1 &> EventSource()) >- (enumerator2 &> EventSource())).as("text/event-stream")
and the enumerators as follows:
val enumerator1: Enumerator[String] = Enumerator.generateM{
Future[Option[String]]{Thread.sleep(1500); Some("Hello")}
}
val enumerator2: Enumerator[String] = Enumerator.generateM{
Future[Option[String]]{Thread.sleep(2000); Some("World!")}
}
As you probably have guessed, I was expecting to have "Hello" after 1.5s and then "World!" 0.5s later sent to the client, but I ended up receiving "Hello" every 1.5s and "World!" every 2s.
My questions are:
Is there a way to stop sending an information once it has been correctly delivered to the client using the method above?
Is there a better way to achieve what I want?

You don't want generateM, it's for building enumerators that can return multiple values. generateM takes a function that either returns a Some, to produce the next value for the Enumerator, or None, to signal that the Enumerator is complete. Because your function always returns Some, you create Enumerators that are infinite in length.
You just want to convert a Future into an Enumerator, to create an Enumerator with a single element:
Enumerator.flatten(future.map(Enumerator(_)))
Also, you can interleave your enumerators and then feed the result into EventSource(). Parenthesis are unnecessary as well (methods that start with > have precedence over methods with &).
enumerator1 >- enumerator2 &> EventSource()

Related

How to convert `fs2.Stream[IO, T]` to `Iterator[T]` in Scala

Need to fill in the methods next and hasNext and preserve laziness
new Iterator[T] {
val stream: fs2.Stream[IO, T] = ...
def next(): T = ???
def hasNext(): Boolean = ???
}
But cannot figure out how an earth to do this from a fs2.Stream? All the methods on a Stream (or on the "compiled" thing) are fairly useless.
If this is simply impossible to do in a reasonable amount of code, then that itself is a satisfactory answer and we will just rip out fs2.Stream from the codebase - just want to check first!
fs2.Stream, while similar in concept to Iterator, cannot be converted to one while preserving laziness. I'll try to elaborate on why...
Both represent a pull-based series of items, but the way in which they represent that series and implement the laziness differs too much.
As you already know, Iterator represents its pull in terms of the next() and hasNext methods, both of which are synchronous and blocking. To consume the iterator and return a value, you can directly call those methods e.g. in a loop, or use one of its many convenience methods.
fs2.Stream supports two capabilities that make it incompatible with that interface:
cats.effect.Resource can be included in the construction of a Stream. For example, you could construct a fs2.Stream[IO, Byte] representing the contents of a file. When consuming that stream, even if you abort early or do some strange flatMap, the underlying Resource is honored and your file handle is guaranteed to be closed. If you were trying to do the same thing with iterator, the "abort early" case would pose problems, forcing you to do something like Iterator[Byte] with Closeable and the caller would have to make sure to .close() it, or some other pattern.
Evaluation of "effects". In this context, effects are types like IO or Future, where the process of obtaining the value may perform some possibly-asynchronous action, and may perform side-effects. Asynchrony poses a problem when trying to force the process into a synchronous interface, since it forces you to block your current thread to wait for the asynchronous answer, which can cause deadlocks if you aren't careful. Libraries like cats-effect strongly discourage you from calling methods like unsafeRunSync.
fs2.Stream does allow for some special cases that prevent the inclusion of Resource and Effects, via its Pure type alias which you can use in place of IO. That gets you access to Stream.PureOps, but that only gets you methods that consume the whole stream by building a collection; the laziness you want to preserve would be lost.
Side note: you can convert an Iterator to a Stream.
The only way to "convert" a Stream to an Iterator is to consume it to some collection type via e.g. .compile.toList, which would get you an IO[List[T]], then .map(_.iterator) that to get an IO[Iterator[T]]. But ultimately that doesn't fit what you're asking for since it forces you to consume the stream to a buffer, breaking laziness.
#Dima mentioned the "XY Problem", which was poorly-received since they didn't really elaborate (initially) on the incompatibility, but they're right. It would be helpful to know why you're trying to make a Stream-to-Iterator conversion, in case there's some other approach that would serve your overall goal instead.

Akka Streams - Backpressure for Source.unfoldAsync

I'm currently trying to read a paginated HTTP resource. Each page is a Multipart Document and the response for the page include a next link in the headers if there is a page with more content. An automated parser can then start at the oldest page and then read page by page using the headers to construct the request for the next page.
I'm using Akka Streams and Akka Http for the implementation, because my goal is to create a streaming solution. I came up with this (I will include only the relevant parts of the code here, feel free to have a look at this gist for the whole code):
def read(request: HttpRequest): Source[HttpResponse, _] =
Source.unfoldAsync[Option[HttpRequest], HttpResponse](Some(request))(Crawl.crawl)
val parse: Flow[HttpResponse, General.BodyPart, _] = Flow[HttpResponse]
.flatMapConcat(r => Source.fromFuture(Unmarshal(r).to[Multipart.General]))
.flatMapConcat(_.parts)
....
def crawl(reqOption: Option[HttpRequest]): Future[Option[(Option[HttpRequest], HttpResponse)]] = reqOption match {
case Some(req) =>
Http().singleRequest(req).map { response =>
if (response.status.isFailure()) Some((None, response))
else nextRequest(response, HttpMethods.GET)
}
case None => Future.successful(None)
}
So the general idea is to use Source.unfoldAsync to crawl through the pages and to do the HTTP requests (The idea and implementation are very close to what's described in this answer. This will create a Source[HttpResponse, _] that can then be consumed (Unmarshal to Multipart, split up into the individual parts, ...).
My problem now is that the consumption of the HttpResponses might take a while (Unmarshalling takes some time if the pages are large, maybe there will be some database requests at the end to persist some data, ...). So I would like the Source.unfoldAsync to backpressure if the downstream is slower. By default, the next HTTP request will be started as soon as the previous one finished.
So my question is: Is there some way to make Source.unfoldAsync backpressure on a slow downstream? If not, is there an alternative that makes backpressuring possible?
I can imagine a solution that makes use of the Host-Level Client-Side API that akka-http provides, as described here together with a cyclic graph where the response of first request will be used as input to generate the second request, but I haven't tried that yet and I'm not sure if this could work or not.
EDIT: After some days of playing around and reading the docs and some blogs, I'm not sure if I was on the right track with my assumption that the backpressure behavior of Source.unfoldAsync is the root cause. To add some more observations:
When the stream is started, I see several requests going out. This is no problem in the first place, as long as the resulting HttpResponse is consumed in a timely fashion (see here for a description)
If I don't change the default response-entity-subscription-timeout, I will run into the following error (I stripped out the URLs):
[WARN] [03/30/2019 13:44:58.984] [default-akka.actor.default-dispatcher-16] [default/Pool(shared->http://....)] [1 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 seconds. Make sure to read the response entity body or call discardBytes() on it. GET ... Empty -> 200 OK Chunked
This leads to an IllegalStateException that terminates the stream: java.lang.IllegalStateException: Substream Source cannot be materialized more than once
I observed that the unmarshalling of the response is the slowest part in the stream, which might make sense because the response body is a Multipart document and thereby relatively large. However, I would expect this part of the stream to signal less demand to the upstream (which is the Source.unfoldAsync part in my case). This should lead to the fact that less requests are made.
Some googling lead me to a discussion about an issue that seems to describe a similar problem. They also discuss the problems that occur when a response is not processed fast enough. The associated merge request will bring documentation changes that propose to completely consume the HttpResponse before continuing with the stream. In the discussion to the issue there are also doubts about whether or not it's a good idea to combine Akka Http with Akka Streams. So maybe I would have to change the implementation to directly do the unmarshalling inside the function that's being called by unfoldAsync.
According to the implementation of the Source.unfoldAsync the passed in function is only called when the source is pulled:
def onPull(): Unit = f(state).onComplete(asyncHandler)(akka.dispatch.ExecutionContexts.sameThreadExecutionContext)
So if the downstream is not pulling (backpressuring) the function passed in to the source is not called.
In your gist you use runForeach (which is the same as runWith(Sink.foreach)) that pulls the upstream as soon as the println is finished. So it is hard to notice backpressure here.
Try changing your example to runWith(Sink.queue) which will give you an SinkQueueWithCancel as the materialized value. Then, unless you call pull on the queue, the stream will be backpressured and will not issue requests.
Note that there could be one or more initial requests until the backpressure propagates through all of the stream.
I think I figured it out. As I already mentioned in the edit of my question, I found this comment to an issue in Akka HTTP, where the author says:
...it is simply not best practice to mix Akka http into a larger processing stream. Instead, you need a boundary around the Akka http parts of the stream that ensures they always consume their response before allowing the outer processing stream to proceed.
So I went ahead and tried it: Instead of doing the HTTP request and the unmarshalling in different stages of the stream, I directly unmarshal the response by flatMaping the Future[HttpResponse] into a Future[Multipart.General]. This makes sure that the HttpResponse is directly consumed and avoids the Response entity was not subscribed after 1 second errors. The crawl function looks slightly different now, because it has to return the unmarshalled Multipart.General object (for further processing) as well as the original HttpResponse (to be able to construct the next request out of the headers):
def crawl(reqOption: Option[HttpRequest])(implicit actorSystem: ActorSystem, materializer: Materializer, executionContext: ExecutionContext): Future[Option[(Option[HttpRequest], (HttpResponse, Multipart.General))]] = {
reqOption match {
case Some(request) =>
Http().singleRequest(request)
.flatMap(response => Unmarshal(response).to[Multipart.General].map(multipart => (response, multipart)))
.map {
case tuple#(response, multipart) =>
if (response.status.isFailure()) Some((None, tuple))
else nextRequest(response, HttpMethods.GET).map { case (req, res) => (req, (res, multipart)) }
}
case None => Future.successful(None)
}
}
The rest of the code has to change because of that. I created another gist that contains equivalent code like the gist from the original question.
I was expecting the two Akka projects to integrate better (the docs don't mention this limitation at the moment, and instead the HTTP API seems to encourage the user to use Akka HTTP and Akka Streams together), so this feels a bit like a workaround, but it solves my problem for now. I still have to figure out some other problems I encounter when integrating this part into my larger use case, but this is not part of this question here.

How to create a stream with Scalaz-Stream?

It must be damn simple. But for some reason I cannot make it work.
If I do io.linesR(...), I have a stream of lines of the file, it's ok.
If I do Processor.emitAll(), I have a stream of pre-defined values. It also works.
But what I actually need is to produce values for scalaz-stream asynchronously (well, from Akka actor).
I have tried:
async.unboundedQueue[String]
async.signal[String]
Then called queue.enqueueOne(...).run or signal.set(...).run and listened to queue.dequeue or signal.discrete. Just with .map and .to. With an example proved to work with another kind of stream -- either with Processor or lines from the file.
What is the secret? What is the preferred way to create a channel to be streamed later? How to feed it with values from another context?
Thanks!
If the values are produced asynchronously but in a way that can be driven from the stream, I've found it easiest to use the "primitive" await method and construct the process "by hand". You need an indirectly recursive function:
def processStep(v: Int): Process[Future, Int] =
Process.emit(v) ++ Process.await(myActor ? NextValuePlease())(w => processStep(w))
But if you need a truly async process, driven from elsewhere, I've never done that.

How to buffer based on time and count, but stopping the timer if no events occur

I'm producing a sequence of 50 items each tree seconds. I then want to batch them at max 20 items, but also not waiting more than one second before I release the buffer.
That works great!
But since the interval never dies, Buffer keeps firing empty batch chunks...
How can I avoid that? Shure Where(buf => buf.Count > 0)should help - but that seems like a hack.
Observable
.Interval(TimeSpan.FromSeconds(3))
.Select(n => Observable.Repeat(n, 50))
.Merge()
.Buffer(TimeSpan.FromSeconds(1), 20)
.Subscribe(e => Console.WriteLine(e.Count));
Output:
0-0-0-20-20-10-0-20-20-10-0-0-20-20
The Where filter you propose is a sound approach, I'd go with that.
You could wrap the Buffer and Where into a single helper method named to make the intent clearer perhaps, but rest assured the Where clause is idiomatic Rx in this scenario.
Think of it this way; an empty Buffer is relaying information that no events occurred in the last second. While you can argue that this is implicit, it would require extra work to detect this if Buffer didn't emit an empty list. It just so happens it's not information you are interested in - so Where is an appropriate way to filter this information out.
A lazy timer solution
Following from your comment ("...the timer... be[ing] lazily initiated...") you can do this to create a lazy timer and omit the zero counts:
var source = Observable.Interval(TimeSpan.FromSeconds(3))
.Select(n => Observable.Repeat(n, 50))
.Merge();
var xs = source.Publish(pub =>
pub.Buffer(() => pub.Take(1).Delay(TimeSpan.FromSeconds(1))
.Merge(pub.Skip(19)).Take(1)));
xs.Subscribe(x => Console.WriteLine(x.Count));
Explanation
Publishing
This query requires subscribing to the source events multiple times. To avoid unexpected side-effects, we use Publish to give us pub which is a stream that multicasts the source creating just a single subscription to it. This replaces the older Publish().RefCount() technique that achieved the same end, effectively giving us a "hot" version of the source stream.
In this case, this is necessary to ensure the subsequent buffer closing streams produced after the first will start with the current events - if the source was cold they would start over each time. I wrote a bit about publishing here.
The main query
We use an overload of Buffer that accepts a factory function that is called for every buffer emitted to obtain an observable stream whose first event is a signal to terminate the current buffer.
In this case, we want to terminate the buffer when either the first event into the buffer has been there for a full second, or when 20 events have appeared from the source - whichever comes first.
To achieve this we Merge streams that describe each case - the Take(1).Delay(...) combo describes the first condition, and the Skip(19).Take(1) describes the second.
However, I would still test performance the easy way, because I still suspect this is overkill, but a lot depends on the precise details of the platform and scenario etc.
After using the accepted answer for quite a while I would now suggest a different implementation (inspired by James Skip / Take approach and this answer):
var source = Observable.Interval(TimeSpan.FromSeconds(3))
.Select(n => Observable.Repeat(n, 50))
.Merge();
var xs = source.BufferOmitEmpty(TimeSpan.FromSeconds(1), 20);
xs.Subscribe(x => Console.WriteLine(x.Count));
With an extension method BufferOmitEmpty like:
public static IObservable<IList<TSource>> BufferOmitEmpty<TSource>(this IObservable<TSource> observable, TimeSpan maxDelay, int maxBufferCount)
{
return observable
.GroupByUntil(x => 1, g => Observable.Timer(maxDelay).Merge(g.Skip(maxBufferCount - 1).Take(1).Select(x => 1L)))
.Select(x => x.ToArray())
.Switch();
}
It is 'lazy', because no groups are created as long as there are no elements on the source sequence, so there are no empty buffers. As in Toms answer there is an other nice advantage to the Buffer / Where implementation, that is the buffer is started when the first element arrives. So elements following each other within buffer time after a quiet period are processed in the same buffer.
Why not to use the Buffer method
Three problems occured when I was using the Buffer approach (they might be irrelevant for the scope of the question, so this is a warning to people who use stack overflow answers in different contexts like me):
Because of the Delay one thread is used per subscriber.
In scenarios with long running subscribers elements from the source sequence can be lost.
With multiple subscribers it sometimes creates buffers with count greater than maxBufferCount.
(I can supply sample code for 2. and 3. but I'm insecure whether to post it here or in a different question because I cannot fully explain why it behaves this way)
RxJs5 has hidden features buried into their source code. It turns out it's pretty easy to achieve with bufferTime
From the source code, the signature looks like this:
export function bufferTime<T>(this: Observable<T>, bufferTimeSpan: number, bufferCreationInterval: number, maxBufferSize: number, scheduler?: IScheduler): Observable<T[]>;
So your code would be like this:
observable.bufferTime(1000, null, 20)

Traversing lists and streams with a function returning a future

Introduction
Scala's Future (new in 2.10 and now 2.9.3) is an applicative functor, which means that if we have a traversable type F, we can take an F[A] and a function A => Future[B] and turn them into a Future[F[B]].
This operation is available in the standard library as Future.traverse. Scalaz 7 also provides a more general traverse that we can use here if we import the applicative functor instance for Future from the scalaz-contrib library.
These two traverse methods behave differently in the case of streams. The standard library traversal consumes the stream before returning, while Scalaz's returns the future immediately:
import scala.concurrent._
import ExecutionContext.Implicits.global
// Hangs.
val standardRes = Future.traverse(Stream.from(1))(future(_))
// Returns immediately.
val scalazRes = Stream.from(1).traverse(future(_))
There's also another difference, as Leif Warner observes here. The standard library's traverse starts all of the asynchronous operations immediately, while Scalaz's starts the first, waits for it to complete, starts the second, waits for it, and so on.
Different behavior for streams
It's pretty easy to show this second difference by writing a function that will sleep for a few seconds for the first value in the stream:
def howLong(i: Int) = if (i == 1) 10000 else 0
import scalaz._, Scalaz._
import scalaz.contrib.std._
def toFuture(i: Int)(implicit ec: ExecutionContext) = future {
printf("Starting %d!\n", i)
Thread.sleep(howLong(i))
printf("Done %d!\n", i)
i
}
Now Future.traverse(Stream(1, 2))(toFuture) will print the following:
Starting 1!
Starting 2!
Done 2!
Done 1!
And the Scalaz version (Stream(1, 2).traverse(toFuture)):
Starting 1!
Done 1!
Starting 2!
Done 2!
Which probably isn't what we want here.
And for lists?
Strangely enough the two traversals behave the same in this respect on lists—Scalaz's doesn't wait for one future to complete before starting the next.
Another future
Scalaz also includes its own concurrent package with its own implementation of futures. We can use the same kind of setup as above:
import scalaz.concurrent.{ Future => FutureZ, _ }
def toFutureZ(i: Int) = FutureZ {
printf("Starting %d!\n", i)
Thread.sleep(howLong(i))
printf("Done %d!\n", i)
i
}
And then we get the behavior of Scalaz on streams for lists as well as streams:
Starting 1!
Done 1!
Starting 2!
Done 2!
Perhaps less surprisingly, traversing an infinite stream still returns immediately.
Question
At this point we really need a table to summarize, but a list will have to do:
Streams with standard library traversal: consume before returning; don't wait for each future.
Streams with Scalaz traversal: return immediately; do wait for each future to complete.
Scalaz futures with streams: return immediately; do wait for each future to complete.
And:
Lists with standard library traversal: don't wait.
Lists with Scalaz traversal: don't wait.
Scalaz futures with lists: do wait for each future to complete.
Does this make any sense? Is there a "correct" behavior for this operation on lists and streams? Is there some reason that the "most asynchronous" behavior—i.e., don't consume the collection before returning, and don't wait for each future to complete before moving on to the next—isn't represented here?
I cannot answer it all, but i try on some parts:
Is there some reason that the "most asynchronous" behavior—i.e., don't
consume the collection before returning, and don't wait for each
future to complete before moving on to the next—isn't represented
here?
If you have dependent calculations and a limited number of threads, you can experience deadlocks. For example you have two futures depending on a third one (all three in the list of futures) and only two threads, you can experience a situation where the first two futures block all two threads and the third one never gets executed. (Of course, if your pool size is one, i.e. zou execute one calculation after the other, you can get similar situations)
To solve this, you need one thread per future, without any limitation. This works for small lists of futures, but not for big one. So if you run all in parallel, you will get a situation where small examples will run in all cases and bigger one will deadlock. (Example: Developer tests run fine, production deadlocks).
Is there a "correct" behavior for this operation on lists and streams?
I think it is impossible with futures. If you know something more of the dependencies, or when you know for sure that the calculations will not block, a more concurrent solution might be possible. But executing lists of futures looks for me "broken by design". Best solution seems one, that will already fail for small examples for deadlocks (i.e. execute one Future after the other).
Scalaz futures with lists: do wait for each future to complete.
I think scalaz uses for comprehensions internally for traversal. With for comprehensions, it is not guaranteed that the calculations are independent. So I guess that Scalaz is doing the right thing here with for comprehensions: Doing one calculation after the other. In the case of futures, this will always work, given you have unlimited threads in you operating system.
So in other words: You see just an artifact of how for comprehensions (must) work.
I hope this makes some sense.
If I understand the question correctly, I think it really comes down to the semantics of streams vs lists.
Traversing a list does what we'd expect from the docs:
Transforms a TraversableOnce[A] into a Future[TraversableOnce[B]] using the provided function A => Future[B]. This is useful for performing a parallel map. For example, to apply a function to all items of a list in parallel:
With streams, it's up to the developer to decide how they want it to work because it depends on more knowledge of the stream than the compiler has (streams can be infinite, but the type system doesn't know about it). if my stream is reading lines from a file, I want to consume it first, since chaining futures line by line wouldn't actually parallelize things. in this case, I would want the parallel approach.
On the other hand, if my stream is an infinite list generating sequential integers and hunting for the first prime greater than some large number, it would be impossible to consume the stream first in one sweep (the chained Future approach would be required, and we'd probably want to run over batches from the stream).
Rather than trying to figure out a canonical way to handle this, I wonder if there are missing types that would help make the different cases more explicit.