How to convert an Akka Source to a scala.stream - scala

I already have a Source[T], but I need to pass it to a function that requires a Stream[T].
I could .run the source and materialize everything to a list and then do a .toStream on the result but that removes the lazy/stream aspect that I want to keep.
Is this the only way to accomplish this or am I missing something?
EDIT:
After reading Vladimir's comment, I believe I'm approaching my issue in the wrong way. Here's a simple example of what I have and what I want to create:
// Given a start index, returns a list from startIndex to startIndex+10. Stops at 50.
def getMoreData(startIndex: Int)(implicit ec: ExecutionContext): Future[List[Int]] = {
println(s"f running with $startIndex")
val result: List[Int] = (startIndex until Math.min(startIndex + 10, 50)).toList
Future.successful(result)
}
So getMoreData just emulates a service which returns data by the page to me.
My first goal it to create the following function:
def getStream(startIndex: Int)(implicit ec: ExecutionContext): Stream[Future[List[Int]]]
where the next Future[List[Int]] element in the stream depends on the previous one, taking the last index read from the previous Future's value in the stream. Eg with a startIndex of 0:
getStream(0)(0) would return Future[List[0 until 10]]
getStream(0)(1) would return Future[List[10 until 20]]
... etc
Once I have that function, I then want to create a 2nd function to further map it down to a simple Stream[Int]. Eg:
def getFlattenedStream(stream: Stream[Future[List[Int]]]): Stream[Int]
Streams are beginning to feel like the wrong tool for the job and I should just write a simple loop instead. I liked the idea of streams because the consumer can map/modify/consume the stream as they see fit without the producer needing to know about it.

Scala Streams are a fine way of accomplishing your task within getStream; here is a basic way to construct what you're looking for:
def getStream(startIndex : Int)
(implicit ec: ExecutionContext): Stream[Future[List[Int]]] =
Stream
.from(startIndex, 10)
.map(getMoreData)
Where things get tricky is with your getFlattenedStream function. It is possible to eliminate the Future wrapper around your List[Int] values, but it will require an Await.result function call which is usually a mistake.
More often than not it is best to operate on the Futures and allow asynchronous operations to happen on their own. If you analyze your ultimate requirement/goal it is usually not necessary to wait on a Future.
But, iff you absolutely must drop the Future then here is the code that can accomplish it:
val awaitDuration = 10 Seconds
def getFlattenedStream(stream: Stream[Future[List[Int]]]): Stream[Int] =
stream
.map(f => Await.result(f, awaitDuration))
.flatten

Related

Lazy Pagination in Scala (Stream/Iterator of Iterators?)

I'm reading a very large number of records sequentially from database API one page at a time (with unknown number of records per page) via call to def readPage(pageNumber: Int): Iterator[Record]
I'm trying to wrap this API in something like either Stream[Iterator[Record]] or Iterator[Iterator[Record]] lazily, in a functional way, ideally no mutable state, with constant memory footprint, so that I can treat it as infinite stream of pages or sequence of Iterators, and abstract away the pagination from the client. Client can iterate on the result, by calling next() it will retrieve the next page (Iterator[Record]).
What is the most idiomatic and efficient way to implement this in Scala.
Edit: need to fetch & process the records one page at a time, cannot maintain all the records from all pages in memory. If one page fails, throw an exception. Large number of pages/records means infinite for all practical purposes. I want to treat it as infinite stream (or iterator) of pages with each page being an iterator for finite number of records (e.g. less <1000 but exact number is unknown ahead if time).
I looked at BatchCursor in Monix but it serves a different purpose.
Edit 2: this is the current version using Tomer's answer below as starting point, but using Stream instead of Iterator.
This allows to eliminate the need in tail recursion as per https://stackoverflow.com/a/10525539/165130, and have O(1) time for stream prepend #:: operation (while if we've concatenated iterators via ++ operation it would be O(n))
Note: While streams are lazily evaluated, Stream memoization may still cause memory blow up, and memory management gets tricky. Changing from val to def to define the Stream in def pages = readAllPages below doesn't seem to have any effect
def readAllPages(pageNumber: Int = 0): Stream[Iterator[Record]] = {
val iter: Iterator[Record] = readPage(pageNumber)
if (iter.isEmpty)
Stream.empty
else
iter #:: readAllPages(pageNumber + 1)
}
//usage
val pages = readAllPages
for{
page<-pages
record<-page
if(isValid(record))
}
process(record)
Edit 3:
the second suggestion by Tomer seems to be the best, its runtime and memory footprint is similar to the above solution but it is much more concise and error-prone.
val pages = Stream.from(1).map(readPage).takeWhile(_.nonEmpty)
Note: Stream.from(1) creates a stream starting from 1 and incrementing by 1, it's in API docs
You can try implement such logic:
def readPage(pageNumber: Int): Iterator[Record] = ???
#tailrec
def readAllPages(pageNumber: Int): Iterator[Iterator[Record]] = {
val iter = readPage(pageNumber)
if (iter.nonEmpty) {
// Compute on records
// When finishing computing:
Iterator(iter) ++ readAllPages(pageNumber + 1)
} else {
Iterator.empty
}
}
readAllPages(0)
A shorter option will be:
Stream.from(1).map(readPage).takeWhile(_.nonEmpty)

Order of execution of Future - Making sequential inserts in a db non-blocking

A simple scenario here. I am using akka streams to read from kafka and write into an external source, in my case: cassandra.
Akka streams(reactive-kafka) library equips me with backpressure and other nifty things to make this possible.
kafka being a Source and Cassandra being a Sink, when I get bunch of events which are, for example be cassandra queries here through Kafka which are supposed to be executed sequentially (ex: it could be a INSERT, UPDATE and a DELETE and must be sequential).
I cannot use mayAsync and execute both the statement, Future is eager and there is a chance that DELETE or UPDATE might get executed first before INSERT.
I am forced to use Cassandra's execute as opposed to executeAsync which is non-blocking.
There is no way to make a complete async solution to this issue, but how ever is there a much elegant way to do this?
For ex: Make the Future lazy and sequential and offload it to a different execution context of sorts.
mapAsync gives a parallelism option as well.
Can Monix Task be of help here?
This a general design question and what are the approaches one can take.
UPDATE:
Flow[In].mapAsync(3)(input => {
input match {
case INSERT => //do insert - returns future
case UPDATE => //do update - returns future
case DELETE => //delete - returns future
}
The scenario is a little more complex. There could be thousands of insert, update and delete coming in order for specific key(s)(in kafka)
I would ideally want to execute the 3 futures of a single key in sequence. I believe Monix's Task can help?
If you process things with parallelism of 1, they will get executed in strict sequence, which will solve your problem.
But that's not interesting. If you want, you can run operations for different keys in parallel - if processing for different keys is independent, which, I assume from your description, is possible. To do this, you have to buffer the incoming values and then regroup it. Let's see some code:
import monix.reactive.Observable
import scala.concurrent.duration._
import monix.eval.Task
// Your domain logic - I'll use these stubs
trait Event
trait Acknowledgement // whatever your DB functions return, if you need it
def toKey(e: Event): String = ???
def processOne(event: Event): Task[Acknowledgement] = Task.deferFuture {
event match {
case _ => ??? // insert/update/delete
}
}
// Monix Task.traverse is strictly sequential, which is what you need
def processMany(evs: Seq[Event]): Task[Seq[Acknowledgement]] =
Task.traverse(evs)(processOne)
def processEventStreamInParallel(source: Observable[Event]): Observable[Acknowledgement] =
source
// Process a bunch of events, but don't wait too long for whole 100. Fine-tune for your data source
.bufferTimedAndCounted(2.seconds, 100)
.concatMap { batch =>
Observable
.fromIterable(batch.groupBy(toKey).values) // Standard collection methods FTW
.mapAsync(3)(processMany) // processing up to 3 different keys in parallel - tho 3 is not necessary, probably depends on your DB throughput
.flatMap(Observable.fromIterable) // flattening it back
}
The concatMap operator here will ensure that your chunks are processed sequentially as well. So even if one buffer has key1 -> insert, key1 -> update and the other has key1 -> delete, that causes no problems. In Monix, this is the same as flatMap, but in other Rx libraries flatMap might be an alias for mergeMap which has no ordering guarantee.
This can be done with Futures too, tho there's no standard "sequential traverse", so you have to roll your own, something like:
def processMany(evs: Seq[Event]): Future[Seq[Acknowledgement]] =
evs.foldLeft(Future.successful(Vector.empty[Acknowledgement])){ (acksF, ev) =>
for {
acks <- acksF
next <- processOne(ev)
} yield acks :+ next
}
You can use akka-streams subflows, to group by key, then merge substreams if you want to do something with what you get from your database operations:
def databaseOp(input: In): Future[Out] = input match {
case INSERT => ...
case UPDATE => ...
case DELETE => ...
}
val databaseFlow: Flow[In, Out, NotUsed] =
Flow[In].groupBy(Int.maxValues, _.key).mapAsync(1)(databaseOp).mergeSubstreams
Note that order from input source won't be kept in output as it is done in mapAsync, but all operations on the same key will still be in order.
You are looking for Future.flatMap:
def doSomething: Future[Unit]
def doSomethingElse: Future[Unit]
val result = doSomething.flatMap { _ => doSomethingElse }
This executes the first function, and then, when its Future is satisfied, starts the second one. The result is a new Future that completes when the result of the second execution is satisfied.
The result of the first future is passed into the function you give to .flatMap, so the second function can depend on the result of the first one. For example:
def getUserID: Future[Int]
def getUser(id: Int): Future[User]
val userName: Future[String] = getUserID.flatMap(getUser).map(_.name)
You can also write this as a for-comprehension:
for {
id <- getUserID
user <- getUser(id)
} yield user.name

Understanding NotUsed and Done

I am having a hard time understanding the purpose and significance of NotUsed and Done in Akka Streams.
Let us see the following 2 simple examples:
Using NotUsed :
implicit val system = ActorSystem("akka-streams")
implicit val materializer = ActorMaterializer()
val myStream: RunnableGraph[NotUsed] =
Source.single("stackoverflow")
.map(s => s.toUpperCase())
.to(Sink.foreach(println))
val runResult:NotUsed = myStream.run()
Using Done
implicit val system = ActorSystem("akka-streams")
implicit val materializer = ActorMaterializer()
val myStream: RunnableGraph[Future[Done]] =
Source.single("stackoverflow")
.map(s => s.toUpperCase())
.toMat(Sink.foreach(println))(Keep.right)
val runResult: Future[Done] = myStream.run()
When I run these examples, I get the same output in both cases:
STACKOVERFLOW //output
So what exactly are NotUsed and Done? What are the differences and when should I prefer one above the other ?
First of all, the choice you are making is between NotUsed and Future[Done] (not just Done).
Now, you are essentially deciding the materialized value of your graph, by using the different combinators (to and toMat with Keep.right).
The materialized value is a way to interact with your stream while it's running. This choice does not affect the data processed by your stream, and for this reason you see the same output in both cases. The same element (the string "stackoverflow") goes through both streams.
The choice depends on what your main program is supposed to do after running the stream:
in case you are not interested in interacting with it, NotUsed is the right choice. It is just a dummy object, and it conveys the information that no interaction with the stream is allowed nor needed
in case you need to listen for the completion of the stream to perform some other action, you need to expose the Future[Done]. This way you can attach a callback to it using (e.g.) onComplete or map.

Scala fs2 Streams with Chunks and Tasks?

// Simulated external API that synchronously returns elements one at a time indefinitely.
def externalApiGet[A](): A = ???
// This wraps with the proper fs2 stream that will indefinitely return values.
def wrapGetWithFS2[A](): Stream[Task, A] = Stream.eval(Task.delay(externalApiGet))
// Simulated external API that synchronously returns "chunks" of elements at a time indefinitely.
def externalApiGetSeq[A](): Seq[A] = ???
// How do I wrap this with a stream that hides the internal chunks and just provides a stream of A values.
// The following doesn't compile. I need help fixing this.
def wrapGetSeqWithFS2[A](): Stream[Task, A] = Stream.eval(Task.delay(externalApiGetSeq))
You need to mark the sequence as a Chunk and then use flatMap to flatten the stream.
def wrapGetSeqWithFS2[A](): Stream[Task, A] =
Stream.eval(Task.delay(externalApiGetSeq()))
.flatMap(Stream.emits)
(Edited to simplify solution)

Transform `Future[Option[X]]` into `Option[Future[X]]`

How would one transform Future[Option[X]] into Option[Future[X]]?
val futOpt:Future[Option[Int]] = future(Some(1))
val optFut:Option[Future[Int]] = ?
Update:
This is a follow up to this question. I suppose I'm trying to get a grasp on elegantly transforming nested futures. I'm trying to achieve with Options what can be done with Sequences, where you turn a Future[Seq[Future[Seq[X]]]] into Future[Future[Seq[Seq[x]]]] and then flatMap the double layers. As Ionut has clarified, I have phrased the question in flipped order, it was supposed to be Option[Future[X]] -> Future[Option[X]].
Unfortunately, this isn't possible without losing the non-blocking properties of the computation. It's pretty simple if you think about it. You don't know whether the result of the computation is None or Some until that Future has completed, so you have to Await on it. At that point, it makes no sense to have a Future anymore. You can simply return the Option[X], as the Future has completed already.
Take a look here. It always returns Future.successful, which does no computation, just wraps o in a Future for no good reason.
def transform[A](f: Future[Option[A]]): Option[Future[A]] =
Await.result(f, 2.seconds).map(o => Future.successful(o))
So, if in your context makes sense to block, you're better off using this:
def transform[A](f: Future[Option[A]]): Option[A] =
Await.result(f, 2.seconds)
Response for comments:
def transform[A](o: Option[Future[A]]): Future[Option[A]] =
o.map(f => f.map(Option(_))).getOrElse(Future.successful(None))
you could technically you could do this - in snippet below I use A context bound to a monoid so my types hold.
def fut2Opt[A: Monoid](futOpt: Future[Option[A]])(implicit ec: ExecutionContext): Option[Future[A]] =
Try(futOpt.flatMap {
case Some(a) => Future.successful(a)
case None => Future.fromTry(Try(Monoid[A].empty))
}).toOption