I'm reading a very large number of records sequentially from database API one page at a time (with unknown number of records per page) via call to def readPage(pageNumber: Int): Iterator[Record]
I'm trying to wrap this API in something like either Stream[Iterator[Record]] or Iterator[Iterator[Record]] lazily, in a functional way, ideally no mutable state, with constant memory footprint, so that I can treat it as infinite stream of pages or sequence of Iterators, and abstract away the pagination from the client. Client can iterate on the result, by calling next() it will retrieve the next page (Iterator[Record]).
What is the most idiomatic and efficient way to implement this in Scala.
Edit: need to fetch & process the records one page at a time, cannot maintain all the records from all pages in memory. If one page fails, throw an exception. Large number of pages/records means infinite for all practical purposes. I want to treat it as infinite stream (or iterator) of pages with each page being an iterator for finite number of records (e.g. less <1000 but exact number is unknown ahead if time).
I looked at BatchCursor in Monix but it serves a different purpose.
Edit 2: this is the current version using Tomer's answer below as starting point, but using Stream instead of Iterator.
This allows to eliminate the need in tail recursion as per https://stackoverflow.com/a/10525539/165130, and have O(1) time for stream prepend #:: operation (while if we've concatenated iterators via ++ operation it would be O(n))
Note: While streams are lazily evaluated, Stream memoization may still cause memory blow up, and memory management gets tricky. Changing from val to def to define the Stream in def pages = readAllPages below doesn't seem to have any effect
def readAllPages(pageNumber: Int = 0): Stream[Iterator[Record]] = {
val iter: Iterator[Record] = readPage(pageNumber)
if (iter.isEmpty)
Stream.empty
else
iter #:: readAllPages(pageNumber + 1)
}
//usage
val pages = readAllPages
for{
page<-pages
record<-page
if(isValid(record))
}
process(record)
Edit 3:
the second suggestion by Tomer seems to be the best, its runtime and memory footprint is similar to the above solution but it is much more concise and error-prone.
val pages = Stream.from(1).map(readPage).takeWhile(_.nonEmpty)
Note: Stream.from(1) creates a stream starting from 1 and incrementing by 1, it's in API docs
You can try implement such logic:
def readPage(pageNumber: Int): Iterator[Record] = ???
#tailrec
def readAllPages(pageNumber: Int): Iterator[Iterator[Record]] = {
val iter = readPage(pageNumber)
if (iter.nonEmpty) {
// Compute on records
// When finishing computing:
Iterator(iter) ++ readAllPages(pageNumber + 1)
} else {
Iterator.empty
}
}
readAllPages(0)
A shorter option will be:
Stream.from(1).map(readPage).takeWhile(_.nonEmpty)
Related
How do I get the k-th minimum element of a Priority Queue in Scala?
I tried the following but it seems to be wrong!
import collection.mutable
object Main {
def main(args: Array[String]): Unit = {
val asc = Ordering.by((_: (Double, Vector[Double]))._1).reverse
val pq = mutable.PriorityQueue.empty[(Double, Vector[Double])](asc)
pq.enqueue(12.4 -> Vector(22.0, 3.4))
pq.enqueue(1.2 -> Vector(2.3, 3.2))
pq.enqueue(9.1 -> Vector(12.0, 3.2))
pq.enqueue(32.4 -> Vector(22.0, 13.4))
pq.enqueue(13.2 -> Vector(32.3, 23.2))
pq.enqueue(93.1 -> Vector(12.0, 43.2))
val k = 3
val kthMinimum = pq.take(k).last
println(kthMinimum)
}
}
It's explicitly stated in Scala API doc:
Only the dequeue and dequeueAll methods will return elements in
priority order (while removing elements from the heap). Standard
collection methods including drop, iterator, and toString will remove
or traverse the heap in whichever order seems most convenient.
If you want to stick to using PriorityQueue, it seems dequeue-ing k times or pq.dequeueAll(k-1) might be the only means to achieve priority retrieval.
The problem is the incompatibility between PriorityQueue properties and inherited collection methods like take etc. Another example of weird implementation issues with Scala collections.
Same problems exist with Java's PriorityQueue.
import java.util.PriorityQueue
val pQueue = new PriorityQueue[Integer]
pQueue.add(10)
pQueue.add(20)
pQueue.add(4)
pQueue.add(15)
pQueue.add(9)
val iter = pQueue.iterator()
iter.next() // 4
iter.next() // 9
iter.next() // 10
iter.next() // 20
iter.next() // 15
So, PriorityQueue maintains your data in an underlying ArrayBuffer (not exacltly but an special inherited class). This "Array" is kept heapified. And the inherited take API works on top of this heapified Array-like data structure. And first k elements of a min-heapified Array are certainly not same as minimum k elements.
Now, definition a PriorityQueue is supposed to support enqueue and dequeue. It just maintains the highest priotiry (first) element and is just incapable of reliably providing k-th element in the queue.
Although I say that this is a problem with both Java and Scala implementations, its just not possible to come up with a sane implemention for this. I jsut wonder that why are these misleading methods still present in PriorityQueue implementations. Can't we just remove them ?
I strongly suggest staying with the strictest API suited for your requirement. In other words stick with Queue API and not using the inherited API methods (which might do weird things).
Although, there is no good way of doing it (as it is not something explicitly required for a PriorityQueue).
You can achieve this by simply dequeueing k times in a loop with time complexity of k * log(n).
val kThMinimum = {
val pqc = pq.clone()
(1 until k).foreach(i => pqc.dequeue())
pqc.dequeue()
}
I'm using a service that returns me paginated resources. It exposes one single call, which is defined by the following interface:
trait Service {
getPage(pageSize: Int, pageCursor: String): AsyncPage[Resource]
}
The getPage function returns an AsyncPage[T] object, which is implemented like this:
/**
* A page of contents that are retrieved asynchronously from their origin
*
* #param content The resource object
* #param nextPageCursor The token representing the next page, or empty if no more pages to consume
* #tparam T The type of resource withing the page
*/
case class AsyncPage[T](
val content: Future[Iterable[T]],
val nextPageCursor : Future[String]
) { }
The contents of the page are retrieved asynchronously from whichever storage system the service uses.
Because of the needs of my application, I don't really care about pages. I'd like to code something that allows me to consume the resources of the service as if it was a single Iterable[T].
However, I want to maintain the lazyness of the service. I don't want to request more pages than necessary. That means that I don't want to request the next page until I haven't consumed all the elements of the previous one.
Whenever I have consumed the whole Iterable[T] of one page, I want the code to request the following page using the getPage(...) function, and providing the pageCursor parameter from the last page nextPageCursor.
Can you guide me on how to achieve that?
Well, if you don't mind blocking, you can do something like this:
class FutureIter[+P](fu: => Future[Iterator[P]]) extends AbstractIterator[P] {
lazy val iter = Await.result(fu)
def hasNext = iter.hasNext
def next = iter.next
}
def fold[T](fs: Stream[Future[Iterator[T]]]): Iterator[T]= fs match {
case hd #:: tail => new FutureIter(hd) ++ fold(tail)
case _ => Iterator.empty
}
val pages = Stream
.iterate(getPage(size, "")) { getPage(size, _.nextPageCursor) }
.map(_.contents.map(_.iterator))
val result: Iterator[T] = fold(pages)
This will block before the first page, and at the end of each subsequent page to load the next batch. I don't think there is a way to do this without blocking, because you can't tell where the page ends until the future is satisfied.
Also, note that the iterator this code produces is infinite, because you didn't mention any criteria when to stop looking for more pages. You can tuck some .takeWhile call onto the pages to correct that.
You may also want to replace Stream with Iterator so that pages you are done with get discarded immediately, rather than getting cached. I just used Stream because that makes fold a little bit nicer (you can't match on iterators, would have to use and ugly if(it.hasNext) ... instead).
BTW, fold looks like it is recursive, but it actually is not: ++ is lazy, so the fold(tail) piece will not be executed until you have iterated all the way through the left-hand-side - well after the outer fold is off the stack.
Since you mentioned akka, you could create a Source[T] which can sort of be though of as an async Iterable[T]:
Source.unfoldAsync[String, T](startPageCursor) { cursor =>
val page = getPage(pageSize, cursor)
for {
nextCursor <- page.nextPageCursor
it <- page.content
} yield Some((nextCursor, it))
}.mapConcat(identity)
This is much cleaner and completely non-blocking. But it is up to your use case if this is suitable.
Given I have a very long running stream of events flowing through something as show below. When a long time has passed there will be lots of sub streams created that is no longer needed.
Is there a way to clean up a specific substream at a given time, for
example the substream created by id 3 should be cleaned and the state
in the scan method lost at 13Pm (expires property of Wid)?
case class Wid(id: Int, v: String, expires: LocalDateTime)
test("Substream with scan") {
val (pub, sub) = TestSource.probe[Wid]
.groupBy(Int.MaxValue, _.id)
.scan("")((a: String, b: Wid) => a + b.v)
.mergeSubstreams
.toMat(TestSink.probe[String])(Keep.both)
.run()
}
TL;DR You can close a substream after some time. However, using input to dynamically set the time with built-in stages is another matter.
Closing a substream
To close a flow, you usually complete it (from upstream), but you can also cancel it (from downstream). For instance, the take(n: Int) flow will cancel once n elements have gone through.
Now, in the groupBy case, you cannot complete a substream, since upstream is shared by all substreams, but you can cancel it. How depends on what condition you want to put on its end.
However, be aware that groupBy removes inputs for subflows that have already been closed: If a new element with id 3 comes from upstream to the groupBy after the 3-substream has been closed, it will simply be ignored and the next element will be pulled in. The reason for this is probably that some elements might be lost in the process between closing and re-opening of the substream. Also, if your stream is supposed to run for a very long time, this will affect performances because each element will be checked against the list of closed substreams before being forwarded to the relevant (live) substream. You might want to implement your own stateful filter (say, with a bloom filter) if you're not satisfied with the performances of this.
To close a substream, I usually use either take (if you want only a given number of elements, but that's probably not the case on an infinite stream), or some kind of timeout: either completionTimeout if you want a fixed time from materialization to closure or idleTimeout if you want to close when no element have been coming through for some time. Note that these flows do not cancel the stream but fail it, so you have to catch the exception with a recover or recoverWith stage to change the failure into a cancel (recoverWith allows you to cancel without sending any last element, by recovering with Source.empty).
Dynamically set the timeout
Now what you want is to set dynamically the closing time according to the first passing element. This is more complicated because materialization of streams is independant of the elements that pass through them. Indeed, in the usual (without groupBy) case, streams are materialized before any element go through them, so it makes no sense to use elements to materialize them.
I had similar issues in that question, and ended up using a modified version of groupBy with signature
paramGroupBy[K, OO, MM](maxSubstreams: Int, f: Out => K, paramSubflow: K => Flow[Out, OO, MM])
that allows to define every substream using the key that defined it. This can be modified to have the first element (instead of the key), as parameter.
Another (probably simpler, in your case) way would be to write your own stage that does exactly what you want: get end-time from first element and cancel the stream at that time. Here is an example implementation for this (I used a scheduler instead of setting a state):
object CancelAfterTimer
class CancelAfter[T](getTimeout: T => FiniteDuration) extends GraphStage[FlowShape[T, T]] {
val in = Inlet[T]("CancelAfter.in")
val out = Outlet[T]("CancelAfter.in")
override val shape: FlowShape[T, T] = FlowShape(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new TimerGraphStageLogic(shape) with InHandler with OutHandler {
override def onPush(): Unit = {
val elem = grab(in)
if (!isTimerActive(CancelAfterTimer))
scheduleOnce(CancelAfterTimer, getTimeout(elem))
push(out, elem)
}
override def onTimer(timerKey: Any): Unit =
completeStage() //this will cancel the upstream and close the downstrean
override def onPull(): Unit = pull(in)
setHandlers(in, out, this)
}
}
For an immutable flavour, Iterator does the job.
val x = Iterator.fill(100000)(someFn)
Now I want to implement a mutable version of Iterator, with three guarantees:
thread-safe on all transformations(fold, foldLeft, ..) and append
lazy evaluated
traversable only once! Once used, an object from this Iterator should be destroyed.
Is there an existing implementation to give me these guarantees? Any library or framework example would be great.
Update
To illustrate the desired behaviour.
class SomeThing {}
class Test(val list: Iterator[SomeThing]) {
def add(thing: SomeThing): Test = {
new Test(list ++ Iterator(thing))
}
}
(new Test()).add(new SomeThing).add(new SomeThing);
In this example, SomeThing is an expensive construct, it needs to be lazy.
Re-iterating over list is never required, Iterator is a good fit.
This is supposed to asynchronously and lazily sequence 10 million SomeThing instances without depleting the executor(a cached thread pool executor) or running out of memory.
You don't need a mutable Iterator for this, just daisy-chain the immutable form:
class SomeThing {}
case class Test(val list: Iterator[SomeThing]) {
def add(thing: => SomeThing) = Test(list ++ Iterator(thing))
}
(new Test()).add(new SomeThing).add(new SomeThing)
Although you don't really need the extra boilerplate of Test here:
Iterator(new SomeThing) ++ Iterator(new SomeThing)
Note that Iterator.++ takes a by-name param, so the ++ operation is already lazy.
You might also want to try this, to avoid building intermediate Iterators:
Iterator.continually(new SomeThing) take 2
UPDATE
If you don't know the size in advance, then I'll often use a tactic like this:
def mkSomething = if(cond) Some(new Something) else None
Iterator.continually(mkSomething) takeWhile (_.isDefined) map { _.get }
The trick is to have your generator function wrap its output in an Option, which then gives you a way to flag that the iteration is finished by returning None
Of course... If you're really pushing out the boat, you can even use the dreaded null:
def mkSomething = if(cond) { new Something } else null
Iterator.continually(mkSomething) takeWhile (_ != null)
Seems like you need to hide the fact that the iterator is mutable but at the same time allow it to grow mutably. What I'm going to propose is the same sort of trick I've used to speed up ::: in the past:
abstract class AppendableIterator[A] extends Iterator[A]{
protected var inner: Iterator[A]
def hasNext = inner.hasNext
def next() = inner next ()
def append(that: Iterator[A]) = synchronized{
inner = new JoinedIterator(inner, that)
}
}
//You might need to add some more things, this is a skeleton
class JoinedIterator[A](first: Iterator[A], second: Iterator[A]) extends Iterator[A]{
def hasNext = first.hasNext || second.hasNext
def next() = if(first.hasNext) first next () else if(second.hasNext) second next () else Iterator.next()
}
So what you're really doing is leaving the Iterator at whatever place in its iteration you might have it while still preserving the thread safety of the append by "joining" another Iterator in non-destructively. You avoid the need to recompute the two together because you never actually force them through a CanBuildFrom.
This is also a generalization of just adding one item. You can always wrap some A in an Iterator[A] of one element if you so choose.
Have you looked at the mutable.ParIterable in the collection.parallel package?
To access an iterator over elements you can do something like
val x = ParIterable.fill(100000)(someFn).iterator
From the docs:
Parallel operations are implemented with divide and conquer style algorithms that parallelize well. The basic idea is to split the collection into smaller parts until they are small enough to be operated on sequentially.
...
The higher-order functions passed to certain operations may contain side-effects. Since implementations of bulk operations may not be sequential, this means that side-effects may not be predictable and may produce data-races, deadlocks or invalidation of state if care is not taken. It is up to the programmer to either avoid using side-effects or to use some form of synchronization when accessing mutable data.
I'm writing an application server and there is a message sending loop. A message is composed of fields and thus can be viewed as an iterator that iterates over the fields. And there is a message queue that is processed by the message loop, but the loop is breakable at any time (e.g. when the socket buffer is full) and can be resumed later. Current implementation looks like:
private val messageQueue: Queue[Iterator[Field]]
sent = 0
breakable {
for (iterator <- messageQueue) {
for (field <- iterator) {
... breakable ...
}
sent += 1
}
} finally messageQueue.trimStart(sent)
This works and is not bad, but then I thought I could make the code a bit cleaner if I could replace the queue by an iterator that concatenates iterators using the ++ operator. To say:
private val messageQueue: Iterator[Field] = message1.iterator ++ message2.iterator ++ ...
breakable {
for (field <- messageQueue) {
... breakable ...
}
}
Now the code looks much cleaner but there's a performance issue. Concatenated iterators form a (unbalanced) tree internally so the next() operation takes O(n) of time. So the iteration takes O(n^2) of time overall.
To summarize, the messages need to be processed just once so the queue doesn't need to be a Traversable. An Iterator (TraversableOnce) would do. I'd like to view the message queue as a collection of consecutive iterators but the ++ has a performance issue. Would there be a nice solution that makes the code cleaner but is efficient at the same time?
What if you just flatten them?
def flattenIterator[T](l: List[Iterator[T]]): Iterator[T] = l.iterator.flatten
Have you thought about using Stream and #::: to lazily concatenate your messages together?
private val messageQueue: Stream[Field] = message1.toStream #::: message2.toStream #::: ...
breakable {
for (field <- messageQueue) {
... breakable ...
}
}
As for the time complexity here, I believe it would be O(n) in the number of iterators you're concatenating (you need to call toStream for each iterator and #::: them together). However, the individual toStream and #::: operations should be O(1) since they're lazy. Here's the toStream implementation for Iterator:
def toStream: Stream[A] =
if (self.hasNext) Stream.cons(self.next, self.toStream)
else Stream.empty[A]
This will take constant time because the 2nd argument to Stream.cons is call-by-name, so it won't get evaluated until you actually access the tail.
However, the conversion to Stream will add a constant factor of overhead for each element access, i.e. instead of just calling next on the iterator it will have to do a few extra method calls to force the lazy tail of the stream and access the contained value.