Scala paging iterating over legacy java method until condition is met - scala

In a Scala code, I need to loop over a legacy paginated Java endpoint until no more calls are required.
Naive pseudo solution -
var data: Seq[Data]
while(data.length < 1000)
// paging.limit is 1000
// getData return List[Data]
var newData = getData(paging.offset, paging.limit).asScala
data = data ++ newData
What is the more Scala way to do it?

A simple way to do a task for indefinite times is to use LazyList:
// This generates a list of unknown number of elements (less than 10)
def getData(pageNumber: Int): List[Int] =
List.fill(scala.util.Random.between(1,10))(scala.util.Random.between(1,100))
LazyList
.from(0) // Create infinite LazyList and track the page number
.map(getData)
.flatten //flatten is needed because getData generates a list. You can combine with map and use flatMap.
//If it generates single element, then it is not needed.
.take(1000) //Take number of element from this LazyList. It will not go over that because it is lazy
.toList // Materialize this LazyList
EDIT
For Scala 2.12 and older, use Stream.

Related

How to generate the materialized value from the elements in Source or Flow?

Suppose there is a source of type Source[Int, NotUsed]. How can this be turned into a Source[Int, T] where the materialized value T is computed based on the elements of the source?
Example: I would like summing the elements from a stream; how to implement dumbFlow so result should be 6 instead of 42?
val dumbFlow = ??? //Flow[Int].mapMaterializedValue(_ => 42)
//code below cannot be changed
val source = Source(List(1, 2, 3)).viaMat(dumbFlow)(Keep.right)
val result = source.toMat(Sink.ignore)(Keep.left).run()
//result: Int = 42
I know how to achieve the same result using Sink.fold or Sink.head but I need the materialization logic in the Source; cannot change .to(Sink.ignore).
Strictly speaking, the materialized value is always computed (including any mapMaterializedValue/toMat/viaMat etc.) before a single element goes through the stream and thus cannot depend on the elements of the stream.
If the materialized value happens to be a Future (in the Scala API), the future can be constructed (though not yet completed) and the stream can complete the Future based on the elements. In general, the Future materialized values are from sinks (e.g., as you note Sink.fold/Sink.head).
The alsoTo operator on a Source or Flow lets you embed a Sink to the side of a Source/Flow. It has an alsoToMat companion which lets you combine the Sink's materialized value with the Source/Flow's.
So one could have
val summingSink = Sink.fold[Int, Int](0)(_ + _)
val dumbFlow: Flow[Int, Int, Future[Int]] = Flow[Int].alsoToMat(summingSink)(Keep.right)
val result: Future[Int] = source.toMat(Sink.ignore)(Keep.left).run()
result.foreach(println _)
// will eventually print 6

Lazy Pagination in Scala (Stream/Iterator of Iterators?)

I'm reading a very large number of records sequentially from database API one page at a time (with unknown number of records per page) via call to def readPage(pageNumber: Int): Iterator[Record]
I'm trying to wrap this API in something like either Stream[Iterator[Record]] or Iterator[Iterator[Record]] lazily, in a functional way, ideally no mutable state, with constant memory footprint, so that I can treat it as infinite stream of pages or sequence of Iterators, and abstract away the pagination from the client. Client can iterate on the result, by calling next() it will retrieve the next page (Iterator[Record]).
What is the most idiomatic and efficient way to implement this in Scala.
Edit: need to fetch & process the records one page at a time, cannot maintain all the records from all pages in memory. If one page fails, throw an exception. Large number of pages/records means infinite for all practical purposes. I want to treat it as infinite stream (or iterator) of pages with each page being an iterator for finite number of records (e.g. less <1000 but exact number is unknown ahead if time).
I looked at BatchCursor in Monix but it serves a different purpose.
Edit 2: this is the current version using Tomer's answer below as starting point, but using Stream instead of Iterator.
This allows to eliminate the need in tail recursion as per https://stackoverflow.com/a/10525539/165130, and have O(1) time for stream prepend #:: operation (while if we've concatenated iterators via ++ operation it would be O(n))
Note: While streams are lazily evaluated, Stream memoization may still cause memory blow up, and memory management gets tricky. Changing from val to def to define the Stream in def pages = readAllPages below doesn't seem to have any effect
def readAllPages(pageNumber: Int = 0): Stream[Iterator[Record]] = {
val iter: Iterator[Record] = readPage(pageNumber)
if (iter.isEmpty)
Stream.empty
else
iter #:: readAllPages(pageNumber + 1)
}
//usage
val pages = readAllPages
for{
page<-pages
record<-page
if(isValid(record))
}
process(record)
Edit 3:
the second suggestion by Tomer seems to be the best, its runtime and memory footprint is similar to the above solution but it is much more concise and error-prone.
val pages = Stream.from(1).map(readPage).takeWhile(_.nonEmpty)
Note: Stream.from(1) creates a stream starting from 1 and incrementing by 1, it's in API docs
You can try implement such logic:
def readPage(pageNumber: Int): Iterator[Record] = ???
#tailrec
def readAllPages(pageNumber: Int): Iterator[Iterator[Record]] = {
val iter = readPage(pageNumber)
if (iter.nonEmpty) {
// Compute on records
// When finishing computing:
Iterator(iter) ++ readAllPages(pageNumber + 1)
} else {
Iterator.empty
}
}
readAllPages(0)
A shorter option will be:
Stream.from(1).map(readPage).takeWhile(_.nonEmpty)

How do I get the k-th minimum element of a Priority Queue in Scala?

How do I get the k-th minimum element of a Priority Queue in Scala?
I tried the following but it seems to be wrong!
import collection.mutable
object Main {
def main(args: Array[String]): Unit = {
val asc = Ordering.by((_: (Double, Vector[Double]))._1).reverse
val pq = mutable.PriorityQueue.empty[(Double, Vector[Double])](asc)
pq.enqueue(12.4 -> Vector(22.0, 3.4))
pq.enqueue(1.2 -> Vector(2.3, 3.2))
pq.enqueue(9.1 -> Vector(12.0, 3.2))
pq.enqueue(32.4 -> Vector(22.0, 13.4))
pq.enqueue(13.2 -> Vector(32.3, 23.2))
pq.enqueue(93.1 -> Vector(12.0, 43.2))
val k = 3
val kthMinimum = pq.take(k).last
println(kthMinimum)
}
}
It's explicitly stated in Scala API doc:
Only the dequeue and dequeueAll methods will return elements in
priority order (while removing elements from the heap). Standard
collection methods including drop, iterator, and toString will remove
or traverse the heap in whichever order seems most convenient.
If you want to stick to using PriorityQueue, it seems dequeue-ing k times or pq.dequeueAll(k-1) might be the only means to achieve priority retrieval.
The problem is the incompatibility between PriorityQueue properties and inherited collection methods like take etc. Another example of weird implementation issues with Scala collections.
Same problems exist with Java's PriorityQueue.
import java.util.PriorityQueue
val pQueue = new PriorityQueue[Integer]
pQueue.add(10)
pQueue.add(20)
pQueue.add(4)
pQueue.add(15)
pQueue.add(9)
val iter = pQueue.iterator()
iter.next() // 4
iter.next() // 9
iter.next() // 10
iter.next() // 20
iter.next() // 15
So, PriorityQueue maintains your data in an underlying ArrayBuffer (not exacltly but an special inherited class). This "Array" is kept heapified. And the inherited take API works on top of this heapified Array-like data structure. And first k elements of a min-heapified Array are certainly not same as minimum k elements.
Now, definition a PriorityQueue is supposed to support enqueue and dequeue. It just maintains the highest priotiry (first) element and is just incapable of reliably providing k-th element in the queue.
Although I say that this is a problem with both Java and Scala implementations, its just not possible to come up with a sane implemention for this. I jsut wonder that why are these misleading methods still present in PriorityQueue implementations. Can't we just remove them ?
I strongly suggest staying with the strictest API suited for your requirement. In other words stick with Queue API and not using the inherited API methods (which might do weird things).
Although, there is no good way of doing it (as it is not something explicitly required for a PriorityQueue).
You can achieve this by simply dequeueing k times in a loop with time complexity of k * log(n).
val kThMinimum = {
val pqc = pq.clone()
(1 until k).foreach(i => pqc.dequeue())
pqc.dequeue()
}

Char count in string

I'm new to scala and FP in general and trying to practice it on a dummy example.
val counts = ransomNote.map(e=>(e,1)).reduceByKey{case (x,y) => x+y}
The following error is raised:
Line 5: error: value reduceByKey is not a member of IndexedSeq[(Char, Int)] (in solution.scala)
The above example looks similar to staring FP primer on word count, I'll appreciate it if you point on my mistake.
It looks like you are trying to use a Spark method on a Scala collection. The two APIs have a few similarities, but reduceByKey is not part of it.
In pure Scala you can do it like this:
val counts =
ransomNote.foldLeft(Map.empty[Char, Int].withDefaultValue(0)) {
(counts, c) => counts.updated(c, counts(c) + 1)
}
foldLeft iterates over the collection from the left, using the empty map of counts as the accumulated state (which returns 0 is no value is found), which is updated in the function passed as argument by being updated with the found value, incremented.
Note that accessing a map directly (counts(c)) is likely to be unsafe in most situations (since it will throw an exception if no item is found). In this situation it's fine because in this scope I know I'm using a map with a default value. When accessing a map you will more often than not want to use get, which returns an Option. More on that on the official Scala documentation (here for version 2.13.2).
You can play around with this code here on Scastie.
On Scala 2.13 you can use the new groupMapReduce
ransomNote.groupMapReduce(identity)(_ => 1)(_ + _)
val str = "hello"
val countsMap: Map[Char, Int] = str
.groupBy(identity)
.mapValues(_.length)
println(countsMap)

Write a Scala Programs to find even numbers and a number just before it

I written the code below for finding even numbers and the number just before it in a RDD object. In this I first converted that to a List and tried to use my own function to find the even numbers and the numbers just before them. The following is my code. In this I have made an empty list in which I am trying to append the numbers one by one.
object EvenandOdd
{
def mydef(nums:Iterator[Int]):Iterator[Int]=
{
val mylist=nums.toList
val len= mylist.size
var elist=List()
var i:Int=0
var flag=0
while(flag!=1)
{
if(mylist(i)%2==0)
{
elist.++=List(mylist(i))
elist.++=List(mylist(i-1))
}
if(i==len-1)
{
flag=1
}
i=i+1
}
}
def main(args:Array[String])
{
val myrdd=sc.parallelize(List(1,2,3,4,5,6,7,8,9,10),2)
val myx=myrdd.mapPartitions(mydef)
myx.collect
}
}
I am not able to execute this command in Scala shell as well as in Eclipse and not able to figure out the error as I am just a beginner to Scala.
The following are the errors I got in Scala Shell.
<console>:35: error: value ++= is not a member of List[Nothing]
elist.++=List(mylist(i))
^
<console>:36: error: value ++= is not a member of List[Nothing]
elist.++=List(mylist(i-1))
^
<console>:31: error: type mismatch;
found : Unit
required: Iterator[Int]
while(flag!=1)
^
Your code looks too complicated and not functional. Also, it introduce potential problems with memory: you take Iterator as param and return Iterator as output. So, knowing that Iterator itself could be lazy and has under the hood huge amount of data, materializing it inside method with list could cause OOM. So your task is to get as much data from initial iterator as it it enough to answer two methods for new Iterator: hasNext and next
For example (based on your implementation, which outputs duplicates in case of sequence of even numbers) it could be:
def mydef(nums:Iterator[Int]): Iterator[Int] = {
var before: Option[Int] = None
val helperIterator = new Iterator[(Option[Int], Int)] {
override def hasNext: Boolean = nums.hasNext
override def next(): (Option[Int], Int) = {
val result = (before, nums.next())
before = Some(result._2)
result
}
}
helperIterator.withFilter(_._2 % 2 == 0).flatMap{
case (None, next) => Iterator(next)
case (Some(prev), next) => Iterator(prev, next)
}
}
Here you have two iterators. One helper, which just prepare data, providing previous element for each next. And next on - resulting, based on helper, which filter only even for sequence elements (second in pair), and output both when required (or just one, if first element in sequence is even)
For initial code
Additionally to answer of #pedrorijo91, in initial code you do did not also return anything (suppose you wanted to convert elist to Iterator)
It will be easier if you use a functional coding style rather than an iterative coding style. In functional style the basic operation is straightforward.
Given a list of numbers, the following code will find all the even numbers and the values that precede them:
nums.sliding(2,1).filter(_(1) % 2 == 0)
The sliding operation creates a list containing all possible pairs of adjacent values in the original list.
The filter operation takes only those pairs where the second value is even.
The result is an Iterator[List[Int]] where each List[Int] has two elements. You should be able to use this in your RDD framework.
It's marked part of the developer API, so there's no guarantee it'll stick around, but the RDDFunctions object actually defines sliding for RDDs. You will have to make sure it sees elements in the order you want.
But this becomes something like
rdd.sliding(2).filter(x => x(1) % 2 == 0) # pairs of (preceding number, even number)
for the first 2 errors:
there's no ++= operator on Lists. You will have to do list = list ++ element