How to collect paginated resuls in scala - scala

I am trying to collect paginated results by trying to do the following logic in Scala and failed pathetically:
def python_version():
cursor
books, cursor = fetch_results()
while (cursor!=null) {
new_books = fetch_results(cursor)
books = books + new_books
}
return books
def fetch_results(cursor=None):
#do some fetchings...
return books, next_cursor

Here is an alternative solution using a recursive function, which avoids mutable values:
def fetchResults(c: Option[Cursor]=None): (List[Book], Option[Cursor]) = ...
def fetchAllResults(): List[Book] = {
#tailrec
def loop(cursor: Option[Cursor], res: List[Book]): List[Book] = {
val (books, newCursor) = fetchResults(cursor)
val newBooks = res ::: books
newCursor match {
case Some(_) =>
loop(newCursor, newBooks)
case None =>
newBooks
}
}
loop(None, Nil)
}
This is a fairly standard pattern for recursive functions in Scala where the actual recursion is done in an internal function. The result of the previous iteration is passed down to the next iteration and then returned from the function. This means that loop is a tail-recursive function and can be optimised by the compiler into a while loop. (The #tailrec annotation tells the compiler to warn if this is not actually tail-recursive)

Something like this, perhaps:
Iterator.iterate(fetch_results()) {
case (_, None) => (Nil, None)
case (books, cursor) => fetch_results(cursor)
}.map(_._1).takeWhile(_.nonEmpty).flatten.toList`
.iterate takes the first parameter to be the initial element of the iterator, and the second one is a function, that, given previous element, computes the next one.
So, this creates an iterator of tuples (Seq[Book], Cursor), starting with the initial return of fetch_results, and then keeps fetching more results, and accumulating them, until the nextCursor is None (I used None instead of null, because nulls are evil, and shouldn't really be used in a normal language, like scala, that provides enough facilities to avoid them).
Then .map(_._1) discards the cursors (don't need them any more), so we now have an iterator of pages, .takeWhile truncates the iterator at the first
page that is empty, then .flatten concatenates all inner Seqs together, and finally toList materializes all the elements, and returns the entire list of books.

Related

Scala: Compare elements at same position in two arrays

I'm in the process of learning Scala and am trying to write some sort of function that will compare one element in an list against an element in another list at the same index. I know that there has to be a more Scalatic way to do this than two write two for loops and keep track of the current index of each manually.
Let's say that we're comparing URLs, for example. Say that we have the following two Lists that are URLs split by the / character:
val incomingUrl = List("users", "profile", "12345")
and
val urlToCompare = List("users", "profile", ":id")
Say that I want to treat any element that begins with the : character as a match, but any element that does not begin with a : will not be a match.
What is the best and most Scalatic way to go about doing this comparison?
Coming from a OOP background, I would immediately jump to a for loop, but I know that there has to be a good FP way to go about it that will teach me a thing or two about Scala.
EDIT
For completion, I found this outdated question shortly after posting mine that relates to the problem.
EDIT 2
The implementation that I chose for this specific use case:
def doRoutesMatch(incomingURL: List[String], urlToCompare: List[String]): Boolean = {
// if the lengths don't match, return immediately
if (incomingURL.length != urlToCompare.length) return false
// merge the lists into a tuple
urlToCompare.zip(incomingURL)
// iterate over it
.foreach {
// get each path
case (existingPath, pathToCompare) =>
if (
// check if this is some value supplied to the url, such as `:id`
existingPath(0) != ':' &&
// if this isn't a placeholder for a value that the route needs, then check if the strings are equal
p2 != p1
)
// if neither matches, it doesn't match the existing route
return false
}
// return true if a `false` didn't get returned in the above foreach loop
true
}
You can use zip, that invoked on Seq[A] with Seq[B] results in Seq[(A, B)]. In other words it creates a sequence with tuples with elements of both sequences:
incomingUrl.zip(urlToCompare).map { case(incoming, pattern) => f(incoming, pattern) }
There is already another answer to the question, but I am adding another one since there is one corner case to watch out for. If you don't know the lengths of the two Lists, you need zipAll. Since zipAll needs a default value to insert if no corresponding element exists in the List, I am first wrapping every element in a Some, and then performing the zipAll.
object ZipAllTest extends App {
val incomingUrl = List("users", "profile", "12345", "extra")
val urlToCompare = List("users", "profile", ":id")
val list1 = incomingUrl.map(Some(_))
val list2 = urlToCompare.map(Some(_))
val zipped = list1.zipAll(list2, None, None)
println(zipped)
}
One thing that might bother you is that we are making several passes through the data. If that's something you are worried about, you can use lazy collections or else write a custom fold that makes only one pass over the data. That is probably overkill though. If someone wants to, they can add those alternative implementations in another answer.
Since the OP is curious to see how we would use lazy collections or a custom fold to do the same thing, I have included a separate answer with those implementations.
The first implementation uses lazy collections. Note that lazy collections have poor cache properties so that in practice, it often does does not make sense to use lazy collections as a micro-optimization. Although lazy collections will minimize the number of times you traverse the data, as has already been mentioned, the underlying data structure does not have good cache locality. To understand why lazy collections minimize the number of passes you make over the data, read chapter 5 of Functional Programming in Scala.
object LazyZipTest extends App{
val incomingUrl = List("users", "profile", "12345", "extra").view
val urlToCompare = List("users", "profile", ":id").view
val list1 = incomingUrl.map(Some(_))
val list2 = urlToCompare.map(Some(_))
val zipped = list1.zipAll(list2, None, None)
println(zipped)
}
The second implementation uses a custom fold to go over lists only one time. Since we are appending to the rear of our data structure, we want to use IndexedSeq, not List. You should rarely be using List anyway. Otherwise, if you are going to convert from List to IndexedSeq, you are actually making one additional pass over the data, in which case, you might as well not bother and just use the naive implementation I already wrote in the other answer.
Here is the custom fold.
object FoldTest extends App{
val incomingUrl = List("users", "profile", "12345", "extra").toIndexedSeq
val urlToCompare = List("users", "profile", ":id").toIndexedSeq
def onePassZip[T, U](l1: IndexedSeq[T], l2: IndexedSeq[U]): IndexedSeq[(Option[T], Option[U])] = {
val folded = l1.foldLeft((l2, IndexedSeq[(Option[T], Option[U])]())) { (acc, e) =>
acc._1 match {
case x +: xs => (xs, acc._2 :+ (Some(e), Some(x)))
case IndexedSeq() => (IndexedSeq(), acc._2 :+ (Some(e), None))
}
}
folded._2 ++ folded._1.map(x => (None, Some(x)))
}
println(onePassZip(incomingUrl, urlToCompare))
println(onePassZip(urlToCompare, incomingUrl))
}
If you have any questions, I can answer them in the comments section.

Scala: what is the interest in using Iterators?

I have used Iterators after have worked with Regexes in Scala but I don't really understand the interest.
I know that it has a state and if I call the next() method on it, it will output a different result every time, but I don't see anything I can do with it and that is not possible with an Iterable.
And it doesn't seem to work as Akka Streams (for example) since the following example directly prints all the numbers (without waiting one second as I would expect it):
lazy val a = Iterator({Thread.sleep(1000); 1}, {Thread.sleep(1000); 2}, {Thread.sleep(1000); 3})
while(a.hasNext){ println(a.next()) }
So what is the purpose of using Iterators?
Perhaps, the most useful property of iterators is that they are lazy.
Consider something like this:
(1 to 10000)
.map { x => x * x }
.map { _.toString }
.find { _ == "4" }
This snippet will square 10000 numbers, then generate 10000 strings, and then return the second one.
This on the other hand:
(1 to 10000)
.iterator
.map { x => x * x }
.map { _.toString }
.find { _ == "4" }
... only computes two squares, and generates two strings.
Iterators are also often useful when you need to wrap around some poorly designed (java?) objects in order to be able to handle them in functional style:
val rs: ResultSet = jdbcQuery.executeQuery()
new Iterator {
def next = rs
def hasNext = rs.next
}.map { rs =>
fetchData(rs)
}
Streams are similar to iterators - they are also lazy, and also useful for wrapping:
Stream.continually(rs).takeWhile { _.next }.map(fetchData)
The main difference though is that streams remember the data that gets materialized, so that you can traverse them more than once. This is convenient, but may be costly if the original amount of data is very large, especially, if it gets filtered down to much smaller size:
Source
.fromFile("huge_file.txt")
.getLines
.filter(_ == "")
.toList
This only uses, roughly (ignoring buffering, object overhead, and other implementation specific details), the amount of memory, necessary to keep one line in memory, plus however many empty lines there are in the file.
This on the other hand:
val reader = new FileReader("huge_file.txt")
Stream
.continually(reader.readLine)
.takeWhile(_ != null)
.filter(_ == "")
.toList
... will end up with the entire content of the huge_file.txt in memory.
Finally, if I understand the intent of your example correctly, here is how you could do it with iterators:
val iterator = Seq(1,2,3).iterator.map { n => Thread.sleep(1000); n }
iterator.foreach(println)
// Or while(iterator.hasNext) { println(iterator.next) } as you had it.
There is a good explanation of what iterator is http://www.scala-lang.org/docu/files/collections-api/collections_43.html
An iterator is not a collection, but rather a way to access the
elements of a collection one by one. The two basic operations on an
iterator it are next and hasNext. A call to it.next() will return the
next element of the iterator and advance the state of the iterator.
Calling next again on the same iterator will then yield the element
one beyond the one returned previously. If there are no more elements
to return, a call to next will throw a NoSuchElementException.
First of all you should understand what is wrong with your example:
lazy val a = Iterator({Thread.sleep(1); 1}, {Thread.sleep(1); 2},
{Thread.sleep(2); 3}) while(a.hasNext){ println(a.next()) }
if you look at the apply method of Iterator, you'll see there are no calls by name,so all Thread.sleep are calling at the same time when apply method calls. Also Thread.sleep takes parameter of time to sleep in milliseconds, so if you want to sleep your thread on one second you should pass Thread.sleep(1000).
The companion object has additional methods which allow you do the next:
val a = Iterator.iterate(1)(x => {Thread.sleep(1000); x+1})
Iterator is very useful when you need to work with large data. Also you can implement your own:
val it = new Iterator[Int] {
var i = -1
def hasNext = true
def next(): Int = { i += 1; i }
}
I don't see anything I can do with it and that is not possible with an Iterable
In fact, what most collection can do can also be done with Array, but we don't do that because it's much less convenient
So same reason apply to iterator, if you want to model a mutable state, then iterator makes more sense.
For example, Random is implemented in a way resemble to iterator because it's use case fit more naturally in iterator, rather than iterable.

How to handle Option of List in Scala?

Suppose I have a function getCustomers and getOrdersByCustomer.
def getCustomer():List[Customer] = ...
def getOrdersByCustomer(cust: Customer): List[Order] = ...
Now I can easily define a function getOrdersOfAllCustomers
def getOrdersOfAllCustomers(): List[Order] =
for(cust <- getCustomer(); order <- getOrderByCustomer(cust)) yield order
So far, so good but what if getCustomer and getOrdersByCustomer return Options of the lists ?
def getCustomer():Option[List[Customer]] = ...
def getOrdersByCustomer(cust: Customer): Option[List[Order]] = ...
Now I would like to implement two different flavors of getOrdersOfAllCustomers():
Return None if one of the functions returns None;
Return None if getCustomer returns None and do not care if getOrdersByCustomer returns None.
How would you suggest implement it?
I think you should consider three possibilities--a populated list, an empty list, or an error--and avoid a lot of inelegant testing to figure out which one happened.
So use Try with List:
def getOrdersOfAllCustomers(): Try[List[Order]] = {
Try(funtionReturningListOfOrders())
}
If all goes well, you will come out with a Success[List[Order]]; if not, Failure[List[Order]].
The beauty of this approach is no matter which happens--a populated list, an empty list, or an error--you can do all the stuff you want with lists. This is because Try is a monad just like Option is. Go ahead and filter, forEach, map, etc. to your heart's content without caring which of those three occurred.
The one thing is that awkward moment when you do have to figure out if success or failure happened. Then use a match expression:
getOrdersOfAllCustomers() match {
case Success(orders) => println(s"Awww...yeah!")
case Failure(ex) => println(s"Stupid Scala")
}
Even if you don't go with the Try, I implore you not to treat empty lists different from populated lists.
Try this,
def getOrdersOfAllCustomers(): Option[List[Order]] =
for{
cust <- getCustomer().toList.flatten;
order <- getOrderByCustomer(cust).toList.flatten
} yield order
This should do it:
def getOrdersOfAllCustomers(): Option[List[Order]] = {
getCustomer() flatMap { customers =>
//optOrders is a List[Option[List[Order]]]
val optOrders = customers map { getOrderByCustomer }
// Any result must be wrapped in an Option because we're flatMapping
// the return from the initial getCustomer call
if(optOrders contains None) None
else {
// map the nested Option[List[Order]]] into List[List[Order]]
// and flatten into a List[Order]
// This then gives a List[List[Order]] which can be flattened again
Some(optOrders.map(_.toList.flatten).flatten)
}
}
}
The hard part is handling the case where one of the nested invocations of getOrderByCustomer returns None and bubbling that result back to the outer scope (which is why using empty lists is so much easier)

Functional style early exit from depth-first recursion

I have a question about writing recursive algorithms in a functional style. I will use Scala for my example here, but the question applies to any functional language.
I am doing a depth-first enumeration of an n-ary tree where each node has a label and a variable number of children. Here is a simple implementation that prints the labels of the leaf nodes.
case class Node[T](label:T, ns:Node[T]*)
def dfs[T](r:Node[T]):Seq[T] = {
if (r.ns.isEmpty) Seq(r.label) else for (n<-r.ns;c<-dfs(n)) yield c
}
val r = Node('a, Node('b, Node('d), Node('e, Node('f))), Node('c))
dfs(r) // returns Seq[Symbol] = ArrayBuffer('d, 'f, 'c)
Now say that sometimes I want to be able to give up on parsing oversize trees by throwing an exception. Is this possible in a functional language? Specifically is this possible without using mutable state? That seems to depend on what you mean by "oversize". Here is a purely functional version of the algorithm that throws an exception when it tries to handle a tree with a depth of 3 or greater.
def dfs[T](r:Node[T], d:Int = 0):Seq[T] = {
require(d < 3)
if (r.ns.isEmpty) Seq(r.label) else for (n<-r.ns;c<-dfs(n, d+1)) yield c
}
But what if a tree is oversized because it is too broad rather than too deep? Specifically what if I want to throw an exception the n-th time the dfs() function is called recursively regardless of how deep the recursion goes? The only way I can see how to do this is to have a mutable counter that is incremented with each call. I can't see how to do it without a mutable variable.
I'm new to functional programming and have been working under the assumption that anything you can do with mutable state can be done without, but I don't see the answer here. The only thing I can think to do is write a version of dfs() that returns a view over all the nodes in the tree in depth-first order.
dfs[T](r:Node[T]):TraversableView[T, Traversable[_]] = ...
Then I could impose my limit by saying dfs(r).take(n), but I don't see how to write this function. In Python I'd just create a generator by yielding nodes as I visited them, but I don't see how to achieve the same effect in Scala. (Scala's equivalent to a Python-style yield statement appears to be a visitor function passed in as a parameter, but I can't figure out how to write one of these that will generate a sequence view.)
EDIT Getting close to the answer.
Here is an function that returns a Stream of nodes in depth-first order.
def dfs[T](r: Node[T]): Stream[Node[T]] = {
(r #:: Stream.empty /: r.ns)(_ ++ dfs(_))
}
That is almost it. The only problem is that Stream memoizes all results, which is a waste of memory. I want a traversable view. The following is the idea, but does not compile.
def dfs[T](r: Node[T]): TraversableView[Node[T], Traversable[Node[T]]] = {
(Traversable(r).view /: r.ns)(_ ++ dfs(_))
}
It gives a "found TraversableView[Node[T], Traversable[Node[T]]], required TraversableView[Node[T], Traversable[_]] error for the ++ operator. If I change the return type to TraversableView[Node[T], Traversable[_]], I get the same problem with the "found" and "required" clauses switched. So there's some magic type variance incantation I haven't lit upon yet, but this is close.
It can be done: you just have to write some code to actually iterate through the children in the way you want (as opposed to relying on for).
More explicitly, you'll have to write code to iterate through a list of children and check if the "depth" crossed your threshold. Here's some Haskell code (I'm really sorry, I'm not fluent in Scala, but this can probably be easily transliterated):
http://ideone.com/O5gvhM
In this code, I've basically replaced the for loop for an explicit recursive version. This allows me to stop the recursion if the number of visited nodes is already too deep (i.e., limit is not positive). When I recurse to examine the next child, I subtract the number of nodes the dfs of the previous child visited and set this as the limit for the next child.
Functional languages are fun, but they're a huge leap from imperative programming. It really makes you pay attention to the concept of state, because all of it is excruciatingly explicit in the arguments when you go functional.
EDIT: Explaining this a bit more.
I ended up converting from "print just the leaf nodes" (which was the original algorithm from the OP) to "print all nodes". This enabled me to have access to the number of nodes the subcall visited through the length of the resulting list. If you want to stick to the leaf nodes, you'll have to carry around how many nodes you have already visited:
http://ideone.com/cIQrna
EDIT again To clear up this answer, I'm putting all the Haskell code on ideone, and I've transliterated my Haskell code to Scala, so this can stay here as the definite answer to the question:
case class Node[T](label:T, children:Seq[Node[T]])
case class TraversalResult[T](num_visited:Int, labels:Seq[T])
def dfs[T](node:Node[T], limit:Int):TraversalResult[T] =
limit match {
case 0 => TraversalResult(0, Nil)
case limit =>
node.children match {
case Nil => TraversalResult(1, List(node.label))
case children => {
val result = traverse(node.children, limit - 1)
TraversalResult(result.num_visited + 1, result.labels)
}
}
}
def traverse[T](children:Seq[Node[T]], limit:Int):TraversalResult[T] =
limit match {
case 0 => TraversalResult(0, Nil)
case limit =>
children match {
case Nil => TraversalResult(0, Nil)
case first :: rest => {
val trav_first = dfs(first, limit)
val trav_rest =
traverse(rest, limit - trav_first.num_visited)
TraversalResult(
trav_first.num_visited + trav_rest.num_visited,
trav_first.labels ++ trav_rest.labels
)
}
}
}
val n = Node(0, List(
Node(1, List(Node(2, Nil), Node(3, Nil))),
Node(4, List(Node(5, List(Node(6, Nil))))),
Node(7, Nil)
))
for (i <- 1 to 8)
println(dfs(n, i))
Output:
TraversalResult(1,List())
TraversalResult(2,List())
TraversalResult(3,List(2))
TraversalResult(4,List(2, 3))
TraversalResult(5,List(2, 3))
TraversalResult(6,List(2, 3))
TraversalResult(7,List(2, 3, 6))
TraversalResult(8,List(2, 3, 6, 7))
P.S. this is my first attempt at Scala, so the above probably contains some horrid non-idiomatic code. I'm sorry.
You can convert breadth into depth by passing along an index or taking the tail:
def suml(xs: List[Int], total: Int = 0) = xs match {
case Nil => total
case x :: rest => suml(rest, total+x)
}
def suma(xs: Array[Int], from: Int = 0, total: Int = 0) = {
if (from >= xs.length) total
else suma(xs, from+1, total + xs(from))
}
In the latter case, you already have something to limit your breadth if you want; in the former, just add a width or somesuch.
The following implements a lazy depth-first search over nodes in a tree.
import collection.TraversableView
case class Node[T](label: T, ns: Node[T]*)
def dfs[T](r: Node[T]): TraversableView[Node[T], Traversable[Node[T]]] =
(Traversable[Node[T]](r).view /: r.ns) {
(a, b) => (a ++ dfs(b)).asInstanceOf[TraversableView[Node[T], Traversable[Node[T]]]]
}
This prints the labels of all the nodes in depth-first order.
val r = Node('a, Node('b, Node('d), Node('e, Node('f))), Node('c))
dfs(r).map(_.label).force
// returns Traversable[Symbol] = List('a, 'b, 'd, 'e, 'f, 'c)
This does the same thing, quitting after 3 nodes have been visited.
dfs(r).take(3).map(_.label).force
// returns Traversable[Symbol] = List('a, 'b, 'd)
If you want only leaf nodes you can use filter, and so forth.
Note that the fold clause of the dfs function requires an explicit asInstanceOf cast. See "Type variance error in Scala when doing a foldLeft over Traversable views" for a discussion of the Scala typing issues that necessitate this.

Is there a way to handle the last case differently in a Scala for loop?

For example suppose I have
for (line <- myData) {
println("}, {")
}
Is there a way to get the last line to print
println("}")
Can you refactor your code to take advantage of built-in mkString?
scala> List(1, 2, 3).mkString("{", "}, {", "}")
res1: String = {1}, {2}, {3}
Before going any further, I'd recommend you avoid println in a for-comprehension. It can sometimes be useful for tracking down a bug that occurs in the middle of a collection, but otherwise leads to code that's harder to refactor and test.
More generally, life usually becomes easier if you can restrict where any sort of side-effect occurs. So instead of:
for (line <- myData) {
println("}, {")
}
You can write:
val lines = for (line <- myData) yield "}, {"
println(lines mkString "\n")
I'm also going to take a guess here that you wanted the content of each line in the output!
val lines = for (line <- myData) yield (line + "}, {")
println(lines mkString "\n")
Though you'd be better off still if you just used mkString directly - that's what it's for!
val lines = myData.mkString("{", "\n}, {", "}")
println(lines)
Note how we're first producing a String, then printing it in a single operation. This approach can easily be split into separate methods and used to implement toString on your class, or to inspect the generated String in tests.
I agree fully with what has been said before about using mkstring, and distinguishing the first iteration rather than the last one. Would you still need to distinguish on the last, scala collections have an init method, which return all elements but the last.
So you can do
for(x <- coll.init) workOnNonLast(x)
workOnLast(coll.last)
(init and last being sort of the opposite of head and tail, which are the first and and all but first). Note however than depending on the structure, they may be costly. On Vector, all of them are fast. On List, while head and tail are basically free, init and last are both linear in the length of the list. headOption and lastOption may help you when the collection may be empty, replacing workOnlast by
for (x <- coll.lastOption) workOnLast(x)
You may take the addString function of the TraversableOncetrait as an example.
def addString(b: StringBuilder, start: String, sep: String, end: String): StringBuilder = {
var first = true
b append start
for (x <- self) {
if (first) {
b append x
first = false
} else {
b append sep
b append x
}
}
b append end
b
}
In your case, the separator is }, { and the end is }
If you don't want to use built-in mkString function, you can make something like
for (line <- lines)
if (line == lines.last) println("last")
else println(line)
UPDATE: As didierd mentioned in comments, this solution is wrong because last value can occurs several times, he provides better solution in his answer.
It is fine for Vectors, because last function takes "effectively constant time" for them, as for Lists, it takes linear time, so you can use pattern matching
#tailrec
def printLines[A](l: List[A]) {
l match {
case Nil =>
case x :: Nil => println("last")
case x :: xs => println(x); printLines(xs)
}
}
Other answers are rightfully pointed to mkString, and for a normal amount of data I would also use that.
However, mkString builds (accumulates) the end-result in-memory through a StringBuilder. This is not always desirable, depending on the amount of data we have.
In this case, if all we want is to "print" we don't need to build the big-result first (and maybe we even want to avoid this).
Consider the implementation of this helper function:
def forEachIsLast[A](iterator: Iterator[A])(operation: (A, Boolean) => Unit): Unit = {
while(iterator.hasNext) {
val element = iterator.next()
val isLast = !iterator.hasNext // if there is no "next", this is the last one
operation(element, isLast)
}
}
It iterates over all elements and invokes operation passing each element in turn, with a boolean value. The value is true if the element passed is the last one.
In your case it could be used like this:
forEachIsLast(myData) { (line, isLast) =>
if(isLast)
println("}")
else
println("}, {")
}
We have the following advantages here:
It operates on each element, one by one, without necessarily accumulating the result in memory (unless you want to).
Because it does not need to load the whole collection into memory to check its size, it's enough to ask the Iterator if it's exhausted or not. You could read data from a big file, or from the network, etc.