Split an iterator by a predicate - scala

I need a method that can split Iterator[Char] into lines (separated by \n and \r)
For that, I wrote a general method that gets an iterator and a predicate and will split the iterator every time the predicate is true.
This is similar to span, but will split every time the predicate is true, not only the first time
this is my implementation:
def iterativeSplit[T](iterO: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
new Iterator[List[T]] {
private var iter = iterO
def hasNext = iter.hasNext
def next = {
val (i1,i2) = iter.span(el => !breakOn(el))
val cur = i1.toList
iter = i2.dropWhile(breakOn)
cur
}
}.withFilter(l => l.nonEmpty)
and it works well on small inputs, but on larges inputs, this runs very slow, and sometimes I get stack overflow exception
here is the code that recreates the issue:
val iter = ("aaaaaaaaabbbbbbbbbbbccccccccccccc\r\n" * 10000).iterator
iterativeSplit(iter)(c => c == '\r' || c == '\n').length
the stack trace during the run is:
...
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
...
looking at the source code (I'm using scala 2.10.4)
line 591 is the hasNext of the second iterator from the span, and line 651 is the hasNext in the iterator from dropWhile
I guess I'm using those 2 iterators incorrectly, but I can't see why

You can simplify your code as follows, which seems to solve the problem:
def iterativeSplit2[T](iter: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
new Iterator[List[T]] {
def hasNext = iter.hasNext
def next = {
val cur = iter.takeWhile(!breakOn(_)).toList
iter.dropWhile(breakOn)
cur
}
}.withFilter(l => l.nonEmpty)
Rather than using span (so you need to replace iter on each call to next), simply use takeWhile and dropWhile on the original iter. Then there's no need for the var.
I think the cause of your original stack overflow is that repeatedly calling span creates a long chain of Iterators, each of whose hasNext methods calls the hasNext of its parent Iterator. If you look at the source code for Iterator, you can see that each span creates new Iterators that forward calls to hasNext to the original iterator (via a BufferedIterator, which increases the call stack even further).
Update having consulted the documentation it seems that, although my solution above appears to work, it is not recommended - see particularly:
It is of particular importance to note that, unless stated otherwise,
one should never use an iterator after calling a method on it.
[...] Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.
which applies to takeWhile and dropWhile (and span), but not next or hasNext.
It's possible to use span as in your original solution, but using streams rather than iterators, and recursion:
def split3[T](s: Stream[T])(breakOn: T => Boolean): Stream[List[T]] = s match {
case Stream.Empty => Stream.empty
case s => {
val (a, b) = s.span(!breakOn(_))
a.toList #:: split3(b.dropWhile(breakOn))(breakOn)
}
}
But the performance is pretty terrible. I'm sure there must be a better way...
Update 2: Here is a very imperative solution that has better performance:
import scala.collection.mutable.ListBuffer
def iterativeSplit4[T](iter: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
new Iterator[List[T]] {
val word = new ListBuffer[T]
def hasNext() = iter.hasNext
def next = {
var looking = true
while (looking) {
val c = iter.next
if (breakOn(c)) looking = false
else word += c
}
val w = word.toList
word.clear()
w
}
}.withFilter(_.nonEmpty)

Related

scala view filter not lazy?

While trying to understand the differences between streams, iterators, and views of collections, I stumbled upon the following strange behavior.
Here the code (map and filter simply print their input and forward it unchanged):
object ArrayViewTest {
def main(args: Array[String]) {
val array = Array.range(1,10)
print("stream-map-head: ")
array.toStream.map(x => {print(x); x}).head
print("\nstream-filter-head: ")
array.toStream.filter(x => {print(x); true}).head
print("\niterator-map-head: ")
array.iterator.map(x => {print(x); x}).take(1).toArray
print("\niterator-filter-head: ")
array.iterator.filter(x => {print(x); true}).take(1).toArray
print("\nview-map-head: ")
array.view.map(x => {print(x); x}).head
print("\nview-filter-head: ")
array.view.filter(x => {print(x); true}).head
}
}
And its output:
stream-map-head: 1
stream-filter-head: 1
iterator-map-head: 1
iterator-filter-head: 1
view-map-head: 1
view-filter-head: 123456789 // <------ WHY ?
Why does filter called on a view process the whole array?
I expected that the evaluation of filter is driven only once by calling head, just as in all other cases, in particular just as in using map on view.
Which insight am I missing ?
(Mini-side-question for a comment, why is there no head on an iterator?)
edit:
The same strange behavior (as here for scala.Array.range(1,10)) is achieved by scala.collection.mutable.ArraySeq.range(1,10), scala.collection.mutable.ArrayBuffer.range(1,10), and scala.collection.mutable.StringBuilder.newBuilder.append("123456789").
However, for all other mutable collections, and all immutable collections, the filter on the view works as expected and outputs 1.
It seems the head uses isEmpty
trait IndexedSeqOptimized[+A, +Repr] extends Any with IndexedSeqLike[A, Repr] { self =>
...
override /*IterableLike*/
def head: A = if (isEmpty) super.head else this(0)
And isEmpty uses length
trait IndexedSeqOptimized[+A, +Repr] extends Any with IndexedSeqLike[A, Repr] { self =>
...
override /*IterableLike*/
def isEmpty: Boolean = { length == 0 }
The implementation of length is used from Filtered which loops through the whole array
trait Filtered extends super.Filtered with Transformed[A] {
protected[this] lazy val index = {
var len = 0
val arr = new Array[Int](self.length)
for (i <- 0 until self.length)
if (pred(self(i))) {
arr(len) = i
len += 1
}
arr take len
}
def length = index.length
def apply(idx: Int) = self(index(idx))
}
The Filtered trait is only used when calling filter
protected override def newFiltered(p: A => Boolean): Transformed[A] =
new { val pred = p } with AbstractTransformed[A] with Filtered
This is why is happens when using filter and not when using map
I think it has to do that Array is a mutable indexed sequence. And it's view is also a mutable collection :) So when it creates a view it creates an index that maps between original collection and filtered collection. And it doesn't really make sense to create this index lazily, because when someone will request the ith element than the whole source array may be traversed anyway. It is still lazy in a sense that this index is not created until you call head. Still this is not explicitly stated in scala documentation, and it looks like a bug at first sight.
For the mini side question, I think the problem with head on iterator is that people expect head to be pure function, namely you should be able to call it n times and it should return the same result each time. And iterator is inherently mutable data structure, which by contract is only traversable once. This may be overcomed by caching the first element of the iterator, but I find this to be very confusing.

Queue In Scala using list in scala

We can implement a queue in java simply by using ArrayList but in case of Scala Lists are immutable so how can I implement a queue using List in Scala.Somebody give me some hint about it.
This is from Scala's immutable Queue:
Queue is implemented as a pair of Lists, one containing the in elements and the other the out elements. Elements are added to the in list and removed from the out list. When the out list runs dry, the queue is pivoted by replacing the out list by in.reverse, and in by Nil.
So:
object Queue {
def empty[A]: Queue[A] = new Queue(Nil, Nil)
}
class Queue[A] private (in: List[A], out: List[A]) {
def isEmpty: Boolean = in.isEmpty && out.isEmpty
def push(elem: A): Queue[A] = new Queue(elem :: in, out)
def pop(): (A, Queue[A]) =
out match {
case head :: tail => (head, new Queue(in, tail))
case Nil =>
val head :: tail = in.reverse // throws exception if empty
(head, new Queue(Nil, tail))
}
}
var q = Queue.empty[Int]
(1 to 10).foreach(i => q = q.push(i))
while (!q.isEmpty) { val (i, r) = q.pop(); println(i); q = r }
With immutable Lists, you have to return a new List after any modifying operation. Once you've grasped that, it's straightforward. A minimal (but inefficient) implementation where the Queue is also immutable might be:
class Queue[T](content:List[T]) {
def pop() = new Queue(content.init)
def push(element:T) = new Queue(element::content)
def peek() = content.last
override def toString() = "Queue of:" + content.toString
}
val q= new Queue(List(1)) //> q : lists.queue.Queue[Int] = Queue of:List(1)
val r = q.push(2) //> r : lists.queue.Queue[Int] = Queue of:List(2, 1)
val s = r.peek() //> s : Int = 1
val t = r.pop() //> t : lists.queue.Queue[Int] = Queue of:List(2)
If we talk about mutable Lists, they wouldn't be an efficient structure for implementing a Queue for the following reason: Adding elements to the beginning of a list works very well (takes constant time), but popping elements off the end is not efficient at all (takes longer the more elements there are in the list).
You do, however, have Arrays in Scala. Accessing any element in an array takes constant time. Unfortunately arrays are not dynamically sized, so they wouldn't make good queues. They cannot grow as your queue grows. However ArrayBuffers do grow as your array grows. So that would be a great place to start.
Also, note that Scala already has a Queue class: scala.collection.mutable.Queue.
The only way to implement a Queue with an immutable List would be to use a var. Good luck!

Scala - weird behaviour with Iterator.toList

I am new to Scala and I have a function as follows:
def selectSame(messages: BufferedIterator[Int]) = {
val head = messages.head
messages.takeWhile(_ == head)
}
Which is selecting from a buffered iterator only the elems matching the head. I am subsequently using this code:
val messageStream = List(1,1,1,2,2,3,3)
if (!messageStream.isEmpty) {
var lastTimeStamp = messageStream.head.timestamp
while (!messageStream.isEmpty) {
val messages = selectSame(messageStream).toList
println(messages)
}
Upon first execution I am getting (1,1,1) as expected, but then I only get the List(2), like if I lost one element down the line... Probably I am doing sth wrong with the iterators/lists, but I am a bit lost here.
Scaladoc of Iterator says about takeWhile:
Reuse: After calling this method, one should discard the iterator it
was called on, and use only the iterator that was returned. Using the
old iterator is undefined, subject to change, and may result in
changes to the new iterator as well.
So that's why. This basically means you cannot directly do what you want with Iterators and takeWhile. IMHO, easiest would be to quickly write your own recursive function to do that.
If you want to stick with Iterators, you could use the sameElements method on the Iterator to generate a duplicate where you'd call dropWhile.
Even better: Use span repeatedly:
def selectSame(messages: BufferedIterator[Int]) = {
val head = messages.head
messages.span(_ == head)
}
def iter(msgStream: BufferedIterator[Int]): Unit = if (!msgStream.isEmpty) {
val (msgs, rest) = selectSame(msgStream)
println(msgs.toList)
iter(rest)
}
val messageStream = List(1,1,1,2,2,3,3)
if (!messageStream.isEmpty) {
var lastTimeStamp = messageStream.head.timestamp
iter(messageStream0
}

Efficient way to fold list in scala, while avoiding allocations and vars

I have a bunch of items in a list, and I need to analyze the content to find out how many of them are "complete". I started out with partition, but then realized that I didn't need to two lists back, so I switched to a fold:
val counts = groupRows.foldLeft( (0,0) )( (pair, row) =>
if(row.time == 0) (pair._1+1,pair._2)
else (pair._1, pair._2+1)
)
but I have a lot of rows to go through for a lot of parallel users, and it is causing a lot of GC activity (assumption on my part...the GC could be from other things, but I suspect this since I understand it will allocate a new tuple on every item folded).
for the time being, I've rewritten this as
var complete = 0
var incomplete = 0
list.foreach(row => if(row.time != 0) complete += 1 else incomplete += 1)
which fixes the GC, but introduces vars.
I was wondering if there was a way of doing this without using vars while also not abusing the GC?
EDIT:
Hard call on the answers I've received. A var implementation seems to be considerably faster on large lists (like by 40%) than even a tail-recursive optimized version that is more functional but should be equivalent.
The first answer from dhg seems to be on-par with the performance of the tail-recursive one, implying that the size pass is super-efficient...in fact, when optimized it runs very slightly faster than the tail-recursive one on my hardware.
The cleanest two-pass solution is probably to just use the built-in count method:
val complete = groupRows.count(_.time == 0)
val counts = (complete, groupRows.size - complete)
But you can do it in one pass if you use partition on an iterator:
val (complete, incomplete) = groupRows.iterator.partition(_.time == 0)
val counts = (complete.size, incomplete.size)
This works because the new returned iterators are linked behind the scenes and calling next on one will cause it to move the original iterator forward until it finds a matching element, but it remembers the non-matching elements for the other iterator so that they don't need to be recomputed.
Example of the one-pass solution:
scala> val groupRows = List(Row(0), Row(1), Row(1), Row(0), Row(0)).view.map{x => println(x); x}
scala> val (complete, incomplete) = groupRows.iterator.partition(_.time == 0)
Row(0)
Row(1)
complete: Iterator[Row] = non-empty iterator
incomplete: Iterator[Row] = non-empty iterator
scala> val counts = (complete.size, incomplete.size)
Row(1)
Row(0)
Row(0)
counts: (Int, Int) = (3,2)
I see you've already accepted an answer, but you rightly mention that that solution will traverse the list twice. The way to do it efficiently is with recursion.
def counts(xs: List[...], complete: Int = 0, incomplete: Int = 0): (Int,Int) =
xs match {
case Nil => (complete, incomplete)
case row :: tail =>
if (row.time == 0) counts(tail, complete + 1, incomplete)
else counts(tail, complete, incomplete + 1)
}
This is effectively just a customized fold, except we use 2 accumulators which are just Ints (primitives) instead of tuples (reference types). It should also be just as efficient a while-loop with vars - in fact, the bytecode should be identical.
Maybe it's just me, but I prefer using the various specialized folds (.size, .exists, .sum, .product) if they are available. I find it clearer and less error-prone than the heavy-duty power of general folds.
val complete = groupRows.view.filter(_.time==0).size
(complete, groupRows.length - complete)
How about this one? No import tax.
import scala.collection.generic.CanBuildFrom
import scala.collection.Traversable
import scala.collection.mutable.Builder
case class Count(n: Int, total: Int) {
def not = total - n
}
object Count {
implicit def cbf[A]: CanBuildFrom[Traversable[A], Boolean, Count] = new CanBuildFrom[Traversable[A], Boolean, Count] {
def apply(): Builder[Boolean, Count] = new Counter
def apply(from: Traversable[A]): Builder[Boolean, Count] = apply()
}
}
class Counter extends Builder[Boolean, Count] {
var n = 0
var ttl = 0
override def +=(b: Boolean) = { if (b) n += 1; ttl += 1; this }
override def clear() { n = 0 ; ttl = 0 }
override def result = Count(n, ttl)
}
object Counting extends App {
val vs = List(4, 17, 12, 21, 9, 24, 11)
val res: Count = vs map (_ % 2 == 0)
Console println s"${vs} have ${res.n} evens out of ${res.total}; ${res.not} were odd."
val res2: Count = vs collect { case i if i % 2 == 0 => i > 10 }
Console println s"${vs} have ${res2.n} evens over 10 out of ${res2.total}; ${res2.not} were smaller."
}
OK, inspired by the answers above, but really wanting to only pass over the list once and avoid GC, I decided that, in the face of a lack of direct API support, I would add this to my central library code:
class RichList[T](private val theList: List[T]) {
def partitionCount(f: T => Boolean): (Int, Int) = {
var matched = 0
var unmatched = 0
theList.foreach(r => { if (f(r)) matched += 1 else unmatched += 1 })
(matched, unmatched)
}
}
object RichList {
implicit def apply[T](list: List[T]): RichList[T] = new RichList(list)
}
Then in my application code (if I've imported the implicit), I can write var-free expressions:
val (complete, incomplete) = groupRows.partitionCount(_.time != 0)
and get what I want: an optimized GC-friendly routine that prevents me from polluting the rest of the program with vars.
However, I then saw Luigi's benchmark, and updated it to:
Use a longer list so that multiple passes on the list were more obvious in the numbers
Use a boolean function in all cases, so that we are comparing things fairly
http://pastebin.com/2XmrnrrB
The var implementation is definitely considerably faster, even though Luigi's routine should be identical (as one would expect with optimized tail recursion). Surprisingly, dhg's dual-pass original is just as fast (slightly faster if compiler optimization is on) as the tail-recursive one. I do not understand why.
It is slightly tidier to use a mutable accumulator pattern, like so, especially if you can re-use your accumulator:
case class Accum(var complete = 0, var incomplete = 0) {
def inc(compl: Boolean): this.type = {
if (compl) complete += 1 else incomplete += 1
this
}
}
val counts = groupRows.foldLeft( Accum() ){ (a, row) => a.inc( row.time == 0 ) }
If you really want to, you can hide your vars as private; if not, you still are a lot more self-contained than the pattern with vars.
You could just calculate it using the difference like so:
def counts(groupRows: List[Row]) = {
val complete = groupRows.foldLeft(0){ (pair, row) =>
if(row.time == 0) pair + 1 else pair
}
(complete, groupRows.length - complete)
}

How to yield a single element from for loop in scala?

Much like this question:
Functional code for looping with early exit
Say the code is
def findFirst[T](objects: List[T]):T = {
for (obj <- objects) {
if (expensiveFunc(obj) != null) return /*???*/ Some(obj)
}
None
}
How to yield a single element from a for loop like this in scala?
I do not want to use find, as proposed in the original question, i am curious about if and how it could be implemented using the for loop.
* UPDATE *
First, thanks for all the comments, but i guess i was not clear in the question. I am shooting for something like this:
val seven = for {
x <- 1 to 10
if x == 7
} return x
And that does not compile. The two errors are:
- return outside method definition
- method main has return statement; needs result type
I know find() would be better in this case, i am just learning and exploring the language. And in a more complex case with several iterators, i think finding with for can actually be usefull.
Thanks commenters, i'll start a bounty to make up for the bad posing of the question :)
If you want to use a for loop, which uses a nicer syntax than chained invocations of .find, .filter, etc., there is a neat trick. Instead of iterating over strict collections like list, iterate over lazy ones like iterators or streams. If you're starting with a strict collection, make it lazy with, e.g. .toIterator.
Let's see an example.
First let's define a "noisy" int, that will show us when it is invoked
def noisyInt(i : Int) = () => { println("Getting %d!".format(i)); i }
Now let's fill a list with some of these:
val l = List(1, 2, 3, 4).map(noisyInt)
We want to look for the first element which is even.
val r1 = for(e <- l; val v = e() ; if v % 2 == 0) yield v
The above line results in:
Getting 1!
Getting 2!
Getting 3!
Getting 4!
r1: List[Int] = List(2, 4)
...meaning that all elements were accessed. That makes sense, given that the resulting list contains all even numbers. Let's iterate over an iterator this time:
val r2 = (for(e <- l.toIterator; val v = e() ; if v % 2 == 0) yield v)
This results in:
Getting 1!
Getting 2!
r2: Iterator[Int] = non-empty iterator
Notice that the loop was executed only up to the point were it could figure out whether the result was an empty or non-empty iterator.
To get the first result, you can now simply call r2.next.
If you want a result of an Option type, use:
if(r2.hasNext) Some(r2.next) else None
Edit Your second example in this encoding is just:
val seven = (for {
x <- (1 to 10).toIterator
if x == 7
} yield x).next
...of course, you should be sure that there is always at least a solution if you're going to use .next. Alternatively, use headOption, defined for all Traversables, to get an Option[Int].
You can turn your list into a stream, so that any filters that the for-loop contains are only evaluated on-demand. However, yielding from the stream will always return a stream, and what you want is I suppose an option, so, as a final step you can check whether the resulting stream has at least one element, and return its head as a option. The headOption function does exactly that.
def findFirst[T](objects: List[T], expensiveFunc: T => Boolean): Option[T] =
(for (obj <- objects.toStream if expensiveFunc(obj)) yield obj).headOption
Why not do exactly what you sketched above, that is, return from the loop early? If you are interested in what Scala actually does under the hood, run your code with -print. Scala desugares the loop into a foreach and then uses an exception to leave the foreach prematurely.
So what you are trying to do is to break out a loop after your condition is satisfied. Answer here might be what you are looking for. How do I break out of a loop in Scala?.
Overall, for comprehension in Scala is translated into map, flatmap and filter operations. So it will not be possible to break out of these functions unless you throw an exception.
If you are wondering, this is how find is implemented in LineerSeqOptimized.scala; which List inherits
override /*IterableLike*/
def find(p: A => Boolean): Option[A] = {
var these = this
while (!these.isEmpty) {
if (p(these.head)) return Some(these.head)
these = these.tail
}
None
}
This is a horrible hack. But it would get you the result you wished for.
Idiomatically you'd use a Stream or View and just compute the parts you need.
def findFirst[T](objects: List[T]): T = {
def expensiveFunc(o : T) = // unclear what should be returned here
case class MissusedException(val data: T) extends Exception
try {
(for (obj <- objects) {
if (expensiveFunc(obj) != null) throw new MissusedException(obj)
})
objects.head // T must be returned from loop, dummy
} catch {
case MissusedException(obj) => obj
}
}
Why not something like
object Main {
def main(args: Array[String]): Unit = {
val seven = (for (
x <- 1 to 10
if x == 7
) yield x).headOption
}
}
Variable seven will be an Option holding Some(value) if value satisfies condition
I hope to help you.
I think ... no 'return' impl.
object TakeWhileLoop extends App {
println("first non-null: " + func(Seq(null, null, "x", "y", "z")))
def func[T](seq: Seq[T]): T = if (seq.isEmpty) null.asInstanceOf[T] else
seq(seq.takeWhile(_ == null).size)
}
object OptionLoop extends App {
println("first non-null: " + func(Seq(null, null, "x", "y", "z")))
def func[T](seq: Seq[T], index: Int = 0): T = if (seq.isEmpty) null.asInstanceOf[T] else
Option(seq(index)) getOrElse func(seq, index + 1)
}
object WhileLoop extends App {
println("first non-null: " + func(Seq(null, null, "x", "y", "z")))
def func[T](seq: Seq[T]): T = if (seq.isEmpty) null.asInstanceOf[T] else {
var i = 0
def obj = seq(i)
while (obj == null)
i += 1
obj
}
}
objects iterator filter { obj => (expensiveFunc(obj) != null } next
The trick is to get some lazy evaluated view on the colelction, either an iterator or a Stream, or objects.view. The filter will only execute as far as needed.