What is the difference between Iterator and Iterable in scala?
I thought that Iterable represents a set that I can iterate through, and Iterator is a "pointer" to one of the items in the iterable set.
However, Iterator has functions like forEach, map, foldLeft. It can be converted to Iterable via toIterable. And, for example, scala.io.Source.getLines returns Iterator, not Iterable.
But I cannot do groupBy on Iterator and I can do it on Iterable.
So, what's the relation between those two, Iterator and Iterable?
In short: An Iterator does have state, whereas an Iterable does not.
See the API docs for both.
Iterable:
A base trait for iterable collections.
This is a base trait for all Scala collections that define an iterator
method to step through one-by-one the collection's elements.
[...] This trait implements Iterable's foreach method by stepping
through all elements using iterator.
Iterator:
Iterators are data structures that allow to iterate over a sequence of
elements. They have a hasNext method for checking if there is a next
element available, and a next method which returns the next element
and discards it from the iterator.
An iterator is mutable: most operations on it change its state. While
it is often used to iterate through the elements of a collection, it
can also be used without being backed by any collection (see
constructors on the companion object).
With an Iterator you can stop an iteration and continue it later if you want. If you try to do this with an Iterable it will begin from the head again:
scala> val iterable: Iterable[Int] = 1 to 4
iterable: Iterable[Int] = Range(1, 2, 3, 4)
scala> iterable.take(2)
res8: Iterable[Int] = Range(1, 2)
scala> iterable.take(2)
res9: Iterable[Int] = Range(1, 2)
scala> val iterator = iterable.iterator
iterator: Iterator[Int] = non-empty iterator
scala> if (iterator.hasNext) iterator.next
res23: AnyVal = 1
scala> if (iterator.hasNext) iterator.next
res24: AnyVal = 2
scala> if (iterator.hasNext) iterator.next
res25: AnyVal = 3
scala> if (iterator.hasNext) iterator.next
res26: AnyVal = 4
scala> if (iterator.hasNext) iterator.next
res27: AnyVal = ()
Note, that I didn't use take on Iterator. The reason for this is that it is tricky to use. hasNext and next are the only two methods that are guaranteed to work as expected on Iterator. See the Scaladoc again:
It is of particular importance to note that, unless stated otherwise,
one should never use an iterator after calling a method on it. The two
most important exceptions are also the sole abstract methods: next and
hasNext.
Both these methods can be called any number of times without having to
discard the iterator. Note that even hasNext may cause mutation --
such as when iterating from an input stream, where it will block until
the stream is closed or some input becomes available.
Consider this example for safe and unsafe use:
def f[A](it: Iterator[A]) = {
if (it.hasNext) { // Safe to reuse "it" after "hasNext"
it.next // Safe to reuse "it" after "next"
val remainder = it.drop(2) // it is *not* safe to use "it" again after this line!
remainder.take(2) // it is *not* safe to use "remainder" after this line!
} else it
}
Another explanation from Martin Odersky and Lex Spoon:
There's an important difference between the foreach method on
iterators and the same method on traversable collections: When called
to an iterator, foreach will leave the iterator at its end when it is
done. So calling next again on the same iterator will fail with a
NoSuchElementException. By contrast, when called on on a collection,
foreach leaves the number of elements in the collection unchanged
(unless the passed function adds to removes elements, but this is
discouraged, because it may lead to surprising results).
Source: http://www.scala-lang.org/docu/files/collections-api/collections_43.html
Note also (thanks to Wei-Ching Lin for this tip) Iterator extends the TraversableOnce trait while Iterable doesn't.
Related
I am trying to convert java.util.StringTokenizer to Scala's Iterator and the following approach fails:
def toIterator(st: StringTokenizer): Iterator[String] =
Iterator.continually(st.nextToken()).takeWhile(_ => st.hasMoreTokens()))
But this works:
def toIterator(st: StringTokenizer): Iterator[String] =
Iterator.fill(st.countTokens())(st.nextToken())
You can see this in Scala console:
scala> Iterator("a b", "c d").map(new java.util.StringTokenizer(_)).flatMap(st => Iterator.continually(st.nextToken()).takeWhile(_ => st.hasMoreTokens())).toList
res1: List[String] = List(a, c)
scala> Iterator("a b", "c d").map(new java.util.StringTokenizer(_)).flatMap(st => Iterator.fill(st.countTokens())(st.nextToken())).toList
res2: List[String] = List(a, b, c, d)
As you can see res1 is incorrect and res2 is correct. What am I doing wrong? The first version should work and is preferred since it is 2x faster than second approach since it does not scan the string twice
takeWhile is not intended to be used statefully. It should take a pure function that determines, solely based on the input, whether or not to continue.
Specifically, the iterator must produce the value before the takeWhile predicate gets called. Even though your function ignores the takeWhile argument, it still gets evaluated. So nextToken gets called and then we check for more tokens.
To be perfectly precise, in your "a b" case,
First, we call nextToken, which is what Iterator.continually does. There's a next token, so it returns "a".
Now, to determine if we should include the next token, we call your predicate with "a" as argument. Your predicate ignores "a" and calls hasMoreTokens. Our tokenizer has more tokens (namely, "b"), so it returns true. Continue.
Now we call nextToken again. This returns "b".
We need to determine if we should include this in our result, so our takeWhile predicate runs with "b" as argument. Our takeWhile predicate ignores its argument and calls hasMoreTokens. We have no more tokens anymore, so this returns false. We should not include this element.
takeWhile has returned false, so we stop at the last element for which it returned true. Our resulting list is List("a").
Since we're abusing a pure functional technique like takeWhile to be stateful, we get unintuitive results.
As much as it looks snazzy and clever to have a one-line solution, what you have is a stateful, imperative object that you want to adapt to the Iterator interface. Hiding that statefulness in a bunch of pure function calls is not a good idea, so we should just write our own subclass of Iterator and do it properly.
import java.util.StringTokenizer
final class StringTokenizerIterator(
private val tokenizer: StringTokenizer
) extends Iterator[String] {
def hasNext: Boolean = tokenizer.hasMoreTokens
def next(): String = tokenizer.nextToken()
}
object Example {
def toIterator(st: StringTokenizer): Iterator[String] =
new StringTokenizerIterator(st)
def main(args: Array[String]) = {
println(Iterator("a b", "c d")
.map(new java.util.StringTokenizer(_))
.flatMap(toIterator(_))
.toList)
}
}
We're doing the same work you were doing calling the appropriate StringTokenizer functions, but we're doing it in a full class that encapsulates the state, rather than pretending the stateful part is not there. It's longer code, yes, but it should be longer. We don't want the messy parts of it to go unnoticed.
I am new to Scala and I just learned that LazyList was created to replace Stream, and at the same time they added the .view methods to all collections.
So, I am wondering why was LazyList added to Scala collections library, when we can do List.view?
I just looked at the Scaladoc, and it seems that the only difference is that LazyList has memoization, while View does not. Am I right or wrong?
Stream elements are realized lazily except for the 1st (head) element. That was seen as a deficiency.
A List view is re-evaluated lazily but, as far as I know, has to be completely realized first.
def bang :Int = {print("BANG! ");1}
LazyList.fill(4)(bang) //res0: LazyList[Int] = LazyList(<not computed>)
Stream.fill(3)(bang) //BANG! res1: Stream[Int] = Stream(1, <not computed>)
List.fill(2)(bang).view //BANG! BANG! res2: SeqView[Int] = SeqView(<not computed>)
In 2.13, you can't force your way back from a view to the original collection type:
scala> case class C(n: Int) { def bump = new C(n+1).tap(i => println(s"bump to $i")) }
defined class C
scala> List(C(42)).map(_.bump)
bump to C(43)
res0: List[C] = List(C(43))
scala> List(C(42)).view.map(_.bump)
res1: scala.collection.SeqView[C] = SeqView(<not computed>)
scala> .force
^
warning: method force in trait View is deprecated (since 2.13.0): Views no longer know about their underlying collection type; .force always returns an IndexedSeq
bump to C(43)
res2: scala.collection.IndexedSeq[C] = Vector(C(43))
scala> LazyList(C(42)).map(_.bump)
res3: scala.collection.immutable.LazyList[C] = LazyList(<not computed>)
scala> .force
bump to C(43)
res4: res3.type = LazyList(C(43))
A function taking a view and optionally returning a strict realization would have to also take a "forcing function" such as _.toList, if the caller needs to choose the result type.
I don't do this sort of thing at my day job, but this behavior surprises me.
The difference is that LazyList can be generated from huge/infinite sequence, so you can do something like:
val xs = (1 to 1_000_000_000).to(LazyList)
And that won't run out of memory. After that you can operate on the lazy list with transformers. You won't be able to do the same by creating a List and taking a view from it. Having said that, SeqView has a much reacher set of methods compared to LazyList and that's why you can actually take a view of a LazyList like:
val xs = (1 to 1_000_000_000).to(LazyList)
val listView = xs.view
In Scala collections, if one wants to iterate over a collection (without returning results, i.e. doing a side effect on every element of collection), it can be done either with
final def foreach(f: (A) ⇒ Unit): Unit
or
final def map[B](f: (A) ⇒ B): SomeCollectionClass[B]
With the exception of possible lazy mapping(*), from an end-user perspective, I see zero differences in these invocations:
myCollection.foreach { element =>
doStuffWithElement(element);
}
myCollection.map { element =>
doStuffWithElement(element);
}
given that I can just ignore what map outputs. I can't think of any specific reason why two different methods should exist & be used, when map seems to include all the functionality of foreach, and, in fact, I would be pretty much impressed if an intelligent compiler & VM won't optimize out that collection object creation given that it's not assigned to anything, or read, or used anywhere.
So, the question is - am I right - and there are no reasons to call foreach anywhere in one's code?
Notes:
(*) The lazy mapping concept, as throughly illustrated in this question, might change things a bit and justify usage of foreach, but as far as I can see, one specifically needs to stumble upon a LazyMap, normal
(**) If one's not using a collection, but writing one, then one would quickly stumble upon the fact that for comprehension syntax syntax is in fact a syntax sugar that generates "foreach" call, i.e. these two lines generate fully equivalent code:
for (element <- myCollection) { doStuffWithElement(element); }
myCollection.foreach { element => doStuffWithElement(element); }
So if one cares about other people using that collection class with for syntax, one might still want to implement foreach method.
I can think of a couple motivations:
When the foreach is the last line of a method that is of type Unit your compiler will not give an warning but will with map (and you need -Ywarn-value-discard on). Sometimes you get warning: a pure expression does nothing in statement position; you may be omitting necessary parentheses using map but wouldn't with foreach.
General readability - a reader can know that your mutating some state without returning something at a glance, but greater cognitive resources would be required to understand the same operation if map was used
Further to 1. you also can have type checking when passing named functions around, then into map and foreach
Using foreach won't build a new list, so will be more efficient (thanks #Vishnu)
scala> (1 to 5).iterator map println
res0: Iterator[Unit] = non-empty iterator
scala> (1 to 5).iterator foreach println
1
2
3
4
5
I'd be impressed if the builder machinery could be optimized away.
scala> :pa
// Entering paste mode (ctrl-D to finish)
implicit val cbf = new collection.generic.CanBuildFrom[List[Int],Int,List[Int]] {
def apply() = new collection.mutable.Builder[Int, List[Int]] {
val b = new collection.mutable.ListBuffer[Int]
override def +=(i: Int) = { println(s"Adding $i") ; b +=(i) ; this }
override def clear() = () ; override def result() = b.result() }
def apply(from: List[Int]) = apply() }
// Exiting paste mode, now interpreting.
cbf: scala.collection.generic.CanBuildFrom[List[Int],Int,List[Int]] = $anon$2#e3cee7b
scala> List(1,2,3) map (_ + 1)
Adding 2
Adding 3
Adding 4
res1: List[Int] = List(2, 3, 4)
scala> List(1,2,3) foreach (_ + 1)
I know that when you type:
val list = List(2,3)
you are accessing the apply method of the List object which returns a List. What I can't understand is why is this possible when the List class is abstract and therefore cannot be directly instanciated(new List() won't compile)?
I'd also like to ask what is the difference between:
val arr = Array(4,5,6)
and
val arr = new Array(4, 5, 6)
The List class is sealed and abstract. It has two concreate implementations
Nil which represents an empty list
::[B] which represents a non empty list with head and tail. ::[B] in the documentation
When you call List.apply it will jump through some hoops and supply you with an instance of the ::[B] case class.
About array: new Array(4, 5, 6) will throw a compile error as the constructor of array is defined like this: new Array(_length: Int). The apply method of the Array companion object uses the arguments to create a new instance of an Array (with the help of ArrayBuilder).
I started writing that the easy way to determine this is to look at the sources for the methods you're calling, which are available from the ScalaDoc. However, the various levels of indirection that are gone through to actually build a list give lie to the term 'easy'! It's worth having a look through if you want, starting from the apply method in the List object which is defined as follows:
override def apply[A](xs: A*): List[A] = xs.toList
You may or may not know that a parameter of the form xs : A* is treated internally as a Seq, which means that we're calling the toList method on a Seq, which is defined in TraversableOnce. This then delegates to a generic to method, which looks for an implicit
CanBuildFrom which actually constructs the list. So what you're getting back is some implementation of List which is chosen by the CanBuildFrom. What you actually get is a scala.collection.immutable.$colon$colon, which implements a singly-linked list.
Luckily, the behaviour of Array.apply is a little easier to look up:
def apply[T: ClassTag](xs: T*): Array[T] = {
val array = new Array[T](xs.length)
var i = 0
for (x <- xs.iterator) { array(i) = x; i += 1 }
array
}
So, Array.apply just delegates to new Array and then sets elements appropriately.
I have looked at this question but still don't understand the difference between Iterable and Traversable traits. Can someone explain ?
Think of it as the difference between blowing and sucking.
When you have call a Traversables foreach, or its derived methods, it will blow its values into your function one at a time - so it has control over the iteration.
With the Iterator returned by an Iterable though, you suck the values out of it, controlling when to move to the next one yourself.
To put it simply, iterators keep state, traversables don't.
A Traversable has one abstract method: foreach. When you call foreach, the collection will feed the passed function all the elements it keeps, one after the other.
On the other hand, an Iterable has as abstract method iterator, which returns an Iterator. You can call next on an Iterator to get the next element at the time of your choosing. Until you do, it has to keep track of where it was in the collection, and what's next.
tl;dr Iterables are Traversables that can produce stateful Iterators
First, know that Iterable is subtrait of Traversable.
Second,
Traversable requires implementing the foreach method, which is used by everything else.
Iterable requires implementing the iterator method, which is used by everything else.
For example, the implemetation of find for Traversable uses foreach (via a for comprehension) and throws a BreakControl exception to halt iteration once a satisfactory element has been found.
trait TravserableLike {
def find(p: A => Boolean): Option[A] = {
var result: Option[A] = None
breakable {
for (x <- this)
if (p(x)) { result = Some(x); break }
}
result
}
}
In contrast, the Iterable subtract overrides this implementation and calls find on the Iterator, which simply stops iterating once the element is found:
trait Iterable {
override /*TraversableLike*/ def find(p: A => Boolean): Option[A] =
iterator.find(p)
}
trait Iterator {
def find(p: A => Boolean): Option[A] = {
var res: Option[A] = None
while (res.isEmpty && hasNext) {
val e = next()
if (p(e)) res = Some(e)
}
res
}
}
It'd be nice not to throw exceptions for Traversable iteration, but that's the only way to partially iterate when using just foreach.
From one perspective, Iterable is the more demanding/powerful trait, as you can easily implement foreach using iterator, but you can't really implement iterator using foreach.
In summary, Iterable provides a way to pause, resume, or stop iteration via a stateful Iterator. With Traversable, it's all or nothing (sans exceptions for flow control).
Most of the time it doesn't matter, and you'll want the more general interface. But if you ever need more customized control over iteration, you'll need an Iterator, which you can retrieve from an Iterable.
Daniel's answer sounds good. Let me see if I can to put it in my own words.
So an Iterable can give you an iterator, that lets you traverse the elements one at a time (using next()), and stop and go as you please. To do that the iterator needs to keep an internal "pointer" to the element's position. But a Traversable gives you the method, foreach, to traverse all elements at once without stopping.
Something like Range(1, 10) needs to have only 2 integers as state as a Traversable. But Range(1, 10) as an Iterable gives you an iterator which needs to use 3 integers for state, one of which is an index.
Considering that Traversable also offers foldLeft, foldRight, its foreach needs to traverse the elements in a known and fixed order. Therefore it's possible to implement an iterator for a Traversable. E.g.
def iterator = toList.iterator