Scala Seq vs List performance - scala

So I am a bit perplexed. I have a piece of code in Scala (the exact code is not really all that important). I had all my methods written to take Seq[T]. These methods are mostly tail recursive and use the Seq[T] as an accumulator which is fed initially like Seq(). Interestingly enough, when I swap all the signature with the concrete List() implementation, I am observing three fold improvement in performance.
Isn't it the case that Seq's default implementation is in fact an immutable List ? So if that is the case, what is really going on ?

Calling Seq(1,2,3) and calling List(1,2,3) will both result in a 1 :: 2 :: 3 :: Nil. The Seq.apply method is just a very generic method that looks like this:
def apply[A](elems: A*): CC[A] = {
if (elems.isEmpty) empty[A]
else {
val b = newBuilder[A]
b ++= elems
b.result()
}
}
newBuilder is the thing that sort of matters here. That method delegates to scala.collection.immutable.Seq.newBuilder:
def newBuilder[A]: Builder[A, Seq[A]] = new mutable.ListBuffer
So the Builder for a Seq is a mutable.ListBuffer. A Seq gets constructed by appending the elements to the empty ListBuffer and then calling result on it, which is implemented like this:
def result: List[A] = toList
/** Converts this buffer to a list. Takes constant time. The buffer is
* copied lazily, the first time it is mutated.
*/
override def toList: List[A] = {
exported = !isEmpty
start
}
List also has a ListBuffer as Builder. It goes through a slightly different but similar building process. It is not going to make a big difference anyway, since I assume that most of your algorithm will consist of prepending things to a Seq, not calling Seq.apply(...) the whole time. Even if you did it shouldn't make much difference.
It's really not possible to say what is causing the behavior you're seeing without seeing the code that has that behavior.

Related

How to case match a View in scala

I'm implementing a class to constrain the access on an iterable. Intermediate steps of the sequence (after some map, etc...) is expected to be too big for memory. Thus map (and the likes: scanLeft, reduce, ...) should be lazy.
Internally I use map(...) = iterable.view.map( ... ). But it seems, IterableView.view is not it-self, which produce useless redirection when calling map multiple times. It is probably not critical, but I'd like to call .view only if the iterable is not already a view.
So, how can I case-match a View?
class LazyIterable[A](iterable: Iterable[A]){
def map[B](f: A => B) = {
val mapped = iterable match {
case v: View[A] => v // what should be here?
case i: Iterable[A] => i.view
}.map( f ))
new LazyIterable(mapped)
}
def compute() = iterable.toList
}
Note that I don't know what is the inputed Iterable, a concrete Seq (e.g. List, Vector) or a View. And if a View, I don't know on which concrete seq type (e.g. InterableView, SeqView, ...). And I got lost in the class hierarchy of View's & ViewLike's.
v: IterableView[A,_] is probably what you are looking for ...
But I don't think you need any of this to begin with.
I simply don't see what having this wrapper buys you at all. What benefits does writing
new LazyIterable(myThing).map(myFunc).compute
have over
myThing.view.map(myFunc).toList

Scala: "map" vs "foreach" - is there any reason to use "foreach" in practice?

In Scala collections, if one wants to iterate over a collection (without returning results, i.e. doing a side effect on every element of collection), it can be done either with
final def foreach(f: (A) ⇒ Unit): Unit
or
final def map[B](f: (A) ⇒ B): SomeCollectionClass[B]
With the exception of possible lazy mapping(*), from an end-user perspective, I see zero differences in these invocations:
myCollection.foreach { element =>
doStuffWithElement(element);
}
myCollection.map { element =>
doStuffWithElement(element);
}
given that I can just ignore what map outputs. I can't think of any specific reason why two different methods should exist & be used, when map seems to include all the functionality of foreach, and, in fact, I would be pretty much impressed if an intelligent compiler & VM won't optimize out that collection object creation given that it's not assigned to anything, or read, or used anywhere.
So, the question is - am I right - and there are no reasons to call foreach anywhere in one's code?
Notes:
(*) The lazy mapping concept, as throughly illustrated in this question, might change things a bit and justify usage of foreach, but as far as I can see, one specifically needs to stumble upon a LazyMap, normal
(**) If one's not using a collection, but writing one, then one would quickly stumble upon the fact that for comprehension syntax syntax is in fact a syntax sugar that generates "foreach" call, i.e. these two lines generate fully equivalent code:
for (element <- myCollection) { doStuffWithElement(element); }
myCollection.foreach { element => doStuffWithElement(element); }
So if one cares about other people using that collection class with for syntax, one might still want to implement foreach method.
I can think of a couple motivations:
When the foreach is the last line of a method that is of type Unit your compiler will not give an warning but will with map (and you need -Ywarn-value-discard on). Sometimes you get warning: a pure expression does nothing in statement position; you may be omitting necessary parentheses using map but wouldn't with foreach.
General readability - a reader can know that your mutating some state without returning something at a glance, but greater cognitive resources would be required to understand the same operation if map was used
Further to 1. you also can have type checking when passing named functions around, then into map and foreach
Using foreach won't build a new list, so will be more efficient (thanks #Vishnu)
scala> (1 to 5).iterator map println
res0: Iterator[Unit] = non-empty iterator
scala> (1 to 5).iterator foreach println
1
2
3
4
5
I'd be impressed if the builder machinery could be optimized away.
scala> :pa
// Entering paste mode (ctrl-D to finish)
implicit val cbf = new collection.generic.CanBuildFrom[List[Int],Int,List[Int]] {
def apply() = new collection.mutable.Builder[Int, List[Int]] {
val b = new collection.mutable.ListBuffer[Int]
override def +=(i: Int) = { println(s"Adding $i") ; b +=(i) ; this }
override def clear() = () ; override def result() = b.result() }
def apply(from: List[Int]) = apply() }
// Exiting paste mode, now interpreting.
cbf: scala.collection.generic.CanBuildFrom[List[Int],Int,List[Int]] = $anon$2#e3cee7b
scala> List(1,2,3) map (_ + 1)
Adding 2
Adding 3
Adding 4
res1: List[Int] = List(2, 3, 4)
scala> List(1,2,3) foreach (_ + 1)

Scala: Thread safe mutable lazy Iterator with append

For an immutable flavour, Iterator does the job.
val x = Iterator.fill(100000)(someFn)
Now I want to implement a mutable version of Iterator, with three guarantees:
thread-safe on all transformations(fold, foldLeft, ..) and append
lazy evaluated
traversable only once! Once used, an object from this Iterator should be destroyed.
Is there an existing implementation to give me these guarantees? Any library or framework example would be great.
Update
To illustrate the desired behaviour.
class SomeThing {}
class Test(val list: Iterator[SomeThing]) {
def add(thing: SomeThing): Test = {
new Test(list ++ Iterator(thing))
}
}
(new Test()).add(new SomeThing).add(new SomeThing);
In this example, SomeThing is an expensive construct, it needs to be lazy.
Re-iterating over list is never required, Iterator is a good fit.
This is supposed to asynchronously and lazily sequence 10 million SomeThing instances without depleting the executor(a cached thread pool executor) or running out of memory.
You don't need a mutable Iterator for this, just daisy-chain the immutable form:
class SomeThing {}
case class Test(val list: Iterator[SomeThing]) {
def add(thing: => SomeThing) = Test(list ++ Iterator(thing))
}
(new Test()).add(new SomeThing).add(new SomeThing)
Although you don't really need the extra boilerplate of Test here:
Iterator(new SomeThing) ++ Iterator(new SomeThing)
Note that Iterator.++ takes a by-name param, so the ++ operation is already lazy.
You might also want to try this, to avoid building intermediate Iterators:
Iterator.continually(new SomeThing) take 2
UPDATE
If you don't know the size in advance, then I'll often use a tactic like this:
def mkSomething = if(cond) Some(new Something) else None
Iterator.continually(mkSomething) takeWhile (_.isDefined) map { _.get }
The trick is to have your generator function wrap its output in an Option, which then gives you a way to flag that the iteration is finished by returning None
Of course... If you're really pushing out the boat, you can even use the dreaded null:
def mkSomething = if(cond) { new Something } else null
Iterator.continually(mkSomething) takeWhile (_ != null)
Seems like you need to hide the fact that the iterator is mutable but at the same time allow it to grow mutably. What I'm going to propose is the same sort of trick I've used to speed up ::: in the past:
abstract class AppendableIterator[A] extends Iterator[A]{
protected var inner: Iterator[A]
def hasNext = inner.hasNext
def next() = inner next ()
def append(that: Iterator[A]) = synchronized{
inner = new JoinedIterator(inner, that)
}
}
//You might need to add some more things, this is a skeleton
class JoinedIterator[A](first: Iterator[A], second: Iterator[A]) extends Iterator[A]{
def hasNext = first.hasNext || second.hasNext
def next() = if(first.hasNext) first next () else if(second.hasNext) second next () else Iterator.next()
}
So what you're really doing is leaving the Iterator at whatever place in its iteration you might have it while still preserving the thread safety of the append by "joining" another Iterator in non-destructively. You avoid the need to recompute the two together because you never actually force them through a CanBuildFrom.
This is also a generalization of just adding one item. You can always wrap some A in an Iterator[A] of one element if you so choose.
Have you looked at the mutable.ParIterable in the collection.parallel package?
To access an iterator over elements you can do something like
val x = ParIterable.fill(100000)(someFn).iterator
From the docs:
Parallel operations are implemented with divide and conquer style algorithms that parallelize well. The basic idea is to split the collection into smaller parts until they are small enough to be operated on sequentially.
...
The higher-order functions passed to certain operations may contain side-effects. Since implementations of bulk operations may not be sequential, this means that side-effects may not be predictable and may produce data-races, deadlocks or invalidation of state if care is not taken. It is up to the programmer to either avoid using side-effects or to use some form of synchronization when accessing mutable data.

Scala: What is the difference between Traversable and Iterable traits in Scala collections?

I have looked at this question but still don't understand the difference between Iterable and Traversable traits. Can someone explain ?
Think of it as the difference between blowing and sucking.
When you have call a Traversables foreach, or its derived methods, it will blow its values into your function one at a time - so it has control over the iteration.
With the Iterator returned by an Iterable though, you suck the values out of it, controlling when to move to the next one yourself.
To put it simply, iterators keep state, traversables don't.
A Traversable has one abstract method: foreach. When you call foreach, the collection will feed the passed function all the elements it keeps, one after the other.
On the other hand, an Iterable has as abstract method iterator, which returns an Iterator. You can call next on an Iterator to get the next element at the time of your choosing. Until you do, it has to keep track of where it was in the collection, and what's next.
tl;dr Iterables are Traversables that can produce stateful Iterators
First, know that Iterable is subtrait of Traversable.
Second,
Traversable requires implementing the foreach method, which is used by everything else.
Iterable requires implementing the iterator method, which is used by everything else.
For example, the implemetation of find for Traversable uses foreach (via a for comprehension) and throws a BreakControl exception to halt iteration once a satisfactory element has been found.
trait TravserableLike {
def find(p: A => Boolean): Option[A] = {
var result: Option[A] = None
breakable {
for (x <- this)
if (p(x)) { result = Some(x); break }
}
result
}
}
In contrast, the Iterable subtract overrides this implementation and calls find on the Iterator, which simply stops iterating once the element is found:
trait Iterable {
override /*TraversableLike*/ def find(p: A => Boolean): Option[A] =
iterator.find(p)
}
trait Iterator {
def find(p: A => Boolean): Option[A] = {
var res: Option[A] = None
while (res.isEmpty && hasNext) {
val e = next()
if (p(e)) res = Some(e)
}
res
}
}
It'd be nice not to throw exceptions for Traversable iteration, but that's the only way to partially iterate when using just foreach.
From one perspective, Iterable is the more demanding/powerful trait, as you can easily implement foreach using iterator, but you can't really implement iterator using foreach.
In summary, Iterable provides a way to pause, resume, or stop iteration via a stateful Iterator. With Traversable, it's all or nothing (sans exceptions for flow control).
Most of the time it doesn't matter, and you'll want the more general interface. But if you ever need more customized control over iteration, you'll need an Iterator, which you can retrieve from an Iterable.
Daniel's answer sounds good. Let me see if I can to put it in my own words.
So an Iterable can give you an iterator, that lets you traverse the elements one at a time (using next()), and stop and go as you please. To do that the iterator needs to keep an internal "pointer" to the element's position. But a Traversable gives you the method, foreach, to traverse all elements at once without stopping.
Something like Range(1, 10) needs to have only 2 integers as state as a Traversable. But Range(1, 10) as an Iterable gives you an iterator which needs to use 3 integers for state, one of which is an index.
Considering that Traversable also offers foldLeft, foldRight, its foreach needs to traverse the elements in a known and fixed order. Therefore it's possible to implement an iterator for a Traversable. E.g.
def iterator = toList.iterator

Pros and Cons of choosing def over val

I'm asking a slight different question than this one. Suppose I have a code snippet:
def foo(i : Int) : List[String] = {
val s = i.toString + "!" //using val
s :: Nil
}
This is functionally equivalent to the following:
def foo(i : Int) : List[String] = {
def s = i.toString + "!" //using def
s :: Nil
}
Why would I choose one over the other? Obviously I would assume the second has a slight disadvantages in:
creating more bytecode (the inner def is lifted to a method in the class)
a runtime performance overhead of invoking a method over accessing a value
non-strict evaluation means I could easily access s twice (i.e. unnecesasarily redo a calculation)
The only advantage I can think of is:
non-strict evaluation of s means it is only called if it is used (but then I could just use a lazy val)
What are peoples' thoughts here? Is there a significant dis-benefit to me making all inner vals defs?
1)
One answer I didn't see mentioned is that the stack frame for the method you're describing could actually be smaller. Each val you declare will occupy a slot on the JVM stack, however, the whenever you use a def obtained value it will get consumed in the first expression you use it in. Even if the def references something from the environment, the compiler will pass .
The HotSpot should optimize both these things, or so some people claim. See:
http://www.ibm.com/developerworks/library/j-jtp12214/
Since the inner method gets compiled into a regular private method behind the scene and it is usually very small, the JIT compiler might choose to inline it and then optimize it. This could save time allocating smaller stack frames (?), or, by having fewer elements on the stack, make local variables access quicker.
But, take this with a (big) grain of salt - I haven't actually made extensive benchmarks to backup this claim.
2)
In addition, to expand on Kevin's valid reply, the stable val provides also means that you can use it with path dependent types - something you can't do with a def, since the compiler doesn't check its purity.
3)
For another reason you might want to use a def, see a related question asked not so long ago:
Functional processing of Scala streams without OutOfMemory errors
Essentially, using defs to produce Streams ensures that there do not exist additional references to these objects, which is important for the GC. Since Streams are lazy anyway, the overhead of creating them is probably negligible even if you have multiple defs.
The val is strict, it's given a value as soon as you define the thing.
Internally, the compiler will mark it as STABLE, equivalent to final in Java. This should allow the JVM to make all sorts of optimisations - I just don't know what they are :)
I can see an advantage in the fact that you are less bound to a location when using a def than when using a val.
This is not a technical advantage but allows for better structuring in some cases.
So, stupid example (please edit this answer, if you’ve got a better one), this is not possible with val:
def foo(i : Int) : List[String] = {
def ret = s :: Nil
def s = i.toString + "!"
ret
}
There may be cases where this is important or just convenient.
(So, basically, you can achieve the same with lazy val but, if only called at most once, it will probably be faster than a lazy val.)
For a local declaration like this (with no arguments, evaluated precisely once and with no code evaluated between the point of declaration and the point of evaluation) there is no semantic difference. I wouldn't be surprised if the "val" version compiled to simpler and more efficient code than the "def" version, but you would have to examine the bytecode and possibly profile to be sure.
In your example I would use a val. I think the val/def choice is more meaningful when declaring class members:
class A { def a0 = "a"; def a1 = "a" }
class B extends A {
var c = 0
override def a0 = { c += 1; "a" + c }
override val a1 = "b"
}
In the base class using def allows the sub class to override with possibly a def that does not return a constant. Or it could override with a val. So that gives more flexibility than a val.
Edit: one more use case of using def over val is when an abstract class has a "val" for which the value should be provided by a subclass.
abstract class C { def f: SomeObject }
new C { val f = new SomeObject(...) }