Scala Buffer: Size or Length? - scala

I am using a mutable Buffer and need to find out how many elements it has.
Both size and length methods are defined, inherited from separate traits.
Is there any actual performance difference, or can they be considered exact synonyms?

They are synonyms, mostly a result of Java's decision of having size for collections and length for Array and String. One will always be defined in terms of the other, and you can easily see which is which by looking at the source code, the link for which is provided on scaladoc. Just find the defining trait, open the source code, and search for def size or def length.

In this case, they can be considered synonyms. You may want to watch out with some other cases such as Array - whilst length and size will always return the same result, in versions prior to Scala 2.10 there may be a boxing overhead for calling size (which is provided by a Scala wrapper around the Array), whereas length is provided by the underlying Java Array.
In Scala 2.10, this overhead has been removed by use of a value class providing the size method, so you should feel free to use whichever method you like.

As of Scala-2.11, these methods may have different performance. For example, consider this code:
val bigArray = Array.fill(1000000)(0)
val beginTime = System.nanoTime()
var i = 0
while (i < 2000000000) {
i += 1
bigArray.length
}
val endTime = System.nanoTime()
println(endTime - beginTime)
sys.exit(-1)
Running this on my amd64 computer gives about 2423834 nanos time (varies from time to time).
Now, if I change the length method to size, it will become about 70764719 nanos time.
This is more than 20x slower.
Why does it happen? I didn't dig it through, I don't know. But there are scenarios where length and size perform drastically different.

They are synonyms, as the scaladoc for Buffer.size states:
The size of this buffer, equivalent to length.
The scaladoc for Buffer.length is explicit too:
The length of the buffer. Note: xs.length and xs.size yield the same result.
Simple advice: refer to the scaladoc before asking a question.
UPDATE: Just saw your edit adding mention of performance. As Daniel C. Sobral aid, one is normally always implemented in term of the other, so they have the same performance.

Related

Scala | FilterNot vs diff

Given:
val mySet = Array(1,2,3).toSet
val myArr = Array(1,2,2)
Code snippet 1:
val difference = mySet.filterNot(myArr.toSet)
Code snippet 2:
val difference = mySet diff myArr.toSet
From above two ways of finding difference, which one will be faster for huge sets. I am new to scala. Is predicate for filterNot going to initialize new set for each value of mySet.
Once the size of a set is > 4 then it will be a HashSet.
I suspect diff will be faster because it is implemented for diffing two HashSets whereas filterNot is more general purpose.
Considering that we have no idea what kind of implementation is used underneath (Set might be HashSet or ListSet) I would be very careful of any guessing about the performance. One version might have one algorithm of picking it, next version might use a different one. I suggest that you pick an implementation explicitly (e.g. arr.to(HashSet) in 2.13) and do some actual benchmarks to check that performance is acceptable.
And if the type you use underneath is Int then probably you would benefit from using something like BitSet or other specialized data structure.

Scala build an arraybuffer of the same length with integers

I have an ArrayBuffer from which I want to convert all objects to their respective size to an existing buffer:
trait Obj {
def size: Int
}
def copySizes(input: ArrayBuffer[Obj], output: ArrayBuffer[Int]): Unit = {
output.clear()
input foreach { obj =>
output += obj.size
}
}
Is there a better idiomatic way to describe copySizes in scala ?
I was thinking about a syntax like:
input.mapTo(_.size, output)
You could
output ++= input.view.map(_.size)
which has a non-negligible additional overhead (~2x runtime) but is more compact. You can write your version more compactly, though:
input.foreach{ output += _.size }
so I don't see much reason not to use it.
Have you considered trying to foldLeft on the input buffer using the output buffer as the accumulator and doing the append in the function body?
val output = input.foldLeft(ArrayBuffer[Int]())(_ += _.size)
As #Rex Kerr mentioned, there might be a performance hit here vs the foreach that you were doing, but I wasn't sure that this was a high performance need piece of code. I guess it depends on how many items are in the buffer and how often this code is hit. If it's a low number of items or this is not a piece of code that is consistently hit, then you might be better with something that is more functional (the fold) vs something that is more side effect based (the foreach).
There generally comes a time when writing Scala code where you have to make that decision; do I care about functional purity/pretty scala code or is this something that needs to be optimized. I try to stay on the purely functional side when possible and then performance test my system and find hotspots and optimize when needed. Premature optimization is the root of all evil (or something like that) =)

Why do Scala's index methods return -1 instead of None if the element is not found?

I've always been wondering why in Scala the various index methods for determining the position of an element in a collection (e.g. List.indexOf, List.indexWhere) return -1 to indicate the absence of the given element in the collection, instead of a more idiomatic Option[Int]. Is there some particular advantage to returning -1 instead of None, or is this just for historical reasons?
It is just for historical reasons, but then one wants to know what the historical reasons are: what was the history, and why did it turn out that way?
The immediate history is the java.lang.String.indexOf method, which returns the index, or -1 if no matching character is found. But this is hardly new; the Fortran SCAN function returns 0 if no character is found in a string, which is the same thing given that Fortran uses 1-indexing.
The reason to do this is that strings have only positive length, so any negative length can be used as a None value without any overhead of boxing. -1 is the most convenient negative number, so that's it.
And this can add up if the compiler isn't smart enough to realize that all the boxing and unboxing and everything is irrelevant. In particular, an object creation tends to take 5-10 ns, while a function call or comparison typically takes more like 1-2 ns, so if the collection is short, creating a new object can have a sizable fractional penalty (more so if your memory is already taxed and the GC has a lot of work to do).
If Scala had initially had an amazing optimizer, then the choice probably would have been different, as one would just write things with options, which is safer and less of a special case, and then trust the compiler to convert it into appropriately high-performance code.
Speed? (not sure)
def a(): Option[Int] = Some(Math.random().toInt)
def b(): Int = Math.random().toInt
val t0 = System.nanoTime; (0 to 1000000).foreach(_ => a()); println("" + (System.nanoTime - t0))
// 53988000
val t0 = System.nanoTime; (0 to 1000000).foreach(_ => b()); println("" + (System.nanoTime - t0))
// 49273000
And you also should always check for index < 0 in Some(index)
There is also the benefit that just returning an Int can use Java's built-in types, whereas Option[Int] would need to wrap the integer in an Object. This means both worse speed (as indicated by #idonnie) but also more memory usage.
While Option is great as a general tool (and I use it a lot) also other non-value presentations s.a. Double.NaN or an empty string are perfectly valid, and useful.
One of the benefits of using Option is the ability to pass it to for loops etc. as a collection. If you are not likely to do that, checking for -1 or NaN may be more concise than doing matches for None/Some.

Scala: read and save all elements of an Iterable

I have an Iterable[T] that is really a stream of unknown length, and want to read it all and save it into something that is still an instance of Iterable. I really do have to read it and save it; I can't do it in a lazy way. The original Iterable can have a few thousand elements, at least. What's the most efficient/best/canonical way? Should I use an ArrayBuffer, a List, a Vector?
Suppose xs is my Iterable. I can think of doing these possibilities:
xs.toArray.toIterable // Ugh?
xs.toList // Fast?
xs.copyToBuffer(anArrayBuffer)
Vector(xs: _*) // There's no toVector, sadly. Is this construct as efficient?
EDIT: I see by the questions I should be more specific. Here's a strawman example:
def f(xs: Iterable[SomeType]) { // xs might a stream, though I can't be sure
val allOfXS = <xs all read in at once>
g(allOfXS)
h(allOfXS) // Both g() and h() take an Iterable[SomeType]
}
This is easy. A few thousand elements is nothing, so it hardly matters unless it's a really tight loop. So the flippant answer is: use whatever you feel is most elegant.
But, okay, let's suppose that this is actually in some tight loop, and you can predict or have benchmarked your code enough to know that this is performance-limiting.
Your best performance for an immutable solution will likely be a Vector, used like so:
Vector() ++ xs
In my hands, this can copy a 10k iterable about 4k-5k times per second. List is about half the speed.
If you're willing to try a mutable solution under the hood, xs.toArray.toIterable usually takes the cake with about 10k copies per second. ArrayBuffer is about the same speed as List.
If you actually know the size of the target (i.e. size is O(1) or you know it from somewhere else), you can shave off another 20-30% of the execution speed by allocating just the right size and writing a while loop.
If it's actually primitives, you can gain a factor of 10 by writing your own specialized Iterable-like-thing that acts on arrays and converts to regular collections via the underlying array.
Bottom line: for a great blend of power, speed, and flexibility, use Vector() ++ xs in most situations. xs.toIndexedSeq defaults to the same thing, with the benefit that if it's already a Vector that it will take no time at all (and chains nicely without using parens), and the drawback that you are relying upon a convention, not a specification for behavior (and it takes 1-3 more characters to type).
How about Stream.force?
Forces evaluation of the whole stream and returns it.
This is hard. An Iterable's methods are defined in terms of its iterator, but that gets overridden by subtraits. For instance, IndexedSeq methods are usually defined in terms of apply.
There is the question of why do you want to copy the Iterable, but I suppose you might be guarding against the possibility of it being mutable. If you do not want to copy it, then you need to rephrase your question.
If you are going to copy it, and you want to be sure all elements are copied in a strict manner, you could use .toList. That will not copy a List, but a List does not need to be copied. For anything else, it will produce a new copy.

Scala linked list stackoverflow

Using scala I have added about 100000 nodes to a linked list. When I use the function length, for example mylist.length. I get a 'java.lang.StackOverflowError' error, is my list to big to process? The list is only string objects.
It appears the library implementation is not tail-recursive override def length: Int = if (isEmpty) 0 else next.length + 1. It seems like this is something that could be discussed on the mailing list to check if an enhancement ticket should be opened.
You can compute the length like this:
def length[T](l:LinkedList[T], acc:Int=0): Int =
if (l.isEmpty) acc else length(l.tail, acc + 1)
In Scala, computing the length of a List is an order n operation, therefore you should try to avoid it. You might consider switching to an Array, as that is a constant time operation.
You could try increasing the stack/heap size available to the JVM.
scala JAVA_OPTS="-Xmx512M -Xms16M -Xss16M" MyClass.scala
Where
-Xss<size> maximum native stack size for any thread
-Xms<size> set initial Java heap size
-Xmx<size> set maximum Java heap size
This question has some more information.
See also this This Scala document.
Can you confirm that you truly need to use the length method? It sounds like you might not be using the correct collection type for your use-case (hard to tell without any extra information). Lists are optimised to be mapped over using folds or a tail-recursive function.
Despite saying that, this is absolutely an oversight that can easily be fixed in the standard library with a tail-recursive function. Hopefully we can get it in time for 2.9.0.