I've tried different collections in Scala to sum it's elements and they are much slower than Java sums it's arrays (with for cycle). Is there a way for Scala to be as fast as Java arrays?
I've heard that in scala 2.8 arrays will be same as in java, but they are much slower in practice

Indexing into arrays in a while loop is as fast in Scala as in Java. (Scala's "for" loop is not the low-level construct that Java's is, so that won't work the way you want.)
Thus if in Java you see
for (int i=0 ; i < array.length ; i++) sum += array(i)
in Scala you should write
var i=0
while (i < array.length) {
sum += array(i)
i += 1
and if you do your benchmarks appropriately, you'll find no difference in speed.
If you have iterators anyway, then Scala is as fast as Java in most things. For example, if you have an ArrayList of doubles and in Java you add them using
for (double d : arraylist) { sum += d }
then in Scala you'll be approximately as fast--if using an equivalent data structure like ArrayBuffer--with
arraybuffer.foreach( sum += _ )
and not too far off the mark with either of
sum = (0 /: arraybuffer)(_ + _)
sum = arraybuffer.sum // 2.8 only
Keep in mind, though, that there's a penalty to mixing high-level and low-level constructs. For example, if you decide to start with an array but then use "foreach" on it instead of indexing into it, Scala has to wrap it in a collection (ArrayOps in 2.8) to get it to work, and often will have to box the primitives as well.
Anyway, for benchmark testing, these two functions are your friends:
def time[F](f: => F) = {
val t0 = System.nanoTime
val ans = f
printf("Elapsed: %.3f\n",1e-9*(System.nanoTime-t0))
def lots[F](n: Int, f: => F): F = if (n <= 1) f else { f; lots(n-1,f) }
For example:
val a = Array.tabulate(1000000)(_.toDouble)
val ab = new collection.mutable.ArrayBuffer[Double] ++ a
def adSum(ad: Array[Double]) = {
var sum = 0.0
var i = 0
while (i<ad.length) { sum += ad(i); i += 1 }
// Mixed array + high-level; convenient, not so fast
scala> lots(3, time( lots(100,(0.0 /: a)(_ + _)) ) )
Elapsed: 2.434
Elapsed: 2.085
Elapsed: 2.081
res4: Double = 4.999995E11
// High-level container and operations, somewhat better
scala> lots(3, time( lots(100,(0.0 /: ab)(_ + _)) ) )
Elapsed: 1.694
Elapsed: 1.679
Elapsed: 1.635
res5: Double = 4.999995E11
// High-level collection with simpler operation
scala> lots(3, time( lots(100,{var s=0.0;ab.foreach(s += _);s}) ) )
Elapsed: 1.171
Elapsed: 1.166
Elapsed: 1.162
res7: Double = 4.999995E11
// All low level operations with primitives, no boxing, fast!
scala> lots(3, time( lots(100,adSum(a)) ) )
Elapsed: 0.185
Elapsed: 0.183
Elapsed: 0.186
res6: Double = 4.999995E11

You can now simply use sum.
val values = Array.fill[Double](numValues)(0)
val sumOfValues = values.sum

The proper scala or functional was to do this would be:
val numbers = Array(1, 2, 3, 4, 5)
val sum = numbers.reduceLeft[Int](_+_)
Check out this link for the full explanation of the syntax:
I doubt this would be faster than doing it in the ways described in the other answers but I haven't tested it so I'm not sure. In my opinion this is the proper way to do it though since Scala is a functional language.

It is very difficult to explain why some code you haven't shown performs worse than some other code you haven't shown in some benchmark you haven't shown.
You may be interested in this question and its accepted answer, for one thing. But benchmarking JVM code is hard, because the JIT will optimize code in ways that are difficult to predict (which is why JIT beats traditional optimization at compile time).

Scala 2.8 Array are JVM / Java arrays and as such have identical performance characteristics. But that means they cannot directly have extra methods that unify them with the rest of the Scala collections. To provide the illusion that arrays have these methods, there are implicit conversions to wrapper classes that add those capabilities. If you are not careful you'll incur inordinate overhead using those features.
In those cases where iteration overhead is critical, you can explicitly get an iterator (or maintain an integer index, for indexed sequential structures like Array or other IndexedSeq) and use a while loop, which is a language-level construct that need not operate on functions (literals or otherwise) but can compile in-line code blocks.
val l1 = List(...) // or any Iteralbe
val i1 = l1.iterator
while (i1.hasNext) {
val e = i1.next
// Do stuff with e
Such code will execute essentially as fast as a Java counterpart.

Timing is not the only concern.
With sum you might find an overflow issue:
scala> Array(2147483647,2147483647).sum
res0: Int = -2
in this case seeding foldLeft with a Long is preferable
scala> Array(2147483647,2147483647).foldLeft(0L)(_+_)
res1: Long = 4294967294
Long can be used from beginning:
scala> Array(2147483647L,2147483647L).sum
res1: Long = 4294967294


For until square root

In Scala, I want to write the equivalent of the following C++ code:
for(int i = 1 ; i * i < n ; i++)
So far I did this, but it looks ugly and I think it goes up until n:
for(i <- 1 to n
if(i * i < n))
Is there a nicer way of writing this code?
Not nicer but different approach
Using a stream
(1 to n).toStream map (i => i * i) takeWhile (_ < n)
Example for n = 100
scala> val res = (1 to 100).toStream map(i => i * i) takeWhile (_ < 100)
res: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> res.toList
res16: List[Int] = List(1, 4, 9, 16, 25, 36, 49, 64, 81)
A Stream allows to request values on demand, i.e. lazy evaluation. So the function that is mapped will only be applied when the next value is requested.
First of all, declare a function to generate a lazy stream of squares:
def squares(i: Int = 1): Stream[Int] = Stream.cons(i * i, squares(i + 1))
then use takeWhile to get the value when i * i is smaller than n. For example:
scala> squares().takeWhile(_ < 50).foreach(println)
The solution you have might not be the nicest but it might be the most efficient, everything else is internally more complicated, so it might have some overhead. (In most situations not a notable overhead though, and it might be optimized very well.)
I would not go for the solution using Streams suggested in an other answer. While streams are computed lazily, they do cache the computed results, which is not required in this case and might take a lot of memory if the range iterated over is large. Instead I would use an Iterator. Operations on Iterators are typically lazy as well and do not cache anything.
If you need this more often, you could add an other "operator" using an implicit class like this:
implicit class UntilHelper(start: Int) {
def aslong(cond: Int => Boolean) =
Your loop then looks like this:
for(i <- 1 aslong (Math.pow(_, 2) < 1000)) {
From a quick micro-benchmark it looks like this is about 3 times faster than the stream solution and a little bit slower than a simple while loop. These things are however notoriously hard to measure without any context.
Remark on computing squares
A nice way of computing a sequence of Squares is by adding the difference between squares. This can be done using the scanLeft method on a Stream or an Iterator.
val itr = Iterator.from(1).scanLeft(1)((a,b)=>a + 2*b+1)

Scala ParArray Sorting

How to sort in ascending order a ParArray collection such as
or else, which parallel collections may be more suitable for this purpose ?
How to implement a parallel algorithm on ParArray that may prove more efficient than casting to a non parallel collection for sequential sorting ?
How to implement a parallel algorithm on ParArray that may prove more
efficient than casting to a non parallel collection for sequential
My first obvervation would be that there doesn't seem to be much performance penalty for "converting" parallel arrays to sequential and back:
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
val diff: Long = t1 - t0
println(s"Elapsed time: ${diff * 1.0 / 1E9}s")
def main(args: Array[String]): Unit = {
val size: Int = args.headOption.map(_.toInt).getOrElse(1000000)
val input = Array.fill(size)(Random.nextInt())
val arrayCopy: Array[Int] = Array.ofDim(size)
time { input.sorted }
val parArray = arrayCopy.par
val result = time { parArray.seq.sorted.toArray.par }
> run 1000000
[info] Running Runner 1000000
Elapsed time: 0.344659236s
Elapsed time: 0.321363896s
For all Array sizes I tested the results are very similar and usually somehow in favor of the second expression. So in case you were worried that converting to sequential collections and back will kill the performance gains you achieved on other operations - I don't think you should be.
When it comes to utilizing Scala's parallel collections to achieve parallel sorting that in some cases would perform better than the default - I don't think there's an obvious good way of doing that, but it wouldn't hurt to try:
What I thought should work would be splitting the input array into as many subarrays as you have cores in your computer (preferably without any unnecessary copying) and sorting the parts concurrently. Afterwards one might merge (as in merge sort) the parts together. Here's how the code might look like:
val maxThreads = 8 //for simplicity we're not configuring the thread pool explicitly
val groupSize:Int = size/maxThreads + 1
val ranges: IndexedSeq[(Int, Int)] = (0 until maxThreads).map(i => (i * groupSize, (i + 1) * groupSize))
time {
//parallelizing sorting for each range
ranges.par.foreach {case (from, to) =>
input.view(from, to).sortWith(_ < _)
//TODO merge the parts together
Unfortunately there's this old bug that prevents us from doing anything fun with views. There doesn't seem to be any Scala built-in mechanism (other than views) for sorting just a part of a collection. This is why I tried coding my own merge sort algorithm with the signature of def mergeSort(a: Array[Int], r: Range): Unit to use it as I described above. Unfortunately it seems to be more than 4 times less effective than the scala Array.sorted method so I don't think it could be used to gain efficiency over the standard sequential approach.
If I understand your situation correctly, your dataset fits in memory, so using something like Hadoop and MapReduce would be premature. What you might try though would be Apache Spark - other than adding a dependency you wouldn't need to set up any cluster or install anything for Spark to use all cores of your machine in a basic configuration. Its RDD's are ideologically similar to Scala's Parallel Collections, but with additional functionalities. And they (in a way) support parallel sorting.
If you build your Scala project against Java 8, there is the new Arrays.parallelSort you can use:
def sort[T <: Comparable](parArray: ParArray[T])(implicit c: ClassTag[T]): ParArray[T] = {
var array = new Array[T](parArray.size) // Or, to prevent copying, var array = parArray.seq.array.asInstanceOf[Array[T]] might work?
There are no parallel sorting algorithms available in the Scala standard library. For this reason, the parallel collection don't provide sorted, sortBy, or sortWith methods. You will have to convert to an appropriate sequential class (e.g. with toArray) before sorting.
If your data can fit in memory, then single thread in memory sort is fast enough. If you need to load a lot of data from disk or HDFS, then you can do the sort on a distributed system like hadoop or spark.
def parallelSort[A : Ordering](seq: ParIterable[A]): TreeSet[A] = {
(set, a) => set + a,
(set, set) => set ++ set)

What is the fastest way to subtract two arrays in scala

I have two arrays (that i have pulled out of a matrix (Array[Array[Int]]) and I need to subtract one from the other.
At the moment I am using this method however, when I profile it, it is the bottleneck.
def subRows(a: Array[Int], b: Array[Int], sizeHint: Int): Array[Int] = {
val l: Array[Int] = new Array(sizeHint)
var i = 0
while (i < sizeHint) {
l(i) = a(i) - b(i)
i += 1
I need to do this billions of times so any improvement in speed is a plus.
I have tried using a List instead of an Array to collect the differences and it is MUCH faster but I lose all benefit when I convert it back to an Array.
I did modify the downstream code to take a List to see if that would help but I need to access the contents of the list out of order so again there is loss of any gains there.
It seems like any conversion of one type to another is expensive and I am wondering if there is some way to use a map etc. that might be faster.
Is there a better way?
Not sure what I did the first time!?
So the code I used to test it was this:
def subRowsArray(a: Array[Int], b: Array[Int], sizeHint: Int): Array[Int] = {
val l: Array[Int] = new Array(sizeHint)
var i = 0
while (i < sizeHint) {
l(i) = a(i) - b(i)
i += 1
def subRowsList(a: Array[Int], b: Array[Int], sizeHint: Int): List[Int] = {
var l: List[Int] = Nil
var i = 0
while (i < sizeHint) {
l = a(i) - b(i) :: l
i += 1
val a = Array.fill(100, 100)(scala.util.Random.nextInt(2))
val loops = 30000 * 10000
def runArray = for (i <- 1 to loops) subRowsArray(a(scala.util.Random.nextInt(100)), a(scala.util.Random.nextInt(100)), 100)
def runList = for (i <- 1 to loops) subRowsList(a(scala.util.Random.nextInt(100)), a(scala.util.Random.nextInt(100)), 100)
def optTimer(f: => Unit) = {
val s = System.currentTimeMillis
System.currentTimeMillis - s
The results I thought I got the first time I did this were the exact opposite... I must have misread or mixed up the methods.
My apologies for asking a bad question.
That code is the fastest you can manage single-threaded using a standard JVM. If you think List is faster, you're either fooling yourself or not actually telling us what you're doing. Putting an Int into List requires two object creations: one to create the list element, and one to box the integer. Object creations take about 10x longer than an array access. So it's really not a winning proposition to do it any other way.
If you really, really need to go faster, and must stay with a single thread, you should probably switch to C++ or the like and explicitly use SSE instructions. See this question, for example.
If you really, really need to go faster and can use multiple threads, then the easiest is to package up a chunk of work like this (i.e. a sensible number of pairs of vectors that need to be subtracted--probably at least a few million elements per chunk) into a list as long as the number of processors on your machine, and then call list.par.map(yourSubtractionRoutineThatActsOnTheChunkOfWork).
Finally, if you can be destructive,
a(i) -= b(i)
in the inner loop is, of course, faster. Likewise, if you can reuse space (e.g. with System.arraycopy), you're better off than if you have to keep allocating it. But that changes the interface from what you've shown.
You can use Scalameter to try a benchmark the two implementations which requires at least JRE 7 update 4 and Scala 2.10 to be run. I used scala 2.10 RC2.
Compile with scalac -cp scalameter_2.10-0.2.jar RangeBenchmark.scala.
Run with scala -cp scalameter_2.10-0.2.jar:. RangeBenchmark.
Here's the code I used:
import org.scalameter.api._
object RangeBenchmark extends PerformanceTest.Microbenchmark {
val limit = 100
val a = new Array[Int](limit)
val b = new Array[Int](limit)
val array: Array[Int] = new Array(limit)
var list: List[Int] = Nil
val ranges = for {
size <- Gen.single("size")(limit)
} yield 0 until size
measure method "subRowsArray" in {
using(ranges) curve("Range") in {
var i = 0
while (i < limit) {
array(i) = a(i) - b(i)
i += 1
r => array
measure method "subRowsList" in {
using(ranges) curve("Range") in {
var i = 0
while (i < limit) {
list = a(i) - b(i) :: list
i += 1
r => list
Here's the results:
::Benchmark subRowsArray::
Parameters(size -> 100): 8.26E-4
::Benchmark subRowsList::
Parameters(size -> 100): 7.94E-4
You can draw your own conclusions. :)
The stack blew up on larger values of limit. I'll guess it's because it's measuring the performance many times.

Scala - folding on values that result from object interaction

In Scala I have a list of objects that represent points and contain x and y values. The list describes a path that goes through all these points sequentially. My question is how to use folding on that list in order to find the total length of the path? Or maybe there is even a better functional or Scala way to do this?
What I have came up with is this:
def distance = (0 /: wps)(Waypoint.distance(_, _))
but ofcourse this is totally wrong because distance returns Float, but accepts two Waypoint objects.
Thanks for the proposed solutions! They are definitely interesting, but I think that this is too much functional for real-time calculations that may become heavy. So far I have came out with these lines:
val distances = for(i <- 0 until wps.size) yield wps(i).distanceTo(wps(i + 1))
val distance = (0f /: distances)(_ + _)
I feel this to be a fair imperative/functional mix that is both fast and also leaves the distances values between each waypoint for further possible references which is also a benifit in my case.
UPDATE 2: Actually, to determine, what is faster, I will have to do benchmarks of all the proposed solutions on all types of sequences.
This should work.
(wps, wps drop 1).zipped.map(Waypoint.distance).sum
Don't know if fold can be used here, but try this:
wps.sliding(2).map(segment => Waypoint.distance(segment(0), segment(1))).sum
wps.sliding(2) returns a list of all subsequent pairs. Or if you prefer pattern matching:
wps.sliding(2).collect{case start :: end :: Nil => Waypoint.distance(start, end)}.sum
BTW consider defining:
def distanceTo(to: Waypoint)
on Waypoint class directly, not on companion object as it looks more object-oriented and will allow you to write nice DSL-like code:
or even:
point1 distanceTo point2
case start :: end :: Nil => start distanceTo end
Your comment "too much functional for real-time calculations that may become heavy" makes this interesting. Benchmarking and profiling are critical, since you don't want to write a bunch of hard-to-maintain code for the sake of performance, only to find out that it's not a performance critical part of your application in the first place! Or, even worse, find out that your performance optimizations makes things worse for your specific workload.
The best performing implementation will depend on your specifics (How long are the paths? How many cores are on the system?) But I think blending imperative and functional approaches may give you the worst-of-both worlds. You could lose out on both readability and performance if you're not careful!
I would very slightly modify missingfaktor's answer to allow you to have performance gains from parallel collections. The fact that simply adding .par could give you a tremendous performance boost demonstrates the power of sticking with functional programming!
def distancePar(wps: collection.GenSeq[Waypoint]): Double = {
val parwps = wps.par
parwps.zip(parwps drop 1).map(Function.tupled(distance)).sum
My guess is that this would work best if you have several of cores to throw at the problem, and wps tends to be somewhat long. If you have few cores or short paths, then parallelism will probably hurt more than it helps.
The other extreme would be a fully imperative solution. Writing imperative implementations of individual, performance critical, functions is usually acceptable, so long as you avoid shared mutable state. But once you get used to FP, you'll find this sort of function more difficult to write and maintain. And it's also not easy to parallelize.
def distanceImp(wps: collection.GenSeq[Waypoint]): Double = {
if (wps.size <= 1) {
} else {
var r = 0.0
var here = wps.head
var remaining = wps.tail
while (!remaining.isEmpty) {
r += distance(here, remaining.head)
here = remaining.head
remaining = remaining.tail
Finally, if you're looking for a middle ground between FP and imperative, you might try recursion. I haven't profiled it, but my guess is that this will be roughly equivalent to the imperative solution in terms of performance.
def distanceRec(wps: collection.GenSeq[Waypoint]): Double = {
def helper(acc: Double, here: Waypoint, remaining: collection.GenSeq[Waypoint]): Double =
if (remaining.isEmpty)
helper(acc + distance(here, remaining.head), remaining.head, remaining.tail)
if (wps.size <= 1)
helper(0.0, wps.head, wps.tail)
If you are doing indexing of any kind you want to be using Vector, not List:
scala> def timed(op: => Unit) = { val start = System.nanoTime; op; (System.nanoTime - start) / 1e9 }
timed: (op: => Unit)Double
scala> val l = List.fill(100000)(1)
scala> val v = Vector.fill(100000)(1)
scala> timed { var t = 0; for (i <- 0 until l.length - 1) yield t += l(i) + l(i + 1) }
res2: Double = 16.252194583
scala> timed { var t = 0; for (i <- 0 until v.length - 1) yield t += v(i) + v(i + 1) }
res3: Double = 0.047047654
ListBuffer offers fast appends, it doesn't offer fast random access.

Why stream fold operation throws Out of memory exception?

I have following simple code
def fib(i:Long,j:Long):Stream[Long] = i #:: fib(j, i+j)
(0l /: fib(1,1).take(10000000)) (_+_)
And it throws OutOfMemmoryError exception.
I can not understand why, because I think all the parts use constant memmory i.e. lazy evaluation streams and foldLeft...
Those code also don't work
fib(1,1).take(10000000).sum or max, min e.t.c.
How to correctly implement infinite streams and do iterative operations upon it?
Scala version: 2.9.0
Also scala javadoc said, that foldLeft operation is memmory safe for streams
/** Stream specialization of foldLeft which allows GC to collect
* along the way.
override final def foldLeft[B](z: B)(op: (B, A) => B): B = {
if (this.isEmpty) z
else tail.foldLeft(op(z, head))(op)
Implementation with iterators still not useful, since it throws ${domainName} exception
def fib(i:Long,j:Long): Iterator[Long] = Iterator(i) ++ fib(j, i + j)
How to define correctly infinite stream/iterator in Scala?
I don't care about int overflow, I just want to understand how to create infinite stream/iterator etc in scala without side effects .
The reason to use Stream instead of Iterator is so that you don't have to calculate all the small terms in the series over again. But this means that you need to store ten million stream nodes. These are pretty large, unfortunately, so that could be enough to overflow the default memory. The only realistic way to overcome this is to start with more memory (e.g. scala -J-Xmx2G). (Also, note that you're going to overflow Long by an enormous margin; the Fibonacci series increases pretty quickly.)
P.S. The iterator implementation I have in mind is completely different; you don't build it out of concatenated singleton Iterators:
def fib(i: Long, j: Long) = Iterator.iterate((i,j)){ case (a,b) => (b,a+b) }.map(_._1)
Now when you fold, past results can be discarded.
The OutOfMemoryError happens indenpendently from the fact that you use Stream. As Rex Kerr mentioned above, Stream -- unlike Iterator -- stores everything in memory. The difference with List is that the elements of Stream are calculated lazily, but once you reach 10000000, there will be 10000000 elements, just like List.
Try with new Array[Int](10000000), you will have the same problem.
To calculate the fibonacci number as above you may want to use different approach. You can take into account the fact that you only need to have two numbers, instead of the whole fibonacci numbers discovered so far.
For example:
scala> def fib(i:Long,j:Long): Iterator[Long] = Iterator(i) ++ fib(j, i + j)
fib: (i: Long,j: Long)Iterator[Long]
And to get, for example, the index of the first fibonacci number exceeding 1000000:
scala> fib(1, 1).indexWhere(_ > 1000000)
res12: Int = 30
Edit: I added the following lines to cope with the StackOverflow
If you really want to work with 1 millionth fibonacci number, the iterator definition above will not work either for StackOverflowError. The following is the best I have in mind at the moment:
class FibIterator extends Iterator[BigDecimal] {
var i: BigDecimal = 1
var j: BigDecimal = 1
def next = {val temp = i
i = i + j
j = temp
j }
def hasNext = true
scala> new FibIterator().take(1000000).foldLeft(0:BigDecimal)(_ + _)
res49: BigDecimal = 82742358764415552005488531917024390424162251704439978804028473661823057748584031
#yura's problem:
def fib(i:Long,j:Long):Stream[Long] = i #:: fib(j, i+j)
(0l /: fib(1,1).take(10000000)) (_+_)
besides using a Long which can't possibly hold the Fibonacci of 10,000,000, it does work. That is, if the foldLeft is written as:
Looking at the Streams.scala source, foldLeft() is clearly designed for Garbage Collection, but /: is not def'd.
The other answers alluded to another problem. The Fibonacci of 10 million is a big number and if BigInt is used, instead of just overflowing like with a Long, absolutely enormous numbers are being added to each over and over again.
Since Stream.foldLeft is optimized for GC it does look like the way to solve for really big Fibonacci numbers, rather than using a zip or tail recursion.
// Fibonacci using BigInt
def fib(i:BigInt,j:BigInt):Stream[BigInt] = i #:: fib(j, i+j)
Results of the above code: 10,000,000 is a 8-figure number. How many figures in fib(10000000)? 2,089,877
fib(1,1).take(10000000) is the "this" of the method /:, it is likely that the JVM will consider the reference alive as long as the method runs, even if in this case, it might get rid of it.
So you keep a reference on the head of the stream all along, hence on the whole stream as you build it to 10M elements.
You could just use recursion, which is about as simple:
def fibSum(terms: Int, i: Long = 1, j: Long = 1, total: Long = 2): Long = {
if (terms == 2) total
else fibSum(terms - 1, j, i + j, total + i + j)
With this, you can "fold" a billion elements in only a couple of seconds, but as Rex points out, summing the Fibbonaci sequence overflows Long very quickly.
If you really wanted to know the answer to your original problem and don't mind sacrificing some accuracy you could do this:
def fibSum(terms: Int, i: Double = 1, j: Double = 1, tot: Double = 2,
exp: Int = 0): String = {
if (terms == 2) "%.6f".format(tot) + " E+" + exp
else {
val (i1, j1, tot1, exp1) =
if (tot + i + j > 10) (i/10, j/10, tot/10, exp + 1)
else (i, j, tot, exp)
fibSum(terms - 1, j1, i1 + j1, tot1 + i1 + j1, exp1)
scala> fibSum(10000000)
res54: String = 2.957945 E+2089876