I am not asking should i use recursion or iteration, or which is faster between them. I was trying to understand the iteration and recursion time taken, and I come up with an interesting pattern in the time taken of the both, which was what ever was on the top of the file is taking more time than the other.
For example: If I am writing for loop in the beginning it is taking more time than recursion and vice versa. The difference between time taken in both of the process are significently huge aprox 30 to 40 times.
My questions are:-
Is order of a loop and recursion matters?
Is there something related to print?
What could be the possible reason for such behaviour?
following is the code I have in the same file and the language I am using is scala?
def count(x: Int): Unit = {
if (x <= 1000) {
print(s"$x ")
count(x + 1)
}
}
val t3 = System.currentTimeMillis()
count(1)
val t4 = System.currentTimeMillis()
println(s"\ntime taken by the recursion look = ${t4 - t3} mili second")
var c = 1
val t1 = System.currentTimeMillis()
while(c <= 1000)
{
print(s"$c ")
c+=1
}
val t2 = System.currentTimeMillis()
println(s"\ntime taken by the while loop = ${t2 - t1} mili second")
In this situation the time taken for recursion and while loop are 986ms, 20ms respectively.
When I switch the position of loop and recursion which means first loop then recursion, time taken for recursion and while loop are 1.69 sec and 28 ms respectively.
Edit 1:
I can see the same behaviour with bufferWriter if the recursion code is on the top. But not the case when recursion is below the loop. When recursion is below the loop it is taking almost same time with the difference of 2 to 3 ms.
If you wanted to convince yourself that the tailrec-optimization works, without relying on any profiling tools, here is what you could try:
Use way more iterations
Throw away the first few iterations to give the JIT the time to wake up and do the hotspot-optimizations
Throw away all unpredictable side effects like printing to stdout
Throw away all costly operations that are the same in both approaches (formatting numbers etc.)
Measure in multiple rounds
Randomize the number of repetitions in each round
Randomize the order of variants within each round, to avoid any "catastrophic resonance" with the cycles of the garbage collector
Preferably, don't run anything else on the computer
Something along these lines:
def compare(
xs: Array[(String, () => Unit)],
maxRepsPerBlock: Int = 10000,
totalRounds: Int = 100000,
warmupRounds: Int = 1000
): Unit = {
val n = xs.size
val times: Array[Long] = Array.ofDim[Long](n)
val rng = new util.Random
val indices = (0 until n).toList
var totalReps: Long = 0
for (round <- 1 to totalRounds) {
val order = rng.shuffle(indices)
val reps = rng.nextInt(maxRepsPerBlock / 2) + maxRepsPerBlock / 2
for (i <- order) {
var r = 0
while (r < reps) {
r += 1
val start = System.currentTimeMillis
(xs(i)._2)()
val end = System.currentTimeMillis
if (round > warmupRounds) {
times(i) += (end - start)
}
}
}
if (round > warmupRounds) {
totalReps += reps
}
}
for (i <- 0 until n) {
println(f"${xs(i)._1}%20s : ${times(i) / totalReps.toDouble}")
}
}
def gaussSumWhile(n: Int): Long = {
var acc: Long = 0
var i = 0
while (i <= n) {
acc += i
i += 1
}
acc
}
#annotation.tailrec
def gaussSumRec(n: Int, acc: Long = 0, i: Int = 0): Long = {
if (i <= n) gaussSumRec(n, acc + i, i + 1)
else acc
}
compare(Array(
("while", { () => gaussSumWhile(1000) }),
("#tailrec", { () => gaussSumRec(1000) })
))
Here is what it prints:
while : 6.737733046257334E-5
#tailrec : 6.70325653896487E-5
Even the simple hints above are sufficient for creating a benchmark that shows that the while loop and the tail-recursive function take roughly the same time.
Scala does not compile into machine code but into bytecode for the "Java Virtual Machine"(JVM) which then interprets that code on the native processor. The JVM uses multiple mechanisms to optimise code that is run frequently, eventually converting the frequently-called functions ("hotspots") into pure machine code.
This means that testing the first run of a function does not give a good measure of eventual performance. You need to "warm up" the JIT compiler by running the test code many times before attempting to measure the time taken.
Also, as noted in the comments, doing any kind of I/O is going to make timings very unreliable because there is a danger that the I/O will block. Write a test case that does not do any blocking, if possible.
Related
Clearly, if you need to count up, count up. If you need to count down, count down. However, other things being equal, is one faster than the other?
Here is my Scala code for a well-known puzzle - checking if a number is divisible by 13.
In the first example, I reverse my array and count upwards in the subsequent for-loop. In the second example I leave the array alone and do a decrementing for-loop. On the surface, the second example looks faster. Unfortunately, on the site where I run the code, it always times out.
// works every time
object Thirteen {
import scala.annotation.tailrec
#tailrec
def thirt(n: Long): Long = {
val getNum = (n: Int) => Array(1, 10, 9, 12, 3, 4)(n % 6)
val ni = n.toString.split("").reverse.map(_.toInt)
var s: Long = 0
for (i <- 0 to ni.length-1) {
s += ni(i) * getNum(i)
}
if (s == n) s else thirt(s)
}
}
// times out every time
object Thirteen {
import scala.annotation.tailrec
#tailrec
def thirt(n: Long): Long = {
val getNum = (n: Int) => Array(1, 10, 9, 12, 3, 4)(n % 6)
val ni = n.toString.split("").map(_.toInt)
var s: Long = 0
for (i <- ni.length-1 to 0 by -1) {
s = s + ni(i) * getNum(i)
}
if (s == n) s else thirt(s)
}
}
I ask the following questions:
Is there an obvious rule I am unaware of?
What is an easy way to test two code versions for performance – reliably measuring performance in the JVM appears difficult.
Does it help to look at the underlying byte code?
Is there a better piece of code solving
the same problem, If so, I'd be very grateful to see it.
Whilst I've seen similar questions, I can't find a definitive answer.
Here's how I'd be tempted to tackle it.
val nums :Stream[Int] = 1 #:: 10 #:: 9 #:: 12 #:: 3 #:: 4 #:: nums
def thirt(n :Long) :Long = {
val s :Long = Stream.iterate(n)(_ / 10)
.takeWhile(_ > 0)
.zip(nums)
.foldLeft(0L){case (sum, (i, num)) => sum + i%10 * num}
if (s == n) s else thirt(s)
}
I am very naively trying to use Scala .par, and the result turns out to be slower than the non-parallel version, by quite a bit. What is the explanation for that?
Note: the question is not to make this faster, but to understand why this naive use of .par doesn't yield an immediate speed-up.
Note 2: timing method: I ran both methods with N = 10000. The first one returned in about 20s. The second one I killed after 3 minutes. Not even close. If I let it run longer I get into a Java heap space exception.
def pi_random(N: Long): Double = {
val count = (0L until N * N)
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_parallel(N: Long): Double = {
val count = (0L until N * N)
.par
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
Hard to know for sure without doing some actual profiling, but I have two theories:
First, you may be losing some benefits of the Range class, specifically near-zero memory usage. When you do (0L until N * N), you create a Range object, which is lazy. It does not actually create any object holding every single number in the range. Neither does map, I think. And sum calculates and adds numbers one at a time, so also allocates barely any memory.
I'm not sure the same is all true about ParRange. Seems like it would have to allocate some amount per split, and after map is called, perhaps it might have to store some amount of intermediate results in memory as "neighboring" splits wait for the other to complete. Especially the heap space exception makes me think something like this is the case. So you'll lose a lot of time to GC and such.
Second, probably the calls to rng.nextDouble are by far the most expensive part of that inner function. But I believe both java and scala Random classes are essentially single-threaded. They synchronize and block internally. So you won't gain that much from parallelism anyway, and in fact lose some to overhead.
There is not enough work per task, the task granularity is too fine-grained.
Creating each task requires some overhead:
Some object representing the task must be created
It must be ensured that only one thread executes one task at a time
In the case that some threads become idle, some job-stealing procedure must be invoked.
For N = 10000, you instantiate 100,000,000 tiny tasks. Each of those tasks does almost nothing: it generates two random numbers and performs some basic arithmetic and an if-branch. The overhead of creating a task is not comparable to the work that each task is doing.
The tasks must be much larger, so that each thread has enough work to do. Furthermore, it's probably faster if you make each RNG thread local, so that the threads can do their job in parallel, without permanently locking the default random number generator.
Here is an example:
import scala.util.Random
def pi_random(N: Long): Double = {
val rng = new Random
val count = (0L until N * N)
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_parallel(N: Long): Double = {
val rng = new Random
val count = (0L until N * N)
.par
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_properly(n: Long): Double = {
val count = (0L until n).par.map { _ =>
val rng = ThreadLocalRandom.current
var sum = 0
var idx = 0
while (idx < n) {
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1.0) sum += 1
idx += 1
}
sum
}.sum
4 * count.toDouble / (n * n)
}
Here is a little demo and timings:
def measureTime[U](repeats: Long)(block: => U): Double = {
val start = System.currentTimeMillis
var iteration = 0
while (iteration < repeats) {
iteration += 1
block
}
val end = System.currentTimeMillis
(end - start).toDouble / repeats
}
// basic sanity check that all algos return roughly same result
println(pi_random(2000))
println(pi_random_parallel(2000))
println(pi_random_properly(2000))
// time comparison (N = 2k, 10 repetitions for each algorithm)
val N = 2000
val Reps = 10
println("Sequential: " + measureTime(Reps)(pi_random(N)))
println("Naive: " + measureTime(Reps)(pi_random_parallel(N)))
println("My proposal: " + measureTime(Reps)(pi_random_properly(N)))
Output:
3.141333
3.143418
3.14142
Sequential: 621.7
Naive: 3032.6
My version: 44.7
Now the parallel version is roughly an order of magnitude faster than the sequential version (result will obviously depend on the number of cores etc.).
I couldn't test it with N = 10000, because the naively parallelized version crashed everything with an "GC overhead exceeded"-error, which also illustrates that the overhead for creating the tiny tasks is too large.
In my implementation, I've additionaly unrolled the inner while: you need only one counter in one register, no need to create a huge collection by mapping on the range.
Edit: Replaced everything by ThreadLocalRandom, it now shouldn't matter whether your compiler versions supports SAM or not, so it should work with earlier versions of 2.11 too.
I'm trying to write some code as below -
def kthSmallest(matrix: Array[Array[Int]], k: Int): Int = {
val pq = new PriorityQueue[Int]() //natural ordering
var count = 0
for (
i <- matrix.indices;
j <- matrix(0).indices
) yield {
pq += matrix(i)(j)
count += 1
} //This would yield Any!
pq.dequeue() //kth smallest.
}
My question is, that I only want to loop till the time count is less than k (something like takeWhile(count != k)), but as I'm also inserting elements into the priority queue in the yield, this won't work in the current state.
My other options are to write a nested loop and return once count reaches k. Is it possible to do with yield? I could not find a idiomatic way of doing it yet. Any pointers would be helpful.
It's not idiomatic for Scala to use vars or break loops. You can go for recursion, lazy evaluation or duct tape a break, giving up on some performance (just like return, it's implemented as an Exception, and won't perform well enough). Here are the options broken down:
Use recursion - recursive algorithms are the analog of loops in functional languages
def kthSmallest(matrix: Array[Array[Int]], k: Int): Int = {
val pq = new PriorityQueue[Int]() //natural ordering
#tailrec
def fillQueue(i: Int, j: Int, count: Int): Unit =
if (count >= k || i >= matrix.length) ()
else {
pq += matrix(i)(j)
fillQueue(
if (j >= matrix(i).length - 1) i + 1 else i,
if (j >= matrix(i).length - 1) 0 else j + 1,
count + 1)
}
fillQueue(0, 0, 0)
pq.dequeue() //kth smallest.
}
Use a lazy structure, as chengpohi suggested - this doesn't sound very much like a pure function though. I'd suggest to use an Iterator instead of a Stream in this case though - as iterators don't memoize the steps they've gone through (might spare some memory for large matrices).
For those desperately willing to use break, Scala supports it in an attachable fashion (note the performance caveat mentioned above):
import scala.util.control.Breaks
breakable {
// loop code
break
}
There is a way using the Stream lazy evaluation to do this. Since for yield is equal to flatMap, you can convert for yield to flatMap with Stream:
matrix.indices.toStream.flatMap(i => {
matrix(0).indices.toStream.map(j => {
pq += matrix(i)(j)
count += 1
})
}).takeWhile(_ => count <= k)
Use toStream to convert the collection to Stream, and Since Stream is lazy evaluation, so we can use takeWhile to predicate count to terminate the less loops without init others.
My code is equivalent to this:
def iterate(prev: Vector[Int], acc: Int): Vector[Int] = {
val next = (for { i <- 1.to(1000000) }
yield (prev(Random.nextInt(i))) ).toVector
if (acc < 20) iterate(next, acc + 1)
else next
}
iterate(1.to(1000000).toVector, 1)
For a large number of iterations, it does an operation on a collection, and yields the value. At the end of the iterations, it converts everything to a vector. Finally, it proceeds to the next recursive self-call, but it cannot proceed until it has all the iterations done. The number of the recursive self-calls is very small.
I want to paralellize this, so I tried to use .par on the 1.to(1000000) range. This used 8 processes instead of 1, and the result was only twice faster! .toParArray was only slightly faster than .par. I was told it could be much faster if I used something different, like maybe ThreadPool - this makes sense, because all of the time is spent in constructing next, and I assume that concatenating the outputs of different processes onto shared memory will not result in huge slowdowns, even for very large outputs (this is a key assumption and it might be wrong). How can I do it? If you provide code, paralellizing the code I gave will be sufficient.
Note that the code I gave is not my actual code. My actual code is much more long and complex (Held-Karp algorithm for TSP with constraints, BitSets and more stuff), and the only notable difference is that in my code, prev's type is ParMap, instead of Vector.
Edit, extra information: the ParMap has 350k elements on the worst iteration at the biggest sample size I can handle, and otherwise it's typically 5k-200k (that varies on a log scale). If it inherently needs a lot of time to concatenate the results from the processes into one single process (I assume this is what's happening), then there is nothing much I can do, but I rather doubt this is the case.
Implemented few versions after the original, proposed in the question,
rec0 is the original with a for loop;
rec1 uses par.map instead of for loop;
rec2 follows rec1 yet it employs parallel collection ParArray for lazy builders (and fast access on bulk traversal operations);
rec3 is a non-idiomatic non-parallel version with mutable ArrayBuffer.
Thus
import scala.collection.mutable.ArrayBuffer
import scala.collection.parallel.mutable.ParArray
import scala.util.Random
// Original
def rec0() = {
def iterate(prev: Vector[Int], acc: Int): Vector[Int] = {
val next = (for { i <- 1.to(1000000) }
yield (prev(Random.nextInt(i))) ).toVector
if (acc < 20) iterate(next, acc + 1)
else next
}
iterate(1.to(1000000).toVector, 1)
}
// par map
def rec1() = {
def iterate(prev: Vector[Int], acc: Int): Vector[Int] = {
val next = (1 to 1000000).par.map { i => prev(Random.nextInt(i)) }.toVector
if (acc < 20) iterate(next, acc + 1)
else next
}
iterate(1.to(1000000).toVector, 1)
}
// ParArray par map
def rec2() = {
def iterate(prev: ParArray[Int], acc: Int): ParArray[Int] = {
val next = (1 to 1000000).par.map { i => prev(Random.nextInt(i)) }.toParArray
if (acc < 20) iterate(next, acc + 1)
else next
}
iterate((1 to 1000000).toParArray, 1).toVector
}
// Non-idiomatic non-parallel
def rec3() = {
def iterate(prev: ArrayBuffer[Int], acc: Int): ArrayBuffer[Int] = {
var next = ArrayBuffer.tabulate(1000000){i => i+1}
var i = 0
while (i < 1000000) {
next(i) = prev(Random.nextInt(i+1))
i = i + 1
}
if (acc < 20) iterate(next, acc + 1)
else next
}
iterate(ArrayBuffer.tabulate(1000000){i => i+1}, 1).toVector
}
Then a little testing on averaging elapsed times,
def elapsed[A] (f: => A): Double = {
val start = System.nanoTime()
f
val stop = System.nanoTime()
(stop-start)*1e-6d
}
val times = 10
val e0 = (1 to times).map { i => elapsed(rec0) }.sum / times
val e1 = (1 to times).map { i => elapsed(rec1) }.sum / times
val e2 = (1 to times).map { i => elapsed(rec2) }.sum / times
val e3 = (1 to times).map { i => elapsed(rec3) }.sum / times
// time in ms.
e0: Double = 2782.341
e1: Double = 2454.828
e2: Double = 3455.976
e3: Double = 1275.876
shows that the non-idiomatic non-parallel version proves the fastest in average. Perhaps for larger input data, the parallel, idiomatic versions may be beneficial.
I have two arrays (that i have pulled out of a matrix (Array[Array[Int]]) and I need to subtract one from the other.
At the moment I am using this method however, when I profile it, it is the bottleneck.
def subRows(a: Array[Int], b: Array[Int], sizeHint: Int): Array[Int] = {
val l: Array[Int] = new Array(sizeHint)
var i = 0
while (i < sizeHint) {
l(i) = a(i) - b(i)
i += 1
}
l
}
I need to do this billions of times so any improvement in speed is a plus.
I have tried using a List instead of an Array to collect the differences and it is MUCH faster but I lose all benefit when I convert it back to an Array.
I did modify the downstream code to take a List to see if that would help but I need to access the contents of the list out of order so again there is loss of any gains there.
It seems like any conversion of one type to another is expensive and I am wondering if there is some way to use a map etc. that might be faster.
Is there a better way?
EDIT
Not sure what I did the first time!?
So the code I used to test it was this:
def subRowsArray(a: Array[Int], b: Array[Int], sizeHint: Int): Array[Int] = {
val l: Array[Int] = new Array(sizeHint)
var i = 0
while (i < sizeHint) {
l(i) = a(i) - b(i)
i += 1
}
l
}
def subRowsList(a: Array[Int], b: Array[Int], sizeHint: Int): List[Int] = {
var l: List[Int] = Nil
var i = 0
while (i < sizeHint) {
l = a(i) - b(i) :: l
i += 1
}
l
}
val a = Array.fill(100, 100)(scala.util.Random.nextInt(2))
val loops = 30000 * 10000
def runArray = for (i <- 1 to loops) subRowsArray(a(scala.util.Random.nextInt(100)), a(scala.util.Random.nextInt(100)), 100)
def runList = for (i <- 1 to loops) subRowsList(a(scala.util.Random.nextInt(100)), a(scala.util.Random.nextInt(100)), 100)
def optTimer(f: => Unit) = {
val s = System.currentTimeMillis
f
System.currentTimeMillis - s
}
The results I thought I got the first time I did this were the exact opposite... I must have misread or mixed up the methods.
My apologies for asking a bad question.
That code is the fastest you can manage single-threaded using a standard JVM. If you think List is faster, you're either fooling yourself or not actually telling us what you're doing. Putting an Int into List requires two object creations: one to create the list element, and one to box the integer. Object creations take about 10x longer than an array access. So it's really not a winning proposition to do it any other way.
If you really, really need to go faster, and must stay with a single thread, you should probably switch to C++ or the like and explicitly use SSE instructions. See this question, for example.
If you really, really need to go faster and can use multiple threads, then the easiest is to package up a chunk of work like this (i.e. a sensible number of pairs of vectors that need to be subtracted--probably at least a few million elements per chunk) into a list as long as the number of processors on your machine, and then call list.par.map(yourSubtractionRoutineThatActsOnTheChunkOfWork).
Finally, if you can be destructive,
a(i) -= b(i)
in the inner loop is, of course, faster. Likewise, if you can reuse space (e.g. with System.arraycopy), you're better off than if you have to keep allocating it. But that changes the interface from what you've shown.
You can use Scalameter to try a benchmark the two implementations which requires at least JRE 7 update 4 and Scala 2.10 to be run. I used scala 2.10 RC2.
Compile with scalac -cp scalameter_2.10-0.2.jar RangeBenchmark.scala.
Run with scala -cp scalameter_2.10-0.2.jar:. RangeBenchmark.
Here's the code I used:
import org.scalameter.api._
object RangeBenchmark extends PerformanceTest.Microbenchmark {
val limit = 100
val a = new Array[Int](limit)
val b = new Array[Int](limit)
val array: Array[Int] = new Array(limit)
var list: List[Int] = Nil
val ranges = for {
size <- Gen.single("size")(limit)
} yield 0 until size
measure method "subRowsArray" in {
using(ranges) curve("Range") in {
var i = 0
while (i < limit) {
array(i) = a(i) - b(i)
i += 1
}
r => array
}
}
measure method "subRowsList" in {
using(ranges) curve("Range") in {
var i = 0
while (i < limit) {
list = a(i) - b(i) :: list
i += 1
}
r => list
}
}
}
Here's the results:
::Benchmark subRowsArray::
Parameters(size -> 100): 8.26E-4
::Benchmark subRowsList::
Parameters(size -> 100): 7.94E-4
You can draw your own conclusions. :)
The stack blew up on larger values of limit. I'll guess it's because it's measuring the performance many times.