Scala recursion depth differences and StackOverflowError below allowed depth - scala

The following program was written to check how deep recursion in Scala can go deep on my machine. I assume that it mostly depends on stack's size assigned to this program. However, after calculating the maximal depth of recursion (catching exception when it appears) and trying to simulate a recursion of this depth I got odd outputs.
object RecursionTest {
def sumArrayRec(elems: Array[Int]) = {
def goFrom(i: Int, size: Int, elms: Array[Int]): Int = {
if (i < size) elms(i) + goFrom(i + 1, size, elms)
else 0
}
goFrom(0, elems.length, elems)
}
def main(args: Array[String]) {
val recDepth = recurseTest(0, 0, Array(1))
println("sumArrayRec = " + sumArrayRec((0 until recDepth).toArray))
}
def recurseTest(i: Int, j: Int, arr: Array[Int]): Int = {
try {
recurseTest(i + 1, j + 1, arr)
} catch { case e: java.lang.StackOverflowError =>
println("Recursion depth on this system is " + i + ".")
i
}
}
}
The results vary among executions of the program.
One of them is a desirable output:
/usr/lib/jvm/java-8-oracle/bin/java
Recursion depth on this system is 7166.
sumArrayRec = 25672195
Process finished with exit code 0
Nevertheless, the second possible output which I get indicates an error:
/usr/lib/jvm/java-8-oracle/bin/java
Recursion depth on this system is 8129.
Exception in thread "main" java.lang.StackOverflowError
at RecursionTest$.goFrom$1(lab44.scala:6)
at RecursionTest$.goFrom$1(lab44.scala:6)
at RecursionTest$.goFrom$1(lab44.scala:6)
(...)
at RecursionTest$.goFrom$1(lab44.scala:6)
at RecursionTest$.goFrom$1(lab44.scala:6)
Process finished with exit code 1
I didn't observe any dependence or relationship of them, I just once got the first, and other time the second
All of that lead me to the following questions:
Why do I get an error and what causes it?
Is this the stack overflow error and if yes...
Why the size of stack changes in the same time when the program is running? Is it even possible?
When I changed println("sumArrayRec = " + sumArrayRec((0 until recDepth).toArray)) into
println("sumArrayRec = " + sumArrayRec((0 until recDepth - 5000).toArray)), so much less, the behaviour remains the same.

You get a stack overflow because you cause a stack overflow
Yes
The max stack doesn't change, the JVM sets this. Your two methods recurseTest and sumArrayRec push different amounts to the stack. recurseTest is basically adding 2 ints the stack with each call while sumArrayRec is adding 3 ints since you have a call to elms(i), probably more with addition/ifelse (more instructions). It's tough to tell since the JVM is handling all of this and it's purpose is to obscure this from what you need to know. Also, who knows what optimizations the JVM is doing behind the scenes which can drastically effect the stack sizes created from each method call. On multiple runs you'll get different depths for stack due to system timing etc, maybe the jvm will optimize in the short time the program has run, maybe it won't, it's non-deterministic in this scenario. If you ran some warmup code before hand that might make your tests more deterministic so any optimizations will or will not take place.
You should look into using something like JMH for your tests, it will help with the determinism.
Side note: you can also manually change your stack size with -Xss2M, common for SBT usage.

Related

Write from compute shader to persistently mapped SSBO fails

I'm trying to write to a SSBO with a compute shader and read the data back on the cpu.
The compute shader is just a 1x1x1 toy example that writes 24 floats:
#version 450 core
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout (std430, binding = 0) buffer particles {
float Particle[];
};
void main() {
for (int i = 0; i < 24; ++i) {
Particle[i] = i + 1;
}
}
This is how I run the shader and read the data:
val bufferFlags = GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT
val bufferSize = 24 * 4
val bufferId = glCreateBuffers()
glNamedBufferStorage(bufferId, bufferSize, bufferFlags)
val mappedBuffer = glMapNamedBufferRange(bufferId, 0, bufferSize, bufferFlags)
mappedBuffer.rewind()
val mappedFloatBuffer = mappedBuffer.asFloatBuffer()
mappedFloatBuffer.rewind()
val ssboIndex = glGetProgramResourceIndex(progId, GL_SHADER_STORAGE_BLOCK, "particles")
val props = Array(GL_BUFFER_BINDING)
val params = Array(-1)
glGetProgramResourceiv(progId, GL_SHADER_STORAGE_BLOCK, ssboIndex, props, null, params)
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, params(0), bufferId)
glUseProgram(progId)
val sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0)
glDispatchCompute(1, 1, 1)
glClientWaitSync(sync, 0, 1000000000) match {
case GL_TIMEOUT_EXPIRED =>
println("Timeout expired")
case GL_WAIT_FAILED =>
println("Wait failed. " + glGetError())
case _ =>
println("Result:")
while(mappedFloatBuffer.hasRemaining) {
println(mappedFloatBuffer.get())
}
}
I expect it to print the numbers 1 to 24 but instead it prints 24 zeros. Using the cpu I can read and write (if the GL_MAP_WRITE_BIT is set) to the buffer just fine. The same happens if I don't use DSA (glBindBuffer / glBufferStorage / glMapBufferRange instead). However, if the buffer isn't mapped while the shader runs and I only map it just before printing the contents, everything works correctly. Isn't this exactly what persistently mapped buffers are for? So I can keep it mapped while the gpu is using it?
I checked for any errors, with glGetError as well as with the newer debug output but I don't get any.
Here (pastebin) is a fully working example. You need LWJGL to run it.
There are a number of problems in your code.
First, you put the fence sync before the command you want to sync with. Syncing with a fence syncs with all commands executed before the fence, not after. If you want to sync with the compute shader execution, then you have to insert the fence after the dispatch call, not before.
Second, synchronization is not enough. Writes to an SSBO are incoherent, so you must follow the rules of incoherent memory accesses in order to make them visible to you. In this case, you need to insert an appropriate memory barrier between the compute operation and when you try to read from the buffer with glMemoryBarrier. Since you're reading the data via mapping, the proper barrier to use is GL_CLIENT_MAPPED_BUFFER_BARRIER_BIT.
Your code appears to work when you use non-persistent mapping, but this is mere appearance. It's still undefined behavior due to the improper incoherent memory access (ie: the lack of a memory barrier). It just so happens that the UB does what you want... in that case.

Expensive flatMap() operation on streams originating from Stream.emits()

I just encountered an issue with degrading fs2 performance using a stream of strings to be written to a file via text.utf8encode. I tried to change my source to use chunked strings to increase performance, but the observation was performance degradation instead.
As far as I can see, it boils down to the following: Invoking flatMap on a stream that originates from Stream.emits() can be very expensive. Time usage seems to be exponential based on the size of the sequence passed to Stream.emits(). The code snippet below shows an example:
/*
Test done with scala 2.11.11 and fs2 version 0.10.0-M7.
*/
val rangeSize = 20000
val integers = (1 to rangeSize).toVector
// Note that the last flatMaps are just added to show extreme load for streamA.
val streamA = Stream.emits(integers).flatMap(Stream.emit(_))
val streamB = Stream.range(1, rangeSize + 1).flatMap(Stream.emit(_))
streamA.toVector // Uses approx. 25 seconds (!)
streamB.toVector // Uses approx. 15 milliseconds
Is this a bug, or should usage of Stream.emits() for large sequences be avoided?
TLDR: Allocations.
Longer answer:
Interesting question. I ran a JFR profile on both methods separately, and looked at the results. First thing which immediately attracted my eye was the amount of allocations.
Stream.emit:
Stream.range:
We can see that Stream.emit allocates a significant amount of Append instances, which are the concrete implementation of Catenable[A], which is the type used in Stream.emit to fold:
private[fs2] final case class Append[A](left: Catenable[A], right: Catenable[A]) extends Catenable[A]
This actually comes from the implementation of how Catenable[A] implemented foldLeft:
foldLeft(empty: Catenable[B])((acc, a) => acc :+ f(a))
Where :+ allocates a new Append object for each element. This means we're at least generating 20000 such Append objects.
There is also a hint in the documentation of Stream.range about how it produces a single chunk instead of dividing the stream further, which may be bad if this was a big range we're generating:
/**
* Lazily produce the range `[start, stopExclusive)`. If you want to produce
* the sequence in one chunk, instead of lazily, use
* `emits(start until stopExclusive)`.
*
* #example {{{
* scala> Stream.range(10, 20, 2).toList
* res0: List[Int] = List(10, 12, 14, 16, 18)
* }}}
*/
def range(start: Int, stopExclusive: Int, by: Int = 1): Stream[Pure,Int] =
unfold(start){i =>
if ((by > 0 && i < stopExclusive && start < stopExclusive) ||
(by < 0 && i > stopExclusive && start > stopExclusive))
Some((i, i + by))
else None
}
You can see that there is no additional wrapping here, only the integers that get emitted as part of the range. On the other hand, Stream.emits creates an Append object for every element in the sequence, where we have a left containing the tail of the stream, and right containing the current value we're at.
Is this a bug? I would say no, but I would definitely open this up as a performance issue to the fs2 library maintainers.

Data-processing takes too long if pre-processing just before

So I've been trying to perform a cumsum operation on a data-set. I want to emphasize that I want my cumsum to happen on partitions on my data-set (eg. cumsum for feature1 over time for personA).
I know how to do it, and it works "on its own" perfectly - i'll explain that part later. Here's the piece of code doing it:
// it's admitted that this DF contains all data I need
// with one column/possible value, with only 1/0 in each line
// 1 <-> feature has the value
// 0 <-> feature doesn't contain the value
// this DF is the one I get after the one-hot operation
// this operation is performed to apply ML algorithms on features
// having simultaneously multiple values
df_after_onehot.createOrReplaceTempView("test_table")
// #param DataFrame containing all possibles values eg. A, B, C
def cumSumForFeatures(values: DataFrame) = {
values
.map(value => "CAST(sum(" + value(0) + ") OVER (PARTITION BY person ORDER BY date) as Integer) as sum_" + value(0))
.reduce(_+ ", " +_)
}
val req = "SELECT *, " + cumSumForFeatures(possible_segments) + " FROM test_table"
// val req = "SELECT * FROM test_table"
println("executing: " + req)
val data_after_cumsum = sqLContext.sql(req).orderBy("person", "date")
data_after_cumsum.show(10, false)
The problem happens when I try to perform the same operation with some pre-processing before (like the one-hot operation, or adding computed features before). I tried with a very small dataset and it doesn't work.
Here is the printed stack trace (the part that should interess you at least):
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[Executor task launch worker-3] ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
So it seems it's related to a GC issue/JVM heap size? I just don't understand how it's related to my pre-processing?
I tried unpersist operation on not-used-anymore DFs.
I tried modifying the options on my machine (eg. -Xmx2048m).
The issue is the same once I deploy on AWS.
Extract of my pom.xml (for versions of Java, Spark, Scala):
<spark.version>2.1.0</spark.version>
<scala.version>2.10.4</scala.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
Would you know how I could fix my issue?
Thanks
From what I understand, I think that we could have two reasons for that:
JVM's heap overflow because of kept-in-memory-but-no-longer-used DataFrames
the cum-sum request could be too big to be processed with the few amount of RAM left
show/print operations increase the number of steps necessary for the job, and may interfer with Spark's inner optimizations
Considering that, I decided to "unpersist" no-longer-used DataFrames. That did not seem to change much.
Then, I decided to remove all unecessary show/print operations. That improved the number of step very much.
I changed my code to be more functionnal, but I kept 3 separate values to help debugging. That did not change much, but my code is cleaner.
Finally, here is the thing that helped me deal with the problem. Instead of making my request go through the dataset in one pass, I partitionned the list of features into slices:
def listOfSlices[T](list: List[T], sizeOfSlices: Int): List[List[T]] =
(for (i <- 0 until list.length by sizeOfSlices) yield list.slice(i, i+sizeOfSlices)).toList
I perform the request for each slice of, with a map operation. Then I join together them to have my final DataFrame. That way, I kind of distribute the computation, and it seems that this way is more efficient.
val possible_features_slices = listOfSlices[String](possible_features, 5)
val df_cum_sum = possible_features_slices
.map(possible_features_slice =>
dfWithCumSum(sqLContext, my_df, possible_segments_slice, "feature", "time")) // operation described in the original post
.foldLeft[DataFrame](null)((a, b) => if (a == null) b else if (b == null) a else a.join(b, Seq("person", "list_features", "time")))
I just really want to emphasize that I still not understand the reason behind my problem, and I still expect an answer at this level.

Common practice how to deal with Integer overflows?

What's the common practice to deal with Integer overflows like 999999*999999 (result > Integer.MAX_VALUE) from an Application Development Team point of view?
One could just make BigInt mandatory and prohibit the use of Integer, but is that a good/bad idea?
If it is extremely important that the integer not overflow, you can define your own overflow-catching operations, e.g.:
def +?+(i: Int, j: Int) = {
val ans = i.toLong + j.toLong
if (ans < Int.MinValue || ans > Int.MaxValue) {
throw new ArithmeticException("Int out of bounds")
}
ans.toInt
}
You may be able to use the enrich-your-library pattern to turn this into operators; if the JVM manages to do escape analysis properly, you won't get too much of a penalty for it:
class SafePlusInt(i: Int) {
def +?+(j: Int) = { /* as before, except without i param */ }
}
implicit def int_can_be_safe(i: Int) = new SafePlusInt(i)
For example:
scala> 1000000000 +?+ 1000000000
res0: Int = 2000000000
scala> 2000000000 +?+ 2000000000
java.lang.ArithmeticException: Int out of bounds
at SafePlusInt.$plus$qmark$plus(<console>:12)
...
If it is not extremely important, then standard unit testing and code reviews and such should catch the problem in the large majority of cases. Using BigInt is possible, but will slow your arithmetic down by a factor of 100 or so, and won't help you when you have to use an existing method that takes an Int.
By far the most common practice regarding integer overflows is that programmers are expected to know that the issue exists, to watch for cases where they might happen, and to make the appropriate checks or rearrange the math so that overflows won't happen, things like doing a * (b / c) rather than (a * b) / c . If the project uses unit test, they will include cases to try and force overflows to happen.
I have never worked on or seen code from a team that required more than that, so I'm going to say that's good enough for almost all software.
The one embedded application I've seen that actually, honest-to-spaghetti-monster NEEDED to prevent overflows, they did it by proving that overflows weren't possible in each line where it looked like they might happen.
If you're using Scala (and based on the tag I'm assuming you are), one very generic solution is to write your library code against the scala.math.Integral type class:
def naturals[A](implicit f: Integral[A]) =
Stream.iterate(f.one)(f.plus(_, f.one))
You can also use context bounds and Integral.Implicits for nicer syntax:
import scala.math.Integral.Implicits._
def squares[A: Integral] = naturals.map(n => n * n)
Now you can use these methods with either Int or Long or BigInt as needed, since instances of Integral exist for all of them:
scala> squares[Int].take(10).toList
res0: List[Int] = List(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)
scala> squares[Long].take(10).toList
res0: List[Long] = List(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)
scala> squares[BigInt].take(10).toList
res1: List[BigInt] = List(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)
No need to change the library code: just use Long or BigInt where overflow is a concern and Int otherwise.
You will pay some penalty in terms of performance, but the genericity and the ability to defer the Int-or-BigInt decision may be worth it.
In addition to simple mindfulness, as noted by #mjfgates, there are a couple of practices that I always use when dealing with scaled-decimal (non-floating-point) real-world quantities. This may not be on point for your particular application - apologies in advance if not.
First, if there are multiple units of measure in use, values must always clearly identify what they are. This can be by naming convention, or by using a separate class for each unit of measure. I've always just used names - a suffix on every variable name. In addition to eliminating errors from confusion over the units, it encourages thinking about overflow because the measures are less likely to be thought of as just numbers.
Second, my most frequent source of overflow concern is usually rescaling - converting from one measure to another - when it requires a lot of significant digits. For example, the conversion factor from cm to inches is 0.393700787402. In order to avoid both overflow and loss of significant digits, you need to be careful to multiply and divide in the right order. I haven't done this in a long time, but I believe what you want is something like:
Add to Rational.scala, from The Book:
def rescale(i:Int) : Int = {
(i * (numer/denom)) + (i/denom * (numer % denom))
Then you get as results (shortened from a specs2 test):
val InchesToCm = new Rational(1000000000,393700787)
InchesToCm.rescale(393700787) must_== 1000000000
InchesToCm.rescale(1) must_== 2
This doesn't round, or deal with negative scaling factors.
A production implementation may want to factor out numer/denom and numer % denom.

For loop in scala without sequence?

So, while working my way through "Scala for the Impatient" I found myself wondering: Can you use a Scala for loop without a sequence?
For example, there is an exercise in the book that asks you to build a counter object that cannot be incremented past Integer.MAX_VALUE. In order to test my solution, I wrote the following code:
var c = new Counter
for( i <- 0 to Integer.MAX_VALUE ) c.increment()
This throws an error: sequences cannot contain more than Int.MaxValue elements.
It seems to me that means that Scala is first allocating and populating a sequence object, with the values 0 through Integer.MaxValue, and then doing a foreach loop on that sequence object.
I realize that I could do this instead:
var c = new Counter
while(c.value < Integer.MAX_VALUE ) c.increment()
But is there any way to do a traditional C-style for loop with the for statement?
In fact, 0 to N does not actually populate anything with integers from 0 to N. It instead creates an instance of scala.collection.immutable.Range, which applies its methods to all the integers generated on the fly.
The error you ran into is only because you have to be able to fit the number of elements (whether they actually exist or not) into the positive part of an Int in order to maintain the contract for the length method. 1 to Int.MaxValue works fine, as does 0 until Int.MaxValue. And the latter is what your while loop is doing anyway (to includes the right endpoint, until omits it).
Anyway, since the Scala for is a very different (much more generic) creature than the C for, the short answer is no, you can't do exactly the same thing. But you can probably do what you want with for (though maybe not as fast as you want, since there is some performance penalty).
Wow, some nice technical answers for a simple question (which is good!) But in case anyone is just looking for a simple answer:
//start from 0, stop at 9 inclusive
for (i <- 0 until 10){
println("Hi " + i)
}
//or start from 0, stop at 9 inclusive
for (i <- 0 to 9){
println("Hi " + i)
}
As Rex pointed out, "to" includes the right endpoint, "until" omits it.
Yes and no, it depends what you are asking for. If you're asking whether you can iterate over a sequence of integers without having to build that sequence first, then yes you can, for instance using streams:
def fromTo(from : Int, to : Int) : Stream[Int] =
if(from > to) {
Stream.empty
} else {
// println("one more.") // uncomment to see when it is called
Stream.cons(from, fromTo(from + 1, to))
}
Then:
for(i <- fromTo(0, 5)) println(i)
Writing your own iterator by defining hasNext and next is another option.
If you're asking whether you can use the 'for' syntax to write a "native" loop, i.e. a loop that works by incrementing some native integer rather than iterating over values produced by an instance of an object, then the answer is, as far as I know, no. As you may know, 'for' comprehensions are syntactic sugar for a combination of calls to flatMap, filter, map and/or foreach (all defined in the FilterMonadic trait), depending on the nesting of generators and their types. You can try to compile some loop and print its compiler intermediate representation with
scalac -Xprint:refchecks
to see how they are expanded.
There's a bunch of these out there, but I can't be bothered googling them at the moment. The following is pretty canonical:
#scala.annotation.tailrec
def loop(from: Int, until: Int)(f: Int => Unit): Unit = {
if (from < until) {
f(from)
loop(from + 1, until)(f)
}
}
loop(0, 10) { i =>
println("Hi " + i)
}