Why JVM doesn't optimize simple callback (in Scala)? - scala

I'm creating Scala method to add elements into a ArrayBuffer. I'm thinking about 2 approaches:
def addToArrayBuffer(b: ArrayBuffer[Int])
def addToArrayBuffer(cb: Int => Unit)
The first approach is a method which gets collection and adds elements into it. The second approach is a method which gets callback cb and calls this callback for every element I want to add into collection.
The second approach is more flexible because I can transform/filter elements before adding them into collection.
Unfortunately the second approach is slower (72 ops/s vs 57 ops/s):
Benchmark Mode Cnt Score Error Units
TestBenchmark.addToArrayBufferDirectly thrpt 9 72.808 ? 13.394 ops/s
TestBenchmark.addToArrayBufferViaCallback thrpt 9 57.786 ? 3.532 ops/s
My question is why is JVM unable to optimize callback and achieve the same speed as direct adding into collection? And how can I improve speed?
I'm using java version 1.8.0_162 on Mac. Here is the source of benchmark:
package bench
import org.openjdk.jmh.annotations.{Benchmark, Fork, Measurement, Scope, State, Warmup}
import org.openjdk.jmh.infra.Blackhole
import scala.collection.mutable.ArrayBuffer
#State(Scope.Thread)
#Warmup(iterations = 5)
#Measurement(iterations = 3)
#Fork(3)
class TestBenchmark {
val size = 1000000
#Benchmark
def addToArrayBufferDirectly(blackhole: Blackhole) = {
def addToArrayBuffer(b: ArrayBuffer[Int]) = {
var i = 0
while (i < size) {
b.append(i)
i += 1
}
}
val ab = new ArrayBuffer[Int](size)
addToArrayBuffer(ab)
blackhole.consume(ab)
}
#Benchmark
def addToArrayBufferViaCallback(blackhole: Blackhole) = {
def addToArrayBuffer(cb: Int => Unit) = {
var i = 0
while (i < size) {
cb(i)
i += 1
}
}
val ab = new ArrayBuffer[Int](size)
addToArrayBuffer(i => ab.append(i))
blackhole.consume(ab)
}
}

It can be optimized by Scala compiler by using flags
scalacOptions ++= Seq(
"-opt-inline-from:bench.**",
"-opt:l:inline"
)
No changes in code are necessary. More about Scala inlining: https://www.lightbend.com/blog/scala-inliner-optimizer

Related

inner .par collection breaks outer ForkJoinTaskSupport in Scala

I want to make a parallel collection that uses a fixed number of threads.
The standard advice for this is to set tasksupport for the parallel collection to use a ForkJoinTaskSupport with a ForkJoinPool with a fixed number of threads. That works fine UNTIL the processing you are doing in your parallel collection itself uses a parallel collection. When this is the case it appears that the limit set for the ForkJoinPool goes away.
A simple test looks something like the following:
import java.util.concurrent.atomic.AtomicInteger
import java.util.concurrent.ForkJoinPool
import scala.collection.parallel.ForkJoinTaskSupport
object InnerPar {
def forkJoinPoolIsSuccess(useInnerPar:Boolean): Boolean = {
val numTasks = 100
val numThreads = 10
// every thread in the outer collection will increment
// and decrement this counter as it starts and exits
val threadCounter = new AtomicInteger(0)
// function that returns the thread count when we first
// started running and creates an inner parallel collection
def incrementAndCountThreads(idx:Int):Int = {
val otherThreadsRunning:Int = threadCounter.getAndAdd(1)
if (useInnerPar) {
(0 until 20).toSeq.par.map { elem => elem + 1 }
}
Thread.sleep(10)
threadCounter.getAndAdd(-1)
otherThreadsRunning + 1
}
// create parallel collection using a ForkJoinPool with numThreads
val parCollection = (0 until numTasks).toVector.par
parCollection.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(numThreads))
val threadCountLogList = parCollection.map { idx =>
incrementAndCountThreads(idx)
}
// the total number of threads running should not have exceeded
// numThreads at any point, similarly we hope that the number of
// simultaneously executing threads was close numThreads at some point
val respectsNumThreadsCapSuccess = threadCountLogList.max <= numThreads
respectsNumThreadsCapSuccess
}
def main(args:Array[String]):Unit = {
val testConfigs = Seq(true, false, true, false)
testConfigs.foreach { useInnerPar =>
val isSuccess = forkJoinPoolIsSuccess(useInnerPar)
println(f"useInnerPar $useInnerPar%6s, success is $isSuccess%6s")
}
}
}
And from this we get the following output, showing that more than numThreads (in the example 10) threads are running simultaneously if we create a parallel collection inside of incrementAndCountThreads().
useInnerPar true, success is false
useInnerPar false, success is true
useInnerPar true, success is false
useInnerPar false, success is true
Also note that using a ForkJoinTaskSupport in the inner collection does not fix the problem. In other words you get the same results if you use the following code for the inner collection:
if (useInnerPar) {
val innerParCollection = (0 until 20).toVector.par
innerParCollection.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(3))
innerParCollection.map { elem => elem + 1 }
}
I'm using Scala 2.12.5 and Java OpenJDK 1.8.0_161-b14 on a Linux 3.10.0 x86_64 kernel.
Am I missing something? If not is there a way to work around this?
Thanks!
The core issue is that in Java 8 the numThreads parameter passed to the ForkJoinPool is just a guide, not a hard limit. In Java 9 there is a maxPoolSize parameter you can set which should provide a hard limit for the number of threads in the pool and solve this problem directly. I don't know of a great way to solve this in Java 8.
See the following for more details:
https://github.com/scala/bug/issues/11036

Scala how to decrease execution time

I have one method which generate UUID and code as below :
def generate(number : Int): List[String] = {
List.fill(number)(Generators.randomBasedGenerator().generate().toString.replaceAll("-",""))
}
and I called this as below :
for(i <-0 to 100) {
val a = generate(1000000)
println(a)
}
But for running the above for loop it take almost 8-9 minutes for execution, is there any other way to minimised execution time ?
Note: Here for understanding I added for loop but in real situation the generate method will call thousand of times from other request at same time.
The problem is the List. Filling a List with 1,000,000 generated and processed elements is going to take time (and memory) because every one of those elements has to be materialized.
You can generate an infinite number of processed UUID strings instantly if you don't have to materialize them until they are actually needed.
def genUUID :Stream[String] = Stream.continually {
Generators.randomBasedGenerator().generate().toString.filterNot(_ == '-')
}
val next5 = genUUID.take(5) //only the 1st (head) is materialized
next5.length //now all 5 are materialized
You can use Stream or Iterator for the infinite collection, whichever you find most conducive (or least annoying) to your work flow.
Basically you used not the fastest implementation. You should use that one when you pass Random to the constructor Generators.randomBasedGenerator(new Random(System.currentTimeMillis())). I did next things:
Use Array instead of List (Array is faster)
Removed string replacing, let's measure pure performance of generation
Dependency: "com.fasterxml.uuid" % "java-uuid-generator" % "3.1.5"
Result:
Generators.randomBasedGenerator(). Per iteration: 1579.6 ms
Generators.randomBasedGenerator() with passing Random Per iteration: 59.2 ms
Code:
import java.util.{Random, UUID}
import com.fasterxml.uuid.impl.RandomBasedGenerator
import com.fasterxml.uuid.{Generators, NoArgGenerator}
import org.scalatest.{FunSuiteLike, Matchers}
import scala.concurrent.duration.Deadline
class GeneratorTest extends FunSuiteLike
with Matchers {
val nTimes = 10
// Let use Array instead of List - Array is faster!
// and use pure UUID generators
def generate(uuidGen: NoArgGenerator, number: Int): Seq[UUID] = {
Array.fill(number)(uuidGen.generate())
}
test("Generators.randomBasedGenerator() without passed Random (secure one)") {
// Slow generator
val uuidGen = Generators.randomBasedGenerator()
// Warm up JVM
benchGeneration(uuidGen, 3)
val startTime = Deadline.now
benchGeneration(uuidGen, nTimes)
val endTime = Deadline.now
val perItermTimeMs = (endTime - startTime).toMillis / nTimes.toDouble
println(s"Generators.randomBasedGenerator(). Per iteration: $perItermTimeMs ms")
}
test("Generators.randomBasedGenerator() with passing Random (not secure)") {
// Fast generator
val uuidGen = Generators.randomBasedGenerator(new Random(System.currentTimeMillis()))
// Warm up JVM
benchGeneration(uuidGen, 3)
val startTime = Deadline.now
benchGeneration(uuidGen, nTimes)
val endTime = Deadline.now
val perItermTimeMs = (endTime - startTime).toMillis / nTimes.toDouble
println(s"Generators.randomBasedGenerator() with passing Random Per iteration: $perItermTimeMs ms")
}
private def benchGeneration(uuidGen: RandomBasedGenerator, nTimes: Int) = {
var r: Long = 0
for (i <- 1 to nTimes) {
val a = generate(uuidGen, 1000000)
r += a.length
}
println(r)
}
}
You could use scala's parallel collections to split the load on multiple cores/threads.
You could also avoid creating a new generator every time:
class Generator {
val gen = Generators.randomBasedGenerator()
def generate(number : Int): List[String] = {
List.fill(number)(gen.generate().toString.replaceAll("-",""))
}
}

Scala: what's def actually do?

In Scala, def defined a function, But i don't understand the below code.
Ex.
def v = 10
what's v defination? v is a variable or a function or anything else?
it's a function that always returns 10. in Java, the equivalent would be
public int v() { return 10; }
this might seem pointless, but the difference is real, and sometimes importantly useful. for example, suppose i define a trait like this:
trait Wrench {
val size = 14 //millimeters, the default, most common size
}
if i need different size wrench, i can refine the trait
val bigWrench = new Wrench {
override val size = 21
}
but what if I want an adjustable wrench?
// mutable! not thread safe!
class AdjustableWrench extends Wrench {
var adjustment = 0
override val size = 14 + (3 * adjustment) // oops!
def adjust( turns : Int ) : Unit = {
adjustment += turns
}
}
this won't work! size will always be 14!
if I had defined my trait originally as
trait Wrench {
def size = 14 //millimeters, the default, most common size
}
i'd be able to define bigWrench exactly as I did above, because a val can override a def. but now i can write a functional adjustable wrench too:
// mutable! not thread safe!
class AdjustableWrench extends Wrench {
var adjustment = 0
override def size = 14 + (3 * adjustment) // this works
def adjust( turns : Int ) : Unit = {
adjustment += turns
}
}
by originally defining size as a def, rather than a val in the base trait, even though it looked dumb, I preserved the flexibility to override with def or val. it's quite common to define a base trait with very simple defaults, but where implementations might want to do something more complicated. so statements like
def v = 10
are not at all rare.
to get your head around the difference a bit more, compare these two:
def vDef = {
println("vDef")
10
}
and
val vVal = {
println("vVal")
10
}
both vDef and vVal will evaluate to 10 whenever you access them. but each time you access vDef, you will see the side effect, a print out of vDef. no matter how any times you access vVal, you will see vVal printed out just once.
v is a function that always returns 10.
Equivalent code in Java would be:
public int v() {
return 10;
}
Also see 2nd chapter from "Programming in Scala" book.

ParSeq.fill running sequentially?

I am trying to initialize an array in Scala, using parallelization. However, when using ParSeq.fill method, the performance doesn't seem to be better any better than sequential initialization (Seq.fill). If I do the same task, but initializing the collection with map, then it is much faster.
To show my point, I set up the following example:
import scala.collection.parallel.immutable.ParSeq
import scala.util.Random
object Timer {
def apply[A](f: => A): (A, Long) = {
val s = System.nanoTime
val ret = f
(ret, System.nanoTime - s)
}
}
object ParallelBenchmark extends App {
def randomIsPrime: Boolean = {
val n = Random.nextInt(1000000)
(2 until n).exists(i => n % i == 0)
}
val seqSize = 100000
val (_, timeSeq) = Timer { Seq.fill(seqSize)(randomIsPrime) }
println(f"Time Seq:\t\t $timeSeq")
val (_, timeParFill) = Timer { ParSeq.fill(seqSize)(randomIsPrime) }
println(f"Time Par Fill:\t $timeParFill")
val (_, timeParMap) = Timer { (0 until seqSize).par.map(_ => randomIsPrime) }
println(f"Time Par map:\t $timeParMap")
}
And the result is:
Time Seq: 32389215709
Time Par Fill: 32730035599
Time Par map: 17270448112
Clearly showing that the fill method is not running in parallel.
The parallel collections library in Scala can only parallelize existing collections, fill hasn't been implemented yet (and may never be). Your method of using a Range to generate a cheap placeholder collection is probably your best option if you want to see a speed boost.
Here's the underlying method being called by ParSeq.fill, obviously not parallel.

Efficient way to fold list in scala, while avoiding allocations and vars

I have a bunch of items in a list, and I need to analyze the content to find out how many of them are "complete". I started out with partition, but then realized that I didn't need to two lists back, so I switched to a fold:
val counts = groupRows.foldLeft( (0,0) )( (pair, row) =>
if(row.time == 0) (pair._1+1,pair._2)
else (pair._1, pair._2+1)
)
but I have a lot of rows to go through for a lot of parallel users, and it is causing a lot of GC activity (assumption on my part...the GC could be from other things, but I suspect this since I understand it will allocate a new tuple on every item folded).
for the time being, I've rewritten this as
var complete = 0
var incomplete = 0
list.foreach(row => if(row.time != 0) complete += 1 else incomplete += 1)
which fixes the GC, but introduces vars.
I was wondering if there was a way of doing this without using vars while also not abusing the GC?
EDIT:
Hard call on the answers I've received. A var implementation seems to be considerably faster on large lists (like by 40%) than even a tail-recursive optimized version that is more functional but should be equivalent.
The first answer from dhg seems to be on-par with the performance of the tail-recursive one, implying that the size pass is super-efficient...in fact, when optimized it runs very slightly faster than the tail-recursive one on my hardware.
The cleanest two-pass solution is probably to just use the built-in count method:
val complete = groupRows.count(_.time == 0)
val counts = (complete, groupRows.size - complete)
But you can do it in one pass if you use partition on an iterator:
val (complete, incomplete) = groupRows.iterator.partition(_.time == 0)
val counts = (complete.size, incomplete.size)
This works because the new returned iterators are linked behind the scenes and calling next on one will cause it to move the original iterator forward until it finds a matching element, but it remembers the non-matching elements for the other iterator so that they don't need to be recomputed.
Example of the one-pass solution:
scala> val groupRows = List(Row(0), Row(1), Row(1), Row(0), Row(0)).view.map{x => println(x); x}
scala> val (complete, incomplete) = groupRows.iterator.partition(_.time == 0)
Row(0)
Row(1)
complete: Iterator[Row] = non-empty iterator
incomplete: Iterator[Row] = non-empty iterator
scala> val counts = (complete.size, incomplete.size)
Row(1)
Row(0)
Row(0)
counts: (Int, Int) = (3,2)
I see you've already accepted an answer, but you rightly mention that that solution will traverse the list twice. The way to do it efficiently is with recursion.
def counts(xs: List[...], complete: Int = 0, incomplete: Int = 0): (Int,Int) =
xs match {
case Nil => (complete, incomplete)
case row :: tail =>
if (row.time == 0) counts(tail, complete + 1, incomplete)
else counts(tail, complete, incomplete + 1)
}
This is effectively just a customized fold, except we use 2 accumulators which are just Ints (primitives) instead of tuples (reference types). It should also be just as efficient a while-loop with vars - in fact, the bytecode should be identical.
Maybe it's just me, but I prefer using the various specialized folds (.size, .exists, .sum, .product) if they are available. I find it clearer and less error-prone than the heavy-duty power of general folds.
val complete = groupRows.view.filter(_.time==0).size
(complete, groupRows.length - complete)
How about this one? No import tax.
import scala.collection.generic.CanBuildFrom
import scala.collection.Traversable
import scala.collection.mutable.Builder
case class Count(n: Int, total: Int) {
def not = total - n
}
object Count {
implicit def cbf[A]: CanBuildFrom[Traversable[A], Boolean, Count] = new CanBuildFrom[Traversable[A], Boolean, Count] {
def apply(): Builder[Boolean, Count] = new Counter
def apply(from: Traversable[A]): Builder[Boolean, Count] = apply()
}
}
class Counter extends Builder[Boolean, Count] {
var n = 0
var ttl = 0
override def +=(b: Boolean) = { if (b) n += 1; ttl += 1; this }
override def clear() { n = 0 ; ttl = 0 }
override def result = Count(n, ttl)
}
object Counting extends App {
val vs = List(4, 17, 12, 21, 9, 24, 11)
val res: Count = vs map (_ % 2 == 0)
Console println s"${vs} have ${res.n} evens out of ${res.total}; ${res.not} were odd."
val res2: Count = vs collect { case i if i % 2 == 0 => i > 10 }
Console println s"${vs} have ${res2.n} evens over 10 out of ${res2.total}; ${res2.not} were smaller."
}
OK, inspired by the answers above, but really wanting to only pass over the list once and avoid GC, I decided that, in the face of a lack of direct API support, I would add this to my central library code:
class RichList[T](private val theList: List[T]) {
def partitionCount(f: T => Boolean): (Int, Int) = {
var matched = 0
var unmatched = 0
theList.foreach(r => { if (f(r)) matched += 1 else unmatched += 1 })
(matched, unmatched)
}
}
object RichList {
implicit def apply[T](list: List[T]): RichList[T] = new RichList(list)
}
Then in my application code (if I've imported the implicit), I can write var-free expressions:
val (complete, incomplete) = groupRows.partitionCount(_.time != 0)
and get what I want: an optimized GC-friendly routine that prevents me from polluting the rest of the program with vars.
However, I then saw Luigi's benchmark, and updated it to:
Use a longer list so that multiple passes on the list were more obvious in the numbers
Use a boolean function in all cases, so that we are comparing things fairly
http://pastebin.com/2XmrnrrB
The var implementation is definitely considerably faster, even though Luigi's routine should be identical (as one would expect with optimized tail recursion). Surprisingly, dhg's dual-pass original is just as fast (slightly faster if compiler optimization is on) as the tail-recursive one. I do not understand why.
It is slightly tidier to use a mutable accumulator pattern, like so, especially if you can re-use your accumulator:
case class Accum(var complete = 0, var incomplete = 0) {
def inc(compl: Boolean): this.type = {
if (compl) complete += 1 else incomplete += 1
this
}
}
val counts = groupRows.foldLeft( Accum() ){ (a, row) => a.inc( row.time == 0 ) }
If you really want to, you can hide your vars as private; if not, you still are a lot more self-contained than the pattern with vars.
You could just calculate it using the difference like so:
def counts(groupRows: List[Row]) = {
val complete = groupRows.foldLeft(0){ (pair, row) =>
if(row.time == 0) pair + 1 else pair
}
(complete, groupRows.length - complete)
}