inner .par collection breaks outer ForkJoinTaskSupport in Scala - scala

I want to make a parallel collection that uses a fixed number of threads.
The standard advice for this is to set tasksupport for the parallel collection to use a ForkJoinTaskSupport with a ForkJoinPool with a fixed number of threads. That works fine UNTIL the processing you are doing in your parallel collection itself uses a parallel collection. When this is the case it appears that the limit set for the ForkJoinPool goes away.
A simple test looks something like the following:
import java.util.concurrent.atomic.AtomicInteger
import java.util.concurrent.ForkJoinPool
import scala.collection.parallel.ForkJoinTaskSupport
object InnerPar {
def forkJoinPoolIsSuccess(useInnerPar:Boolean): Boolean = {
val numTasks = 100
val numThreads = 10
// every thread in the outer collection will increment
// and decrement this counter as it starts and exits
val threadCounter = new AtomicInteger(0)
// function that returns the thread count when we first
// started running and creates an inner parallel collection
def incrementAndCountThreads(idx:Int):Int = {
val otherThreadsRunning:Int = threadCounter.getAndAdd(1)
if (useInnerPar) {
(0 until 20).toSeq.par.map { elem => elem + 1 }
}
Thread.sleep(10)
threadCounter.getAndAdd(-1)
otherThreadsRunning + 1
}
// create parallel collection using a ForkJoinPool with numThreads
val parCollection = (0 until numTasks).toVector.par
parCollection.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(numThreads))
val threadCountLogList = parCollection.map { idx =>
incrementAndCountThreads(idx)
}
// the total number of threads running should not have exceeded
// numThreads at any point, similarly we hope that the number of
// simultaneously executing threads was close numThreads at some point
val respectsNumThreadsCapSuccess = threadCountLogList.max <= numThreads
respectsNumThreadsCapSuccess
}
def main(args:Array[String]):Unit = {
val testConfigs = Seq(true, false, true, false)
testConfigs.foreach { useInnerPar =>
val isSuccess = forkJoinPoolIsSuccess(useInnerPar)
println(f"useInnerPar $useInnerPar%6s, success is $isSuccess%6s")
}
}
}
And from this we get the following output, showing that more than numThreads (in the example 10) threads are running simultaneously if we create a parallel collection inside of incrementAndCountThreads().
useInnerPar true, success is false
useInnerPar false, success is true
useInnerPar true, success is false
useInnerPar false, success is true
Also note that using a ForkJoinTaskSupport in the inner collection does not fix the problem. In other words you get the same results if you use the following code for the inner collection:
if (useInnerPar) {
val innerParCollection = (0 until 20).toVector.par
innerParCollection.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(3))
innerParCollection.map { elem => elem + 1 }
}
I'm using Scala 2.12.5 and Java OpenJDK 1.8.0_161-b14 on a Linux 3.10.0 x86_64 kernel.
Am I missing something? If not is there a way to work around this?
Thanks!

The core issue is that in Java 8 the numThreads parameter passed to the ForkJoinPool is just a guide, not a hard limit. In Java 9 there is a maxPoolSize parameter you can set which should provide a hard limit for the number of threads in the pool and solve this problem directly. I don't know of a great way to solve this in Java 8.
See the following for more details:
https://github.com/scala/bug/issues/11036

Related

Why JVM doesn't optimize simple callback (in Scala)?

I'm creating Scala method to add elements into a ArrayBuffer. I'm thinking about 2 approaches:
def addToArrayBuffer(b: ArrayBuffer[Int])
def addToArrayBuffer(cb: Int => Unit)
The first approach is a method which gets collection and adds elements into it. The second approach is a method which gets callback cb and calls this callback for every element I want to add into collection.
The second approach is more flexible because I can transform/filter elements before adding them into collection.
Unfortunately the second approach is slower (72 ops/s vs 57 ops/s):
Benchmark Mode Cnt Score Error Units
TestBenchmark.addToArrayBufferDirectly thrpt 9 72.808 ? 13.394 ops/s
TestBenchmark.addToArrayBufferViaCallback thrpt 9 57.786 ? 3.532 ops/s
My question is why is JVM unable to optimize callback and achieve the same speed as direct adding into collection? And how can I improve speed?
I'm using java version 1.8.0_162 on Mac. Here is the source of benchmark:
package bench
import org.openjdk.jmh.annotations.{Benchmark, Fork, Measurement, Scope, State, Warmup}
import org.openjdk.jmh.infra.Blackhole
import scala.collection.mutable.ArrayBuffer
#State(Scope.Thread)
#Warmup(iterations = 5)
#Measurement(iterations = 3)
#Fork(3)
class TestBenchmark {
val size = 1000000
#Benchmark
def addToArrayBufferDirectly(blackhole: Blackhole) = {
def addToArrayBuffer(b: ArrayBuffer[Int]) = {
var i = 0
while (i < size) {
b.append(i)
i += 1
}
}
val ab = new ArrayBuffer[Int](size)
addToArrayBuffer(ab)
blackhole.consume(ab)
}
#Benchmark
def addToArrayBufferViaCallback(blackhole: Blackhole) = {
def addToArrayBuffer(cb: Int => Unit) = {
var i = 0
while (i < size) {
cb(i)
i += 1
}
}
val ab = new ArrayBuffer[Int](size)
addToArrayBuffer(i => ab.append(i))
blackhole.consume(ab)
}
}
It can be optimized by Scala compiler by using flags
scalacOptions ++= Seq(
"-opt-inline-from:bench.**",
"-opt:l:inline"
)
No changes in code are necessary. More about Scala inlining: https://www.lightbend.com/blog/scala-inliner-optimizer

Scala how to decrease execution time

I have one method which generate UUID and code as below :
def generate(number : Int): List[String] = {
List.fill(number)(Generators.randomBasedGenerator().generate().toString.replaceAll("-",""))
}
and I called this as below :
for(i <-0 to 100) {
val a = generate(1000000)
println(a)
}
But for running the above for loop it take almost 8-9 minutes for execution, is there any other way to minimised execution time ?
Note: Here for understanding I added for loop but in real situation the generate method will call thousand of times from other request at same time.
The problem is the List. Filling a List with 1,000,000 generated and processed elements is going to take time (and memory) because every one of those elements has to be materialized.
You can generate an infinite number of processed UUID strings instantly if you don't have to materialize them until they are actually needed.
def genUUID :Stream[String] = Stream.continually {
Generators.randomBasedGenerator().generate().toString.filterNot(_ == '-')
}
val next5 = genUUID.take(5) //only the 1st (head) is materialized
next5.length //now all 5 are materialized
You can use Stream or Iterator for the infinite collection, whichever you find most conducive (or least annoying) to your work flow.
Basically you used not the fastest implementation. You should use that one when you pass Random to the constructor Generators.randomBasedGenerator(new Random(System.currentTimeMillis())). I did next things:
Use Array instead of List (Array is faster)
Removed string replacing, let's measure pure performance of generation
Dependency: "com.fasterxml.uuid" % "java-uuid-generator" % "3.1.5"
Result:
Generators.randomBasedGenerator(). Per iteration: 1579.6 ms
Generators.randomBasedGenerator() with passing Random Per iteration: 59.2 ms
Code:
import java.util.{Random, UUID}
import com.fasterxml.uuid.impl.RandomBasedGenerator
import com.fasterxml.uuid.{Generators, NoArgGenerator}
import org.scalatest.{FunSuiteLike, Matchers}
import scala.concurrent.duration.Deadline
class GeneratorTest extends FunSuiteLike
with Matchers {
val nTimes = 10
// Let use Array instead of List - Array is faster!
// and use pure UUID generators
def generate(uuidGen: NoArgGenerator, number: Int): Seq[UUID] = {
Array.fill(number)(uuidGen.generate())
}
test("Generators.randomBasedGenerator() without passed Random (secure one)") {
// Slow generator
val uuidGen = Generators.randomBasedGenerator()
// Warm up JVM
benchGeneration(uuidGen, 3)
val startTime = Deadline.now
benchGeneration(uuidGen, nTimes)
val endTime = Deadline.now
val perItermTimeMs = (endTime - startTime).toMillis / nTimes.toDouble
println(s"Generators.randomBasedGenerator(). Per iteration: $perItermTimeMs ms")
}
test("Generators.randomBasedGenerator() with passing Random (not secure)") {
// Fast generator
val uuidGen = Generators.randomBasedGenerator(new Random(System.currentTimeMillis()))
// Warm up JVM
benchGeneration(uuidGen, 3)
val startTime = Deadline.now
benchGeneration(uuidGen, nTimes)
val endTime = Deadline.now
val perItermTimeMs = (endTime - startTime).toMillis / nTimes.toDouble
println(s"Generators.randomBasedGenerator() with passing Random Per iteration: $perItermTimeMs ms")
}
private def benchGeneration(uuidGen: RandomBasedGenerator, nTimes: Int) = {
var r: Long = 0
for (i <- 1 to nTimes) {
val a = generate(uuidGen, 1000000)
r += a.length
}
println(r)
}
}
You could use scala's parallel collections to split the load on multiple cores/threads.
You could also avoid creating a new generator every time:
class Generator {
val gen = Generators.randomBasedGenerator()
def generate(number : Int): List[String] = {
List.fill(number)(gen.generate().toString.replaceAll("-",""))
}
}

In Apache Spark, how to make an RDD/DataFrame operation lazy?

Assuming that I would like to write a function foo that transforms a DataFrame:
object Foo {
def foo(source: DataFrame): DataFrame = {
...complex iterative algorithm with a stopping condition...
}
}
since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution.
This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation of foo should be executed only if the resulted DataFrame or its derivative(s) are being used on the Driver (through another "Action").
So far, the only way to reliably achieve this is through writing all implementations into a SparkPlan, and superimpose it into the DataFrame's SparkExecution, this is very error-prone and involves lots of boilerplate codes. What is the recommended way to do this?
It is not exactly clear to me what you try to achieve but Scala itself provides at least few tools which you may find useful:
lazy vals:
val rdd = sc.range(0, 10000)
lazy val count = rdd.count // Nothing is executed here
// count: Long = <lazy>
count // count is evaluated only when it is actually used
// Long = 10000
call-by-name (denoted by => in the function definition):
def foo(first: => Long, second: => Long, takeFirst: Boolean): Long =
if (takeFirst) first else second
val rdd1 = sc.range(0, 10000)
val rdd2 = sc.range(0, 10000)
foo(
{ println("first"); rdd1.count },
{ println("second"); rdd2.count },
true // Only first will be evaluated
)
// first
// Long = 10000
Note: In practice you should create local lazy binding to make sure that arguments are not evaluated on every access.
infinite lazy collections like Stream
import org.apache.spark.mllib.random.RandomRDDs._
val initial = normalRDD(sc, 1000000L, 10)
// Infinite stream of RDDs and actions and nothing blows :)
val stream: Stream[RDD[Double]] = Stream(initial).append(
stream.map {
case rdd if !rdd.isEmpty =>
val mu = rdd.mean
rdd.filter(_ > mu)
case _ => sc.emptyRDD[Double]
}
)
Some subset of these should be more than enough to implement complex lazy computations.

ParSeq.fill running sequentially?

I am trying to initialize an array in Scala, using parallelization. However, when using ParSeq.fill method, the performance doesn't seem to be better any better than sequential initialization (Seq.fill). If I do the same task, but initializing the collection with map, then it is much faster.
To show my point, I set up the following example:
import scala.collection.parallel.immutable.ParSeq
import scala.util.Random
object Timer {
def apply[A](f: => A): (A, Long) = {
val s = System.nanoTime
val ret = f
(ret, System.nanoTime - s)
}
}
object ParallelBenchmark extends App {
def randomIsPrime: Boolean = {
val n = Random.nextInt(1000000)
(2 until n).exists(i => n % i == 0)
}
val seqSize = 100000
val (_, timeSeq) = Timer { Seq.fill(seqSize)(randomIsPrime) }
println(f"Time Seq:\t\t $timeSeq")
val (_, timeParFill) = Timer { ParSeq.fill(seqSize)(randomIsPrime) }
println(f"Time Par Fill:\t $timeParFill")
val (_, timeParMap) = Timer { (0 until seqSize).par.map(_ => randomIsPrime) }
println(f"Time Par map:\t $timeParMap")
}
And the result is:
Time Seq: 32389215709
Time Par Fill: 32730035599
Time Par map: 17270448112
Clearly showing that the fill method is not running in parallel.
The parallel collections library in Scala can only parallelize existing collections, fill hasn't been implemented yet (and may never be). Your method of using a Range to generate a cheap placeholder collection is probably your best option if you want to see a speed boost.
Here's the underlying method being called by ParSeq.fill, obviously not parallel.

Filtering Scala's Parallel Collections with early abort when desired number of results found

Given a very large instance of collection.parallel.mutable.ParHashMap (or any other parallel collection), how can one abort a filtering parallel scan once a given, say 50, number of matches has been found ?
Attempting to accumulate intermediate matches in a thread-safe "external" data structure or keeping an external AtomicInteger with result count seems to be 2 to 3 times slower on 4 cores than using a regular collection.mutable.HashMap and pegging a single core at 100%.
I am aware that find or exists on Par* collections do abort "on the inside". Is there a way to generalize this to find more than one result ?
Here's the code which still seems to be 2 to 3 times slower on the ParHashMap with ~ 79,000 entries and also has a problem of stuffing more than maxResults results into the results CHM (Which is probably due to thread being preempted after incrementAndGet but before break which allows other threads to add more elements in). Update: it seems the slow down is due to worker threads contending on the counter.incrementAndGet() which of course defeats the purpose of the whole parallel scan :-(
def find(filter: Node => Boolean, maxResults: Int): Iterable[Node] =
{
val counter = new AtomicInteger(0)
val results = new ConcurrentHashMap[Key, Node](maxResults)
import util.control.Breaks._
breakable
{
for ((key, node) <- parHashMap if filter(node))
{
results.put(key, node)
val total = counter.incrementAndGet()
if (total > maxResults) break
}
}
results.values.toArray(new Array[Node](results.size))
}
I would first do parallel scan in which variable maxResults would be threadlocal. This would find up to (maxResults * numberOfThreads) results.
Then I would do single threaded scan to reduce it to maxResults.
I had performed an interesting investigation about your case.
Investigation reasoning
I suspected the problem is with the mutability of the input Map and I will try to explain you why: HashMap implementation organizes the data in different buckets, as one can see on Wikipedia.
The first thread-safe collections in Java, the synchronized collections were based on synchronizing all the methods around the underlying implementation and resulted in poor performance. Further research and thinking brought to the more performant Concurrent Collection, such as the ConcurrentHashMap which approach was smarter : why don't we protect each bucket with a specific lock?
According to my feeling the performance problem occurs because:
when you run in parallel your filter, some threads will conflict on accessing the same bucket at once and will hit the same lock, because your map is mutable.
You hold a counter to see how many results you have while you can actually check the size of your
result. If you have a thread-safe way to build a collection, you don't need a thread-safe counter too.
Investigation result
I have developed a test case and I find out I was wrong. The problem is with the concurrent nature of the output map. In fact, that is where the collision occurs, when you are putting elements in the map, rather then when you are iterating on it. Additionally, since you want only the result on values, you don't need the keys and the hashing and all the map features. It might be interesting to test if you remove the AtomicCounter and you use only the result map to check if you collected enough elements how your version performs.
Please be careful with the following code in Scala 2.9.2. I am explaining in another post why I need two different functions for the parallel and the non parallel version: Calling map on a parallel collection via a reference to an ancestor type
object MapPerformance {
val size = 100000
val items = Seq.tabulate(size)( x => (x,x*2))
val concurrentParallelMap = ImmutableParHashMap(items:_*)
val concurrentMutableParallelMap = MutableParHashMap(items:_*)
val unparallelMap = Map(items:_*)
class ThreadSafeIndexedSeqBuilder[T](maxSize:Int) {
val underlyingBuilder = new VectorBuilder[T]()
var counter = 0
def sizeHint(hint:Int) { underlyingBuilder.sizeHint(hint) }
def +=(item:T):Boolean ={
synchronized{
if(counter>=maxSize)
false
else{
underlyingBuilder+=item
counter+=1
true
}
}
}
def result():Vector[T] = underlyingBuilder.result()
}
def find(map:ParMap[Int,Int],filter: Int => Boolean, maxResults: Int): Iterable[Int] =
{
// we already know the maximum size
val resultsBuilder = new ThreadSafeIndexedSeqBuilder[Int](maxResults)
resultsBuilder.sizeHint(maxResults)
import util.control.Breaks._
breakable
{
for ((key, node) <- map if filter(node))
{
val newItemAdded = resultsBuilder+=node
if (!newItemAdded)
break()
}
}
resultsBuilder.result().seq
}
def findUnParallel(map:Map[Int,Int],filter: Int => Boolean, maxResults: Int): Iterable[Int] =
{
// we already know the maximum size
val resultsBuilder = Array.newBuilder[Int]
resultsBuilder.sizeHint(maxResults)
var counter = 0
for {
(key, node) <- map if filter(node)
if counter < maxResults
}{
resultsBuilder+=node
counter+=1
}
resultsBuilder.result()
}
def measureTime[K](f: => K):(Long,K) = {
val startMutable = System.currentTimeMillis()
val result = f
val endMutable = System.currentTimeMillis()
(endMutable-startMutable,result)
}
def main(args:Array[String]) = {
val maxResultSetting=10
(1 to 10).foreach{
tryNumber =>
println("Try number " +tryNumber)
val (mutableTime, mutableResult) = measureTime(find(concurrentMutableParallelMap,_%2==0,maxResultSetting))
val (immutableTime, immutableResult) = measureTime(find(concurrentMutableParallelMap,_%2==0,maxResultSetting))
val (unparallelTime, unparallelResult) = measureTime(findUnParallel(unparallelMap,_%2==0,maxResultSetting))
assert(mutableResult.size==maxResultSetting)
assert(immutableResult.size==maxResultSetting)
assert(unparallelResult.size==maxResultSetting)
println(" The mutable version has taken " + mutableTime + " milliseconds")
println(" The immutable version has taken " + immutableTime + " milliseconds")
println(" The unparallel version has taken " + unparallelTime + " milliseconds")
}
}
}
With this code, I have systematically the parallel (both mutable and immutable version of the input map) about 3,5 time faster then the unparallel on my machine.
You could try to get an iterator and then create a lazy list (a Stream) where you filter (with your predicate) and take the number of elements you want. Because it is a non strict, this 'taking' of elements is not evaluated.
Afterwards you can force the execution by adding ".par" to the whole thing and achieve parallelization.
Example code:
A parallelized map with random values (simulating your parallel hash map):
scala> myMap
res14: scala.collection.parallel.immutable.ParMap[Int,Int] = ParMap(66978401 -> -1331298976, 256964068 -> 126442706, 1698061835 -> 1622679396, -1556333580 -> -1737927220, 791194343 -> -591951714, -1907806173 -> 365922424, 1970481797 -> 162004380, -475841243 -> -445098544, -33856724 -> -1418863050, 1851826878 -> 64176692, 1797820893 -> 405915272, -1838192182 -> 1152824098, 1028423518 -> -2124589278, -670924872 -> 1056679706, 1530917115 -> 1265988738, -808655189 -> -1742792788, 873935965 -> 733748120, -1026980400 -> -163182914, 576661388 -> 900607992, -1950678599 -> -731236098)
Get an iterator and create a Stream from the iterator and filter it.
In this case my predicate is only accepting pairs (of the value member of the map).
I want to get 10 even elements, so I take 10 elements which will only get evaluated when I force it to:
scala> val mapIterator = myMap.toIterator
mapIterator: Iterator[(Int, Int)] = HashTrieIterator(20)
scala> val r = Stream.continually(mapIterator.next()).filter(_._2 % 2 == 0).take(10)
r: scala.collection.immutable.Stream[(Int, Int)] = Stream((66978401,-1331298976), ?)
Finally, I force the evaluation which only gets 10 elements as planned
scala> r.force
res16: scala.collection.immutable.Stream[(Int, Int)] = Stream((66978401,-1331298976), (256964068,126442706), (1698061835,1622679396), (-1556333580,-1737927220), (791194343,-591951714), (-1907806173,365922424), (1970481797,162004380), (-475841243,-445098544), (-33856724,-1418863050), (1851826878,64176692))
This way you only get the number of elements you want (without needing to process the remaining elements) and you parallelize the process without locks, atomics or breaks.
Please compare this to your solutions to see if it is any good.