Scala how to decrease execution time - scala

I have one method which generate UUID and code as below :
def generate(number : Int): List[String] = {
List.fill(number)(Generators.randomBasedGenerator().generate().toString.replaceAll("-",""))
}
and I called this as below :
for(i <-0 to 100) {
val a = generate(1000000)
println(a)
}
But for running the above for loop it take almost 8-9 minutes for execution, is there any other way to minimised execution time ?
Note: Here for understanding I added for loop but in real situation the generate method will call thousand of times from other request at same time.

The problem is the List. Filling a List with 1,000,000 generated and processed elements is going to take time (and memory) because every one of those elements has to be materialized.
You can generate an infinite number of processed UUID strings instantly if you don't have to materialize them until they are actually needed.
def genUUID :Stream[String] = Stream.continually {
Generators.randomBasedGenerator().generate().toString.filterNot(_ == '-')
}
val next5 = genUUID.take(5) //only the 1st (head) is materialized
next5.length //now all 5 are materialized
You can use Stream or Iterator for the infinite collection, whichever you find most conducive (or least annoying) to your work flow.

Basically you used not the fastest implementation. You should use that one when you pass Random to the constructor Generators.randomBasedGenerator(new Random(System.currentTimeMillis())). I did next things:
Use Array instead of List (Array is faster)
Removed string replacing, let's measure pure performance of generation
Dependency: "com.fasterxml.uuid" % "java-uuid-generator" % "3.1.5"
Result:
Generators.randomBasedGenerator(). Per iteration: 1579.6 ms
Generators.randomBasedGenerator() with passing Random Per iteration: 59.2 ms
Code:
import java.util.{Random, UUID}
import com.fasterxml.uuid.impl.RandomBasedGenerator
import com.fasterxml.uuid.{Generators, NoArgGenerator}
import org.scalatest.{FunSuiteLike, Matchers}
import scala.concurrent.duration.Deadline
class GeneratorTest extends FunSuiteLike
with Matchers {
val nTimes = 10
// Let use Array instead of List - Array is faster!
// and use pure UUID generators
def generate(uuidGen: NoArgGenerator, number: Int): Seq[UUID] = {
Array.fill(number)(uuidGen.generate())
}
test("Generators.randomBasedGenerator() without passed Random (secure one)") {
// Slow generator
val uuidGen = Generators.randomBasedGenerator()
// Warm up JVM
benchGeneration(uuidGen, 3)
val startTime = Deadline.now
benchGeneration(uuidGen, nTimes)
val endTime = Deadline.now
val perItermTimeMs = (endTime - startTime).toMillis / nTimes.toDouble
println(s"Generators.randomBasedGenerator(). Per iteration: $perItermTimeMs ms")
}
test("Generators.randomBasedGenerator() with passing Random (not secure)") {
// Fast generator
val uuidGen = Generators.randomBasedGenerator(new Random(System.currentTimeMillis()))
// Warm up JVM
benchGeneration(uuidGen, 3)
val startTime = Deadline.now
benchGeneration(uuidGen, nTimes)
val endTime = Deadline.now
val perItermTimeMs = (endTime - startTime).toMillis / nTimes.toDouble
println(s"Generators.randomBasedGenerator() with passing Random Per iteration: $perItermTimeMs ms")
}
private def benchGeneration(uuidGen: RandomBasedGenerator, nTimes: Int) = {
var r: Long = 0
for (i <- 1 to nTimes) {
val a = generate(uuidGen, 1000000)
r += a.length
}
println(r)
}
}

You could use scala's parallel collections to split the load on multiple cores/threads.
You could also avoid creating a new generator every time:
class Generator {
val gen = Generators.randomBasedGenerator()
def generate(number : Int): List[String] = {
List.fill(number)(gen.generate().toString.replaceAll("-",""))
}
}

Related

Why JVM doesn't optimize simple callback (in Scala)?

I'm creating Scala method to add elements into a ArrayBuffer. I'm thinking about 2 approaches:
def addToArrayBuffer(b: ArrayBuffer[Int])
def addToArrayBuffer(cb: Int => Unit)
The first approach is a method which gets collection and adds elements into it. The second approach is a method which gets callback cb and calls this callback for every element I want to add into collection.
The second approach is more flexible because I can transform/filter elements before adding them into collection.
Unfortunately the second approach is slower (72 ops/s vs 57 ops/s):
Benchmark Mode Cnt Score Error Units
TestBenchmark.addToArrayBufferDirectly thrpt 9 72.808 ? 13.394 ops/s
TestBenchmark.addToArrayBufferViaCallback thrpt 9 57.786 ? 3.532 ops/s
My question is why is JVM unable to optimize callback and achieve the same speed as direct adding into collection? And how can I improve speed?
I'm using java version 1.8.0_162 on Mac. Here is the source of benchmark:
package bench
import org.openjdk.jmh.annotations.{Benchmark, Fork, Measurement, Scope, State, Warmup}
import org.openjdk.jmh.infra.Blackhole
import scala.collection.mutable.ArrayBuffer
#State(Scope.Thread)
#Warmup(iterations = 5)
#Measurement(iterations = 3)
#Fork(3)
class TestBenchmark {
val size = 1000000
#Benchmark
def addToArrayBufferDirectly(blackhole: Blackhole) = {
def addToArrayBuffer(b: ArrayBuffer[Int]) = {
var i = 0
while (i < size) {
b.append(i)
i += 1
}
}
val ab = new ArrayBuffer[Int](size)
addToArrayBuffer(ab)
blackhole.consume(ab)
}
#Benchmark
def addToArrayBufferViaCallback(blackhole: Blackhole) = {
def addToArrayBuffer(cb: Int => Unit) = {
var i = 0
while (i < size) {
cb(i)
i += 1
}
}
val ab = new ArrayBuffer[Int](size)
addToArrayBuffer(i => ab.append(i))
blackhole.consume(ab)
}
}
It can be optimized by Scala compiler by using flags
scalacOptions ++= Seq(
"-opt-inline-from:bench.**",
"-opt:l:inline"
)
No changes in code are necessary. More about Scala inlining: https://www.lightbend.com/blog/scala-inliner-optimizer

Surprisingly slow of mutable.array.drop

I am a newbie in Scala, and when I am trying to profile my Scala code with YourKit, I have some surprising finding regarding the usage of array.drop.
Here is what I write:
...
val items = s.split(" +") // s is a string
...
val s1 = items.drop(2).mkString(" ")
...
In a 1 mins run of my code, YourKit told me that function call items.drop(2) takes around 11% of the total execution time..
Lexer.scala:33 scala.collection.mutable.ArrayOps$ofRef.drop(int) 1054 11%
This is really surprising to me, is there any internal memory copy that slow down the processing? If so, what is the best practice to optimize my simple code snippet? Thank you.
This is really surprising to me, is there any internal memory copy
that slow down the processing?
ArrayOps.drop internally calls IterableLike.slice, which allocates a builder that produces a new Array for each call:
override def slice(from: Int, until: Int): Repr = {
val lo = math.max(from, 0)
val hi = math.min(math.max(until, 0), length)
val elems = math.max(hi - lo, 0)
val b = newBuilder
b.sizeHint(elems)
var i = lo
while (i < hi) {
b += self(i)
i += 1
}
b.result()
}
You're seeing the cost of the iteration + allocation. You didn't specify how many times this happens and what's the size of the collection, but if it's large this could be time consuming.
One way of optimizing this is to generate a List[String] instead which simply iterates the collection and drops it's head element. Note this will occur an additional traversal of the Array[T] to create the list, so make sure to benchmark this to see you actually gain anything:
val items = s.split(" +").toList
val afterDrop = items.drop(2).mkString(" ")
Another possibility is to enrich Array[T] to include your own version of mkString which manually populates a StringBuilder:
object RichOps {
implicit class RichArray[T](val arr: Array[T]) extends AnyVal {
def mkStringWithIndex(start: Int, end: Int, separator: String): String = {
var idx = start
val stringBuilder = new StringBuilder(end - start)
while (idx < end) {
stringBuilder.append(arr(idx))
if (idx != end - 1) {
stringBuilder.append(separator)
}
idx += 1
}
stringBuilder.toString()
}
}
}
And now we have:
object Test {
def main(args: Array[String]): Unit = {
import RichOps._
val items = "hello everyone and welcome".split(" ")
println(items.mkStringWithIndex(2, items.length, " "))
}
Yields:
and welcome

Yield a result from a list of Scala Future

I am brand new to Scala Futures and I am working on some simple task.
I have the following function that returns a list of Future and all I want to do is to read the result (and block until all the future finished).
private def findAll(className: String): List[Future[Vector[ParseObject]]]= {
def find(query: ParseQuery[ParseObject], from: Int, limit: Int) = {
query.skip(from)
query.limit(limit)
Future(query.find().asScala.toVector)
}
val count = ParseQuery.getQuery(className).count()
val skip = 1000
val fromAndLimit = for (from <- 0 to count by skip) yield (from, if (from + skip < count) skip else count - from )
println("fromAndLimit: " + fromAndLimit)
(for((from, limit) <- fromAndLimit if limit > 0) yield find(ParseQuery.getQuery(className), from, limit)).toList
}
As appears, the function try to read all objects from Parse.com and return all the objects in one big Vector.
(code snippet is very appreciated; as right now I am not trying to learn Future, I just want a solution for this case).
If you want to compose futures to one you could call:
val compositeFuture: Future[List[Vector[ParseObject]]] = Future.sequence(findAll(???))
If you want to wait for completion, then:
val result: List[Vector[ParseObject]] = Await.result(compositeFuture, 1 minute)

ParSeq.fill running sequentially?

I am trying to initialize an array in Scala, using parallelization. However, when using ParSeq.fill method, the performance doesn't seem to be better any better than sequential initialization (Seq.fill). If I do the same task, but initializing the collection with map, then it is much faster.
To show my point, I set up the following example:
import scala.collection.parallel.immutable.ParSeq
import scala.util.Random
object Timer {
def apply[A](f: => A): (A, Long) = {
val s = System.nanoTime
val ret = f
(ret, System.nanoTime - s)
}
}
object ParallelBenchmark extends App {
def randomIsPrime: Boolean = {
val n = Random.nextInt(1000000)
(2 until n).exists(i => n % i == 0)
}
val seqSize = 100000
val (_, timeSeq) = Timer { Seq.fill(seqSize)(randomIsPrime) }
println(f"Time Seq:\t\t $timeSeq")
val (_, timeParFill) = Timer { ParSeq.fill(seqSize)(randomIsPrime) }
println(f"Time Par Fill:\t $timeParFill")
val (_, timeParMap) = Timer { (0 until seqSize).par.map(_ => randomIsPrime) }
println(f"Time Par map:\t $timeParMap")
}
And the result is:
Time Seq: 32389215709
Time Par Fill: 32730035599
Time Par map: 17270448112
Clearly showing that the fill method is not running in parallel.
The parallel collections library in Scala can only parallelize existing collections, fill hasn't been implemented yet (and may never be). Your method of using a Range to generate a cheap placeholder collection is probably your best option if you want to see a speed boost.
Here's the underlying method being called by ParSeq.fill, obviously not parallel.

Factorial calculation using Scala actors

How to compute the factorial using Scala actors ?
And would it prove more time efficient compared to for instance
def factorial(n: Int): BigInt = (BigInt(1) to BigInt(n)).par.product
Many Thanks.
Problem
You have to split up your input in partial products. This partial products can then be calculated in parallel. The partial products are then multiplied to get the final product.
This can be reduced to a broader class of problems: The so called Parallel prefix calculation. You can read up about it on Wikipedia.
Short version: When you calculate a*b*c*d with an associative operation _ * _, you can structure the calculation a*(b*(c*d)) or (a*b)*(c*d). With the second approach, you can then calculate a*b and c*d in parallel and then calculate the final result from these partial results. Of course you can do this recursively, when you have a bigger number of input values.
Solution
Disclaimer
This sounds a little bit like a homework assignment. So I will provide a solution that has two properties:
It contains a small bug
It shows how to solve parallel prefix in general, without solving the problem directly
So you can see how the solution should be structured, but no one can use it to cheat on her homework.
Solution in detail
First I need a few imports
import akka.event.Logging
import java.util.concurrent.TimeUnit
import scala.concurrent.duration.FiniteDuration
import akka.actor._
Then I create some helper classes for the communication between the actors
case class Calculate[T](values : Seq[T], segment : Int, parallelLimit : Int, fn : (T,T) => T)
trait CalculateResponse
case class CalculationResult[T](result : T, index : Int) extends CalculateResponse
case object Busy extends CalculateResponse
Instead of telling the receiver you are busy, the actor could also use the stash or implement its own queue for partial results. But in this case I think the sender shoudl decide how much parallel calculations are allowed.
Now I create the actor:
class ParallelPrefixActor[T] extends Actor {
val log = Logging(context.system, this)
val subCalculation = Props(classOf[ParallelPrefixActor[BigInt]])
val fanOut = 2
def receive = waitForCalculation
def waitForCalculation : Actor.Receive = {
case c : Calculate[T] =>
log.debug(s"Start calculation for ${c.values.length} values, segment nr. ${c.index}, from ${c.values.head} to ${c.values.last}")
if (c.values.length < c.parallelLimit) {
log.debug("Calculating result direct")
val result = c.values.reduceLeft(c.fn)
sender ! CalculationResult(result, c.index)
}else{
val groupSize: Int = Math.max(1, (c.values.length / fanOut) + Math.min(c.values.length % fanOut, 1))
log.debug(s"Splitting calculation for ${c.values.length} values up to ${fanOut} children, ${groupSize} elements each, limit ${c.parallelLimit}")
def segments=c.values.grouped(groupSize)
log.debug("Starting children")
segments.zipWithIndex.foreach{case (values, index) =>
context.actorOf(subCalculation) ! c.copy(values = values, index = index)
}
val partialResults: Vector[T] = segments.map(_.head).to[Vector]
log.debug(s"Waiting for ${partialResults.length} results (${partialResults.indices})")
context.become(waitForResults(segments.length, partialResults, c, sender), discardOld = true)
}
}
def waitForResults(outstandingResults : Int, partialResults : Vector[T], originalRequest : Calculate[T], originalSender : ActorRef) : Actor.Receive = {
case c : Calculate[_] => sender ! Busy
case r : CalculationResult[T] =>
log.debug(s"Putting result ${r.result} on position ${r.index} in ${partialResults.length}")
val updatedResults = partialResults.updated(r.index, r.result)
log.debug("Killing sub-worker")
sender ! PoisonPill
if (outstandingResults==1) {
log.debug("Calculating result from partial results")
val result = updatedResults.reduceLeft(originalRequest.fn)
originalSender ! CalculationResult(result, originalRequest.index)
context.become(waitForCalculation, discardOld = true)
}else{
log.debug(s"Still waiting for ${outstandingResults-1} results")
// For fanOut > 2 one could here already combine consecutive partial results
context.become(waitForResults(outstandingResults-1, updatedResults, originalRequest, originalSender), discardOld = true)
}
}
}
Optimizations
Using parallel prefix calculation is not optimal. The actors calculating the the product of the bigger numbers will do much more work than the actors calculating the product of the smaller numbers (e.g. when calculating 1 * ... * 100 , it is faster to calculate 1 * ... * 10 than 90 * ... * 100). So it might be a good idea to shuffle the numbers, so big numbers will be mixed with small numbers. This works in this case, because we use an commutative operation. Parallel prefix calculation in general only needs an associative operation to work.
Performance
In theory
Performance of the actor solution is worse than the "naive" solution (using parallel collections) for small amounts of data. The actor solution will shine, when you make complex calculations or distribute your calculation on specialized hardware (e.g. graphics card or FPGA) or on multiple machines. With the actor you can control, who does which calculation and you can even restart "hanging calculations". This can give a big speed up.
On a single machine, the actor solution might help when you have a non-uniform memory architecture. You could then organize the actors in a way that pins memory to a certain processor.
Some measurement
I did some real performance measurement using a Scala worksheet in IntelliJ IDEA.
First I set up the actor system:
// Setup the actor system
val system = ActorSystem("root")
// Start one calculation actor
val calculationStart = Props(classOf[ParallelPrefixActor[BigInt]])
val calcolon = system.actorOf(calculationStart, "Calcolon-BigInt")
val inbox = Inbox.create(system)
Then I defined a helper method to measure time:
// Helper function to measure time
def time[A] (id : String)(f: => A) = {
val start = System.nanoTime()
val result = f
val stop = System.nanoTime()
println(s"""Time for "${id}": ${(stop-start)*1e-6d}ms""")
result
}
And then I did some performance measurement:
// Test code
val limit = 10000
def testRange = (1 to limit).map(BigInt(_))
time("par product")(testRange.par.product)
val timeOut = FiniteDuration(240, TimeUnit.SECONDS)
inbox.send(calcolon, Calculate[BigInt]((1 to limit).map(BigInt(_)), 0, 10, _ * _))
time("actor product")(inbox.receive(timeOut))
time("par sum")(testRange.par.sum)
inbox.send(calcolon, Calculate[BigInt](testRange, 0, 5, _ + _))
time("actor sum")(inbox.receive(timeOut))
I got the following results
> Time for "par product": 134.38289ms
res0: scala.math.BigInt = 284625968091705451890641321211986889014805140170279923
079417999427441134000376444377299078675778477581588406214231752883004233994015
351873905242116138271617481982419982759241828925978789812425312059465996259867
065601615720360323979263287367170557419759620994797203461536981198970926112775
004841988454104755446424421365733030767036288258035489674611170973695786036701
910715127305872810411586405612811653853259684258259955846881464304255898366493
170592517172042765974074461334000541940524623034368691540594040662278282483715
120383221786446271838229238996389928272218797024593876938030946273322925705554
596900278752822425443480211275590191694254290289169072190970836905398737474524
833728995218023632827412170402680867692104515558405671725553720158521328290342
799898184493136...
Time for "actor product": 1310.217247ms
res2: Any = CalculationResult(28462596809170545189064132121198688901480514017027
992307941799942744113400037644437729907867577847758158840621423175288300423399
401535187390524211613827161748198241998275924182892597878981242531205946599625
986706560161572036032397926328736717055741975962099479720346153698119897092611
277500484198845410475544642442136573303076703628825803548967461117097369578603
670191071512730587281041158640561281165385325968425825995584688146430425589836
649317059251717204276597407446133400054194052462303436869154059404066227828248
371512038322178644627183822923899638992827221879702459387693803094627332292570
555459690027875282242544348021127559019169425429028916907219097083690539873747
452483372899521802363282741217040268086769210451555840567172555372015852132829
034279989818449...
> Time for "par sum": 6.488620999999999ms
res3: scala.math.BigInt = 50005000
> Time for "actor sum": 657.752832ms
res5: Any = CalculationResult(50005000,0)
You can easily see that the actor version is much slower than using parallel collections.