Slow Performance When Using Scalaz Task - scala

I'm looking to improve the performance of a Monte Carlo simulation I am developing.
I first did an implementation which does the simulation of each paths sequentially as follows:
def simulate() = {
for (path <- 0 to 30000) {
(0 to 100).foreach(
x => // do some computation
)
}
}
This basically is simulating 30,000 paths and each path has 100 discretised random steps.
The above function runs very quickly on my machine (about 1s) for the calculation I am doing.
I then thought about speeding it up even further by making the code run in a multithreaded fashion.
I decided to use Task for this and I coded the following:
val simulation = (1 |-> 30000 ).map(n => Task {
(1 |-> 100).map(x => // do some computation)
})
I then use this as follows:
Task.gatherUnordered(simulation).run
When I kick this off, I know my machine is doing a lot of work as I can
see that in the activity monitor and also the machine fan is going ballistic.
After about two minutes of heavy activity on the machine, the work it seems
to be doing finishes but I don't get any value returned (I am expected a collection
of Doubles from each task that was processed).
My questions are:
Why does this take longer than the sequential example? I am more
than likely doing something wrong but I can't see it.
Why don't I get any returned collection of values from the tasks that are apparently being processed?

I'm not sure why Task.gatherUnordered is so slow, but if you change Task.gatherUnordered to Nondeterminism.gatherUnordered everything will be fine:
import scalaz.Nondeterminism
Nondeterminism[Task].gatherUnordered(simulation).run
I'm going to create an issue on Github about Task.gatherUnordered. This definitely should be fixed.

Related

using scastie and scalafiddle to evaluate code execution time

I'm trying to reduce execution time of some small scala routine, say, concatenation of strings, since I'm too lazy to setup local environment, I'm using online scala compilers, but found the comparison result differs between scastie and scalafiddle w/ the following code:
// routine 1
var startT1 = System.nanoTime()
(1 until 100 * 1000).foreach{ x=>
val sb = new StringBuilder("a")
sb.append("b").append("c").append("d").append("e").append("f")
}
println(System.nanoTime() - startT1)
// routine 2
var startT2 = System.nanoTime()
(1 until 100 * 1000).foreach{ x=>
val arr = Array[Char]('a', 'b', 'c', 'd', 'e', 'f')
}
println(System.nanoTime() - startT2)
In scalafiddle routine 1 is faster but in scastie routine 2 is faster.
I'v read this article https://medium.com/#otto.chrons/what-makes-scalafiddle-so-fast-9a3edf33ed4d, so it seems that scalafiddle actually runs JavaScript instead of scala. But the remaining question is, can I really use scastie for execution time benchmarks?
Short answer - NO I don't you can not rely on ANY online running tools like scastie and scalafiddle to verify performance.
Because, there is 1000 and more reasons why some benchmark will show X millis execution for some operation, and 99% of that reasons is running environment: hardware, operating system, CPU architecture, load on machine, used Scala compiler, used JVM etc. And we don't know environment change between runs on Scatie for instance, so you can get totally different numbers and don't know why, hence benchmark results won't be reliable.
If you would like to get some results, you would like rely one at least a bit, take a look at https://openjdk.java.net/projects/code-tools/jmh/ and it's sbt helper plugin https://github.com/ktoso/sbt-jmh and run known environment.
And along with posting benchmark results - please post environment details, where it was run.

how does building a big task computation compare to execute synchronously several steps?

I have the following two pieces of code written in Scala/Monix:
def f1(input) =
for {
a <- task1(input)
b <- task2(a)
c <- task3(b)
} yield (c).runSyncUnsafe
and
def f2(input) = {
val a = task1(input).runSyncUnsafe
val b = task2(a).runSyncUnsafe
task3(b).runSyncUnsafe
}
I think the version f1 is better as it fully async and it doesn't block threads and my assumption is that, if there are many tasks running, the first should perform better in multithreading.
I know I should write a test to compare the two implementations but it would require a lot of refactoring of the legacy code. Also the profiling of the two versions is not easy in our specific situation so I'm asking here first, hoping for an answer from somebody with a lot of Scala/Monix experience:
How should the two compare in terms of performance under heavy load? Is this a real concern or is it a non-issue?
As a general rule is better to stay async for as long as possible. So you could write f1 like this:
def f1(input) =
for {
a <- task1(input)
b <- task2(a)
c <- task3(b)
} yield c
The caller can then decide whether to call runSyncUnsafe or an async call (runAsync, runOnComplete) or flatMap it with another task. This removes the Unsafe call from your code and leaves it to the caller to decide whether to be safe or not.
As far as performance goes, the tasks will be evaluated sequentially either way because later tasks depend on the results of earlier tasks.

Spark running time in local changes a lot with "println"

A tricky problem happened related to the sharp increase of executing time.
I run my scala code in local spark, part of which is to build a n*n matrix.
When running a small dataset, it just takes 5s to finish. The most time-consuming part is to build 2000*2000 matrix. And this part is executed within map, which just deals with array data structure.
However, just out of curiosity, I add "println" within the matrix-building code to see the number of iterations. Suddenly, the whole running time increases to 1min23s.
And the final results are same.
I am new to Spark and have no idea what really causes this situation.
The codes are simply:
val x = someRDD.map(buildMatrix)
def buildMatrix(stringVect:Array[String]): Array[Array[Double]] = {
//var count = 0
val num = stringVect.length
var simi_matrix = Array[Array[Double]]()
for (i<- 0 until num-1){
for (j<- (i+1) until num){
"build the matrix with some computation"
//println(count)
//count += 1
}
}
}
TL;DR
This does not have to do anything with Spark. I/O access to the console is synchronized and costly. It will slow down any program on the JVM (Scala/Java/Clojure/...).
println defaults to java.lang.System.out which is a PrintStream. println delegates to PrintStream#println, hence entering the synchronized block of the println implementation to output to the console: There are two expenses:
Getting a synchronized lock
I/O to the console OutputStream
The slowdown observed is to be expected. Just don't use println in hot parts of the code (like a tight loop in this case).

scalaz-stream consume stream based on computed value

I've got two streams and I want to be able to consume only one based on a computation that I run every x seconds.
I think I basically need to create a third tick stream - something like every(3.seconds) - that does the computation and then come up with sort of a switch between the other two.
I'm kind of stuck here (and I've only just started fooling around with scalaz-stream).
Thanks!
There are several ways we can approach this problem. One way to approach it is using awakeEvery. For concrete example, see here.
To describe the example briefly, consider that we would like to query twitter in every 5 sec and get the tweets and perform sentiment analysis. We can compose this pipeline as follows:
val source =
awakeEvery(5 seconds) |> buildTwitterQuery(query) through queryChannel flatMap {
Process emitAll _
}
Note that the queryChannel can be stated as follows.
def statusTask(query: Query): Task[List[Status]] = Task {
twitterClient.search(query).getTweets.toList
}
val queryChannel: Channel[Task, Query, List[Status]] = channel lift statusTask
Let me know if you have any question. As stated earlier, for the complete example, see this.
I hope it helps!

Efficiency/scalability of parallel collections in Scala (graphs)

So I've been working with parallel collections in Scala for a graph project I'm working on, I've got the basics of the graph class defined, it is currently using a scala.collection.mutable.HashMap where the key is Int and the value is ListBuffer[Int] (adjacency list). (EDIT: This has since been change to ArrayBuffer[Int]
I had done a similar thing a few months ago in C++, with a std::vector<int, std::vector<int> >.
What I'm trying to do now is run a metric between all pairs of vertices in the graph, so in C++ I did something like this:
// myVec = std::vector<int> of vertices
for (std::vector<int>::iterator iter = myVec.begin(); iter != myVec.end(); ++iter) {
for (std::vector<int>::iterator iter2 = myVec.begin();
iter2 != myVec.end(); ++iter2) {
/* Run algorithm between *iter and *iter2 */
}
}
I did the same thing in Scala, parallelized, (or tried to) by doing this:
// vertexList is a List[Int] (NOW CHANGED TO Array[Int] - see below)
vertexList.par.foreach(u =>
vertexList.foreach(v =>
/* Run algorithm between u and v */
)
)
The C++ version is clearly single-threaded, the Scala version has .par so it's using parallel collections and is multi-threaded on 8 cores (same machine). However, the C++ version processed 305,570 pairs in a span of roughly 3 days, whereas the Scala version so far has only processed 23,573 in 17 hours.
Assuming I did my math correctly, the single-threaded C++ version is roughly 3x faster than the Scala version. Is Scala really that much slower than C++, or am I completely mis-using Scala (I only recently started - I'm about 300 pages into Programming in Scala)?
Thanks!
-kstruct
EDIT To use a while loop, do I do something like..
// Where vertexList is an Array[Int]
vertexList.par.foreach(u =>
while (i <- 0 until vertexList.length) {
/* Run algorithm between u and vertexList(i) */
}
}
If you guys mean use a while loop for the entire thing, is there an equivalent of .par.foreach for whiles?
EDIT2 Wait a second, that code isn't even right - my bad. How would I parallelize this using while loops? If I have some var i that keeps track of the iteration, then wouldn't all threads be sharing that i?
From your comments, I see that your updating a shared mutable HashMap at the end of each algorithm run. And if you're randomizing your walks, a shared Random is also a contention point.
I recommend two changes:
Use .map and .flatMap to return an immutable collection instead of modifying a shared collection.
Use a ThreadLocalRandom (from either Akka or Java 7) to reduce contention on the random number generator
Check the rest of your algorithm for further possible contention points.
You may try running the inner loop in parallel, too. But without knowing your algorithm, it's hard to know if that will help or hurt. Fortunately, running all combinations of parallel and sequential collections is very simple; just switch out pVertexList and vertexList in the code below.
Something like this:
val pVertexList = vertexList.par
val allResult = for {
u <- pVertexList
v <- pVertexList
} yield {
/* Run algorithm between u and v */
((u -> v) -> result)
}
The value allResult will be a ParVector[((Int, Int), Int)]. You may call .toMap on it to convert that into a Map.
Why mutable? I don't think there's a good parallel mutable map on Scala 2.9.x -- particularly because just such a data structure was added to the upcoming Scala 2.10.
On the other hand... you have a List[Int]? Don't use that, use a Vector[Int]. Also, are you sure you aren't wasting time elsewhere, doing the conversions from your mutable maps and buffers into immutable lists? Scala data structures are different than C++'s so you might well be incurring in complexity problems elsewhere in the code.
Finally, I think dave might be onto something when he asks about contention. If you have contention, parallelism might well make things slower. How faster/slower does it run if you do not make it parallel? If making it not parallel makes it faster, then you most likely do have contention issues.
I'm not completely sure about it, but I think foreach loops in foreach loops are rather slow, because lots of objects get created. See: http://scala-programming-language.1934581.n4.nabble.com/for-loop-vs-while-loop-performance-td1935856.html
Try rewriting it using a while loop.
Also Lists are only efficient for head access, Arrays are probably faster.