In the SparkPi example that comes with the distribution of Spark, is the reduce on the RDD, executed in parallel (each slice calculates its total), or not?
val count: Int = spark.sparkContext.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
Yes, it is.
By default, this example will operate on 2 slices. As a result, your collection will be split into 2 parts. Then Spark will execute the map transformation and reduce action on each partition in parallel. Finally, Spark will merge the individual results into the final value.
You can observe 2 tasks in the console output if the example is executed using the default config.
Related
This piece of scala code mixes view with strict List in a for expression:
val list = List.range(1, 4)
def compute(n: Int) = {
println("Computing "+n)
n * 2
}
val view = for (n <- list.view; k<-List(1,2)) yield compute(n)
val x = view(0)
The output is:
Computing 1
Computing 1
Computing 2
Computing 2
Computing 3
Computing 3
Computing 1
Computing 1
I expected that it should just have the last 2 lines "Computing 1" in the output. Why would it computed all the values eagerly? And why it then recomputed the values again?
Arguably, access by index forces the view to be computed. Also, notice that you flatmap the list with something which is not lazy (k is not a view).
Compare the following:
// 0) Your example
val v0 = List.range(1, 4).view.flatMap(n => List(1,2).map(k => compute(n)))
v0(0) // Computing 1
// Computing 1
// Computing 2
// Computing 2
// Computing 3
// Computing 3
// Computing 1
// Computing 1
v0(0) // Computing 1
// Computing 1
// 1) Your example, but access by head and not by index
val v1 = List.range(1, 4).view.flatMap(n => List(1,2).map(k => compute(n)))
v1.head // Computing 1
// Computing 1
// 2) Do not mix views and strict lists
val v2 = List.range(1, 4).view.flatMap(n => List(1,2).view.map(k => compute(n)))
v2(0) // Computing 1
Regarding example 0, notice that views are not like streams; while streams do cache their results, lazy views do not (they just compute lazily, i.e., by-need, on access). It seems that indexed-access requires computing the entire list, and then another computation is needed to actually access the element by index.
You may ask why indexed access in example 2 does not compute the entire list. This requires an understanding of how things work underneath; in particular, we may see the difference of the method calls from example 0 wrt example 2 in the following excerpts:
Example 0
java.lang.Exception scala.collection.SeqViewLike$FlatMapped.$anonfun$index$1(SeqViewLike.scala:75)
at scala.collection.SeqViewLike$FlatMapped.index(SeqViewLike.scala:74)
at scala.collection.SeqViewLike$FlatMapped.index$(SeqViewLike.scala:71)
at scala.collection.SeqViewLike$$anon$5.index$lzycompute(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$$anon$5.index(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$FlatMapped.length(SeqViewLike.scala:84)
at scala.collection.SeqViewLike$FlatMapped.length$(SeqViewLike.scala:84)
at scala.collection.SeqViewLike$$anon$5.length(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$FlatMapped.apply(SeqViewLike.scala:86)
at scala.collection.SeqViewLike$FlatMapped.apply$(SeqViewLike.scala:85)
at scala.collection.SeqViewLike$$anon$5.apply(SeqViewLike.scala:197)
at scala.collection.immutable.List.foreach(List.scala:389)
Computing 1
Example 2
java.lang.Exception scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:12)
at scala.collection.SeqViewLike$Mapped.apply(SeqViewLike.scala:67)
at scala.collection.SeqViewLike$Mapped.apply$(SeqViewLike.scala:67)
at scala.collection.SeqViewLike$$anon$4.apply(SeqViewLike.scala:196)
at scala.collection.SeqViewLike$FlatMapped.apply(SeqViewLike.scala:88)
at scala.collection.SeqViewLike$FlatMapped.apply$(SeqViewLike.scala:85)
at scala.collection.SeqViewLike$$anon$5.apply(SeqViewLike.scala:197)
at scala.collection.immutable.List.foreach(List.scala:389)
Computing 1
In particular, you see that example 0 results in a call of Flatmapped.length (which needs to evaluate the entire list).
view here is a SeqView[Int,Seq[_]], which is immutable and recomputes every item when iterated over.
You could access just the first by explicitly using the .iterator:
# view.iterator.next
Computing 1
Computing 1
res11: Int = 2
Or explicitly make it a List (eg. if you need to reuse many entries):
# val view2: List[Int] = view.toList
Computing 1
Computing 1
Computing 2
Computing 2
Computing 3
Computing 3
view2: List[Int] = List(2, 2, 4, 4, 6, 6)
# view2(0)
res13: Int = 2
I am very naively trying to use Scala .par, and the result turns out to be slower than the non-parallel version, by quite a bit. What is the explanation for that?
Note: the question is not to make this faster, but to understand why this naive use of .par doesn't yield an immediate speed-up.
Note 2: timing method: I ran both methods with N = 10000. The first one returned in about 20s. The second one I killed after 3 minutes. Not even close. If I let it run longer I get into a Java heap space exception.
def pi_random(N: Long): Double = {
val count = (0L until N * N)
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_parallel(N: Long): Double = {
val count = (0L until N * N)
.par
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
Hard to know for sure without doing some actual profiling, but I have two theories:
First, you may be losing some benefits of the Range class, specifically near-zero memory usage. When you do (0L until N * N), you create a Range object, which is lazy. It does not actually create any object holding every single number in the range. Neither does map, I think. And sum calculates and adds numbers one at a time, so also allocates barely any memory.
I'm not sure the same is all true about ParRange. Seems like it would have to allocate some amount per split, and after map is called, perhaps it might have to store some amount of intermediate results in memory as "neighboring" splits wait for the other to complete. Especially the heap space exception makes me think something like this is the case. So you'll lose a lot of time to GC and such.
Second, probably the calls to rng.nextDouble are by far the most expensive part of that inner function. But I believe both java and scala Random classes are essentially single-threaded. They synchronize and block internally. So you won't gain that much from parallelism anyway, and in fact lose some to overhead.
There is not enough work per task, the task granularity is too fine-grained.
Creating each task requires some overhead:
Some object representing the task must be created
It must be ensured that only one thread executes one task at a time
In the case that some threads become idle, some job-stealing procedure must be invoked.
For N = 10000, you instantiate 100,000,000 tiny tasks. Each of those tasks does almost nothing: it generates two random numbers and performs some basic arithmetic and an if-branch. The overhead of creating a task is not comparable to the work that each task is doing.
The tasks must be much larger, so that each thread has enough work to do. Furthermore, it's probably faster if you make each RNG thread local, so that the threads can do their job in parallel, without permanently locking the default random number generator.
Here is an example:
import scala.util.Random
def pi_random(N: Long): Double = {
val rng = new Random
val count = (0L until N * N)
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_parallel(N: Long): Double = {
val rng = new Random
val count = (0L until N * N)
.par
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_properly(n: Long): Double = {
val count = (0L until n).par.map { _ =>
val rng = ThreadLocalRandom.current
var sum = 0
var idx = 0
while (idx < n) {
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1.0) sum += 1
idx += 1
}
sum
}.sum
4 * count.toDouble / (n * n)
}
Here is a little demo and timings:
def measureTime[U](repeats: Long)(block: => U): Double = {
val start = System.currentTimeMillis
var iteration = 0
while (iteration < repeats) {
iteration += 1
block
}
val end = System.currentTimeMillis
(end - start).toDouble / repeats
}
// basic sanity check that all algos return roughly same result
println(pi_random(2000))
println(pi_random_parallel(2000))
println(pi_random_properly(2000))
// time comparison (N = 2k, 10 repetitions for each algorithm)
val N = 2000
val Reps = 10
println("Sequential: " + measureTime(Reps)(pi_random(N)))
println("Naive: " + measureTime(Reps)(pi_random_parallel(N)))
println("My proposal: " + measureTime(Reps)(pi_random_properly(N)))
Output:
3.141333
3.143418
3.14142
Sequential: 621.7
Naive: 3032.6
My version: 44.7
Now the parallel version is roughly an order of magnitude faster than the sequential version (result will obviously depend on the number of cores etc.).
I couldn't test it with N = 10000, because the naively parallelized version crashed everything with an "GC overhead exceeded"-error, which also illustrates that the overhead for creating the tiny tasks is too large.
In my implementation, I've additionaly unrolled the inner while: you need only one counter in one register, no need to create a huge collection by mapping on the range.
Edit: Replaced everything by ThreadLocalRandom, it now shouldn't matter whether your compiler versions supports SAM or not, so it should work with earlier versions of 2.11 too.
Below I have a Scala example of a Spark fold action:
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
rdd1.fold(5)(_ + _)
This produces the output 35. Can somebody explain in detail how this output gets computed?
Taken from the Scaladocs here (emphasis mine):
#param zeroValue the initial value for the accumulated result of each
partition for the op operator, and also the initial value for the
combine results from different
partitions for the op operator - this will typically be the neutral
element (e.g. Nil for list concatenation or 0 for summation)
The zeroValue is in your case added four times (one for each partition, plus one when combining the results from the partitions). So the result is:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5) + 5 // (extra one for combining results)
zeroValue is added once for each partition and should a neutral element - in case of + it should be 0. The exact result will depend on the number of partitions but it is equivalent to:
rdd1.mapPartitions(iter => Iterator(iter.foldLeft(zeroValue)(_ + _))).reduce(_ + _)
so:
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
distributes data as:
scala> rdd1.glom.collect
res1: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
and a whole expression is equivalent to:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5)
plus 5 for jobResult.
You know that Spark RDD's perform distributed computations.
So, this line here,
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
tells Spark that it needs to support 3 partitions in this RDD and that will enable it to run computations using 3 independent executors in parallel.
Now, this line here,
rdd1.fold(5)(_ + _)
tells spark to fold all those partitions using 5 as initial value and then fold all these partition results from 3 executors again with 5 as initial value.
A normal Scala equivalent is can be written as,
val list = List(1, 2, 3, 4, 5)
val listOfList = list.grouped(2).toList
val listOfFolds = listOfList.map(l => l.fold(5)(_ + _))
val fold = listOfFolds.fold(5)(_ + _)
So... if you are using fold on RDD's you need to provide a zero value.
But then you will ask - why or when someone will use fold instead of reduce?
Your confusion lies in you perception of zero value. The thing is that this zero value for RDD[T] does not entirely depend on our type T but also on the nature of computation. So your zero value does not need to be 0.
Lets consider a simple example where we want to calculate "largest number greater than 15" or "15" in our RDD,
Can we do that using reduce? The answer is NO. But we can do it using fold.
val n15GT15 = rdd1.fold(15)({ case (acc, i) => Math.max(acc, i) })
Error:
ERROR TaskSetManager: Total size of serialized results of XXXX tasks (2.0 GB) is bigger than spark.driver.maxResultSize (2.0 GB)
Goal: Obtain recommendation for all the users using the model and overlap with each users test data and generate overlap ratio.
I have build a recommendation model using spark mllib. I evaluate the overlap ration of test data per user and recommended items per user and generate mean overlap ratio.
def overlapRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = {
val testData: RDD[(Int, Iterable[Int])] = test_data.map(r => (r.user, r.product)).groupByKey
val n = testData.count
val recommendations: RDD[(Int, Array[Int])] = model.recommendProductsForUsers(20)
.mapValues(_.map(r => r.product))
val overlaps = testData.join(recommendations).map(x => {
val moviesPerUserInRecs = x._2._2.toSet
val moviesPerUserInTest = x._2._1.toSet
val localHitRatio = moviesPerUserInRecs.intersect(moviesPerUserInTest)
if(localHitRatio.size > 0)
1
else
0
}).filter(x => x != 0).count
var r = 0.0
if (overlaps != 0)
r = overlaps / n
return r
}
But the problem here is that it ends up throwing above maxResultSize error. In my spark configuration I did following to increase the maxResultSize.
val conf = new SparkConf()
conf.set("spark.driver.maxResultSize", "6g")
But that didn't solve the problem, I went almost close to the amount that I allocate the driver memory yet the issue didn't get resolve. While the code is getting execute I kept eyes on my spark job and what I saw is bit puzzling.
[Stage 281:==> (47807 + 100) / 1000000]15/12/01 12:27:03 ERROR TaskSetManager: Total size of serialized results of 47809 tasks (6.0 GB) is bigger than spark.driver.maxResultSize (6.0 GB)
At above stage code is executing MatrixFactorization code in spark-mllib recommendForAll around line 277 (not exactly sure the line number).
private def recommendForAll(
rank: Int,
srcFeatures: RDD[(Int, Array[Double])],
dstFeatures: RDD[(Int, Array[Double])],
num: Int): RDD[(Int, Array[(Int, Double)])] = {
val srcBlocks = blockify(rank, srcFeatures)
val dstBlocks = blockify(rank, dstFeatures)
val ratings = srcBlocks.cartesian(dstBlocks).flatMap {
case ((srcIds, srcFactors), (dstIds, dstFactors)) =>
val m = srcIds.length
val n = dstIds.length
val ratings = srcFactors.transpose.multiply(dstFactors)
val output = new Array[(Int, (Int, Double))](m * n)
var k = 0
ratings.foreachActive { (i, j, r) =>
output(k) = (srcIds(i), (dstIds(j), r))
k += 1
}
output.toSeq
}
ratings.topByKey(num)(Ordering.by(_._2))
}
recommendForAll method get called in from recommendProductsForUsers method.
But looks like the method is spinning off 1M tasks. Data that get fed comes from 2000 part files so I am confuse how it started to spit 1M tasks and I think that might be the problem.
My question is how can I actually resolve this problem. Without using this approach its really hard to calculate overlap ratio or recall#K. This is on spark 1.5 (cloudera 5.5)
the 2GB problem is not new to the Spark community: https://issues.apache.org/jira/browse/SPARK-6235
Re/ the partition size greater than 2GB, try to repartition (myRdd.repartition(parallelism)) your RDD to a greater number of partitions (w/r/t/ your current level of parallelism), thus reducing each single partition's size.
Re/ the number of tasks spinned (hence partitions created), my hypothesis is that it might come out of the srcBlocks.cartesian(dstBlocks) API call, which produces an output RDD made of (z = srcBlocks's number of partitions * dstBlocks's number of partitions) partitions.
In this case, you might consider leveraging myRdd.coalesce(parallelism) API instead of the repartition one to avoid shuffle (and partitions seralialization related problems).
Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using parallelize function. The time is scaling almost linearly with adding each slice. Please let me know if I'm doing anything wrong. The code snippet is given below:
val spark = new SparkContext("spark://telecom:7077", "SparkPi",
System.getenv("SPARK_HOME"), List("target/scala-2.10/sparkpii_2.10-1.0.jar"))
val slices = 1
val n = 10000000 * slices
val count = spark.parallelize(1 to n, slices).map {
i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
Update: Problem was with random function, since it was a synchronized method, it couldn't scale to multiple cores.
The random function used in sparkpi example is a synchronized method and can't scale to multiple cores. It's an easy enough example to deploy on your cluster but don't use it to check Spark's performance and scalability.
As Ahsan mentioned in his answer, the problem was with 'scala.math.random'.
I have replaced it with 'org.apache.spark.util.random.XORShiftRandom', and now using multiple processors makes the Pi calculations to run much faster.
Below is my code, which is a modified version of SparkPi example from Spark distribution:
// scalastyle:off println
package org.apache.spark.examples
import org.apache.spark.util.random.XORShiftRandom
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi").setMaster(args(0))
val spark = new SparkContext(conf)
val slices = if (args.length > 1) args(1).toInt else 2
val n = math.min(100000000L * slices, Int.MaxValue).toInt // avoid overflow
val rand = new XORShiftRandom()
val count = spark.parallelize(1 until n, slices).map { i =>
val x = rand.nextDouble * 2 - 1
val y = rand.nextDouble * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
// scalastyle:on println
When I run the program above using one core with parameters 'local[1] 16' it takes about 60 seconds on my laptop. Same program using 8 cores ('local[*] 16') it takes 17 seconds.