SparkPi running slow with more than 1 slice

SparkPi running slow with more than 1 slice - scala

Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using parallelize function. The time is scaling almost linearly with adding each slice. Please let me know if I'm doing anything wrong. The code snippet is given below:
val spark = new SparkContext("spark://telecom:7077", "SparkPi",
System.getenv("SPARK_HOME"), List("target/scala-2.10/sparkpii_2.10-1.0.jar"))
val slices = 1
val n = 10000000 * slices
val count = spark.parallelize(1 to n, slices).map {
i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
Update: Problem was with random function, since it was a synchronized method, it couldn't scale to multiple cores.

The random function used in sparkpi example is a synchronized method and can't scale to multiple cores. It's an easy enough example to deploy on your cluster but don't use it to check Spark's performance and scalability.

As Ahsan mentioned in his answer, the problem was with 'scala.math.random'.
I have replaced it with 'org.apache.spark.util.random.XORShiftRandom', and now using multiple processors makes the Pi calculations to run much faster.
Below is my code, which is a modified version of SparkPi example from Spark distribution:
// scalastyle:off println
package org.apache.spark.examples
import org.apache.spark.util.random.XORShiftRandom
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi").setMaster(args(0))
val spark = new SparkContext(conf)
val slices = if (args.length > 1) args(1).toInt else 2
val n = math.min(100000000L * slices, Int.MaxValue).toInt // avoid overflow
val rand = new XORShiftRandom()
val count = spark.parallelize(1 until n, slices).map { i =>
val x = rand.nextDouble * 2 - 1
val y = rand.nextDouble * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
// scalastyle:on println
When I run the program above using one core with parameters 'local[1] 16' it takes about 60 seconds on my laptop. Same program using 8 cores ('local[*] 16') it takes 17 seconds.

Related

Optimizing Flink transformation

I have the following method that computes the probability of a value in a DataSet:
/**
* Compute the probabilities of each value on the given [[DataSet]]
*
* #param x single colum [[DataSet]]
* #return Sequence of probabilites for each value
*/
private[this] def probs(x: DataSet[Double]): Seq[Double] = {
val counts = x.groupBy(_.doubleValue)
.reduceGroup(_.size.toDouble)
.name("X Probs")
.collect
val total = counts.sum
counts.map(_ / total)
}
The problem is that when I submit my flink job, that uses this method, its causing flink to kill the job due to a task TimeOut. I am executing this method for each attribute on a DataSet with only 40.000 instances and 9 attributes.
Is there a way I could do this code more efficient?
After a few tries, I made it work with mapPartition, this method is part of a class InformationTheory, which does some computations to calculate Entropy, mutual information etc. So, for example, SymmetricalUncertainty is computed as this:
/**
* Computes 'symmetrical uncertainty' (SU) - a symmetric mutual information measure.
*
* It is defined as SU(X, y) = 2 * (IG(X|Y) / (H(X) + H(Y)))
*
* #param xy [[DataSet]] with two features
* #return SU value
*/
def symmetricalUncertainty(xy: DataSet[(Double, Double)]): Double = {
val su = xy.mapPartitionWith {
case in ⇒
val x = in map (_._2)
val y = in map (_._1)
val mu = mutualInformation(x, y)
val Hx = entropy(x)
val Hy = entropy(y)
Some(2 * mu / (Hx + Hy))
}
su.collect.head.head
}
With this, I can compute efficiently entropy, mutual information etc. The catch is, it only works with a level of parallelism of 1, the problem resides in mapPartition.
Is there a way I could do something similar to what I am doing here with SymmetricalUncertainty, but with whatever level of parallelism?

I finally did it, don't know if its the best solution, but its working with n levels of parallelism:
def symmetricalUncertainty(xy: DataSet[(Double, Double)]): Double = {
val su = xy.reduceGroup { in ⇒
val invec = in.toVector
val x = invec map (_._2)
val y = invec map (_._1)
val mu = mutualInformation(x, y)
val Hx = entropy(x)
val Hy = entropy(y)
2 * mu / (Hx + Hy)
}
su.collect.head
}
You can check the entire code at InformationTheory.scala, and its tests InformationTheorySpec.scala

Scala parallel collections

I am very naively trying to use Scala .par, and the result turns out to be slower than the non-parallel version, by quite a bit. What is the explanation for that?
Note: the question is not to make this faster, but to understand why this naive use of .par doesn't yield an immediate speed-up.
Note 2: timing method: I ran both methods with N = 10000. The first one returned in about 20s. The second one I killed after 3 minutes. Not even close. If I let it run longer I get into a Java heap space exception.
def pi_random(N: Long): Double = {
val count = (0L until N * N)
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_parallel(N: Long): Double = {
val count = (0L until N * N)
.par
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}

Hard to know for sure without doing some actual profiling, but I have two theories:
First, you may be losing some benefits of the Range class, specifically near-zero memory usage. When you do (0L until N * N), you create a Range object, which is lazy. It does not actually create any object holding every single number in the range. Neither does map, I think. And sum calculates and adds numbers one at a time, so also allocates barely any memory.
I'm not sure the same is all true about ParRange. Seems like it would have to allocate some amount per split, and after map is called, perhaps it might have to store some amount of intermediate results in memory as "neighboring" splits wait for the other to complete. Especially the heap space exception makes me think something like this is the case. So you'll lose a lot of time to GC and such.
Second, probably the calls to rng.nextDouble are by far the most expensive part of that inner function. But I believe both java and scala Random classes are essentially single-threaded. They synchronize and block internally. So you won't gain that much from parallelism anyway, and in fact lose some to overhead.

There is not enough work per task, the task granularity is too fine-grained.
Creating each task requires some overhead:
Some object representing the task must be created
It must be ensured that only one thread executes one task at a time
In the case that some threads become idle, some job-stealing procedure must be invoked.
For N = 10000, you instantiate 100,000,000 tiny tasks. Each of those tasks does almost nothing: it generates two random numbers and performs some basic arithmetic and an if-branch. The overhead of creating a task is not comparable to the work that each task is doing.
The tasks must be much larger, so that each thread has enough work to do. Furthermore, it's probably faster if you make each RNG thread local, so that the threads can do their job in parallel, without permanently locking the default random number generator.
Here is an example:
import scala.util.Random
def pi_random(N: Long): Double = {
val rng = new Random
val count = (0L until N * N)
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_parallel(N: Long): Double = {
val rng = new Random
val count = (0L until N * N)
.par
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_properly(n: Long): Double = {
val count = (0L until n).par.map { _ =>
val rng = ThreadLocalRandom.current
var sum = 0
var idx = 0
while (idx < n) {
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1.0) sum += 1
idx += 1
}
sum
}.sum
4 * count.toDouble / (n * n)
}
Here is a little demo and timings:
def measureTime[U](repeats: Long)(block: => U): Double = {
val start = System.currentTimeMillis
var iteration = 0
while (iteration < repeats) {
iteration += 1
block
}
val end = System.currentTimeMillis
(end - start).toDouble / repeats
}
// basic sanity check that all algos return roughly same result
println(pi_random(2000))
println(pi_random_parallel(2000))
println(pi_random_properly(2000))
// time comparison (N = 2k, 10 repetitions for each algorithm)
val N = 2000
val Reps = 10
println("Sequential: " + measureTime(Reps)(pi_random(N)))
println("Naive: " + measureTime(Reps)(pi_random_parallel(N)))
println("My proposal: " + measureTime(Reps)(pi_random_properly(N)))
Output:
3.141333
3.143418
3.14142
Sequential: 621.7
Naive: 3032.6
My version: 44.7
Now the parallel version is roughly an order of magnitude faster than the sequential version (result will obviously depend on the number of cores etc.).
I couldn't test it with N = 10000, because the naively parallelized version crashed everything with an "GC overhead exceeded"-error, which also illustrates that the overhead for creating the tiny tasks is too large.
In my implementation, I've additionaly unrolled the inner while: you need only one counter in one register, no need to create a huge collection by mapping on the range.
Edit: Replaced everything by ThreadLocalRandom, it now shouldn't matter whether your compiler versions supports SAM or not, so it should work with earlier versions of 2.11 too.

Spark SparkPi example

In the SparkPi example that comes with the distribution of Spark, is the reduce on the RDD, executed in parallel (each slice calculates its total), or not?
val count: Int = spark.sparkContext.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)

Yes, it is.
By default, this example will operate on 2 slices. As a result, your collection will be split into 2 parts. Then Spark will execute the map transformation and reduce action on each partition in parallel. Finally, Spark will merge the individual results into the final value.
You can observe 2 tasks in the console output if the example is executed using the default config.

Spark - LinearRegressionWithSGD on Coursera Machine Learning by Stanford University samples

Software Version: Apache Spark v1.3
Context: I've been trying to "translate" Octave/MATLAB code to Scala on Apache Spark. More precisely, I work on ex1data1.txt and ex1data2.txt from coursera practical part ex1. I've made such translation into Julia lang (it went smoothly) and now I've been struggling with Spark...without success.
Problem: Performance of my implementation on Spark is very poor. I cannot even say it works correctly. That's why for ex1data1.txt I added polynomial feature, and I also worked with: theta0 using setIntercept(true) and with extra non-normalized column of 1.0 values(in this case I set Intercept to false). I receive only silly results.
So, then I 've decided to start working with ex1data2.txt. Below you can find the code and the expected result. Of course Spark result is wrong.
Did you have similar experience? I will be grateful for your help.
The Scala code for the exd1data2.txt:
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.SquaredL2Updater
import org.apache.spark.mllib.regression.{LinearRegressionModel, LinearRegressionWithSGD, LabeledPoint}
import org.apache.spark.{SparkContext, SparkConf}
object MLibOnEx1data2 extends App {
val conf = new SparkConf()
conf.set("spark.app.name", "coursera ex1data2.txt test")
val sc = new SparkContext(conf)
val input = sc.textFile("hdfs:///ex1data2.txt")
val trainData = input.map { line =>
val parts = line.split(',')
val y = parts(2).toDouble
val features = Vectors.dense(parts(0).toDouble, parts(1).toDouble)
println(s"x = $features y = $y")
LabeledPoint(y, features)
}.cache()
// Building the model
val numIterations = 1500
val alpha = 0.01
// Scale the features
val scaler = new StandardScaler(withMean = true, withStd = true)
.fit(trainData.map(x => x.features))
val scaledTrainData = trainData.map{ td =>
val normFeatures = scaler.transform(td.features)
println(s"normalized features = $normFeatures")
LabeledPoint(td.label, normFeatures)
}.cache()
val tsize = scaledTrainData.count()
println(s"Training set size is $tsize")
val alg = new LinearRegressionWithSGD().setIntercept(true)
alg.optimizer
.setNumIterations(numIterations)
.setStepSize(alpha)
.setUpdater(new SquaredL2Updater)
.setRegParam(0.0) //regularization - off
val model = alg.run(scaledTrainData)
println(s"Theta is $model.weights")
val total1 = model.predict(scaler.transform(Vectors.dense(1650, 3)))
println(s"Estimate the price of a 1650 sq-ft, 3 br house = $total1 dollars") //it should give ~ $289314.620338
// Evaluate model on training examples and compute training error
val valuesAndPreds = scaledTrainData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = ((valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()) / 2)
println("Training Mean Squared Error = " + MSE)
// Save and load model
val trySaveAndLoad = util.Try(model.save(sc, "myModelPath"))
.flatMap { _ => util.Try(LinearRegressionModel.load(sc, "myModelPath")) }
.getOrElse(-1)
println(s"trySaveAndLoad result is $trySaveAndLoad")
}
STDOUT result is:
Training set size is 47
Theta is (weights=[52090.291641275864,19342.034885388926],
intercept=181295.93717032953).weights
Estimate the price of a 1650 sq-ft, 3 br house = 153983.5541846754
dollars
Training Mean Squared Error = 1.5876093757127676E10
trySaveAndLoad result is -1

Well, after some digging I believe there is nothing here. First I saved content of the valuesAndPreds to text file:
valuesAndPreds.map{
case {x, y} => s"$x,$y"}.repartition(1).saveAsTextFile("results.txt")'
Rest of the code is written in R.
First lets create a model using closed form solution:
# Load data
df <- read.csv('results.txt/ex1data2.txt', header=FALSE)
# Scale features
df[, 1:2] <- apply(df[, 1:2], 2, scale)
# Build linear model
model <- lm(V3 ~ ., df)
For reference:
> summary(model)
Call:
lm(formula = V3 ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-130582 -43636 -10829 43698 198147
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 340413 9637 35.323 < 2e-16 ***
V1 110631 11758 9.409 4.22e-12 ***
V2 -6650 11758 -0.566 0.575
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 66070 on 44 degrees of freedom
Multiple R-squared: 0.7329, Adjusted R-squared: 0.7208
F-statistic: 60.38 on 2 and 44 DF, p-value: 2.428e-13
Now prediction:
closedFormPrediction <- predict(model, df)
closedFormRMSE <- sqrt(mean((closedFormPrediction - df$V3)**2))
plot(
closedFormPrediction, df$V3,
ylab="Actual", xlab="Predicted",
main=paste("Closed form, RMSE: ", round(closedFormRMSE, 3)))
'
Now we can compare above to SGD results:
sgd <- read.csv('results.txt/part-00000', header=FALSE)
sgdRMSE <- sqrt(mean(sgd$V2 - sgd$V1)**2)
plot(
sgd$V2, sgd$V1, ylab="Actual",
xlab="Predicted", main=paste("SGD, RMSE: ", round(sgdRMSE, 3)))
Finally lets compare both:
plot(
sgd$V2, closedFormPrediction,
xlab="SGD", ylab="Closed form", main="SGD vs Closed form")
So, result are clearly not perfect but nothing seems to be completely off here.

TaskSchedulerImpl: Initial job has not accepted any resources. (Error in Spark)

I'm trying to run SparkPi example on my standalone mode cluster.
package org.apache.spark.examples
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkPi")
.setMaster("spark://192.168.17.129:7077")
.set("spark.driver.allowMultipleContexts", "true")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
Note: I made a little change in this line:
val conf = new SparkConf().setAppName("SparkPi")
.setMaster("spark://192.168.17.129:7077")
.set("spark.driver.allowMultipleContexts", "true")
Problem: I'm using spark-shell (Scala interface) to run this code. When I try this code, I receive this error repeatedly:
15/02/09 06:39:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Note: I can see my workers in my Master's WebUI and also I can see a new job in the Running Applications section. But there is no end for this application and I see error repeatedly.
What is the problem?
Thanks

If you want to run this from spark shell, then start the shell with argument --master spark://192.168.17.129:7077 and enter the following code:
import scala.math.random
import org.apache.spark._
val slices = 10
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = sc.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
Otherwise, compile the code into a jar and run it with spark-submit. But remove setMaster from the code and add it as 'master' argument to spark-submit script. Also remove the allowMultipleContexts argument from the code.
You need only one spark context.