TaskSchedulerImpl: Initial job has not accepted any resources. (Error in Spark) - scala

I'm trying to run SparkPi example on my standalone mode cluster.
package org.apache.spark.examples
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkPi")
.set("spark.driver.allowMultipleContexts", "true")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
Note: I made a little change in this line:
val conf = new SparkConf().setAppName("SparkPi")
.set("spark.driver.allowMultipleContexts", "true")
Problem: I'm using spark-shell (Scala interface) to run this code. When I try this code, I receive this error repeatedly:
15/02/09 06:39:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Note: I can see my workers in my Master's WebUI and also I can see a new job in the Running Applications section. But there is no end for this application and I see error repeatedly.
What is the problem?

If you want to run this from spark shell, then start the shell with argument --master spark:// and enter the following code:
import scala.math.random
import org.apache.spark._
val slices = 10
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = sc.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
Otherwise, compile the code into a jar and run it with spark-submit. But remove setMaster from the code and add it as 'master' argument to spark-submit script. Also remove the allowMultipleContexts argument from the code.
You need only one spark context.


Getting error while running below code on jupyter notebook

After installing pyspark i am trying to run code as mentioned below and getting IndexError: list index out of range.
import findspark
import random
from pyspark import SparkContext
sc = SparkContext(appName="EstimatePi")
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
NUM_SAMPLES = 1000000
count = sc.parallelize(range(0, NUM_SAMPLES)) \
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))

Spark - LinearRegressionWithSGD on Coursera Machine Learning by Stanford University samples

Software Version: Apache Spark v1.3
Context: I've been trying to "translate" Octave/MATLAB code to Scala on Apache Spark. More precisely, I work on ex1data1.txt and ex1data2.txt from coursera practical part ex1. I've made such translation into Julia lang (it went smoothly) and now I've been struggling with Spark...without success.
Problem: Performance of my implementation on Spark is very poor. I cannot even say it works correctly. That's why for ex1data1.txt I added polynomial feature, and I also worked with: theta0 using setIntercept(true) and with extra non-normalized column of 1.0 values(in this case I set Intercept to false). I receive only silly results.
So, then I 've decided to start working with ex1data2.txt. Below you can find the code and the expected result. Of course Spark result is wrong.
Did you have similar experience? I will be grateful for your help.
The Scala code for the exd1data2.txt:
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.SquaredL2Updater
import org.apache.spark.mllib.regression.{LinearRegressionModel, LinearRegressionWithSGD, LabeledPoint}
import org.apache.spark.{SparkContext, SparkConf}
object MLibOnEx1data2 extends App {
val conf = new SparkConf()
conf.set("spark.app.name", "coursera ex1data2.txt test")
val sc = new SparkContext(conf)
val input = sc.textFile("hdfs:///ex1data2.txt")
val trainData = input.map { line =>
val parts = line.split(',')
val y = parts(2).toDouble
val features = Vectors.dense(parts(0).toDouble, parts(1).toDouble)
println(s"x = $features y = $y")
LabeledPoint(y, features)
// Building the model
val numIterations = 1500
val alpha = 0.01
// Scale the features
val scaler = new StandardScaler(withMean = true, withStd = true)
.fit(trainData.map(x => x.features))
val scaledTrainData = trainData.map{ td =>
val normFeatures = scaler.transform(td.features)
println(s"normalized features = $normFeatures")
LabeledPoint(td.label, normFeatures)
val tsize = scaledTrainData.count()
println(s"Training set size is $tsize")
val alg = new LinearRegressionWithSGD().setIntercept(true)
.setUpdater(new SquaredL2Updater)
.setRegParam(0.0) //regularization - off
val model = alg.run(scaledTrainData)
println(s"Theta is $model.weights")
val total1 = model.predict(scaler.transform(Vectors.dense(1650, 3)))
println(s"Estimate the price of a 1650 sq-ft, 3 br house = $total1 dollars") //it should give ~ $289314.620338
// Evaluate model on training examples and compute training error
val valuesAndPreds = scaledTrainData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
val MSE = ((valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()) / 2)
println("Training Mean Squared Error = " + MSE)
// Save and load model
val trySaveAndLoad = util.Try(model.save(sc, "myModelPath"))
.flatMap { _ => util.Try(LinearRegressionModel.load(sc, "myModelPath")) }
println(s"trySaveAndLoad result is $trySaveAndLoad")
STDOUT result is:
Training set size is 47
Theta is (weights=[52090.291641275864,19342.034885388926],
Estimate the price of a 1650 sq-ft, 3 br house = 153983.5541846754
Training Mean Squared Error = 1.5876093757127676E10
trySaveAndLoad result is -1
Well, after some digging I believe there is nothing here. First I saved content of the valuesAndPreds to text file:
case {x, y} => s"$x,$y"}.repartition(1).saveAsTextFile("results.txt")'
Rest of the code is written in R.
First lets create a model using closed form solution:
# Load data
df <- read.csv('results.txt/ex1data2.txt', header=FALSE)
# Scale features
df[, 1:2] <- apply(df[, 1:2], 2, scale)
# Build linear model
model <- lm(V3 ~ ., df)
For reference:
> summary(model)
lm(formula = V3 ~ ., data = df)
Min 1Q Median 3Q Max
-130582 -43636 -10829 43698 198147
Estimate Std. Error t value Pr(>|t|)
(Intercept) 340413 9637 35.323 < 2e-16 ***
V1 110631 11758 9.409 4.22e-12 ***
V2 -6650 11758 -0.566 0.575
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 66070 on 44 degrees of freedom
Multiple R-squared: 0.7329, Adjusted R-squared: 0.7208
F-statistic: 60.38 on 2 and 44 DF, p-value: 2.428e-13
Now prediction:
closedFormPrediction <- predict(model, df)
closedFormRMSE <- sqrt(mean((closedFormPrediction - df$V3)**2))
closedFormPrediction, df$V3,
ylab="Actual", xlab="Predicted",
main=paste("Closed form, RMSE: ", round(closedFormRMSE, 3)))
Now we can compare above to SGD results:
sgd <- read.csv('results.txt/part-00000', header=FALSE)
sgdRMSE <- sqrt(mean(sgd$V2 - sgd$V1)**2)
sgd$V2, sgd$V1, ylab="Actual",
xlab="Predicted", main=paste("SGD, RMSE: ", round(sgdRMSE, 3)))
Finally lets compare both:
sgd$V2, closedFormPrediction,
xlab="SGD", ylab="Closed form", main="SGD vs Closed form")
So, result are clearly not perfect but nothing seems to be completely off here.

How can I run Spark job programmatically

I wan't to run Spark job programmatically - submit SparkPi calculation to remote cluster directly from Idea (my laptop):
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = 100000 * slices
val count = spark.parallelize(1 to n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
However, when I run it, I observe the following error:
14/12/08 11:31:20 ERROR security.UserGroupInformation: PriviledgedActionException as:remeniuk (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
... 4 more
When I run the same script with spark-submit from my laptop, I see the same error.
And only when I upload the jar to remote cluster (machine, where master is running), job complete successfully:
./bin/spark-submit --master spark://host-name:7077 --class com.viaden.crm.spark.experiments.SparkPi ../spark-experiments_2.10-0.1-SNAPSHOT.jar
According to the exception stack, it should be your local firewall issue.
please refer to this similar case
Intermittent Timeout Exception using Spark

How to construct graph in graphx

I am new to scala and graphx and am having problems converting a tsv file to a graph.
I have a flat tab separated file like below:
n1 P1 n2
n3 P1 n4
n2 P2 n3
n3 P2 n1
n1 P3 n4
n3 P3 n2
where n1,n2,n3,n4 are the nodes of the graph and R1,P2,P3 are the properties which should form the edges between the nodes.
How can I construct a graph from the above file in SPARK GraphX ?
Example code would be very helpful.
There is some code for you (of course you should build it in jar file using sbt):
package vinnie.pooh
import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
object Main {
def main(args: Array[String]) {
if (args.length != 1) {
"Should be one parameter: <path/to/edges>")
val conf = new SparkConf()
.setAppName("Load graph")
val sc = new SparkContext(conf)
val edges: RDD[Edge[String]] =
sc.textFile(args(0)).map { line =>
val fields = line.split(" ")
Edge(fields(0).toLong, fields(2).toLong, fields(1))
val graph : Graph[Any, String] = Graph.fromEdges(edges, "defaultProperty")
println("num edges = " + graph.numEdges);
println("num vertices = " + graph.numVertices);
and I have edge.txt:
1 Prop12 2
2 Prop24 4
4 Prop45 5
5 Prop52 2
6 Prop65 7
and then, for example, you can launch it locally:
$SPARK_HOME>./bin/spark-submit --class vinnie.pooh.Main --master local[2] ~/justBuiltJar.jar ~/edge.txt

SparkPi running slow with more than 1 slice

Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using parallelize function. The time is scaling almost linearly with adding each slice. Please let me know if I'm doing anything wrong. The code snippet is given below:
val spark = new SparkContext("spark://telecom:7077", "SparkPi",
System.getenv("SPARK_HOME"), List("target/scala-2.10/sparkpii_2.10-1.0.jar"))
val slices = 1
val n = 10000000 * slices
val count = spark.parallelize(1 to n, slices).map {
i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
Update: Problem was with random function, since it was a synchronized method, it couldn't scale to multiple cores.
The random function used in sparkpi example is a synchronized method and can't scale to multiple cores. It's an easy enough example to deploy on your cluster but don't use it to check Spark's performance and scalability.
As Ahsan mentioned in his answer, the problem was with 'scala.math.random'.
I have replaced it with 'org.apache.spark.util.random.XORShiftRandom', and now using multiple processors makes the Pi calculations to run much faster.
Below is my code, which is a modified version of SparkPi example from Spark distribution:
// scalastyle:off println
package org.apache.spark.examples
import org.apache.spark.util.random.XORShiftRandom
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi").setMaster(args(0))
val spark = new SparkContext(conf)
val slices = if (args.length > 1) args(1).toInt else 2
val n = math.min(100000000L * slices, Int.MaxValue).toInt // avoid overflow
val rand = new XORShiftRandom()
val count = spark.parallelize(1 until n, slices).map { i =>
val x = rand.nextDouble * 2 - 1
val y = rand.nextDouble * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
// scalastyle:on println
When I run the program above using one core with parameters 'local[1] 16' it takes about 60 seconds on my laptop. Same program using 8 cores ('local[*] 16') it takes 17 seconds.