How can I run Spark job programmatically - scala

I wan't to run Spark job programmatically - submit SparkPi calculation to remote cluster directly from Idea (my laptop):
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = 100000 * slices
val count = spark.parallelize(1 to n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
However, when I run it, I observe the following error:
14/12/08 11:31:20 ERROR security.UserGroupInformation: PriviledgedActionException as:remeniuk (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at Method)
... 4 more
When I run the same script with spark-submit from my laptop, I see the same error.
And only when I upload the jar to remote cluster (machine, where master is running), job complete successfully:
./bin/spark-submit --master spark://host-name:7077 --class com.viaden.crm.spark.experiments.SparkPi ../spark-experiments_2.10-0.1-SNAPSHOT.jar

According to the exception stack, it should be your local firewall issue.
please refer to this similar case
Apache Zeppelin Not Showing Full Stack Trace

I have the following Paragraph that does some Outlier detection using the InterQuartileRange method and strangely it runs in an error, but Apache Zeppelin is kind of truncating it to be useful.
Here is the code:
def interQuartileRangeFiltering(df: DataFrame): DataFrame = {
def inner(cols: Seq[String], acc: DataFrame): DataFrame = cols match {
case Nil => acc
case column :: xs =>
val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config
val q1 = quantiles(0)
val q3 = quantiles(1)
val iqr = q1 - q3
val lowerRange = q1 - 1.5 * iqr
val upperRange = q3 + 1.5 * iqr
inner(xs, acc.filter(s"$column < $lowerRange or value > $upperRange"))
inner(df.columns.toSeq, df)
Here is the error when run in Apache Zeppelin:
scala.MatchError: WrappedArray(NEAR BAY, ISLAND, NEAR OCEAN, housing_median_age, population, total_bedrooms, <1H OCEAN, median_house_value, longitude, INLAND, latitude, total_rooms, households, median_income) (of class scala.collection.mutable.WrappedArray$ofRef)
at inner$1(<console>:74)
at interQuartileRangeFiltering(<console>:85)
... 56 elided
I have indeed verified the corresponding setting in the spark interpreter to true:
Any ideas as to what is wrong here with my approach and how to get Apache Zeppelin to print the whole stacktrace so that I can find out what the actual problem is?

XGBoost failing after using windowing functions on label column

I have successfully trained an XGBoost model where trainDF is a dataframe hacing two columns: features and label where we have 11k 1s and 57M 0's (unbalanced dataset). Everything works fine.
val udnersample = 0.1
// Undersampling of 0's -- choosing 10%
val training1 = output1.filter($"datestr" < end_period1 &&
$"label" === 1)
val training0 = output1.filter($"datestr" < end_period1 &&
$"label" === 0).sample(
false, undersample)
val training = training0.unionAll(training1)
val traindDF ="label",
"features").toDF("label", "features")}
val paramMap = List("eta" -> 0.05,
"max_depth" -> 6,
"objective" -> "binary:logistic").toMap
val num_trees = 400
val num_cores = 200
val XGBModel = XGBoost.trainWithDataFrame(trainDF,
useExternalMemory = true)
Then, I want to change the y label with some windowing, so that in each group, I can predict y label earlier.
val sum_label = "sum_label"
val label_window_length = 19
val sliding_window_label = Window.partitionBy("id").orderBy(
asc("timestamp")).rowsBetween(0, label_window_length)
val training_source = output1.filter($"datestr" <
sum_label, sum($"label").over(sliding_window_label)).drop(
"label").withColumnRenamed(sum_label, "label")
val training1 = training_source.filter(col("label") === 1)
val training0 = training_source.filter(col("label") === 0).sample(false, 0.099685)
val training = training0.unionAll(training1)
val traindDF ="label",
"features").toDF("label", "features")}
The result has 57M 0's and 214k 1's (soughly the same number of rows though). No NAs in "label" column of trainDF and the type is still double (nullable=true). Then xgboost fails:
Message: XGBoostModel training failed
StackTrace: at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:316)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithRDD(XGBoost.scala:293)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:138)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:35)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:169)
I can include the logs as needed. My confusion is that using the windowing function and literally not changing any other setting, causes XGB to fail. I would appreciate any thoughts on this.
It turns out that saving the table traindDF in hive and reloading it into Spark solves the problem:
Then, you can easily load the table:
val traindDF = spark.sql("""select * from database.tablename""")
This trick solved the problem. It seems like spark windowing function is a bit unstable and saving the result into a hive table makes it work.
A better way to do this is using windowing functions in hive instead of Spark.

spark stanford parser out of memory

I'm using StanfordCoreNLP 2.4.1 on Spark 1.5 to parse Chinese sentences, but ran into Java heap OOM exception. The code is like below:
val modelpath = "edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz"
val lp = LexicalizedParser.loadModel(modelpath)
val dataWords =>{
val tokens = x.split("\t")
val id = tokens(0)
val word_seg = tokens(2)
val comm_words = word_seg.split("\1").filter(_.split(":").length == 2).map(y=>(y.split(":")(0), y.split(":")(1)))
(id, comm_words)
val dataSenSlice =>{
val id = x._1
val comm_words = x._2
val punctuationIndex = Array(0) ++ comm_words.zipWithIndex.filter(_._1._2 == "34").map(_._2) ++ Array(comm_words.length - 1)
val senIndex = (punctuationIndex zip punctuationIndex.tail).filter(z => z._1 != z._2)
val senSlice =>{
val begin = if (z._1 > 0) z._1 + 1 else z._1
val end = if (z._2 == comm_words.length - 1) z._2 + 1 else z._2
if (comm_words.slice(begin, end).filter(_._2 != "34").nonEmpty) {
val sen = comm_words.slice(begin, end).filter(_._2 != "34").map(_._1).mkString(" ").trim
} else ""
}).filter(l=>l.nonEmpty && l.length<20)
(id, senSlice)
val dataPoint =>{
val id = x._1
val senSlice = x._2
val senParse =>{
StanfordNLPParser.senParse(lp, y)// java code wrapped sentence parser
id + "\t" + senParse.mkString("\1")
The sentence I feed into parser is a sentence concatenated by segmented words using spaces.
The exception I ran into is:
17/08/09 10:28:15 WARN TaskSetManager: Lost task 1062.0 in stage 0.0 (TID 1219, rz-data-hdp-dn15004.rz.******.com): java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.union(
at java.util.regex.Pattern.clazz(
at java.util.regex.Pattern.sequence(
at java.util.regex.Pattern.expr(
at java.util.regex.Pattern.compile(
at java.util.regex.Pattern.<init>(
at java.util.regex.Pattern.compile(
at java.util.regex.Pattern.matches(
at java.lang.String.matches(
at edu.stanford.nlp.parser.lexparser.ChineseUnknownWordModel.score(
at edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel.score(
at edu.stanford.nlp.parser.lexparser.ChineseLexicon.score(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(
I'm wondering if I use the right way to do sentence parsing, or some other things are wrong.
increase the number of partitions e.g.
Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
increase executor and driver memory, e.g. add 'spark-submit' parameter:
--executor-memory 8G
--driver-memory 4G

TaskSchedulerImpl: Initial job has not accepted any resources. (Error in Spark)

I'm trying to run SparkPi example on my standalone mode cluster.
package org.apache.spark.examples
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkPi")
.set("spark.driver.allowMultipleContexts", "true")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
Note: I made a little change in this line:
val conf = new SparkConf().setAppName("SparkPi")
.set("spark.driver.allowMultipleContexts", "true")
Problem: I'm using spark-shell (Scala interface) to run this code. When I try this code, I receive this error repeatedly:
15/02/09 06:39:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Note: I can see my workers in my Master's WebUI and also I can see a new job in the Running Applications section. But there is no end for this application and I see error repeatedly.
What is the problem?
If you want to run this from spark shell, then start the shell with argument --master spark:// and enter the following code:
import scala.math.random
import org.apache.spark._
val slices = 10
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = sc.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
Otherwise, compile the code into a jar and run it with spark-submit. But remove setMaster from the code and add it as 'master' argument to spark-submit script. Also remove the allowMultipleContexts argument from the code.
You need only one spark context.

SparkPi running slow with more than 1 slice

Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using parallelize function. The time is scaling almost linearly with adding each slice. Please let me know if I'm doing anything wrong. The code snippet is given below:
val spark = new SparkContext("spark://telecom:7077", "SparkPi",
System.getenv("SPARK_HOME"), List("target/scala-2.10/sparkpii_2.10-1.0.jar"))
val slices = 1
val n = 10000000 * slices
val count = spark.parallelize(1 to n, slices).map {
i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
Update: Problem was with random function, since it was a synchronized method, it couldn't scale to multiple cores.
The random function used in sparkpi example is a synchronized method and can't scale to multiple cores. It's an easy enough example to deploy on your cluster but don't use it to check Spark's performance and scalability.
As Ahsan mentioned in his answer, the problem was with 'scala.math.random'.
I have replaced it with 'org.apache.spark.util.random.XORShiftRandom', and now using multiple processors makes the Pi calculations to run much faster.
Below is my code, which is a modified version of SparkPi example from Spark distribution:
// scalastyle:off println
package org.apache.spark.examples
import org.apache.spark.util.random.XORShiftRandom
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi").setMaster(args(0))
val spark = new SparkContext(conf)
val slices = if (args.length > 1) args(1).toInt else 2
val n = math.min(100000000L * slices, Int.MaxValue).toInt // avoid overflow
val rand = new XORShiftRandom()
val count = spark.parallelize(1 until n, slices).map { i =>
val x = rand.nextDouble * 2 - 1
val y = rand.nextDouble * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
// scalastyle:on println
When I run the program above using one core with parameters 'local[1] 16' it takes about 60 seconds on my laptop. Same program using 8 cores ('local[*] 16') it takes 17 seconds.