Spark atop of Docker not accepting jobs - scala

I'm trying to make a hello world example work with spark+docker, and here is my code.
object Generic {
def main(args: Array[String]) {
val sc = new SparkContext("spark://172.17.0.3:7077", "Generic", "/opt/spark-0.9.0")
val NUM_SAMPLES = 100000
val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x * x + y * y < 1) 1.0 else 0.0
}.reduce(_ + _)
println("Pi is roughly " + 4 * count / NUM_SAMPLES)
}
}
When I run sbt run, I get
14/05/28 15:19:58 INFO client.AppClient$ClientActor: Connecting to master spark://172.17.0.3:7077...
14/05/28 15:20:08 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I checked both the cluster UI, where I have 3 nodes that each have 1.5g of memory, and the namenode UI, where I see the same thing.
The docker logs show no output from the workers and the following from the master
14/05/28 21:20:38 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#master:7077] -> [akka.tcp://spark#10.0.3.1:48085]: Error [Association failed with [akka.tcp://spark#10.0.3.1:48085]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark#10.0.3.1:48085]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /10.0.3.1:48085
]
This happens a couple times, and then the program times out and dies with
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Spark cluster looks down
When I did a tcpdump over the docker0 interface, and it looks like the workers and the master nodes are talking.
However, the spark console works.
If I set sc as val sc = new SparkContext("local", "Generic", System.getenv("SPARK_HOME")), the program runs

I've been there. The issue looks like the AKKA actor subsystem in Spark is binding on a different interface than Spark on docker0.
While your master ip is on: spark://172.17.0.3:7077
Akka is binding on: akka.tcp://spark#10.0.3.1:48085
If you masters/slaves are docker containers, they should be communicating through the docker0 interface in the 172.17.x.x range.
Try providing the master and slaves with their correct local IP using the env config SPARK_LOCAL_IP. See config docs for details.
In our docker setup for Spark 0.9 we are using this command to start the slaves:
${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_IP -i $LOCAL_IP
Which directly provides the local IP to the worker.

For running spark on Docker it's crucial to
Expose all necessary ports
Set correct spark.broadcast.factory
Handle docker aliases
Without handling all 3 issues spark cluster parts(master, worker, driver) can't communicate. You can read closely on every issue on http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html or use container ready for spark from https://registry.hub.docker.com/u/epahomov/docker-spark/

You have to check firewall if you are on Windows host and make sure java.exe is allowed to access the public network or change dockerNAT to private. In general, the worker must be able to connect back to the driver (the program you submitted).

Related

Unable to submit Spark job to yarn cluster using Scala

I am trying to submit spark job through SparkSubmit class on a Scala application from my local Windows machine to a remote Yarn cluster, but the spark ResourceManager always try to connect to 0.0.0.0.
val args = Array(
"--master", "yarn",
"--verbose",
"--class", "application-class",
"--num-executors", "1",
"--executor-cores", "1",
"--executor-memory", "10g",
"--deploy-mode", "cluster",
"--driver-memory", "10g",
"path-to-jar", "1")
SparkSubmit.main(args)
Below is the error
Failed to connect to server: 0.0.0.0/0.0.0.0:8032: retries get failed due to exceeded maximum allowed retries number: 10
When I try to submit the spark job through Command Prompt/Windows shell with same arguments as with Scala, then it works fine and submits the job to the cluster.
I have already HADOOP_CONF_DIR and YARN_CONF_DIR in environment variables and my yarn-site.xml has yarn.resourcemanager.address defined with remote IP.
Am I missing anything here?

Checkpoint data corruption in Spark Streaming

I am testing checkpointing and write ahead logs with this basic Spark streaming code below. I am checkpointing into a local directory. After starting and stopping the application a few times (using Ctrl-C) - it would refuse to start, for what looks like some data corruption in the checkpoint directoty. I am getting:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 1 times, most recent failure: Lost task 0.0 in stage 80.0 (TID 17, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
Full code:
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._
object ProtoDemo {
def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
(values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
}
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc
}
def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
}
}
Basically what you are trying to do is a driver failure scenario , for this to work , based on the cluster you are running you have to follow the below instructions to monitor the driver process and relaunch the driver if it fails
Configuring automatic restart of the application driver - To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalone - A Spark application driver can be submitted to
run within the Spark Standalone cluster (see cluster deploy
mode), that is, the application driver itself runs on one of the
worker nodes. Furthermore, the Standalone cluster manager can be
instructed to supervise the driver, and relaunch it if the driver
fails either due to non-zero exit code, or due to failure of the
node running the driver. See cluster mode and supervise in the Spark
Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for
more details.
Mesos - Marathon has been used to achieve this with Mesos.
You need to configure write ahead logs as below ,there are special instructions for S3 which you need to follow.
While using S3 (or any file system that does not support flushing) for write ahead logs, please remember to enable
spark.streaming.driver.writeAheadLog.closeFileAfterWrite
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite.
See Spark Streaming Configuration for more details.
The issue looks rather Kryo Serializer issue than checkpoint corruption.
At code example (including GitHub project), Kryo Serialization is not configured.
Since it is not configured KryoException exception could not happen.
When using "write ahead logs", and restoring from a directory, all Spark config is getting from there.
At your example, createContext method does not call when starting from the checkpoint.
I assume the issue is another application were tested before with the same checkpoint directory, where Kryo Serializer where configured.
And current application fails to be restored from that checkpoint.

Simple Spark program eats all resources

I have server with running in it Spark master and slave. Spark was built manually with next flags:
build/mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package
I'm trying to execute next simple program remotely:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("testApp").setMaster("spark://sparkserver:7077")
val sc = new SparkContext(conf)
println(sc.parallelize(Array(1,2,3)).reduce((a, b) => a + b))
}
Spark dependency:
"org.apache.spark" %% "spark-core" % "1.6.1"
Log on program executing:
16/04/12 18:45:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
My cluster WebUI:
Why so simple application uses all availiable resources?
P.S. Also I noticed what if I allocate more memory for my app (10 gb e.g.) next logs appear many times:
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now RUNNING
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now EXITED (Command exited with code 1)
I think that reason in connection between master and slave. How I set up master and slave(on the same machine):
sbin/start-master.sh
sbin/start-slave.sh spark://sparkserver:7077
P.P.S. When I'm connecting to spark master with spark-shell all is good:
spark-shell --master spark://sparkserver:7077
By default, yarn will allocate all "available" ressources if the yarn dynamic ressource allocation is set to true and your job still have queued tasks. You can also look for your yarn configuration, namely the number of executor and the memory allocated to each one and tune in function of your need.
in file:spark-default.xml ------->setting :spark.cores.max=4
It was a driver issue. Driver (My scala app) was ran on my local computer. And workers have no access to it. As result all resources were eaten by attempts to reconnect to a driver.

Running Apache Spark Example Application in IntelliJ Idea

I am trying to run the SparkPi.scala example program in Netbeans. Unfortunately I am quite new to Spark and have not been able to execute it successfully.
My preference is to work in Netbeans only and execute from there. I know spark also allows executing from the spark console - I however prefer not to take that approach.
This is my build.sbt file contents:
name := "SBTScalaSparkPi"
version := "1.0"
scalaVersion := "2.10.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
This is my plugins.sbt file contents:
logLevel := Level.Warn
This is the program I am trying to execute:
import scala.math.random
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
JDK version: 1.8.
The error I get when trying to execute the code is given below:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/03/25 07:50:25 INFO SparkContext: Running Spark version 1.6.1
16/03/25 07:50:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/25 07:50:26 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:401)
at SparkPi.main(SparkPi.scala)
16/03/25 07:50:26 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>
at SparkPi$.main(SparkPi.scala:28)
at SparkPi.main(SparkPi.scala)
Process finished with exit code 1
Thanks in advance for any help.
A master URL must be set in your configuration
You must set a spark.master in your SparkConf. There are only two mandatory parameters you must set - the master and the AppName that you've already set. For more details, see Initializing Spark section in the docs.
Which master should you use? See Master URLs section for all options. The simplest option for testing is local, which runs an entire Spark system (driver, master, worker) on your local machine, with no extra configuration.
To set the master through the Scala API:
val conf = new SparkConf().setAppName("Spark Pi").setMaster("local")
val spark = new SparkContext(conf)
The start of your program just lacks the URL that points to the Spark master endpoint. You can specify this as a command line parameter in InteliJ. The master URL is the URL and port where the Spark master of your cluster is running. An example command line parameter looks like this:
-Dspark.master=spark://myhost:7077
See the answer to this question for details:
How to set Master address for Spark examples from command line
Perhaps for your first runs you want to just start a local Spark standalone environment. How to get that running is well documented here: http://spark.apache.org/docs/latest/spark-standalone.html
If you got this running you can setup your spark master config like this:
-Dspark.master=spark://localhost:7077
The Master URL need to be set. Using the setMaster("local") function / method solved the issue.
val conf = new SparkConf().setAppName("Spark Pi").setMaster("local")
val spark = new SparkContext(conf)
As a matter of fact both #Matthias and #Tzach are right. You should choose your solution based on what is easier for you (maybe prefer the first option for now). As soon as you start running your spark job on a real cluster it is far better to not hardcode the "master" parameter so that you can run your spark job in multiple cluster mode (YARN, Mesos, Standalone with spark-submit) and still keep it running locally with Netbeans (-Dspark.master=local[*])

StreamingContext couldn't bind to a port used by Java

I have started Spark master and workers and can easily run a MapReduce like wordcount on HDFS.
Now I want to run a streaming on textstream and when I want to make a new StreamingContext
I have this error:
scala> val ssc = new StreamingContext("spark://master:7077","test", Seconds(2))
13/07/17 11:13:45 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started
org.jboss.netty.channel.ChannelException: Failed to bind to: /192.168.2.105:48594
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:298)
....
I checked the port and it was used by Java. I killed the process and I got out of Spark-shell.
Is there any way I can change the StreamingContext's port to a random free port?
Java is the underlying process for spark (scala runs on the jvm). It is possible that you have multiple copies of spark /spark streaming running. Can you look into that?
Specifically: i get the same result if I have a spark-shell already running.
You can check for other spark processes:
ps -ef | grep spark | -v grep