Simple Spark program eats all resources - scala

I have server with running in it Spark master and slave. Spark was built manually with next flags:
build/mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package
I'm trying to execute next simple program remotely:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("testApp").setMaster("spark://sparkserver:7077")
val sc = new SparkContext(conf)
println(sc.parallelize(Array(1,2,3)).reduce((a, b) => a + b))
}
Spark dependency:
"org.apache.spark" %% "spark-core" % "1.6.1"
Log on program executing:
16/04/12 18:45:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
My cluster WebUI:
Why so simple application uses all availiable resources?
P.S. Also I noticed what if I allocate more memory for my app (10 gb e.g.) next logs appear many times:
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now RUNNING
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now EXITED (Command exited with code 1)
I think that reason in connection between master and slave. How I set up master and slave(on the same machine):
sbin/start-master.sh
sbin/start-slave.sh spark://sparkserver:7077
P.P.S. When I'm connecting to spark master with spark-shell all is good:
spark-shell --master spark://sparkserver:7077

By default, yarn will allocate all "available" ressources if the yarn dynamic ressource allocation is set to true and your job still have queued tasks. You can also look for your yarn configuration, namely the number of executor and the memory allocated to each one and tune in function of your need.

in file:spark-default.xml ------->setting :spark.cores.max=4

It was a driver issue. Driver (My scala app) was ran on my local computer. And workers have no access to it. As result all resources were eaten by attempts to reconnect to a driver.

Related

Can a PySpark Kernel(JupyterHub) run in yarn-client mode?

My Current Setup:
Spark EC2 Cluster with HDFS and YARN
JuputerHub(0.7.0)
PySpark Kernel with python27
The very simple code that I am using for this question:
rdd = sc.parallelize([1, 2])
rdd.collect()
The PySpark kernel that works as expected in Spark standalone has the following environment variable in the kernel json file:
"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"
However, when I try to run in yarn-client mode it is getting stuck forever, while the log output from the JupyerHub logs is:
16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
As described here I have added the HADOOP_CONF_DIR env. variable to point to the directory where the Hadoop configurations are, and changed PYSPARK_SUBMIT_ARGS --master property to "yarn-client". Also i can confirm that there are no other jobs running during this and that the workers are correctly registered.
I am under the impression that it is possible to configure a JupyterHub Notebook with a PySpark kernel to run with YARN as other people have done it, if this indeed is the case what I am I doing wrong?
In order to have your pyspark works in yarn mode you'll have to do some additional configurations:
Configure yarn for remote yarn connection by copying the
hadoop-yarn-server-web-proxy-<version>.jar of your yarn cluster in the <local hadoop directory>/hadoop-<version>/share/hadoop/yarn/ of your jupyter instance (You need a local hadoop)
Copy the hive-site.xml of your cluster in the <local spark directory>/spark-<version>/conf/
Copy the yarn-site.xml of your cluster in the <local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/
Set environment variables:
export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
export SPARK_HOME=<local spark directory>/spark-<version>
export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
Now, you can create your kernel in file /usr/local/share/jupyter/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 2.1.0)",
"language": "python",
"argv": [
"/opt/conda/envs/python35/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python",
"SPARK_HOME": "/opt/mapr/spark/spark-2.1.0",
"PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/",
"PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell"
}
}
Relaunch your jupyterhub, you should see pyspark. Root user doesn't usually have yarn permission because of uid=1. You should connect to jupyterhub with another user
I hope my case can help you.
I config the url by simply passing a parameter:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")

Checkpoint data corruption in Spark Streaming

I am testing checkpointing and write ahead logs with this basic Spark streaming code below. I am checkpointing into a local directory. After starting and stopping the application a few times (using Ctrl-C) - it would refuse to start, for what looks like some data corruption in the checkpoint directoty. I am getting:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 1 times, most recent failure: Lost task 0.0 in stage 80.0 (TID 17, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
Full code:
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._
object ProtoDemo {
def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
(values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
}
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc
}
def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
}
}
Basically what you are trying to do is a driver failure scenario , for this to work , based on the cluster you are running you have to follow the below instructions to monitor the driver process and relaunch the driver if it fails
Configuring automatic restart of the application driver - To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalone - A Spark application driver can be submitted to
run within the Spark Standalone cluster (see cluster deploy
mode), that is, the application driver itself runs on one of the
worker nodes. Furthermore, the Standalone cluster manager can be
instructed to supervise the driver, and relaunch it if the driver
fails either due to non-zero exit code, or due to failure of the
node running the driver. See cluster mode and supervise in the Spark
Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for
more details.
Mesos - Marathon has been used to achieve this with Mesos.
You need to configure write ahead logs as below ,there are special instructions for S3 which you need to follow.
While using S3 (or any file system that does not support flushing) for write ahead logs, please remember to enable
spark.streaming.driver.writeAheadLog.closeFileAfterWrite
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite.
See Spark Streaming Configuration for more details.
The issue looks rather Kryo Serializer issue than checkpoint corruption.
At code example (including GitHub project), Kryo Serialization is not configured.
Since it is not configured KryoException exception could not happen.
When using "write ahead logs", and restoring from a directory, all Spark config is getting from there.
At your example, createContext method does not call when starting from the checkpoint.
I assume the issue is another application were tested before with the same checkpoint directory, where Kryo Serializer where configured.
And current application fails to be restored from that checkpoint.

Running Apache Spark Example Application in IntelliJ Idea

I am trying to run the SparkPi.scala example program in Netbeans. Unfortunately I am quite new to Spark and have not been able to execute it successfully.
My preference is to work in Netbeans only and execute from there. I know spark also allows executing from the spark console - I however prefer not to take that approach.
This is my build.sbt file contents:
name := "SBTScalaSparkPi"
version := "1.0"
scalaVersion := "2.10.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
This is my plugins.sbt file contents:
logLevel := Level.Warn
This is the program I am trying to execute:
import scala.math.random
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
JDK version: 1.8.
The error I get when trying to execute the code is given below:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/03/25 07:50:25 INFO SparkContext: Running Spark version 1.6.1
16/03/25 07:50:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/25 07:50:26 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:401)
at SparkPi.main(SparkPi.scala)
16/03/25 07:50:26 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>
at SparkPi$.main(SparkPi.scala:28)
at SparkPi.main(SparkPi.scala)
Process finished with exit code 1
Thanks in advance for any help.
A master URL must be set in your configuration
You must set a spark.master in your SparkConf. There are only two mandatory parameters you must set - the master and the AppName that you've already set. For more details, see Initializing Spark section in the docs.
Which master should you use? See Master URLs section for all options. The simplest option for testing is local, which runs an entire Spark system (driver, master, worker) on your local machine, with no extra configuration.
To set the master through the Scala API:
val conf = new SparkConf().setAppName("Spark Pi").setMaster("local")
val spark = new SparkContext(conf)
The start of your program just lacks the URL that points to the Spark master endpoint. You can specify this as a command line parameter in InteliJ. The master URL is the URL and port where the Spark master of your cluster is running. An example command line parameter looks like this:
-Dspark.master=spark://myhost:7077
See the answer to this question for details:
How to set Master address for Spark examples from command line
Perhaps for your first runs you want to just start a local Spark standalone environment. How to get that running is well documented here: http://spark.apache.org/docs/latest/spark-standalone.html
If you got this running you can setup your spark master config like this:
-Dspark.master=spark://localhost:7077
The Master URL need to be set. Using the setMaster("local") function / method solved the issue.
val conf = new SparkConf().setAppName("Spark Pi").setMaster("local")
val spark = new SparkContext(conf)
As a matter of fact both #Matthias and #Tzach are right. You should choose your solution based on what is easier for you (maybe prefer the first option for now). As soon as you start running your spark job on a real cluster it is far better to not hardcode the "master" parameter so that you can run your spark job in multiple cluster mode (YARN, Mesos, Standalone with spark-submit) and still keep it running locally with Netbeans (-Dspark.master=local[*])

Spark atop of Docker not accepting jobs

I'm trying to make a hello world example work with spark+docker, and here is my code.
object Generic {
def main(args: Array[String]) {
val sc = new SparkContext("spark://172.17.0.3:7077", "Generic", "/opt/spark-0.9.0")
val NUM_SAMPLES = 100000
val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x * x + y * y < 1) 1.0 else 0.0
}.reduce(_ + _)
println("Pi is roughly " + 4 * count / NUM_SAMPLES)
}
}
When I run sbt run, I get
14/05/28 15:19:58 INFO client.AppClient$ClientActor: Connecting to master spark://172.17.0.3:7077...
14/05/28 15:20:08 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I checked both the cluster UI, where I have 3 nodes that each have 1.5g of memory, and the namenode UI, where I see the same thing.
The docker logs show no output from the workers and the following from the master
14/05/28 21:20:38 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#master:7077] -> [akka.tcp://spark#10.0.3.1:48085]: Error [Association failed with [akka.tcp://spark#10.0.3.1:48085]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark#10.0.3.1:48085]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /10.0.3.1:48085
]
This happens a couple times, and then the program times out and dies with
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Spark cluster looks down
When I did a tcpdump over the docker0 interface, and it looks like the workers and the master nodes are talking.
However, the spark console works.
If I set sc as val sc = new SparkContext("local", "Generic", System.getenv("SPARK_HOME")), the program runs
I've been there. The issue looks like the AKKA actor subsystem in Spark is binding on a different interface than Spark on docker0.
While your master ip is on: spark://172.17.0.3:7077
Akka is binding on: akka.tcp://spark#10.0.3.1:48085
If you masters/slaves are docker containers, they should be communicating through the docker0 interface in the 172.17.x.x range.
Try providing the master and slaves with their correct local IP using the env config SPARK_LOCAL_IP. See config docs for details.
In our docker setup for Spark 0.9 we are using this command to start the slaves:
${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_IP -i $LOCAL_IP
Which directly provides the local IP to the worker.
For running spark on Docker it's crucial to
Expose all necessary ports
Set correct spark.broadcast.factory
Handle docker aliases
Without handling all 3 issues spark cluster parts(master, worker, driver) can't communicate. You can read closely on every issue on http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html or use container ready for spark from https://registry.hub.docker.com/u/epahomov/docker-spark/
You have to check firewall if you are on Windows host and make sure java.exe is allowed to access the public network or change dockerNAT to private. In general, the worker must be able to connect back to the driver (the program you submitted).

StreamingContext couldn't bind to a port used by Java

I have started Spark master and workers and can easily run a MapReduce like wordcount on HDFS.
Now I want to run a streaming on textstream and when I want to make a new StreamingContext
I have this error:
scala> val ssc = new StreamingContext("spark://master:7077","test", Seconds(2))
13/07/17 11:13:45 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started
org.jboss.netty.channel.ChannelException: Failed to bind to: /192.168.2.105:48594
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:298)
....
I checked the port and it was used by Java. I killed the process and I got out of Spark-shell.
Is there any way I can change the StreamingContext's port to a random free port?
Java is the underlying process for spark (scala runs on the jvm). It is possible that you have multiple copies of spark /spark streaming running. Can you look into that?
Specifically: i get the same result if I have a spark-shell already running.
You can check for other spark processes:
ps -ef | grep spark | -v grep