Unable to submit Spark job to yarn cluster using Scala - scala

I am trying to submit spark job through SparkSubmit class on a Scala application from my local Windows machine to a remote Yarn cluster, but the spark ResourceManager always try to connect to 0.0.0.0.
val args = Array(
"--master", "yarn",
"--verbose",
"--class", "application-class",
"--num-executors", "1",
"--executor-cores", "1",
"--executor-memory", "10g",
"--deploy-mode", "cluster",
"--driver-memory", "10g",
"path-to-jar", "1")
SparkSubmit.main(args)
Below is the error
Failed to connect to server: 0.0.0.0/0.0.0.0:8032: retries get failed due to exceeded maximum allowed retries number: 10
When I try to submit the spark job through Command Prompt/Windows shell with same arguments as with Scala, then it works fine and submits the job to the cluster.
I have already HADOOP_CONF_DIR and YARN_CONF_DIR in environment variables and my yarn-site.xml has yarn.resourcemanager.address defined with remote IP.
Am I missing anything here?

Related

Hudi: Access to timeserver times out in embedded mode

I am testing Hudi 0.5.3 (supported by AWS Athena) by running it with Spark in embedded mode, i.e. with unit tests. At first, the test succeeded but now it's failing due to timeout when accessing Hudi's timeserver.
The following is based on Hudi: Getting Started guide.
Spark Session setup:
private val spark = addSparkConfigs(SparkSession.builder()
.appName("spark testing")
.master("local"))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.ui.port", "4041")
.enableHiveSupport()
.getOrCreate()
Code which causes timeout exception:
val inserts = convertToStringList(dataGen.generateInserts(10))
var df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
The timeout and exception throws:
170762 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.FileSystemViewManager - Creating remote view for basePath /var/folders/z9/_9mf84p97hz1n45b0gnpxlj40000gp/T/HudiQuickStartSpec-hudi_trips_cow2193648737745630661. Server=xxx:59520
170766 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.FileSystemViewManager - Creating InMemory based view for basePath /var/folders/z9/_9mf84p97hz1n45b0gnpxlj40000gp/T/HudiQuickStartSpec-hudi_trips_cow2193648737745630661
170769 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView - Sending request : (http://xxx:59520/v1/hoodie/view/datafiles/beforeoron/latest/?partition=americas%2Funited_states%2Fsan_francisco&maxinstant=20201221180946&basepath=%2Fvar%2Ffolders%2Fz9%2F_9mf84p97hz1n45b0gnpxlj40000gp%2FT%2FHudiQuickStartSpec-hudi_trips_cow2193648737745630661&lastinstantts=20201221180946&timelinehash=70f7aa073fa3d86033278a59cbda71c6488f4883570d826663ebb51934a25abf)
246649 [Executor task launch worker for task 47] ERROR org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: Connect to xxx:59520 [/xxx] failed: Operation timed out (Connection timed out)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFilesFromParams(RemoteHoodieTableFileSystemView.java:223)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFilesBeforeOrOn(RemoteHoodieTableFileSystemView.java:230)
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:97)
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFilesBeforeOrOn(PriorityBasedFileSystemView.java:134)
at org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$19c2c1bb$1(HoodieBloomIndex.java:201)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
I wasn't able to experiment with different port settings for Hudi timeserver port as I wasn't able to find the config setting that controls the port.
Any ideas why access to the timeserver times out?
The problem turned out to be rooted in the way Hudi resolves spark driver host. It seems that although it starts and binds its web server to localhost, Hudi's client subsequently uses the IP address to make calls to the server it started.
5240 [pool-1-thread-1-ScalaTest-running-HudiSimpleCdcSpec] INFO io.javalin.Javalin - Starting Javalin ...
5348 [pool-1-thread-1-ScalaTest-running-HudiSimpleCdcSpec] INFO io.javalin.Javalin - Listening on http://localhost:59520/
...
org.apache.hudi.exception.HoodieRemoteException: Connect to xxx:59520 [/xxx] failed: Operation timed out (Connection timed out)
The solution is to configure "spark.driver.host" setting explicitly. The following worked for me:
private val spark = addSparkConfigs(SparkSession.builder()
.appName("spark testing")
.master("local"))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.driver.host", "localhost")
.config("spark.ui.port", "4041")
.enableHiveSupport()
.getOrCreate()

Spark refuse connection to master

I am trying to setup a small Spark cluster for testing. The cluster consists of 3 workers and one master.
On each node I setup Java, scala and spark.
The configuration files are as follow:
spark-defaults.conf:
spark.master spark://test01.scem:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://test01.scem/user/spark/applicationHistory
spark.executor.memory 4g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.yarn.archive hdfs://test01.scem/user/spark
spark-env.sh
export SPARK_CONF_DIR=/usr/hadoop/spark-2.1.0-bin-hadoop2.7/conf
export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR=/var/run/spark
export HADOOP_HOME=${HADOOP_HOME:-/usr/hadoop}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hadoop/etc/hadoop}
I am able to start all nodes by (start-all.sh), but I recieve an error message on starting the shell (spark-shell).
I tried all available methods to view the UI for Spark cluster, but no luck, any help please.
The error message I receive is:
WARN client.StandaloneAppClient$ClientEndpoint: Failed to connect to master test01.scem:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
The jps of each node is :
Master {18097 JobHistoryServer, 21249 Jps, 20758 NameNode, 20440
ResourceManager}
slaves {11456 JobHistoryServer, 15409 Jps, 15092 DataNode, 14799
NodeManager}
check if you can ping the master. if that's true check if the port 7077 is occupied on master using netstat command. if both are true it may be a firewall issue

Can a PySpark Kernel(JupyterHub) run in yarn-client mode?

My Current Setup:
Spark EC2 Cluster with HDFS and YARN
JuputerHub(0.7.0)
PySpark Kernel with python27
The very simple code that I am using for this question:
rdd = sc.parallelize([1, 2])
rdd.collect()
The PySpark kernel that works as expected in Spark standalone has the following environment variable in the kernel json file:
"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"
However, when I try to run in yarn-client mode it is getting stuck forever, while the log output from the JupyerHub logs is:
16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
As described here I have added the HADOOP_CONF_DIR env. variable to point to the directory where the Hadoop configurations are, and changed PYSPARK_SUBMIT_ARGS --master property to "yarn-client". Also i can confirm that there are no other jobs running during this and that the workers are correctly registered.
I am under the impression that it is possible to configure a JupyterHub Notebook with a PySpark kernel to run with YARN as other people have done it, if this indeed is the case what I am I doing wrong?
In order to have your pyspark works in yarn mode you'll have to do some additional configurations:
Configure yarn for remote yarn connection by copying the
hadoop-yarn-server-web-proxy-<version>.jar of your yarn cluster in the <local hadoop directory>/hadoop-<version>/share/hadoop/yarn/ of your jupyter instance (You need a local hadoop)
Copy the hive-site.xml of your cluster in the <local spark directory>/spark-<version>/conf/
Copy the yarn-site.xml of your cluster in the <local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/
Set environment variables:
export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
export SPARK_HOME=<local spark directory>/spark-<version>
export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
Now, you can create your kernel in file /usr/local/share/jupyter/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 2.1.0)",
"language": "python",
"argv": [
"/opt/conda/envs/python35/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python",
"SPARK_HOME": "/opt/mapr/spark/spark-2.1.0",
"PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/",
"PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell"
}
}
Relaunch your jupyterhub, you should see pyspark. Root user doesn't usually have yarn permission because of uid=1. You should connect to jupyterhub with another user
I hope my case can help you.
I config the url by simply passing a parameter:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")

Simple Spark program eats all resources

I have server with running in it Spark master and slave. Spark was built manually with next flags:
build/mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package
I'm trying to execute next simple program remotely:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("testApp").setMaster("spark://sparkserver:7077")
val sc = new SparkContext(conf)
println(sc.parallelize(Array(1,2,3)).reduce((a, b) => a + b))
}
Spark dependency:
"org.apache.spark" %% "spark-core" % "1.6.1"
Log on program executing:
16/04/12 18:45:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
My cluster WebUI:
Why so simple application uses all availiable resources?
P.S. Also I noticed what if I allocate more memory for my app (10 gb e.g.) next logs appear many times:
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now RUNNING
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now EXITED (Command exited with code 1)
I think that reason in connection between master and slave. How I set up master and slave(on the same machine):
sbin/start-master.sh
sbin/start-slave.sh spark://sparkserver:7077
P.P.S. When I'm connecting to spark master with spark-shell all is good:
spark-shell --master spark://sparkserver:7077
By default, yarn will allocate all "available" ressources if the yarn dynamic ressource allocation is set to true and your job still have queued tasks. You can also look for your yarn configuration, namely the number of executor and the memory allocated to each one and tune in function of your need.
in file:spark-default.xml ------->setting :spark.cores.max=4
It was a driver issue. Driver (My scala app) was ran on my local computer. And workers have no access to it. As result all resources were eaten by attempts to reconnect to a driver.

Spark atop of Docker not accepting jobs

I'm trying to make a hello world example work with spark+docker, and here is my code.
object Generic {
def main(args: Array[String]) {
val sc = new SparkContext("spark://172.17.0.3:7077", "Generic", "/opt/spark-0.9.0")
val NUM_SAMPLES = 100000
val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x * x + y * y < 1) 1.0 else 0.0
}.reduce(_ + _)
println("Pi is roughly " + 4 * count / NUM_SAMPLES)
}
}
When I run sbt run, I get
14/05/28 15:19:58 INFO client.AppClient$ClientActor: Connecting to master spark://172.17.0.3:7077...
14/05/28 15:20:08 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I checked both the cluster UI, where I have 3 nodes that each have 1.5g of memory, and the namenode UI, where I see the same thing.
The docker logs show no output from the workers and the following from the master
14/05/28 21:20:38 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#master:7077] -> [akka.tcp://spark#10.0.3.1:48085]: Error [Association failed with [akka.tcp://spark#10.0.3.1:48085]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark#10.0.3.1:48085]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /10.0.3.1:48085
]
This happens a couple times, and then the program times out and dies with
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Spark cluster looks down
When I did a tcpdump over the docker0 interface, and it looks like the workers and the master nodes are talking.
However, the spark console works.
If I set sc as val sc = new SparkContext("local", "Generic", System.getenv("SPARK_HOME")), the program runs
I've been there. The issue looks like the AKKA actor subsystem in Spark is binding on a different interface than Spark on docker0.
While your master ip is on: spark://172.17.0.3:7077
Akka is binding on: akka.tcp://spark#10.0.3.1:48085
If you masters/slaves are docker containers, they should be communicating through the docker0 interface in the 172.17.x.x range.
Try providing the master and slaves with their correct local IP using the env config SPARK_LOCAL_IP. See config docs for details.
In our docker setup for Spark 0.9 we are using this command to start the slaves:
${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_IP -i $LOCAL_IP
Which directly provides the local IP to the worker.
For running spark on Docker it's crucial to
Expose all necessary ports
Set correct spark.broadcast.factory
Handle docker aliases
Without handling all 3 issues spark cluster parts(master, worker, driver) can't communicate. You can read closely on every issue on http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html or use container ready for spark from https://registry.hub.docker.com/u/epahomov/docker-spark/
You have to check firewall if you are on Windows host and make sure java.exe is allowed to access the public network or change dockerNAT to private. In general, the worker must be able to connect back to the driver (the program you submitted).