Can a PySpark Kernel(JupyterHub) run in yarn-client mode? - pyspark

My Current Setup:
Spark EC2 Cluster with HDFS and YARN
JuputerHub(0.7.0)
PySpark Kernel with python27
The very simple code that I am using for this question:
rdd = sc.parallelize([1, 2])
rdd.collect()
The PySpark kernel that works as expected in Spark standalone has the following environment variable in the kernel json file:
"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"
However, when I try to run in yarn-client mode it is getting stuck forever, while the log output from the JupyerHub logs is:
16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
As described here I have added the HADOOP_CONF_DIR env. variable to point to the directory where the Hadoop configurations are, and changed PYSPARK_SUBMIT_ARGS --master property to "yarn-client". Also i can confirm that there are no other jobs running during this and that the workers are correctly registered.
I am under the impression that it is possible to configure a JupyterHub Notebook with a PySpark kernel to run with YARN as other people have done it, if this indeed is the case what I am I doing wrong?

In order to have your pyspark works in yarn mode you'll have to do some additional configurations:
Configure yarn for remote yarn connection by copying the
hadoop-yarn-server-web-proxy-<version>.jar of your yarn cluster in the <local hadoop directory>/hadoop-<version>/share/hadoop/yarn/ of your jupyter instance (You need a local hadoop)
Copy the hive-site.xml of your cluster in the <local spark directory>/spark-<version>/conf/
Copy the yarn-site.xml of your cluster in the <local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/
Set environment variables:
export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
export SPARK_HOME=<local spark directory>/spark-<version>
export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
Now, you can create your kernel in file /usr/local/share/jupyter/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 2.1.0)",
"language": "python",
"argv": [
"/opt/conda/envs/python35/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python",
"SPARK_HOME": "/opt/mapr/spark/spark-2.1.0",
"PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/",
"PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell"
}
}
Relaunch your jupyterhub, you should see pyspark. Root user doesn't usually have yarn permission because of uid=1. You should connect to jupyterhub with another user

I hope my case can help you.
I config the url by simply passing a parameter:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")

Related

Spark files not found in cluster deploy mode

I'm trying to run a Spark job in cluster deploy mode by issuing in the EMR cluster master node:
spark-submit --master yarn \
--deploy-mode cluster \
--files truststore.jks,kafka.properties,program.properties \
--class com.someOrg.somePackage.someClass s3://someBucket/someJar.jar kafka.properties program.properties
I'm getting the following error, which states that the file can not be found at the Spark executor working directory:
//This is me printing the Spark executor working directory through SparkFiles.getRootDirectory()
20/07/03 17:53:40 INFO Program$: This is the path: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e
//This is me trying to list the content for that working directory, which turns out empty.
20/07/03 17:53:40 INFO Program$: This is the content for the path:
//This is me getting the error:
20/07/03 17:53:40 ERROR ApplicationMaster: User class threw exception: java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at ccom.someOrg.somePackage.someHelpers$.loadPropertiesFromFile(Helpers.scala:142)
at com.someOrg.somePackage.someClass$.main(someClass.scala:33)
at com.someOrg.somePackage.someClass.main(someClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
This is the function I use to attempt to read the properties files passed as arguments:
def loadPropertiesFromFile(path: String): Properties = {
val inputStream = Files.newInputStream(Paths.get(path), StandardOpenOption.READ)
val properties = new Properties()
properties.load(inputStream)
properties
}
Invoked as:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val kafkaProperties = loadPropertiesFromFile(SparkFiles.get(args(1)))
val programProperties = loadPropertiesFromFile(SparkFiles.get(args(2)))
//Also tried loadPropertiesFromFile(args({1,2}))
The program works as expected when issued with client deploy mode:
spark-submit --master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files truststore.jks program.jar com.someOrg.somePackage.someClass kafka.properties program.properties
This happens in Spark 2.4.5 / EMR 5.30.1.
Additionally, when I try to configure this job as an EMR step it does not even work in client mode. Any clue on how are the resource files passed through --files option managed/persisted/available in EMR?
Option 1: Put those files in s3 and pass the s3 path.
Option 2: copy those files to each node in a specific location(using bootstrap) and pass the absolute path of files.
Solved with suggestions from the above comments:
spark-submit --master yarn \
--deploy-mode cluster \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files s3://someBucket/resources/truststore.jks,s3://someBucket/resources/kafka.properties,s3://someBucket/resources/program.properties \
--class com.someOrg.someClass.someMain \
s3://someBucket/resources/program.jar kafka.properties program.properties
I was previously assuming that in cluster deploy mode the files under --files were also shipped alongside the driver deployed to a worker node (and thereby available in the working directory), if accessible from the machine where spark-submit is issued.
Bottom line: Regardless of where you issue spark-submit from and the availability of the files in that machine, in cluster mode, you must ensure that files are accessible from every worker node.
It is now working by pointing files location to S3.
Thank you all!

Unable to submit Spark job to yarn cluster using Scala

I am trying to submit spark job through SparkSubmit class on a Scala application from my local Windows machine to a remote Yarn cluster, but the spark ResourceManager always try to connect to 0.0.0.0.
val args = Array(
"--master", "yarn",
"--verbose",
"--class", "application-class",
"--num-executors", "1",
"--executor-cores", "1",
"--executor-memory", "10g",
"--deploy-mode", "cluster",
"--driver-memory", "10g",
"path-to-jar", "1")
SparkSubmit.main(args)
Below is the error
Failed to connect to server: 0.0.0.0/0.0.0.0:8032: retries get failed due to exceeded maximum allowed retries number: 10
When I try to submit the spark job through Command Prompt/Windows shell with same arguments as with Scala, then it works fine and submits the job to the cluster.
I have already HADOOP_CONF_DIR and YARN_CONF_DIR in environment variables and my yarn-site.xml has yarn.resourcemanager.address defined with remote IP.
Am I missing anything here?

Spark refuse connection to master

I am trying to setup a small Spark cluster for testing. The cluster consists of 3 workers and one master.
On each node I setup Java, scala and spark.
The configuration files are as follow:
spark-defaults.conf:
spark.master spark://test01.scem:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://test01.scem/user/spark/applicationHistory
spark.executor.memory 4g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.yarn.archive hdfs://test01.scem/user/spark
spark-env.sh
export SPARK_CONF_DIR=/usr/hadoop/spark-2.1.0-bin-hadoop2.7/conf
export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR=/var/run/spark
export HADOOP_HOME=${HADOOP_HOME:-/usr/hadoop}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hadoop/etc/hadoop}
I am able to start all nodes by (start-all.sh), but I recieve an error message on starting the shell (spark-shell).
I tried all available methods to view the UI for Spark cluster, but no luck, any help please.
The error message I receive is:
WARN client.StandaloneAppClient$ClientEndpoint: Failed to connect to master test01.scem:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
The jps of each node is :
Master {18097 JobHistoryServer, 21249 Jps, 20758 NameNode, 20440
ResourceManager}
slaves {11456 JobHistoryServer, 15409 Jps, 15092 DataNode, 14799
NodeManager}
check if you can ping the master. if that's true check if the port 7077 is occupied on master using netstat command. if both are true it may be a firewall issue

Failed to submit local jar to spark cluster: java.nio.file.NoSuchFileException

~/spark/spark-2.1.1-bin-hadoop2.7/bin$ ./spark-submit --master spark://192.168.42.80:32141 --deploy-mode cluster file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/06/20 16:41:30 INFO RestSubmissionClient: Submitting a request to launch an application in spark://192.168.42.80:32141.
17/06/20 16:41:31 INFO RestSubmissionClient: Submission successfully created as driver-20170620204130-0005. Polling submission state...
17/06/20 16:41:31 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20170620204130-0005 in spark://192.168.42.80:32141.
17/06/20 16:41:31 INFO RestSubmissionClient: State of driver driver-20170620204130-0005 is now ERROR.
17/06/20 16:41:31 INFO RestSubmissionClient: Driver is running on worker worker-20170620203037-172.17.0.5-45429 at 172.17.0.5:45429.
17/06/20 16:41:31 ERROR RestSubmissionClient: Exception from the cluster:
java.nio.file.NoSuchFileException: /home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
java.nio.file.Files.copy(Files.java:1274)
org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:608)
org.apache.spark.util.Utils$.copyFile(Utils.scala:579)
org.apache.spark.util.Utils$.doFetchFile(Utils.scala:664)
org.apache.spark.util.Utils$.fetchFile(Utils.scala:463)
org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:154)
org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:172)
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:91)
17/06/20 16:41:31 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20170620204130-0005",
"serverSparkVersion" : "2.1.1",
"submissionId" : "driver-20170620204130-0005",
"success" : true
}
Log from spark-worker:
2017-06-20T20:41:30.807403232Z 17/06/20 20:41:30 INFO Worker: Asked to launch driver driver-20170620204130-0005
2017-06-20T20:41:30.817248508Z 17/06/20 20:41:30 INFO DriverRunner: Copying user jar file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar to /opt/spark/work/driver-20170620204130-0005/myproj-assembly-0.1.0.jar
2017-06-20T20:41:30.883645747Z 17/06/20 20:41:30 INFO Utils: Copying /home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar to /opt/spark/work/driver-20170620204130-0005/myproj-assembly-0.1.0.jar
2017-06-20T20:41:30.885217508Z 17/06/20 20:41:30 INFO DriverRunner: Killing driver process!
2017-06-20T20:41:30.885694618Z 17/06/20 20:41:30 WARN Worker: Driver driver-20170620204130-0005 failed with unrecoverable exception: java.nio.file.NoSuchFileException: home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
Any idea why? Thanks
UPDATE
Is the following command right?
./spark-submit --master spark://192.168.42.80:32141 --deploy-mode cluster file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
UPDATE
I think I understand a little more about the spark and why I had this problem and spark-submit error: ClassNotFoundException. The key point is that though the word REST used here REST URL: spark://127.0.1.1:6066 (cluster mode), the application jar will not be uploaded to the cluster after submission, which is different with my understanding. so, the spark cluster cannot find the application jar, and cannot load the main class.
I will try to find how to setup the spark cluster and use the cluster mode to submit application. No idea whether client mode will use more resources for streaming jobs.
Blockquote
UPDATE
I think I understand a little more about the spark and why I had this problem and >spark-submit error: ClassNotFoundException. The key point is that though the word >REST used here REST URL: spark://127.0.1.1:6066 (cluster mode), the application >jar will not be uploaded to the cluster after submission, which is different with >my understanding. so, the spark cluster cannot find the application jar, and >cannot load the main class.
That's why you have to locate the jar-file in the master node OR put it into the hdfs before the spark submit.
This is how to do it:
1.) Transfering the file to the master node with ubuntu command
$ scp <file> <username>#<IP address or hostname>:<Destination>
For example:
$ scp mytext.txt tom#128.140.133.124:~/
2.) Transfering the file to the HDFS:
$ hdfs dfs -put mytext.txt
Hope I could help you.
You are submiting the application with cluster mode, this mean a Spark driver application will be created somewhere, the file must exist here.
That why with Spark, its recommanded to use a distributed file system like HDFS or S3.
The standalone mode cluster wants to pass jar files to hdfs because the driver is on any node in the cluster.
hdfs dfs -put xxx.jar /user/
spark-submit --master spark://xxx:7077 \
--deploy-mode cluster \
--supervise \
--driver-memory 512m \
--total-executor-cores 1 \
--executor-memory 512m \
--executor-cores 1 \
--class com.xiyou.bi.streaming.game.common.DmMoGameviewOnlineLogic \
hdfs://xxx:8020/user/hutao/xxx.jar

Simple Spark program eats all resources

I have server with running in it Spark master and slave. Spark was built manually with next flags:
build/mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package
I'm trying to execute next simple program remotely:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("testApp").setMaster("spark://sparkserver:7077")
val sc = new SparkContext(conf)
println(sc.parallelize(Array(1,2,3)).reduce((a, b) => a + b))
}
Spark dependency:
"org.apache.spark" %% "spark-core" % "1.6.1"
Log on program executing:
16/04/12 18:45:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
My cluster WebUI:
Why so simple application uses all availiable resources?
P.S. Also I noticed what if I allocate more memory for my app (10 gb e.g.) next logs appear many times:
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now RUNNING
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now EXITED (Command exited with code 1)
I think that reason in connection between master and slave. How I set up master and slave(on the same machine):
sbin/start-master.sh
sbin/start-slave.sh spark://sparkserver:7077
P.P.S. When I'm connecting to spark master with spark-shell all is good:
spark-shell --master spark://sparkserver:7077
By default, yarn will allocate all "available" ressources if the yarn dynamic ressource allocation is set to true and your job still have queued tasks. You can also look for your yarn configuration, namely the number of executor and the memory allocated to each one and tune in function of your need.
in file:spark-default.xml ------->setting :spark.cores.max=4
It was a driver issue. Driver (My scala app) was ran on my local computer. And workers have no access to it. As result all resources were eaten by attempts to reconnect to a driver.