I created a pySpark Job and its working perfectly fine on submitting thru spark-submit. Now When I tried thru Oozie its failing. I doubt the Fields that the fields I enter has issues . These fields are required for Spark Action in Oozie.
Spark Master : local
Mode : client
Main class : DO I need to enter anything here as its Python + Spark code (Pyspark)
Jars/py files : My py module
Log Stdout is as bellow
=================================================================
>>> Invoking Main class now >>>
Fetching child yarn jobs
tag id : oozie-653992fdf1609a2d4e19a863dff21a1
Child yarn jobs are found -
Spark Action Main class : org.apache.spark.deploy.SparkSubmit
Oozie Spark action configuration
=================================================================
--master
local[*]
--deploy-mode
client
--name
POC1L
--verbose
/user/sachinkerala6174/pgm/poc1l.py
=================================================================
>>> Invoking Spark class now >>>
python: can't open file '/user/sachinkerala6174/pgm/poc1l.py': [Errno 2] No such file or directory
Intercepting System.exit(2)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [2]
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://ip-172-31-53-48.ec2.internal:8020/user/sachinkerala6174/oozie-oozi/0000509-170711051319609-oozie-oozi-W/spark-fea0--spark/action-data.seq
Oozie Launcher ends
You don't need to put anything in "Main class" input. Just add hdfs:// prefix to python file path, and change Master to yarn and Mode to cluster (AFAIR it's required if your source code is on HDFS).
Related
We have a pyspark based application and we are doing a spark-submit as shown below. Application is working as expected, however we are seeing a weird warning message. Any way to handle this or why is this coming ?
Note: The cluster is Azure HDI Cluster.
spark-submit --master yarn --deploy-mode cluster --jars file:/<localpath>/* --py-files pyFiles/__init__.py,pyFiles/<abc>.py,pyFiles/<abd>.py --files files/<env>.properties,files/<config>.json main.py
warning seen is:
warnings.warn(
/usr/hdp/current/spark3-client/python/pyspark/context.py:256:
RuntimeWarning: Failed to add file
[file:///home/sshuser/project/pyFiles/abc.py] speficied in
'spark.submit.pyFiles' to Python path:
/mnt/resource/hadoop/yarn/local/usercache/sshuser/filecache/929
above warning coming for all files i.e abc.py, abd.py etc (which ever passed to --py-files to)
I am using MapR5.2 - Spark version 2.1.0
And i am running my spark app jar in Yarn CLuster mode.
I have tried all the available options that i found But unable to succeed.
This is our Production environment. But i need that for my particular spark job it should follow and pick-up my log4j-Driver.properties file which is present in my src/main/resources folder(I also confirmed by opening the jar my property file is present)
1) Content of My Log File -> log4j-Driver.properties
log4j.rootCategory=DEBUG, FILE
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/users/myuser/logs/Myapp.log
log4j.appender.FILE.ImmediateFlush=true
log4j.appender.FILE.Threshold=debug
log4j.appender.FILE.Append=true
log4j.appender.FILE.MaxFileSize=100MB
log4j.appender.FILE.MaxBackupIndex=10
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
2) My Script for Spark-Submit Command
propertyFile=application.properties
spark-submit --class MyAppDriver \
--conf "spark.driver.extraJavaOptions -Dlog4j.configuration=file:/users/myuser/log4j-Driver.properties" \
--master yarn --deploy-mode cluster \
--files /users/myuser/log4j-Driver.properties,/opt/mapr/spark/spark-2.1.0/conf/hive-site.xml,/users/myuser/application.properties
/users/myuser/myapp_2.11-1.0.jar $propertyFile
All i Need is as of now i am trying to Write my Driver Logs in the directory mentioned in my properties file(mentioned above) If i am successful in this then i will try for Executor logs as well. But first i need to make this Driver Log to write on my local (and its an Edge node of our Cluster)
/users/myuser/log4j-Driver.properties seems to be the path to the file on your local computer so you were right to use it for the --files argument.
The problem is, that there's no such file on the driver and/or executor, so when you use file:/users/myuser/log4j-Driver.properties as an argument to -Dlog4j.configuration Log4j will fail to find it.
Since you run on YARN, files listed as arguments to --files will be submitted to HDFS. Each application submission will have its own base directory in HDFS where all the files will be put by spark-submit.
In order to refer to these files use relative paths. In your case --conf "spark.driver.extraJavaOptions -Dlog4j.configuration=log4j-Driver.properties" should work.
I'm using the spark-shell for learning purpose and for that I created several scala files containing frequently used code, like class definitions. I use the files by calling the ":load" command within the shell.
Now I would like to to use the spark-shell in in yarn-cluster mode. I start it using spark-shell --master yarn --deploy-mode client.
the shell starts without any issues but when I try to run the code loaded by ":load", I get execution errors.
17/05/04 07:59:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e68_1493271022021_0168_01_000002 on host: xxxw03.mine.de. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e68_1493271022021_0168_01_000002
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
at org.apache.hadoop.util.Shell.run(Shell.java:844)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I think I will have to share the code loaded in the shell to the workers. But how do I have to do this?
The spark-shell is useful for quickly testing but once you have an idea of what you want to do and put together a complete program it's usefulness plummets.
You probably want to now move on to using the spark-submit command.
See the docs on submitting an application https://spark.apache.org/docs/latest/submitting-applications.html
Using this command you provide a JAR file instead of individual class files.
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
<main-class> is the Java style path to your class e.g. com.example.MyMainClass
<application-jar> is the path to the JAR file containing the classes in your project and the other params are as per documented on the link I included above but these two are the two key differences in terms of how you supply your code to the cluster.
I am creating a self-contained Scala program that uses Spark for parallelization in some parts. In my specific situation, the Spark cluster is available through mesos.
I create spark context like this:
val conf = new SparkConf().setMaster("mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark").setAppName("foo")
val sc = new SparkContext(conf)
I found out from searching around that you have to specify MESOS_NATIVE_JAVA_LIBRARY env var to point to the libmesos library, so when running my Scala program I do this:
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib sbt run
But, this results in a SparkException:
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Could not parse Master URL: 'mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark'
At the same time, using spark-submit seems to work fine after exporting the MESOS_NATIVE_JAVA_LIBRARY env var.
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib spark-submit --class <MAIN CLASS> ./target/scala-2.10/<APP_JAR>.jar
Why?
How can I make the standalone program run like spark-submit?
Add spark-mesos jar to your classpath.
My Current Setup:
Spark EC2 Cluster with HDFS and YARN
JuputerHub(0.7.0)
PySpark Kernel with python27
The very simple code that I am using for this question:
rdd = sc.parallelize([1, 2])
rdd.collect()
The PySpark kernel that works as expected in Spark standalone has the following environment variable in the kernel json file:
"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"
However, when I try to run in yarn-client mode it is getting stuck forever, while the log output from the JupyerHub logs is:
16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
As described here I have added the HADOOP_CONF_DIR env. variable to point to the directory where the Hadoop configurations are, and changed PYSPARK_SUBMIT_ARGS --master property to "yarn-client". Also i can confirm that there are no other jobs running during this and that the workers are correctly registered.
I am under the impression that it is possible to configure a JupyterHub Notebook with a PySpark kernel to run with YARN as other people have done it, if this indeed is the case what I am I doing wrong?
In order to have your pyspark works in yarn mode you'll have to do some additional configurations:
Configure yarn for remote yarn connection by copying the
hadoop-yarn-server-web-proxy-<version>.jar of your yarn cluster in the <local hadoop directory>/hadoop-<version>/share/hadoop/yarn/ of your jupyter instance (You need a local hadoop)
Copy the hive-site.xml of your cluster in the <local spark directory>/spark-<version>/conf/
Copy the yarn-site.xml of your cluster in the <local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/
Set environment variables:
export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
export SPARK_HOME=<local spark directory>/spark-<version>
export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
Now, you can create your kernel in file /usr/local/share/jupyter/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 2.1.0)",
"language": "python",
"argv": [
"/opt/conda/envs/python35/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python",
"SPARK_HOME": "/opt/mapr/spark/spark-2.1.0",
"PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/",
"PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell"
}
}
Relaunch your jupyterhub, you should see pyspark. Root user doesn't usually have yarn permission because of uid=1. You should connect to jupyterhub with another user
I hope my case can help you.
I config the url by simply passing a parameter:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")