launching a spark program using oozie workflow - scala

I am working with a scala program using spark packages.
Currently I run the program using the bash command from the gateway:
/homes/spark/bin/spark-submit --master yarn-cluster --class "com.xxx.yyy.zzz" --driver-java-options "-Dyyy.num=5" a.jar arg1 arg2
I would like to start using oozie for running this job. I have a few setbacks:
Where should I put the spark-submit executable? on the hfs?
How do I define the spark action? where should the --driver-java-options appear?
How should the oozie action look like? is it similar to the one appearing here?

If you have a new enough version of oozie you can use oozie's spark task:
https://github.com/apache/oozie/blob/master/client/src/main/resources/spark-action-0.1.xsd
Otherwise you need to execute a java task that will call spark. Something like:
<java>
<main-class>org.apache.spark.deploy.SparkSubmit</main-class>
<arg>--class</arg>
<arg>${spark_main_class}</arg> -> this is the class com.xxx.yyy.zzz
<arg>--deploy-mode</arg>
<arg>cluster</arg>
<arg>--master</arg>
<arg>yarn</arg>
<arg>--queue</arg>
<arg>${queue_name}</arg> -> depends on your oozie config
<arg>--num-executors</arg>
<arg>${spark_num_executors}</arg>
<arg>--executor-cores</arg>
<arg>${spark_executor_cores}</arg>
<arg>${spark_app_file}</arg> -> jar that contains your spark job, written in scala
<arg>${input}</arg> -> some arg
<arg>${output}</arg>-> some other arg
<file>${spark_app_file}</file>
<file>${name_node}/user/spark/share/lib/spark-assembly.jar</file>
</java>

Related

Read local/linux files in Spark Scala code executing in Yarn Cluster Mode

How to access and read local file data in Spark executing in Yarn Cluster Mode.
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
Spark code to read csv:
val test_data = spark.read.option("inferSchema", "true").option("header", "true).csv("/home/test_dir/test_file.csv")
val test_file_data = spark.read.option("inferSchema", "true").option("header", "true).csv("file:///home/test_dir/test_file.csv")
The above sample spark-submit is failing with local file not-found error (/home/test_dir/test_file.csv)
Spark by defaults check for file in hdfs:// but my file is in local and should not be copied into hfds and should read only from local file system.
Any suggestions to resolve this error?
Using file:// prefix will pull files from the YARN nodemanager filesystem, not the system from where you submitted the code.
To access your --files use csv("#test_file.csv")
should not be copied into hdfs
Using --files will copy the files into a temporary location that's mounted by the YARN executor and you can see them from the YARN UI
Below solution worked for me:
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
To access file passed in spark-submit:
import scala.io.Source
val lines = Source.fromPath("test_file.csv").getLines.toString
Instead of specifying complete path, specify only file name that we want to read. As spark already takes copy of file across nodes, we can access data of file with only file name.

How can I add configuration files to a Spark job running in YARN-CLUSTER mode?

I am using spark 1.6.0. I want to upload a files using --files tag and read the file content after initializing the spark context.
My spark-submit command syntax looks like below:
spark-submit \
--deploy-mode yarn-cluster \
--files /home/user/test.csv \
/home/user/spark-test-0.1-SNAPSHOT.jar
I read the Spark documentation and it suggested me to use SparkFiles.get("test.csv") but this is not working in yarn-cluster mode.
If I change the deploy mode to local, the code works fine but I get a file not found exception in yarn-cluster mode.
I can see in logs that my files is uploaded to hdfs://host:port/user/guest/.sparkStaging/application_1452310382039_0019/test.csv directory and the SparkFiles.get is trying to look for file in /tmp/test.csv which is not correct. If someone has successfully used this, please help me solve this.
Spark submit command
spark-submit \
--deploy-mode yarn-client \
--files /home/user/test.csv \
/home/user/spark-test-0.1-SNAPSHOT.jar /home/user/test.csv
Read file in main program
def main(args: Array[String]) {
val fis = new FileInputStream(args(0));
// read content of file
}

Pass opt arguments in an application executed as a .jar through spark-submit --class and use the existing context

I am writting a scala project that I want to have classes that are executable from spark-submit as a jar class. (e.g. spark-submit --class org.project
My problems are the following:
I want to use the spark-context-configuration that the user sets when doing a spark submit and overwrite optionally some parameters like the Application name. Example: spark-submit --num-executors 6 --class org.project will pass 6 in number of exectors configuration field in spark context.
I want to be able to pass option parameters like --inputFile or --verbose to my project without interfering with the spark parameters (possibly with avoid name overlap)
Example: spark-submit --num-executors 6 --class org.project --inputFile ./data/mystery.txt should pass "--inputFile ./data/mystery.txt" to the args input of class org.project main method.
What my progress is in those problems is the following:
I run val conf = new SparkConf().setAppName("project");
val sc = new SparkContext(conf);
in my main method,
but I am not sure if this does things as expected.
Sparks considers those optional arguments as arguments of the spark-submit and outputs an error.
Note.1: My java class project currently does not inherit any other class.
Note.2: I am new to the world of spark and I couldn't find something relative from a basic search.
You will have to handle parameter parsing yourself. Here we use Scopt.
When your spark-submit your job, it must enter through an object def main(args: Array[String]). Takes theses args and parse them using your favorite argument parser, set your sparkConf and SparkSession accordingly and launch your process.
Spark has examples of that whole idea:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala

ClassNotFoundException: com.databricks.spark.csv.DefaultSource

I am trying to export data from Hive using spark scala. But I am getting following error.
Caused by: java.lang.ClassNotFoundException:com.databricks.spark.csv.DefaultSource
My scala script is like below.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM sparksdata")
df.write.format("com.databricks.spark.csv").save("/root/Desktop/home.csv")
I have also try this command but still is not resolved please help me.
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
If you wish to run that script the way you are running it, you'll need to use the --jars for local jars or --packages for remote repo when you run the command.
So running the script should be like this :
spark-shell -i /path/to/script/scala --packages com.databricks:spark-csv_2.10:1.5.0
If you'd also want to stop the spark-shell after the job is done, you'll need to add :
System.exit(0)
by the end of your script.
PS: You won't be needing to fetch this dependency with spark 2.+.

Submit a PySpark job to a cluster with the '--py-files' argument

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value.
This did not seem to work. Do I need to provide some relative path for the PY_FILE value? The PY_FILE is also included in the zip.
e.g. in
gcloud beta dataproc jobs submit pyspark --cluster clustername --py-files gcsuriofzip PY_FILE
what should the value of PY_FILE be?
This is a good question. To answer this question, I am going to use the PySpark wordcount example.
In this case, I created two files, one called test.py which is the file I want to execute and another called wordcount.py.zip which is a zip containing a modified wordcount.py file designed to mimic a module I want to call.
My test.py file looks like this:
import wordcount
import sys
if __name__ == "__main__":
wordcount.wctest(sys.argv[1])
I modified the wordcount.py file to eliminate the main method and to add a named method:
...
from pyspark import SparkContext
...
def wctest(path):
sc = SparkContext(appName="PythonWordCount")
...
I can call the whole thing on Dataproc by using the following gcloud command:
gcloud beta dataproc jobs submit pyspark --cluster <cluster-name> \
--py-files gs://<bucket>/wordcount.py.zip gs://<bucket>/test.py \
gs://<bucket>/input/input.txt
In this example <bucket> is the name (or path) to my bucket and <cluster-name> is the name of my Dataproc cluster.