Passing parameters into dataproc pyspark job - google-cloud-dataproc

How do you pass parameters into the python script being called in a dataproc pyspark job submit? Here is a cmd I've been mucking with:
gcloud dataproc jobs submit pyspark --cluster my-dataproc \
file:///usr/test-pyspark.py \
--properties=^:^p1="7day":p2="2017-10-01"
This is the output returned:
Job [vvvvvvv-vvvv-vvvv-vvvv-0vvvvvv] submitted. Waiting for job output...
Warning: Ignoring non-spark config property: p2=2017-10-01
Warning: Ignoring non-spark config property: p1=7day
Found script=/usr/test-pyspark.py
Traceback (most recent call last):
File "/usr/test-pyspark.py", line 52, in <module>
print(sys.argv[1])
IndexError: list index out of range`
Clearly doesn't recognize the 2 params I'm trying to pass in. I also tried:
me#my-dataproc-m:~$ gcloud dataproc jobs submit pyspark --cluster=my-dataproc test-pyspark.py 7day 2017-11-01
But that returned with:
ERROR: (gcloud.dataproc.jobs.submit.pyspark) unrecognized arguments:
7day
2017-11-01
The pattern I use to pass params with the hive jobs doesn't work for pyspark.
Any help appreciated!
Thanks,
Melissa

The second form is close, use '--' to separate arguments to your job from arguments to gcloud:
$ gcloud dataproc jobs submit pyspark --cluster=my-dataproc \
test-pyspark.py -- 7day 2017-11-01

This worked for me for passing multiple arguments -
gcloud dataproc jobs submit pyspark --cluster <cluster_name> --region europe-west1 --properties spark.master=yarn --properties spark.submit.deployMode=client --properties spark.sql.adaptive.enabled=true --properties spark.executor.memoryOverhead=8192 --properties spark.driver.memoryOverhead=4096 .py -- --arg1=value1 --arg2=value2
or simply saying -
gcloud dataproc jobs submit pyspark --cluster <cluster_name> <name-of-the-script>.py -- --arg1=value1 --arg2=value2
All the Best !

Related

Read local/linux files in Spark Scala code executing in Yarn Cluster Mode

How to access and read local file data in Spark executing in Yarn Cluster Mode.
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
Spark code to read csv:
val test_data = spark.read.option("inferSchema", "true").option("header", "true).csv("/home/test_dir/test_file.csv")
val test_file_data = spark.read.option("inferSchema", "true").option("header", "true).csv("file:///home/test_dir/test_file.csv")
The above sample spark-submit is failing with local file not-found error (/home/test_dir/test_file.csv)
Spark by defaults check for file in hdfs:// but my file is in local and should not be copied into hfds and should read only from local file system.
Any suggestions to resolve this error?
Using file:// prefix will pull files from the YARN nodemanager filesystem, not the system from where you submitted the code.
To access your --files use csv("#test_file.csv")
should not be copied into hdfs
Using --files will copy the files into a temporary location that's mounted by the YARN executor and you can see them from the YARN UI
Below solution worked for me:
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
To access file passed in spark-submit:
import scala.io.Source
val lines = Source.fromPath("test_file.csv").getLines.toString
Instead of specifying complete path, specify only file name that we want to read. As spark already takes copy of file across nodes, we can access data of file with only file name.

Is it possible to submit a job to a cluster using initization script on Google Dataproc?

I am using Dataproc with 1 job on 1 cluster.
I would like to start my job as soon as the cluster is created. I found that the best way to achieve this is to submit a job using an initialization script like below.
function submit_job() {
echo "Submitting job..."
gcloud dataproc jobs submit pyspark ...
}
export -f submit_job
function check_running() {
echo "checking..."
gcloud dataproc clusters list --region='asia-northeast1' --filter='clusterName = {{ cluster_name }}' |
tail -n 1 |
while read name platform worker_count preemptive_worker_count status others
do
if [ "$status" = "RUNNING" ]; then
return 0
fi
done
}
export -f check_running
function after_initialization() {
local role
role=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${role}" == 'Master' ]]; then
echo "monitoring the cluster..."
while true; do
if check_running; then
submit_job
break
fi
sleep 5
done
fi
}
export -f after_initialization
echo "start monitoring..."
bash -c after_initialization & disown -h
is it possible? When I ran this on Dataproc, a job is not submitted...
Thank you!
Consider to use Dataproc Workflow, it is designed for workflows of multi-steps, creating cluster, submitting job, deleting cluster. It is better than init actions, because it is a first class feature of Dataproc, there will be a Dataproc job resource, and you can view the history.
Please consider to use cloud composer - then you can write a single script that creates the cluster, runs the job and terminates the cluster.
I found a way.
Put a shell script named await_cluster_and_run_command.sh on GCS. Then, add the following codes to the initialization script.
gsutil cp gs://...../await_cluster_and_run_command.sh /usr/local/bin/
chmod 750 /usr/local/bin/await_cluster_and_run_command.sh
nohup /usr/local/bin/await_cluster_and_run_command.sh &>>/var/log/master-post-init.log &
reference: https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/post-init/master-post-init.sh

passing parameters via dataproc workflow-templates

I understand that dataproc workflow-templates is still in beta, but how do you pass parameters via the add-job into the executable sql? Here is a basic example:
#/bin/bash
DATE_PARTITION=$1
echo DatePartition: $DATE_PARTITION
# sample job
gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=$DATE_PARTITION"
gcloud beta dataproc workflow-templates run $WORK_FLOW
gcloud beta dataproc workflow-templates remove-job $WORK_FLOW --step-
id=0_first-job
echo `date`
Here is my first-job.sql file called from the shell:
SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
SET mapred.output.compress=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
USE mydb;
CREATE EXTERNAL TABLE if not exists data_raw (
field1 string,
field2 string
)
PARTITIONED BY (dt String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'gs://data/first-job/';
ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="${hivevar:DATE_PARTITION}");
In the ALTER TABLE statement, what is the correct syntax? I’ve tried what feels like over 15 variations but nothing works. If I hard code it like this (ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="2017-10-31");) the partition gets created, but unfortunately it needs to be parameterized.
BTW – The error I receive is consistently like this:
Error: Error while compiling statement: FAILED: ParseException line 1:48 cannot recognize input near '${DATE_PARTITION}' ')' '' in constant
I am probably close but not sure what I am missing.
TIA,
Melissa
Update: Dataproc now has workflow template parameterization, a beta feature:
https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-parameters
For your specific case, you can do the following:
Create an empty template
gcloud beta dataproc workflow-templates create my-template
Add a job with a placeholder for the value you want to parameterize
gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=PLACEHOLDER"
Export the template configuration to a file
gcloud beta dataproc workflow-templates export my-template \
--destination=hive-template.yaml
Edit the file to add a parameter
jobs:
- hiveJob:
queryFileUri: gs://mybucket/first-job.sql
scriptVariables:
DATE_PARTITION: PLACEHOLDER
stepId: 0_first-job
parameters:
- name: DATE_PARTITION
fields:
- jobs['0_first-job'].hiveJob.scriptVariables['DATE_PARTITION']
Import the changes
gcloud beta dataproc workflow-templates import my-template \
--source=hive-template.yaml
Add a managed cluster or cluster selector
gcloud beta dataproc workflow-templates set-managed-cluster my-template \
--cluster-name=my-cluster \
--zone=us-central1-a
Run your template with parameters
gcloud beta dataproc workflow-templates instantiate my-template \
--parameters="DATE_PARTITION=${DATE_PARTITION}"
Thanks for trying out Workflows! First-class support for parameterization is part of our roadmap. However for now your remove-job/add-job trick is the best way to go.
Regarding your specific question:
Values passed via params are accessed as ${hivevar:PARAM} (see [1]). Alternatively, you can set --properties which are accessed as ${PARAM}
The brackets around params are not needed. If it's intended to handle spaces in parameter values use quotations like: --params="FOO=a b c,BAR=X"
Finally, I noticed an errant space here DATE_PARTITION =$1 which probably results in empty DATE_PARTITION value
Hope this helps!
[1] How to use params/properties flag values when executing hive job on google dataproc

Submit a PySpark job to a cluster with the '--py-files' argument

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value.
This did not seem to work. Do I need to provide some relative path for the PY_FILE value? The PY_FILE is also included in the zip.
e.g. in
gcloud beta dataproc jobs submit pyspark --cluster clustername --py-files gcsuriofzip PY_FILE
what should the value of PY_FILE be?
This is a good question. To answer this question, I am going to use the PySpark wordcount example.
In this case, I created two files, one called test.py which is the file I want to execute and another called wordcount.py.zip which is a zip containing a modified wordcount.py file designed to mimic a module I want to call.
My test.py file looks like this:
import wordcount
import sys
if __name__ == "__main__":
wordcount.wctest(sys.argv[1])
I modified the wordcount.py file to eliminate the main method and to add a named method:
...
from pyspark import SparkContext
...
def wctest(path):
sc = SparkContext(appName="PythonWordCount")
...
I can call the whole thing on Dataproc by using the following gcloud command:
gcloud beta dataproc jobs submit pyspark --cluster <cluster-name> \
--py-files gs://<bucket>/wordcount.py.zip gs://<bucket>/test.py \
gs://<bucket>/input/input.txt
In this example <bucket> is the name (or path) to my bucket and <cluster-name> is the name of my Dataproc cluster.

How to use custom spark-defaults.conf settings

I have added a custom value to conf/spark-defaults.conf but that value is not being used.
stephen#ubuntu:~/spark-1.2.2$ cat conf/spark-defaults.conf
spark.akka.frameSize 92345678
Now let us run my program LBFGSRunner
sbt/sbt '; project mllib; runMain org.apache.spark.mllib.optimization.LBFGSRunner spark://ubuntu:7077'
Notice the following error: the conf setting was not being used:
[error] Exception in thread "main" org.apache.spark.SparkException:
Job aborted due to stage failure: Serialized task 0:0 was 26128706 bytes,
which exceeds max allowed: spark.akka.frameSize (10485760 bytes) -
reserved (204800 bytes). Consider increasing spark.akka.frameSize
or using broadcast variables for large values
Note: Working In Linux Mint.
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
If you want to run your job in development mode.
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()