passing parameters via dataproc workflow-templates - google-cloud-dataproc

I understand that dataproc workflow-templates is still in beta, but how do you pass parameters via the add-job into the executable sql? Here is a basic example:
#/bin/bash
DATE_PARTITION=$1
echo DatePartition: $DATE_PARTITION
# sample job
gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=$DATE_PARTITION"
gcloud beta dataproc workflow-templates run $WORK_FLOW
gcloud beta dataproc workflow-templates remove-job $WORK_FLOW --step-
id=0_first-job
echo `date`
Here is my first-job.sql file called from the shell:
SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
SET mapred.output.compress=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
USE mydb;
CREATE EXTERNAL TABLE if not exists data_raw (
field1 string,
field2 string
)
PARTITIONED BY (dt String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'gs://data/first-job/';
ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="${hivevar:DATE_PARTITION}");
In the ALTER TABLE statement, what is the correct syntax? I’ve tried what feels like over 15 variations but nothing works. If I hard code it like this (ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="2017-10-31");) the partition gets created, but unfortunately it needs to be parameterized.
BTW – The error I receive is consistently like this:
Error: Error while compiling statement: FAILED: ParseException line 1:48 cannot recognize input near '${DATE_PARTITION}' ')' '' in constant
I am probably close but not sure what I am missing.
TIA,
Melissa

Update: Dataproc now has workflow template parameterization, a beta feature:
https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-parameters
For your specific case, you can do the following:
Create an empty template
gcloud beta dataproc workflow-templates create my-template
Add a job with a placeholder for the value you want to parameterize
gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=PLACEHOLDER"
Export the template configuration to a file
gcloud beta dataproc workflow-templates export my-template \
--destination=hive-template.yaml
Edit the file to add a parameter
jobs:
- hiveJob:
queryFileUri: gs://mybucket/first-job.sql
scriptVariables:
DATE_PARTITION: PLACEHOLDER
stepId: 0_first-job
parameters:
- name: DATE_PARTITION
fields:
- jobs['0_first-job'].hiveJob.scriptVariables['DATE_PARTITION']
Import the changes
gcloud beta dataproc workflow-templates import my-template \
--source=hive-template.yaml
Add a managed cluster or cluster selector
gcloud beta dataproc workflow-templates set-managed-cluster my-template \
--cluster-name=my-cluster \
--zone=us-central1-a
Run your template with parameters
gcloud beta dataproc workflow-templates instantiate my-template \
--parameters="DATE_PARTITION=${DATE_PARTITION}"

Thanks for trying out Workflows! First-class support for parameterization is part of our roadmap. However for now your remove-job/add-job trick is the best way to go.
Regarding your specific question:
Values passed via params are accessed as ${hivevar:PARAM} (see [1]). Alternatively, you can set --properties which are accessed as ${PARAM}
The brackets around params are not needed. If it's intended to handle spaces in parameter values use quotations like: --params="FOO=a b c,BAR=X"
Finally, I noticed an errant space here DATE_PARTITION =$1 which probably results in empty DATE_PARTITION value
Hope this helps!
[1] How to use params/properties flag values when executing hive job on google dataproc

Related

formating gcloud spanner queries

Up until recently my gcloud spanner queries where nicely presented as columns across the screen, with each output line representing a single row from the query. Recently however, for some unknown reason, the output is now displayed as row data presented in single column:value pair down the screen, e.g.
PKey: 9moVr4HmSy6GGIYJyVGu3A==
Ty: Pf
IsNode: Y
P: None
IX: X
I have tried various --format command line options but alas have had no success in generating the original row-per-line-output format i.e. with columns presented across the screen as follows
PKey Ty IsNode P IX <-- columns names
9moVr4HmSy6GGIYJyVGu3A== Pf Y None X. <--- row data
What format option should I use ?.
Example of gcloud query:
gcloud spanner databases execute-sql test-sdk-db --instance=test-instance --sql="Select * from Block "
Thanks
gcloud formats the results as a table if they're being written to a file, the usual formatting rules apply otherwise.
So one easy way to see the table in the shell is to tee it somewhere:
gcloud spanner databases execute-sql test-sdk-db --instance=test-instance --sql="Select * from Block " \
| tee /dev/null
If you can't do that for some reason you can always get the same result with some --format surgery. To print the column names:
gcloud spanner databases execute-sql test-sdk-db --instance=test-instance --sql="Select * from Block " \
--format 'csv[no-heading, delimiter=" "](metadata.rowType.fields.name)'
And to print the rows:
gcloud spanner databases execute-sql test-sdk-db --instance=test-instance --sql="Select * from Block " \
--format 'csv[no-heading, delimiter="\n"](rows.map().flatten(separator=" "))'
The format of gcloud spanner databases execute-sql is a result of broader changes in formatting to better support accessibility standards for screen readers.
There are two methods to receive results in the tabular format:
Set the accessibility/screen_reader configuration property to false.
gcloud config set accessibility/screen_reader false
Similar to the other suggestion about using formatting, you can use a --format=multi(...) option in your gcloud command:
gcloud spanner databases execute-sql test-sdk-db \
--instance=test-instance --sql="Select * from Block " \
--format="multi(metadata:format='value[delimiter='\t'](rowType.fields.name)', \
rows:format='value[delimiter='\t']([])')"
The caveat of this second method is that column names and values may not align due to differences in length.

GCloud build YAML substitutions don't work

I have a build config file that looks like:
steps:
...
<i use the ${_IMAGE} variable around 4 times here>
...
images: ['${_IMAGE}']
options:
dynamic_substitutions: true
substitutions:
_IMAGE: http://example.com/image-${_ENVIRON}
And I trigger the build like:
gcloud builds submit . --config=config.yaml --substitutions=_ENVIRON=prod
What I expected is for the gcloud to substitute the _ENVIRON variable in my script and then substitute the _IMAGE variable so that it'd expand to 'http://example.com/image-prod' - but instead I'm getting the following error:
ERROR: (gcloud.builds.submit) INVALID_ARGUMENT: generic::invalid_argument: key "_ENVIRON" in the substitution data is not matched in the template
What can I do to make that work? I really want to be able to change the environment easily with a sub and without the need to change anything in code
As you've seen, this isn't possible.
If the only use of _ENVIRON is by _IMAGE, why not drop the substitions from config.yaml and use _IMAGE as the substitution:
ENVIRON="prod"
IMAGE: http://example.com/image-${ENVIRON}
gcloud builds submit . \
--config=config.yaml \
--substitutions=_IMAGE=${IMAGE}

Google Cloud AI Platform: I can't create a model version using the "--accelerator" parameter

In order to get online predictions, I'm creating a model version on the ai-platform. It works fine unless I want to use the --accelerator parameter.
Here is the command that works:
gcloud alpha ai-platform versions create [...] --model [...] --origin=[...] --python-version=3.5 --runtime-version=1.14 --package-uris=[...] --machine-type=mls1-c4-m4 --prediction-class=[...]
Here is the parameter that makes it not work:
--accelerator=^:^count=1:type=nvidia-tesla-k80
This is the error message I get:
ERROR: (gcloud.alpha.ai-platform.versions.create) INVALID_ARGUMENT: Request contains an invalid argument.
I expect it to work, since 1) the parameter exists and uses these two keys (count and type), 2) I use the correct syntax for the parameter, any other syntaxes would return a syntax error, and 3) the "nvidia-tesla-k80" value exists (it is described in --help) and is available in the region in which the model is deployed.
Make sure you are using a recent version of the Google Cloud SDK.
Then you can use the following command:
gcloud beta ai-platform versions create $VERSION_NAME \
--model $MODEL_NAME \
--origin gs://$MODEL_DIRECTORY_URI \
--runtime-version 1.15 \
--python-version 3.7 \
--framework tensorflow \
--machine-type n1-standard-4 \
--accelerator count=1,type=nvidia-tesla-t4
For reference you can enable logging during model creation:
gcloud beta ai-platform models create {MODEL_NAME} \
--regions {REGION}
--enable-logging \
--enable-console-logging
The format for the --accelerator parameter as you can check in the official documentation is:
--accelerator=count=1,type=nvidia-tesla-k80
I think that might cause your issue, let me know.

Passing parameters into dataproc pyspark job

How do you pass parameters into the python script being called in a dataproc pyspark job submit? Here is a cmd I've been mucking with:
gcloud dataproc jobs submit pyspark --cluster my-dataproc \
file:///usr/test-pyspark.py \
--properties=^:^p1="7day":p2="2017-10-01"
This is the output returned:
Job [vvvvvvv-vvvv-vvvv-vvvv-0vvvvvv] submitted. Waiting for job output...
Warning: Ignoring non-spark config property: p2=2017-10-01
Warning: Ignoring non-spark config property: p1=7day
Found script=/usr/test-pyspark.py
Traceback (most recent call last):
File "/usr/test-pyspark.py", line 52, in <module>
print(sys.argv[1])
IndexError: list index out of range`
Clearly doesn't recognize the 2 params I'm trying to pass in. I also tried:
me#my-dataproc-m:~$ gcloud dataproc jobs submit pyspark --cluster=my-dataproc test-pyspark.py 7day 2017-11-01
But that returned with:
ERROR: (gcloud.dataproc.jobs.submit.pyspark) unrecognized arguments:
7day
2017-11-01
The pattern I use to pass params with the hive jobs doesn't work for pyspark.
Any help appreciated!
Thanks,
Melissa
The second form is close, use '--' to separate arguments to your job from arguments to gcloud:
$ gcloud dataproc jobs submit pyspark --cluster=my-dataproc \
test-pyspark.py -- 7day 2017-11-01
This worked for me for passing multiple arguments -
gcloud dataproc jobs submit pyspark --cluster <cluster_name> --region europe-west1 --properties spark.master=yarn --properties spark.submit.deployMode=client --properties spark.sql.adaptive.enabled=true --properties spark.executor.memoryOverhead=8192 --properties spark.driver.memoryOverhead=4096 .py -- --arg1=value1 --arg2=value2
or simply saying -
gcloud dataproc jobs submit pyspark --cluster <cluster_name> <name-of-the-script>.py -- --arg1=value1 --arg2=value2
All the Best !

Submit a PySpark job to a cluster with the '--py-files' argument

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value.
This did not seem to work. Do I need to provide some relative path for the PY_FILE value? The PY_FILE is also included in the zip.
e.g. in
gcloud beta dataproc jobs submit pyspark --cluster clustername --py-files gcsuriofzip PY_FILE
what should the value of PY_FILE be?
This is a good question. To answer this question, I am going to use the PySpark wordcount example.
In this case, I created two files, one called test.py which is the file I want to execute and another called wordcount.py.zip which is a zip containing a modified wordcount.py file designed to mimic a module I want to call.
My test.py file looks like this:
import wordcount
import sys
if __name__ == "__main__":
wordcount.wctest(sys.argv[1])
I modified the wordcount.py file to eliminate the main method and to add a named method:
...
from pyspark import SparkContext
...
def wctest(path):
sc = SparkContext(appName="PythonWordCount")
...
I can call the whole thing on Dataproc by using the following gcloud command:
gcloud beta dataproc jobs submit pyspark --cluster <cluster-name> \
--py-files gs://<bucket>/wordcount.py.zip gs://<bucket>/test.py \
gs://<bucket>/input/input.txt
In this example <bucket> is the name (or path) to my bucket and <cluster-name> is the name of my Dataproc cluster.