passing properties argument for gcloud dataproc jobs submit pyspark - mongodb

I am trying to submit a pyspark job to google cloud dataproc via the command line
these are my arguments;
gcloud dataproc jobs submit pyspark --cluster mongo-load --properties org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 mongo_load.py
I am constantly getting an exception
--properties: Bad syntax for dict arg: [org.mongodb.spark:mongo-spark-connector_2.11:2.2.0]
I tried some of the escaping options from google shown here but nothing seems to work.

figured out I just needed to pass
spark.jars.packages=org.mongodb.spark:mongo-spark-connector_2.11:2.2.0

In addition to the answer by #Marlon Gray, if you need to pass more that one package you need to escape the spark.jars.packages string, like
--properties=^#^spark.jars.packages=mavencoordinate1,mavencoordinate2
Please check https://cloud.google.com/sdk/gcloud/reference/topic/escaping for further details.

Related

gcloud CLI application logs to bucket

There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1
By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.

Launch a dataproc job with --files through the rest API

I am able to submit a job to dataproc via the command line
gcloud dataproc jobs submit pyspark --cluster=my_cluster --region=myregion --py-files file1.py script.py
I would like to transform this command line to a POST request to the rest api, https://cloud.google.com/dataproc/docs/guides/submit-job
However I am not able to understand how to set the
--py-files file1.py
parameter in the request. Could somebody help me?
I have found that this can be accomplished though
"pythonFileUris": [
"gs://file1.py"

Updating a Dataproc cluster (metadata or labels) directly while in initialization actions script

I'd like to save more specific errors in the case of a failed initialization script of a Dataproc cluster. Is it possible to update the cluster metadata or add a label to the cluster (without using gcloud dataproc clusters update) from within the script? Or any other method to write a more useful error message? Thanks in advance!
If your goal is to report an error from within an initialization action, there is a feature within Dataproc to extract messages from init action output.
As long as you emit a message in this format: StructuredError{message}
For example:
message="something went wrong"
echo "StructuredError{${message}}"

Give custom job_id to Google Dataproc cluster for running pig/hive/spark jobs

Is there any flag available to give custom job_id to dataproc jobs. I am using this command to run pig jobs.
gcloud dataproc jobs submit pig --cluster my_cluster --file my_queries.pig
I use similar commands to submit pyspark/hive jobs.
This command creates a job_id on its own and tracking them later on is difficult.
Reading the gcloud code you can see that the args called id is used as job name
https://github.com/google-cloud-sdk/google-cloud-sdk/blob/master/lib/googlecloudsdk/command_lib/dataproc/jobs/submitter.py#L56
therefore you only need to add the --id to you gcloud command
gcloud dataproc jobs submit spark --id this-is-my-job-name --cluster my-cluster --class com.myClass.Main --jars gs://my.jar
While it's possible to provide your own generated jobid when using the underlying REST API, there isn't currently any way to specify your own jobid when submitting with gcloud dataproc jobs submit; this feature might be added in the future. That said, typically when people want to specify job ids they also want to be able to list with more complex match expressions, or potentially to have multiple categories of jobs listed by different kinds of expressions at different points in time.
So, you might want to consider dataproc labels instead; labels are intended specifically for this kind of use case, and are optimized for efficient lookup. For example:
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170508 ...
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170509 ...
gcloud dataproc jobs submit pig --labels jobtype=mlpipeline,date=20170509 ...
gcloud dataproc jobs list --filter "labels.jobtype=mylogspipeline"
gcloud dataproc jobs list --filter "labels.date=20170509"
gcloud dataproc jobs list --filter "labels.date=20170509 AND labels.jobtype=mlpipeline"

How to retrieve Dataproc's jobId within a PySpark job

I run several batch jobs and I would like to reference the jobId from dataproc to the saved output files.
That would allow to have all logs for arguments and output associated with the results. Downside remains: As executors in YARN past away, no logs for the single executor can be obtained anymore.
The context of Google dataproc is passed into Spark jobs by using tags. Therefore all suitable information are present in the SparkConfig and can be accessed:
pyspark.SparkConf().get("spark.yarn.application.tags", "unknown")
pyspark.SparkConf().get("spark.yarn.tags", "unknown")
Output looks the following:
dataproc_job_3f4025a0-bce1-a254-9ddc-518a4d8b2f3d
That information can then be assigned to our export folder and output is saved with Dataproc reference:
df.select("*").write. \
format('com.databricks.spark.csv').options(header='true') \
.save(export_folder)