Updating a Dataproc cluster (metadata or labels) directly while in initialization actions script - google-cloud-dataproc

I'd like to save more specific errors in the case of a failed initialization script of a Dataproc cluster. Is it possible to update the cluster metadata or add a label to the cluster (without using gcloud dataproc clusters update) from within the script? Or any other method to write a more useful error message? Thanks in advance!

If your goal is to report an error from within an initialization action, there is a feature within Dataproc to extract messages from init action output.
As long as you emit a message in this format: StructuredError{message}
For example:
message="something went wrong"
echo "StructuredError{${message}}"

Related

gcloud CLI application logs to bucket

There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1
By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.

How to run hudi on dataproc and write to gcs bucket

I want to write to a gcs bucket from dataproc using hudi.
To write to gcs using hudi it says to set prop fs.defaultFS to value gs:// (https://hudi.apache.org/docs/gcs_hoodie)
However when I set fs.defaultFS on dataproc to be a gcs bucket I get errors at startup relating to the job not being able to find my jar. It is looking in a gs:/ prefix, presumably because I have overridden defaultFs which it was previously using the find the jar. How would I fix this?
org.apache.spark.SparkException: Application application_1617963833977_0009 failed 2 times due to AM Container for appattempt_1617963833977_0009_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2021-04-12 15:36:05.142]java.io.FileNotFoundException: File not found : gs:/user/root/.sparkStaging/application_1617963833977_0009/myjar.jar
If it is relevant I am setting the defaultFs from within the code.
sparkConfig.set("spark.hadoop.fs.defaultFS", gs://defaultFs)
You can try setting fs.defaultFS to GCS when creating the cluster. For example:
gcloud dataproc clusters create ...\
--properties 'core:fs.defaultFS=gs://my-bucket'

How can we get a list of failed dataproc jobs and their start time using gcloud or python

How can we get a list of failed dataproc jobs and their start time using gcloud or python? I don't see much info about this in the documentation.
It's tricky to do exactly what you are asking for, but this command almost matches it:
gcloud dataproc jobs list --filter="status.state=INACTIVE" --format="table(jobUuid,status.state,statusHistory[0].stateStartTime)"
This will print out the Job UUID, final state, and start time for all jobs that are no longer running.
Where this falls short of what you asked is that the returned list includes all of failed, cancelled, and done jobs, rather than just the failed jobs.
The issue is that Dataproc jobs list API supports filtering on job state, but only on the broad categories of "ACTIVE" or "INACTIVE". The "INACTIVE" category includes jobs with a state of "ERROR", but also includes "DONE" and "CANCELLED".
The simplest way I could get to a full solution to what you asked is to pipe the output of that command through grep
gcloud dataproc jobs list --filter="status.state=INACTIVE" --format="table(jobUuid,status.state,statusHistory[0].stateStartTime)" | grep ERROR
That will only list the failed jobs, but it is Unix specific.

ERROR: gcloud crashed (TransferRetryError): Bad Request

I'm running a daily dataproc job (pyspark), has been working fine for a year. Today, we're getting the following error:
ERROR: gcloud crashed (TransferRetryError): Bad Request
We got the error twice in a row, near the end of the job execution. It doesn't happen at a specific point in the job though.
I don't see much info about this error so I'd like to learn more about it and what could cause it.
It looks like this may have been due to some sort of (transient?) (network?) error with gcloud rather than the job itself failing.
For future reference, if this happens in the middle of job execution, you can always rerun gcloud to poll for job completion. Doing so will print out all driver logs from the beginning and continue streaming as usual until completion:
gcloud dataproc jobs wait <job-id> [--region=<region>]
If you aren't sure of the corresponding job id, it should have been printed out at job submission time. You can also list the jobs for a given cluster:
gcloud dataproc jobs list --cluster=<cluster> [--region=<region>]

Dataproc creation fails when updating fs.defaltFS using --properties

I created a dataproc with defaultFS as hdfs.that is working fine.But whenever i try to update fs.defaultFS to 'gs' the dataproc dashboard showing an error "Unable to Start Master, Insufficient number of datanodes reporting."
As mentioned in the comments, this question is a duplicate of Cannot create a Dataproc cluster when setting the fs.defaultFS property?
Re-pasting here for easier readability/discoverability vs just putting it in the comment thread.