How to force delete dataproc serverless batch - google-cloud-dataproc

I am running a pyspark dataproc serverless batch. It has been running for too long so I decided to delete it. But neither the GCP console nor the CLI allow me to delete the batch.
The command I tried is
gcloud dataproc batches delete <Batch ID> --region=us-central1
I get the following error:
ERROR: (gcloud.dataproc.batches.delete) FAILED_PRECONDITION: Cannot delete non terminal batch 'Batch(<project-id/batch-id>)'; current state: 'RUNNING'

gcloud dataproc batches cancel is used to cancel a running batch, while gcloud dataproc batches delete is used to delete the batch resource. In this case, you want to use cancel.

Related

gcloud CLI application logs to bucket

There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1
By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.

How to check cluster name, execution time of dataproc jobs via gcloud command

We need to investigate dataproc jobs ran on particular date.
While running command -
gcloud dataproc jobs list --region=<region_name>
It helps to get below output format -
JOB_ID TYPE STATUS
where TYPE could be spark/hive and STATUS could be DONE/ERROR
is there any way to check job execution time and cluster name as well ?

How to run hudi on dataproc and write to gcs bucket

I want to write to a gcs bucket from dataproc using hudi.
To write to gcs using hudi it says to set prop fs.defaultFS to value gs:// (https://hudi.apache.org/docs/gcs_hoodie)
However when I set fs.defaultFS on dataproc to be a gcs bucket I get errors at startup relating to the job not being able to find my jar. It is looking in a gs:/ prefix, presumably because I have overridden defaultFs which it was previously using the find the jar. How would I fix this?
org.apache.spark.SparkException: Application application_1617963833977_0009 failed 2 times due to AM Container for appattempt_1617963833977_0009_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2021-04-12 15:36:05.142]java.io.FileNotFoundException: File not found : gs:/user/root/.sparkStaging/application_1617963833977_0009/myjar.jar
If it is relevant I am setting the defaultFs from within the code.
sparkConfig.set("spark.hadoop.fs.defaultFS", gs://defaultFs)
You can try setting fs.defaultFS to GCS when creating the cluster. For example:
gcloud dataproc clusters create ...\
--properties 'core:fs.defaultFS=gs://my-bucket'

Dataproc creation fails when updating fs.defaltFS using --properties

I created a dataproc with defaultFS as hdfs.that is working fine.But whenever i try to update fs.defaultFS to 'gs' the dataproc dashboard showing an error "Unable to Start Master, Insufficient number of datanodes reporting."
As mentioned in the comments, this question is a duplicate of Cannot create a Dataproc cluster when setting the fs.defaultFS property?
Re-pasting here for easier readability/discoverability vs just putting it in the comment thread.

Give custom job_id to Google Dataproc cluster for running pig/hive/spark jobs

Is there any flag available to give custom job_id to dataproc jobs. I am using this command to run pig jobs.
gcloud dataproc jobs submit pig --cluster my_cluster --file my_queries.pig
I use similar commands to submit pyspark/hive jobs.
This command creates a job_id on its own and tracking them later on is difficult.
Reading the gcloud code you can see that the args called id is used as job name
https://github.com/google-cloud-sdk/google-cloud-sdk/blob/master/lib/googlecloudsdk/command_lib/dataproc/jobs/submitter.py#L56
therefore you only need to add the --id to you gcloud command
gcloud dataproc jobs submit spark --id this-is-my-job-name --cluster my-cluster --class com.myClass.Main --jars gs://my.jar
While it's possible to provide your own generated jobid when using the underlying REST API, there isn't currently any way to specify your own jobid when submitting with gcloud dataproc jobs submit; this feature might be added in the future. That said, typically when people want to specify job ids they also want to be able to list with more complex match expressions, or potentially to have multiple categories of jobs listed by different kinds of expressions at different points in time.
So, you might want to consider dataproc labels instead; labels are intended specifically for this kind of use case, and are optimized for efficient lookup. For example:
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170508 ...
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170509 ...
gcloud dataproc jobs submit pig --labels jobtype=mlpipeline,date=20170509 ...
gcloud dataproc jobs list --filter "labels.jobtype=mylogspipeline"
gcloud dataproc jobs list --filter "labels.date=20170509"
gcloud dataproc jobs list --filter "labels.date=20170509 AND labels.jobtype=mlpipeline"