How to check cluster name, execution time of dataproc jobs via gcloud command - gcloud

We need to investigate dataproc jobs ran on particular date.
While running command -
gcloud dataproc jobs list --region=<region_name>
It helps to get below output format -
JOB_ID TYPE STATUS
where TYPE could be spark/hive and STATUS could be DONE/ERROR
is there any way to check job execution time and cluster name as well ?

Related

How to force delete dataproc serverless batch

I am running a pyspark dataproc serverless batch. It has been running for too long so I decided to delete it. But neither the GCP console nor the CLI allow me to delete the batch.
The command I tried is
gcloud dataproc batches delete <Batch ID> --region=us-central1
I get the following error:
ERROR: (gcloud.dataproc.batches.delete) FAILED_PRECONDITION: Cannot delete non terminal batch 'Batch(<project-id/batch-id>)'; current state: 'RUNNING'
gcloud dataproc batches cancel is used to cancel a running batch, while gcloud dataproc batches delete is used to delete the batch resource. In this case, you want to use cancel.

gcloud CLI application logs to bucket

There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1
By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.

How can we get a list of failed dataproc jobs and their start time using gcloud or python

How can we get a list of failed dataproc jobs and their start time using gcloud or python? I don't see much info about this in the documentation.
It's tricky to do exactly what you are asking for, but this command almost matches it:
gcloud dataproc jobs list --filter="status.state=INACTIVE" --format="table(jobUuid,status.state,statusHistory[0].stateStartTime)"
This will print out the Job UUID, final state, and start time for all jobs that are no longer running.
Where this falls short of what you asked is that the returned list includes all of failed, cancelled, and done jobs, rather than just the failed jobs.
The issue is that Dataproc jobs list API supports filtering on job state, but only on the broad categories of "ACTIVE" or "INACTIVE". The "INACTIVE" category includes jobs with a state of "ERROR", but also includes "DONE" and "CANCELLED".
The simplest way I could get to a full solution to what you asked is to pipe the output of that command through grep
gcloud dataproc jobs list --filter="status.state=INACTIVE" --format="table(jobUuid,status.state,statusHistory[0].stateStartTime)" | grep ERROR
That will only list the failed jobs, but it is Unix specific.

ERROR: gcloud crashed (TransferRetryError): Bad Request

I'm running a daily dataproc job (pyspark), has been working fine for a year. Today, we're getting the following error:
ERROR: gcloud crashed (TransferRetryError): Bad Request
We got the error twice in a row, near the end of the job execution. It doesn't happen at a specific point in the job though.
I don't see much info about this error so I'd like to learn more about it and what could cause it.
It looks like this may have been due to some sort of (transient?) (network?) error with gcloud rather than the job itself failing.
For future reference, if this happens in the middle of job execution, you can always rerun gcloud to poll for job completion. Doing so will print out all driver logs from the beginning and continue streaming as usual until completion:
gcloud dataproc jobs wait <job-id> [--region=<region>]
If you aren't sure of the corresponding job id, it should have been printed out at job submission time. You can also list the jobs for a given cluster:
gcloud dataproc jobs list --cluster=<cluster> [--region=<region>]

Give custom job_id to Google Dataproc cluster for running pig/hive/spark jobs

Is there any flag available to give custom job_id to dataproc jobs. I am using this command to run pig jobs.
gcloud dataproc jobs submit pig --cluster my_cluster --file my_queries.pig
I use similar commands to submit pyspark/hive jobs.
This command creates a job_id on its own and tracking them later on is difficult.
Reading the gcloud code you can see that the args called id is used as job name
https://github.com/google-cloud-sdk/google-cloud-sdk/blob/master/lib/googlecloudsdk/command_lib/dataproc/jobs/submitter.py#L56
therefore you only need to add the --id to you gcloud command
gcloud dataproc jobs submit spark --id this-is-my-job-name --cluster my-cluster --class com.myClass.Main --jars gs://my.jar
While it's possible to provide your own generated jobid when using the underlying REST API, there isn't currently any way to specify your own jobid when submitting with gcloud dataproc jobs submit; this feature might be added in the future. That said, typically when people want to specify job ids they also want to be able to list with more complex match expressions, or potentially to have multiple categories of jobs listed by different kinds of expressions at different points in time.
So, you might want to consider dataproc labels instead; labels are intended specifically for this kind of use case, and are optimized for efficient lookup. For example:
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170508 ...
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170509 ...
gcloud dataproc jobs submit pig --labels jobtype=mlpipeline,date=20170509 ...
gcloud dataproc jobs list --filter "labels.jobtype=mylogspipeline"
gcloud dataproc jobs list --filter "labels.date=20170509"
gcloud dataproc jobs list --filter "labels.date=20170509 AND labels.jobtype=mlpipeline"