How can we get a list of failed dataproc jobs and their start time using gcloud or python - google-cloud-dataproc

How can we get a list of failed dataproc jobs and their start time using gcloud or python? I don't see much info about this in the documentation.

It's tricky to do exactly what you are asking for, but this command almost matches it:
gcloud dataproc jobs list --filter="status.state=INACTIVE" --format="table(jobUuid,status.state,statusHistory[0].stateStartTime)"
This will print out the Job UUID, final state, and start time for all jobs that are no longer running.
Where this falls short of what you asked is that the returned list includes all of failed, cancelled, and done jobs, rather than just the failed jobs.
The issue is that Dataproc jobs list API supports filtering on job state, but only on the broad categories of "ACTIVE" or "INACTIVE". The "INACTIVE" category includes jobs with a state of "ERROR", but also includes "DONE" and "CANCELLED".
The simplest way I could get to a full solution to what you asked is to pipe the output of that command through grep
gcloud dataproc jobs list --filter="status.state=INACTIVE" --format="table(jobUuid,status.state,statusHistory[0].stateStartTime)" | grep ERROR
That will only list the failed jobs, but it is Unix specific.

Related

gcloud CLI application logs to bucket

There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1
By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.

How to check cluster name, execution time of dataproc jobs via gcloud command

We need to investigate dataproc jobs ran on particular date.
While running command -
gcloud dataproc jobs list --region=<region_name>
It helps to get below output format -
JOB_ID TYPE STATUS
where TYPE could be spark/hive and STATUS could be DONE/ERROR
is there any way to check job execution time and cluster name as well ?

ERROR: gcloud crashed (TransferRetryError): Bad Request

I'm running a daily dataproc job (pyspark), has been working fine for a year. Today, we're getting the following error:
ERROR: gcloud crashed (TransferRetryError): Bad Request
We got the error twice in a row, near the end of the job execution. It doesn't happen at a specific point in the job though.
I don't see much info about this error so I'd like to learn more about it and what could cause it.
It looks like this may have been due to some sort of (transient?) (network?) error with gcloud rather than the job itself failing.
For future reference, if this happens in the middle of job execution, you can always rerun gcloud to poll for job completion. Doing so will print out all driver logs from the beginning and continue streaming as usual until completion:
gcloud dataproc jobs wait <job-id> [--region=<region>]
If you aren't sure of the corresponding job id, it should have been printed out at job submission time. You can also list the jobs for a given cluster:
gcloud dataproc jobs list --cluster=<cluster> [--region=<region>]

Google Dataproc jobs tab not listing the jobs

I have created dataproc cluster and processed dataproc jobs. When I select jobs tab, It didn't list the created jobs even when I select all regions.
There was a recently identified bug where jobs fail to list if you have no jobs in the older "global"region; this is fixed in code but the fix will take some time to get released everywhere, possibly up to middle of next week or so.
In the meantime, if you run any job in the "global" region and don't delete it, you should be able to see all of your jobs in other regions as well.
Once the fix is rolled out, this workaround will no longer be necessary.

What is a use case for kubernetes job?

I'm looking to fully understand the jobs in kubernetes.
I have successfully create and executed a job, but I do not see the use case.
Not being able to rerun a job or not being able to actively listen to it completion makes me think it is a bit difficult to manage.
Anyone using them? Which is the use case?
Thank you.
A job retries pods until they complete, so that you can tolerate errors that cause pods to be deleted.
If you want to run a job repeatedly and periodically, you can use CronJob alpha or cronetes.
Some Helm Charts use Jobs to run install, setup, or test commands on clusters, as part of installing services. (Example).
If you save the YAML for the job then you can re-run it by deleting the old job an creating it again, or by editing the YAML to change the name (or use e.g. sed in a script).
You can watch a job's status with this command:
kubectl get jobs myjob -w
The -w option watches for changes. You are looking for the SUCCESSFUL column to show 1.
Here is a shell command loop to wait for job completion (e.g. in a script):
until kubectl get jobs myjob -o jsonpath='{.status.conditions[?(#.type=="Complete")].status}' | grep True ; do sleep 1 ; done
One of the use case can be to take a backup of a DB. But as already mentioned that are some overheads to run a job e.g. When a Job completes the Pods are not deleted . so you need to manually delete the job(which will also delete the pods created by job). so recommended option will be to use Cron instead of Jobs