I'm running a daily dataproc job (pyspark), has been working fine for a year. Today, we're getting the following error:
ERROR: gcloud crashed (TransferRetryError): Bad Request
We got the error twice in a row, near the end of the job execution. It doesn't happen at a specific point in the job though.
I don't see much info about this error so I'd like to learn more about it and what could cause it.
It looks like this may have been due to some sort of (transient?) (network?) error with gcloud rather than the job itself failing.
For future reference, if this happens in the middle of job execution, you can always rerun gcloud to poll for job completion. Doing so will print out all driver logs from the beginning and continue streaming as usual until completion:
gcloud dataproc jobs wait <job-id> [--region=<region>]
If you aren't sure of the corresponding job id, it should have been printed out at job submission time. You can also list the jobs for a given cluster:
gcloud dataproc jobs list --cluster=<cluster> [--region=<region>]
Related
ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here
Error Message - job failed with error message The output of the notebook is too large. Cause: rpc response (of 20972488 bytes) exceeds limit of 20971520 bytes
Details:
We are using databricks notebooks to run the job. Job is running on job cluster. This is a streaming job.
Job started failing with above mentioned error.
We do not have any display(), show(), print(), explain method in the job.
We are not using awaitAnyTermination method in the job as well.
We also tried adding "spark.databricks.driver.disableScalaOutput true" to the job but it still did not work. Job is failing with same error.
We have followed all the steps mentioned in this document - https://learn.microsoft.com/en-us/azure/databricks/kb/jobs/job-cluster-limit-nb-output
Do we have any option to resolve this issue or to find out exactly which commands output is causing it to go above 20MB limit.
See the docs regarding structured streaming in prod.
I would recommend migrating to workflows based on jar jobs because:
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook workflows in your streaming jobs.
I have submitted spark job through airflow sometimes job works and sometimes it don't give output at all .
Even after 2-3 hrs of waiting job is not giving any detail apart from
Waiting for job output...
I am using dataproc-1-4-deb10
Its simple job like pulling data from jdbc using pysaprk. Also it works without error sometimes and sometimes doesnt at all.
How can we get a list of failed dataproc jobs and their start time using gcloud or python? I don't see much info about this in the documentation.
It's tricky to do exactly what you are asking for, but this command almost matches it:
gcloud dataproc jobs list --filter="status.state=INACTIVE" --format="table(jobUuid,status.state,statusHistory[0].stateStartTime)"
This will print out the Job UUID, final state, and start time for all jobs that are no longer running.
Where this falls short of what you asked is that the returned list includes all of failed, cancelled, and done jobs, rather than just the failed jobs.
The issue is that Dataproc jobs list API supports filtering on job state, but only on the broad categories of "ACTIVE" or "INACTIVE". The "INACTIVE" category includes jobs with a state of "ERROR", but also includes "DONE" and "CANCELLED".
The simplest way I could get to a full solution to what you asked is to pipe the output of that command through grep
gcloud dataproc jobs list --filter="status.state=INACTIVE" --format="table(jobUuid,status.state,statusHistory[0].stateStartTime)" | grep ERROR
That will only list the failed jobs, but it is Unix specific.
I'd like to save more specific errors in the case of a failed initialization script of a Dataproc cluster. Is it possible to update the cluster metadata or add a label to the cluster (without using gcloud dataproc clusters update) from within the script? Or any other method to write a more useful error message? Thanks in advance!
If your goal is to report an error from within an initialization action, there is a feature within Dataproc to extract messages from init action output.
As long as you emit a message in this format: StructuredError{message}
For example:
message="something went wrong"
echo "StructuredError{${message}}"