I have submitted spark job through airflow sometimes job works and sometimes it don't give output at all .
Even after 2-3 hrs of waiting job is not giving any detail apart from
Waiting for job output...
I am using dataproc-1-4-deb10
Its simple job like pulling data from jdbc using pysaprk. Also it works without error sometimes and sometimes doesnt at all.
Related
ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here
I have a scheduled parallel Datastage (11.7) job.
This job has a Hive Connector with a Before and After Statement.
The before statement run ok but After statement remains in running state for several hours (on Hue Log i see this job finished in 1hour) and i have to manually abort it on Datastage Director.
Is there the way to "program an abort"?
For example i want schedule the interruption of the running job every morning at 6.
I hope I was clear :)
Even though you can kill the job - as per other responses - using dsjob to stop the job, this may have no effect because the After statement has been issued synchronously; the job is waiting for it to finish, and (probably) not processing kill signals and the like in the meantime. You would be better advised to work out why the After command is taking too long, and addressing that.
I'm running a daily dataproc job (pyspark), has been working fine for a year. Today, we're getting the following error:
ERROR: gcloud crashed (TransferRetryError): Bad Request
We got the error twice in a row, near the end of the job execution. It doesn't happen at a specific point in the job though.
I don't see much info about this error so I'd like to learn more about it and what could cause it.
It looks like this may have been due to some sort of (transient?) (network?) error with gcloud rather than the job itself failing.
For future reference, if this happens in the middle of job execution, you can always rerun gcloud to poll for job completion. Doing so will print out all driver logs from the beginning and continue streaming as usual until completion:
gcloud dataproc jobs wait <job-id> [--region=<region>]
If you aren't sure of the corresponding job id, it should have been printed out at job submission time. You can also list the jobs for a given cluster:
gcloud dataproc jobs list --cluster=<cluster> [--region=<region>]
I am currently using Airflow 1.8.2 to schedule some EMR tasks and then execute some long running queries on our Redshift cluster. For that purpose I am using the postgres_operator. The queries take about 30 minutes to run. However, once they are done, the connection never closes and the operator runs for an hour and a half more till its terminated at the 2 hour mark every time. The message on termination is that the server closed the connection unexpectedly.
I've checked the logs on Redshift's end and it shows the queries have run and the connection has been closed. Somehow, that is never communicated back to Airflow. Any directions of what more I could check would be helpful. To give some more info, my Airflow installation is an extension of the https://github.com/puckel/docker-airflow docker image, is run in an ECS cluster and has SQLite as backend since I am still testing Airflow out. Also, I'm using the sequential executor for the backend. I would appreciate any help in this matter.
We had similar issue before but I am using SQLAlchemy to Redshift, if you are using postgres_operator, it should be very similar. It seems Redshift will close the connection if it doesn't see any activity for a long running query, in your case, 30 mins are pretty long query.
Check https://www.postgresql.org/docs/9.5/static/runtime-config-connection.html
you have three settings, tcp_keepalives_idle, tcp_keepalives_idle, tcp_keepalives_count, that sends a live message to redshift to indicate "Hey, I am still alive.
You can pass the following as argument, so something like this: connect_args={'keepalives': 1, 'keepalives_idle':60, 'keepalives_interval': 60}
A job has been submitted and an entry is also there in dba_jobs but this job is not comming in the running state.So there is no entry for the job in dba_jobs_running.But the parameter 'JOB_QUEUE_PROCESS' has the value 10
and there are no jobs in the running state.Please suggest how to solve this problem.
SELECT NEXT_DATE, NEXT_SEC, BROKEN, FAILURES, WHAT
FROM DBA_JOBS
WHERE JOB = :JOB_ID
What's that return? A BROKEN job won't kick off, and if the NEXT_DATE/NEXT_SEC is in the past, it won't kick off either.
I hope you labeled that database parameter correctly i.e. 'JOB_QUEUE_PROCESSES=10'.
This is typically why a job won't run.
Also check that the user/schema that is running the job is correct too.
An alternative is to use a different scheduling tool to run the job (i.e. cron on linux)