I have a stage with two jobs (build and deploy) which is giving me a lot of trouble for the past 2 days.
Yesterday it started to fail about 1 every 4 times I run the stage. The stage is cancelled by itself. First I thought it might be a problem with my code but if I rerun the same execution it failed some of the times.
Today although, I haven't manage to complete a single run (I tried like 20 times in the last 3 hours). Some of the times the build job is cancelled, some others the deploy job.
Anybody facing the same problem? I cannot find where the issue is coming from...
Thanks!
Related
ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here
I am scratching my head for the last 2 days because of this issue. This error is intermittent on the production server as sometimes the task scheduler works and sometimes not.
The same settings work in the development server.
I also checked the execution policy on both servers and it looks the same.
In your second screenshot, you can choose "Stop existing instance" in the latest dropdown list (if the task is already running). Then the retry option might trigger your task again correctly.
Last night was "fall back" time change for most locations in the US. I woke up this morning to find dozens of job failure notifications. Almost all of them though were incorrect: the jobs showed as having completed normally, yet Rundeck sent a failure notification for it.
Interestingly, this happened in two completely separate Rundeck installations (v2.10.8-1 and v3.1.2-20190927). The commonality is that they're both on CentOS 7 (separate servers). They're both using MariaDB, although different versions of MariaDB.
The failure emails for the jobs that finished successfully showed a negative time in the "Scheduled after" line:
#1,811,391
by admin Scheduled after 59m at 1:19 AM
• Scheduled after -33s - View Output »
• Download Output
Execution
User: admin
Time: 59m
Started: in 59m 2019-11-03 01:19:01.0
Finished: 1s ago Sun Nov 03 01:19:28 EDT 2019
Executions Success rate Average duration
100% -45s
That job actually ran in 27s at 01:19 EDT (the first 1am hour, it is now EST). Looking at the email headers, I believe I got the message at 1:19 EST, an hour after the job ran.
So that would seem to imply to me that it's just a notification problem (somehow).
But there were a couple of jobs that were following other job executions that failed as well, apparently because the successfully finished job returned a RC 2. I'm not sure what to make of this.
We've been running Rundeck for a few years now, this is the first I remember seeing this problem. Of course my memory may be faulty--maybe we did see it previously, only there were fewer jobs affected or some such.
The fact that it impacted two different versions of Rundeck on two different servers implies either it's a fundamental issue with Rundeck that's been around for a while or it is something else in the operating system that's somehow causing problems for Rundeck. (Although time change isn't new, so that would seem to be somewhat surprising too.)
Any thoughts about what might have gone on (and how to prevent it next year, short of the obvious run on UTC) would be appreciated.
You can define specific Timezone in Rundeck, check this and this.
I am using the SSIS Transfer Objects task to transfer a database from one server to another. This is a nightly task as the final part of ETL.
If I run the task manually during the day, there is no problem. It completes in around 60 to 90 minutes.
When I run the task from Agent, it always starts but often fails . I have the agent steps set up to rety on failure, but most nights it is taking 3 attempts. On some nights 5 or 6 attempts.
The error message returned is two fold (both error messages show in the log for the same row):-
1) An error occurred while transferring data. See the inner exception for details.
2) Timeout expired: The timeout period elapsed prior to completion of the operation or the server is not responding
I can't find any timeout limit to adjust that I haven't already adjusted.
Anyone have any ideas?
I tried to create a job to call a REST API every 10 minutes using the Application Lab UI of the Workload Scheduler.
The task works fine, if I push Run Now.
This is the configuration of my trigger:
I left out valid to intentionally to have this task running infinitely.
Looking the trigger you created seems to be right: your step will run every 10 minutes every days.
If not, could you specify what is not working?
Thanks