ADF pipeline fails 10% of the time with a Timeout Error - azure-data-factory

I have an ADF pipeline job that runs once a night for approximately 3 to 4 hours. About 10% of the time in the last month, my job fails. I get this error each time:
''Type=Microsoft.Azure.Documents.RequestTimeoutException,Message=Request
timed out. ActivityId: e9fd74da-cfef-4a86-970a-7de173c0935c, Request
URI: /dbs/Gd0sAA==/colls/Gd0sANfHJAA=/docs, RequestStats: , SDK:
documentdb-dotnet-sdk/2.5.1 Host/64-bit
MicrosoftWindowsNT/6.2.9200.0,Source=Microsoft.Azure.Documents.Client,''Type=System.Threading.Tasks.TaskCanceledException,Message=A
task was canceled.,Source=mscorlib,'
My job consists of an Azure Data Explorer command which then copies a table from CosmosDB and migrates it to Kusto.
When I get this error in the past, I have rerun the job and it seems to work fine. Do you know what the problem is? It is becoming inconvenient to rerun a 3 to 4 hour job that sporadically fails for an unclear reason.

Change below configurations and then try-
Batch Size: The tool defaults to a batch size of 50. If the documents to be imported are large, consider lowering the batch size. Conversely, if the documents to be imported are small, consider raising the batch size.
Number of Retries on Failure: Specifies how often to retry the connection to Azure Cosmos DB during transient failures (for example, network connectivity interruption).
Retry Interval: Specifies how long to wait between retrying the connection to Azure Cosmos DB in case of transient failures (for example, network connectivity interruption).
Refer - https://learn.microsoft.com/en-us/azure/cosmos-db/import-data#SQLSeqTarget

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

Why is my Azure DevOps Migration timing out after several hours?

I have a long running Migration (don't ask) being run by an AzureDevOps Release pipeline.
Specifically, it's an "Azure SQL Database deployment" activity, running a "SQL Script File" Deployment Type.
Despite having configured maximums in all the timeouts in the Invoke-Sql Additional Parameters settings, my migration is still timing out.
Specifically, I get:
We stopped hearing from agent Hosted Agent. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error.
So far it's timed out after:
6:13:15
6:13:18
6:14:41
6:10:19
So "after 6 and a bit hours". It's ~22,400 seconds, which doesn't seem like any obvious kind of number either :)
Why? And how do I fix it?
It turns out that AzureDevOps uses Hosting Agents, to execute each Task in a pipeline, and those Agents have innate lifetimes, independent from whatever task they're running.
https://learn.microsoft.com/en-us/azure/devops/pipelines/troubleshooting/troubleshooting?view=azure-devops#job-time-out
A pipeline may run for a long time and then fail due to job time-out. Job timeout closely depends on the agent being used. Free Microsoft hosted agents have a max timeout of 60 minutes per job for a private repository and 360 minutes for a public repository. To increase the max timeout for a job, you can opt for any of the following.
Buy a Microsoft hosted agent which will give you 360 minutes for all jobs, irrespective of the repository used
Use a self-hosted agent to rule out any timeout issues due to the agent
Learn more about job timeout.
So I'm hitting the "360 minute" limit (presumably they give you a little extra on top, so that no-one complains?).
Solution is to use a self-hosted agent. (or make my Migration run in under 6 hours, of course)

Zombie giant unkillable task blocks Druid at restart

I'm running Apache Druid 0.17 deploying with nohup ./bin/start-nano-quickstart > mylog.log. As the deep storage I am using s3 and I have parquet extension enabled and all work fine. I could ingest with several small spark partitioned parquet datasources from s3 correctly. All the remaining configurations are untouched.
As I tried loading a giant datasource to test the performance and resource usage the task died after a couple of hours because of OutOfMemory.(It was expected)
2020-02-07T17:32:20,519 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - New segment[arc_2016-09-29T12:00:00.000Z_2016-09-29T13:00:00.000Z_2020-02-07T17:22:45.965Z] for sequenceName[index_parallel_arc_chgindko_2020-02-07T14:59:32.337Z].
Terminating due to java.lang.OutOfMemoryError: GC overhead limit exceeded
Now every time I restart Druid, it starts that giant task and it is impossible to kill it. Even when the task apparently dies or turns in waiting status the CPU usage is about 140% and I cannot submit new tasks to Druid. I tried to access the Derby database manually to find the task and remove it but I was not successful and this solution is really nasty. I know that I can change the database in the configuration so the next time I will have a fresh Druid but it is not a good solution as I will miss all other datasources. How can I get ready of this long running zombie task?

SSIS Transfer Objects task fails when run from Agent

I am using the SSIS Transfer Objects task to transfer a database from one server to another. This is a nightly task as the final part of ETL.
If I run the task manually during the day, there is no problem. It completes in around 60 to 90 minutes.
When I run the task from Agent, it always starts but often fails . I have the agent steps set up to rety on failure, but most nights it is taking 3 attempts. On some nights 5 or 6 attempts.
The error message returned is two fold (both error messages show in the log for the same row):-
1) An error occurred while transferring data. See the inner exception for details.
2) Timeout expired: The timeout period elapsed prior to completion of the operation or the server is not responding
I can't find any timeout limit to adjust that I haven't already adjusted.
Anyone have any ideas?

Timeout when running stage on Bluemix DevOps pipeline

I'm running e2e tests in a stage in my Bluemix DevOps pipeline but it is exceeding the 60 minutes limit:
The execution exceeded the time limit of 60 minutes.
One possible solution is to split up your execution.
Finished: ERRORED
Is there a way of increasing the stage timeout? I do not want to split my tests across different stages.
No, it is not possible to change the timeout for a running build. Instead of using different stages, you could try using multiple test jobs on your one stage as each job has the timeout of 60 minutes. One possible way to break it down could be one job per test suite.