SSIS Transfer Objects task fails when run from Agent - tsql

I am using the SSIS Transfer Objects task to transfer a database from one server to another. This is a nightly task as the final part of ETL.
If I run the task manually during the day, there is no problem. It completes in around 60 to 90 minutes.
When I run the task from Agent, it always starts but often fails . I have the agent steps set up to rety on failure, but most nights it is taking 3 attempts. On some nights 5 or 6 attempts.
The error message returned is two fold (both error messages show in the log for the same row):-
1) An error occurred while transferring data. See the inner exception for details.
2) Timeout expired: The timeout period elapsed prior to completion of the operation or the server is not responding
I can't find any timeout limit to adjust that I haven't already adjusted.
Anyone have any ideas?

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

ADF pipeline fails 10% of the time with a Timeout Error

I have an ADF pipeline job that runs once a night for approximately 3 to 4 hours. About 10% of the time in the last month, my job fails. I get this error each time:
''Type=Microsoft.Azure.Documents.RequestTimeoutException,Message=Request
timed out. ActivityId: e9fd74da-cfef-4a86-970a-7de173c0935c, Request
URI: /dbs/Gd0sAA==/colls/Gd0sANfHJAA=/docs, RequestStats: , SDK:
documentdb-dotnet-sdk/2.5.1 Host/64-bit
MicrosoftWindowsNT/6.2.9200.0,Source=Microsoft.Azure.Documents.Client,''Type=System.Threading.Tasks.TaskCanceledException,Message=A
task was canceled.,Source=mscorlib,'
My job consists of an Azure Data Explorer command which then copies a table from CosmosDB and migrates it to Kusto.
When I get this error in the past, I have rerun the job and it seems to work fine. Do you know what the problem is? It is becoming inconvenient to rerun a 3 to 4 hour job that sporadically fails for an unclear reason.
Change below configurations and then try-
Batch Size: The tool defaults to a batch size of 50. If the documents to be imported are large, consider lowering the batch size. Conversely, if the documents to be imported are small, consider raising the batch size.
Number of Retries on Failure: Specifies how often to retry the connection to Azure Cosmos DB during transient failures (for example, network connectivity interruption).
Retry Interval: Specifies how long to wait between retrying the connection to Azure Cosmos DB in case of transient failures (for example, network connectivity interruption).
Refer - https://learn.microsoft.com/en-us/azure/cosmos-db/import-data#SQLSeqTarget

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

Datastage: How to keep continuous mode job running after a unexpected termination

I have a job that uses the Kafka Connector Stage in order to read a Kafka queue and then load into the database. That job runs in Continuous Mode, which it has no time to conclude, since it keeps monitoring the Kafka queue in real time.
For unexpected reasons (say, server issues, job issues etc) that job may terminate with failure. In general, that happens after 300 running hours of that job. So, in order to keep the job alive I have to manually look to the job status and then to do a Reset and Run, in order to keep the job running.
The problem is that between the job termination and my manual Reset and Run can pass several hours, which is critical. So I'm looking for a way to eliminate the manual interaction and to reduce that gap by automating the job invocation.
I tried to use Control-M to daily run the job, but with no success: The first day the Control-M called the job, it ran it fine. But in the next day, when the Control-M did an attempt to instantiate the job again it failed (since it was already running). Besides, the Datastage will never tell back Control-M that a job was successfully concluded, since the job's nature won't allow that.
Said that, I would like to hear ideas from you that can light me up.
The first thing that came in mind is to create a intermediate Sequence and then schedule it in Control-M. Then, this new Sequence would call the continuous job asynchronously by using command line stage.
For the case where just this one job terminates unexpectedly and you want it to be restarted as soon as possible, have you considered calling this job from a sequence? The sequence could be setup to loop running this job.
Thus sequence starts job and waits for it to finish. When job finishes, the sequence will then loop and start the job again. You could have added conditions on job exit (for example, if the job aborted, then based on that job end status, you could reset the job before re-running it.
This would not handle the condition where the DataStage engine itself was shut down (such as for maintenance or possibly an error) in which case all jobs end including your new sequence. The same also applies for a server reboot or other situations where someone may have inadvertently stopped your sequence. For those cases (such as DataStage engine stop) your team would need to have process in place for jobs/sequences that need to be started up following a DataStage or System outage.
For the outage scenario, you could create a monitor script (regardless of whether running the job solo or from sequence) that sleeps/loops on 5-10 minute intervals and then checks the status of your job using dsjob command, and if not running can start that job/sequence (also via dsjob command). You can decide whether that script startup would occur at DataSTage startup, machine startup, or run it from Control M or other scheduler.

The target server failed to respond for multiple iterations in Jmeter

In my Jmeter script,I am getting error for 2nd iteration.
For multiple users with single iteration, no errors were observed, but with multiple iterations am getting error with below message
Response code: Non HTTP response code: org.apache.http.NoHttpResponseException
Response message: Non HTTP response message: The target server failed to respond
Response data is The target server failed to respond
Error Snapshot
Could you please suggest me what could be reason behind this error
Thanks in advance
Most likely your server becomes overloaded. In regards to possible reason my expectation is that single iteration does not deliver the full concurrency as JMeter acts like:
JMeter starts all the virtual users within the specified ramp-up period
Each virtual user starts executing samplers
When there are no more samplers to execute and no loops to iterate - the thread is being shut down
So with 1 iteration you may run into situation when some threads have already finished their job and the others have not been started yet. When you add more iterations the "old" threads start over and "new" are arriving. The situation is explained in the JMeter Test Results: Why the Actual Users Number is Lower than Expected article and you can monitor the actual delivered load using Active Threads Over Time chart of the HTML Reporting Dashboard or Active Threads Over Time Listener available via JMeter Plugins
To get to the bottom of the failure I would recommend checking the following:
components logs on the application under test side (application logs, application/web server logs, database logs)
application under test baseline health metrics (CPU, RAM, Disk, etc.). You can use JMeter PerfMon Plugin, this way you will be able to correlate increasing load with resources consumption