Stop running Azure Data Factory Pipeline when it is still running

Stop running Azure Data Factory Pipeline when it is still running - triggers

I have a Azure Data Factory Pipeline. My trigger has been set for every each 5 minutes.
Sometimes my Pipeline takes more than 5 mins to finished its jobs. In this case, Trigger runs again and creates another instance of my Pipeline and two instances of the same pipeline make problem in my ETL.
How can I be sure than just one instance of my pipeline runs at time?
As you can see there are several instances running of my pipelines

Few options I could think of:
OPT 1
Specify 5 min timeout on your pipeline activities:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities
https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities#activity-policy
OPT 2
1) Create a 1 row 1 column sql RunStatus table: 1 will be our "completed", 0 - "running" status
2) At the end of your pipeline add a stored procedure activity that would set the bit to 1.
3) At the start of your pipeline add a lookup activity to read that bit.
4) The output of this lookup will then be used in if condition activity:
if 1 - start the pipeline's job, but before that add another stored procedure activity to set our status bit to 0.
if 0 - depending on the details of your project: do nothing, add a wait activity, send an email, etc.
To make a full use of this option, you can turn the table into a log, where the new line with start and end time will be added after each successful run (before initiating a new run, you can check if the previous run had the end time). Having this log might help you gather data on how much does it take to run your pipeline and perhaps either add more resources or increase the interval between the runs.
OPT 3
Monitor the pipeline run with SDKs (have not tried that, so this is just to possibly direct you):
https://learn.microsoft.com/en-us/azure/data-factory/monitor-programmatically
Hopefully you can use at least one of them

It sounds like you're trying to run a process more or less constantly, which is a good fit for tumbling window triggers. You can create a dependency such that the trigger is dependent on itself - so it won't run until the previous run has completed.
Start by creating a trigger that runs a pipeline on a tumbling window, then create a tumbling window trigger dependency. The section at the bottom of that article discusses "tumbling window self-dependency properties", which shows you what the code should look like once you've successfully set this up.

Try changing the concurrency of the pipeline to 1.
Link: https://www.datastackpros.com/2020/05/prevent-azure-data-factory-from-running.html

My first thought is that the recurrence is too frequent under these circumstances. If the graph you shared is all for the same pipeline, then most of them take close to 5 minutes, but you have some that take 30, 40, even 60 minutes. Situations like this are when a simple recurrence trigger probably isn't sufficient. What is supposed to happen while the 60 minute one is running? There will be 10-12 runs that wouldn't start: so they still need to run or can they be ignored?
To make sure all the pipelines run, and manage concurrency, you're going to need to build a queue manager of some kind. ADF cannot handle this itself, so I have built such a system internally and rely on it extensively. I use a combination of Logic Apps, Stored Procedures (Azure SQL), and Azure Functions to queue, execute, and monitor pipeline executions. Here is a high level break down of what you probably need:
Logic App 1: runs every 5 minutes and queues an ADF job in the SQL database.
Logic App 2: runs every 2-3 minutes and checks the queue to see if a) there is not a job currently running (status = 'InProgress') and 2) there is a job in the queue waiting to run (I do this with a Stored Procedure). IF this state is met: execute the next ADF and update its status to 'InProgress'.
I use an Azure Function to submit jobs instead of the built in Logic App activity because I have better control over variable parameters. Also, they can return the newly created ADF RunId, which I rely in #3.
Logic App 3: runs every minute and updates the status of any 'InProgress' jobs.
I use an Azure Function to check the status of the ADF pipeline based on RunId.

Related

Integration Runtime with TTL not helping with Cluster startup time

Hi I have a pipeline with Foreach loop with in which I have a Dataflow task, that runs on a integration runtime I have setup with 10 min time to live. When I triggered the pipeline with three files (i.e the Dataflow task within the Foreach would execute three times) I see that the cluster startup time remains almost the same (4-6 minutes) for each dataflow execution. I assumed the IR with 10 min TTL would reduce the cluster startup time substantially (for at least the second or third execution) but it doesn't seem that way.
Not sure if I am missing a setup/configuration on the pipeline or IR, or if this is intended behavior. any insight would be appreciated.

When using a ForEach w/Dataflow activity in ADF, if you wish to take advantage of shortened cluster start-up times, you must set the ForEach to execute iterations sequentially. Allow the ForEach to execute in parallel will fire-up new clusters for every iteration even if you have a TTL set on the Azure IR.

I found the solution. Microsoft added a check box in the Integration Runtime creation process...

Azure Data Factory ADFV2 Trigger Overlap

I have a ADFV2 trigger that runs every 2 minutes. The pipeline that is called usually takes just over a minute to run but sometimes it takes over 2 minutes but if that happens the trigger kicks in again and runs regardless of the previous trigger still running or not. Is there any way to stop this overlap?
The trigger needs to run every 2 minutes.
Thanks.

There is a concurrency setting in the pipeline definition. Set it to 1. The trigger will create an event, but it will be set to Queued state until previous job completes

How to run five agent jobs simultaneously in VSTS (Azure DevOps)?

I have created a release pipeline with five agent jobs and I want to start all five jobs at the same time.
example:
In example I need to start all agent jobs simultaneously, and execute unique task (wait 10 seconds) at the same time.
Does VSTS (Azure DevOps) have option to do this?

You could also just use 5 different stages (depending on what exactly it is you're doing). Then you can leverage the full power of the pipeline model, have pre and post stages, whatever you wish. This is as mentioned in the other answers also possible with different agent jobs but this is more straight forward. Also you can easily clone stages.
I'm not sure what it is what you're trying to achieve with waiting for 10 seconds, but this is very easy to do with a PowerShell step. Select the radio button "Inline" and type this:
Start-Sleep -Seconds 10
Example of a pipeline, that might do the simultaneous work that you want, but keep in mind, each agent job (doesn't matter multiple jobs in one stage or multiple single job stages) has to find an agent that is capable, available and idle, otherwise the job(s) will wait in a waiting queue!!!

In the release pipeline click on "Agent job", then expand the "Execution plan" and click on "Multi-agent".

I think you need to create 5 stages since for the release pipeline in the Azure devops, jobs in one stage could not be paralleled.see documents from Microsoft
Or if you want to run the same set of set of tasks on multiple agents, you could use the option Multi-agent as shown below.
ADO Multi-agent option

If you want a job to be executed in parallel then choose multi-agent configuration, but if you have 5 (very) different jobs then you can choose "Even if a previous job has failed" from the dropdown "Run this job".
This is by default set to "Only when all previous jobs have succeeded" which means that:
All of your 5 jobs will be executed sequentially in the order that you've set them up
The chain of jobs will come to a stop as soon as one of the jobs fails
Take note that you can specify individually on what agent queue what job will execute, by default they're all going to the same queue, if you run 5 jobs in parallel on a single queue, then this queue should have 5 agents available and idle to get what you're expecting.

Run scheduler to execute jobs at an interval from the completion of the previous job

I need to create schedulers to execute jobs(class files) at specified intervals..For Now, I'm using Quartz Scheduler which triggers the jobs at defined intervals from the time of triggering of it.
For Eg: Consider I'm giving a cron expression to run for every one hour starting at morning 9.My first run will be at 9 and my second run will be at 10 and so on.
If my job is taking 20 minutes to execute then in that case this method is not that much efficient.
What I need to do is to schedule a job for every one hour from the completion time of the previously ran job
For Eg: Consider my job to run every one hour is triggered at 9 and for the first run it took 20 minutes to run, so for the next time the job should trigger only at 10:20 instead of 10 (ie., one hour from the completion of previous ran job)
I need to know whether there are any methods in Quartz Scheduling to achieve this or any other logic I need to do.
If anyone could help me out on this,it would be very helpful for me.

You can easily achieve this by job-chaining your job executions. There are various approaches you can choose from:
(1) Implement a Quartz JobListener and in its jobWasExecuted method, that is invoked by Quartz whenever a job finishes executing, re-fire your job.
(2) Look at the Quartz JobChainingJobListener that you can use to implement simple job chaining scenarios. Please note that the functionality of this listener is very limited as it does not allow you to insert delays between job executions, there is no support for conditions that must be met before target jobs are executed etc. But you can use it as a good starting point to implement (1).
(3) Use QuartzDesk (our commercial product) or any other product that allows you to create job chains while externalizing and managing all job dependencies outside of your application. A job chain can have multiple target jobs that can be executed immediately, with a fixed delay or at arbitrary time in the future produced by a JavaScript expression. It also allows you to implement somewhat more sophisticated works flows, such as firing a target job when multiple source jobs complete their execution etc. I am attaching screenshots showing you what a simple job chain that re-executes Job1 with a 1 minute delay upon Job1's completion (with any job execution status) looks like:

SQL Agent Job runtime alert

I was hoping i could get some help on how i can setup an e-mail alert for a specific agent job, such that it sends an e-mail alert when the run duration exceeds 30 minutes.
Would it be easier to add this step in the job itself? Are there any available methods in the SQL Agent GUI or do i have to create a new job? I figured creating a new job is less likely as i would have to query the sysjobhistory in msdb; The value is only updated once the job finishes so that doesn't help...I need it to check the real time duration of 1 specific agent job as it's running...
Specifically because it happens that the job runs into a deadlock ( That's no longer an issue now), so the job just stays stuck on the table it's locked on, and i only get the notification from the enduser that the report doesn't return results :S

The best method outside of 3rd party monitoring software is to create a high-frequency SQL Agent Job that runs a query on active sessions (returned by something like sp_who) for the duration of spids. This way you can have this monitoring job email you whenever a spid goes over a threshold. Alternatively you could have it compare the current runtime vs a calculated average runtime gleaned from the sys.jobhistory table.