How do i Re-run pipeline with only failed activities/Dataset in Azure Data Factory V2? - azure-data-factory

I am running a pipeline where i am looping through all the tables in INFORMATION.SCHEMA.TABLES and copying it onto Azure Data lake store.My question is how do i run this pipeline for the failed tables only if any of the table fails to copy?

Best approach I’ve found is to code your process to:
0. Yes, root cause the failure and identify if it is something wrong with the pipeline or if it is a “feature” of your dependency you have to code around.
1. Be idempotent. If your process ensures a clean state as the very first step, similar to Command Design pattern’s undo (but more naive), then your process can re-execute.
* with #1, you can safely use “retry” in your pipeline activities, along with sufficient time between retries.
* this is an ADFv1 or v2 compatible approach
2. If ADFv2, then you have more options and can have more complex logic to handle errors:
* for the activity that is failing, wrap this in an until-success loop, and be sure to include a bound on execution.
* you can add more activities in the loop to handle failure and log, notify, or resolve known failure conditions due to externalities out of your control.
3. You can also use asynchronous communication to future process executions that save success to a central store. Then later executions “if” I already was successful then stop processing before the activity.
* this is powerful for more generalized pipelines, since you can choose where to begin
4. Last resort I know (and I would love to learn new ways to handle) is manual re-execution of failed activities.
Hope this helps,
J

Related

Azure DevOps: How to automatically re-deploy a stage if it fails on the first attempt

Instead of manually redeploying a stage, I want to achieve an automatic way to redeploy(can be done manually).
My stage include some disk operations, which sometime fails on the first attempt but usually succeeds on the second attempt.
I am currently re-running the task group into another job in the same stage.
The second job basically executes only if the first one fails.
But this marks the stage as failed as out of two jobs, first one has failed.
But in my case both the jobs are same. Can't find a way to redeploy the same stage.
Your options are basically keep doing what you have or you can replace the steps that fail with custom PowerShell/Bash script that knows how to retry.
Edit: You could improve your existing solution little bit by putting your second attempt in the same job as your first attempt. That way you wouldn't get a failed stage. You can put conditions on steps and not just jobs.
https://learn.microsoft.com/en-us/azure/devops/pipelines/process/conditions?view=azure-devops&tabs=yaml

How to implement conditional branches in Azure Data Factory pipelines

I am implementing a pipeline to insert data updates from csv files to SQL DB. Plan is to first insert the data to temporary SQL table for validation and transformation, and then move processed data to actual SQL table. I would like to branch the pipeline execution depending on the validation result. If data is OK, it will be inserted to target SQL table. If there are fatal fails, insertion activity should be skipped.
Tried to find instructions / guidance but no success so far. Any ideas if pipeline activity supports conditional execution, e.g. based on some properties in input dataset?
It is possible now with Azure Data Factory ver 2.
Post execution our downstream activities can now be dependent on four possible outcomes as standard.
- On success
- On failure
- On completion
- On skip
Also, custom ‘if’ conditions will be available for branching based expressions.
Refer below links for more detail:-
https://www.purplefrogsystems.com/paul/2017/09/whats-new-in-azure-data-factory-version-2-adfv2/
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-control-flow
The short answer is no.
I think its worth pointing out that ADF is just an orchestration tool to invoke other services. The current version can't do what you want because it does not have any of its own compute. Its not an SSIS data flow engine.
If you want this behaviour you'll need to code it into the SQL DB stored procedures with flags etc on the processed datasets.
Then maybe have some boiler plate code with a parameters that are passed from ADF to perform either the insert or update or divert operation.
Handy link for called stored procedure with params from ADF: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-stored-proc-activity
Hope this helps.

Talend Force run order of joblets

My company has a couple of joblets that we put in new jobs to do things like initialization of variables, get system information from the database and sending out error / warning emails. The issue we are running into is that if we go ahead and start creating the components of a job and realize that we forgot to include these 3 joblets, we have to basically re-create the job to ensure that the joblets are added first so they run first.
Is there any way to force these joblets to run first and possibly also in a certain order before moving on to the contents of the job being created? Please let me know if there is any information you may need that I'm missing as I have only been using Talend for a few days. The rest of the team has not been using it too much longer than I have, so they do not have the answer I'm looking for either. Thanks in advance!
In Joblets you can use the components Trigger_Input and Trigger_Output as connection-points for on subjob OK triggers. So you can connect joblets and other components in a job with triggers. Thus enforcing execution order.
But you cannot get a on subjob OK trigger from a tPreJob. I am thinking on triggering from a tPreJob to a tWarn (on component OK) and then from tWarn to the joblet (on subjob OK).

Disable Retry When commitInterval = 1

The behavior for the batch processing of our business entities we would like is to rollback the failed transaction and not try again. I have read through the forum and it appears that it is not possible. We have set the commitInterval=1 and tried the Never Retry Policy for this special case but to no avail. I have read the rational is that the writer does not know if the list of items received is the initial or subsequent processing in the case of a failure.
Have I summarized this correctly and Spring batch does not currently support the behavior we are looking for?
Sounds like a candidate for Skip Logic
https://docs.spring.io/spring-batch/reference/html/configureStep.html
Check out these two sections in particular:
5.1.5 Configuring Skip Logic
5.1.7 Controlling Rollback

Dynamics CRM workflow failing with infinite loop detection - but why?

I want to run a plug-in every 30 minutes, to poll an external system for changes. I am in CRM Online, so I don't have ready access to a scheduling engine.
To run the plug-in, I have a 'trigger' entity with a timezone independent date-
Updating the field also triggers a workflow, which in pseudocode has this logic:
If (Trigger_WaitUntil >= [Process-Execution Time])
{
Timeout until Trigger:WaitUntil
{
Set Trigger_WaitUntil to [Process-Execution Time] + 30 minutes
Stop Workflow with status of: Succeeded
}
}
If Trigger_WaitUntil < [Process-Execution Time])
{
Send email //Tell an admin that the recurring task has self-terminated
Stop Workflow with status of: Canceled
}
So, the behaviour I expect is that every 30 minutes, the 'WaitUntil' field gets updated (and the Plug-in and workflow get triggered again); unless the WaitUntil date is before the Execution time, in which case stop the workflow.
However, 4 hours or so later (probably 8 executions, although I haven't verified that yet) I get an infinite loop warning "This workflow job was canceled because the workflow that started it included an infinite loop. Correct the workflow logic and try again. For information about workflow".
My question is why? Do workflows have a correlation id like plug-ins, which is being carried through to the child workflow? If so, is there anyway I can prevent this, whilst maintaining the current basic mechanism of using a single trigger record to manage the schedule (I've seen other solutions in which workflows create new records, but then you've got to go round tidying up the old trigger records as well)
Yes, this behavior is well-known. The only way to implement recurring workflows without issues with infinite loops in Dynamics CRM and using only OOB features is usage of Bulk Deletion functionality. This article describes how to implement it - http://www.crmsoftwareblog.com/2012/08/using-the-bulk-deletion-process-to-schedule-recurring-workflows/
UPD: If you want to run your code every 30 mins then you will have to create 48 bulkdelete jobs with correspond startdatetime like 12:00, 12: 30, 1:00 ...
The current supported method for CRM is to use the Azure Scheduler.
Excerpt:
create a Web API application to communicate with CRM and our external
provider running on a shared (free) Azure web site and also utilize
the Azure Scheduler to manage the recurrence pattern.
The free version of the Azure Scheduler limits us to execution no more
than once an hour and a maximum of 5 jobs. If you have a lot going on
$20 a month will get you executions every minute and up to 50 jobs -
which sounds like a pretty good deal.
so if you wanted every 30 minutes, you could create two jobs, one on the half hour, and one on the hour.
The Bulk Deletion is an interesting work around and something we've used before. It creates extra work and maintenance though so I try to avoid it if possible.
I would generally recommend building a windows application and using the windows scheduling feature (I know you said you don't have a scheduler available but this is often forgotten). This approach works really well and is very easy to troubleshoot. Writing to logs and sending error email alerts is pretty easy to make it robust. The server doesn't need to be accessible externally, it only needs to reach CRM. If you had CRM on-prem, you could just use the same server.
Azure Scheduler is a great suggestion. This keeps you in the cloud which is nice.
SSIS is another option if you already have KingswaySoft or Cozy Roc in place.
You could build a workflow that creates another record and cleans up after itself; however, this is really using the wrong tool for the job. Also, it's very easy for it to fail and then not initiate the next record.
There is a solution called "Scheduled Workflow Runner". You create a FetchXML query to create a record set to run against, and point it at an on-demand workflow that you want it to run on each record.
http://alexanderdevelopment.net/post/2013/05/18/scheduling-recurring-dynamics-crm-workflows-with-fetchxml/