Retry a Google Cloud Data Fusion Pipeline if it fails - google-cloud-data-fusion

We have several cloud data fusion pipelines and some fail, for seemingly random reasons. Albeit rarely.
Is there a way to automatically retry a pipeline if it fails, say 3 times?

You can automatically retry a pipeline if it fails using triggers. When it fails, you need to STOP and START the pipeline.
To use triggers with Data Fusion, select the event you need to execute, in this case it will be the event Fails. You can see this documentation about creating triggers.
Next you need to STOP and START the pipeline. This is documentation about stop and start the pipelines.

Related

Stop running Azure Data Factory Pipeline when it is still running

I have a Azure Data Factory Pipeline. My trigger has been set for every each 5 minutes.
Sometimes my Pipeline takes more than 5 mins to finished its jobs. In this case, Trigger runs again and creates another instance of my Pipeline and two instances of the same pipeline make problem in my ETL.
How can I be sure than just one instance of my pipeline runs at time?
As you can see there are several instances running of my pipelines
Few options I could think of:
OPT 1
Specify 5 min timeout on your pipeline activities:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities
https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities#activity-policy
OPT 2
1) Create a 1 row 1 column sql RunStatus table: 1 will be our "completed", 0 - "running" status
2) At the end of your pipeline add a stored procedure activity that would set the bit to 1.
3) At the start of your pipeline add a lookup activity to read that bit.
4) The output of this lookup will then be used in if condition activity:
if 1 - start the pipeline's job, but before that add another stored procedure activity to set our status bit to 0.
if 0 - depending on the details of your project: do nothing, add a wait activity, send an email, etc.
To make a full use of this option, you can turn the table into a log, where the new line with start and end time will be added after each successful run (before initiating a new run, you can check if the previous run had the end time). Having this log might help you gather data on how much does it take to run your pipeline and perhaps either add more resources or increase the interval between the runs.
OPT 3
Monitor the pipeline run with SDKs (have not tried that, so this is just to possibly direct you):
https://learn.microsoft.com/en-us/azure/data-factory/monitor-programmatically
Hopefully you can use at least one of them
It sounds like you're trying to run a process more or less constantly, which is a good fit for tumbling window triggers. You can create a dependency such that the trigger is dependent on itself - so it won't run until the previous run has completed.
Start by creating a trigger that runs a pipeline on a tumbling window, then create a tumbling window trigger dependency. The section at the bottom of that article discusses "tumbling window self-dependency properties", which shows you what the code should look like once you've successfully set this up.
Try changing the concurrency of the pipeline to 1.
Link: https://www.datastackpros.com/2020/05/prevent-azure-data-factory-from-running.html
My first thought is that the recurrence is too frequent under these circumstances. If the graph you shared is all for the same pipeline, then most of them take close to 5 minutes, but you have some that take 30, 40, even 60 minutes. Situations like this are when a simple recurrence trigger probably isn't sufficient. What is supposed to happen while the 60 minute one is running? There will be 10-12 runs that wouldn't start: so they still need to run or can they be ignored?
To make sure all the pipelines run, and manage concurrency, you're going to need to build a queue manager of some kind. ADF cannot handle this itself, so I have built such a system internally and rely on it extensively. I use a combination of Logic Apps, Stored Procedures (Azure SQL), and Azure Functions to queue, execute, and monitor pipeline executions. Here is a high level break down of what you probably need:
Logic App 1: runs every 5 minutes and queues an ADF job in the SQL database.
Logic App 2: runs every 2-3 minutes and checks the queue to see if a) there is not a job currently running (status = 'InProgress') and 2) there is a job in the queue waiting to run (I do this with a Stored Procedure). IF this state is met: execute the next ADF and update its status to 'InProgress'.
I use an Azure Function to submit jobs instead of the built in Logic App activity because I have better control over variable parameters. Also, they can return the newly created ADF RunId, which I rely in #3.
Logic App 3: runs every minute and updates the status of any 'InProgress' jobs.
I use an Azure Function to check the status of the ADF pipeline based on RunId.

Understanding compute acquisition times across pipelines

I am struggling to optimize my data factory pipeline to achieve as little time spent in spinning up compute for dataflows.
My understanding is that if we set up a runtime with a TTL of say 15 minutes, then all subsequent flows executed in a sequence following this should experience very short compute acquisition times, but does this also hold true, when switching from one pipeline to the other - in the image below, would flow 3 utilize that the runtime was already spun up in flow 1? I ask because I see very sporadic behavior.
Pipeline example
If you are using the same Azure IR inside of the same factory, yes. However, the activities must be executed sequentially, otherwise, ADF will spin-up another pool for you. That's because Databricks parallel job executions are not supported in job clusters. I describe the techniques in this video and in this document.

Does draining a dataflow job that uses FILE_LOAD write method ensure that all elements are written?

You are writing elements to bigquery in the following way:
pcoll.apply(BigQueryIO.writeTableRows()
.to(destination)
.withSchema(tableSchema)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(10))
.withNumFileShards(10)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
And drain the job either via the gcloud cli tool or the google cloud console, it seems that the job is considered "drained" almost instantly, even if the withTriggeringFrequency had just triggered before. Is the behaviour of the drain function such that it triggers all writes if any are pending?
Yes Dataflow immediately closes any in-process windows and fires all triggers.
Once Drain is triggered, the pipeline will stop accepting new inputs. The input watermark will be advanced to infinity. Elements already in the pipeline will continue to be processed. Drained jobs can safely be cancelled.When you issue the Drain command.
For reference see this doc from
Google Effects of draining a job Effects of draining a job

Concourse - fly CLI - Limit to specific job name

Is it possible in Concourse to limit to a task inside the pipeline? Let's say I have a pipeline with three jobs, but I want test just job #2 not 1 and 3. I tried to do a trigger job by pointing to a pipeline/job-name and it kind of worked (i.e., fly -t lab tj -j bbr-backup-bosh/export-om-installation). 'Kind of' because it did start from this job and then it fired off other jobs that I didn't want to test anyway. Wondering if there Ansible-like (i.e., --tag)
Thanks!!
You cannot "limit" a triggered job to itself, since a job is part of a pipeline. Each time you trigger a job, it will keep putting all the resources it uses. These resources, if marked as trigger: true downstream, well, they will trigger the downstream jobs.
You have two possibilities:
do not mark any resource in the pipeline as trigger: true. This obviously also means that your pipeline will never advance automatically, you will need to manual trigger each job. Not ideal but maybe good enough while troubleshooting the pipeline itself.
Think in terms of tasks. A job is made of one or more tasks, and tasks can be run independently from the pipeline. See the documentation for fly execute and for example https://concoursetutorial.com/ where they explain tasks and fly execute. Note that fly execute supports also --input and --output, so it is possible to emulate the task inputs and outputs as if it were in the pipeline.
Marco is pretty dead on but there’s one other option. You could pause the other jobs and abort any builds that would be triggered after they’re unpaused

Graceful custom activity timeout in data factory

I'm looking to impose a timeout on custom activities in data factory via the policy.timeout property in the activity json.
However I haven't seen any documentation to suggest how the timeout operates upon Azure batch? I assume that the batch task is forcibly terminated somehow.
But is the task -> custom activity informed so it can tidy up?
The reason I ask is that I could be mid-copying to data lake store and I neither want to let it run indefinitely nor stop it without some sort of clean up (I can't see a way of doing transactions as such using the data lake store SDK).
I'm considering putting the timeout within the custom activity, but it would be a shame to have timeouts defined at 2 different levels (I'd probably still want the overall timeout).
I feel your pain.
ADF does simply terminate the activity if its own time out is reached regardless of what state the invoked service is in.
I have the same issue with U-SQL processing calls. It takes a lot of proactive monitoring via PowerShell to ensure data lake or batch jobs have enough compute to complete jobs with natually increasing data volumes before the ADF timeout kill occurs.
I'm not aware of any graceful way for ADF to handle this because it would differ for each activity type.
Time to create another feedback article for Microsoft!