I have a pipeline that needs to run daily... but the data only arrives at around 2pm on that day (for the previous day)... so when midnight ticks over, the data isn't available, and therefore everything falls over ;)
I have tried this:
"start": "2016-02-10T15:00:00Z",
"end": "2016-05-31T00:00:00Z",
but it still kicks off at midnight, I assume because I have my scheduler is as follows:
"scheduler": {
"frequency": "Day",
"interval": 1
},
i think i need to use either anchordatetime, or offset.. but i'm not sure which?
Following in pipeline just means that you want pipeline enabled during that period.
"start": "2016-02-10T15:00:00Z",
"end": "2016-05-31T00:00:00Z"
Following in activity definition means that you want the activity to run at the end of the day, thus, midnight in UTC time.
"scheduler": {
"frequency": "Day",
"interval": 1
},
If you want the activity to kick off at 2pm, use following
"scheduler": {
"frequency": "Hour",
"interval": 24,
"anchorDateTime": "2016-02-10T14:00:00"
},
You should make it consistent in target dataset definition as well.
"availability": {
"frequency": "Hour",
"interval": 24,
"anchorDateTime": "2016-02-10T14:00:00"
}
Hope that helps!!
Related
I'm trying to create a tumbling window trigger to run every 1 hour and 10 minutes delay before the pipeline starts executing.
I created a test trigger with time interval of 5 minutes and delay of 10 minutes.
I expected the pipeline to run every 15 minutes (5 min interval + 10 min delay).
What I actually see in the Monitor section of the pipelines Runs and Triggers Runs that it runs every 5 minutes.
Isn't the delay should delay the pipeline execution?
Am I doing something wrong here?
Updated
Here's my trigger template:
{
"name": "[concat(parameters('factoryName'), '/trigger_test')]",
"type": "Microsoft.DataFactory/factories/triggers",
"apiVersion": "2018-06-01",
"properties": {
"annotations": [],
"runtimeState": "Started",
"pipeline": {
"pipelineReference": {
"referenceName": "exportData",
"type": "PipelineReference"
},
"parameters": {}
},
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Minute",
"interval": 5,
"startTime": "2021-07-25T07:46:00Z",
"delay": "00:10:00",
"maxConcurrency": 50,
"retryPolicy": {
"intervalInSeconds": 30
},
"dependsOn": []
}
},
"dependsOn": [
"[concat(variables('factoryId'), '/pipelines/exportData')]"
]
}
I haven't found a concrete example and the docs are not very clear in terms for terminology.
From what I understand, when one trigger window finished running, the next trigger window starts running regardless of the delay specified.
According to the docs, "the delay doesn't alter the window startTime" which I assume means what I have mentioned above.
A pipeline has to trigger every December on second Friday from the end of the month.
I am trying to do this using scheduled trigger of ADF See Trigger Definition by using,
Start date of Dec 1st 2021
Recurrence of 12 months
No end date
Advanced recurrence option of weekdays with occurrance as -2 and day as Friday.
"name": "Dec_Last_But_One_Friday",
"properties": {
"annotations": [],
"runtimeState": "Stopped",
"pipelines": [
{
"pipelineReference": {
"referenceName": "pipeline_test_triggers",
"type": "PipelineReference"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Month",
"interval": 12,
"startTime": "2021-12-01T14:24:00Z",
"timeZone": "UTC",
"schedule": {
"monthlyOccurrences": [
{
"day": "Friday",
"occurrence": -2
}
]
}
Is this right way? how do I know it will be definitely trigger every year, the Second Friday from the end of the December month. Is there a way to see the future schedules in ADF?
Thank you!
Viewing future ADF schedule, this feature does not currently exist in Data Factory V2 at this time.
Due to advanced recurrence options are not perfect, we'd better check once a year.
I have a very simple pipeline that I have setup to test tumbling window trigger dependency. So the pipeline has a single Wait activity. Here is the pipeline code:-
{
"name": "pl-something",
"properties": {
"activities": [
{
"name": "Wait1",
"type": "Wait",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"waitTimeInSeconds": 25
}
}
],
"parameters": {
"date_id": {
"type": "string"
}
},
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
I have created the following hourly trigger on it which just executes it at hourly intervals:-
{
"name": "trg-hourly",
"properties": {
"annotations": [],
"runtimeState": "Started",
"pipeline": {
"pipelineReference": {
"referenceName": "pl-something",
"type": "PipelineReference"
},
"parameters": {
"date_id": "#formatDateTime(triggerOutputs().windowStartTime, 'yyyyMMddHH')"
}
},
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Hour",
"interval": 1,
"startTime": "2019-11-01T00:00:00.000Z",
"delay": "00:00:00",
"maxConcurrency": 1,
"retryPolicy": {
"intervalInSeconds": 30
},
"dependsOn": []
}
}
}
The parameter date_id exists so I know exactly which hourly window a trigger instance is running for. Now this executes fine. My goal is to create another trigger on the same pipeline but which will execute as a daily thing and which depends on the hourly trigger. So that unless all the 24 hours in a day are processed , the daily trigger should not run. So in the screenshow below you can see how I am trying to setup this new trigger dependent on the hourly trigger (trg-hourly), but somehow the 'OK' button is not activated whenever I try to specify 24 hours window and you can see the error too that the window size is not valid. There is no json to show , since it's not even allowing me to create the trigger. What's the issue here?
Maybe it is expecting 1.00:00:00 instead of 0.24:00:00 because there are 24 hours in a day.
can somebody let me know how to get previous days data i.e 2017-07-28 etc from my onpremises file system if my pipleline start and end dates are
"start": "2017-07-29T00:00:00Z",
"end": "2017-08-03T00:00:00Z"
My pipeline's input is"FileSystemSource" and output is "AzureDataLakeStore". I have tried below JSON in my copy pipeline as input
"inputs": [
{
"name": "OnPremisesFileInput2"
"startTime": "Date.AddDays(SliceStart, -1)",
"endTime": "Date.AddDays(SliceEnd, -1)"
}
]
I have also tried defining "offset" in the input and output datasets and in the pipeline as follows
"availability": {
"frequency": "Day",
"interval": 1,
"offset": "-1.00:00:00",
"style": "StartOfInterval"
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"offset": "-1.00:00:00",
"style": "StartOfInterval"
},
none of the above seems to be working. Request someone to help me.
I think a good strategy to do this is to think about yesterday's output as today's input. Azure Data Factory let's you run activities one after another in sequence using different data sources.
There's good documentation here
With an example like this:
Like this you can either have a temporary storage in between the two activities or use your main input data source but with a filter to get only yesterday's slice.
Your offset should be positive.
"availability": {
"frequency": "Day",
"interval": 1,
"offset": "01:00:00",
"style": "EndOfInterval"
}
In this case it will run for example on September 7th at 1:00 AM UTC and will run the slice from Sep 6th 0:00 UTC to Sept 7th UTC. Which is yesterday slice.
Your input dataset should be configured to use the SliceStart for the naming of the file
"partitionedBy": [
{
"name": "Slice",
"value": {
"type": "DateTime",
"date": SliceStart",
"format": "yyyymmdd"
}
}],
"typeProperties": {
"fileName": "{slice}.csv",
}
It would look for 20170906.csv file when executed on Sept 7th.
I'm having some troubles with the execution order of scheduled pipelines in Data Factory.
My pipeline is as follows:
{
"name": "Copy_Stage_Upsert",
"properties": {
"description": "",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlDWSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "00:10:00"
}
},
"inputs": [
{
"name": "csv_extracted_file"
}
],
"outputs": [
{
"name": "stage_table"
}
],
"policy": {
"timeout": "01:00:00",
"retry": 2
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Copy to stage table"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "SELECT * from table WHERE id NOT IN (SELECT id from stage_table) UNION ALL SELECT * from stage_table"
},
"sink": {
"type": "SqlDWSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "00:10:00"
}
},
"inputs": [
{
"name": "stage_table"
}
],
"outputs": [
{
"name": "upsert_table"
}
],
"policy": {
"timeout": "01:00:00",
"retry": 2
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Copy"
},
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "sp_rename_tables"
},
"inputs": [
{
"name": "upsert_table"
}
],
"outputs": [
{
"name": "table"
}
],
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Rename tables"
}
],
"start": "2017-02-09T18:00:00Z",
"end": "9999-02-06T15:00:00Z",
"isPaused": false,
"hubName": "",
"pipelineMode": "Scheduled"
}
}
For simplicity imagine that I've one pipeline called A with three simple tasks:
Task 1, Task 2 and finally Task 3.
Scenario A
One execution of Pipeline A scheduled.
It runs as:
Task 1 -> Task 2 -> Task 3
Scenario B
Two or more executions of Pipeline A scheduled to be executed.
It runs as:
First Scheduled Pipeline Task 1 -> Second Scheduled Pipeline Task 1 -> First Scheduled Pipeline Task 2 -> Second Scheduled Pipeline Task 2 -> First Scheduled Pipeline Task 2 -> First Scheduled Pipeline Task 3 -> Second Scheduled Pipeline Task 3.
Is it possible run the second scenario as:
First Scheduled Pipeline Task 1 -> First Scheduled Pipeline Task 2 -> First Scheduled Pipeline Task 3, Second Scheduled Pipeline Task 1 -> Second Scheduled Pipeline Task 2 -> Second Scheduled Pipeline Task 3
In other words, I need to finish the first scheduled pipeline before the second pipeline starts.
Thank you in advance!
It's possible. However, it will require some fake input and output datasets to enforce the dependency behaviour and create the chain as you describe. So possible, but a dirty hack!
This isn't a great solution and it will become complicated if your outputs/downstream datasets in the second pipeline have different time slice intervals to the first. I recommend testing and understanding this.
Really ADF isn't designed to do what you want. It's not a tool to sequence things like steps in a SQL Agent job. ADF is for scale and parallel work streams.
As I understand in whispers from Microsoft peeps there might be more event driven scheduling coming soon in ADF. But I don't know for sure.
Hope this helps.