Monthly Schedule in Azure data factory pipeline - azure-data-factory

How can we schedule a Azure Data Factory pipeline to run only on particular day of month.(for e.g 9th of every month)

This can be achieved with the "schedule trigger" capability in the V2 ADF service: https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers#examples-recurrence-schedules

If you are using data factory version 1, you can achieve this by setting the availability with frequency month, interval 1, and set the offset with the number of the day you want the pipeline to run.
For example if you want it to run the 9th of each month as you said, you will have something like this:
"availability": {
"frequency": "Month",
"interval": 1,
"offset": "9.00:00:00",
"style": "StartOfInterval"
}
If you are using data factory version 2, this can be achieved with triggers as Mark Kromer said. You have to set {monthDays":[9]} in the trigger's schedule.
Cheers.

Related

Quartz job not getting triggered on the end datetime

I am quite new to Quartz Scheduled jobs. I scheduled on job with a given starttime and endtime.
The job successfully triggers on the starttime and the recurring intervals but on the last recurrence which is equal to the endtime, the job is not triggering anything.
"schedule": {
"startDatetime": 1664457960000,
"endDatetime": 1664717400000,
"recurrenceType": "Interval",
"messageSendTimeZone": "America/Chicago",
"recurrence": "2"
}
I want the job to trigger on all the intervals at the given recurrence.
For eg, if I started the job at 28th Sept who's end time is 2nd Oct, it should trigger on 28th, 30th and 2nd as well.
Is there something that I am missing ?
Thanks,
The endDatetime property indicates when the trigger’s schedule should be canceled.

Apache Beam Unbounded data processing and compute aggregations like count, durations per key per window

I am new to beam pipeline and I have a requirement to compute aggregated stats (counts and duration for each window - like very 30 min or so) from the events received through a Kafka topic (unbounded source).
Events
{"id":"xxxxx", "state": "start", "timestamp": 1625718600000, "device": "device-1", ...}
{"id":"xxxxx", "state": "end", "timestamp": 1625721300000, "device": "device-1",. ..}
{"id":"yyyyy", "state": "start", "timestamp": 1625718600000, "device": "device-2", ...}
{"id":"yyyyy", "state": "end", "timestamp": 1625719500000, "device": "device-2", ...}
Event "xxxxx" started 10:00 and ended 10:45
Event "yyyyy" started 10:00 and ended 10:15
Expected Stats from pipeline
Device Interval Count Duration
device-1 10:00-10:30 1 30 min
device-2 10:00-10:30 1 15 min
device-1 10:30-11:00 0 15 min
I played with fixed-windows, triggers, groupByKey, CombineFn etc and I am successful in computing the aggregated counters, increment the count if the event state is "start" but, I am clueless on computing the overlapping window duration even with the state-full processing.
Note: Used event identifier while grouping the events.
Please advice me on this.
So sounds like what you need to do is compute the start and end events at the same worker to compute the difference, right ?
I can think of few ways to do this.
Use GroupByKey - you can set even ID to the key and perform a GroupByKey to group events based on the key and compute and output the difference.
Stateful DoFn - When you receive events, store them in state keyed by the ID and when you store check for the availability of another event and compute the difference.
One thing to note that it's possible for start and end events to fall in two different Windows. Either of above solutions will not work in this case since different Windows are computed differently. I think you'll have to adapt the pipeline to account for such (rare ?) occurrences.

when to store data different collections MongoDB?

I want to store events occurred during a week in MongoDB. Each week can have 300-400 events.
One week's events are independent of another week and at a time i fetch or process only one week at time.( Never joining two or more week)
All event objects have same properties but different values.
Is it good to create separate collection for a each week or in same collection?
By given information, I would choose to store everything in one collection, because 300-400 events per week is really a small amount of entities (documents).
The document schema depends on the project needs/details, but as minimum I would add separated the year and week elements for the filtering purpose
Document example:
{
"_id": ObjectId("5c530d202029a5144454f9c2"),
"year": 2019,
"week": 10,
"event": {
"date": ISODate("2019-03-07T11:00:01.022Z")
"name": "some event name",
"message": "some message"
}
}

Is there a way set SliceStart back a few days?

I am working on a Data Factory where I want to query data from few days back.
I am executing a stored procedure that takes input based on the slice start:
For Example:
"value": "$$Text.Format('{0:dd}', SliceStart)"
So a run that starts on Friday queries data from Monday.
I can do some date manipulation in the stored procedure, but ideally I would like the window start and end date reflect the data the was copied.
In V1 you can specify "delay" in your policy of your activities. This allows you to postpone the execution of your activity. The example shows a couple of minutes, but I think you can use that to make it execute a few days later, while show the date/time of your slice.
Yes, you can use Date.AddDays function in order to acomplish this. From your screenshot and the fact you are talking about slices, I assume you are using DataFactory version 1. Here is overview of ADF v1 functions.
For your example, to get date that is 4 days before SliceStart, you would write something like
"value": "$$Text.Format('{0:dd}', Date.AddDays(SliceStart, -4))"

Which day is considered as a trigger day for weekly dataset/pipeline?

If I define my dataset/pipeline as weekly -- which day does ADF consider by default provided I am not adding any offset? For Daily and Monthly it's clear to me -- for Monthly for example it is first day of the month and for daily the first hour of the day. So what is that for weekly? Exactly on which day will it get triggered?
And another question -- if I want the pipeline to execute in middle of week every week (e.g. Thursday every week) ?
It uses the ISO standard for week which is Monday.
For the second part of your question then using the offset attribute will deal with this. For example:
"availability": {
"frequency": "Week",
"interval": 1,
"offset": "04.00:00:00",
"style": "StartOfInterval"
}
Hope this helps.