Azure Data Factory copy day before data from slicestart date

Azure Data Factory copy day before data from slicestart date - azure-data-factory

can somebody let me know how to get previous days data i.e 2017-07-28 etc from my onpremises file system if my pipleline start and end dates are
"start": "2017-07-29T00:00:00Z",
"end": "2017-08-03T00:00:00Z"
My pipeline's input is"FileSystemSource" and output is "AzureDataLakeStore". I have tried below JSON in my copy pipeline as input
"inputs": [
{
"name": "OnPremisesFileInput2"
"startTime": "Date.AddDays(SliceStart, -1)",
"endTime": "Date.AddDays(SliceEnd, -1)"
}
]
I have also tried defining "offset" in the input and output datasets and in the pipeline as follows
"availability": {
"frequency": "Day",
"interval": 1,
"offset": "-1.00:00:00",
"style": "StartOfInterval"
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"offset": "-1.00:00:00",
"style": "StartOfInterval"
},
none of the above seems to be working. Request someone to help me.

I think a good strategy to do this is to think about yesterday's output as today's input. Azure Data Factory let's you run activities one after another in sequence using different data sources.
There's good documentation here
With an example like this:
Like this you can either have a temporary storage in between the two activities or use your main input data source but with a filter to get only yesterday's slice.

Your offset should be positive.
"availability": {
"frequency": "Day",
"interval": 1,
"offset": "01:00:00",
"style": "EndOfInterval"
}
In this case it will run for example on September 7th at 1:00 AM UTC and will run the slice from Sep 6th 0:00 UTC to Sept 7th UTC. Which is yesterday slice.
Your input dataset should be configured to use the SliceStart for the naming of the file
"partitionedBy": [
{
"name": "Slice",
"value": {
"type": "DateTime",
"date": SliceStart",
"format": "yyyymmdd"
}
}],
"typeProperties": {
"fileName": "{slice}.csv",
}
It would look for 20170906.csv file when executed on Sept 7th.

Related

Azure Data factory trigger: Pipeline has to trigger every December on second Friday from end of the month

A pipeline has to trigger every December on second Friday from the end of the month.
I am trying to do this using scheduled trigger of ADF See Trigger Definition by using,
Start date of Dec 1st 2021
Recurrence of 12 months
No end date
Advanced recurrence option of weekdays with occurrance as -2 and day as Friday.
"name": "Dec_Last_But_One_Friday",
"properties": {
"annotations": [],
"runtimeState": "Stopped",
"pipelines": [
{
"pipelineReference": {
"referenceName": "pipeline_test_triggers",
"type": "PipelineReference"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Month",
"interval": 12,
"startTime": "2021-12-01T14:24:00Z",
"timeZone": "UTC",
"schedule": {
"monthlyOccurrences": [
{
"day": "Friday",
"occurrence": -2
}
]
}
Is this right way? how do I know it will be definitely trigger every year, the Second Friday from the end of the December month. Is there a way to see the future schedules in ADF?
Thank you!

Viewing future ADF schedule, this feature does not currently exist in Data Factory V2 at this time.
Due to advanced recurrence options are not perfect, we'd better check once a year.

Add day to a date in Vega-Lite

I'm trying to add a day to my dates which look like "2020-11-20" for November, 20, 2020. However I am encountering difficulty doing that - do I need to use the offset function? The reason I am doing this is that Vega-Lite is automatically offsetting my dates back 1 day through its GMT conversion and I cannot get it to stop. Please help!
Here is an example. If you look at the timeline graph it ends at 2020-11-19, but the final date in my data is 2020-11-20, and I need to make it so 2020-11-20 is the last date on my timeline graph.

This issue comes from an unfortunate "feature" of how javascript parses dates. Here is a minimal example of the problem you're seeing (open in editor):
{
"data": {
"values": [
{"date": "2020-11-17", "value": 5},
{"date": "2020-11-18", "value": 6},
{"date": "2020-11-19", "value": 7},
{"date": "2020-11-20", "value": 8}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "value", "type": "quantitative"},
"y": {
"field": "date",
"timeUnit": "yearmonthdate",
"type": "ordinal"
},
"tooltip": [
{
"field": "date",
"timeUnit": "yearmonthdate",
"type": "temporal"
}
]
}
}
Each of the days in the chart are off by one compared to the input. So why is this happening?
Well, it turns out that the Vega-Lite's renderer makes use of Javascript's built-in date parsing, and Javascript's date parsing treats inputs differently depending on how they're formatted. In particular, Javascript will parse non-standard timestamps in UTC time, but will parse full ISO-8601 timestamps in local time, a fact you can confirm in your browser's javascript console (I executed this on a computer set to PST):
> new Date('2020-11-20')
Thu Nov 19 2020 16:00:00 GMT-0800 (Pacific Standard Time)
> new Date('2020-11-20T00:00:00')
Fri Nov 20 2020 00:00:00 GMT-0800 (Pacific Standard Time)
The Vega-Lite docs recommend using UTC timeUnits and scales to work around this, but I tend to find that approach a bit clunky. Instead, I try to always specify dates in Vega-Lite via full ISO 8601 timestamps.
In your case, the best approach is probably to use a calculate transform to regularize your dates, and proceed from there. Modifying the simplified example above, it might look something like this (open in editor):
{
"data": {
"values": [
{"date": "2020-11-17", "value": 5},
{"date": "2020-11-18", "value": 6},
{"date": "2020-11-19", "value": 7},
{"date": "2020-11-20", "value": 8}
]
},
"transform": [
{"calculate": "toDate(datum.date + 'T00:00:00')", "as": "date"}
],
"mark": "bar",
"encoding": {
"x": {"field": "value", "type": "quantitative"},
"y": {
"field": "date",
"timeUnit": "yearmonthdate",
"type": "ordinal"
},
"tooltip": [
{
"field": "date",
"timeUnit": "yearmonthdate",
"type": "temporal"
}
]
}
}

gSuite Integeration Admin SDK Report API Date format

Hi Guys I am currently working on Gsuite Admin SDK Report API. I am successfully able to send the request and getting the response.
Now, the issue is that I am not able to identify the date format returned by the Activities.list().
Here is a snippet:
"events": [
{
"type": "event_change",
"name": "create_event",
"parameters": [
{
"name": "event_id",
"value": "jdlvhwrouwovhuwhovvwuvhw"
},
{
"name": "organizer_calendar_id",
"value": "abc#xyz.com"
},
{
"name": "calendar_id",
"value": "abc#xyz.com"
},
{
"name": "target_calendar_id",
"value": "abc#xyz.com"
},
{
"name": "event_title",
"value": "test event 3"
},
{
"name": "start_time",
"intValue": "63689520600"
},
{
"name": "end_time",
"intValue": "63689524200"
},
{
"name": "user_agent",
"value": "Mozilla/5.0"
}
]
}
]
Note: Please have a look at start_time and end_time and let me know if you guys have any idea about it.
Please have a look and share some info and let me know if any other infomation is needed.

I ran into this same question when parsing google calendar logs.
The time format they use are the number of seconds since January 1st, 0001 (0001-01-01).
I never found documentation where they referenced that time format. Google uses this instead of epoch for some of their app logs.
You can find an online calculator here https://www.epochconverter.com/seconds-days-since-y0
Use the one under "Seconds Since 0001-01-01 AD" and not the one under year zero.
So your start_time of "63689520600" converts to March 30, 2019 5:30:00 AM GMT.
If you want start_time to be in epoch, you could subtract 62135596800 seconds from the number. 62135596800 converts to January 1, 1970 12:00:00 AM when counting the number of seconds since 0001-01-01. Subtracting 62135596800 from the start_time would give you the number of seconds since January 1, 1970 12:00:00 AM AKA Epoch Time.
Hope this helps.

Is there a possibility to have another timestamp as dimension in Druid?

Is it possible to have Druid datasource with 2 (or multiple) timestmaps in it?
I know that Druid is time-based DB and I have no problem with the concept but I'd like to add another dimension with which I can work as with timestamp
e.g. User retention: Metric surely is specified to a certain date, but I also need to create cohorts based on users date of registration and rollup those dates maybe to a weeks, months or filter to only a certain time periods....
If the functionality is not supported, are there any plug-ins? Any dirty solutions?

Although I'd rather wait for official implementation for timestamp dimensions full support in druid to be made, I've found a 'dirty' hack which I've been looking for.
DataSource Schema
First things first, I wanted to know, how much users logged in for each day, with being able to aggregate by date/month/year cohorts
here's the data schema I used:
"dataSchema": {
"dataSource": "ds1",
"parser": {
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"dimensionsSpec": {
"dimensions": [
"user_id",
"platform",
"register_time"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [
{ "type" : "hyperUnique", "name" : "users", "fieldName" : "user_id" }
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "DAY",
"intervals": ["2015-01-01/2017-01-01"]
}
},
so the sample data should look something like (each record is login event):
{"user_id": 4151948, "platform": "portal", "register_time": "2016-05-29T00:45:36.000Z", "timestamp": "2016-06-29T22:18:11.000Z"}
{"user_id": 2871923, "platform": "portal", "register_time": "2014-05-24T10:28:57.000Z", "timestamp": "2016-06-29T22:18:25.000Z"}
as you can see, my "main" timestamp to which I calculate these metrics is timestamp field, where register_time is only the dimension in stringy - ISO 8601 UTC format .
Aggregating
And now, for the fun part: I've been able to aggregate by timestamp (date) and register_time (date again) thanks to Time Format Extraction Function
Query looking like that:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "timeFormat",
"format": "YYYY-MM-dd",
"timeZone": "Europe/Bratislava" ,
"locale": "sk-SK"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
}
Filtering
Solution for filtering is based on JavaScript Extraction Function with which I can transform date to UNIX time and use it inside (for example) bound filter:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
"platform",
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
"filter": {
"type": "bound",
"dimension": "register_time",
"outputName": "reg_date",
"alphaNumeric": "true"
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
}
I've tried to filter it 'directly' with javascript filter but I haven't been able to convince druid to return the correct records although I've doublecheck it with various JavaScript REPLs, but hey, I'm no JavaScript expert.

Unfortunately Druid has only one time-stamp column that can be used to do rollup plus currently druid treat all the other columns as a strings (except metrics of course) so you can add another string columns with time-stamp values, but the only thing you can do with it is filtering.
I guess you might be able to hack it that way.
Hopefully in the future druid will allow different type of columns and maybe time-stamp will be one of those.

Another solution is add a longMin sort of metric for the timestamp and store the epoch time in that field or you convert the date time to a number and store it (eg 31st March 2021 08:00 to 310320210800)

As for Druid 0.22 it is stated in the documentation that secondary timestamps should be handled/parsed as dimensions of type long. Secondary timestamps can be parsed to longs at ingestion time with a tranformSpec and be transformed back if needed in querying time link.

Data Factory Slice Starts

I have a pipeline that needs to run daily... but the data only arrives at around 2pm on that day (for the previous day)... so when midnight ticks over, the data isn't available, and therefore everything falls over ;)
I have tried this:
"start": "2016-02-10T15:00:00Z",
"end": "2016-05-31T00:00:00Z",
but it still kicks off at midnight, I assume because I have my scheduler is as follows:
"scheduler": {
"frequency": "Day",
"interval": 1
},
i think i need to use either anchordatetime, or offset.. but i'm not sure which?

Following in pipeline just means that you want pipeline enabled during that period.
"start": "2016-02-10T15:00:00Z",
"end": "2016-05-31T00:00:00Z"
Following in activity definition means that you want the activity to run at the end of the day, thus, midnight in UTC time.
"scheduler": {
"frequency": "Day",
"interval": 1
},
If you want the activity to kick off at 2pm, use following
"scheduler": {
"frequency": "Hour",
"interval": 24,
"anchorDateTime": "2016-02-10T14:00:00"
},
You should make it consistent in target dataset definition as well.
"availability": {
"frequency": "Hour",
"interval": 24,
"anchorDateTime": "2016-02-10T14:00:00"
}
Hope that helps!!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Azure Data Factory copy day before data from slicestart date - azure-data-factory

Related

Azure Data factory trigger: Pipeline has to trigger every December on second Friday from end of the month

Add day to a date in Vega-Lite

gSuite Integeration Admin SDK Report API Date format

Is there a possibility to have another timestamp as dimension in Druid?

Data Factory Slice Starts

Categories

Resources