Azure Data Factory v1 pipeline not starting - azure-data-factory

I have created an Azure Data Factory which has the following activity JSON as viewed in the portal (excerpt)
"start": "2018-07-27T00:00:00Z",
"end": "2099-12-30T13:00:00Z",
"isPaused": false,
"runtimeInfo": {
"deploymentTime": "2020-06-08T12:42:21.2801494Z",
"activePeriodSetTime": "2020-06-08T12:23:16.2436361Z",
"pipelineState": "Running",
"activityPeriods": {
"copyXZActivity": {
"start": "2017-06-27T00:00:00Z",
"end": "2099-12-30T13:00:00Z"
}
}
},
"id": "ef896997-2046-4b2e-7074-ecb5f58dd489",
"provisioningState": "Succeeded",
"hubName": "sxdb_hub",
"pipelineMode": "Scheduled"
My AzureSQLTable inputs and outputs have the following JSON config:
"availability": {
"frequency": "Minute",
"interval": 15
},
I would expect it to run immediately, every 15 minutes, but the activity window is empty. The next scheduled run at 5/3/2020, 4:30 PM UTC according to the activity window, which seems to be a random date in the past.
How do I get the activity to run, as expected, every 15 minutes?

The problem seems to have been this line, which caused execution to begin on 3 May 2020:
"copyXZActivity": {
"start": "2017-06-27T00:00:00Z",
Changing it to this fixed the issue:
"copyXZActivity": {
"start": "2020-06-10T00:00:00Z",
Changing the start date to the current date got the activity running. It appears like Data Factory 1, when given a start date in the distant past, chooses another date a few months ago, and starts executing the activity from that day, continuously, until it "catches up" and then follows the interval pattern (although I wasn't able to navigate via the monitoring UI to that date).
It exhibits this "catch up" behavior with any start date/time in the past.

Related

Opensearch assumes wrong time zone

Edit
Opensearch increases the timestamp of the logs by one hour. It must somehow assume that the logs come from the UTC time zone. How do I change this behaviour?
2023-02-02 12:47:27,897 [INFO]: <log> becomes 2023-02-02 13:47:27,897 [INFO]: <log>
From the official documentation:
Internally, dates are converted to UTC (if the time-zone is specified)
and stored as a long number representing milliseconds-since-the-epoch.
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html
You can't change this behavior, but when you search the data you will see the correct time according to your browser time. Or you can specify the time zone during the search. For example:
GET _search
{
"query": {
"range": {
"timestamp": {
"gte": "2023-02-02 12:47:27",
"format": "yyyy-MM-dd HH:mm:ss",
"time_zone": "+01:00"
}
}
}
}

Debezium kafka connect no records produced

I am trying out Kafka and have got the io.debezium.connector.sqlserver.SqlServerConnector registered correctly as the connector returns
{
"name": "test_connector",
"connector": {
"state": "RUNNING",
"worker_id": "kafka-connect:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "kafka-connect:8083"
}
],
"type": "source"
}
When I enable CDC on the table from the logs it appears to work
INFO Snapshot ended with SnapshotResult [status=COMPLETED, offset=SqlServerOffsetContext [sourceInfoSchema=Schema{io.debezium.connector.sqlserver.Source:STRUCT}, sourceInfo=SourceInfo [serverName=TestServer, changeLsn=NULL, commitLsn=0000006d:00000a46:0003, eventSerialNo=null, snapshot=FALSE, sourceTime=2022-02-28T04:51:31.813Z], snapshotCompleted=true, eventSerialNo=1]] (io.debezium.pipeline.ChangeEventSourceCoordinator)
However edits to the table do not seem to be picked up yet I can confirm the edits are making it into the change control table.
INFO WorkerSourceTask{id=test_connector-0} Either no records were produced by the task since the last offset commit, or every record has been filtered out by a transformation or dropped due to transformation or conversion errors. (org.apache.kafka.connect.runtime.WorkerSourceTask)
Once I confirm this my next step is to see if I can stream the changes using KSQL
Any suggestions?

Not able to get logs related to azure data factory mapping data flows from log analytics

We are working on implementing a custom logging solution. Most of the information what we need is already present in log analytics from data factory analytics solution but for getting log info on data flows,  there is a challenge. When querying, we get this error in output. "Too large to parse". 
Since data flows are complex and critical piece in a pipeline, we are in desperate need to get data like rows copied, skipped, read etc of each activities with in data flow. can you pls help how to get those info?
You can get the same information shown in the ADF portal UI by making a POST request to the below REST endpoint. You can find more information and read about authentication on the following link https://learn.microsoft.com/en-us/rest/api/datafactory/pipelineruns/querybyfactory
You can choose to query by factory or for a specific pipeline run id depending on your needs.
https://management.azure.com/subscriptions/<subscription id>/resourcegroups/<resource group name>/providers/Microsoft.DataFactory/factories/<ADF resource Name>/pipelineruns/<pipeline run id>/queryactivityruns?api-version=2018-06-01
Below is an example of the data you can get from one stage:
{
"stage": 7,
"partitionTimes": [
950
],
"lastUpdateTime": "2020-07-28 18:24:55.604",
"bytesWritten": 0,
"bytesRead": 544785954,
"streams": {
"CleanData": {
"type": "select",
"count": 241231,
"partitionCounts": [
950
],
"cached": false
},
"ProductData": {
"type": "source",
"count": 241231,
"partitionCounts": [
950
],
"cached": false
}
},
"target": "MergeWithDeltaLakeTable",
"time": 67589,
"progressState": "Completed"
}

Working on historical data (incrementing properties)

I'm trying to write a CEP rule that would take all the existing ACTIVE alarms and increase a specific fragment bike_alarm.priority by 1 every minute. This is the whole structure of alarm:
{
"count": 1,
"creationTime": "2018-07-09T15:30:20.790+02:00",
"time": "2014-03-03T12:03:27.845Z",
"history": {
"auditRecords": [],
"self": "https://cumulocity.com/audit/auditRecords"
},
"id": "1024",
"self": "https://cumulocity.com/alarm/alarms/1024",
"severity": "WARNING",
"source": {
"id": "22022",
"name": "01BIKE_STATION_NORTH",
"self": "https://cumulocity.com/inventory/managedObjects/22022"
},
"status": "ACTIVE",
"text": "Bike disconnected",
"type": "bike_error_01",
"bike_alarm" {
"priority": 10
}
}
This is what I managed to create (based mainly on this question):
create schema Alarm as Alarm;
create schema CollectedAlarms(
alarms List
);
create schema SingleAlarm(
alarm Alarm
);
#Name("Collecting alarms")
insert into CollectedAlarms
select findAllAlarmByTimeBetween(
current_timestamp().minus(100 day).toDate(),
current_timestamp().toDate()
) as alarms
from pattern[every timer:interval(30 sec)];
#Name("Splitting alarms")
insert into SingleAlarm
select
singleAlarm as alarm
from
CollectedAlarms as alarms unidirectional,
CollectedAlarms[alarms#type(Alarm)] as singleAlarm;
#Name("Rising priority")
insert into UpdateAlarm
select
sa.alarm.id as id,
{
"bike_alarm.priority", GetNumber(sa.alarm, "bike_alarm.priority". 0) + 1
} as fragments
from pattern [every sa = SingleAlarm -> (timer:interval(1 minutes))];
the problem is that not all alarms are founded and even those that are the incrementation don't work, priority is set to null.
Additionally, could you point me in direction of some better documentation? Is this something you use?
In general the esper documentation that you linked is the best place to look for the generic syntax.
In combination you probably sometimes also need the Cumulocity documentation for the specific stuff: http://cumulocity.com/guides/event-language/introduction
Coming to your problems:
You are miss-using a realtime processing engine to do cron-like batch operations. While it technically can be done this might not be the best approach and I will show you a different approach that you can take.
But first solving your approach:
The database queries like findAllAlarmByTimeBetween() will only return up to 2000 results and there is no way to get the next page like on the REST API of Cumulocity. Also if you say you want to handle only active alarms you should use a function that also filters for the status.
Getting null out of a function like getNumber() means the JsonPath wasn't found or the dataType is incorrect (using getNumber for a String). You can set a default value for that case as the third parameter. From you Json that you provided it looks correct though. The syntax errors in your code are copy paste errors I assume as otherwise you wouldn't have been able to deploy it.
In my opinion you should approach that differently:
On each incoming alarm raise the priority after one minute if it hasn't been cleared. Additionally trigger the 1 minute timer again. Like a loop until the alarm is cleared.
The pattern for this would like like that:
from pattern [
every e = AlarmCreated(alarm.status = CumulocityAlarmStatuses.ACTIVE)
-> (timer:interval(1 minutes)
and not AlarmUpdated(alarm.status != CumulocityAlarmStatuses.ACTIVE, alarm.id.value = e.alarm.id.value))
];
You need one with AlarmCreated which will only cover the initial increase and a second statement that triggers on your own schema which is then run in a loop.
In general try to avoid as many database calls as you can. So keep the loop counter in your schema so you only always need to execute the update call of the alarm.

Facebook marketing API insights doesn't return data for all days within the range

I am trying to use the Insights API to get daily performance stats on my ad account. To achieve this I am setting the time_increment=1. I expect to receive stats for every day within the interval specified, even if there is no data for those days (in this case counts would be zero). Instead I get stats only for one day, presumably one day when my test account was active.
Here is the call:
curl "https://graph.facebook.com/v2.7/act_[redacted]/insights?
fields=clicks,impressions,cpc,ctr&access_token=
[redacted]&time_range%5Bsince%5D=2016-09-01
&time_range%5Buntil%5D=2016-09-09&time_increment=1"
Output:
{
"data": [
{
"clicks": "3",
"impressions": "89",
"cpc": 0.29333333333333,
"ctr": 3.3707865168539,
"date_start": "2016-09-06",
"date_stop": "2016-09-06"
}
],
"paging": {
"cursors": {
"before": "MAZDZD",
"after": "MAZDZD"
}
}
}
Notice that my time range is set to 2016-09-01 - 2016-09-09, but I've got data only for the day of 2016-09-06.
The question is: Is it possible to get a response, similar to the one above, but with an entry for every day within the time range, even if some of the entries will have no data (i.e. counts will be zero or null)?
We've reached out to Facebook, and they have said that indeed for the time periods where there is no data, no insights objects will be returned. It is up to caller to fill in the blanks if they want to build a time series dataset.