How to execute Google Data Fusion Pipeline from a event based triggers CDAP - google-cloud-data-fusion

Is there any way to run a Google Data Fusion pipeline from CDAP event based triggers?
The 1st requirement is, whenever a new file arrives within a GCS bucket. it will trigger the Data Fusion pipeline to run automatically.
The 2nd requirement is pipeline dependency, For example, Pipeline B cannot run if Pipeline A not started or failed.
Thanks

Reviewing your initial use case, I assume that for the 2nd requirement you might consider to look at CDAP pure components like: Schedules, Workflows and Triggers.
Generally, designing the run flow for underlying pipelines with some conditional execution schema, you create the Schedule object by defining the specific Workflow that holds the logical combination of the conditions between pipelines and apply the Trigger's model that matches you event occurrence.
According to the CDAP documentation:
Workflows can be controlled by the CDAP CLI and the Lifecycle
HTTP RESTful API.
Having above mentioned, it is required to compose an appropriate HTTP request to
CDAP REST API, containing JSON object that stores the details of the schedule to be created, based on the example from documentation and for the further reference I've created the workflow , whereas Pipeline_2 triggers only when Pipeline_1 succeeds:
{
"name": "Schedule_1",
"description": "Triggers Pipeline_2 on the succeding execution of Pipeline_1",
"namespace": "<Pipeline_2-namespace>",
"application": "Pipeline_2",
"version": "<application version of the Pipeline_2>",
"program": {
"programName": "Workflow_name",
"programType": "WORKFLOW"
},
"trigger": {
"type": "PROGRAM_STATUS",
"programId": {
"namespace": "<Pipeline_1-namespace>",
"application": "Pipeline_1",
"version": "<application version of the Pipeline_1>",
"type": "WORKFLOW",
"entity": "PROGRAM",
"program": "Workflow_name"
},
"programStatuses": ["COMPLETED"]
}
}
For the 1st requirement I'm not sure whether it can be feasible to achieve within the Data Fusion/CDAP native instruments, while I'm not able to see such kind of event, matching the continuous discover of GCS bucket:
Triggers are fired by events such as creation of a new partition in a
dataset, or fulfillment of a cron expression of a time trigger, or the
status of a program.
In such a case I would look at GCP Cloud function and GCP Composer, nicely written example, depicts the way how to use Cloud Functions for event-based DAG triggers, assuming that in particular Composer DAG file you can invoke sequential Data Fusion pipeline execution. Check out this Stack thread for more details.

Related

How to find which activity called another activity in my ADF Pipeline

I have created a pipeline (LogPipeline) that logs other pipelines' status to a database. The idea is that every pipeline will call the LogPipeline at the start and at the end by passing pipeline name and pipeline ID along with other parameters like started/ended/failed.
The last parameter is "Reason" where I want to capture the error message of why a pipeline may have failed.
However, in a given pipeline there are multiple activities that could fail. So I want to direct any and all failed activities to my Execute Pipeline activity and pass the error message.
But on the Execute Pipeline when filling out the parameters, I can only reference an activity by its name, e.g. Reason = #activity['Caller Activity'].Error.Message.
However, since multiple activities are calling this Execute Pipeline, is there a way to say
Reason = #activity[activityThatCalledExecutePipeline].Error.Message?
If my understanding is correct,there are multiple activities call the LogPipeline and you want to get those failed activities' names so that you could know the names inside LogPipeline. Per my knowledge,your requirement is not supported in ADF.
I'm not sure why you have to construct such complex scenario,even though you just want to log the specific fail activities and error messages anyway which is common requirement.There are many monitor ways supported by ADF,please follow below links:
1.https://learn.microsoft.com/en-us/azure/data-factory/monitor-using-azure-monitor#alerts
2.https://learn.microsoft.com/en-us/azure/data-factory/monitor-programmatically
I would suggest you getting an idea of Alerts and Monitor in ADF portal:
And you could set the Target Criteria
It includes:

Triggering Kusto commands using 'ADX Command' activity in ADFv2 vs calling WebAPI on it

In ADFv2 (Azure Data Factory V2) if we need to trigger a command on an ADX (Azure Data Explorer) cluster , we have two choices:-
Use 'Azure Data Explorer Commmand' activity
Use POST method provided in the 'WebActivity' activity
Having figured out that both the methods work I would say from development/maintenance point of view the first method sounds more slick and systematic especially because it is out of the box feature to support Kusto in ADFv2. Is there any scenario where the Web Activity method would be more preferable or more performant? I am trying to figure out if it's alright to simply use the ADX Command activity all the time to run any Kusto command from ADFv2 instead of ever using the Web activity,
It is indeed recommended to use the "Azure Data Explorer Command" activity:
That activity is more comfortable, as you don't have to construct by yourself a the HTTP request.
That command takes care of few things for you, such as:
In case you are running an async command, it will poll the Operations table until your async command is completed.
Logging.
Error handling.
In addition, you should take into consideration that the result format will be different between both cases, and that each activity has its own limits in terms of response size and timeout.

Get the list of skipped tasks on Azure DevOps Pipeline

Is there a way by which we can get the list of skipped tasks from the build?
For example, I have 2 tasks that run conditionally only based on external factors. So how can I see, whether the tasks were skipped or actually ran from Azure DevOps REST API?
I need to trigger another build conditionally based on the above factor.
Any help will be appreciated!
You should look into the Build Timeline REST API. If you issue the following GET request:
GET https://dev.azure.com/{organization}/{project}/_apis/build/builds/{buildId}/timeline
where buildId is the ID of the build you're examining, it returns the Timeline object. It is essentially a collection of TimelineRecord objects, each representing an entry in a build's timeline. You should filter out this collection to leave only those, where "type": "Task" and "result": "skipped".

Using DevOps Release REST API to determine time for release to complete

I'm using the DevOps REST API here:
https://learn.microsoft.com/en-us/rest/api/azure/devops/release/releases/list?view=azure-devops-rest-5.0
I have a specific release pipeline that I want to monitor for performance, I'd like to be able to query the pipeline to determine how long it's been taking to complete over the last n runs. then I can take that data and use it to determine if there's been any degradation in performance over time.
Is it possible to determine this info using the existing APIs? The API above seems to only talk about release start time, from playing around with the various options I haven't been able to get completion time from it.
It's not very easy to find but the following link returns the data you need:
https://vsrm.dev.azure.com/Utopia-Demo/Utopia/_apis/release/releases/1
"releaseDeployPhases": [
{
...
"deploymentJobs": [
{
"job": {
...
"dateStarted": "2019-01-23T14:40:59.603Z",
"dateEnded": "2019-01-23T14:42:49.863Z",
"startTime": "2019-01-23T14:40:59.603Z",
"finishTime": "2019-01-23T14:42:49.863Z",
...
},
Here is a start and end time on the job that you can use to calculate the job length.

CQRS and Event Sourcing coupled with Relational Database Design

Let me start by saying, I do not have real world experience with CQRS and that is the basis for this question.
Background:
I am building a system where a new key requirement is allowing admins to "playback" user actions (admins want to be able to step through every action that has happened in a system to any particular point). The caveats to this are, the company already has reports that are generated off of their current SQL db that they will not change (at least not in parallel with this new requirement) so the storage of record will be SQL. I do not have access to SQL's Change Data Capture, so creating a bunch of history tables with triggers would be incredibly difficult to maintain so I'd like to avoid that if at all possible. Lastly, there are potentially (not currently) a lot of data entry points that go through a versioning lifecycle that will result in changes to the SQL db (adding/removing fields) so if I tried to implement change tracking in SQL, I'd have to maintain the tables that handled the older versions of the data (nightmare).
Potential Solution
I am thinking about using NoSQL (Azure DocumentDB) to handle data storage (writes) and then have command handlers handle updating the current SQL (Azure SQL) with the relevant data to be queried (reads). That way the audit trail is created and that idea of "playing back" can be handled while also not disturbing the current back end functionality that is provided.
This approach would handle the requirement and satisfy the caveats. I wouldnt use CQRS for the entire app, just for the pieces that I needed this "playback" functionality. I know that I would have to mitigate failure points along the Client -> Write to DocumentDB -> Respond to user with success/fail -> Write to SQL on Success write to DocumentDB path, but my novice CQRS eyes can't see a reason why this isn't a great way to handle this.
Any advice would be greatly appreciated.
This article explained CQRS pattern and provided an example of a CQRS implementation please refer to it.
I am thinking about using NoSQL (Azure DocumentDB) to handle data storage (writes) and then have command handlers handle updating the current SQL (Azure SQL) with the relevant data to be queried (reads).
here is my suggestion, when a user do write operations to update a record, we could always do insert operation before admin audit user’s operations. For example, if user want to update a record, we could insert updating entity with a property that indicates if current operation is audited by admins instead of directly update the record.
Original data in document
{
"version1_data": {
"data": {
"id": "1",
"name": "jack",
"age": 28
},
"isaudit": true
}
}
For updating age field, we could insert entity with updated information instead of updating original data directly.
{
"version1_data": {
"data": {
"id": "1",
"name": "jack",
"age": 28
},
"isaudit": true
},
"version2_data": {
"data": {
"id": "1",
"name": "jack",
"age": 29
},
"isaudit": false
}
}
and then admin could check the current the document to audit user’s operations and determine if updates could write to SQL database.
One potential way to think about this is creating a transaction object that has a unique id and represents the work that needs to be done. The transaction in this case would be write an object to document db or write an object to sql db. It could contain the in memory object to be written and the destination db (doc db, sql, etc.) connection parameters.
Once you define your transaction you would need to adjust your work flow for a proper CQRS. Instead of client writing to doc db directly and waiting on the result of this call, let the client create a transaction with a unique id - which could be something like Date Time tick counts or an incremental transaction id for instance, and then write this transaction to a message queue like azure queue or service bus. Once you write the transaction to the queue return success to user at that point. Create worker roles that would read the transaction messages from this queue and process them, write objects to doc db. That is not overwriting the same entity in doc db, but just writing the transaction with the unique incremental id to doc db for that particular entity. You could also use azure table storage for that afaik.
After successfully updating the doc db transaction, the same worker role could write this transaction to a different message queue which would be processed by its own set of worker roles which would update the entity in sql db. If anything goes wrong in the interim, keep an error table and update failures in that error table to query and retry later on.