azure data factory dependencies - azure-data-factory

I have two activities in my azure data factory.
Activity A1 = a stored proc on a sql db. Input=none, output = DB (output1). The stored proc targets the output dataset.
Activity A2 = an azure copy activity ("type": "Copy") which copies from blob to same sql db. Input = blob, Output = DB (output2)
i need to run activity A1 before A2 and i cant for the world figure out what dependencies to put between them.
i tried to mark A2 as having two inputs - the blob + the DB (output1). if i do this, the copy activity doesn't throw error, but it does NOT copy the blob to db (i think it silently uses the DB as the source of copy, instead of blob as source of copy and somehow does nothing).
if i remove the DB input (output1) on A2 it can successfully copy blob to DB but i no longer have the dependency chain that A1 needs to run before A2
thanks!

I figured this out - I was able to keep two dependencies on A2, but just needed to make sure of the ordering of the 2 inputs. Weird. Looks like the Copy activity just acts on the FIRST input - so when i moved the blob as the first input it worked! :) (earlier i had the DB output1 as first input and it silently did nothing)
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "MyBlobInput"
},
{
"name": "MyDBOutput1"
}
],
"outputs": [
{
"name": "MyDBOutput2"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 3,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureBlobtoSQL",
"description": "Copy Activity"
}
],

Related

ADF - Loop through a large JSON file in a dataflow

We currently receive some metadata information from a third party supplier in the form of a JSON file.
The JSON file contains definitions of some tables which need to be loaded into SQL via ADF.
The JSON file looks like this, it's a list of tables and their data types
"Tables": [
{
"name": "account",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "250",
"name": "name"
}
]
},
{
"name": "customer",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "100",
"name": "name"
}
]
}
]
What we need to do is to loop through this JSON and via an ADF data flow we create the required tables in the destination database.
We initially designed the Pipeline with a lookup activity that loads the JSON file then pass the output to a foreach loop. This worked really well when we had only a small JSON file but as we started using real data, the JSON file was over the limit of 4MB resulting in the lookup activity throwing an error.
We then tried using a mapping dataflow by loading the JSON as a source, then setting the sink as a cache and outputting this to an output variable which we then loop through but again this works with smaller datasets but as soon as the dataset is large enough it can't parse it to an output.
I am sure this should be easy to do but just can't get my head around it!
Here is the sample procedure to loop through large JSON file in a Dataflows.
Create a Linked service and dataset to the json file path.
Provision the dataset to the source in the dataflows.
By the flatten formatter will get the input columns from the source through Unroll by option with required input.
Create linked service and dataset to the sink path.
Attach the data flow work item to the Data Flow activity.
Will get result as per the expectations in the sql db.

How to fix Data Lake Analytics script

I would like to use Azure Data Factory with Azure Data Lake Analytics as action, but without success.
This is my PIPELINE script
{
"name": "UsageStatistivsPipeline",
"properties": {
"description": "Standardize JSON data into CSV, with friendly column names & consistent output for all event types. Creates one output (standardized) file per day.",
"activities": [{
"name": "UsageStatisticsActivity",
"type": "DataLakeAnalyticsU-SQL",
"linkedServiceName": {
"referenceName": "DataLakeAnalytics",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "BlobStorage",
"type": "LinkedServiceReference"
},
"scriptPath": "adla-scripts/usage-statistics-adla-script.json",
"degreeOfParallelism": 30,
"priority": 100,
"parameters": {
"sourcefile": "wasb://nameofblob.blob.core.windows.net/$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/0_647de4764587459ea9e0ce6a73e9ace7_2.json', SliceStart)",
"destinationfile": "$$Text.Format('wasb://nameofblob.blob.core.windows.net/{0:yyyy}/{0:MM}/{0:dd}/DailyResult.csv', SliceStart)"
}
},
"inputs": [{
"type": "DatasetReference",
"referenceName": "DirectionsData"
}
],
"outputs": [{
"type": "DatasetReference",
"referenceName": "OutputData"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 10,
"executionPriorityOrder": "NewestFirst"
}
}
],
"start": "2018-01-08T00:00:00Z",
"end": "2017-01-09T00:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}}
I have two parameters variables sourcefile and destinationfile, which are dynamic (path is from Date).
Then I have this ADLA script for execution.
REFERENCE ASSEMBLY master.[Newtonsoft.Json];
REFERENCE ASSEMBLY master.[Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#Data =
EXTRACT
jsonstring string
FROM #sourcefile
USING Extractors.Tsv(quoting:false);
#CreateJSONTuple =
SELECT
JsonFunctions.JsonTuple(jsonstring) AS EventData
FROM
#Data;
#records =
SELECT
JsonFunctions.JsonTuple(EventData["records"], "[*].*") AS record
FROM
#CreateJSONTuple;
#properties =
SELECT
JsonFunctions.JsonTuple(record["[0].properties"]) AS prop,
record["[0].time"] AS time
FROM
#records;
#result =
SELECT
...
FROM #properties;
OUTPUT #result
TO #destinationfile
USING Outputters.Csv(outputHeader:false,quoting:true);
Job execution fails and the error is :
EDIT:
It seems, that Text.Format is not executed and passed into script like string ... Then in Data Lake Analytics Job detail is this :
DECLARE #sourcefile string = "$$Text.Format('wasb://nameofblob.blob.core.windows.net/{0:yyyy}/{0:MM}/{0:dd}/0_647de4764587459ea9e0ce6a73e9ace7_2.json', SliceStart)";
In your code sample, the sourcefile parameter is not defined the same way as destinationfile. The latter appears to be correct while the former does not.
The whole string should be wrapped inside $$Text.Format() for both:
"paramName" : "$$Text.Format('...{0:pattern}...', param)"
Also consider passing only the formatted date like so:
"sliceStart": "$$Text.Format('{0:yyyy-MM-dd}', SliceStart)"
and then doing the rest in U-SQL:
DECLARE #sliceStartDate DateTime = DateTime.Parse(#sliceStart);
DECLARE #path string = String.Format("wasb://path/to/file/{0:yyyy}/{0:MM}/{0:dd}/file.csv", #sliceStartDate);
Hope this helps

Calling Azure SQL DW Stored Procedure from Azure Data Factory

I am getting the following error when trying to run a stored procedure in an Azure SQL Datawarehouse.
Activity 'SprocActivitySample' contains an invalid Dataset reference 'Destination-SQLDW-nna'. This dataset is pointing to Azure SQL DW and stored procedure is in it.
Here is the entire code.
{
"name": "SprocActivitySamplePipeline",
"properties": {
"activities": [
{
"type":"SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "DailyImport",
"storedProcedureParameters": {
"DateToImportFor": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
},
"outputs": [
{
"name": "Destination-SQLDW-nna"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "SprocActivitySample"
}
],
"start": "2017-01-01T00:00:00Z",
"end": "2017-02-20T05:00:00Z",
"isPaused": true
}
}
I'm afraid that Azure Sql Data Warehouse does not support table-valued parameters in stored procedures.
Read more about it here: https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-stored-procedures
If you find a workaround for it please share! I couldnt find any.
Also it would be good if you can post the dataset json so we can try to find any errors on it.
Cheers!
I got this working. the problem was that i was referencing wrong
"outputs": [
{
"name": "Destination-SQLDW-nna"
}
after correcting name to the right Dataset it is working

Azure Data Factory: Parameterized the Folder and file path

Environments
Azure Data Factory
Scenario
I have ADF pipeline which reads the data from On premise server and writes the data to azure data lake.
For the same - I have provided Folder structure in ADF*(dataset)*as follows
Folder Path : - DBName/RawTables/Transactional
File Path : - TableName.csv
Problem
Is it possible to parameterized the Folder name or file path ? Basically - if tomorrow - I want to change the folder path*(without deployment)* then we should be updating the metadata or table structure.
So the short answer here is no. You won't be able to achieve this level of dynamic flexibility with ADF on its own.
You'll need to add new defined datasets to your pipeline as inputs for folder changes. In Data Lake you could probably get away with a single stored procedure that accepts a parameter for the file path which could be reused. But this would still require tweaks to the ADF JSON when calling the proc.
Of course, the catch all situation here is to use an ADF custom activity and write a C# class with methods to do whatever you need. Maybe overkill though and lots of effort to setup the authentication to data lake store.
Hope this gives you a steer.
Mangesh, why don't you try the .Net custom activity in ADF. This custom activity will be your first activity that will potentially check for the processed folder and if the processed folder is present it will move that to History(say) folder. As, ADF is a platform for data movement and data transformation, it doesn't deal with the IO activity. You can learn more about the .Net custom activity at:
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-use-custom-activities
What you want to do is possible with the new Lookup activity in Azure Data Factory V2. Documentation is here: Lookup Activity.
A JSON example would be something like this:
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "LookupActivity",
"type": "Lookup",
"typeProperties": {
"dataset": {
"referenceName": "LookupDataset",
"type": "DatasetReference"
}
}
},
{
"name": "CopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from #{activity('LookupActivity').output.tableName}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupActivity",
"dependencyConditions": [ "Succeeded" ]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
}
]
}
}

Exporting a AWS Postgres RDS Table to AWS S3

I wanted to use AWS Data Pipeline to pipe data from a Postgres RDS to AWS S3. Does anybody know how this is done?
More precisely, I wanted to export a Postgres Table to AWS S3 using data Pipeline. The reason I am using Data Pipeline is I want to automate this process and this export is going to run once every week.
Any other suggestions will also work.
There is a sample on github.
https://github.com/awslabs/data-pipeline-samples/tree/master/samples/RDStoS3
Here is the code:
https://github.com/awslabs/data-pipeline-samples/blob/master/samples/RDStoS3/RDStoS3Pipeline.json
You can define a copy-activity in the Data Pipeline interface to extract data from a Postgres RDS instance into S3.
Create a data node of the type SqlDataNode. Specify table name and select query.
Setup the database connection by specifying RDS instance ID (the instance ID is in your URL, e.g. your-instance-id.xxxxx.eu-west-1.rds.amazonaws.com) along with username, password and database name.
Create a data node of the type S3DataNode.
Create a Copy activity and set the SqlDataNode as input and the S3DataNode as output.
Another option is to use an external tool like Alooma. Alooma can replicate tables from PostgreSQL database hosted Amazon RDS to Amazon S3 (https://www.alooma.com/integrations/postgresql/s3). The process can be automated and you can run it once a week.
I built a Pipeline from scratch using the MySQL and the documentation as reference.
You need to have the roles on place, DataPipelineDefaultResourceRole && DataPipelineDefaultRole.
I haven't load the parameters, so, you need to get into the architech and put your credentials and folders.
Hope it helps.
{
"objects": [
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "#{myS3LogsPath}",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"database": {
"ref": "DatabaseId_WC2j5"
},
"name": "DefaultSqlDataNode1",
"id": "SqlDataNodeId_VevnE",
"type": "SqlDataNode",
"selectQuery": "#{myRDSSelectQuery}",
"table": "#{myRDSTable}"
},
{
"*password": "#{*myRDSPassword}",
"name": "RDS_database",
"id": "DatabaseId_WC2j5",
"type": "RdsDatabase",
"rdsInstanceId": "#{myRDSId}",
"username": "#{myRDSUsername}"
},
{
"output": {
"ref": "S3DataNodeId_iYhHx"
},
"input": {
"ref": "SqlDataNodeId_VevnE"
},
"name": "DefaultCopyActivity1",
"runsOn": {
"ref": "ResourceId_G9GWz"
},
"id": "CopyActivityId_CapKO",
"type": "CopyActivity"
},
{
"dependsOn": {
"ref": "CopyActivityId_CapKO"
},
"filePath": "#{myS3Container}#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
"name": "DefaultS3DataNode1",
"id": "S3DataNodeId_iYhHx",
"type": "S3DataNode"
},
{
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"instanceType": "m1.medium",
"name": "DefaultResource1",
"id": "ResourceId_G9GWz",
"type": "Ec2Resource",
"terminateAfter": "30 Minutes"
}
],
"parameters": [
]
}
You can now do this with aws_s3.query_export_to_s3 command within postgres itself https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html