Environments
Azure Data Factory
Scenario
I have ADF pipeline which reads the data from On premise server and writes the data to azure data lake.
For the same - I have provided Folder structure in ADF*(dataset)*as follows
Folder Path : - DBName/RawTables/Transactional
File Path : - TableName.csv
Problem
Is it possible to parameterized the Folder name or file path ? Basically - if tomorrow - I want to change the folder path*(without deployment)* then we should be updating the metadata or table structure.
So the short answer here is no. You won't be able to achieve this level of dynamic flexibility with ADF on its own.
You'll need to add new defined datasets to your pipeline as inputs for folder changes. In Data Lake you could probably get away with a single stored procedure that accepts a parameter for the file path which could be reused. But this would still require tweaks to the ADF JSON when calling the proc.
Of course, the catch all situation here is to use an ADF custom activity and write a C# class with methods to do whatever you need. Maybe overkill though and lots of effort to setup the authentication to data lake store.
Hope this gives you a steer.
Mangesh, why don't you try the .Net custom activity in ADF. This custom activity will be your first activity that will potentially check for the processed folder and if the processed folder is present it will move that to History(say) folder. As, ADF is a platform for data movement and data transformation, it doesn't deal with the IO activity. You can learn more about the .Net custom activity at:
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-use-custom-activities
What you want to do is possible with the new Lookup activity in Azure Data Factory V2. Documentation is here: Lookup Activity.
A JSON example would be something like this:
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "LookupActivity",
"type": "Lookup",
"typeProperties": {
"dataset": {
"referenceName": "LookupDataset",
"type": "DatasetReference"
}
}
},
{
"name": "CopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from #{activity('LookupActivity').output.tableName}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupActivity",
"dependencyConditions": [ "Succeeded" ]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
}
]
}
}
Related
Does anyone know if there is a way to pass a schema mapping to multiple csv without doing it manually? I have 30 csv passed through a data flow in a foreach activity, so I can't detect or set fields's type. (Because i could only for the first)
Thanks for your help! :)
A Copy Activity mapping can be parameterized and changed at runtime if explicit mapping is required. The parameter is just a json object that you'd pass in for each of the files you are processing. It looks something like this:
{
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Id"
},
"sink": {
"name": "CustomerID"
}
},
{
"source": {
"name": "Name"
},
"sink": {
"name": "LastName"
}
},
{
"source": {
"name": "LastModifiedDate"
},
"sink": {
"name": "ModifiedDate"
}
}
]
}
You can read more about it here: Schema and data type mapping in copy activity
So, you can either pre-generate these mapping and fetch them via a lookup in a previous step in the pipeline or if they need to be dynamic you an create them at runtime with code (e.g. have an Azure Function that looks up the current schema of the CSV and returns a properly formatted translator object).
Once you have the object as a parameter you can pass it to the copy activity. On the mapping properties of the copy activity you just Add Dynamic Content and select the appropriate parameter. It will look something like this:
I have a custom activity in Azure Data Factory, which attempts to execute the following command:
PowerShell.exe -Command "Write-Host 'Hello, world!'"
However, when I debug (run) this command from within Azure Data Factory, it runs for a long time, and finally fails.
I guess it fails because perhaps it could not locate "PowerShell.exe". How can I ensure that the ADF Custom Activity has access to PowerShell.exe?
Some sites say about specifying a package (.zip file) that contains everything needed for the exe to execute. However, since PowerShell is from Microsoft, I think it would be inappropriate to ZIP the PowerShell directory, and specify it as a package to the Custom Activity.
Please suggest as to how I can execute PowerShell command from Custom Activity of an Azure Data Factory. Thanks!
Whenever I search "Execute PowerShell from Custom Activity in Azure Data Factory", the search results are talking more about which Az PowerShell command to use to trigger start an ADF pipeline.
I saw two threads in Stackoverflow.com, where the answer just specifies to use a Custom Activity, and the answer is not specific to PowerShell command call from ADF
Here is the JSON for the task:
{
"name": "ExecutePs1CustomActivity",
"properties": {
"activities": [
{
"name": "ExecutePSScriptCustomActivity",
"type": "Custom",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"command": "PowerShell.exe -Command \"Write-Host 'Hello, world!'\"",
"referenceObjects": {
"linkedServices": [],
"datasets": []
}
},
"linkedServiceName": {
"referenceName": "Ps1CustomActivityAzureBatch",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
}
}
I see "In Progress" for 3 minutes (180 seconds), and then it shows as "Failed."
I would suggest you to move all you scripting task in a powershell file and copy it to a storage account linked with your custom activity. . Once done try to call it like below:
powershell .\script.ps1
You can also provide the path of the script in json like below:
{
"name": "MyCustomActivityPipeline",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "helloworld.exe",
"folderPath": "customactv2/helloworld",
"resourceLinkedService": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
}
}]
}
}
Please try it and see if it helps. Also i would suggest you to troubleshoot the pipeline steps to look for detailed error.
Also to your second point "Some sites say about specifying a package (.zip file) that contains everything needed for the exe to execute." This is required when you are building a custom activity using dot net then it is must copy all the Dll's and Exe's for execution.
Hope it helps.
I have two activities in my azure data factory.
Activity A1 = a stored proc on a sql db. Input=none, output = DB (output1). The stored proc targets the output dataset.
Activity A2 = an azure copy activity ("type": "Copy") which copies from blob to same sql db. Input = blob, Output = DB (output2)
i need to run activity A1 before A2 and i cant for the world figure out what dependencies to put between them.
i tried to mark A2 as having two inputs - the blob + the DB (output1). if i do this, the copy activity doesn't throw error, but it does NOT copy the blob to db (i think it silently uses the DB as the source of copy, instead of blob as source of copy and somehow does nothing).
if i remove the DB input (output1) on A2 it can successfully copy blob to DB but i no longer have the dependency chain that A1 needs to run before A2
thanks!
I figured this out - I was able to keep two dependencies on A2, but just needed to make sure of the ordering of the 2 inputs. Weird. Looks like the Copy activity just acts on the FIRST input - so when i moved the blob as the first input it worked! :) (earlier i had the DB output1 as first input and it silently did nothing)
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "MyBlobInput"
},
{
"name": "MyDBOutput1"
}
],
"outputs": [
{
"name": "MyDBOutput2"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 3,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureBlobtoSQL",
"description": "Copy Activity"
}
],
I am getting the following error when trying to run a stored procedure in an Azure SQL Datawarehouse.
Activity 'SprocActivitySample' contains an invalid Dataset reference 'Destination-SQLDW-nna'. This dataset is pointing to Azure SQL DW and stored procedure is in it.
Here is the entire code.
{
"name": "SprocActivitySamplePipeline",
"properties": {
"activities": [
{
"type":"SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "DailyImport",
"storedProcedureParameters": {
"DateToImportFor": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
},
"outputs": [
{
"name": "Destination-SQLDW-nna"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "SprocActivitySample"
}
],
"start": "2017-01-01T00:00:00Z",
"end": "2017-02-20T05:00:00Z",
"isPaused": true
}
}
I'm afraid that Azure Sql Data Warehouse does not support table-valued parameters in stored procedures.
Read more about it here: https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-stored-procedures
If you find a workaround for it please share! I couldnt find any.
Also it would be good if you can post the dataset json so we can try to find any errors on it.
Cheers!
I got this working. the problem was that i was referencing wrong
"outputs": [
{
"name": "Destination-SQLDW-nna"
}
after correcting name to the right Dataset it is working
I'm facing an issue when running a U-SQL Script with Azure Data Factory V2.
This U-SQL Script work fine in the portal or vs :
#a =
SELECT * FROM
(VALUES
("Contoso", 1500.0, "2017-03-39"),
("Woodgrove", 2700.0, "2017-04-10")
) AS D( customer, amount );
#results =
SELECT
customer,
amount
FROM #a;
OUTPUT #results
TO "test"
USING Outputters.Text();
But don't work when started in Azure Data Factory V2 Activity, bellow my ADF scripts.
Creating or updating linked service ADLA [adla] ...
{
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "adla",
"servicePrincipalId": "ea4823f2-3b7a-4c-78d29cffa68b",
"servicePrincipalKey": {
"type": "SecureString",
"value": "jKhyspEwMScDAGU0MO39FcAP9qQ="
},
"tenant": "41f06-8ca09e840041",
"subscriptionId": "9cf053128b749",
"resourceGroupName": "test"
}
}
}
Creating or updating linked service BLOB [BLOB] ...
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=totoblob;AccountKey=qZqpKyGtWMRXZO2CNLa0qTyvLTMko4lzfgsg07pjloIPGZtJel4qvRBkoVOA==;EndpointSuffix=core.windows.net"
}
}
}
}
Creating or updating pipeline ADLAPipeline...
{
"properties": {
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "src/test.usql",
"scriptLinkedService": {
"referenceName": "blob",
"type": "LinkedServiceReference"
},
"degreeOfParallelism": 1
},
"linkedServiceName": {
"referenceName": "adla",
"type": "LinkedServiceReference"
},
"name": "Usql-toto"
}
]
}
}
1 - I checked the connection to the blob storage, the u-sql script is successfully found (if i rename it, it throw a not found error)
2 - I checked the connection to azure data lake analytics, it seems to connect, if i set a wrong credential it throw an other error
3 - When running the pipeline I Have the following error : "Activity Usql-toto failed: User is not able to access to datalake store" and actually I don't provide Data Lake Store Credential but there is a default datalake store account attached ADLAnalytics.
Any hint?
Finally found help in this post: U-SQL Job Failing in Data Factory
Folders system and catalog doesn't inherit from parent permissions ... So I had to reassign permission on this two folders.
I was also having this problem. What helped me was running the "Add User Wizard" from Data Lake Analytics. Using this wizard I added the service principal that I use in the linked service for Data Lake Analytics as an Owner with R+W permissions.
Before using the wizard, I tried to configure this manually by setting appropriate permissions from the Explore data screen, but that didn't resolve the problem. (SP was already contributor on the service)