How to fix Data Lake Analytics script - azure-data-factory

I would like to use Azure Data Factory with Azure Data Lake Analytics as action, but without success.
This is my PIPELINE script
{
"name": "UsageStatistivsPipeline",
"properties": {
"description": "Standardize JSON data into CSV, with friendly column names & consistent output for all event types. Creates one output (standardized) file per day.",
"activities": [{
"name": "UsageStatisticsActivity",
"type": "DataLakeAnalyticsU-SQL",
"linkedServiceName": {
"referenceName": "DataLakeAnalytics",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "BlobStorage",
"type": "LinkedServiceReference"
},
"scriptPath": "adla-scripts/usage-statistics-adla-script.json",
"degreeOfParallelism": 30,
"priority": 100,
"parameters": {
"sourcefile": "wasb://nameofblob.blob.core.windows.net/$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/0_647de4764587459ea9e0ce6a73e9ace7_2.json', SliceStart)",
"destinationfile": "$$Text.Format('wasb://nameofblob.blob.core.windows.net/{0:yyyy}/{0:MM}/{0:dd}/DailyResult.csv', SliceStart)"
}
},
"inputs": [{
"type": "DatasetReference",
"referenceName": "DirectionsData"
}
],
"outputs": [{
"type": "DatasetReference",
"referenceName": "OutputData"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 10,
"executionPriorityOrder": "NewestFirst"
}
}
],
"start": "2018-01-08T00:00:00Z",
"end": "2017-01-09T00:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}}
I have two parameters variables sourcefile and destinationfile, which are dynamic (path is from Date).
Then I have this ADLA script for execution.
REFERENCE ASSEMBLY master.[Newtonsoft.Json];
REFERENCE ASSEMBLY master.[Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#Data =
EXTRACT
jsonstring string
FROM #sourcefile
USING Extractors.Tsv(quoting:false);
#CreateJSONTuple =
SELECT
JsonFunctions.JsonTuple(jsonstring) AS EventData
FROM
#Data;
#records =
SELECT
JsonFunctions.JsonTuple(EventData["records"], "[*].*") AS record
FROM
#CreateJSONTuple;
#properties =
SELECT
JsonFunctions.JsonTuple(record["[0].properties"]) AS prop,
record["[0].time"] AS time
FROM
#records;
#result =
SELECT
...
FROM #properties;
OUTPUT #result
TO #destinationfile
USING Outputters.Csv(outputHeader:false,quoting:true);
Job execution fails and the error is :
EDIT:
It seems, that Text.Format is not executed and passed into script like string ... Then in Data Lake Analytics Job detail is this :
DECLARE #sourcefile string = "$$Text.Format('wasb://nameofblob.blob.core.windows.net/{0:yyyy}/{0:MM}/{0:dd}/0_647de4764587459ea9e0ce6a73e9ace7_2.json', SliceStart)";

In your code sample, the sourcefile parameter is not defined the same way as destinationfile. The latter appears to be correct while the former does not.
The whole string should be wrapped inside $$Text.Format() for both:
"paramName" : "$$Text.Format('...{0:pattern}...', param)"
Also consider passing only the formatted date like so:
"sliceStart": "$$Text.Format('{0:yyyy-MM-dd}', SliceStart)"
and then doing the rest in U-SQL:
DECLARE #sliceStartDate DateTime = DateTime.Parse(#sliceStart);
DECLARE #path string = String.Format("wasb://path/to/file/{0:yyyy}/{0:MM}/{0:dd}/file.csv", #sliceStartDate);
Hope this helps

Related

Azure Data Factory Copy Data activity - Use variables/expressions in mapping to dynamically select correct incoming column

I have the below mappings for a Copy activity in ADF:
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"path": "$['id']"
},
"sink": {
"name": "TicketID"
}
},
{
"source": {
"path": "$['summary']"
},
"sink": {
"name": "TicketSummary"
}
},
{
"source": {
"path": "$['status']['name']"
},
"sink": {
"name": "TicketStatus"
}
},
{
"source": {
"path": "$['company']['identifier']"
},
"sink": {
"name": "CustomerAccountNumber"
}
},
{
"source": {
"path": "$['company']['name']"
},
"sink": {
"name": "CustomerName"
}
},
{
"source": {
"path": "$['customFields'][74]['value']"
},
"sink": {
"name": "Landlord"
}
},
{
"source": {
"path": "$['customFields'][75]['value']"
},
"sink": {
"name": "Building"
}
}
],
"collectionReference": "",
"mapComplexValuesToString": false
}
The challenge I need to overcome is that the array indexes of the custom fields of the last two sources might change. So I've created an Azure Function which calculates the correct array index. However I can't work out how to use the Azure Function output value in the source path string - I have tried to refer to it using an expression like #activity('Get Building Field Index').output but as it's expecting a JSON path, this doesn't work and produces an error:
JSON path $['customFields'][#activity('Get Building Field Index').outputS]['value'] is invalid.
Is there a different way to achieve what I am trying to do?
Thanks in advance
I have a slightly similar scenario that you might be able to work with.
First, I have a JSON file that is emitted that I then access with Synapse/ADF with Lookup.
I next have a For each activity that runs a copy data activity.
The for each activity receives my Lookup and makes my JSON usable, by setting the following in the For each's Settings like so:
#activity('Lookup').output.firstRow.childItems
My JSON roughly looks as follows:
{"childItems": [
{"subpath": "path/to/folder",
"filename": "filename.parquet",
"subfolder": "subfolder",
"outfolder": "subfolder",
"origin": "A"}]}
So this means in my copy data activity within the for each activity, I can access the parameters of my JSON like so:
#item()['subpath']
#item()['filename']
#item()['folder']
.. etc
Edit:
Adding some screen caps of the parameterization:
https://i.stack.imgur.com/aHpWk.png

'int' is a primitive and doesn't support nested properties: Azure Data Factory v2

I am trying to find a substring of a string as a part of an activity in ADF. Say for instance I want to extract out the subsctring 'de' out of a string 'abcde'. I have tried:
#substring(variables('target_folder_name'), 3, (int(variables('target_folder_name_length'))-3))
where int(variables('target_folder_name_length')) has a value of 5 and variables('target_folder_name') has a value of 'abcde'
But it gives me: Unrecognized expression: (int(variables('target_folder_name_length'))-3)
On the other hand, if I try this: #substring(variables('target_folder_name'), 2, int(variables('target_folder_name_length'))-3)
This gives me: 'int' is a primitive and doesn't support nested properties
Where am I going wrong?
Use indexof. See my example below:
#substring(variables('testString'), indexof(variables('testString'), variables('de')), length(variables('de')))
Output of 'result' variable:
Since your preceding values are static, you can use below dynamic expression to achieve the substring value as per your requirement.
#substring(variables('varInputFolderName'), 3, sub(length(variables('varInputFolderName')), 3))
Where varInputFolderName = String = abcde as per this sample.
Here is the pipeline JSON payload for this sample. You can play around with it for further testing.
{
"name": "pipeline_FindSubstring",
"properties": {
"activities": [
{
"name": "setSubstringValue",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "varSubstringOutput",
"value": {
"value": "#substring(variables('varInputFolderName'), 3, sub(length(variables('varInputFolderName')), 3))",
"type": "Expression"
}
}
}
],
"variables": {
"varInputFolderName": {
"type": "String",
"defaultValue": "abcde"
},
"varSubstringOutput": {
"type": "String"
}
},
"annotations": []
}
}

Using ADF Copy Activity with dynamic schema mapping

I'm trying to drive the columnMapping property from a database configuration table. My first activity in the pipeline pulls in the rows from the config table. My copy activity source is a Json file in Azure blob storage and my sink is an Azure SQL database.
In copy activity I'm setting the mapping using the dynamic content window. The code looks like this:
"translator": {
"value": "#json(activity('Lookup1').output.value[0].ColumnMapping)",
"type": "Expression"
}
My question is, what should the value of activity('Lookup1').output.value[0].ColumnMapping look like?
I've tried several different json formats but the copy activity always seems to ignore it.
For example, I've tried:
{
"type": "TabularTranslator",
"columnMappings": {
"view.url": "url"
}
}
and:
"columnMappings": {
"view.url": "url"
}
and:
{
"view.url": "url"
}
In this example, view.url is the name of the column in the JSON source, and url is the name of the column in my destination table in Azure SQL database.
The issue is due to the dot (.) sign in your column name.
To use column mapping, you should also specify structure in your source and sink dataset.
For your source dataset, you need specify your format correctly. And since your column name has dot, you need specify the json path as following.
You could use ADF UI to setup a copy for a single file first to get the related format, structure and column mapping format. Then change it to lookup.
And as my understanding, your first format should be the right format. If it is already in json format, then you may not need use "json" function in your expression.
There seems to be a disconnect between the question and the answer, so I'll hopefully provide a more straightforward answer.
When setting this up, you should have a source dataset with dynamic mapping. The sink doesn't require one, as we're going to specify it in the mapping.
Within the copy activity, format the dynamic json like the following:
{
"structure": [
{
"name": "Address Number"
},
{
"name": "Payment ID"
},
{
"name": "Document Number"
},
...
...
]
}
You would then specify your dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Address Number",
"type": "Int32"
},
"sink": {
"name": "address_number"
}
},
{
"source": {
"name": "Payment ID",
"type": "Int64"
},
"sink": {
"name": "payment_id"
}
},
{
"source": {
"name": "Document Number",
"type": "Int32"
},
"sink": {
"name": "document_number"
}
},
...
...
]
}
}
Assuming these were set in separate variables, you would want to send the source as a string, and the mapping as json:
source: #string(json(variables('str_dyn_structure')).structure)
mapping: #json(variables('str_dyn_translator')).translator
VladDrak - You could skip the source dynamic definition by building dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"type": "String",
"ordinal": "1"
},
"sink": {
"name": "dateOfActivity",
"type": "String"
}
},
{
"source": {
"type": "String",
"ordinal": "2"
},
"sink": {
"name": "CampaignID",
"type": "String"
}
}
]
}
}

How to use script arguments in AWS DataPipeline SQLActivity?

I am trying to execute an unload command on Redshift via Data Pipeline. The script looks something like:
unload ($$ SELECT *, count(*) FROM (SELECT APP_ID, CAST(record_date AS DATE) WHERE len(APP_ID)>0 AND CAST(record_date as DATE)=$1) GROUP BY APP_ID $$) to 's3://test/unload/' iam_role 'arn:aws:iam::xxxxxxxxxxx:role/Test' delimiter ',' addquotes;
The pipeline looks something like this:
{
"objects": [
{
"role": "DataPipelineDefaultRole",
"subject": "SuccessNotification",
"name": "SNS",
"id": "ActionId_xxxxx”,
"message": "SUCCESS: #{format(minusDays(node.#scheduledStartTime,1),'MM-dd-YYYY')}",
"type": "SnsAlarm",
"topicArn": "arn:aws:sns:us-west-2:xxxxxxxxxx:notification"
},
{
"connectionString": “connection-url”,
"password": “password”,
"name": “Test”,
"id": "DatabaseId_xxxxx”,
"type": "RedshiftDatabase",
"username": “username”
},
{
"subnetId": "subnet-xxxxxx”,
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"name": "EC2",
"id": "ResourceId_xxxxx”,
"type": "Ec2Resource"
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://test/logs/",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"database": {
"ref": "DatabaseId_xxxxxx”
},
"scriptUri": "s3://test/script.sql",
"name": "SqlActivity",
"scriptArgument": "#{format(minusDays(node.#scheduledStartTime,1),"MM-dd-YYYY”)}”,
"id": "SqlActivityId_xxxxx”,
"runsOn": {
"ref": "ResourceId_xxxx”
},
"type": "SqlActivity",
"onSuccess": {
"ref": "ActionId_xxxxx”
}
}
],
"parameters": []
}
However, I keep getting the error: The column index is out of range: 1, number of columns: 0.
I just can't get it to work. I have tried using ?, $1 and I even tried putting the expression #{format(minusDays(node.#scheduledStartTime,1),"MM-dd-YYYY”)} directly in the script. None of them works.
I have looked at the answers to Amazon Data Pipline: How to use a script argument in a SqlActivity? but none of them are helpful.
Does anyone has idea how to use script argument in SQL Script in AWS Data Pipeline?

Copying 7 column table to 6 column table

I'm porting SQL Server Integration Services packages to Azure Data Factory.
I have two tables (Table 1 and Table 2) which live on different servers. One has seven columns, the other six. I followed the example at https://learn.microsoft.com/en-us/azure/data-factory/data-factory-map-columns
Table 1 DDL:
CREATE TABLE dbo.Table1
(
zonename nvarchar(max),
propertyname nvarchar(max),
basePropertyid int,
dfp_ad_unit_id bigint,
MomentType nvarchar(200),
OperatingSystemName nvarchar(50)
)
Table 2 DDL
CREATE TABLE dbo.Table2
(
ZoneID int IDENTITY,
ZoneName nvarchar(max),
propertyName nvarchar(max),
BasePropertyID int,
dfp_ad_unit_id bigint,
MomentType nvarchar(200),
OperatingSystemName nvarchar(50)
)
In ADF, I define Table 1 as:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json",
"name": "Table1",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "PlatformX",
"structure": [
{ "name": "zonename" },
{ "name": "propertyname" },
{ "name": "basePropertyid" },
{ "name": "dfp_ad_unit_id" },
{ "name": "MomentType" },
{ "name": "OperatingSystemName" }
],
"external": true,
"typeProperties": {
"tableName": "Platform.Zone"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
In ADF I define Table 2 as:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json",
"name": "Table2",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "BrixDW",
"structure": [
{ "name": "ZoneID" },
{ "name": "ZoneName" },
{ "name": "propertyName" },
{ "name": "BasePropertyID" },
{ "name": "dfp_ad_unit_id" },
{ "name": "MomentType" },
{ "name": "OperatingSystemName" }
],
"external": true,
"typeProperties": {
"tableName": "staging.DimZone"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
As you can see, Table2 has an identity column, which will automatically populated.
This should be a simple Copy activity:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Pipeline.json",
"name": "Copy_Table1_to_Table2",
"properties": {
"description": "Copy_Table1_to_Table2",
"activities": [
{
"name": "Copy_Table1_to_Table2",
"type": "Copy",
"inputs": [
{ "name": "Table1" }
],
"outputs": [
{
"name": "Table2"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from dbo.Table1"
},
"sink": {
"type": "SqlSink"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "zonename: ZoneName, propertyname: propertyName, basePropertyid: BasePropertyID, dfp_ad_unit_id: dfp_ad_unit_id, MomentType: MomentType, OperatingSystemName: OperatingSystemName"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 3,
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
],
"start": "2017-07-23T00:00:00Z",
"end": "2020-07-19T00:00:00Z"
}
}
I figured by not mapping ZoneID, it would just be ignored. But ADF is giving me the following error.
Copy activity encountered a user error: GatewayNodeName=APP1250S,ErrorCode=UserErrorInvalidColumnMappingColumnCountMismatch,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Invalid column mapping provided to copy activity: 'zonename: ZoneName, propertyname: propertyName, basePropertyid: BasePropertyID, dfp_ad_unit_id: dfp_ad_unit_id, MomentType: MomentType, OperatingSystemName: OperatingSystemName', Detailed message: Different column count between target structure and column mapping. Target column count:7, Column mapping count:6. Check column mapping in table definition.,Source=Microsoft.DataTransfer.Common,'
In a nutshell I'm trying to copy a 7 column table to a 6 column table and Data Factory doesn't like it. How can I accomplish this task?
I realize this is an old question, but I ran into this issue just now. My problem was that I initially generated the destination/sink table, created a pipeline, and then added a column.
Despite clearing and reimporting the schemas, whenever triggering the pipeline, it would throw the above error. I made sure the new column (which has a default on it) was deselected in the mappings, so it would only use the default value. The error was still thrown.
The only way I managed to get things to work was by completely recreating the pipelines from scratch. It's almost as if somewhere in the meta data, the old mappings are retained.
I had the exact same issue and I solved it by going into the azure dataset and removing the identity column. Then making sure I had the same number of columns in my source and target(sink). After doing this the copy will add the records and the identity in the table will just work as expected. I did not have to modify the physical table in SQL only the dataset for the table in azure.
One option would be to create a view over the 7-column table which does not include the identity column and insert into that view.
CREATE VIEW bulkLoad.Table2
AS
SELECT
ZoneName,
propertyName,
BasePropertyID,
dfp_ad_unit_id,
MomentType,
OperatingSystemName
GO
I can do some digging and see if some trick is possible with the column mapping but that should unblock you.
HTH
I was told by MSFT support to just remove the identity column from the table definition. It seems to have worked.