ADF Copy activity delimited to parquet data type mapping with extra column - azure-data-factory

I'm trying to use a copy activity in ADF to copy data from a csv to a parquet. I can get column name mappings to work and I can mostly get the data types to map successfully however I am adding a dynamic column called LoadDate that is created from an expression in ADF and I can't seem to get it to map correctly. Looking at the parquet file that is output for the column "date" which is in the delimited file we get the type of INT96 which Azure Databricks correctly reads as a date. However for the LoadDate column which is generated using an expression we get the type BYTE_ARRAY.
I just can't seem to get it to output the extra column in the correct format. Any help would be appreciated.
Below is the mapping section of my JSON.
"mappings": [
{
"source": {
"name": "Date",
"type": "DateTime",
"physicalType": "String"
},
"sink": {
"name": "Date",
"type": "DateTime",
"physicalType": "INT_96"
}
},
{
"source": {
"name": "Item",
"type": "String",
"physicalType": "String"
},
"sink": {
"name": "Item",
"type": "String",
"physicalType": "UTF8"
}
},
{
"source": {
"name": "Opt",
"type": "INT32",
"physicalType": "String"
},
"sink": {
"name": "Opt",
"type": "Int32",
"physicalType": "INT_32"
}
},
{
"source": {
"name": "Branch",
"type": "INT32",
"physicalType": "String"
},
"sink": {
"name": "Branch",
"type": "Int32",
"physicalType": "INT_32"
}
},
{
"source": {
"name": "QTY",
"type": "INT32",
"physicalType": "String"
},
"sink": {
"name": "QTY",
"type": "Int32",
"physicalType": "INT_32"
}
},
{
"source": {
"name": "LoadDate",
"type": "DateTime",
"physicalType": "String"
},
"sink": {
"name": "LoadDate",
"type": "DateTime",
"physicalType": "INT_96"
}
}
]

Related

Invalid Schema on Confluent Controlcenter

I am just trying to set up a Value-Schema for a Topic in the Web interface of Confluent Control Center.
I chose the Avro-format and tried the following schema:
{
"fields": [
{"name":"date",
"type":"dates",
"doc":"Date of the count"
},
{"name":"time",
"type":"timestamp-millis",
"doc":"date in ms"
},
{"name":"count",
"type":"int",
"doc":"Number of Articles"
}
],
"name": "articleCount",
"type": "record"
}
But the interface keeps on saying the input schema is invalid.
I have no idea why.
Any help is appreciated!
There are issues related to datatypes.
"type":"dates" => "type": "string"
"type":"timestamp-millis" => "type": {"type": "long", "logicalType": "timestamp-millis"}
Updated schema will be like:
{
"fields": [
{
"name": "date",
"type": "string",
"doc": "Date of the count"
},
{
"name": "time",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
},
"doc": "date in ms"
},
{
"name": "count",
"type": "int",
"doc": "Number of Articles"
}
],
"name": "articleCount",
"type": "record"
}
Sample Payload:
{
"date": "2020-07-10",
"time": 12345678900,
"count": 1473217
}
More reference related to Avro datatypes can be found here:
https://docs.oracle.com/database/nosql-12.1.3.0/GettingStartedGuide/avroschemas.html

Getting error on null and empty string while copying a csv file from blob container to Azure SQL DB

I tried all combination on the datatype of my data but each time my data factory pipeline is giving me this error:
{
"errorCode": "2200",
"message": "ErrorCode=UserErrorColumnNameNotAllowNull,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Empty or Null string found in Column Name 2. Please make sure column name not null and try again.,Source=Microsoft.DataTransfer.Common,'",
"failureType": "UserError",
"target": "xxx",
"details": []
}
My Copy data source code is something like this:{
"name": "xxx",
"description": "uuu",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"wildcardFileName": "*"
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "AzureSqlSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "populationId",
"type": "Guid"
},
"sink": {
"name": "PopulationID",
"type": "String"
}
},
{
"source": {
"name": "inputTime",
"type": "DateTime"
},
"sink": {
"name": "inputTime",
"type": "DateTime"
}
},
{
"source": {
"name": "inputCount",
"type": "Decimal"
},
"sink": {
"name": "inputCount",
"type": "Decimal"
}
},
{
"source": {
"name": "inputBiomass",
"type": "Decimal"
},
"sink": {
"name": "inputBiomass",
"type": "Decimal"
}
},
{
"source": {
"name": "inputNumber",
"type": "Decimal"
},
"sink": {
"name": "inputNumber",
"type": "Decimal"
}
},
{
"source": {
"name": "utcOffset",
"type": "String"
},
"sink": {
"name": "utcOffset",
"type": "Int32"
}
},
{
"source": {
"name": "fishGroupName",
"type": "String"
},
"sink": {
"name": "fishgroupname",
"type": "String"
}
},
{
"source": {
"name": "yearClass",
"type": "String"
},
"sink": {
"name": "yearclass",
"type": "String"
}
}
]
}
},
"inputs": [
{
"referenceName": "DelimitedTextFTDimensions",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlTable1",
"type": "DatasetReference"
}
]
}
Can anyone please help me understand the issue. I see in some blogs they ask me use treatnullasempty but I am not allowed to modify the JSON. is there a way to do that??
I suggest to using Data Flow DerivedColumn, DerivedColumn can help you build expression to replace the null column.
For example:
Derived Column, if Column_2 is null =true, return 'dd' :
iifNull(Column_2,'dd')
Mapping the column
Reference: Data transformation expressions in mapping data flow
Hope this helps.
fixed it.it was a easy fix as one of my column in destination was marked as not null, i changed it as null and it worked.

Search and replace JSON multiline using regex in VSCode

I have a really long JSON schema. Using VSCode, I need to replace the partnerName type to be string, null (it appears more than 20 times, the snippet below is just 1 appearance).
How can I search and replace the multiline for the entire partnerName entry?
From other question, I've tried using regex [\n\s]+, (.*\n)+ to be
"partnerName": {(.*\n)+"type": "null"(.*\n)+}
But it's still not matching.
Search for:
"partnerName": {
"type": "null"
},
Replace with:
"partnerName": {
"type": "string, null"
},
Snippet example:
{
"type": "object",
"properties": {
"node": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string"
},
"description": {
"type": "string"
},
"type": {
"type": "string"
},
"frequency": {
"type": "string"
},
"maxCount": {
"type": "integer"
},
"points": {
"type": "integer"
},
"startAt": {
"type": "string"
},
"endAt": {
"type": "string"
},
"partnerName": {
"type": "null"
},
"action": {
"type": "null"
}
},
"required": [
"id",
"name",
"description",
"type",
"frequency",
"maxCount",
"points",
"startAt",
"endAt",
"partnerName",
"action"
]
}
},
"required": [
"node"
]
},
Try this regex:
(partnerName".*\n\s*"type":\s*)"null"
and replace with:
$1"string, null"

Azure Data Factory V2 Dataset Dynamic Folder

In Azure Data Factory (V1) I was able to create a slide and store the output to a specific folder (i.e. {Year}/{Month}/{Day}. See code below.
How do you create the same type of slice in Azure Data Factory V2? I did find that you have to create a paramater. Yes, I was unable to figure out how to pass the parameter.
"folderPath": "#{dataset().path}",
"parameters": {
"path": {
"type": "String"
Here is original ADF V1 code.
{
"name": "EMS_EMSActivations_L1_Snapshot",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "SalesIntelligence_ADLS_LS",
"typeProperties": {
"fileName": "EMS.FACT_EMSActivations_WTA.tsv",
"folderPath": "/Snapshots/EMS/FACT_EMSActivations_WTA/{Year}/{Month}/{Day}",
"format": {
"type": "TextFormat",
"rowDelimiter": "␀",
"columnDelimiter": "\t",
"nullValue": "#NULL#",
"quoteChar": "\""
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
},
{
"name": "Minute",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "mm"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
Here is how you create a dynamic folder path when importing data from SQL into ADL. Look at folderPath line.
{
"name": "EBC_BriefingActivitySummary_L1_Snapshot",
"properties": {
"linkedServiceName": {
"referenceName": "SIAzureDataLakeStore",
"type": "LinkedServiceReference"
},
"type": "AzureDataLakeStoreFile",
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "",
"nullValue": "\\N",
"treatEmptyAsNull": false,
"firstRowAsHeader": false
},
"fileName": {
"value": "EBC.rpt_BriefingActivitySummary.tsv",
"type": "Expression"
},
"folderPath": {
"value": "#concat('/Snapshots/EBC/rpt_BriefingActivitySummary/', formatDateTime(pipeline().parameters.scheduledRunTime, 'yyyy'), '/', formatDateTime(pipeline().parameters.scheduledRunTime, 'MM'), '/', formatDateTime(pipeline().parameters.scheduledRunTime, 'dd'), '/')",
"type": "Expression"
}
}
}
}
Step 1:
Use WindowStartTime / WindowEndTime in folderpath
"folderPath": {
"value": "<<path>>/#{formatDateTime(pipeline().parameters.windowStart,'yyyy')}-#{formatDateTime(pipeline().parameters.windowStart,'MM')}-#{formatDateTime(pipeline().parameters.windowStart,'dd')}/#{formatDateTime(pipeline().parameters.windowStart,'HH')}/",
"type": "Expression"
}
Step2 : Add in Pipeline JSON
"parameters": {
"windowStart": {
"type": "String"
},
"windowEnd": {
"type": "String"
}
}
Step3 : Add Run Parameter in TumblingWindow Trigger
( This is referred in Step 2 )
"parameters": {
"windowStart": {
"type": "Expression",
"value": "#trigger().outputs.windowStartTime"
},
"windowEnd": {
"type": "Expression",
"value": "#trigger().outputs.windowEndTime"
}
}
For more details to understand , Refer
Refer this link.
https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/data-factory/how-to-create-tumbling-window-trigger.md

azure data factory start pipeline different from starting job

I am getting crazy on this issue, I am running an Azure data factory V1, I need to schedule a copy job every week from 01/03/2009 through 01/31/2009, so I defined this schedule on the pipeline:
"start": "2009-01-03T00:00:00Z",
"end": "2009-01-31T00:00:00Z",
"isPaused": false,
monitoring the pipeline, the data factory schedule on these date:
12/29/2008
01/05/2009
01/12/2009
01/19/2009
01/26/2009
instead of this wanted schedule:
01/03/2009
01/10/2009
01/17/2009
01/24/2009
01/31/2009
why the starting date defined on the pipeline doesn't correspond to the schedule date on the monitor?
Many thanks!
Here is the JSON Pipeline:
{
"name": "CopyPipeline-blob2datalake",
"properties": {
"description": "copy from blob storage to datalake directory structure",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "script/dat230.usql",
"scriptLinkedService": "AzureStorageLinkedService",
"degreeOfParallelism": 5,
"priority": 100,
"parameters": {
"salesfile": "$$Text.Format('/DAT230/{0:yyyy}/{0:MM}/{0:dd}.txt', Date.StartOfDay (SliceStart))",
"lineitemsfile": "$$Text.Format('/dat230/dataloads/{0:yyyy}/{0:MM}/{0:dd}/factinventory/fact.csv', Date.StartOfDay (SliceStart))"
}
},
"inputs": [
{
"name": "InputDataset-dat230"
}
],
"outputs": [
{
"name": "OutputDataset-dat230"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 1
},
"scheduler": {
"frequency": "Day",
"interval": 7
},
"name": "DataLakeAnalyticsUSqlActivityTemplate",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2009-01-03T00:00:00Z",
"end": "2009-01-11T00:00:00Z",
"isPaused": false,
"hubName": "edxlearningdf_hub",
"pipelineMode": "Scheduled"
}
}
and here the datasets:
{
"name": "InputDataset-dat230",
"properties": {
"structure": [
{
"name": "Date",
"type": "Datetime"
},
{
"name": "StoreID",
"type": "Int64"
},
{
"name": "StoreName",
"type": "String"
},
{
"name": "ProductID",
"type": "Int64"
},
{
"name": "ProductName",
"type": "String"
},
{
"name": "Color",
"type": "String"
},
{
"name": "Size",
"type": "String"
},
{
"name": "Manufacturer",
"type": "String"
},
{
"name": "OnHandQuantity",
"type": "Int64"
},
{
"name": "OnOrderQuantity",
"type": "Int64"
},
{
"name": "SafetyStockQuantity",
"type": "Int64"
},
{
"name": "UnitCost",
"type": "Double"
},
{
"name": "DaysInStock",
"type": "Int64"
},
{
"name": "MinDayInStock",
"type": "Int64"
},
{
"name": "MaxDayInStock",
"type": "Int64"
}
],
"published": false,
"type": "AzureBlob",
"linkedServiceName": "Source-BlobStorage-dat230",
"typeProperties": {
"fileName": "*.txt.gz",
"folderPath": "dat230/{year}/{month}/{day}/",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t",
"firstRowAsHeader": true
},
"partitionedBy": [
{
"name": "year",
"value": {
"type": "DateTime",
"date": "WindowStart",
"format": "yyyy"
}
},
{
"name": "month",
"value": {
"type": "DateTime",
"date": "WindowStart",
"format": "MM"
}
},
{
"name": "day",
"value": {
"type": "DateTime",
"date": "WindowStart",
"format": "dd"
}
}
],
"compression": {
"type": "GZip"
}
},
"availability": {
"frequency": "Day",
"interval": 7
},
"external": true,
"policy": {}
}
}
{
"name": "OutputDataset-dat230",
"properties": {
"structure": [
{
"name": "Date",
"type": "Datetime"
},
{
"name": "StoreID",
"type": "Int64"
},
{
"name": "StoreName",
"type": "String"
},
{
"name": "ProductID",
"type": "Int64"
},
{
"name": "ProductName",
"type": "String"
},
{
"name": "Color",
"type": "String"
},
{
"name": "Size",
"type": "String"
},
{
"name": "Manufacturer",
"type": "String"
},
{
"name": "OnHandQuantity",
"type": "Int64"
},
{
"name": "OnOrderQuantity",
"type": "Int64"
},
{
"name": "SafetyStockQuantity",
"type": "Int64"
},
{
"name": "UnitCost",
"type": "Double"
},
{
"name": "DaysInStock",
"type": "Int64"
},
{
"name": "MinDayInStock",
"type": "Int64"
},
{
"name": "MaxDayInStock",
"type": "Int64"
}
],
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "Destination-DataLakeStore-dat230",
"typeProperties": {
"fileName": "txt.gz",
"folderPath": "dat230/dataloads/{year}/{month}/{day}/factinventory/",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "year",
"value": {
"type": "DateTime",
"date": "WindowStart",
"format": "yyyy"
}
},
{
"name": "month",
"value": {
"type": "DateTime",
"date": "WindowStart",
"format": "MM"
}
},
{
"name": "day",
"value": {
"type": "DateTime",
"date": "WindowStart",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 7
},
"external": false,
"policy": {}
}
}
You need to look at the time slices for the datasets and there activity.
The pipeline schedule (badly named) only defines the start and end period in which any activities can use to provision and run there time slices.
ADFv1 doesn't use a recursive schedule like the SQL Server Agent. Each execution has to be provisioned at an interval on the time line (the schedule) you create.
For example, if you pipeline start and end is for 1 year. But your dataset and activity has a frequency of monthly and interval of 1 month you will only get 12 executions of the whatever is happening.
Apologies, but the concept of time slices is a little difficult to explain if you aren't already familiar. Maybe read this post: https://blogs.msdn.microsoft.com/ukdataplatform/2016/05/03/demystifying-activity-scheduling-with-azure-data-factory/
Hope this helps.
Would you share with us the json for the datasets and the pipeline? It would be easier to help you having that.
In the meanwhile, check if you are using "style": "StartOfInterval" at the scheduler property of the activity, and also check if you are using an offset.
Cheers!