Mapping multiple csv files in Data Factory - azure-data-factory

Does anyone know if there is a way to pass a schema mapping to multiple csv without doing it manually? I have 30 csv passed through a data flow in a foreach activity, so I can't detect or set fields's type. (Because i could only for the first)
Thanks for your help! :)

A Copy Activity mapping can be parameterized and changed at runtime if explicit mapping is required. The parameter is just a json object that you'd pass in for each of the files you are processing. It looks something like this:
{
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Id"
},
"sink": {
"name": "CustomerID"
}
},
{
"source": {
"name": "Name"
},
"sink": {
"name": "LastName"
}
},
{
"source": {
"name": "LastModifiedDate"
},
"sink": {
"name": "ModifiedDate"
}
}
]
}
You can read more about it here: Schema and data type mapping in copy activity
So, you can either pre-generate these mapping and fetch them via a lookup in a previous step in the pipeline or if they need to be dynamic you an create them at runtime with code (e.g. have an Azure Function that looks up the current schema of the CSV and returns a properly formatted translator object).
Once you have the object as a parameter you can pass it to the copy activity. On the mapping properties of the copy activity you just Add Dynamic Content and select the appropriate parameter. It will look something like this:

Related

ADF - Loop through a large JSON file in a dataflow

We currently receive some metadata information from a third party supplier in the form of a JSON file.
The JSON file contains definitions of some tables which need to be loaded into SQL via ADF.
The JSON file looks like this, it's a list of tables and their data types
"Tables": [
{
"name": "account",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "250",
"name": "name"
}
]
},
{
"name": "customer",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "100",
"name": "name"
}
]
}
]
What we need to do is to loop through this JSON and via an ADF data flow we create the required tables in the destination database.
We initially designed the Pipeline with a lookup activity that loads the JSON file then pass the output to a foreach loop. This worked really well when we had only a small JSON file but as we started using real data, the JSON file was over the limit of 4MB resulting in the lookup activity throwing an error.
We then tried using a mapping dataflow by loading the JSON as a source, then setting the sink as a cache and outputting this to an output variable which we then loop through but again this works with smaller datasets but as soon as the dataset is large enough it can't parse it to an output.
I am sure this should be easy to do but just can't get my head around it!
Here is the sample procedure to loop through large JSON file in a Dataflows.
Create a Linked service and dataset to the json file path.
Provision the dataset to the source in the dataflows.
By the flatten formatter will get the input columns from the source through Unroll by option with required input.
Create linked service and dataset to the sink path.
Attach the data flow work item to the Data Flow activity.
Will get result as per the expectations in the sql db.

Kafka Connect JSON Schema does not appear to support "$ref" tags

I am using Kafka Connect with JSONSchema and am in a situation where I need to convert the JSON schema manually (to "Schema") within a Kafka Connect plugin. I can successfully retrieve the JSON Schema from the Schema Registry and am successful converting with simple JSON Schemas but I am having difficulties with ones that are complex and have valid "$ref" tags referencing components within a single JSON Schema definition.
I have several questions:
The JsonConverter.java does not appear to handle "$ref". Am I correct, or does it handle it in another way elsewhere?
Does the Schema Registry handle the referencing of sub-definitions? If yes, is there code that shows how the dereferencing is handled?
Should the JSON Schema be resolved to a string without references (ie. inline the references) before submitting to the Schema Registry and thereby remove the "$ref" issue?
I am looking at the Kafka Source code module JsonConverter.java below:
https://github.com/apache/kafka/blob/trunk/connect/json/src/main/java/org/apache/kafka/connect/json/JsonConverter.java#L428
An example of the complex schema (taken from the JSON Schema site) is shown below (notice the "$ref": "#/$defs/veggie" tag the references a later sub-definition)
{
"$id": "https://example.com/arrays.schema.json",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"description": "A representation of a person, company, organization, or place",
"title": "complex-schema",
"type": "object",
"properties": {
"fruits": {
"type": "array",
"items": {
"type": "string"
}
},
"vegetables": {
"type": "array",
"items": { "$ref": "#/$defs/veggie" }
}
},
"$defs": {
"veggie": {
"type": "object",
"required": [ "veggieName", "veggieLike" ],
"properties": {
"veggieName": {
"type": "string",
"description": "The name of the vegetable."
},
"veggieLike": {
"type": "boolean",
"description": "Do I like this vegetable?"
}
}
}
}
}
Below is the actual schema returned from the Schema Registry after it the schema was successfully registered:
[
{
"subject": "complex-schema",
"version": 1,
"id": 1,
"schemaType": "JSON",
"schema": "{\"$id\":\"https://example.com/arrays.schema.json\",\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"description\":\"A representation of a person, company, organization, or place\",\"title\":\"complex-schema\",\"type\":\"object\",\"properties\":{\"fruits\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"vegetables\":{\"type\":\"array\",\"items\":{\"$ref\":\"#/$defs/veggie\"}}},\"$defs\":{\"veggie\":{\"type\":\"object\",\"required\":[\"veggieName\",\"veggieLike\"],\"properties\":{\"veggieName\":{\"type\":\"string\",\"description\":\"The name of the vegetable.\"},\"veggieLike\":{\"type\":\"boolean\",\"description\":\"Do I like this vegetable?\"}}}}}"
}
]
The actual schema is embedded in the above returned string (the contents of the "schema" field) and contains the $ref references:
{\"$id\":\"https://example.com/arrays.schema.json\",\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"description\":\"A representation of a person, company, organization, or place\",\"title\":\"complex-schema\",\"type\":\"object\",\"properties\":{\"fruits\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"vegetables\":{\"type\":\"array\",\"items\":{\"$ref\":\"#/$defs/veggie\"}}},\"$defs\":{\"veggie\":{\"type\":\"object\",\"required\":[\"veggieName\",\"veggieLike\"],\"properties\":{\"veggieName\":{\"type\":\"string\",\"description\":\"The name of the vegetable.\"},\"veggieLike\":{\"type\":\"boolean\",\"description\":\"Do I like this vegetable?\"}}}}}
Again, the JsonConverter in the Apache Kafka source code has no notion of JSONSchema, therefore, no, $ref doesn't work and it also doesn't integrate with the Registry.
You seem to be looking for the io.confluent.connect.json.JsonSchemaConverter class + logic

Using ADF Copy Activity with dynamic schema mapping

I'm trying to drive the columnMapping property from a database configuration table. My first activity in the pipeline pulls in the rows from the config table. My copy activity source is a Json file in Azure blob storage and my sink is an Azure SQL database.
In copy activity I'm setting the mapping using the dynamic content window. The code looks like this:
"translator": {
"value": "#json(activity('Lookup1').output.value[0].ColumnMapping)",
"type": "Expression"
}
My question is, what should the value of activity('Lookup1').output.value[0].ColumnMapping look like?
I've tried several different json formats but the copy activity always seems to ignore it.
For example, I've tried:
{
"type": "TabularTranslator",
"columnMappings": {
"view.url": "url"
}
}
and:
"columnMappings": {
"view.url": "url"
}
and:
{
"view.url": "url"
}
In this example, view.url is the name of the column in the JSON source, and url is the name of the column in my destination table in Azure SQL database.
The issue is due to the dot (.) sign in your column name.
To use column mapping, you should also specify structure in your source and sink dataset.
For your source dataset, you need specify your format correctly. And since your column name has dot, you need specify the json path as following.
You could use ADF UI to setup a copy for a single file first to get the related format, structure and column mapping format. Then change it to lookup.
And as my understanding, your first format should be the right format. If it is already in json format, then you may not need use "json" function in your expression.
There seems to be a disconnect between the question and the answer, so I'll hopefully provide a more straightforward answer.
When setting this up, you should have a source dataset with dynamic mapping. The sink doesn't require one, as we're going to specify it in the mapping.
Within the copy activity, format the dynamic json like the following:
{
"structure": [
{
"name": "Address Number"
},
{
"name": "Payment ID"
},
{
"name": "Document Number"
},
...
...
]
}
You would then specify your dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Address Number",
"type": "Int32"
},
"sink": {
"name": "address_number"
}
},
{
"source": {
"name": "Payment ID",
"type": "Int64"
},
"sink": {
"name": "payment_id"
}
},
{
"source": {
"name": "Document Number",
"type": "Int32"
},
"sink": {
"name": "document_number"
}
},
...
...
]
}
}
Assuming these were set in separate variables, you would want to send the source as a string, and the mapping as json:
source: #string(json(variables('str_dyn_structure')).structure)
mapping: #json(variables('str_dyn_translator')).translator
VladDrak - You could skip the source dynamic definition by building dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"type": "String",
"ordinal": "1"
},
"sink": {
"name": "dateOfActivity",
"type": "String"
}
},
{
"source": {
"type": "String",
"ordinal": "2"
},
"sink": {
"name": "CampaignID",
"type": "String"
}
}
]
}
}

Azure Data Factory: Parameterized the Folder and file path

Environments
Azure Data Factory
Scenario
I have ADF pipeline which reads the data from On premise server and writes the data to azure data lake.
For the same - I have provided Folder structure in ADF*(dataset)*as follows
Folder Path : - DBName/RawTables/Transactional
File Path : - TableName.csv
Problem
Is it possible to parameterized the Folder name or file path ? Basically - if tomorrow - I want to change the folder path*(without deployment)* then we should be updating the metadata or table structure.
So the short answer here is no. You won't be able to achieve this level of dynamic flexibility with ADF on its own.
You'll need to add new defined datasets to your pipeline as inputs for folder changes. In Data Lake you could probably get away with a single stored procedure that accepts a parameter for the file path which could be reused. But this would still require tweaks to the ADF JSON when calling the proc.
Of course, the catch all situation here is to use an ADF custom activity and write a C# class with methods to do whatever you need. Maybe overkill though and lots of effort to setup the authentication to data lake store.
Hope this gives you a steer.
Mangesh, why don't you try the .Net custom activity in ADF. This custom activity will be your first activity that will potentially check for the processed folder and if the processed folder is present it will move that to History(say) folder. As, ADF is a platform for data movement and data transformation, it doesn't deal with the IO activity. You can learn more about the .Net custom activity at:
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-use-custom-activities
What you want to do is possible with the new Lookup activity in Azure Data Factory V2. Documentation is here: Lookup Activity.
A JSON example would be something like this:
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "LookupActivity",
"type": "Lookup",
"typeProperties": {
"dataset": {
"referenceName": "LookupDataset",
"type": "DatasetReference"
}
}
},
{
"name": "CopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from #{activity('LookupActivity').output.tableName}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupActivity",
"dependencyConditions": [ "Succeeded" ]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
}
]
}
}

Create large text field in Eloqua using REST API

I was trying to create custom object and corresponding fields in Eloqua. While creating a field with datatype largeText it throws validation error. I can create fields with datatypes like date, text, numeric etc. How can I create largeText fields?
This is my request body
{
"type": "CustomObject",
"description": "TestObject",
"name": "TestObject",
"fields": [
{
"type": "CustomObjectField",
"name": "Description",
"dataType": "largeText",
"displayType": "text"
}
]
}
Response is [Status=Validation error, StatusCode=400]
You should use "displayType":"textArea"for creating largeText fields.