Column defined in source Dataset could not be found in the actual source - azure-data-factory

I have an ADF Copy Data flow and I'm getting the following error at runtime:
My source is defined as follows:
In my data set, the column is defined as shown below:
As you can see from the second image, the column IsLiftStation is defined in the source. Any idea why ADF cannot find the column?

I've had the same error. You can solve this by either selecting all columns (*) in the source and then mapping those you want to the sink schema, or by 'clearing' the mapping in which case the ADF Copy component will auto map to columns in the sink schema (best if columns have the same names in source and sink). Either of these approaches works.
Unfortunately, clicking the import schema button in the mapping tab doesn't work. It does produce the correct column mappings based on the columns in the source query but I still get the original error 'the column could not be located in the actual source' after doing this mapping.

could you check that is there a column named 'ae_type_id' in your schema? If that's the case, could you remove that column and try again? The columns in the schema must be aligned with columns in the query.

The issue is caused by an incomplete schema in one of the data sources. My solution is:
Step through the data flow selecting the first schema, Import projection
Go to the flow and Data Preview
Repeat for each step.
In my case, there were trailing commas in one of the CSV files. This caused automated column names to be created in the import allowing me to fix the data file.

Related

Delta Lake Data Load Datatype mismatch

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!

How to format the negative values in dataflow?

I have below column in my table
I need an output as below
I am using Dataflow in the Azure data factory and unable to get the above output. I used derived column but no success. I used replace function, but it's not coming correct. Can anyone advise how to format this in dataflow?
Source is taken in data flow with data as in below image.
Derived column transformation is added next to source.
New column is added and the expression is given as
iif(left(id,1)=='-', replace(replace(id,"USD",""),"-","-$"), concat("$", replace(id,"USD","")))
Output of Derived Column activity

Guid additional column is always the same

I have a Copy-Data task where I am adding an additional column called "Id" and its value is #guid(). Problem is that for every row it is importing, the Guid value is always the same and the destination/sink throws a primary key violation.
Additional column definition
The copy activity will copy the same guid() for all rows if you use additional column.
To get Unique guid() for Each row, you can follow the demonstration below.
First Give your source data to lookup activity and give its output to a ForEach activity.
This is my source data in csv format for sample, give this to lookup.
source.csv:
name
"Rakesh"
"Laddu"
"Virat"
"John"
Use another dummy dataset and give any one value to it. Use this in copy activity.
Dummy.csv:
name
"Rakesh"
ForEach activity:
Inside ForEach use Copy activity and give the dummy dataset. Create additional columns and give our source data(#item().name) and #guid().
Copy activity:
Now in sink give your database dataset. Here for sample, I have used Azure SQL database table.
Go to mapping of copy activity and click on import Schemas.Give any string value for it to import the schemas of source (Here dummy schema) and sink.
After the above, you will get like this, in this give the additional columns we created to the database columns.
Pipeline Execution:
After Executing the pipeline, you can get the desired output with Unique rows.
Output:

matching the columns in a source file with sink table columns to make sure they match using Azure Data Factory

I have an Azure Data factory trigger that is fired off when a file is placed in blob storage, this trigger will start pipeline execution and pass the file name to the data flow activity. I would like to make sure that all the column names from the header row in the file are in the sink table. There is an identity column in the sink table that should not be in the comparison. Not sure how to tackle this task, I've read about the 'derived column' activity, is that the route I should take?
You can select or filter which columns reside in sink dataset or table by using "Field mapping". You can optionally use "derived columns" transformation, however in the "sink transformation" you will have this by default and is set to "Auto mapping". Here you can add or remove which columns are written to sink.
In the below example the column "id" can be assumed as similar to "Identity" column in your table. Assuming all the files have same columns:
Once you have modified as per your need, you can confirm the same from the "inspect" tab before run.
Strategy:
Use two ADF pipelines, one to get a list of all files and another one to process each file copying its content to a specific SQL table.
Setup:
I’ve created 4 CSV files, following the pattern you need: “[CustomerID][TableName][FileID].csv” and 4 SQL tables, one for each type of file.
A_inventory_0001.csv: inventory records for customer A, to be
inserted into the SQL table “A_Inventory”.
A_sales_0003.csv: sales
records for customer A, to be inserted into the SQL table “A_Sales”.
B_inventory_0002.csv: inventory records for customer B, to be
inserted into the SQL table “B_Inventory”.
B_sales_0004.csv: sales
records for customer B, to be inserted into the SQL table “B_Sales”
Linked Services
In Azure Data Factory, the following linked services were create using Key Vault (Key Vault is optional).
Datasets
The following datasets were created. Note we have created some parameters to allow the pipeline to specify the source file and the destination SQL table.
The dataset “AzureSQLTable” has a parameter to specify the name of the destination SQL table.
The dataset “DelimitedTextFile” has a parameter to specify the name of the source CSV file.
The dataset “DelimitedTextFiles” has no parameter because it will be used to list all files from source folder.
Pipelines
The first pipeline “Get Files” will get the list of CSV files from source folder (Get Metadata activity), and then, for each file, call the second pipeline passing the CSV file name as a parameter.
Inside the foreach loop, there is a call to the second pipeline “Process File” passing the file name as a parameter.
The second pipeline has a parameter “pFileName” to receive the name of the file to be processed and a variable to calculate the name of the destination table based on the file name.
The first activity is to use a split in the file name to extract the parts we need to compose the destination table name.
In the expression bellow we are splitting the file name using the “__” separator and then using the first and second parts to compose the destination table name.
#concat(string(split(pipeline().parameters.pFileName, '_')[0]),'_',string(split(pipeline().parameters.pFileName, '_')[10]))
The second activity will then copy the file from the source “pFileName” to the desnation table “vTableName” using dynamic mapping, ie not adding specific column names as this will be dynamic.
The files I used in this example and the ADF code are available here:
https://github.com/diegoeick/stack-overflow/tree/main/69340699
I hope this will resolve your issue.
In case you still need to save the CustomerID and FileID in the database tables, you can use the dynamic mapping and use the available parameters (filename) and create a json with the dynamic mapping in the mapping tab of your copy activity. You can find more details here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping

Azure ADF Copy Activity with Trailing Column Delimiter

I have a strange source CSV file where it contains a trailing column delimiter at the end of each record just before the carriage return/new line.
When ADF is previewing this data, it displays only 2 columns without issue and all the data rows. However, when using the copy activity, it fails with the following exception.
ErrorCode=DelimitedTextColumnNameNotAllowNull,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The
name of column index 3 is empty. Make sure column name is properly
specified in the header
Now I understand why it's complaining about this due to trailing delimiter, but my question is whether or not there is a way to deal with this condition? I've tried including the trailing comma in the record delimiter (,\r\n), but then it just pivots the data where all the columns become rows.
Is there a way to address this condition in copy activity?
When preview the data in dataset, it seams correct:
But actually in copy actives, the data will derived to 3 columns by the column delimiter ",", the third column is empty or NULL value. This will cause the error.
If you use Data Flow import projection from source, you can see the third column:
Just for now, copy active doesn't support modify the data schema. You must use Data flow Derived Column to create a new schema for the source. For example:
Then mapping the new column/schema to sink will solve the problem.
HTH.
Use a different encoding for your CSV. CSV utf-8 will do the trick.