Azure Data Factory DataFlow Sink to Delta fails when schema changes (column data types and similar) - azure-data-factory

We have an Azure Data Factory dataflow, it will sink into Delta. We have Owerwrite, Allow Insert options set and Vacuum = 1.
When we run the pipeline over and over with no change in the table structure pipeline is successfull.
But when the table structure being sinked changed, ex data types changed and such the pipeline fails with below error.
Error code: DFExecutorUserError
Failure type: User configuration issue
Details: Job failed due to reason: at Sink 'ConvertToDelta': Job aborted.
We tried setting Vacuum to 0 and back, Merge Schema set and now, instead of Overwrite Truncate and back and forth, pipeline still failed.

Can you try enabling Delta Lake's schema evolution (more information)? By default, Delta Lake has schema enforcement enabled which means that the change to the source table is not allowed which would result in an error.
Even with overwrite enabled, unless you specify schema evolution, overwrite will fail because by default the schema cannot be changed.

I created ADLS Gen2 storage account and created input and output folders and uploaded parquet file into input folder.
I created pipeline and created dataflow as below:
I have taken Parquet file as source.
Dataflow Source:
Dataset of Source:
Data preview of Source:
I created derived column to change the structure of the table.
Derived column:
I updated 'difficulty' column of parquet file. I changed the datatype of 'difficulty' column from long to double using below code:
difficulty : toDouble(difficulty)
Image for reference:
I updated 'transactions_len' column of parquet file. I changed the datatype of 'transactions_len' column from Integer to float using below code:
transactions_len : toFloat(transactions_len)
I updated 'number' column of parquet file. I changed the datatype of 'number' column from long to string using below code:
number : toString(number)
Image for reference:
Data preview of Derived column:
I have taken delta as sink.
Dataflow sink:
Sink settings:
Data preview of Sink:
I run the pipeline It executed successfully.
Image for reference:
I t successfully stored in my storage account output folder.
Image for reference:
The procedure worked in my machine please recheck from your end.

The source (Ingestion) was generated to azure blob with given a specific filename. Whenever we generated to source parquet files without specifying a specific filename but only a directory the sink worked

Related

Delta Lake Data Load Datatype mismatch

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!

Azure Data Factory schema mapping not working with SQL sink

I have a simple pipeline that loads data from a csv file to an Azure SQL db.
I have added a data flow where I have ensured all schema matches the SQL table. I have a specific field which contains numbers with leading zeros. The data type in the source - projection is set to string. The field is mapped to the SQL sink showing as string data-type. The field in SQL has nvarchar(50) data-type.
Once the pipeline is run, all the leading zeros are lost and the field appears to be treated as decimal:
Original data: 0012345
Inserted data: 12345.0
The CSV data shown in the data preview is showing correctly, however for some reason it loses its formatting during insert.
Any ideas how I can get it to insert correctly?
I had repro’d in my lab and was able to load as expected. Please see the below repro details.
Source file (CSV file):
Sink table (SQL table):
ADF:
Connect the data flow source to the CSV source file. As my file is in text format, all the source columns in the projection are in a string.
Source data preview:
Connect sink to Azure SQL database to load the data to the destination table.
Data in Azure SQL database table.
Note: You can all add derived columns before sink to convert the value to string as the sink data type is a string.
Thank you very much for your response.
As per your post the DF dataflow appears to be working correctly. I have finally discovered an issue with the transformation - I have an Azure batch service which runs a python script, which does a basic transformation and saves the output to a csv file.
Interestingly, when I preview the data in the dataflow, it looks as expected. However, the values stored in SQL are not.
For the sake of others having a similar issue, my existing python script used to convert a 'float' datatype column to string-type. Upon conversion, it used to retain 1 decimal number but as all of my numbers are integers, they were ending up with .0.
The solution was to convert values to integer and then to string:
df['col_name'] = df['col_name'].astype('Int64').astype('str')

Need recommendation in adf pipeline source properties while loading delimited text files from azure blob to snowflake

We are trying to load a delimited file which has blank data for few columns located in azure blob and would like to get a value like NA in our target snowflake table whenever we encounter a blank value in source csv file. We have been trying to provide a NA against the Null option but it is not working, any suggestions?
Here is the screenshot of what i have mentioned above.
I have used data flow activity in Azure data factory to resolve this issue.
Source file with NULL value in “Name” column.
Now use Derived Column transformation. In Derived column's settings Select column name and use iifNull({Name}, 'NA') expression.
In data preview, Null value in Name column is replaced with NA.
You can follow the above steps to replace Null values and Sink data from blob storage to Snowflake.

How to transform data type in Azure Data Factory

I would like to copy the data from local csv file to sql server in Azure Data Factory. The table in sql server is created already. The local csv file is exported from mysql.
When I use copy data in Azure Data Factory, there is an error "Exception occurred when converting value 'NULL' for column name 'deleted' from type 'String' to type 'DateTime'. The string was not recognized as a valid DateTime.
What I have done:
I checked the original value from column name 'deleted' is NULL, without quotes(i.e. not 'NULL').
I cannot change the data type during file format settings. The data type for all column is preset to string as default.
I tried to create data flow instead of copy data. I can change the data type from source projection. But the sink dataset cannot select sql server.
What can I do to copy data from CSV file to sql server via Azure Data Factory?
Data Flow doesn't support on-premise SQL. We can't create the source and sink.
You can use copy active or copy data tool to do that. I made an example data which delete is NULL:
As you said the delete column data is Null or contains NULL, and ALL will be considered as String. The key is that your Sink SQL Server table schema if it allows NULL.
I tested many times and it all works well.

AWS DMS CDC task does not detect column name and type changes

I have created a CDC task that captures changes in a source PostgreSQL schema and writes them in Parquet format into a target S3 bucket. The task captures the inserts, updates and deletes correctly but fails to capture column name and type changes in the source.
When I change a column name or type of a table in the source and insert new rows to the table, the resulting Parquet file uses the old column name and type.
Is there a specific configuration I am missing? or it is not possible to achieve the desired outcome from this task in DMS?
if you change column at source and DMS will pick automatically from source and update at destination. check your DMS setting. you no need to do manually adding column at destination
Make sure you have the HandleSourceTableAltered parameter set to true in the task settings.[1] (The setting applies when the target metadata parameter BatchApplyEnabled is set to either true or false.)
Same goes for HandleSourceTableDropped or HandleSourceTableTruncated if this is relevant in your case.
Obviously, previously replicated Parquet files on S3 will not change to reflect this DDL change on the source.
[1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.DDLHandling.html