Delta Lake Data Load Datatype mismatch - pyspark

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!

Related

Azure Data Factory DataFlow Sink to Delta fails when schema changes (column data types and similar)

We have an Azure Data Factory dataflow, it will sink into Delta. We have Owerwrite, Allow Insert options set and Vacuum = 1.
When we run the pipeline over and over with no change in the table structure pipeline is successfull.
But when the table structure being sinked changed, ex data types changed and such the pipeline fails with below error.
Error code: DFExecutorUserError
Failure type: User configuration issue
Details: Job failed due to reason: at Sink 'ConvertToDelta': Job aborted.
We tried setting Vacuum to 0 and back, Merge Schema set and now, instead of Overwrite Truncate and back and forth, pipeline still failed.
Can you try enabling Delta Lake's schema evolution (more information)? By default, Delta Lake has schema enforcement enabled which means that the change to the source table is not allowed which would result in an error.
Even with overwrite enabled, unless you specify schema evolution, overwrite will fail because by default the schema cannot be changed.
I created ADLS Gen2 storage account and created input and output folders and uploaded parquet file into input folder.
I created pipeline and created dataflow as below:
I have taken Parquet file as source.
Dataflow Source:
Dataset of Source:
Data preview of Source:
I created derived column to change the structure of the table.
Derived column:
I updated 'difficulty' column of parquet file. I changed the datatype of 'difficulty' column from long to double using below code:
difficulty : toDouble(difficulty)
Image for reference:
I updated 'transactions_len' column of parquet file. I changed the datatype of 'transactions_len' column from Integer to float using below code:
transactions_len : toFloat(transactions_len)
I updated 'number' column of parquet file. I changed the datatype of 'number' column from long to string using below code:
number : toString(number)
Image for reference:
Data preview of Derived column:
I have taken delta as sink.
Dataflow sink:
Sink settings:
Data preview of Sink:
I run the pipeline It executed successfully.
Image for reference:
I t successfully stored in my storage account output folder.
Image for reference:
The procedure worked in my machine please recheck from your end.
The source (Ingestion) was generated to azure blob with given a specific filename. Whenever we generated to source parquet files without specifying a specific filename but only a directory the sink worked

how i will map data in data factory source sqlwh destination blob

my source is SQLDB
SINK :BLOB
SQL table have columns
in target file which i have creating blob initially no Header right. so customer given some Predefined Names so that data from sql column sholud be mapped those fileds.
in copy activity at mapping i need to map WITH proper data type and name which customer given
defaut its coming but i need ti map as i stated
HoW will i resolve it can some one help me
You can simply edit the sink header names, since its a TSV anyways
For addressing DataType mapping,
See, Data type mapping
Currently such data type conversion is supported when copying between
tabular data. Hierarchical sources/sinks are not supported, which
means there is no system-defined data type conversion between source
and sink interim types.

Can't use Data Explorer as a sink in Data Flow

I'm trying to do a Data Flow using ADL1 as the source and Data Explorer as the sink; I can create the source but when I select Dataset for Sink Type the only available options in the Dataset pulldown are my ADL1 Datasets. If I use Data Copy instead I can choose Data Explorer as a sink but this won't work as Data Copy won't allow null values into Data Explorer number data types. Any insight on how to fix this?
I figured out a workaround. First I Data Copy the csv file into a staging table where all columns are strings. Then I Data Copy from staging table to production table using a KQL query that converts strings to their destination data types.

BigQuery Transfer Service UI - run_date parameter

Has anybody had any success in applying a run_date parameter when creating a Transfer in BigQuery using the Transfer service UI ?
I'm taking a CSV file from Google Cloud storage and I want to mirror this into my ingestion date partitioned table, table_a.
Initally I set the destination table as table_a, which resulted in the following message in the job log:
Partition suffix has to be set for date-partitioned tables. Please recreate your transfer config with a valid table name. For example, to load new files to partition of the run date, specify your table name as transferTest${run_date} for daily partitioning or transferTest${run_time|"%Y%m%d%H"} for hourly partitioning.
I then set the destination to table_a$(run_date), which then issues the warning:
Invalid table name. Please use only alphabetic, numeric characters or underscore with supported parameters wrapped in brackets.
However it won't accept table_a_(run_date) either - could anyone please advise?
best wishes
Dave
Apologies - i've identified the correct syntax now
table_a_{run_date}

Column defined in source Dataset could not be found in the actual source

I have an ADF Copy Data flow and I'm getting the following error at runtime:
My source is defined as follows:
In my data set, the column is defined as shown below:
As you can see from the second image, the column IsLiftStation is defined in the source. Any idea why ADF cannot find the column?
I've had the same error. You can solve this by either selecting all columns (*) in the source and then mapping those you want to the sink schema, or by 'clearing' the mapping in which case the ADF Copy component will auto map to columns in the sink schema (best if columns have the same names in source and sink). Either of these approaches works.
Unfortunately, clicking the import schema button in the mapping tab doesn't work. It does produce the correct column mappings based on the columns in the source query but I still get the original error 'the column could not be located in the actual source' after doing this mapping.
could you check that is there a column named 'ae_type_id' in your schema? If that's the case, could you remove that column and try again? The columns in the schema must be aligned with columns in the query.
The issue is caused by an incomplete schema in one of the data sources. My solution is:
Step through the data flow selecting the first schema, Import projection
Go to the flow and Data Preview
Repeat for each step.
In my case, there were trailing commas in one of the CSV files. This caused automated column names to be created in the import allowing me to fix the data file.