Set row as a header Azure Data Factory [mapping data flow] - azure-data-factory

Currently, I have an Excel file that I'm processing using a mapping data flow to remove some null values.
This is my input file:
and after remove the null values I have:
I'm sinking my data into a Cosmos DB but I need to change the names of the columns and set my first row as headers...
I need to do this (First row as a header) in the previous step before Sink and y Can't use the mapping option a set manually the names of the columns because maybe some of these position of columns can change
Any idea to do this?
Thanks

First row as a header can only check in dataset connection.
As a work around, you can save your excel to blob(csv format) after removing null value.
Then create a copy data activity or data flow, use this csv file as source(check first row as header), Cosmos DB as sink.
Update
Setting of sink in data flow:
Data preview of sink:
Result:

Related

How to add header to file in Azure Data Factory

I am storing the header in a CSV file and concatenating it with the data file using mapping data flow.
I am using union Activity to combine these two files. While combining the header file and data file, I can see the data but header data is not at the top. It's randomly present in the sink file.
How can I make the header at top ?
1,Company_Reference_ID,PH2_TIC,USD,PH2_Actuals,2021-03-12,PH2_X_VB_V3,85738,V3Premium
1,Company_Reference_ID,PH2_TIC,USD,PH2_Actuals,2021-03-12,PH2_X_VB_V3,85738,V3Premium
Journey,CompanyRerenceIDType,CompaReferenceID,Currecy,Ledgerype,Accountinate,Journaource
1,Company_Reference_ID,PH2_TIC,USD,PH2_Actuals,2021-03-12,PH2_X_VB_V3,85738,V3Commission
Update:
My debug result is as follows, I think it is what you want:
I created a simple test to merge two csv files. One header.csv and another vlaues.csv.
As #Mark Kromer MSFT said we can use Surrogate Key and then sort these rows.The Row_No of heard.csv will start from 1 and values.csv will start from 2.
Set header source to the header.csv and don't select First row as header.
Set header source to the values.csv and don't select First row as header.
At SurrogateKey1 activity , enter Row_No as Key column and 1 as Start value.
At SurrogateKey2 activity , enter Row_No as Key column and 2 as Start value.
Then we can uion SurrogateKey1 stream and SurrogateKey2 stream at Union1 activity.
Then we can sort these rows by Row_No at Sort1 activity.
We can use Select1 activity to filter Row_No column.
I think it is what you want:
For now, you would need to use a Surrogate Key for the different streams and make sure that the header row has 1 for the surrogate key value and sort by that column.
We are working on a feature for adding a header to the delimited text sink as a property in the data flow Sink. That will make it much easier and should light-up in the UI soon.

Can't use Data Explorer as a sink in Data Flow

I'm trying to do a Data Flow using ADL1 as the source and Data Explorer as the sink; I can create the source but when I select Dataset for Sink Type the only available options in the Dataset pulldown are my ADL1 Datasets. If I use Data Copy instead I can choose Data Explorer as a sink but this won't work as Data Copy won't allow null values into Data Explorer number data types. Any insight on how to fix this?
I figured out a workaround. First I Data Copy the csv file into a staging table where all columns are strings. Then I Data Copy from staging table to production table using a KQL query that converts strings to their destination data types.

Google Cloud Data Fusion is appending a column to original data

When I am loading data encrypted data from GCS source to GCS sink there one additional column getting added.
Original data
Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
1,Vinay,Argekar,01/01/2017,India
2,Thirukkumaran,Haridass,02/02/2017,USA
3,David,Wu,03/04/2000,Canada
4,Vinod,Kumar,04/02/2002,India
5,Joshua,Abraham,04/15/2010,France
6,Allaudin,Dastigar,09/24/2012,UK
7,Senthil,Kumar,08/15/2009,Germany
8,Sudha,Narayanan,12/14/2016,India
9,Ravi,Prasad,11/11/2011,Costa Rica
Data came to file after running pipeline
0,Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
91,1,Vinay,Argekar,01/01/2017,India
124,2,Thirukkumaran,Haridass,02/02/2017,US
164,3,David,Wu,03/04/2000,Canada
193,4,Vinod,Kumar,04/02/2002,India
224,5,Joshua,Abraham,04/15/2010,France
259,6,Allaudin,Dastigar,09/24/2012,UK
293,7,Senthil,Kumar,08/15/2009,Germany
328,8,Sudha,Narayanan,12/14/2016,India
363,9,Ravi,Prasad,11/11/2011,Costa Rica
First column 0 was not present in original file
When you are configuring the GCS source, did you specify the Format to be CSV or was it left as Text? When the Format is Text, the output schema actually contains an offset, which is the first column that first column that you see in the output data. When you specify the format to be CSV, you have to specify the output schema of the file.

How do I dynamically map files in Copy Activity to load the data into destination

Azure Data factory V2 - Copy Activity - Copy data from Changing Column names and number of columns to Destination. I have to copy data from a Flat File where number of Columns will change in each file and even the column names. How do I dynamically map them in Copy Activity to load the data into destination in Azure Data factory V2.
Suppose my destination has 20 columns, but source will come sometimes as 10 columns or 15 or sometimes 20. If the source columns are less than destination then remaining column values in destination should be passed as Null.
Use data flows in ADF. Data Flow sinks can generate the table schema on the fly if you wish. Or you can just "auto-map" any changing schema to your target. If your source schema changes often, just use "schema drift" with no schema defined in your dataset.

Copying data from a single spreadsheet into multiple tables in Azure Data Factory

The Copy Data activity in Azure Data Factory appears to be limited to copying to only a single destination table. I have a spreadsheet containing rows that should be expanded out to multiple tables which reference each other - what would be the most appropriate way to achieve that in Data Factory?
Would multiple copy tasks running sequentially be able to perform this task, or does it require calling a custom stored procedure that would perform the inserts? Are there other options in Data Factory for transforming the data as described above?
If the columnMappings in your source and sink dataset don't against the error conditions mentioned in this link,
1.Source data store query result does not have a column name that is specified in the input dataset "structure" section.
2.Sink data store (if with pre-defined schema) does not have a column name that is specified in the output dataset "structure" section.
3.Either fewer columns or more columns in the "structure" of sink dataset than specified in the mapping.
4.Duplicate mapping.
you could connect the copy activities in series and execute them sequentially.
Another solution is Stored Procedure which could meet your custom requirements.About configuration,please refer to my previous detailed case:Azure Data Factory mapping 2 columns in one column