Google Cloud Data Fusion is appending a column to original data - google-cloud-data-fusion

When I am loading data encrypted data from GCS source to GCS sink there one additional column getting added.
Original data
Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
1,Vinay,Argekar,01/01/2017,India
2,Thirukkumaran,Haridass,02/02/2017,USA
3,David,Wu,03/04/2000,Canada
4,Vinod,Kumar,04/02/2002,India
5,Joshua,Abraham,04/15/2010,France
6,Allaudin,Dastigar,09/24/2012,UK
7,Senthil,Kumar,08/15/2009,Germany
8,Sudha,Narayanan,12/14/2016,India
9,Ravi,Prasad,11/11/2011,Costa Rica
Data came to file after running pipeline
0,Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
91,1,Vinay,Argekar,01/01/2017,India
124,2,Thirukkumaran,Haridass,02/02/2017,US
164,3,David,Wu,03/04/2000,Canada
193,4,Vinod,Kumar,04/02/2002,India
224,5,Joshua,Abraham,04/15/2010,France
259,6,Allaudin,Dastigar,09/24/2012,UK
293,7,Senthil,Kumar,08/15/2009,Germany
328,8,Sudha,Narayanan,12/14/2016,India
363,9,Ravi,Prasad,11/11/2011,Costa Rica
First column 0 was not present in original file

When you are configuring the GCS source, did you specify the Format to be CSV or was it left as Text? When the Format is Text, the output schema actually contains an offset, which is the first column that first column that you see in the output data. When you specify the format to be CSV, you have to specify the output schema of the file.

Related

Azure Data Factory schema mapping not working with SQL sink

I have a simple pipeline that loads data from a csv file to an Azure SQL db.
I have added a data flow where I have ensured all schema matches the SQL table. I have a specific field which contains numbers with leading zeros. The data type in the source - projection is set to string. The field is mapped to the SQL sink showing as string data-type. The field in SQL has nvarchar(50) data-type.
Once the pipeline is run, all the leading zeros are lost and the field appears to be treated as decimal:
Original data: 0012345
Inserted data: 12345.0
The CSV data shown in the data preview is showing correctly, however for some reason it loses its formatting during insert.
Any ideas how I can get it to insert correctly?
I had repro’d in my lab and was able to load as expected. Please see the below repro details.
Source file (CSV file):
Sink table (SQL table):
ADF:
Connect the data flow source to the CSV source file. As my file is in text format, all the source columns in the projection are in a string.
Source data preview:
Connect sink to Azure SQL database to load the data to the destination table.
Data in Azure SQL database table.
Note: You can all add derived columns before sink to convert the value to string as the sink data type is a string.
Thank you very much for your response.
As per your post the DF dataflow appears to be working correctly. I have finally discovered an issue with the transformation - I have an Azure batch service which runs a python script, which does a basic transformation and saves the output to a csv file.
Interestingly, when I preview the data in the dataflow, it looks as expected. However, the values stored in SQL are not.
For the sake of others having a similar issue, my existing python script used to convert a 'float' datatype column to string-type. Upon conversion, it used to retain 1 decimal number but as all of my numbers are integers, they were ending up with .0.
The solution was to convert values to integer and then to string:
df['col_name'] = df['col_name'].astype('Int64').astype('str')

Merge different schema from blob container into a single sql table

I have to read 10 files from a folder in blob container with different schema(most of the schema among the table macthes) and merge them into a single SQL table
file 1: lets say there are 25 such columns
file 2: Some of the column in file2 matches with columns in file1
file 3:
output: a sql table
How to setup a pipeline in azure data factory to merge these columns into a single SQL table.
my approach:
get Metadata Activity---> for each childitems--- copy activity
for the mapping--- i constructed a json that containes all the source/sink columns from these files
You can create a JSON file which contains your each source file name and Tabular Translator. Then use Lookup activity to get this file's content(Don't check first row only). Loop this array in For Each activity and pass source file name in your dataset. Finally, create a copy data activity and use Tabular Translator as your mapping.

Set row as a header Azure Data Factory [mapping data flow]

Currently, I have an Excel file that I'm processing using a mapping data flow to remove some null values.
This is my input file:
and after remove the null values I have:
I'm sinking my data into a Cosmos DB but I need to change the names of the columns and set my first row as headers...
I need to do this (First row as a header) in the previous step before Sink and y Can't use the mapping option a set manually the names of the columns because maybe some of these position of columns can change
Any idea to do this?
Thanks
First row as a header can only check in dataset connection.
As a work around, you can save your excel to blob(csv format) after removing null value.
Then create a copy data activity or data flow, use this csv file as source(check first row as header), Cosmos DB as sink.
Update
Setting of sink in data flow:
Data preview of sink:
Result:

How do I dynamically map files in Copy Activity to load the data into destination

Azure Data factory V2 - Copy Activity - Copy data from Changing Column names and number of columns to Destination. I have to copy data from a Flat File where number of Columns will change in each file and even the column names. How do I dynamically map them in Copy Activity to load the data into destination in Azure Data factory V2.
Suppose my destination has 20 columns, but source will come sometimes as 10 columns or 15 or sometimes 20. If the source columns are less than destination then remaining column values in destination should be passed as Null.
Use data flows in ADF. Data Flow sinks can generate the table schema on the fly if you wish. Or you can just "auto-map" any changing schema to your target. If your source schema changes often, just use "schema drift" with no schema defined in your dataset.

Filename as Column using Data Factory V2

I have a lot of JSON files in Blob Storage and what I would like to do is to load the JSON files via Data factoryV2 into SQL Data Warehouse. I would like the filename in a column for each JSON file. I know how to
do this in SSIS but I am not sure how to replicate this in Data Factory.
e.g File Name: CornerShop.csv as CornerShop in the filename column in SQL Data Warehouse
Firstly,please see the limitation in the copy activity column mapping:
Source data store query result does not have a column name that is
specified in the input dataset "structure" section.
Sink data store (if with pre-defined schema) does not have a column
name that is specified in the output dataset "structure" section.
Either fewer columns or more columns in the "structure" of sink
dataset than specified in the mapping.
Duplicate mapping.
So,i don't think you could do the data transfer plus file name at one time.My idea is:
1.First use a GetMetadata activity. It should get the filepaths of each file you want to copy. Use the "Child Items" in the Field list.
2.On success of GetMetaData activity, do ForEach activity. For the ForEach activity's Items, pass the list of filepaths.
3.Inside the ForEach activity's Activities, place the Copy activity. Reference the iterated item by #item() or #item().name on the blob storage source file name.
4.Meanwhile,configure the filename as a parameter into stored procedure. In the stored procedure, merge the filename into fileName column.