Data fusion pipeline- wrangler transformation not working - google-cloud-data-fusion

Datafusion pipeline
Source is a csv file in GCS
Wrangler -- > Tried to validate few columns like below
Mobile_Number(column) -- > send to error --> value matches regex expression - >^[0]?[789]\d{9}$
When the similar transformations are added, wrangler is failing with errors.
And also how to check the records that were filtered during validation. I can't even find which
validation has failed.

Related

azure data factory copy activity failing DUE TO COLUMN MISMATCH

I am performing copy activity in ADF with source as csv file in gen1. which's copied to sql server. i am getting the below error. i thoroughly checked each column .count is matching.
Error found when processing 'Csv/Tsv Format Text' source 'opportunity.csv' with row number 224: found more columns than expected column count 136

Throw error on invalid lookup in Talend job that populates an output table

I have a tMap component in a Talend job. The objective is to get a row from an input table, perform a column lookup in another input table, and write an output table populating one of the columns with the retrieved value (see screenshot below).
If the lookup is unsuccessful, I generate a row in an "invalid rows" table. This works fine however is not the solution I'm looking for.
Instead, I want to stop the entire process and throw an error on the first unsuccessful lookup. Is this possible in Talend? The error that is thrown should contain the value that failed the lookup.
UPDATE
A tfileoutputdelimited componenent would do the staff .
So ,the flow would be as such tMap ->invalid_row->tfileoutputdelimited -> tdie
Note : that you have to go to advanced settings in the tfileoutputdelimited component aand tick split output into multiple files option and put 1 rather then 1000
For more flexibility , simply do two tmap order than one tMap

Remove or ignore last column in CSV file in Azure

I have a CSV file on a SFTP which has 13 columns, but annoyingly the last one has no data or header, so basically there is an extra comma at the end of every record:
PGR,Concession ,Branch ,Branch Name ,Date ,Receipt ,Ref/EAN,Till No ,Qty , Price , Discount , Net Price ,
I want to import this file into a SQL table in Azure using copy Activity in Data Factory, but I'm getting this error:
I know if I manually open the file and right click and remove column M (which is completely blank), then it works fine. But this needs to be an automated process, can someone assist please? Not too familiar with Data Flow in ADF so that could be an option, or I can use Logic App to access the file too if ADF is not the correct approach.
One workaround is to Parse the csv file and directly send only the required data to Azure sql from logic app using SQL Server Connector. Here is the screenshot of my logic app.
Result:
Alternatively you can remove last column from ADF by using a Select rule with the required condition.
name != 'null' && left(name,2) != '_c'
Because the header of the ADF dataset is blank, data flows will name it "_c(< Some Column No>)" we are using left(name,2) != '_c'.
REFERENCES:
Remove specific columns using Azure Data Factory

Azure Data Factory schema mapping not working with SQL sink

I have a simple pipeline that loads data from a csv file to an Azure SQL db.
I have added a data flow where I have ensured all schema matches the SQL table. I have a specific field which contains numbers with leading zeros. The data type in the source - projection is set to string. The field is mapped to the SQL sink showing as string data-type. The field in SQL has nvarchar(50) data-type.
Once the pipeline is run, all the leading zeros are lost and the field appears to be treated as decimal:
Original data: 0012345
Inserted data: 12345.0
The CSV data shown in the data preview is showing correctly, however for some reason it loses its formatting during insert.
Any ideas how I can get it to insert correctly?
I had repro’d in my lab and was able to load as expected. Please see the below repro details.
Source file (CSV file):
Sink table (SQL table):
ADF:
Connect the data flow source to the CSV source file. As my file is in text format, all the source columns in the projection are in a string.
Source data preview:
Connect sink to Azure SQL database to load the data to the destination table.
Data in Azure SQL database table.
Note: You can all add derived columns before sink to convert the value to string as the sink data type is a string.
Thank you very much for your response.
As per your post the DF dataflow appears to be working correctly. I have finally discovered an issue with the transformation - I have an Azure batch service which runs a python script, which does a basic transformation and saves the output to a csv file.
Interestingly, when I preview the data in the dataflow, it looks as expected. However, the values stored in SQL are not.
For the sake of others having a similar issue, my existing python script used to convert a 'float' datatype column to string-type. Upon conversion, it used to retain 1 decimal number but as all of my numbers are integers, they were ending up with .0.
The solution was to convert values to integer and then to string:
df['col_name'] = df['col_name'].astype('Int64').astype('str')

cdap - cloudfusion - parse csv and apply schema

I am trying to create a pipeline which performs following task.
read and parse the csv file
apply schema on top of that
records which are mapping schema is written to a valid bigquery table
records which doesn't match schema (i.e. if column expect int but in file it's string) it goes to reject bucket.
I have write following pipeline. However, the problem is, I don't see any records going to either rejected or bigquery.
if schema is not matching, shouldn't it go to reject?