I'm doing a large-scale project with multiple pipelines, millions of records per pipeline. I'm trying to develop a generic skipped row capture process.
What I need to do is: for every source row skipped due to any error encountered on the attempted load, I want to capture a key column value from the row and write it to a distinct log file (or separate DB table row). This can't be summary data: for each individual row that fails, I need to capture the row key from that row so we can review/re-load later (I will add in system variable values to identify pipeline, component, time stamp, etc). Pipeline must complete with all successful rows loaded, all unsuccessful rows logged.
This is no-brainer functionality in most ETL tools; I have to be overlooking something in ADF, because I can't find a way to do this. Appreciate any/all suggestions.
You can enable Fault tolerance and choose Skip incompatible rows option. It will skip the incompatible rows between source and target store during copy data. e.g. type and field mismatch or PK violation.
Then you can enable session log and choose Warning log level in copy activity to log skipped rows. Finally, you can save your log file in Azure Storage or Azure Data Lake Storage Gen2.
Reference:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-fault-tolerance
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-log
With your first copy activity, check the fault tolerance option in 'settings' to log skipped fault rows.
Make sure to place your rows key column, as the first in the mapping definition.
Get the copy activity logFilePath from the activity output into a variable
Add another copy activity to load skipped rows into relational table
it source path will be the variable holds logFilePath
Set the file path type to: 'Wildcard file path'
Keep the 'Wildcard file path' empty
Will be the value in Wildcard file name
Make sure that the delimited file dataset escape character is set to quotations.
The OperationItem field of the lg file holds your record fields seperated by ,; because we placed the rowID first on mapping, it will appear first in OperationalItem as well.
Goodluck
Related
I have an Azure Data factory trigger that is fired off when a file is placed in blob storage, this trigger will start pipeline execution and pass the file name to the data flow activity. I would like to make sure that all the column names from the header row in the file are in the sink table. There is an identity column in the sink table that should not be in the comparison. Not sure how to tackle this task, I've read about the 'derived column' activity, is that the route I should take?
You can select or filter which columns reside in sink dataset or table by using "Field mapping". You can optionally use "derived columns" transformation, however in the "sink transformation" you will have this by default and is set to "Auto mapping". Here you can add or remove which columns are written to sink.
In the below example the column "id" can be assumed as similar to "Identity" column in your table. Assuming all the files have same columns:
Once you have modified as per your need, you can confirm the same from the "inspect" tab before run.
Strategy:
Use two ADF pipelines, one to get a list of all files and another one to process each file copying its content to a specific SQL table.
Setup:
I’ve created 4 CSV files, following the pattern you need: “[CustomerID][TableName][FileID].csv” and 4 SQL tables, one for each type of file.
A_inventory_0001.csv: inventory records for customer A, to be
inserted into the SQL table “A_Inventory”.
A_sales_0003.csv: sales
records for customer A, to be inserted into the SQL table “A_Sales”.
B_inventory_0002.csv: inventory records for customer B, to be
inserted into the SQL table “B_Inventory”.
B_sales_0004.csv: sales
records for customer B, to be inserted into the SQL table “B_Sales”
Linked Services
In Azure Data Factory, the following linked services were create using Key Vault (Key Vault is optional).
Datasets
The following datasets were created. Note we have created some parameters to allow the pipeline to specify the source file and the destination SQL table.
The dataset “AzureSQLTable” has a parameter to specify the name of the destination SQL table.
The dataset “DelimitedTextFile” has a parameter to specify the name of the source CSV file.
The dataset “DelimitedTextFiles” has no parameter because it will be used to list all files from source folder.
Pipelines
The first pipeline “Get Files” will get the list of CSV files from source folder (Get Metadata activity), and then, for each file, call the second pipeline passing the CSV file name as a parameter.
Inside the foreach loop, there is a call to the second pipeline “Process File” passing the file name as a parameter.
The second pipeline has a parameter “pFileName” to receive the name of the file to be processed and a variable to calculate the name of the destination table based on the file name.
The first activity is to use a split in the file name to extract the parts we need to compose the destination table name.
In the expression bellow we are splitting the file name using the “__” separator and then using the first and second parts to compose the destination table name.
#concat(string(split(pipeline().parameters.pFileName, '_')[0]),'_',string(split(pipeline().parameters.pFileName, '_')[10]))
The second activity will then copy the file from the source “pFileName” to the desnation table “vTableName” using dynamic mapping, ie not adding specific column names as this will be dynamic.
The files I used in this example and the ADF code are available here:
https://github.com/diegoeick/stack-overflow/tree/main/69340699
I hope this will resolve your issue.
In case you still need to save the CustomerID and FileID in the database tables, you can use the dynamic mapping and use the available parameters (filename) and create a json with the dynamic mapping in the mapping tab of your copy activity. You can find more details here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping
I am setting up a Data Flow in ADF that takes an Azure Table Dataset as Source, adds a Derived Column that adds a column with the name "filename" and a dynamic value, based on a data field from the source schema.
Then the output is sent to a sink that is linked to a DataSet that is attached to Blob Storage (tried ADLS Gen2 and standard Blob storage).
However, after executing the pipeline, instead of finding multiple files in my container, I see there are folders created with the name filename=ABC123.csv that on its own contains other files (it makes me think of parquet files):
- filename=ABC123.csv
+ _started_UNIQUEID
+ part-00000-tid-UNIQUEID-guids.c000.csv
So, I'm clearly missing something, as I would need to have single files listed in the dataset container with the name I have specified in the pipeline.
This is how the pipeline looks like:
The Optimize tab of the Sink shape looks like this:
Here you can see the settings of the Sink shape:
And this is the code of the pipeline (however some parts are edited out):
source(output(
PartitionKey as string,
RowKey as string,
Timestamp as string,
DeviceId as string,
SensorValue as double
),
allowSchemaDrift: true,
validateSchema: false,
inferDriftedColumnTypes: true) ~> devicetable
devicetable derive(filename = Isin + '.csv') ~> setoutputfilename
setoutputfilename sink(allowSchemaDrift: true,
validateSchema: false,
rowUrlColumn:'filename',
mapColumn(
RowKey,
Timestamp,
DeviceId,
SensorValue
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> distributetofiles
Any suggestions or tips? (I'm rather new to ADF, so bear with me)
I recently struggled through something similar to your scenario (but not exactly the same). There are a lot of options and moving parts here, so this post is not meant to be exhaustive. Hopefully something in it will steer you towards the solution you are after.
Step 1: Source Partitioning
In Data Flow, you can group like rows together via Set Partitioning. One of the many options is by Key (a column in the source):
In this example, we have 51 US States (50 states + DC), and so will end up with 51 partitions.
Step 2: Sink Settings
As you found out, the "As data in column" option results in a structured folder name like {columnName}={columnValue}. I've been told this is because it is a standard in Hadoop/Spark type environments. Inside that folder will be a set of files, typically with non-human-friendly GUID based names.
"Default" will give much the same result you currently have, without the column based folder name. Output to Single File" is pretty self-explanatory, and the farthest thing from the solution you are after. If you want control over the final file names, the best option I have found is the "Pattern" option. This will generate file(s) with the specified name and a variable number [n]. I honestly don't know what Per partition would generate, but it may get close to you the results you are after, 1 file per column value.
Some caveats:
The folder name is defined in the Sink Dataset, NOT in the Data Flow. Dataset parameters is really probably "Step 0". For Blob type output, you could probably hard code the folder name like "myfolder/fileName-[n]". YMMV.
Unfortunately, none of these options will permit you to use a derived column to generate the file name. [If you open the expression
editor, you'll find that "Incoming schema" is not populated.]
Step 3: Sink Optimize
The last piece you may experiment with is Sink Partitioning under the Optimize tab:
"Use current partitioning" will group the results based on the partition set in the Source configuration. "Single partition" will group all the results into a single output group (almost certainly NOT what you want). "Set partitioning" will allow you to re-group the Sink data based on a Key column. Unlike the Sink settings, this WILL permit you to access the derived column name, but my guess is that you will end up with the same folder naming problem you have now.
At the moment, this is all I know. I believe that there is a combination of these options that will produce what you want, or something close to it. You may need to approach this in multiple steps, such as have this flow output to incorrectly named folders to a staging location, then have another pipeline/flow that processes each folder and collapses the results the desired name.
You're seeing the ghost files left behind by the Spark process in your dataset folder path. When you use 'As data in column', ADF will write the file using your field value starting at the container root.
You'll see this noted on the 'Column with file name' property:
So, if you navigate to your storage container root, you should see the ABC123.csv file.
Now, if you want to put that file in a folder, just prepend that folder name in your Derived Column transformation formula something like this:
"output/folder1/{Isin}.csv"
The double-quotes activate ADF's string interpolation. You can combine literal text with formulas that way.
I have created a CDC task that captures changes in a source PostgreSQL schema and writes them in Parquet format into a target S3 bucket. The task captures the inserts, updates and deletes correctly but fails to capture column name and type changes in the source.
When I change a column name or type of a table in the source and insert new rows to the table, the resulting Parquet file uses the old column name and type.
Is there a specific configuration I am missing? or it is not possible to achieve the desired outcome from this task in DMS?
if you change column at source and DMS will pick automatically from source and update at destination. check your DMS setting. you no need to do manually adding column at destination
Make sure you have the HandleSourceTableAltered parameter set to true in the task settings.[1] (The setting applies when the target metadata parameter BatchApplyEnabled is set to either true or false.)
Same goes for HandleSourceTableDropped or HandleSourceTableTruncated if this is relevant in your case.
Obviously, previously replicated Parquet files on S3 will not change to reflect this DDL change on the source.
[1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.DDLHandling.html
I am designing a ADF pipeline that copies rows from a SQL table to a folder in Azure Data Lake. After that the rows in SQL should be deleted. But for this delete action takes place I want to know if the number rows that are copied are the same as the number of rows that were I selected in the beginning of the pipeline.
Is there a way to get the rowcount fo the copy action and use this in another action (like a lookup)
Edit follow up question:
Bo Xiao's answer is OK. BUt then I have a follow up question. After the copy-activity I put an If Condition with the following expression:
#activity('LookUpActivity').output.firstRow.RecordsRead == #{activity('copyActivity').output.rowsCopied
But then I get the error: #activity('LookUpActivity').output.firstRow.RecordsRead == #{activity('copyActivity').output.rowsCopied
Isn't it possible to compare output parameters of two activities to see if this is True?
extra edit: I just found an error in this piece of code. I forgot a "{" at the begin of the code. But then the code is still wrong. To compare two outputs from earlier activities the code must be:
#equals(activity('LookUpActivity').output.firstRow.RecordsRead,activity('copyActivity').output.rowsCopied)
You can find copied rows in activity output as pictured below.
And you can use the output value like this:
#activity('copyActivity').output.rowsCopied
I am pretty new to Pentaho so my query might sound very novice.
I have written a transformation in which am using CSV file input step and table input step.
Steps I followed:
Initially, I created a parameter in transformation properties. The
parameter birthdate doesn't have any default value set.
I have used this parameter in postgresql query in table input step
in the following manner:
select * from person where EXTRACT(YEAR FROM birthdate) > ${birthdate};
I am reading the CSV file using CSV file input step. How do I assign the birthdate value which is present in my CSV file to the parameter which I created in the transformation?
(OR)
Could you guide me the process of assigning the CSV field value directly to the SQL query used in the table input step without the use of a parameter?
TLDR;
I recommend using a "database join" step like in my third suggestion below.
See the last image for reference
First idea - Using Table Input as originally asked
Well, you don't need any parameter for that, unless you are going to provide the value for that parameter when asking the transformation to run. If you need to read data from a CSV you can do that with this approach.
First, read your CSV and make sure your rows are ok.
After that, use a select values to keep only the columns to be used as parameters.
In the table input, use a placeholder (?) to determine where to place the data and ask it to run for each row that it receives from the source step.
Just keep in ming that the order of columns received by the table input (the columns out of the select values) is the same order that it will be used for the placeholders (?). This should not be a problem with your question that uses only one placeholder, but keep that in mind as you ramp up using Pentaho.
Second idea, using a Database Lookup
This is another approach where you can't personalize the query made to the database and may experience a better performance because you can set a "Enable cache" flag and if you don't need to use a function on your where clause this is really recommended.
Third idea, using a Database Join
That is my recommended approach if you need a function on your where clause. It looks a lot like the Table Input approach but you can skip the select values step and select what columns to use, repeat the same column a bunch of times and enable a "outer join" flag that returns the rows without result from the query
ProTip: If you feel the transformation running too slow, try to use multiple copies from the step (documentation here) and obviously make sure the table have the appropriate indexes in place.
Yes there's a way of assigning directly without the use of parameter. Do as follows.
Use Block this step until steps finish to halt the table input step till csv input step completes.
Following is how you configure each step.
Note:
Postgres query should be select * from person where EXTRACT(YEAR
FROM birthdate) > ?::integer
Check Execute for each row and Replace variables in in Table input step.
Select only the birthday column in CSV input step.