can a modify stage bulk rename columns based on wildcards? - datastage

I need to modify columns based on business rules using RCP. For example, all source columns that end with '_ID' must be changed to '_KEY' to meet the target.
An example: Test_ID in source becomes Test_KEY in target
I have multiple tables, some with 2 "ID" columns, and some with 20. Is there a way to configure the modify stage to bulk rename columns based on wildcard?
If not, is there another way?
Thanks.

I doubt that there is an option using modify stage with wildcards for this.
One alternative could be a schema file which can be used with any of following stages:
Sequential File, File Set, External Source, External Target, Column Import, Column Export
This schema file could also be generated or modified to adjust the column names as needed.
Another way could be to generate the appropriate SQL statement if the data resides in a database or is written to one.

Related

ADF Copy function comparing watermark against isnull(date1,date2)

Forum Newbie...
I want to utilise the ADF Copy function, to carry out incremental table extracts from one Azure DB to another. Every table in the database that I need all have the same 2 relevant fields i.e. date1, date2. For Watermark comparison purposes, I need to use isnull(date1,date2), but unsure how to do this, i.e. I am not sure how I can add this consistent derived value to the Source as an additional field that can perhaps be added via the Query or Stored Procedure Option on the source, to utilise the #item().source.schema and #item().source.table values that have already been generated as parameters..?
You can use the query option in the Copy data activity source and add a new column in the query itself to get the results of isnull(date1,date2) and include the parameter values to get the table name instead of hardcoding them as shown below.
In source, select Query option under Use query and add dynamic content to concat() select statement with parameter values.
#concat('select *, isnull(date1,date2) as final_dt from ',pipeline().parameters.schema,'.',pipeline().parameters.table)
Sink table data output:

matching the columns in a source file with sink table columns to make sure they match using Azure Data Factory

I have an Azure Data factory trigger that is fired off when a file is placed in blob storage, this trigger will start pipeline execution and pass the file name to the data flow activity. I would like to make sure that all the column names from the header row in the file are in the sink table. There is an identity column in the sink table that should not be in the comparison. Not sure how to tackle this task, I've read about the 'derived column' activity, is that the route I should take?
You can select or filter which columns reside in sink dataset or table by using "Field mapping". You can optionally use "derived columns" transformation, however in the "sink transformation" you will have this by default and is set to "Auto mapping". Here you can add or remove which columns are written to sink.
In the below example the column "id" can be assumed as similar to "Identity" column in your table. Assuming all the files have same columns:
Once you have modified as per your need, you can confirm the same from the "inspect" tab before run.
Strategy:
Use two ADF pipelines, one to get a list of all files and another one to process each file copying its content to a specific SQL table.
Setup:
I’ve created 4 CSV files, following the pattern you need: “[CustomerID][TableName][FileID].csv” and 4 SQL tables, one for each type of file.
A_inventory_0001.csv: inventory records for customer A, to be
inserted into the SQL table “A_Inventory”.
A_sales_0003.csv: sales
records for customer A, to be inserted into the SQL table “A_Sales”.
B_inventory_0002.csv: inventory records for customer B, to be
inserted into the SQL table “B_Inventory”.
B_sales_0004.csv: sales
records for customer B, to be inserted into the SQL table “B_Sales”
Linked Services
In Azure Data Factory, the following linked services were create using Key Vault (Key Vault is optional).
Datasets
The following datasets were created. Note we have created some parameters to allow the pipeline to specify the source file and the destination SQL table.
The dataset “AzureSQLTable” has a parameter to specify the name of the destination SQL table.
The dataset “DelimitedTextFile” has a parameter to specify the name of the source CSV file.
The dataset “DelimitedTextFiles” has no parameter because it will be used to list all files from source folder.
Pipelines
The first pipeline “Get Files” will get the list of CSV files from source folder (Get Metadata activity), and then, for each file, call the second pipeline passing the CSV file name as a parameter.
Inside the foreach loop, there is a call to the second pipeline “Process File” passing the file name as a parameter.
The second pipeline has a parameter “pFileName” to receive the name of the file to be processed and a variable to calculate the name of the destination table based on the file name.
The first activity is to use a split in the file name to extract the parts we need to compose the destination table name.
In the expression bellow we are splitting the file name using the “__” separator and then using the first and second parts to compose the destination table name.
#concat(string(split(pipeline().parameters.pFileName, '_')[0]),'_',string(split(pipeline().parameters.pFileName, '_')[10]))
The second activity will then copy the file from the source “pFileName” to the desnation table “vTableName” using dynamic mapping, ie not adding specific column names as this will be dynamic.
The files I used in this example and the ADF code are available here:
https://github.com/diegoeick/stack-overflow/tree/main/69340699
I hope this will resolve your issue.
In case you still need to save the CustomerID and FileID in the database tables, you can use the dynamic mapping and use the available parameters (filename) and create a json with the dynamic mapping in the mapping tab of your copy activity. You can find more details here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping

Spark - Update target data if primary keys match?

Is it possible to overwrite a record in the target if specific conditions are met using spark without reading the target into a dataframe? For example, I know we can do this if both sets of data are loaded into dataframes, but I would like to know if there is a way to perform this action without loading the target into a dataframe. Basically, a way to specify overwrite/update conditions.
I am guessing no, but I figured I would ask before I dive into this project. I know we have the write options of append and overwrite. What I really want is, if a few specific columns already exist in the data target, then overwrite it and fill in the other columns with the new data. For example:
File1:
id,name,date,score
1,John,"1-10-17",35
2,James,"1-11-17",43
File2:
id,name,date,score
3,Michael,"1-10-17",23
4,James,"1-11-17",56
5,James,"1-12-17",58
I would like the result to look like this:
id,name,date,score
1,John,"1-10-17",35
3,Michael,"1-10-17",23
4,James,"1-11-17",56
5,James,"1-12-17",58
Basically, Name and Date columns act like primary keys in this scenario. I want updates to occur based on those two columns matching, otherwise make a new record. As you can see id 4 overwrites id 2, but id 5 appends because the date column did not match. Thanks ahead guys!

How to assign csv field value to SQL query written inside table input step in Pentaho Spoon

I am pretty new to Pentaho so my query might sound very novice.
I have written a transformation in which am using CSV file input step and table input step.
Steps I followed:
Initially, I created a parameter in transformation properties. The
parameter birthdate doesn't have any default value set.
I have used this parameter in postgresql query in table input step
in the following manner:
select * from person where EXTRACT(YEAR FROM birthdate) > ${birthdate};
I am reading the CSV file using CSV file input step. How do I assign the birthdate value which is present in my CSV file to the parameter which I created in the transformation?
(OR)
Could you guide me the process of assigning the CSV field value directly to the SQL query used in the table input step without the use of a parameter?
TLDR;
I recommend using a "database join" step like in my third suggestion below.
See the last image for reference
First idea - Using Table Input as originally asked
Well, you don't need any parameter for that, unless you are going to provide the value for that parameter when asking the transformation to run. If you need to read data from a CSV you can do that with this approach.
First, read your CSV and make sure your rows are ok.
After that, use a select values to keep only the columns to be used as parameters.
In the table input, use a placeholder (?) to determine where to place the data and ask it to run for each row that it receives from the source step.
Just keep in ming that the order of columns received by the table input (the columns out of the select values) is the same order that it will be used for the placeholders (?). This should not be a problem with your question that uses only one placeholder, but keep that in mind as you ramp up using Pentaho.
Second idea, using a Database Lookup
This is another approach where you can't personalize the query made to the database and may experience a better performance because you can set a "Enable cache" flag and if you don't need to use a function on your where clause this is really recommended.
Third idea, using a Database Join
That is my recommended approach if you need a function on your where clause. It looks a lot like the Table Input approach but you can skip the select values step and select what columns to use, repeat the same column a bunch of times and enable a "outer join" flag that returns the rows without result from the query
ProTip: If you feel the transformation running too slow, try to use multiple copies from the step (documentation here) and obviously make sure the table have the appropriate indexes in place.
Yes there's a way of assigning directly without the use of parameter. Do as follows.
Use Block this step until steps finish to halt the table input step till csv input step completes.
Following is how you configure each step.
Note:
Postgres query should be select * from person where EXTRACT(YEAR
FROM birthdate) > ?::integer
Check Execute for each row and Replace variables in in Table input step.
Select only the birthday column in CSV input step.

Redshift - Adding a column, do we have to change our previous CSVs to include it?

I currently have a redshift table in our database that has 10 columns, and I want to add another. It's trivial to do an alter table to do this.
My question - When I do this, will all my old CSV files fail to insert into redshift (via COPY from S3) given they won't have this new column?
I was hoping the columns would just be NULL vs. it failing on import, but I haven't seen any documentation on this.
Ideally I wish I could specify the actual column name in the header row of the CSV, but I haven't seen if that is possible anywhere.
FILLRECORD in COPY command does that: 'Allows data files to be loaded when contiguous columns are missing at the end of some of the records'.