Talend data Integration : How to maintain input columns in all the components? - talend

My Input sheet has 3 fields
I1,I2 and I3
I have to insert these fields in two tables.
Table1 should have I1 and I2 from input field and It has ID as PK column.
Table2 should have I3. Foreign key relation Table1 ID.
Using talend, I have inserted first two columns data in table1 (I3 is omitted in this table output component) and I have auto incremented ID. Now I want to insert ID and I3 into Table out.
But In the second table output I could not see I3 column in stream since we have omitted I3 in the first step.
Can you anyone help how to get I3 column for second table output component?
Thanks

You should be able to do so with a simple tMap with 2 outputs :
tDBInput --- tMap ---tDBOutput1
---tDBOutput2
In tMap, you just have to select the fields to put in the right output.
You can use Talend sequence Numeric.sequence("s1",1,1) in the 'Var' section of the tMap (central panel), and put it in the 2 outputs as an ID.

I did this in following way...
Over all job view:
tMap Settings:

Another way to do this would be with a two-step process. You can create two separate "processing lines" or subjobs within one job.
Step 1 - l1, l2
First subjob would be, as #Corentin said,
tExcelInput1 -- tMap1 -- tDBOutput1 (l1, l2)
Step 2 - l3
Second subjob, which can be connected with tDBInput1 onSubjobOk connector, would look like this
tExcelInput2 -- tMap2 -- tDBOutput2 (l3)
|--- tDBInput3, lookup ID (with l1)
Now the first subjob adds all values to table 1. When everything worked out fine, all values for l3 are processed with an ID lookup on l1. This would work with a auto-increment field in the database. The lookup would simply look like SELECT ID FROM table1 and in the tMap, just use l3 for the lookup.
Overview
If it is easier for you to see the job, please check the following image.
The lookup would look like this:

Related

Combine columns from two sources

I have two sources resulting from some transformation in data flow:
I have tried using join, it replicates the data no matter join I select it outputs similar stuff:
I have tried union as well but union either creates null in columns (if done by name) or rows (if done by position)
Shouldnt the join just concat the columns together because the IDs are same in both table.
This is how the desired ouput should look:
I want concat the version column to the first source so that it looks like this:
ID name value version
111 file1 0.1 3
111 file2 0.82 15
111 file3 2.2 2
Both of your source files have only one matching column (ID) and it is not unique.
When you join both sources on the ID column, each row of source1 joins with all the matching rows of source2.
Here, your row1 (111) of source1 joins with all 3 matching rows (111) of source2, hence it results in 9 rows with different version values for each row in source1.
To get only 3 rows as your expected results, you need a unique matching row in each source.
Add window transformation for both sources and get the rowNumber() based on the ID column.
Source1->window1:
Window1 data preview:
Source2->window2:
Window2 data preview:
Add join transformation to join data from window transformations on ID and rank columns.
Join data preview:
Add select transformation to remove the unwanted columns.
Select data preview:
That is expected with a join. For example, when you join tables in SQL, you also supply the target projection as part of the select statement. What you need to do here is add a Select transformation after your Join transformation. In there, you will reduce the projection to just the columns that would like to retain. You'll be able to choose which side (left or right) you would like to keep for the ID column.

Is it OK to store transactional primary key on data warehouse dimension table to relate between fact-dim?

I have data source (postgres transactional system) like this (simplified, the actual tables has more fields than this) :
Then I need to create an ETL pipeline, where the required report is something like this :
order number (from sales_order_header)
item name (from sales_order_lines)
batch shift start & end (from receiving_batches)
delivered quantity, approved received quantity, rejected received quantity (from receiving_inventories)
My design for fact-dim tables is this (simplified).
What I don't know about, is the optimal ETL design.
Let's focus on how to insert the fact, and relationship between fact with dim_sales_orders
If I have staging tables like these:
The ETL runs daily. After 22:00, there will be no more receiving, so I can run the ETL at 23:00.
Then I can just fetch data from sales_order_header and sales_order_lines, so at 23:00, the script can runs, kind of :
INSERT
INTO
staging_sales_orders (
SELECT
order_number,
item_name
FROM
sales_order_header soh,
sales_order_lines sol
WHERE
soh.sales_order_id = sol.sales_order_header_id
and date_trunc('day', sol.created_timestamp) = date_trunc('day', now())
);
And for the fact table, can runs at 23:30, with query
SELECT
soh.order_number,
rb.batch_shift_start,
rb.batch_shift_end,
sol.item_name,
ri.delivered_quantity,
ri.approved_received_quantity,
ri.rejected_received_quantity
FROM
receiving_batches rb,
receiving_inventories ri,
sales_order_lines sol,
sales_order_header soh
WHERE
rb.batch_id = ri.batch_id
AND ri.sales_order_line_id = sol.sales_order_line_id
AND sol.sales_order_header_id = soh.sales_order_id
AND date_trunc('day', sol.created_timestamp) = date_trunc('day', now())
But how to optimally load the data into fact table, particulary the fact table?
My approach
select from staging_sales_orders and insert them into dim_sales_orders, using auto increment primary key.
before inserting into fact_receiving_inventories, I need to know the dim_sales_order_id. So in that case, I select :
SELECT
dim_sales_order_id
FROM
dim_sales_orders dso
WHERE
order_number = staging_row.order_number
AND item_name = staging_row.item_name
then insert to fact table.
Now what I doubt, is on point 2 (selecting from existing dim). In here, I select based on 2 varchar columns, which should be performance hit. Since in the normalized form, I'm thinking of modifying the staging tables, adding sales_order_line_id on both staging tables. Hence, during point 2 above, I can just do
SELECT
dim_sales_order_id
FROM
dim_sales_orders dso
WHERE
sales_order_line_id = staging_row.sales_order_line_id
But as consequences, I will need to add sales_order_line_id into dim_sales_orders, which I don't find common on tutorials. I mean, adding transactional table PK, is technically can be done since I can access the data source. But is it a good DW fact-dim dimension, to add such transactional field (especially since it is PK)?
Or there is any other approach, rather than selecting the existing dim based on 2 varchars?
How to optimally select dimension id for fact tables?
Thanks
It is practically mandatory to include the source PK/BK in a dimension.
The standard process is to load your Dims and then load your facts. For the fact loads you translate the source data to the appropriate Dim SKs with lookups to the Dims using the PK/BK

PostgreSQL: Update does not work from one table to another: Query does not return results

What I'm trying to do:
Table A has a column A2 with many different values, each of them occurring one or several times. With a foreign key in column A4 table A points to another table B. This table contains data (in column B2) specificly about each of the values in A2. Hence I want to update another column A3 in table A with these data. Left join didn't work because it only one occurrence of the values in A2 would be matched with the data, not all.
I tried it then with the following:
UPDATE
table_A
SET
column_A3 = table_B.column_B2
FROM
table_B
WHERE
table_A.column_A4 = table_B.column_B1
However, the script returns the following:
"Query failed: Query does not return results"
This problem might be covered already somewhere, but I couldn't make sense of the suggestions. Would be great if someone could offer some help, thanks a lot!!
best, cuezumo
edit:
this is how the code actually looks:
UPDATE
"${projectKey}_stack_dr_pa"
SET
"rideshare_startdate" = "${projectKey}_accepted_rideshares"."start"
FROM
"${projectKey}_accepted_rideshares"
WHERE
"${projectKey}_stack_dr_pa"."rideshare_id" =
"${projectKey}_accepted_rideshares"."rideshare_id"

Postgresql get references from a dictionary

I'm trying to build a request to get the data from a table, but some of those columns have foreign keys I would like to replace by the associated keyword in one request.
Basically there's
table A with column 1:PKA-ID and column 2:name.
table B with column 1:PKB-ID, column 2:FKA-ID, column 3:amount.
I want to get all the lines in table B but with all foreign keys replaced by the associated names in table A.
I started building a request with a subrequest + alias to get that, but ofc I have more than one result per subrequest, yet I can't find a way to link that subrequest to the ID of table B [might be exhausted, dumb or both] from the main request. I did something like that:
SELECT (SELECT "NAME" FROM A JOIN B ON ID = FKA-ID) AS name, amount FROM TABLEB;
it feels so simple of a request yet...
You don't need a join in the subselect.
SELECT pkb_id,
(SELECT name FROM a WHERE a.pka_id = b.fka_id),
amount
FROM b;
(See it live in SQL Fiddle).
The subselect query runs for each and every row of its parent select and has the parent row available from the context.
You can also use a simple join.
SELECT b.pkb_id, a.name, b.amount
FROM b, a
WHERE a.pka_id = b.fka_id;
Note that the join version puts less restrictions on the PostgreSQL query optimizer so in some cases the join version might work faster. (For example, in PostgreSQL 9.6 the join might utilize multiple CPU units, cf. Parallel Query).

Redshift Copy and auto-increment does not work

I am using the COPY command from redshift to copy json data from S3.
The table definition is as follows:
CREATE TABLE my_raw
(
id BIGINT IDENTITY(1,1),
...
...
) diststyle even;
The command for copy i am using is as follows:
COPY my_raw FROM 's3://dev-usage/my/2015-01-22/my-usage-1421928858909-15499f6cc977435b96e610298919db26' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXX' json 's3://bemole-usage/schemas/json_schema' ;
I am expecting that any new id inserted will always be > select max(id) from my_raw . In fact it's clearly not the case.
If I issue the above copy command twice, the first time the ids start from 1 to N although that file is creating 114 records(that's a known issue with redshift when it has multiple shards). The second time the ids are also between 1 and N but it took free numbers that were not used in the first copy.
See below for a demo:
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=#
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/my_json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
Thx in advance
The only solution i found to make sure have sequential Ids based on the insertion is to maintain a pair of tables. The first one is the stage table in which the items are inserted by the COPY command. The stage table will actually not have an ID column.
Then I have another table that is the exact replica of the stage table except that it has an additional column for the Ids. Then there is a job that takes care of filling the master table from the stage using the ROW_NUMBER() function.
In practice, this means executing the following statement after each Redshift COPY is performed:
insert into master
(id,result_code,ct_timestamp,...)
select
#{startIncrement}+row_number() over(order by ct_timestamp) as id,
result_code,...
from stage;
Then the Ids are guaranteed to be sequential/consecutives in the master table.
I can't reproduce your problem, however it is interesting how you have identity columns set correctly in conjunction with copy. Here a small summary:
Be aware that you can specify the columns (and their order) for a copy command.
COPY my_table (col1, col2, col3) FROM s3://...
So if:
EXPLICIT_IDS flag is NOT set
no columns listed like shown above
and you csv does not contain data for the IDENTITY column
then the identity values in the table will be set automatically in monotonously as we all want it.
doc:
If an IDENTITY column is included in the column list, then EXPLICIT_IDS must also be specified; if an IDENTITY column is omitted, then EXPLICIT_IDS cannot be specified. If no column list is specified, the command behaves as if a complete, in-order column list was specified, with IDENTITY columns omitted if EXPLICIT_IDS was also not specified.