Talend tMap Set Default Value for Rejected Inner Joins and connect them with the main data flow - talend

i've got the following problem.
I have several tMaps, each has a lookup and at the end all the data is written in a db. The following mockup shall illustrate it:
There can be values in the main data stream which are not found in the lookup tables. For this values there is a rejected path which catches them from the specific tMap.
Requirements:
In case of a rejected inner join the looked up value shall be set to a default value (for example 0, which could be done in the schema of the tMap) and after that these "corrected" records should be added to the "normal" main data flow and process the next lookup.
The tUnite component is not able to handle this cases because it can not exist in a data flow loop.
Does anybody got an idea how to solve this problem?
Cheers.

The answer was so easy that i didn't got it in the first conception. I just have to change the join model from inner to left-join so all the formal rejected values will have a null value in it. Afterwards i can check the columns in the tmap and set them on a default value if they are null.
row1.id == null ? 0 : row1.id
Cheers.

If I understand correctly what you are trying to accomplish you will have to have staging files or staging tables on the database. Once you get the rejected rows, write them on a file or table. The accepted files will go also to a staging table(different than the rejected). Then you can union both tables or files by reading them. The key point is having a staging structure. I attach a picture what how would it be. In the picture the staging structure is a mysql table.
Let me know if it helps!

Related

Performing a delete in an ADF mapping data flow - Updated with potential solution

I am trying to do upsert and delete in a mapping data flow.
There is a dimension table, DimCustomer.
It is being populated with data from a file.
If a Sha2 hash does not match then upsert.
if CustomerID is missing from the rawSource data, then delete (see image below for settings)
The upsert works, but the delete does not. Its likely because in the sink i have selected the customerID column as the key, but this means it can never delete a record if the entire record, including key is missing from source.
Is there a prescribed design pattern for this scenario?
The easiest solution i can think of is a 2nd dataflow, in which the only customerID's sent to the sink are ones where there is no matching customerID in the source (effectively a right outer join), but want to see if this is indeed the best way to do this.
Update:
The best solution i can come up with for this is, to the above dataflow, add an additional column, the formula for which is coalesce(RawCustomerData#CustomerID,DimCustomer#CustomerID)
This ensures there is a CustID column that always has a value.
In the sink, i change the mapping so that this custID maps to the sink CustomerID.
The delete now works as expected. Still unsure if this is the best solution but it works and doesn't appear to cause a major performance issue.
Per my experience, I think that's the best solution, add a new column can solve the problem is much easier than other operations. The way which simplest and effective is the best solution. You don't need create another data flow actives to achieve it or re-design the Alter active logic.
Your Solution:
Add an additional column CustID, the formula for which is: coalesce(RawCustomerData#CustomerID,DimCustomer#CustomerID)
This ensures there is a CustID column that always has a value. In the sink, you change the mapping so that this custID maps to the sink CustomerID.
The delete now works as expected.

Informatica SQ returns different result

I am trying to pull data from DB2 via informatica, I have a SQ query that pulls few fields based on joins for 4 different tables.
When I run the query directly in the database, it returns the expected result, however when I run it in informatica and run a debugger, I see something else.
Please note all the columns data perfectly match, except one single column.
Weird thing is, this is a calculated field from the table based on a case statement:
CASE WHEN Column1='3' THEN 'N' ELSE 'Y' END.
Since this is a calculated field with a length of one string, I have connected from the source to SQ from one of the sources having 1 character length.
This returns 'Y' when executed in the database, the same query when I copy paste in SQ of information and run it, I get a data 'E', and this data can never be possible as I expect only a N or a Y. I have verified the column order, that its in the right place. This is very strange, is something going wrong because of the CASE Statement?
Save yourself the hassle, put an expression transformation after tge source qualifier and calculate, port value there then forget about it
I think i got the issue. We use Informatica PowerExchange to connect to a as400 system(DB2), and it seems that when we are trying to set a flag information in AS400, and pass it to informatica via PowerExchange, it converts it to binary, and to solve this, there needs to be an entry in the PowerExchange configuration file.
Unfortunately, i myself was not aware that it could be related to PowerExchange instead of powercenter itself.!!
Thanks for your assistance! Below is the KB about it.
https://kb.informatica.com/solution/4/Pages/17498.aspx

Updating the table through tOracleOutput in Talend using an additional SQL query

I have a job where I am getting a flow into tOracleOutput where I am updating the table. Now, I have to update that table using an SQL statement, which I guess we have option in Advanced settings of tOracleOuptut, but I don't know how to use it or you can say that I am not getting the settings properly. I referred to official documentation but could not understand. Can any one explain the fields like Name, SQL expression, Position, Reference Column in a better way?
the SQL query which I am using is:
update set COL1=SOMETHING1
where COL2=SOMETHING2
Now, value for COL1 is coming from the flow but COL2 is some column in the table which is not coming from the flow.
Have a look to tOracleRow for such a case.
Hope this helps.
TRF
Using tOracleOutput is helpful when a ready data source (table or file (...) with same columns as destination) the more elaborate your query is, the more you should do as TRF said (and use tOracleRow), but here's an example to your question:
file contain 3 column,
DB table of destination contains 4 column, where the 4th is the date of update, (the first 3 are identical to the input)
so you add the destination's column's name in Name and put the SQL function for the date (eg: SYSDATE) and where to put it (Position) in reference to a column of your choice (Reference Column)
In my view it helps avoid using tMap for a miserable additional column when you want to Insert, but you want to Update, in which case the component doesn't offer the additional column section, plus I don't think you can add the WHERE clause here
Hope it helps

How to tell if record has changed in Postgres

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.
Basic problem.
I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).
The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.
Solution that I'm contemplating.
Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).
For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.
In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")
I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.
My question?
Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.
A few final things:
I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.
I have lots of similar operations. What I do is COPY to temporary staging tables:
CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0; -- only copy structure, no data
COPY target_tmp FROM '/path/to/target.csv';
For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!
ANALYZE target_tmp;
Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.
ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);
You don't need the performance stuff for small imports.
Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..
Maybe DELETE what isn't there any more?
DELETE FROM target_tbl t
WHERE NOT EXISTS (
SELECT 1 FROM target_tmp t1
WHERE t1.target_id = t.target_id
);
Then UPDATE what's already there:
UPDATE target_tbl t
SET col1 = t1.col1
FROM target_tmp t1
WHERE t.target_id = t1.target_id
To avoid empty UPDATEs, simply add:
...
AND col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns
Or, if the whole row is relevant:
...
AND t IS DISTINCT FROM t1; -- check the whole row
Then INSERT what's new:
INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM target_tmp t1
LEFT JOIN target_tbl t USING (target_id)
WHERE t.target_id IS NULL;
Clean up if your session goes on (temp tables are dropped at end of session automatically):
DROP TABLE target_tmp;
Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.

How to use Pentaho Data Integration to copy columns between tables

I thought this would be an easy task, but since I am new to PDI, I could not
find out so far which transform to choose to accomplish the following:
I am using Pentaho Data Integration (former Kettle), Community Edition, to map/copy values from one table ('tasksA') of one database 'A' to another table
'tasksB' in another database B. tasksA has a column 'description' and I want
to copy these values to the column 'taskName' in 'tasksB'.
Furthermore, I have to copy each value of 'description' several times, since
in 'tasksB', there are multiple lines for each value in 'taskName'.
Maybe this would be possible by direct SQL, but I wanted to try whether
I can define this more readable with PDI, especially because in the next step I will have to extend it to other tables involved.
So I have to tell which value of
'description' has to be mapped onto which value of 'taskName' and that in
every tuple containing this value (well, sounds like a WHERE clause...) in the column 'taskName' it should be replaced.
My first experiments with the 'Table input' and 'Table output' steps
did not work when I simply drew a hop between them and modifying the 'database
fields' tab of the 'Table output' step, which generated 'drop column' statements
in the resulting SQL which is not what I want. I don't want to modify the schema, just copy the values.
Would be great if someone could point me to the right steps/transforms needed,
I worked through the first examples from the Pentaho Wiki and have got the 'Pentaho Kettle Solutions' book of Casters et al. but could find out how
to do solve this. Many thanks in advance for any help.
If I got this right, you should use the Table Input connected to a "Insert/Update" step.
On the Insert/Update step you need to inform the keys from tasksA where should be looked up on tasksB. Then define which fields on tasksB should be updated: description (as stream field) -> taskName (as the table field).
Keep in mind that if this key is not found, a row will be inserted on tasksB. If it is not what you plan, you'll need to build something like: Table Input -> Database Lookup -> Filter Rows -> Insert/Update
#RFVoltolini has a good answer. Alternatively you could go
Table Input -> Update
And connect the error output to something else like a Text file output.