Pentaho kettle join streams with different fields - merge

I have multiple steps in my transformation where i join data from different streams (files). All streams have a common field which is ID_Time.
Now when i try to join 2 streams on ID_Time, lets say using Multiway Merge Join, I get fields ID_Time and ID_Time_1. I'd like to merge those into one, so we get rid of the _1 column. If we do not have the same columns in both streams, I would like the ones that we do not have to be null.
An example:
ID Number1 Number2 ID_Time
1 5,06 215 12154
2 5121 121 151
ID CarModel CarManufacturer ID_Time
1 CX3 Mazda 12
2 V40 Volvo 39
So i would like this as the end result:
ID Number1 Number2 CarModel CarManufacturer ID_Time
1 5,06 215 null null 12154
2 5121 121 null null 151
3 null null CX3 Mazda 12
4 null null V40 Volvo 39
Maybe I have to use a different join? Or on different keys?

The output of both Merge join and Multiway Merge join will always include all fields from all streams; if fields are repeated they will be suffixed with _1, _2, etc.
One obvious case is when the join keys have the same name (which if often the case), but you may also have other fields with the same name.
There are several ways to go about it, but the easiest would be a Select values step after the join, and remove the duplicated fields (one of the tabs of Select Values is called "Remove", it will remove any fields you specify).
Your final Id field can be added after the merge, with a Add sequence step.
Beware: always sort the inputs before joining, as both merge steps assume the input data to be sorted.

Related

PostgreSQL WHERE Clause what is included/excluded?

I have come across a postgresql query with a lot of WHERE statements like this
WHERE (category <> 'A' OR category IS NULL)
AND (category <> 'B' OR category IS NULL)
I am struggling to understand what data this query is including/excluding.
I tried rewriting the code above as
WHERE category NOT IN ('A','B')
WHERE category NOT IN ('A', 'B') OR category IS NULL
WHERE (category NOT IN ('A', 'B') OR category IS NULL)
And all three gave different answers to the original code.
Could someone explain to me what data in included/excluded in each of the four cases above?
Say for example the data looked like
ID
Category
1
A
2
B
3
C
4
D
5
NULL
For (1) I would just get ID's 3, 4. But I am unsure about the others.
EDIT: WHERE (category NOT IN ('A', 'B') OR category IS NULL) and
WHERE (category <> 'A' OR category IS NULL)
AND (category <> 'B' OR category IS NULL)
Give the same answer.
But WHERE category NOT IN ('A', 'B') OR category IS NULL without parenthesis gives a different answer.
To correctly understand the output of the mentioned queries you have to think in the following way: take one by one all the lines that satisfy the WHERE clause.
Query 1
WHERE category NOT IN ('A','B')
The query 1 should give all the lines with the attribute category not in the set specified.
If you proceed step by step, one line at a time you can see that:
the first 2 lines are not included in the output since the category column contains values in the set ('A','B')
the next 2 lines are included in the output since the category column doesn't contain values in the set ('A','B')
the last line is not included in the output since the NULL values are evaluated as UNKNOWN according to the Three-Valued Logic
To better understand the last point the clause WHERE category NOT IN ('A','B') can be rewritten as WHERE category<>'A' AND category<>'B'. Since category is NULL the logical expression is evaluated in the following way WHERE NULL<>'A' AND NULL<>'B', which output is UNKNOWN, so the line will not be included in the output result.
Queries 2 & 3
WHERE category NOT IN ('A', 'B') OR category IS NULL
Queries 2 and 3 are the same, since parentheses in this case doesn't affect the evaluation order of the logical operators.
In this particular case the last line of the example table above is included in the output since category NOT IN ('A','B') is evaluated as UNKNOWN and category IS NULL is evaluated to TRUE. For the same reason mentioned above (Three-Valued Logic) the result of UNKNOWN or TRUE is TRUE.
I think that you are struggling in the logic of the query, the first one is easy to understand
where category not in ('A','B')
you will get id : 3,4,5
Note that it is better if you use the ids of the category than the letters.
in the second query
where category not in ('A','B') or category is null
both the condition will be true and both will be done
you will get ids of : 3,4
and the third one it has to give you the same output of the (2) condition

How to capture the Null values(custname) and respective CustID in separate file and the rest of the CustID's in other file

CustID
CustNAme
10
Ally
20
null
30
null
40
Liza
50
null
60
Mark
You need to generate an artificial key (e.g. line number) on each file. Then have the source of CustID as the stream input to a Lookup stage, and the source of CustName as the reference input of the Lookup stage, where the lookup key is LineNumber. Set the Lookup Failed rule to suit your own needs.
One way to generate line numbers is a Column Generator stage operating in sequential mode.
You can use a Transformer Stage with two output links. Use the output link constraints to check on null values to split the stream.
As constraint, just write IsNull(DSLink2.CustNAme) or IsNotNull(DSLink2.CustNAme) respectively.
Note: You can also write !IsNull(col) or Not(IsNull(col)) for IsNotNull(col)

Spring jpa combine two columns and order by one column and find top

I have table, where I need to fetch row based on two columns and find top records with the help of order by one column.
---------------------------------------------------------------------
id product_id sub_product_id created_dttm
---------------------------------------------------------------------
1 1 2 01-02-2021 07:03:25
2 2 1 01-01-2021 08:03:25
3 1 2 01-02-2021 09:03:25
4 2 1 01-02-2021 08:03:25
in the above table, I have to get rows between productid and subproductid, order by created_dttm, to get last inserted records.
in spring jpa I am trying derived query as below
public List<DailyStock> findTopByProductIdAndSubProductIdOrderByCreatedDttmDesc();
I want row id 3 & 4 to be returned as list
while executing above query, I am getting error
expects at least 1 arguments but only found 0. This leaves an operator of type SIMPLE_PROPERTY for property productTypeId unbound.
You should add arguments to your method
public List<DailyStock> findTopByProductIdAndSubProductIdOrderByCreatedDttmDesc(
Long productId,
Long subProductId
);
I assume that both this columns are mapped to Long properties of DailyStock class
UPDATE:
If you don't want to pass any arguments you should rename your method to exclude columns from filtering query:
public List<DailyStock> findTopByOrderByCreatedDttmDesc();

CASE in JOIN not working PostgreSQL

I got the following tables:
Teams
Matches
I want to get an output like:
matches.semana | teams.nom_equipo | teams.nom_equipo | Winner
1 AMERICA CRUZ AZUL AMERICA
1 SANTOS MORELIA MORELIA
1 LEON CHIVAS LEON
The columns teams.nom_equipo reference to matches.num_eqpo_lo & to matches.num_eqpo_v and at the same time they reference to the column teams.nom_equipo to get the name of each team based on their id
Edit: I have used the following:
SELECT m.semana, t_loc.nom_equipo AS LOCAL, t_vis.nom_equipo AS VISITANTE,
CASE WHEN m.goles_loc > m.goles_vis THEN 'home'
WHEN m.goles_vis > m.goles_loc THEN 'visitor'
ELSE 'tie'
END AS Vencedor
FROM matches AS m
JOIN teams AS t_loc ON (m.num_eqpo_loc = t_loc.num_eqpo)
JOIN teams AS t_vis ON (m.num_eqpo_vis = t_vis.num_eqpo)
ORDER BY m.semana;
But as you can see from the table Matches in row #5 from the goles_loc column (home team) & goles_vis (visitor) column, they have 2 vs 2 (number of goals - home vs visitor) being a tie but and when I run the code I get something that is not a tie:
Matches' score
Resultset from Select:
I also noticed that since the row #5 the names of both teams in the matches are not correct (both visitor & home team).
So, the Select brings correct data but in other order different than the original order (referring to the order from the table matches)
The order from the second week must be:
matches.semana | teams.nom_equipo | teams.nom_equipo | Winner
5 2 CRUZ AZUL TOLUCA TIE
6 2 MORELIA LEON LEON
7 2 CHIVAS SANTOS TIE
Row 8 from the Resultset must be Row # 5 and so on.
Any help would be really thanked!
When doing a SELECT which includes null for a column, that's the value it will always be, so winner in your case will never be populated.
Something like this is probably more along the lines of what you want:
SELECT m.semana, t_loc.nom_equipo AS loc_equipo, t_vis.nom_equipo AS vis_equipo,
CASE WHEN m.goles_loc - m.goles_vis > 0 THEN t_loc.nom_equipo
WHEN m.goles_vis - m.goles_loc > 0 THEN t_vis.nom_equipo
ELSE NULL
END AS winner
FROM matches AS m
JOIN teams AS t_loc ON (m.nom_eqpo_loc = t.num_eqpo)
JOIN teams AS t_vis ON (m.nom_eqpo_vis = t.num_eqpo)
ORDER BY m.semana;
Untested, but this should provide the general approach. Basically, you JOIN to the teams table twice, but using different conditions, and then you need to calculate the scores. I'm using NULL to indicate a tie, here.
Edit in response to comment from OP:
It's the same table -- teams -- but the JOINs produce different results, because the query uses different JOIN conditions in each JOIN.
The first JOIN, for t_loc, compares m.nom_eqpo_loc to t.num_eqpo. This means it gets the teams rows for the home team.
The second JOIN, for t_vis, compares m.nom_eqpo_vis to t.num_eqpo. This means it gets the teams rows for the visting team.
Therefore, in the CASE statement, t_loc refers to the home team, while t_vis refers to the visting one, enabling both to be used in the CASE statement, enabling the correct name to be found for winning.
Edit in response to follow-up comment from OP:
My original query was sorting by m.semana, which means other columns can appear in any order (essentially whichever Postgres feels is most efficient).
If you want the resulting table to be sorted exactly the same way as the matches table, then use the same ORDER BY tuple in its ORDER BY.
So, the ORDER BY clause would then become:
ORDER BY m.semana, m.nom_eqpo_loc, m.nom_eqpo_vis
Basically, the matches table PRIMARY KEY tuple.

Group by with projections for Query resulting more than 22 columns in Slick

Please consider following scenario
1) Table 1 contains 16 columns
2) Table 2 contains 18 columns
I need to select all the columns from these tables and since total number of columns (34) is more than 22(constraints on tuple fields), I am using projection as show below to map field to case class objects. Is there any other approach to achieve the same in context of below requirement?
I further want to apply group by on 34 selected columns by equal number of columns or one or two less based on User inputs i.e. group by 32 columns with applying sum on remaining two columns. The group by code looks like as shown below
implicit s =>
val query = for
{
(t1, t2) <- Table1 leftJoin Table2 on
(_.A1 === _.A2)
}yield(
(t1.A1~t1.A2........~t1.A16)<>(Table1Rec,Table1Rec.unapply _)
(t2.A1~t2.A2........~t2.A18)<>(Table2Rec,Table2Rec.unapply _)
)
query.groupBy(tt=>(tt._1,tt._2)).map{
case ttt => ????
}
But, I failed to get right syntax for doing group by using mapped projection. Can someone please provide insight into right way to achieve this. Earlier post on this scenario did not provide resolution, hence posting with simplified example.