Joining two one-to-many tables duplicates records - postgresql

I have 3 tables, Transaction, Transaction_Items and Transaction_History.
Where the Transaction is the parent table, while Transaction_Items and Transaction_History are the children tables, with one to many relationship.
When i try to join those tables together, if i have 2+ Transaction_History records, or 2+ Transaction_Items i get duplicated or triplicated record results.
This is the SQL query im currently using which works, but what worries me that in the future if i have to Join another one-to-many table, it will duplicate the results again.
I found a workaround for this, but i was just wondering if there is a better and cleaner way to do this ?
The results should be a PostgreSQL JSON array which will contain the Transaction_Items and Transaction_History
SELECT
TR.id AS transaction_id,
TR.transaction_number,
TR.status,
TR.status AS status,
to_json(TR_INV.list),
COUNT(TR_INV) item_cnt,
COUNT(THR) tr_cnt,
json_agg(THR)
FROM transaction_transaction AS TR
LEFT JOIN (
SELECT
array_agg(t) list, -- this is a workaround method
t.transaction_id
FROM (
SELECT
TR_INV.transaction_id transaction_id,
IT.id,
IT.stock_number,
CAT.key category_key,
ITP.description description,
ITP.serial_number serial_number,
ITP.color color,
ITP.manufacturer manufacturer,
ITP.inventory_model inventory_model,
ITP.average_cost average_cost,
ITP.location_in_store location_in_store,
ITP.firearm_caliber firearm_caliber,
ITP.federal_firearm_number federal_firearm_number,
ITP.sold_price sold_price
FROM transaction_transaction_item TR_INV
LEFT JOIN inventory_item IT ON IT.id = TR_INV.item_id
LEFT JOIN inventory_itemprofile ITP ON ITP.id = IT.current_profile_id
LEFT JOIN inventory_category CAT ON CAT.id = ITP.category_id
LEFT JOIN inventory_categorytype CAT_T ON CAT_T.id = CAT.category_type_id
) t
GROUP BY t.transaction_id
) TR_INV ON TR_INV.transaction_id = TR.id
LEFT JOIN transaction_transactionhistory THR ON THR.transaction_id = TR.id
AND (THR.audit_code_id = 44 OR THR.audit_code_id = 27 OR THR.audit_code_id = 28)
WHERE TR.store_id = 21
AND TR.transaction_type = 'Pawn_Loan' AND TR.date_made >= '2018-10-08'
GROUP BY TR.id, TR_INV.list

What you want to do can be achieved by not using joins, as shown below.
Because your actual tables have so many columns that I don't know and should not care. I just created the simplest forms of them for demonstration.
CREATE TABLE transactions (
tid serial PRIMARY KEY,
name varchar(40) NOT NULL
);
CREATE TABLE transaction_histories (
hid serial PRIMARY KEY ,
tid integer REFERENCES transactions(tid),
history varchar(40) NOT NULL
);
CREATE TABLE transaction_items (
iid serial PRIMARY KEY ,
tid integer REFERENCES transactions(tid),
item varchar(40) NOT NULL
);
INSERT INTO transactions(tid,name) Values(1, 'transaction');
INSERT INTO transaction_histories(tid, history) Values(1, 'history1');
INSERT INTO transaction_histories(tid, history) Values(1, 'history2');
INSERT INTO transaction_items(tid, item) Values(1, 'item1');
INSERT INTO transaction_items(tid, item) Values(1, 'item2');
select
t.*,
(select count(*) from transaction_histories h where h.tid= t.tid) h_count ,
(select json_agg(h) from transaction_histories h where h.tid= t.tid) h ,
(select count(*) from transaction_items i where i.tid= t.tid) i_count ,
(select json_agg(i) from transaction_items i where i.tid= t.tid) i
from transactions t;

Related

JOIN with array of ids returns duplicate root records instead of just one

I'm trying to join several tables and pull out each DISTINCT root record (from table_a), but for some reason I keep getting duplicates. Here is my select query:
Fiddle
select
ta.id,
ta.table_a_name as "tableName"
from my_schema.table_a ta
left join my_schema.table_b tb
on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
on (tc.table_b_id = tb.id)
left join my_schema.table_d td
on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);
This returns the following:
[
{
"id": 2,
"tableName": "Root record 2"
},
{
"id": 2,
"tableName": "Root record 2"
}
]
But I am only expecting, in this case,
[
{
"id": 2,
"tableName": "Root record 2"
}
]
What am I doing wrong here?
Here's the fiddle and, just in case, the create and insert statements below:
create schema if not exists my_schema;
create table if not exists my_schema.table_a (
id serial primary key,
table_a_name varchar (255) not null
);
create table if not exists my_schema.table_b (
id serial primary key,
table_a_id bigint not null references my_schema.table_a (id)
);
create table if not exists my_schema.table_d (
id serial primary key
);
create table if not exists my_schema.table_c (
id serial primary key,
table_b_id bigint not null references my_schema.table_b (id),
table_d_ids bigint[] not null
);
insert into my_schema.table_a values
(1, 'Root record 1'),
(2, 'Root record 2'),
(3, 'Root record 3');
insert into my_schema.table_b values
(10, 2),
(11, 2),
(12, 3);
insert into my_schema.table_d values
(100),
(101),
(102),
(103),
(104);
insert into my_schema.table_c values
(1000, 10, array[]::int[]),
(1001, 10, array[100]),
(1002, 11, array[100, 101]),
(1003, 12, array[102]),
(1004, 12, array[103]);
Short answer is use distinct, and this will get the results you want:
select distinct
ta.id,
ta.table_a_name as "tableName"
from my_schema.table_a ta
left join my_schema.table_b tb
on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
on (tc.table_b_id = tb.id)
left join my_schema.table_d td
on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);
That said, this doesn't sit well with me because I assume this is not the end of your query.
The root issue is that you have two records from table_b - table_d that match this criteria. If you follow the breadcrumbs back, you will see there really are two matches:
select
ta.id,
ta.table_a_name as "tableName", tb.*, tc.*, td.*
from my_schema.table_a ta
left join my_schema.table_b tb
on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
on (tc.table_b_id = tb.id)
left join my_schema.table_d td
on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);
So 'distinct' is just a lazy fix to say if there are dupes, limit it to one...
My next question is, is there more to it than this? What's supposed to happen next? Do you really just want candidates from table_a, or is this part 1 of a longer issue? If there is more to it, then there is likely a better solution than a simple select distinct.
-- edit 10/1/2022 --
Based on your comment, I have one final suggestion. Because this really all there is to your output AND you don't actually need the data from the b/c/d tables, then I think a semi-join is a better solution.
It's slightly more code (not going to win any golf or de-obfuscation contents), but it's much more efficient than a distinct or group by all columns. The reason is a distinct pulls every row result and then has to order and remove dupes. A semi-join, by contrast, will "stop looking" once it finds a match. It also scales very well. Almost every time I see a distinct misused, it's better served by a semi-join.
select
ta.id,
ta.table_a_name as "tableName"
from my_schema.table_a ta
where exists (
select null
from
table_b tb,
table_c tc,
table_d tc
where
tb.table_a_id = ta.id and
tc.table_b_id = tb.id and
td.id = any(tc.table_d_ids) and
td.id = any(array[100])
)
I didn't suggest this initially because I was unclear on the "what next."

How can I SUM distinct records in a Postgres database where there are duplicate records?

Imagine a table that looks like this:
The SQL to get this data was just SELECT *
The first column is "row_id" the second is "id" - which is the order ID and the third is "total" - which is the revenue.
I'm not sure why there are duplicate rows in the database, but when I do a SUM(total), it's including the second entry in the database, even though the order ID is the same, which is causing my numbers to be larger than if I select distinct(id), total - export to excel and then sum the values manually.
So my question is - how can I SUM on just the distinct order IDs so that I get the same revenue as if I exported to excel every distinct order ID row?
Thanks in advance!
Easy - just divide by the count:
select id, sum(total) / count(id)
from orders
group by id
See live demo.
Also handles any level of duplication, eg triplicates etc.
You can try something like this (with your example):
Table
create table test (
row_id int,
id int,
total decimal(15,2)
);
insert into test values
(6395, 1509, 112), (22986, 1509, 112),
(1393, 3284, 40.37), (24360, 3284, 40.37);
Query
with distinct_records as (
select distinct id, total from test
)
select a.id, b.actual_total, array_agg(a.row_id) as row_ids
from test a
inner join (select id, sum(total) as actual_total from distinct_records group by id) b
on a.id = b.id
group by a.id, b.actual_total
Result
| id | actual_total | row_ids |
|------|--------------|------------|
| 1509 | 112 | 6395,22986 |
| 3284 | 40.37 | 1393,24360 |
Explanation
We do not know what the reasons is for orders and totals to appear more than one time with different row_id. So using a common table expression (CTE) using the with ... phrase, we get the distinct id and total.
Under the CTE, we use this distinct data to do totaling. We join ID in the original table with the aggregation over distinct values. Then we comma-separate row_ids so that the information looks cleaner.
SQLFiddle example
http://sqlfiddle.com/#!15/72639/3
Create custom aggregate:
CREATE OR REPLACE FUNCTION sum_func (
double precision, pg_catalog.anyelement, double precision
)
RETURNS double precision AS
$body$
SELECT case when $3 is not null then COALESCE($1, 0) + $3 else $1 end
$body$
LANGUAGE 'sql';
CREATE AGGREGATE dist_sum (
pg_catalog."any",
double precision)
(
SFUNC = sum_func,
STYPE = float8
);
And then calc distinct sum like:
select dist_sum(distinct id, total)
from orders
SQLFiddle
You can use DISTINCT in your aggregate functions:
SELECT id, SUM(DISTINCT total) FROM orders GROUP BY id
Documentation here: https://www.postgresql.org/docs/9.6/static/sql-expressions.html#SYNTAX-AGGREGATES
If we can trust that the total for 1 order is actually 1 row. We could eliminate the duplicates in a sub-query by selecting the the MAX of the PK id column. An example:
CREATE TABLE test2 (id int, order_id int, total int);
insert into test2 values (1,1,50);
insert into test2 values (2,1,50);
insert into test2 values (5,1,50);
insert into test2 values (3,2,100);
insert into test2 values (4,2,100);
select order_id, sum(total)
from test2 t
join (
select max(id) as id
from test2
group by order_id) as sq
on t.id = sq.id
group by order_id
sql fiddle
In difficult cases:
select
id,
(
SELECT SUM(value::int4)
FROM jsonb_each_text(jsonb_object_agg(row_id, total))
) as total
from orders
group by id
I would suggest just use a sub-Query:
SELECT "a"."id", SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
GROUP BY "a"."id"
The Above will give you the total of each id
Use below if you want the full total of each duplicate removed:
SELECT SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
Using subselect (http://sqlfiddle.com/#!7/cef1c/51):
select sum(total) from (
select distinct id, total
from orders
)
Using CTE (http://sqlfiddle.com/#!7/cef1c/53):
with distinct_records as (
select distinct id, total from orders
)
select sum(total) from distinct_records;

Get value of at least one coincidense between related tables postgres

I have 3 tables:
CREATE TABLE public.student (id serial NOT NULL PRIMARY KEY,"student" varchar(30) NOT NULL) WITH (OIDS = FALSE);
CREATE TABLE public."class" (id serial NOT NULL PRIMARY KEY,"class" varchar(30) NOT NULL) WITH (OIDS = FALSE);
CREATE TABLE public.student_class (student_id integer, class_id integer, /* Keys */ CONSTRAINT student_class_index01
PRIMARY KEY (student_id, class_id)) WITH (OIDS = FALSE);
With these values:
INSERT INTO public.student_class ("student_id","class_id") VALUES(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(2,1),(2,2),(2,3),(3,4),(3,5),(3,6),(3,7),(3,8);
I want to now the names of the students that have at least one coincident class with other one:
Student 1 has got at least one coincidence with 2 and 3. Student 2 has
got at least one coincidence with 1. Student 3 has got at least one
coincidence with 1.
I have tried this, but it is not doing it by student.
SELECT distinct a.id,a.student from public.student a
LEFT OUTER JOIN public.student_class b ON b.student_id=a.id
LEFT OUTER JOIN public.class d ON d.id=b.class_id
WHERE class_id IN(select class_id from student_class where student_id not in(1))
AND a.id not in(1);
I think taht is easier than I think. Thanks.
What do you think about this :
with student_lst_classes as (
select student_id,array_agg(class_id order by class_id) lst_classes
from student_class group by 1
)
select st1.student,string_agg(st2.student,';' order by st2.student ASC) lst_student_coincident from
(student_lst_classes l1 join student st1 on (st1.id=l1.student_id))
,(student_lst_classes l2 join student st2 on (st2.id=l2.student_id))
where l1.student_id!=l2.student_id and l1.lst_classes && l2.lst_classes
group by 1 order by 1
Try this:
select *
from student
where exists (select null
from student_class sc1
where sc1.student_id = student.id
and exists (select null
from student_class sc2
where sc2.class_id = sc1.class_id
and sc2.student_id != sc1.student_id))
Besically a cartesian join of student X student, so there are no duplicates to take care of. (except for the symmetry s1 <--> s2)
SELECT s1.id AS a_id, s1.student AS a_student
, s2.id AS b_id, s2.student AS b_student
FROM student s1
JOIN student s2 ON EXISTS (
SELECT *
FROM student_class x1
JOIN student_class x2 ON x2.class_id = x1.class_id
WHERE x1.student_id = s1.id
AND x2.student_id = s2.id
)
WHERE s1.id < s2.id -- tie-breaker
;

What is wrong with this t-sql query using left join?

I have two sqlserver 2005 tables: Task table and Travel table.
I am trying to write a query that selects all records from the Task table for a specific Empid, and if the Task is travel, also get the mileage value from another table.
I expect only one row to be selected, but all rows in the Task table are returned.
The query I am using is
Select Tasklog.TaskLogPkey, Tasklog.Empid , Tasklog.Task, TravelLog.Mileage from
Tasklog
left join TravelLog
on
TaskLog.EmpId = 12 and TaskLog.Task ='Travel' and
TaskLog.TaskLogPkey = TravelLog.TaskLogPkey
But it makes no difference if I don’t include
-- TaskLog.EmpId = 12 and TaskLog.Task ='Travel' and
What am I doing wrong?
Clearly the following doesn't work:
Select Tasklog.TaskLogPkey, Tasklog.Empid , Tasklog.Task, TravelLog.Mileage from
Tasklog where TaskLog.EmpId = 12 and TaskLog.Task ='Travel'
left join TravelLog
on
TaskLog.TaskLogPkey = TravelLog.TaskLogPkey
These are the tables I am using:
Create table TaskLog(
TaskLogPkey int not null,
Empid int,
Task varchar(30)
)
Insert Tasklog values(1,12,'Sales')
Insert Tasklog values(2,4,'Travel')
Insert Tasklog values(3,63,'Meeting')
Insert Tasklog values(4,12,'Travel')
Insert Tasklog values(5,12,'Email')
Insert Tasklog values(6,4,'Travel')
Insert Tasklog values(7,63,'Meeting')
Insert Tasklog values(8,12,'PhoneCall')
Create table TravelLog(
TaskLogPkey int not null,
Mileage int
)
Insert TravelLog values(2,45)
Insert TravelLog values(4,25)
Insert TravelLog values(6,18)
When you perform an Outer join ( a left join is a Left Outer Join), you get all the records in the inner side of that join, --- no matter what predicates (filters) you specify on the other outer side in the join conditions.
To filter the other side you need to add a Where clause.
Select t.TaskLogPkey, t.Empid , t.Task, v.Mileage
From Tasklog t
Left Join TravelLog v
On t.TaskLogPkey = v.TaskLogPkey
Where t.Task ='Travel'
And t.EmpId = 12
TaskLog.EmpId = 12 and TaskLog.Task ='Travel' is part of the selection criteria, NOT join criteria.
And "Charles Bretana" beats me by less than 1 minute.
You should try placing a WHERE clause with the additional criteria:
Select Tasklog.TaskLogPkey, Tasklog.Empid , Tasklog.Task, TravelLog.Mileage
from Tasklog
left join TravelLog
on TaskLog.TaskLogPkey = TravelLog.TaskLogPkey
where TaskLog.EmpId = 12
and TaskLog.Task ='Travel'
See SQL Fiddle with Demo

TSQL Group By with an "OR"?

This query for creating a list of Candidate duplicates is easy enough:
SELECT Count(*), Can_FName, Can_HPhone, Can_EMail
FROM Can
GROUP BY Can_FName, Can_HPhone, Can_EMail
HAVING Count(*) > 1
But if the actual rule I want to check against is FName and (HPhone OR Email) - how can I adjust the GROUP BY to work with this?
I'm fairly certain I'm going to end up with a UNION SELECT here (i.e. do FName, HPhone on one and FName, EMail on the other and combine the results) - but I'd love to know if anyone knows an easier way to do it.
Thank you in advance for any help.
Scott in Maine
Before I can advise anything, I need to know the answer to this question:
name phone email
John 555-00-00 john#example.com
John 555-00-01 john#example.com
John 555-00-01 john-other#example.com
What COUNT(*) you want for this data?
Update:
If you just want to know that a record has any duplicates, use this:
WITH q AS (
SELECT 1 AS id, 'John' AS name, '555-00-00' AS phone, 'john#example.com' AS email
UNION ALL
SELECT 2 AS id, 'John', '555-00-01', 'john#example.com'
UNION ALL
SELECT 3 AS id, 'John', '555-00-01', 'john-other#example.com'
UNION ALL
SELECT 4 AS id, 'James', '555-00-00', 'james#example.com'
UNION ALL
SELECT 5 AS id, 'James', '555-00-01', 'james-other#example.com'
)
SELECT *
FROM q qo
WHERE EXISTS
(
SELECT NULL
FROM q qi
WHERE qi.id <> qo.id
AND qi.name = qo.name
AND (qi.phone = qo.phone OR qi.email = qo.email)
)
It's more efficient, but doesn't tell you where the duplicate chain started.
This query select all entries along with the special field, chainid, that indicates where the duplicate chain started.
WITH q AS (
SELECT 1 AS id, 'John' AS name, '555-00-00' AS phone, 'john#example.com' AS email
UNION ALL
SELECT 2 AS id, 'John', '555-00-01', 'john#example.com'
UNION ALL
SELECT 3 AS id, 'John', '555-00-01', 'john-other#example.com'
UNION ALL
SELECT 4 AS id, 'James', '555-00-00', 'james#example.com'
UNION ALL
SELECT 5 AS id, 'James', '555-00-01', 'james-other#example.com'
),
dup AS (
SELECT id AS chainid, id, name, phone, email, 1 as d
FROM q
UNION ALL
SELECT chainid, qo.id, qo.name, qo.phone, qo.email, d + 1
FROM dup
JOIN q qo
ON qo.name = dup.name
AND (qo.phone = dup.phone OR qo.email = dup.email)
AND qo.id > dup.id
),
chains AS
(
SELECT *
FROM dup do
WHERE chainid NOT IN
(
SELECT id
FROM dup di
WHERE di.chainid < do.chainid
)
)
SELECT *
FROM chains
ORDER BY
chainid
None of these answers is correct. Quassnoi's is a decent approach, but you will notice one fatal flaw in the expressions "qo.id > dup.id" and "di.chainid < do.chainid": comparisons made by ID! This is ALWAYS bad practice because it depends on some inherent ordering in the IDs. IDs should NEVER be given any implicit meaning and should ONLY participate in equality or null testing. You can easily break Quassnoi's solution in this example by simply reordering the IDs in the data.
The essential problem is a disjunctive condition with a grouping, which leads to the possibility of two records being related through an intermediate, though they are not directly relatable.
e.g., you stated these records should all be grouped:
(1) John 555-00-00 john#example.com
(2) John 555-00-01 john#example.com
(3) John 555-00-01 john-other#example.com
You can see that #1 and #2 are relatable, as are #2 and #3, but clearly #1 and #3 are not directly relatable as a group.
This establishes that a recursive or iterative solution is the ONLY possible solution.
So, recursion is not viable since you can easily end up in a looping situation. This is what Quassnoi was trying to avoid with his ID comparisons, but in doing so he broke the algorithm. You could try limiting the levels of recursion, but you may not then complete all relations, and you will still potentially be following loops back upon yourself, leading to excessive data size and prohibitive inefficiency.
The best solution is ITERATIVE: Start a result set by tagging each ID as a unique group ID, and then spin through the result set and update it, combining IDs into the same unique group ID as they match on the disjunctive condition. Repeat the process on the updated set each time until no further updates can be made.
I will create example code for this soon.
GROUP BY doesn't support OR - it's implicitly AND and must include every non-aggregator in the select list.
I assume you also have a unique ID integer as the primary key on this table. If you don't, it's a good idea to have one, for this purpose and many others.
Find those duplicates by a self-join:
select
c1.ID
, c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email
, c2.ID
, c2.Can_FName
, c2.Can_HPhone
, c2.Can_Email
from
(
select
min(ID),
Can_FName,
Can_HPhone,
Can_Email
from Can
group by
Can_FName,
Can_HPhone,
Can_Email
) c1
inner join Can c2 on c1.ID < c2.ID
where
c1.Can_FName = c2.Can_FName
and (c1.Can_HPhone = c2.Can_HPhone OR c1.Can_Email = c2.Can_Email)
order by
c1.ID
The query gives you N-1 rows for each N duplicate combinations - if you want just a count along with each unique combination, count the rows grouped by the "left" side:
select count(1) + 1,
, c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email
from
(
select
min(ID),
Can_FName,
Can_HPhone,
Can_Email
from Can
group by
Can_FName,
Can_HPhone,
Can_Email
) c1
inner join Can c2 on c1.ID < c2.ID
where
c1.Can_FName = c2.Can_FName
and (c1.Can_HPhone = c2.Can_HPhone OR c1.Can_Email = c2.Can_Email)
group by
c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email
Granted, this is more involved than a union - but I think it illustrates a good way of thinking about duplicates.
Project the desired transformation first from a derived table, then do the aggregation:
SELECT COUNT(*)
, CAN_FName
, Can_HPhoneOrEMail
FROM (
SELECT Can_FName
, ISNULL(Can_HPhone,'') + ISNULL(Can_EMail,'') AS Can_HPhoneOrEMail
FROM Can) AS Can_Transformed
GROUP BY Can_FName, Can_HPhoneOrEMail
HAVING Count(*) > 1
Adjust your 'OR' operation as needed in the derived table project list.
I know this answer will be criticised for the use of the temp table, but it will work anyway:
-- create temp table to give the table a unique key
create table #tmp(
ID int identity,
can_Fname varchar(200) null, -- real type and len here
can_HPhone varchar(200) null, -- real type and len here
can_Email varchar(200) null, -- real type and len here
)
-- just copy the rows where a duplicate fname exits
-- (better performance specially for a big table)
insert into #tmp
select can_fname,can_hphone,can_email
from Can
where can_fname exists in (select can_fname from Can
group by can_fname having count(*)>1)
-- select the rows that have the same fname and
-- at least the same phone or email
select can_Fname, can_Hphone, can_Email
from #tmp a where exists
(select * from #tmp b where
a.ID<>b.ID and A.can_fname = b.can_fname
and (isnull(a.can_HPhone,'')=isnull(b.can_HPhone,'')
or (isnull(a.can_email,'')=isnull(b.can_email,'') )
Try this:
SELECT Can_FName, COUNT(*)
FROM (
SELECT
rank() over(partition by Can_FName order by Can_FName,Can_HPhone) rnk_p,
rank() over(partition by Can_FName order by Can_FName,Can_EMail) rnk_m,
Can_FName
FROM Can
) X
WHERE rnk_p=1 or rnk_m =1
GROUP BY Can_FName
HAVING COUNT(*)>1