How to understand the use of the ... Select 1 from ... an expression in SQL - postgresql

Task:
Find the employee with the highest salary among each department
Data:
CREATE SEQUENCE employee_id_seq;
create table Employee
(
id_emp int DEFAULT nextval('employee_id_seq')
NOT NULL PRIMARY KEY UNIQUE,
name_emp varchar(255) NOT NULL,
mgr_id_fk int not null,
job_emp text NOT NULL,
salary int NOT NULL,
date_emp date NOT NULL,
dep_ID_fk int NOT NULL
);
ALTER SEQUENCE employee_id_seq
OWNED BY employee.id_emp;
create table Manager
(
id_mgr int not null primary key unique,
type_mgr varchar(255)
);
ALTER table Employee
add FOREIGN KEY (mgr_id_fk) REFERENCES Manager (id_mgr)
on update cascade
on delete set null;
create table Department
(
id_depart int NOT NULL PRIMARY KEY unique,
name_depart varchar(255) not null,
address text,
phone text
);
insert into Manager (id_mgr, type_mgr)
VALUES
(1006, 'juniormgr'),
(1004, 'middlemgr'),
(1005, 'seniormgr');
insert into Department (id_depart, name_depart, address, phone)
values (1, 'Sales', 'Sydney', '0425 198 053'),
(2, 'Accounts', 'Melbourne', '0429 198 955'),
(3, 'Admin', 'Melbourne', '0428 198 758'),
(4, 'Marketing', 'Sydney', '0427 198 757');
insert into Employee(id_emp, name_emp, mgr_id_fk, job_emp, salary, date_emp, dep_ID_fk)
values (nextval('employee_id_seq'), 'ken Adams', 1006, 'Salesman', 70000, '2008-04-12', 1),
(nextval('employee_id_seq'), 'Ru Jones', 1004, 'Salesman', 65000, '2010-01-18', 1),
(nextval('employee_id_seq'), 'Dhal Sim', 1006, 'Accountant', 88000, '2001-03-07', 2),
(nextval('employee_id_seq'), 'Ellen Honda', 1006, 'Manager', 118000, '2001-03-17', 1),
(nextval('employee_id_seq'), 'Mike Bal', 1005, 'Receptionist', 68000, '2006-06-21', 3),
(nextval('employee_id_seq'), 'Martin Bison', 1005, 'CEO', 210000, '2010-07-12', 3),
(nextval('employee_id_seq'), 'Shen Li', 1004, 'Salesman', 86000, '2014-09-18', 1),
(nextval('employee_id_seq'), 'Zang Ross', 1004, 'Salesman', 65000, '2017-02-02', 1),
(nextval('employee_id_seq'), 'Sagar Kahn', 1005, 'Salesman', 70000, '2016-03-01', 1);
Such a request will give you the necessary information :
select * from employee e
where not exists (select 1 from employee e2
where e2.dep_id_fk = e.dep_id_fk
and e2.salary > e.salary);
My reasoning:
First, a subquery will be executed and Postgres will save this temporary result
then the rest of the request will be executed
select * from employee e
where not exists ...
where not exists , excludes all matches found in the subquery
Question:
This code causes a misunderstanding of how it still works,
because everything here is illogical in my opinion.
For example, how does this work
select 1 from employee e2
where e2.dep_id_fk = e.dep_id_fk
and e2.salary > e.salary
select 1 from - what does this expression even do ?
e2.dep_id_fk = e.dep_id_fk - this is a check of the same table ( in the subquery and in the main query), but why ?
e2.salary > e.salary - and that's why ?

There are two things you need to understand:
What is a co-related subquery
How does the (NOT) EXISTS operator work.
A co-related subquery is run once for each row that is returned from the outer query. You can imagine this as a kind of nested loop, where for each row returned by the outer query, the sub-query is executed by using the values of the columns from the outer query.
So if the outer query processes the row for 'ken Adams' it will take the value 1 from dep_id_fk and the value 70000 of the column salary and essentially run:
select 1
from employee e2
where e2.dep_id_fk = 1 --<< this value is from the outer query
and e2.salary > 70000 --<< this value if from the outer query
If that query returns no rows, the row from the outer query is included in the result. Then the database proceeds with the next row in the outer query and does the same again until all rows from the outer query are processed and either included in or excluded from the result.
However the NOT EXISTS and EXISTS operators check only check for the presence of rows from the sub-query. The actual value(s) returned from the sub-query are completely irrelevant.
A lot of people incorrectly assume that select 1 is somehow cheaper than select * in the sub-query - but this is a totally wrong assumption. The expression is never even evaluated, so it's completely irrelevant what is selected there. If you think select * is more logical than select 1 than use that.
To prove that the expression is never evaluated (or even looked at), you can use one that would otherwise throw an exception.
select *
from employee e
where not exists (select 1/0
from employee e2
where e2.dep_id_fk = e.dep_id_fk
and e2.salary > e.salary);
If you run select 1/0 outside of a sub-query used for an EXISTS or NOT EXISTS condition, it would result in an error. But the EXISTS operator never even looks at the expressions in the SELECT list.

Let's try
select *
from employee e
where not exists(select 'foo'
from employee e2
where e2.dep_id_fk = e.dep_id_fk
and e2.salary > e.salary);
or
select *
from employee e
where not exists(select 42
from employee e2
where e2.dep_id_fk = e.dep_id_fk
and e2.salary > e.salary);
2 queries return same result:
Therefore, the purpose of 1 is just supports check existing.

Related

Postgres returning records when query result should be empty

Suppose the following,
CREATE SCHEMA IF NOT EXISTS my_schema;
CREATE TABLE IF NOT EXISTS my_schema.user (
id serial PRIMARY KEY,
chat_ids BIGINT[] NOT NULL
);
CREATE TABLE IF NOT EXISTS my_schema.chat (
id serial PRIMARY KEY,
chat_id_value BIGINT UNIQUE NOT NULL
);
INSERT INTO my_schema.chat VALUES
(1, 12321);
INSERT INTO my_schema.user VALUES
(1, '{12321}');
When I query for a user record with a nonexisting chat, I still receive a result:
SELECT u.id,
(
SELECT TO_JSON(COALESCE(ARRAY_AGG(c.*) FILTER (WHERE c IS NOT NULL), '{}'))
FROM my_schema.chat as c
WHERE c.chat_id_value = ANY (ARRAY[ 1234 ]::int[])
) AS chat_ids
FROM my_schema.user as u
Clearly, there is no my_schema.chat record with with chat_id_value = 1234.
I've tried adding,
. . .
FROM my_schema.user as u
WHERE chat_ids != '{}'
But this still yields the same result:
[
{
"id": 1,
"chat_ids": []
}
]
I've tried WHERE ARRAY_LENGTH(chat_ids, 1) != 0, WHERE CARDINALITY(chat_ids) != 0, none return the expected result.
Oddly enough, WHERE ARRAY_LENGTH(chat_ids, 1) != 1 works, implying the length of chat_ids is 1 when it's actually 0? Very confusing.
What am I doing wrong here? The expected result should be [].
If the subselect on my_schema.chat returns no result, you will get NULL, which coalesce will turn into {}. Moreover, the inner query is not correlated to the outer query, so you will get the same result for each row in my_schema."user". You should use an inner join:
SELECT u.id,
TO_JSON(COALESCE(ARRAY_AGG(c.*) FILTER (WHERE c IS NOT NULL), '{}'))
FROM my_schema.user as u
JOIN my_schema.chat as c
ON c.chat_id_value = ANY (u.chat_ids);
I don't think that your data model is good. You should avoid arrays and use a junction table instead. It will make for better performance and simpler queries.
You can do it as follows :
WITH cte as (
SELECT TO_JSON(ARRAY_AGG(c.*) FILTER (WHERE c IS NOT NULL)) as to_json
FROM my_schema.chat as c
inner join my_schema.user u on c.chat_id_value = ANY (u.chat_ids)
WHERE c.chat_id_value = ANY (ARRAY[ 12321]::int[])
)
select *
from cte where to_json is not null;
This will force not to show any result if the query don't match !
Demo here

JOIN with array of ids returns duplicate root records instead of just one

I'm trying to join several tables and pull out each DISTINCT root record (from table_a), but for some reason I keep getting duplicates. Here is my select query:
Fiddle
select
ta.id,
ta.table_a_name as "tableName"
from my_schema.table_a ta
left join my_schema.table_b tb
on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
on (tc.table_b_id = tb.id)
left join my_schema.table_d td
on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);
This returns the following:
[
{
"id": 2,
"tableName": "Root record 2"
},
{
"id": 2,
"tableName": "Root record 2"
}
]
But I am only expecting, in this case,
[
{
"id": 2,
"tableName": "Root record 2"
}
]
What am I doing wrong here?
Here's the fiddle and, just in case, the create and insert statements below:
create schema if not exists my_schema;
create table if not exists my_schema.table_a (
id serial primary key,
table_a_name varchar (255) not null
);
create table if not exists my_schema.table_b (
id serial primary key,
table_a_id bigint not null references my_schema.table_a (id)
);
create table if not exists my_schema.table_d (
id serial primary key
);
create table if not exists my_schema.table_c (
id serial primary key,
table_b_id bigint not null references my_schema.table_b (id),
table_d_ids bigint[] not null
);
insert into my_schema.table_a values
(1, 'Root record 1'),
(2, 'Root record 2'),
(3, 'Root record 3');
insert into my_schema.table_b values
(10, 2),
(11, 2),
(12, 3);
insert into my_schema.table_d values
(100),
(101),
(102),
(103),
(104);
insert into my_schema.table_c values
(1000, 10, array[]::int[]),
(1001, 10, array[100]),
(1002, 11, array[100, 101]),
(1003, 12, array[102]),
(1004, 12, array[103]);
Short answer is use distinct, and this will get the results you want:
select distinct
ta.id,
ta.table_a_name as "tableName"
from my_schema.table_a ta
left join my_schema.table_b tb
on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
on (tc.table_b_id = tb.id)
left join my_schema.table_d td
on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);
That said, this doesn't sit well with me because I assume this is not the end of your query.
The root issue is that you have two records from table_b - table_d that match this criteria. If you follow the breadcrumbs back, you will see there really are two matches:
select
ta.id,
ta.table_a_name as "tableName", tb.*, tc.*, td.*
from my_schema.table_a ta
left join my_schema.table_b tb
on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
on (tc.table_b_id = tb.id)
left join my_schema.table_d td
on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);
So 'distinct' is just a lazy fix to say if there are dupes, limit it to one...
My next question is, is there more to it than this? What's supposed to happen next? Do you really just want candidates from table_a, or is this part 1 of a longer issue? If there is more to it, then there is likely a better solution than a simple select distinct.
-- edit 10/1/2022 --
Based on your comment, I have one final suggestion. Because this really all there is to your output AND you don't actually need the data from the b/c/d tables, then I think a semi-join is a better solution.
It's slightly more code (not going to win any golf or de-obfuscation contents), but it's much more efficient than a distinct or group by all columns. The reason is a distinct pulls every row result and then has to order and remove dupes. A semi-join, by contrast, will "stop looking" once it finds a match. It also scales very well. Almost every time I see a distinct misused, it's better served by a semi-join.
select
ta.id,
ta.table_a_name as "tableName"
from my_schema.table_a ta
where exists (
select null
from
table_b tb,
table_c tc,
table_d tc
where
tb.table_a_id = ta.id and
tc.table_b_id = tb.id and
td.id = any(tc.table_d_ids) and
td.id = any(array[100])
)
I didn't suggest this initially because I was unclear on the "what next."

Enforcing a unique relationship over multiple columns where one column is nullable

Given the table
ID PERSON_ID PLAN EMPLOYER_ID TERMINATION_DATE
1 123 ABC 321 2020-01-01
2 123 DEF 321 (null)
3 123 ABC 321 (null)
4 123 ABC 321 (null)
I want to exclude the 4th entry. (The 3rd entry shows the person was re-hired and therefore is a new relationship. I'm only showing relevant fields)
My first attempt was to simply create a unique index over PERSON_ID / PLAN / EMPLOYER_ID / TERMINATION_DATE, thinking that DB2 for IBMi considered nulls equal in a unique index. I was evidently wrong...
Is there a way to enforce uniqueness over these columns, or,
is there a better way to approach the value of termination date? (null is not technically correct; I'm thinking of it as more true/false, but the business logic needs a date)
Edit
According to the docs for 7.3:
UNIQUE
Prevents the table from containing two or more rows with the same value of the index key. When UNIQUE is used, all null values for a column are considered equal. For example, if the key is a single column that can contain null values, that column can contain only one null value. The constraint is enforced when rows of the table are updated or new rows are inserted.
The constraint is also checked during the execution of the CREATE INDEX statement. If the table already contains rows with duplicate key values, the index is not created.
UNIQUE WHERE NOT NULL
Prevents the table from containing two or more rows with the same value of the index key, where all null values for a column are not considered equal. Multiple null values in a column are allowed. Otherwise, this is identical to UNIQUE.
So, the behavior I'm seeing looks more like UNIQUE WHERE NOT NULL. When I generate SQL for this table, I see
ADD CONSTRAINT TERMEMPPLANSSN
UNIQUE( TERMINATION_DATE , EMPLOYERID , PLAN_CODE , SSN ) ;
(note this is showing the real field names, not the ones I used in my example)
Edit 2
Bottom line, Constraint !== Index. When I went back and created an actual index, I got the desired behavior.
CREATE TABLE PERSON
(
ID INT NOT NULL
, PERSON_ID INT NOT NULL
, PLAN CHAR(3) NOT NULL
, EMPLOYER_ID INT
, TERMINATION_DATE DATE
);
INSERT INTO PERSON (ID, PERSON_ID, PLAN, EMPLOYER_ID, TERMINATION_DATE)
VALUES
(1, 123, 'ABC', 321, DATE('2020-01-01'))
, (2, 123, 'DEF', 321, CAST(NULL AS DATE))
, (3, 123, 'ABC', 321, CAST(NULL AS DATE))
WITH NC;
--- To not allow: ---
INSERT INTO PERSON (ID, PERSON_ID, PLAN, EMPLOYER_ID, TERMINATION_DATE) VALUES
(4, 123, 'ABC', 321, CAST(NULL AS DATE))
or
(4, 123, 'ABC', 321, DATE('2020-01-01'))
You may:
CREATE UNIQUE INDEX PERSON_U1 ON PERSON
(PERSON_ID, PLAN, EMPLOYER_ID, TERMINATION_DATE);
--- To not allow: ---
INSERT INTO PERSON (ID, PERSON_ID, PLAN, EMPLOYER_ID, TERMINATION_DATE) VALUES
(4, 123, 'ABC', 321, DATE('2020-01-01'))
but allow multiple:
(X, 123, 'ABC', 321, CAST(NULL AS DATE))
(Y, 123, 'ABC', 321, CAST(NULL AS DATE))
...
You may:
CREATE UNIQUE WHERE NOT NULL INDEX PERSON_U2 ON PERSON
(PERSON_ID, PLAN, EMPLOYER_ID, TERMINATION_DATE);

Joining two one-to-many tables duplicates records

I have 3 tables, Transaction, Transaction_Items and Transaction_History.
Where the Transaction is the parent table, while Transaction_Items and Transaction_History are the children tables, with one to many relationship.
When i try to join those tables together, if i have 2+ Transaction_History records, or 2+ Transaction_Items i get duplicated or triplicated record results.
This is the SQL query im currently using which works, but what worries me that in the future if i have to Join another one-to-many table, it will duplicate the results again.
I found a workaround for this, but i was just wondering if there is a better and cleaner way to do this ?
The results should be a PostgreSQL JSON array which will contain the Transaction_Items and Transaction_History
SELECT
TR.id AS transaction_id,
TR.transaction_number,
TR.status,
TR.status AS status,
to_json(TR_INV.list),
COUNT(TR_INV) item_cnt,
COUNT(THR) tr_cnt,
json_agg(THR)
FROM transaction_transaction AS TR
LEFT JOIN (
SELECT
array_agg(t) list, -- this is a workaround method
t.transaction_id
FROM (
SELECT
TR_INV.transaction_id transaction_id,
IT.id,
IT.stock_number,
CAT.key category_key,
ITP.description description,
ITP.serial_number serial_number,
ITP.color color,
ITP.manufacturer manufacturer,
ITP.inventory_model inventory_model,
ITP.average_cost average_cost,
ITP.location_in_store location_in_store,
ITP.firearm_caliber firearm_caliber,
ITP.federal_firearm_number federal_firearm_number,
ITP.sold_price sold_price
FROM transaction_transaction_item TR_INV
LEFT JOIN inventory_item IT ON IT.id = TR_INV.item_id
LEFT JOIN inventory_itemprofile ITP ON ITP.id = IT.current_profile_id
LEFT JOIN inventory_category CAT ON CAT.id = ITP.category_id
LEFT JOIN inventory_categorytype CAT_T ON CAT_T.id = CAT.category_type_id
) t
GROUP BY t.transaction_id
) TR_INV ON TR_INV.transaction_id = TR.id
LEFT JOIN transaction_transactionhistory THR ON THR.transaction_id = TR.id
AND (THR.audit_code_id = 44 OR THR.audit_code_id = 27 OR THR.audit_code_id = 28)
WHERE TR.store_id = 21
AND TR.transaction_type = 'Pawn_Loan' AND TR.date_made >= '2018-10-08'
GROUP BY TR.id, TR_INV.list
What you want to do can be achieved by not using joins, as shown below.
Because your actual tables have so many columns that I don't know and should not care. I just created the simplest forms of them for demonstration.
CREATE TABLE transactions (
tid serial PRIMARY KEY,
name varchar(40) NOT NULL
);
CREATE TABLE transaction_histories (
hid serial PRIMARY KEY ,
tid integer REFERENCES transactions(tid),
history varchar(40) NOT NULL
);
CREATE TABLE transaction_items (
iid serial PRIMARY KEY ,
tid integer REFERENCES transactions(tid),
item varchar(40) NOT NULL
);
INSERT INTO transactions(tid,name) Values(1, 'transaction');
INSERT INTO transaction_histories(tid, history) Values(1, 'history1');
INSERT INTO transaction_histories(tid, history) Values(1, 'history2');
INSERT INTO transaction_items(tid, item) Values(1, 'item1');
INSERT INTO transaction_items(tid, item) Values(1, 'item2');
select
t.*,
(select count(*) from transaction_histories h where h.tid= t.tid) h_count ,
(select json_agg(h) from transaction_histories h where h.tid= t.tid) h ,
(select count(*) from transaction_items i where i.tid= t.tid) i_count ,
(select json_agg(i) from transaction_items i where i.tid= t.tid) i
from transactions t;

Concatenated columns should not match in 2 tables

I'll just put this in layman's terms since I'm a complete noobie:
I have 2 tables A and B, both having 2 columns of interest namely: employee_number and salary.
What I am looking to do is to extract rows of 'combination' of employee_number and salary from A that are NOT present in B, but each of employee_number and salary should be present in both.
I am looking to doing it with the 2 following conditions(please forgive the wrong function
names.. this is just to present the problem 'eloquently'):
1.) A.unique(employee_number) exists in B.unique(employee_number) AND A.unique(salary)
exists in B.unique(salary)
2.) A.concat(employee_number,salary) <> B.concat(employee_number,salary)
Note: A and B are in different databases, so I'm looking to use dblink to do this.
This is what I tried doing:
SELECT distinct * FROM dblink('dbname=test1 port=5432
host=test01 user=user password=password','SELECT employee_number,salary, employee_number||salary AS ENS FROM empsal.A')
AS A(employee_number int8, salary integer, ENS numeric)
LEFT JOIN empsalfull.B B on B.employee_number = A.employee_number AND B.salary = A.salary
WHERE A.ENS not in (select distinct employee_number || salary from empsalfull.B)
but it turned out to be wrong as I had it cross-checked by using spreadsheets and I don't get the same result.
Any help would be greatly appreciated. Thanks.
For easier understanding I left out the dblink.
Because, the first one selects lines in B that equal the employeenumber in A as well as the salery in A, so their concatenated values will equal as well (if you expect this to not be true, please provide some test data).
SELECT * from firsttable A
LEFT JOIN secondtable B where
(A.employee_number = B.employee_number AND a.salery != b.salery) OR
(A.salery = B.salery AND A.employee_number != B.employee_number)
If you have troubles with lines containing nulls, you might also try somthing like this:
AND (a.salery != b.salery OR (a.salery IS NULL AND b.salery IS NOT NULL) or (a.salery IS NOT
NULL and b.salery IS NULL))
I think you're looking for something along these lines.
(Sample data)
create table A (
employee_number integer primary key,
salary integer not null
);
create table B (
employee_number integer primary key,
salary integer not null
);
insert into A values
(1, 20000),
(2, 30000),
(3, 20000); -- This row isn't in B
insert into B values
(1, 20000), -- Combination in A
(2, 20000), -- Individual values in A
(3, 50000); -- Only emp number in A
select A.employee_number, A.salary
from A
where (A.employee_number, A.salary) NOT IN (select employee_number, salary from B)
and A.employee_number IN (select employee_number from B)
and A.salary IN (select salary from B)
output: 3, 20000