Postgres left joining 3 tables with a condition - postgresql

I have a query like this:
SELECT x.id
FROM x
LEFT JOIN (
SELECT a.id FROM a
WHERE [condition1] [condition2]
) AS A USING (id)
LEFT JOIN (
SELECT b.id FROM b
WHERE [condition1] [condition3]
) AS B USING (id)
LEFT JOIN (
SELECT c.id FROM c
WHERE [condition1] [condition4]
) AS C USING (id)
WHERE [condition1]
As you can see the [condition1] is common for subqueries and the outer query.
When [in general] might it be worth to remove [condition1] from subqueries (as the result is same) for performance reasons? Please don't give answers like "run it and see". There are lots of data and its changing so we need good worst case behaviour.
I have tried to do some tests but they are far from being conclusive. Will Postgres figure out that the condition applied to subqueries as well and propagate it?
Examples for condition1:
WHERE a.id NOT IN (SELECT id FROM {ft_geom_in}) (this is slow, I know, this is just for example)
WHERE a.id > x

It is difficult to give a clear general answer, since much depends on the actual data model (especially indexes) and queries (conditions).
However, in many cases it makes sense to place condition1 in joined subqueries.
This applies particularly when condition2 excludes much less rows than condition1.
In such cases, the filter on condition1 may significantly reduce the number of checks of condition2.
On the other hand, it seems unlikely that the presence of condition1 in subqueries could substantially slow down the query.
Simple tests do not give general answers, but might serve as an illustration.
create table x (id integer, something text);
create table a (id integer, something text);
insert into x select i, i::text from generate_series (1, 1000000) i;
insert into a select i, i::text from generate_series (1, 1000000) i;
Query A: condition2 excludes few rows.
A1: with condition1
explain analyse
select x.id
from x
left join (
select id from a
where id < 500000 and length(something) > 1
) as a using (id)
where id < 500000;
Average execution time: ~620.000 ms
A2: without condition1
explain analyse
select x.id
from x
left join (
select id from a
where length(something) > 1
) as a using (id)
where id < 500000;
Average execution time: ~810.000 ms
Query B: condition2 excludes many rows.
B1: with condition1
explain analyse
select x.id
from x
left join (
select id from a
where id < 500000 and length(something) = 1
) as a using (id)
where id < 500000;
Average execution time: ~220.000 ms
B2: without condition1
explain analyse
select x.id
from x
left join (
select id from a
where length(something) = 1
) as a using (id)
where id < 500000;
Average execution time: ~230.000 ms
Note, that the queries do not need to have subqueries. Queries with simple left joins and conditions in a common where clause should be a little bit faster. For example, this is the equivalent of query B1:
explain analyse
select x.id
from x
left join a using(id)
where x.id < 500000
and a.id < 500000
and length(a.something) = 1
Average execution time: ~210.000 ms

Related

Why PostgreSQL do whole scan on index when condition is FALSE?

I notice some slow down when query is running. From 5ms to 200ms. (+44ms JIT)
https://explain.depesz.com/s/lZYf#l12
similar, but JIT is off
Underlined expression is NULL so whole filter expression is FALSE.
Why here PG waste time 227ms? What I did wrong?
EXPLAIN( ANALYSE, FORMAT JSON, VERBOSE, settings, buffers )
WITH
_app_period AS ( select ?::tstzrange ),
ready AS (
SELECT
min( lower( o.app_period ) ) OVER ( PARTITION BY agreement_id ) <# (select * from _app_period) AS new_order,
max( upper( o.app_period ) ) OVER ( PARTITION BY agreement_id ) <# (select * from _app_period) AS del_order
,o.*
FROM "order_bt" o
LEFT JOIN acc_ready( 'Usage', (select * from _app_period), o ) acc_u ON acc_u.ready
LEFT JOIN acc_ready( 'Invoice', (select * from _app_period), o ) acc_i ON acc_i.ready
LEFT JOIN agreement a ON a.id = o.agreement_id
LEFT JOIN xcheck c ON c.doc_id = o.id and c.doctype = 'OrderDetail'
WHERE o.sys_period #> sys_time() AND o.app_period && app_period()
)
SELECT * FROM ready
UPD
Server version is 13.1
Is the second execution faster?
No. Result is reproducible all the time.
Perhaps sys_time() is expensive - what is that function?
This is stable function which do select coalesce( biconf( 'sys_time' )::timestamptz, now() ). app_period() is STABLE SQL and do similar thing.
Are you sure that the expression is NULL for all rows?
Yes. I check result of app_period() it is NULL, so it does not matter how many rows in table. o.app_period && NULL will result NULL for all rows.
Does the execution time change if you replace the expression with a literal NULL?
Yes, changing condition to WHERE o.sys_period #> sys_time() AND o.app_period && NULL reduce time to 0.08ms. Plan is changed.
Do you have indexes on o.sys_period and o.app_period?
Yes. I have: "order_id_sys_period_app_period_excl" EXCLUDE USING gist (id WITH =, sys_period WITH &&, app_period WITH &&)
And what happens when you execute the query without the CTE?
Without CTE many things are inlined and time is reduced to 0.5ms. But for IndexScan similar condition is used (now it is fast)
When I put (select * from _app_period) everywhere then query also run fast: 15ms. Filter is planned as $3: (o.app_period && $3) AND (o.sys_period #> sys_time())

how to get last added record for a battery with left join PSQL

I have query such as
select * from batteries as b ORDER BY inserted_at desc
which gives me data such as
and I have an query such as
select voltage, datetime, battery_id from battery_readings ORDER BY inserted_at desc limit 1
which returns data as
I want to combine both 2 above queries, so in one go, I can have each battery details as well as its last added voltage and datetime from battery_readings.
Postgres has a very useful syntax for this, called DISTINCT ON. This is different from plain DISTINCT in that it keeps only the first row of each set, defined by the sort order. In your case, it would be something like this:
SELECT DISTINCT ON (b.id)
b.id,
b.name,
b.source_url,
b.active,
b.user_id,
b.inserted_at,
b.updated_at,
v.voltage,
v.datetime
FROM battery b
JOIN battery_voltage v ON (b.id = v.battery_id)
ORDER BY b.id, v.datetime desc;
I think that widowing will make what you expected.
Assuming two tables
create table battery (id int, name text);
create table bat_volt(measure_time int, battery_id int, val int);
One of the possible queries is like this:
with latest as (select battery_id, max(measure_time) over (partition by battery_id) from bat_volt)
select * from battery b join bat_volt bv on bv.battery_id=b.id where (b.id,bv.measure_time) in (select * from latest);
If you have Postgres version which supports lateral, it might also make sense to try it out (in case there are way more values than batteries, it could have better performance).
select * from battery b
join bat_volt bv on bv.battery_id=b.id
join lateral
(select battery_id, max(measure_time) over (partition by battery_id) from bat_volt bbv
where bbv.battery_id = b.id limit 1) lbb on (lbb.max = bv.measure_time AND lbb.battery_id = b.id);

Get cartesian product of two columns

How can I get the cartesian product of two columns in one table?
I have table
A 1
A 2
B 3
B 4
and I want a new table
A 1
A 2
A 3
A 4
B 1
B 2
B 3
B 4
fiddle demo
your table
try this using joins
select distinct b.let,a.id from [dbo].[cartesian] a join [dbo].[cartesian] b on a.id<>b.id
will result like this
Create this table :
CREATE TABLE [dbo].[Table_1]
(
[A] [int] NOT NULL ,
[B] [nvarchar](50) NULL ,
CONSTRAINT [PK_Table_1] PRIMARY KEY CLUSTERED ( [A] ASC )
WITH ( PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON ) ON [PRIMARY]
)
ON [PRIMARY]
Fill table like this :
INSERT INTO [dbo].[Table_1]
VALUES ( 1, 'A' )
INSERT INTO [dbo].[Table_1]
VALUES ( 2, 'A' )
INSERT INTO [dbo].[Table_1]
VALUES ( 3, 'B' )
INSERT INTO [dbo].[Table_1]
VALUES ( 4, 'C' )
SELECT *
FROM [dbo].[Table_1]
Use this query
SELECT DISTINCT
T1.B ,
T2.A
FROM dbo.Table_1 AS T1 ,
dbo.Table_1 AS T2
ORDER BY T1.B
To clarify loup's answer (in more detail that allowable in a comment), any join with no relevant criteria specified will naturally produce a Cartesian product (which is why a glib answer to your question might be "all too easily"-- mistakenly doing t1 INNER JOIN t2 ON t1.Key = t1.Key will produce the same result).
However, SQL Server does offer an explicit option to make your intentions known. The CROSS JOIN is essentially what you're looking for. But like INNER JOIN devolving to a Cartesian product without a useful join condition, CROSS JOIN devolves to a simple inner join if you go out of your way to add join criteria in the WHERE clause.
If this is a one-off operation, it probably doesn't matter which you use. But if you want to make it clear for posterity, consider CROSS JOIN instead.

Using EXISTS as a column in TSQL

Is it possible to use the value of EXISTS as part of a query?
(Please note: unfortunately due to client constraints, I need SQLServer 2005 compatible answers!)
So when returning a set of results, one of the columns is a boolean value which states whether the subquery would return any rows.
For example, I want to return a list of usernames and whether a different table contains any rows for each user. The following is not syntactically correct, but hopefully gives you an idea of what I mean...
SELECT T1.[UserName],
(EXISTS (SELECT *
FROM [AnotherTable] T2
WHERE T1.[UserName] = T2.[UserName])
) AS [RowsExist]
FROM [UserTable] T1
Where the resultant set contains a column called [UserName] and boolean column called [RowsExist].
The obvious solution is to use a CASE, such as below, but I wondered if there was a better way of doing it...
SELECT T1.[UserName],
(CASE (SELECT COUNT(*)
FROM [AnotherTable] T2
WHERE T1.[UserName] = T2.[UserName]
)
WHEN 0 THEN CAST(0 AS BIT)
ELSE CAST(1 AS BIT) END
) AS [RowsExist]
FROM [UserTable] T1
Your second query isn't valid syntax.
SELECT T1.[UserName],
CASE
WHEN EXISTS (SELECT *
FROM [AnotherTable] T2
WHERE T1.[UserName] = T2.[UserName]) THEN CAST(1 AS BIT)
ELSE CAST(0 AS BIT)
END AS [RowsExist]
FROM [UserTable] T1
Is generally fine and will be implemented as a semi join.
The article Subqueries in CASE Expressions discusses this further.
In some cases a COUNT query can actually perform better though as discussed here
I like the other guys sql better but i just wrote this:
with bla as (
select t2.username, isPresent=CAST(1 AS BIT)
from t2
group by t2.username
)
select t1.*, isPresent = isnull(bla.isPresent, CAST(0 AS BIT))
from t1
left join blah on t1.username=blah.username
From what you wrote here I would alter your first query into something like this
SELECT
T1.[UserName], ISNULL(
(
SELECT
TOP 1 1
FROM [AnotherTable]
WHERE EXISTS
(
SELECT
1
FROM [AnotherTable] AS T2
WHERE T1.[UserName] = T2.[UserName]
)
), 0)
FROM [UserTable] T1
But actually if you use TOP 1 1 you would not need EXISTS, you could also write
SELECT
T1.[UserName], ISNULL(
(
SELECT
TOP 1 1
FROM [AnotherTable] AS T2
WHERE T1.[UserName] = T2.[UserName]
), 0)
FROM [UserTable] T1

T-SQL query one table, get presence or absence of other table value

I'm not sure what this type of query is called so I've been unable to search for it properly. I've got two tables, Table A has about 10,000 rows. Table B has a variable amount of rows.
I want to write a query that gets all of Table A's results but with an added column, the value of that column is a boolean that says whether the result also appears in Table B.
I've written this query which works but is slow, it doesn't use a boolean but rather a count that will be either zero or one. Any suggested improvements are gratefully accepted:
SELECT u.number,u.name,u.deliveryaddress,
(SELECT COUNT(productUserid)
FROM ProductUser
WHERE number = u.number and productid = #ProductId)
AS IsInPromo
FROM Users u
UPDATE
I've run the query with actual execution plan enabled, I'm not sure how to show the results but various costs are:
Nested Loops (left semi join): 29%]
Clustered Index scan (User Table): 41%
Clustered Index Scan (ProductUser table): 29%
NUMBERS
There are 7366 users in the users table and currently 18 rows in the productUser table (although this will change and could be in the thousands)
You can use EXISTS to short circuit after the first row is found rather than COUNT-ing all matching rows.
SQL Server does not have a boolean datatype. The closest equivalent is BIT
SELECT u.number,
u.name,
u.deliveryaddress,
CASE
WHEN EXISTS (SELECT *
FROM ProductUser
WHERE number = u.number
AND productid = #ProductId) THEN CAST(1 AS BIT)
ELSE CAST(0 AS BIT)
END AS IsInPromo
FROM Users u
RE: "I'm not sure what this type of query is called". This will give a plan with a semi join. See Subqueries in CASE Expressions for more about this.
Which management system are you using?
Try this:
SELECT u.number,u.name,u.deliveryaddress,
case when COUNT(p.productUserid) > 0 then 1 else 0 end
FROM Users u
left join ProductUser p on p.number = u.number and productid = #ProductId
group by u.number,u.name,u.deliveryaddress
UPD: this could be faster using mssql
;with fff as
(
select distinct p.number from ProductUser p where p.productid = #ProductId
)
select u.number,u.name,u.deliveryaddress,
case when isnull(f.number, 0) = 0 then 0 else 1 end
from Users u left join fff f on f.number = u.number
Since you seem concerned about performance, this query can perform faster as this will cause index seek on both tables versus an index scan:
SELECT u.number,
u.name,
u.deliveryaddress,
ISNULL(p.number, 0) IsInPromo
FROM Users u
LEFT JOIN ProductUser p ON p.number = u.number
WHERE p.productid = #ProductId