JOIN vs SubQuery: Why subquery performance win when it should not? - postgresql

Recently I have asked about Why select from function is slow?.
But now when I LEFT JOIN this function it take 11500ms.
When I rewrite LEFT JOIN by SubQuery it took only 111ms
SELECT
(SELECT next_ots FROM order_total_suma( next_range ) next_ots
WHERE next_ots.order_id = ots.order_id AND next_ots.consumed_period #> (ots.o).billed_to
) AS next_suma, --<< this took only 111ms. See plan
ots.* FROM (
SELECT
tstzrange(
NULLIF( (ots.o).billed_to, 'infinity' ),
NULLIF( (ots.o).billed_to +p.interval, 'infinity' )
) as next_range,
ots.*
FROM order_total_suma() ots
LEFT JOIN period p ON p.id = (ots.o).period_id
) ots
--LEFT JOIN order_total_suma( next_range ) next_ots ON next_ots.order_id = 6154
-- AND next_ots.consumed_period #> (ots.o).billed_to --<< this is fine. plan is not posted
--LEFT JOIN order_total_suma( next_range ) next_ots ON next_ots.order_id = ots.order_id
-- AND next_ots.consumed_period #> (ots.o).billed_to --<< this takes 11500ms. See Plan
WHERE ots.order_id IN ( 6154, 10805 )
Attached plans
While googling I have found this blog post
In most cases, joins are also a better solution than subqueries — Postgres will even internally “rewrite” a subquery, creating a join, whenever possible, but this of course increases the time it takes to come up with the query plan
Many SO question like this
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So why LEFT JOINing function is significantly slower in compare to SubQuery?
Is there a way to make LEFT JOIN take time equally to SubQuery?

Related

inneficient subquery postgresql

Hi I have this query:
select distinct r.fparams::json->>'uuid_level_2' as uuid_level_2
from jhft.run r
where r.ts_run >= :ts_run
which returns in 323ms:
49c954c3-9d57-4777-99cb-634e59393053
4e9f3aac-b9d0-422b-badf-171c24dac138
d68726a0-7176-4bd3-aac8-b796dab074a5
I'm using it as a subquery a in clause in this other query:
select distinct
r.fparams::json->>'uuid_level_2' as uuid_level_2,
first_value(r.fparams) over
(partition by r.fparams::json->>'uuid_level_2' order by r.id) as first_fparams
from jhft.run r
where r.fparams::json->>'uuid_level_2' in (
select distinct r.fparams::json->>'uuid_level_2' as uuid_level_2
from jhft.run r
where r.ts_run >= :ts_run )
the results takes about 20 seconds to be retrieved;
BUT when I try to make the same query with the where clause as:
where r.fparams::json->>'uuid_level_2' in (
'd68726a0-7176-4bd3-aac8-b796dab074a5',
'49c954c3-9d57-4777-99cb-634e59393053',
'4e9f3aac-b9d0-422b-badf-171c24dac138' )
the results takes just about 300 ms.
Looks like when there is a subquery in the WHERE clause it makes the whole table to be scanned.
any means to "simulate" the hard-coding of the keys?
An obvious candidate for a faster solution would be to use a CTE and a join (but as Erwin and a_horse_with_no_name pointed out, your question is lacking in detail to come up with a definitive solution):
WITH target AS (
SELECT DISTINCT fparams::json->>'uuid_level_2' AS uuid_level_2
FROM jhft.run
WHERE ts_run >= :ts_run
)
SELECT DISTINCT
fparams::json->>'uuid_level_2' AS uuid_level_2,
first_value(fparams) OVER
(PARTITION BY fparams::json->>'uuid_level_2' ORDER BY id) AS first_fparams
FROM jhft.run
JOIN target USING (uuid_level_2)
However, without any EXPLAIN ANALYZE VERBOSE output from your query as an absolute minimum, this is only an educated guess.

How to Optimize complex query with 17 join tables and limiting data per join using LIMIT syntx

I have the following example of complex queries that get generated in our system. In this example, we're turning data that is joined to 17 other tables. For each of the join tables, I am using the syntax LIMIT keyword to limit returned number of items per join table. The goal was to retrieve a max of 50 items per join table. For queries with far fewer joins (7-10), this seems to work ok.
However, using the limit of 50 in this query, I get the error: Error: temporary file size exceeds temp_file_limit (1025563kB).
If I change the limit from 50 to 5, the query runs in 36s seconds. If I change the limit from 50 to 3, it runs in 3 seconds. If I change it to 2, it runs in 260ms
My question is, is there a more efficient way to run a complex query like this that could return that 50 items per join? Or is that too much for single query for postgres to process?
It's curious it drops to 260ms with reducing the # of returned sub items from 5 to 2.
SELECT Count (*),
array_to_json((Array_agg(t))[0:500]) AS array
FROM (
SELECT tbl_338.id,
custom.fullname AS "CustomID",
tbl_338.field_7,
tbl_338.field_6,
tbl_338.field_5,
tbl_338.field_1,
tbl_338.field_2,
tbl_338.field_18,
tbl_338.field_17,
tbl_338.field_3,
tbl_338.field_32,
tbl_338.addedon,
tbl_338.updatedon,
tbl_338.field_16,
tbl_338.id,
tbl_338.addedby,
tbl_338.updatedby ,
jsonb_agg(DISTINCT jsonb_build_object('id',tbl_340_field_15.id,'data',tbl_340_field_15.fullname)) AS field_15,
jsonb_agg(DISTINCT jsonb_build_object('id',tbl_408_field_30.id,'data',tbl_408_field_30.fullname)) AS field_30,
jsonb_agg(DISTINCT jsonb_build_object('id',tbl_342_field_19.id,'data',tbl_342_field_19.fullname)) AS field_19 ,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_34.optionid,'data',field_34.OPTION,'attributes',field_34.attributes)) AS field_34,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_23.optionid,'data',field_23.OPTION,'attributes',field_23.attributes)) AS field_23,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_24.optionid,'data',field_24.OPTION,'attributes',field_24.attributes)) AS field_24,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_22.optionid,'data',field_22.OPTION,'attributes',field_22.attributes)) AS field_22,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_33.optionid,'data',field_33.OPTION,'attributes',field_33.attributes)) AS field_33,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_37.optionid,'data',field_37.OPTION,'attributes',field_37.attributes)) AS field_37,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_36.optionid,'data',field_36.OPTION,'attributes',field_36.attributes)) AS field_36,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_21.optionid,'data',field_21.OPTION,'attributes',field_21.attributes)) AS field_21,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_38.optionid,'data',field_38.OPTION,'attributes',field_38.attributes)) AS field_38,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_14.optionid,'data',field_14.OPTION,'attributes',field_14.attributes)) AS field_14,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_31.optionid,'data',field_31.OPTION,'attributes',field_31.attributes)) AS field_31,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_8.optionid,'data',field_8.OPTION,'attributes',field_8.attributes)) AS field_8 ,
jsonb_agg(DISTINCT jsonb_build_object('messageid',msg.messageid,'message',msg.message,'schedule',msg.schedule,'tablerowid',msg.tablerowid,'addedon',msg.addedon)) AS field_4
FROM schema_131.tbl_338 tbl_338
LEFT JOIN schema_131.tbl_338_customid custom
ON custom.id=tbl_338.id
LEFT JOIN lateral
(
SELECT DISTINCT field_15.*
FROM schema_131.tbl_338_to_tbl_340_field_15 field_15
WHERE field_15.tbl_338_field_15_id=tbl_338.id limit 50) field_15
ON true
LEFT JOIN lateral
(
SELECT DISTINCT tbl_340_field_15.*
FROM schema_131.tbl_340_customid tbl_340_field_15
WHERE tbl_340_field_15.id = field_15.tbl_340_field_5_id limit 50 ) tbl_340_field_15
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_30.*
FROM schema_131.tbl_408_to_tbl_338_field_4 field_30
WHERE field_30.tbl_338_field_30_id=tbl_338.id limit 50) field_30
ON true
LEFT JOIN lateral
(
SELECT DISTINCT tbl_408_field_30.*
FROM schema_131.tbl_408_customid tbl_408_field_30
WHERE tbl_408_field_30.id = field_30.tbl_408_field_4_id limit 50 ) tbl_408_field_30
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_19.*
FROM schema_131.tbl_338_to_tbl_342_field_19 field_19
WHERE field_19.tbl_338_field_19_id=tbl_338.id limit 50) field_19
ON true
LEFT JOIN lateral
(
SELECT DISTINCT tbl_342_field_19.*
FROM schema_131.tbl_342_customid tbl_342_field_19
WHERE tbl_342_field_19.id = field_19.tbl_342_field_5_id limit 50 ) tbl_342_field_19
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_34_join.*
FROM schema_131.tbl_338_field_34_join field_34_join
WHERE field_34_join.id=tbl_338.id limit 50) field_34_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_34.*
FROM schema_131.tbl_338_field_34 field_34
WHERE field_34.optionid = field_34_join.optionid
ORDER BY field_34.rank limit 5 ) field_34
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_23_join.*
FROM schema_131.tbl_338_field_23_join field_23_join
WHERE field_23_join.id=tbl_338.id limit 50) field_23_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_23.*
FROM schema_131.tbl_338_field_23 field_23
WHERE field_23.optionid = field_23_join.optionid
ORDER BY field_23.rank limit 5 ) field_23
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_24_join.*
FROM schema_131.tbl_338_field_24_join field_24_join
WHERE field_24_join.id=tbl_338.id limit 50) field_24_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_24.*
FROM schema_131.tbl_338_field_24 field_24
WHERE field_24.optionid = field_24_join.optionid
ORDER BY field_24.rank limit 5 ) field_24
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_22_join.*
FROM schema_131.tbl_338_field_22_join field_22_join
WHERE field_22_join.id=tbl_338.id limit 50) field_22_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_22.*
FROM schema_131.tbl_338_field_22 field_22
WHERE field_22.optionid = field_22_join.optionid
ORDER BY field_22.rank limit 5 ) field_22
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_33_join.*
FROM schema_131.tbl_338_field_33_join field_33_join
WHERE field_33_join.id=tbl_338.id limit 50) field_33_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_33.*
FROM schema_131.tbl_338_field_33 field_33
WHERE field_33.optionid = field_33_join.optionid
ORDER BY field_33.rank limit 5 ) field_33
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_37_join.*
FROM schema_131.tbl_338_field_37_join field_37_join
WHERE field_37_join.id=tbl_338.id limit 50) field_37_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_37.*
FROM schema_131.tbl_338_field_37 field_37
WHERE field_37.optionid = field_37_join.optionid
ORDER BY field_37.rank limit 5 ) field_37
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_36_join.*
FROM schema_131.tbl_338_field_36_join field_36_join
WHERE field_36_join.id=tbl_338.id limit 50) field_36_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_36.*
FROM schema_131.tbl_338_field_36 field_36
WHERE field_36.optionid = field_36_join.optionid
ORDER BY field_36.rank limit 5 ) field_36
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_21_join.*
FROM schema_131.tbl_338_field_21_join field_21_join
WHERE field_21_join.id=tbl_338.id limit 50) field_21_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_21.*
FROM schema_131.tbl_338_field_21 field_21
WHERE field_21.optionid = field_21_join.optionid
ORDER BY field_21.rank limit 5 ) field_21
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_38_join.*
FROM schema_131.tbl_338_field_38_join field_38_join
WHERE field_38_join.id=tbl_338.id limit 50) field_38_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_38.*
FROM schema_131.tbl_338_field_38 field_38
WHERE field_38.optionid = field_38_join.optionid
ORDER BY field_38.rank limit 5 ) field_38
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_14_join.*
FROM schema_131.tbl_338_field_14_join field_14_join
WHERE field_14_join.id=tbl_338.id limit 50) field_14_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_14.*
FROM schema_131.tbl_338_field_14 field_14
WHERE field_14.optionid = field_14_join.optionid
ORDER BY field_14.rank limit 5 ) field_14
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_31_join.*
FROM schema_131.tbl_338_field_31_join field_31_join
WHERE field_31_join.id=tbl_338.id limit 50) field_31_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_31.*
FROM schema_131.tbl_338_field_31 field_31
WHERE field_31.optionid = field_31_join.optionid
ORDER BY field_31.rank limit 5 ) field_31
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_8_join.*
FROM schema_131.tbl_338_field_8_join field_8_join
WHERE field_8_join.id=tbl_338.id limit 50) field_8_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_8.*
FROM schema_131.tbl_338_field_8 field_8
WHERE field_8.optionid = field_8_join.optionid
ORDER BY field_8.rank limit 5 ) field_8
ON true
LEFT JOIN lateral
(
SELECT DISTINCT msg.*
FROM schema_131.messages msg
WHERE msg.graceblockssms=tbl_338.smsnumber
AND msg.recipientsms=tbl_338.field_3
ORDER BY msg.addedon DESC limit 1) msg
ON true
GROUP BY tbl_338.id,
custom.fullname,
tbl_338.field_7,
tbl_338.field_6,
tbl_338.field_5,
tbl_338.field_1,
tbl_338.field_2,
tbl_338.field_18,
tbl_338.field_17,
tbl_338.field_3,
tbl_338.field_32,
tbl_338.addedon,
tbl_338.updatedon,
tbl_338.field_16,
tbl_338.id,
tbl_338.addedby,
tbl_338.updatedby
ORDER BY tbl_338.id ASC ) t;
First of all, PostGreSQL is not designed for complex queries... You should use another RDBMS that support such complexity.
PostGreSQL limits the optimization of join to 12 by a parameter
call "geqo_threshold" (default value is 12)
In PG, the time to find an optimized execution plan is a factorial of JOIN, due to the algorithm used in the optimizer...
If you set geqo_threshold to an upper value, the time taken to compute an
optimized plan, will increase too much and can be superior to the
execution of the query with a trivial execution plan.
If you leave the actuel value of geqo_threshold, the excution plan will
probably be computed in less time, but will offer a worst execution
plan..
So you have a dilemma:
do you want a worst execution plan
do you want a good execution plan, that will tak too much time to compute
The discussion about the use of geqo by the PG staff, reveal a dead end...
https://www.postgresql.org/docs/7.1/geqo-pg-intro.html#GEQO-FUTURE
So, what to do ?
FIRST: try to increase the geqo_threshold and make some tests. But use a real world amount of data you shoukld have in 3 to 5 years to To avoid compromising your project.
SECOND: if your results, from FIRST part, concludes that this is an inacceptable situation... transfer your database to a RDBMS that do not have problems with such a situation. Microsoft SQL Server is actually the best choice for this (the best optimizer over Oracle at less cost) and SQL Server is available on Linux.
To have a look of the limitations of PostGreSQL and the bad performances, just read my papers :
http://mssqlserver.fr/postgresql-vs-sql-server-mssql-part-3-very-extremely-detailed-comparison/
http://mssqlserver.fr/postgresql-vs-microsoft-sql-server-comparison-part-2-count-performances/
http://mssqlserver.fr/postgresql-vs-microsoft-part-1-dba-queries-performances/

Why PostgreSQL do whole scan on index when condition is FALSE?

I notice some slow down when query is running. From 5ms to 200ms. (+44ms JIT)
https://explain.depesz.com/s/lZYf#l12
similar, but JIT is off
Underlined expression is NULL so whole filter expression is FALSE.
Why here PG waste time 227ms? What I did wrong?
EXPLAIN( ANALYSE, FORMAT JSON, VERBOSE, settings, buffers )
WITH
_app_period AS ( select ?::tstzrange ),
ready AS (
SELECT
min( lower( o.app_period ) ) OVER ( PARTITION BY agreement_id ) <# (select * from _app_period) AS new_order,
max( upper( o.app_period ) ) OVER ( PARTITION BY agreement_id ) <# (select * from _app_period) AS del_order
,o.*
FROM "order_bt" o
LEFT JOIN acc_ready( 'Usage', (select * from _app_period), o ) acc_u ON acc_u.ready
LEFT JOIN acc_ready( 'Invoice', (select * from _app_period), o ) acc_i ON acc_i.ready
LEFT JOIN agreement a ON a.id = o.agreement_id
LEFT JOIN xcheck c ON c.doc_id = o.id and c.doctype = 'OrderDetail'
WHERE o.sys_period #> sys_time() AND o.app_period && app_period()
)
SELECT * FROM ready
UPD
Server version is 13.1
Is the second execution faster?
No. Result is reproducible all the time.
Perhaps sys_time() is expensive - what is that function?
This is stable function which do select coalesce( biconf( 'sys_time' )::timestamptz, now() ). app_period() is STABLE SQL and do similar thing.
Are you sure that the expression is NULL for all rows?
Yes. I check result of app_period() it is NULL, so it does not matter how many rows in table. o.app_period && NULL will result NULL for all rows.
Does the execution time change if you replace the expression with a literal NULL?
Yes, changing condition to WHERE o.sys_period #> sys_time() AND o.app_period && NULL reduce time to 0.08ms. Plan is changed.
Do you have indexes on o.sys_period and o.app_period?
Yes. I have: "order_id_sys_period_app_period_excl" EXCLUDE USING gist (id WITH =, sys_period WITH &&, app_period WITH &&)
And what happens when you execute the query without the CTE?
Without CTE many things are inlined and time is reduced to 0.5ms. But for IndexScan similar condition is used (now it is fast)
When I put (select * from _app_period) everywhere then query also run fast: 15ms. Filter is planned as $3: (o.app_period && $3) AND (o.sys_period #> sys_time())

Degraded SQL Query Speed By Nesting a Single Query inline vs temp table

I have a query of the following, basic form.
SELECT DISTINCT
a.field1,
b.field2,
c.agg_values
FROM a
INNER JOIN b ON a.something = b.something
LEFT JOIN (
SELECT
array_to_string(array_agg(label), ';;') AS agg_values,
some_table.some_field
FROM some_table
WHERE some_table.some_field = 'some-fixed-value'
GROUP BY some_field
) AS c ON a.some_field = c.some_field
WHERE a.some_other_field = 'some-other-fixed-value'
There's nothing too wild about this query. Pretty run of the mill!
This query runs pretty slow in my Postgres 9.4.5 (~4 minutes), where I have maybe 15k records returned total. some_table has probably ~10k records.
If I move the content of that LEFT JOIN sub-query to a temp table and left join from the temp table, my performance increases substantially. My query may take only 15s now, vs 240s. To be more explicit, if I remove SELECT array_to_string ... GROUP BY some_field query, and put that query into a temp table, then left join on that temp table, BAM, fast.
CREATE TEMP TABLE temp_table_c ( ... );
INSERT INTO temp_table_c SELECT ... same query nested in LEFT JOIN from before ...;
SELECT DISTINCT
a.field1,
b.field2,
c.agg_values
FROM a
INNER JOIN ON a.something = b.something
LEFT JOIN temp_table_c AS c ON a.some_field = c.some_field
WHERE a.some_other_field = 'some-other-fixed-value'
I would appreciate it if someone could explain why the TEMP TABLE version of the query is so much more performant.
Thanks!

Postgres join not respecting outer where clause

In SQL Server, I know for sure that the following query;
SELECT things.*
FROM things
LEFT OUTER JOIN (
SELECT thingreadings.thingid, reading
FROM thingreadings
INNER JOIN things on thingreadings.thingid = things.id
ORDER BY reading DESC LIMIT 1) AS readings
ON things.id = readings.thingid
WHERE things.id = '1'
Would join against thingreadings only once the WHERE id = 1 had restricted the record set down. It left joins against just one row. However in order for performance to be acceptable in postgres, I have to add the WHERE id= 1 to the INNER JOIN things on thingreadings.thingid = things.id line too.
This isn't ideal; is it possible to force postgres to know that what I am joining against is only one row without explicitly adding the WHERE clauses everywhere?
An example of this problem can be seen here;
I am trying to recreate the following query in a more efficient way;
SELECT things.id, things.name,
(SELECT thingreadings.id FROM thingreadings WHERE thingid = things.id ORDER BY id DESC LIMIT 1),
(SELECT thingreadings.reading FROM thingreadings WHERE thingid = things.id ORDER BY id DESC LIMIT 1)
FROM things
WHERE id IN (1,2)
http://sqlfiddle.com/#!15/a172c/2
Not really sure why you did all that work. Isn't the inner query enough?
SELECT t.*
FROM thingreadings tr
INNER JOIN things t on tr.thingid = t.id AND t.id = '1'
ORDER BY tr.reading DESC
LIMIT 1;
sqlfiddle demo
When you want to select the latest value for each thingID, you can do:
SELECT t.*,a.reading
FROM things t
INNER JOIN (
SELECT t1.*
FROM thingreadings t1
LEFT JOIN thingreadings t2
ON (t1.thingid = t2.thingid AND t1.reading < t2.reading)
WHERE t2.thingid IS NULL
) a ON a.thingid = t.id
sqlfiddle demo
The derived table gets you the record with the most recent reading, then the JOIN gets you the information from things table for that record.
The where clause in SQL applies to the result set you're requesting, NOT to the join.
What your code is NOT saying: "do this join only for the ID of 1"...
What your code IS saying: "do this join, then pull records out of it where the ID is 1"...
This is why you need the inner where clause. Incidentally, I also think Filipe is right about the unnecessary code.