I have a function that totally lost it's performance after i added another filter to it in postgresql
Here is a simple example of how it looked like at first with good performance.
CREATE OR REPLACE FUNCTION my_function(param_a boolean, param_b boolean )
RETURNS TABLE(blablabla)
LANGUAGE sql
IMMUTABLE
AS $function$
with data as (
select id,amount,account_nr from transfer
)
select * from
data d
where param_a or 0.00 <> (select sum(d2.amount)
from data d2
where d2.id = d.id)
$function$
;
(cost=0.25..10.25 rows=1000 width=560)
(actual time=1162.528..1162.561 rows=306 loops=1)
Buffers: shared hit=1099180
Planning time: 2.928 ms
Execution time: 1162.630 ms
After i added another filter with a subselect and count i lost my perfomance. Is this count so bad for performance and can i solve it another way?
CREATE OR REPLACE FUNCTION my_function(param_a boolean, param_b boolean )
RETURNS TABLE(blablabla)
LANGUAGE sql
IMMUTABLE
AS $function$
with data as (
select id,amount,account_nr from transfer
)
select * from
data d
where (param_b or 1 < (select count(d2.account_nr)
from data d2
where d2.id = d.id
group by d2.account_nr))
and (param_a or 0.00 <> (select sum(d2.amount)
from data d2
where d2.id = d.id))
$function$
;
(cost=0.25..10.25 rows=1000 width=560)
(actualtime=271191.341..271191.383 rows=306 loops=1)
Buffers: shared hit=1099180
Planning time: 2.955 ms
Execution time: 271191.463 ms
Your slow query, embeddded in your stored function, is this:
with data as ( -- original query from the question.
select id,amount,account_nr from transfer
)
select *
from data d
where (param_b or 1 < (select count(d2.account_nr)
from data d2
where d2.id = d.id
group by d2.account_nr)
)
and (param_a or 0.00 <> (select sum(d2.amount)
from data d2
where d2.id = d.id)
)
This has a pointless common table expression. We can get rid of it for simplicity's sake. You can always put it back if you need it for some other purpose.
And has a couple of correlated subqueries. Let's refactor them into a single independent subquery. Starting with that independent subquery:
select id,
count(account_nr) account_count,
sum(amount) total_amount
from transfer
group by id
This aggregate subquery generates the number of accounts and the total amount for each id in your transfer table. Eyeball the results to convince yourself it does what you need it to do.
Then we can join this to the main query and apply your WHERE conditions.
select d.id, d.amount, d.account_nr
from transfer d
join (
select id,
count(account_nr) account_count,
sum(amount) total_amount
from transfer
group by id
) d2 ON d.id = d2.id
where (param_b or 1 < d2.account_count)
and (param_a or 0.00 <> d2.total_amount)
Using the independent subquery can speed things up a lot; sometimes the query planner decides it needs to re-evaluate the dependent subquery many times.
The following index will help the subquery run faster.
CREATE INDEX id_details ON transfer (id) INCLUDE (account_nr, amount);
Convince yourself this works and is fast enough. (I did not debug it, because I don't have your data.) You'll need to test it substituting true and false for param_a and param_b.
Then, and only then, put it into your stored function.
Hi I have this query:
select distinct r.fparams::json->>'uuid_level_2' as uuid_level_2
from jhft.run r
where r.ts_run >= :ts_run
which returns in 323ms:
49c954c3-9d57-4777-99cb-634e59393053
4e9f3aac-b9d0-422b-badf-171c24dac138
d68726a0-7176-4bd3-aac8-b796dab074a5
I'm using it as a subquery a in clause in this other query:
select distinct
r.fparams::json->>'uuid_level_2' as uuid_level_2,
first_value(r.fparams) over
(partition by r.fparams::json->>'uuid_level_2' order by r.id) as first_fparams
from jhft.run r
where r.fparams::json->>'uuid_level_2' in (
select distinct r.fparams::json->>'uuid_level_2' as uuid_level_2
from jhft.run r
where r.ts_run >= :ts_run )
the results takes about 20 seconds to be retrieved;
BUT when I try to make the same query with the where clause as:
where r.fparams::json->>'uuid_level_2' in (
'd68726a0-7176-4bd3-aac8-b796dab074a5',
'49c954c3-9d57-4777-99cb-634e59393053',
'4e9f3aac-b9d0-422b-badf-171c24dac138' )
the results takes just about 300 ms.
Looks like when there is a subquery in the WHERE clause it makes the whole table to be scanned.
any means to "simulate" the hard-coding of the keys?
An obvious candidate for a faster solution would be to use a CTE and a join (but as Erwin and a_horse_with_no_name pointed out, your question is lacking in detail to come up with a definitive solution):
WITH target AS (
SELECT DISTINCT fparams::json->>'uuid_level_2' AS uuid_level_2
FROM jhft.run
WHERE ts_run >= :ts_run
)
SELECT DISTINCT
fparams::json->>'uuid_level_2' AS uuid_level_2,
first_value(fparams) OVER
(PARTITION BY fparams::json->>'uuid_level_2' ORDER BY id) AS first_fparams
FROM jhft.run
JOIN target USING (uuid_level_2)
However, without any EXPLAIN ANALYZE VERBOSE output from your query as an absolute minimum, this is only an educated guess.
[Title updated to reflect updates in description]
I am running Postgresql 9.6
I have a complex query that isn't using the indexes that I expect, when I break it down to this small example I am lost as to why the index isn't being used.
These examples run on a table with 1 million records, and currently all records have the value 'COMPLETED' for column state. State is a text column and I have a btree index on it.
The following query uses my index as I'd expect:
explain analyze
SELECT * FROM(
SELECT
q.state = 'COMPLETED'::text AS completed_successfully
FROM request.request q
) a where NOT completed_successfully;
V
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using request_state_index on request q (cost=0.43..88162.19 rows=11200 width=1) (actual time=200.554..200.554 rows=0 loops=1)
Filter: (state <> 'COMPLETED'::text)
Rows Removed by Filter: 1050005
Heap Fetches: 198150
Planning time: 0.272 ms
Execution time: 200.579 ms
(6 rows)
But if I add anything else to the select that references my table, then the planner chooses to do a sequential scan instead.
explain analyze
SELECT * FROM(
SELECT
q.state = 'COMPLETED'::text AS completed_successfully,
q.type
FROM request.request q
) a where NOT completed_successfully;
V
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on request q (cost=0.00..234196.06 rows=11200 width=8) (actual time=407.713..407.713 rows=0 loops=1)
Filter: (state <> 'COMPLETED'::text)
Rows Removed by Filter: 1050005
Planning time: 0.113 ms
Execution time: 407.733 ms
(5 rows)
Even this simpler example has the same issue.
Uses Index:
SELECT
q.state
FROM request.request q
WHERE q.state = 'COMPLETED';
Doesn't use Index:
SELECT
q.state,
q.type
FROM request.request q
WHERE q.state = 'COMPLETED';
[UPDATE]
I now understand (for this case) that the index it's using there is INDEX ONLY, and it would stop using that in this case because type isn't also in the index. So the question perhaps is why won't it use it in the 'Not' case below:
When I use a different value that isn't in the table, i knows to use the index (which makes sense):
SELECT
q.state,
q.type
FROM request.request q
WHERE q.state = 'CREATED';
But if I not it, it doesn't:
SELECT
q.state,
q.type
FROM request.request q
WHERE q.state != 'COMPLETED';
Why is my index not being used?
What can I do to ensure it gets used?
Most of the time, I expect nearly all the records in this table to be in one of many end states (using IN operator);. So when running my more complex query, I expect these records should be excluded from my more expensive part of the query early and quickly.
[UPDATES]
It looks like 'NOT' is not a supported B-Tree operation. I'll need some kind of unique approach: https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BTREE
I tried adding the following partial indexes but they didn't seem to work:
CREATE INDEX request_incomplete_state_index ON request.request (state) WHERE state NOT IN('COMPLETED', 'FAILED', 'CANCELLED');
CREATE INDEX request_complete_state_index ON request.request (state) WHERE state IN('COMPLETED', 'FAILED', 'CANCELLED');
This partial index does work, but is not an ideal solution.
CREATE INDEX request_incomplete_state_exact_index ON request.request (state) WHERE state != 'COMPLETED';
explain analyze SELECT q.state, q.type FROM request.request q WHERE q.state != 'COMPLETED';
I also tried this expression index, while also not ideal also didn't work:
CREATE OR REPLACE FUNCTION request.request_is_done(in_state text)
RETURNS BOOLEAN
LANGUAGE sql
STABLE
AS $function$
SELECT in_state IN ('COMPLETED', 'FAILED', 'CANCELLED');
$function$
;
CREATE INDEX request_is_done_index ON request.request (request.request_is_done(state));
explain analyze select * from request.request q where NOT request.request_is_done(state);
Using a list (In Clause) of states with equals works. So I may have to figure out my larger query to just not use the NOT.
I have the following example of complex queries that get generated in our system. In this example, we're turning data that is joined to 17 other tables. For each of the join tables, I am using the syntax LIMIT keyword to limit returned number of items per join table. The goal was to retrieve a max of 50 items per join table. For queries with far fewer joins (7-10), this seems to work ok.
However, using the limit of 50 in this query, I get the error: Error: temporary file size exceeds temp_file_limit (1025563kB).
If I change the limit from 50 to 5, the query runs in 36s seconds. If I change the limit from 50 to 3, it runs in 3 seconds. If I change it to 2, it runs in 260ms
My question is, is there a more efficient way to run a complex query like this that could return that 50 items per join? Or is that too much for single query for postgres to process?
It's curious it drops to 260ms with reducing the # of returned sub items from 5 to 2.
SELECT Count (*),
array_to_json((Array_agg(t))[0:500]) AS array
FROM (
SELECT tbl_338.id,
custom.fullname AS "CustomID",
tbl_338.field_7,
tbl_338.field_6,
tbl_338.field_5,
tbl_338.field_1,
tbl_338.field_2,
tbl_338.field_18,
tbl_338.field_17,
tbl_338.field_3,
tbl_338.field_32,
tbl_338.addedon,
tbl_338.updatedon,
tbl_338.field_16,
tbl_338.id,
tbl_338.addedby,
tbl_338.updatedby ,
jsonb_agg(DISTINCT jsonb_build_object('id',tbl_340_field_15.id,'data',tbl_340_field_15.fullname)) AS field_15,
jsonb_agg(DISTINCT jsonb_build_object('id',tbl_408_field_30.id,'data',tbl_408_field_30.fullname)) AS field_30,
jsonb_agg(DISTINCT jsonb_build_object('id',tbl_342_field_19.id,'data',tbl_342_field_19.fullname)) AS field_19 ,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_34.optionid,'data',field_34.OPTION,'attributes',field_34.attributes)) AS field_34,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_23.optionid,'data',field_23.OPTION,'attributes',field_23.attributes)) AS field_23,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_24.optionid,'data',field_24.OPTION,'attributes',field_24.attributes)) AS field_24,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_22.optionid,'data',field_22.OPTION,'attributes',field_22.attributes)) AS field_22,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_33.optionid,'data',field_33.OPTION,'attributes',field_33.attributes)) AS field_33,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_37.optionid,'data',field_37.OPTION,'attributes',field_37.attributes)) AS field_37,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_36.optionid,'data',field_36.OPTION,'attributes',field_36.attributes)) AS field_36,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_21.optionid,'data',field_21.OPTION,'attributes',field_21.attributes)) AS field_21,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_38.optionid,'data',field_38.OPTION,'attributes',field_38.attributes)) AS field_38,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_14.optionid,'data',field_14.OPTION,'attributes',field_14.attributes)) AS field_14,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_31.optionid,'data',field_31.OPTION,'attributes',field_31.attributes)) AS field_31,
jsonb_agg(DISTINCT jsonb_build_object('optionid',field_8.optionid,'data',field_8.OPTION,'attributes',field_8.attributes)) AS field_8 ,
jsonb_agg(DISTINCT jsonb_build_object('messageid',msg.messageid,'message',msg.message,'schedule',msg.schedule,'tablerowid',msg.tablerowid,'addedon',msg.addedon)) AS field_4
FROM schema_131.tbl_338 tbl_338
LEFT JOIN schema_131.tbl_338_customid custom
ON custom.id=tbl_338.id
LEFT JOIN lateral
(
SELECT DISTINCT field_15.*
FROM schema_131.tbl_338_to_tbl_340_field_15 field_15
WHERE field_15.tbl_338_field_15_id=tbl_338.id limit 50) field_15
ON true
LEFT JOIN lateral
(
SELECT DISTINCT tbl_340_field_15.*
FROM schema_131.tbl_340_customid tbl_340_field_15
WHERE tbl_340_field_15.id = field_15.tbl_340_field_5_id limit 50 ) tbl_340_field_15
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_30.*
FROM schema_131.tbl_408_to_tbl_338_field_4 field_30
WHERE field_30.tbl_338_field_30_id=tbl_338.id limit 50) field_30
ON true
LEFT JOIN lateral
(
SELECT DISTINCT tbl_408_field_30.*
FROM schema_131.tbl_408_customid tbl_408_field_30
WHERE tbl_408_field_30.id = field_30.tbl_408_field_4_id limit 50 ) tbl_408_field_30
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_19.*
FROM schema_131.tbl_338_to_tbl_342_field_19 field_19
WHERE field_19.tbl_338_field_19_id=tbl_338.id limit 50) field_19
ON true
LEFT JOIN lateral
(
SELECT DISTINCT tbl_342_field_19.*
FROM schema_131.tbl_342_customid tbl_342_field_19
WHERE tbl_342_field_19.id = field_19.tbl_342_field_5_id limit 50 ) tbl_342_field_19
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_34_join.*
FROM schema_131.tbl_338_field_34_join field_34_join
WHERE field_34_join.id=tbl_338.id limit 50) field_34_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_34.*
FROM schema_131.tbl_338_field_34 field_34
WHERE field_34.optionid = field_34_join.optionid
ORDER BY field_34.rank limit 5 ) field_34
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_23_join.*
FROM schema_131.tbl_338_field_23_join field_23_join
WHERE field_23_join.id=tbl_338.id limit 50) field_23_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_23.*
FROM schema_131.tbl_338_field_23 field_23
WHERE field_23.optionid = field_23_join.optionid
ORDER BY field_23.rank limit 5 ) field_23
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_24_join.*
FROM schema_131.tbl_338_field_24_join field_24_join
WHERE field_24_join.id=tbl_338.id limit 50) field_24_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_24.*
FROM schema_131.tbl_338_field_24 field_24
WHERE field_24.optionid = field_24_join.optionid
ORDER BY field_24.rank limit 5 ) field_24
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_22_join.*
FROM schema_131.tbl_338_field_22_join field_22_join
WHERE field_22_join.id=tbl_338.id limit 50) field_22_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_22.*
FROM schema_131.tbl_338_field_22 field_22
WHERE field_22.optionid = field_22_join.optionid
ORDER BY field_22.rank limit 5 ) field_22
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_33_join.*
FROM schema_131.tbl_338_field_33_join field_33_join
WHERE field_33_join.id=tbl_338.id limit 50) field_33_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_33.*
FROM schema_131.tbl_338_field_33 field_33
WHERE field_33.optionid = field_33_join.optionid
ORDER BY field_33.rank limit 5 ) field_33
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_37_join.*
FROM schema_131.tbl_338_field_37_join field_37_join
WHERE field_37_join.id=tbl_338.id limit 50) field_37_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_37.*
FROM schema_131.tbl_338_field_37 field_37
WHERE field_37.optionid = field_37_join.optionid
ORDER BY field_37.rank limit 5 ) field_37
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_36_join.*
FROM schema_131.tbl_338_field_36_join field_36_join
WHERE field_36_join.id=tbl_338.id limit 50) field_36_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_36.*
FROM schema_131.tbl_338_field_36 field_36
WHERE field_36.optionid = field_36_join.optionid
ORDER BY field_36.rank limit 5 ) field_36
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_21_join.*
FROM schema_131.tbl_338_field_21_join field_21_join
WHERE field_21_join.id=tbl_338.id limit 50) field_21_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_21.*
FROM schema_131.tbl_338_field_21 field_21
WHERE field_21.optionid = field_21_join.optionid
ORDER BY field_21.rank limit 5 ) field_21
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_38_join.*
FROM schema_131.tbl_338_field_38_join field_38_join
WHERE field_38_join.id=tbl_338.id limit 50) field_38_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_38.*
FROM schema_131.tbl_338_field_38 field_38
WHERE field_38.optionid = field_38_join.optionid
ORDER BY field_38.rank limit 5 ) field_38
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_14_join.*
FROM schema_131.tbl_338_field_14_join field_14_join
WHERE field_14_join.id=tbl_338.id limit 50) field_14_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_14.*
FROM schema_131.tbl_338_field_14 field_14
WHERE field_14.optionid = field_14_join.optionid
ORDER BY field_14.rank limit 5 ) field_14
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_31_join.*
FROM schema_131.tbl_338_field_31_join field_31_join
WHERE field_31_join.id=tbl_338.id limit 50) field_31_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_31.*
FROM schema_131.tbl_338_field_31 field_31
WHERE field_31.optionid = field_31_join.optionid
ORDER BY field_31.rank limit 5 ) field_31
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_8_join.*
FROM schema_131.tbl_338_field_8_join field_8_join
WHERE field_8_join.id=tbl_338.id limit 50) field_8_join
ON true
LEFT JOIN lateral
(
SELECT DISTINCT field_8.*
FROM schema_131.tbl_338_field_8 field_8
WHERE field_8.optionid = field_8_join.optionid
ORDER BY field_8.rank limit 5 ) field_8
ON true
LEFT JOIN lateral
(
SELECT DISTINCT msg.*
FROM schema_131.messages msg
WHERE msg.graceblockssms=tbl_338.smsnumber
AND msg.recipientsms=tbl_338.field_3
ORDER BY msg.addedon DESC limit 1) msg
ON true
GROUP BY tbl_338.id,
custom.fullname,
tbl_338.field_7,
tbl_338.field_6,
tbl_338.field_5,
tbl_338.field_1,
tbl_338.field_2,
tbl_338.field_18,
tbl_338.field_17,
tbl_338.field_3,
tbl_338.field_32,
tbl_338.addedon,
tbl_338.updatedon,
tbl_338.field_16,
tbl_338.id,
tbl_338.addedby,
tbl_338.updatedby
ORDER BY tbl_338.id ASC ) t;
First of all, PostGreSQL is not designed for complex queries... You should use another RDBMS that support such complexity.
PostGreSQL limits the optimization of join to 12 by a parameter
call "geqo_threshold" (default value is 12)
In PG, the time to find an optimized execution plan is a factorial of JOIN, due to the algorithm used in the optimizer...
If you set geqo_threshold to an upper value, the time taken to compute an
optimized plan, will increase too much and can be superior to the
execution of the query with a trivial execution plan.
If you leave the actuel value of geqo_threshold, the excution plan will
probably be computed in less time, but will offer a worst execution
plan..
So you have a dilemma:
do you want a worst execution plan
do you want a good execution plan, that will tak too much time to compute
The discussion about the use of geqo by the PG staff, reveal a dead end...
https://www.postgresql.org/docs/7.1/geqo-pg-intro.html#GEQO-FUTURE
So, what to do ?
FIRST: try to increase the geqo_threshold and make some tests. But use a real world amount of data you shoukld have in 3 to 5 years to To avoid compromising your project.
SECOND: if your results, from FIRST part, concludes that this is an inacceptable situation... transfer your database to a RDBMS that do not have problems with such a situation. Microsoft SQL Server is actually the best choice for this (the best optimizer over Oracle at less cost) and SQL Server is available on Linux.
To have a look of the limitations of PostGreSQL and the bad performances, just read my papers :
http://mssqlserver.fr/postgresql-vs-sql-server-mssql-part-3-very-extremely-detailed-comparison/
http://mssqlserver.fr/postgresql-vs-microsoft-sql-server-comparison-part-2-count-performances/
http://mssqlserver.fr/postgresql-vs-microsoft-part-1-dba-queries-performances/
There are 3 tables that I need to Join with different filters.
EG:
Select 'First' as Header, *
from A join B on A.ID=B.ID
where A.Type=1 and B.Startdate>Getdate()
Union
Select 'Second' as Header, *
from A join B on A.ID=B.ID
where A.Type=2 and B.Startdate=Getdate()
Union
Select 'Third' as Header, *
from A join B on A.ID=B.ID
where A.Type=3 and B.Startdate<Getdate()
Is any impact will be in performance change if above rewritten as
With CTE
as
(
Select *
from A join B on A.ID=B.ID
Where A.Type in (1,2,3)
)
Select 'First' as Header, *
from CTE
where Type=1 and B.Startdate>Getdate()
Union
Select 'Second' as Header, *
from CTE
where Type=2 and B.Startdate=Getdate()
Union
Select 'Third' as Header, *
from CTE
where Type=3 and B.Startdate<Getdate()
Both Table A and B have over 100k records
I have noticed, Execution Query plan seems to be same for both queries but little variation in Physical, Logical and read-ahead reads with execution timing differing in milliseconds.
Does the above CTE better in performance wise than select mentioned at the start of the segment or not. Is any other methods to get above results with better performance.