Optimizing a postgres request with arithmetical operators - postgresql

I have a simple request like this, on a very large table :
(select "table_a"."id",
"table_a"."b_id",
"table_a"."timestamp"
from "table_a"
left outer join "table_b"
on "table_b"."b_id" = "table_a"."b_id"
where ((cast("table_b"."total" ->> 'bar' as int) - coalesce(
(cast("table_b"."ok" ->> 'bar' as int) +
cast("table_b"."ko" ->> 'bar' as int)), 0)) > 0 and coalesce(
(cast("table_b"."ok" ->> 'bar' as int) +
cast("table_b"."ko" ->> 'bar' as int)),
0) > 0)
order by "table_a"."timestamp" desc fetch next 25 rows only)
Problem is it takes quite some time :
Limit (cost=0.84..160.44 rows=25 width=41) (actual time=2267.067..2267.069 rows=0 loops=1)
-> Nested Loop (cost=0.84..124849.43 rows=19557 width=41) (actual time=2267.065..2267.066 rows=0 loops=1)
-> Index Scan using table_a_timestamp_index on table_a (cost=0.42..10523.32 rows=188976 width=33) (actual time=0.011..57.550 rows=188976 loops=1)
-> Index Scan using table_b_b_id_key on table_b (cost=0.42..0.60 rows=1 width=103) (actual time=0.011..0.011 rows=0 loops=188976)
Index Cond: ((b_id)::text = (table_a.b_id)::text)
" Filter: ((COALESCE((((ok ->> 'bar'::text))::integer + ((ko ->> 'bar'::text))::integer), 0) > 0) AND ((((total ->> 'bar'::text))::integer - COALESCE((((ok ->> 'bar'::text))::integer + ((ko ->> 'bar'::text))::integer), 0)) > 0))"
Rows Removed by Filter: 1
Planning Time: 0.411 ms
Execution Time: 2267.135 ms
I tried adding indexes :
create index table_b_bar_total ON "table_b" using BTREE (coalesce(
(cast("table_b"."ok" ->> 'bar' as int) +
cast("table_b"."ko" ->> 'bar' as int)),
0));
create index table_b_bar_remaining ON "table_b" using BTREE
((cast("table_b"."total" ->> 'bar' as int) - coalesce(
(cast("table_b"."ok" ->> 'bar' as int) +
cast("table_b"."ko" ->> 'bar' as int)), 0)));
But it doesn't change anything . How can I make this request run faster ?

Ordinary column indexes don't have their own statistics, as the table's statistics are sufficient for the indexed to be assessed for planning. But expressional indexes have their own statistics collected (on the expression results) whenever the table is analyzed. But a problem is that creating an expressional index does not trigger an autoanalyze to be run on the table, so those needed stats can stay uncollected for a long time. So it is a good idea to manually run ANALYZE after creating an expressional index.
Since your expressions are always compared to zero, it might be better to create one index on the larger expression (including the >0 comparisons and the ANDing of them as part of the indexed expression), rather than two indexes which need to be bitmap-ANDed. Since that expression is a boolean, it might be tempting to create a partial index with it, but I think that that would be a mistake. Unlike expressional indexes, partial indexes do not have statistics collected, and so do not help inform the planner on how many rows will be found.

Related

Partial gin index does not work with WHERE

I have the following table:
CREATE TABLE m2m_entries_n_elements(
entry_id UUID
element_id UUID
value JSONB
)
Value is jsonb object in following format: {<type>: <value>}
And I want to create GIN index only for number values:
CREATE INDEX IF NOT EXISTS idx_element_value_number
ON m2m_entries_n_elements
USING GIN (element_id, CAST(value ->> 'number' AS INT))
WHERE value ? 'number';
But when I use EXPLAIN ANALYZE I see that index does not work:
EXPLAIN ANALYZE
SELECT *
FROM m2m_entries_n_elements WHERE CAST(value ->> 'number' AS INT) = 2;
Seq Scan on m2m_entries_n_elements (cost=0.00..349.02 rows=50 width=89) (actual time=0.013..2.087 rows=1663 loops=1)
Filter: (((value ->> 'number'::text))::integer = 2)
Rows Removed by Filter: 8338
Planning Time: 0.042 ms
Execution Time: 2.150 ms
But if I remove WHERE value ? 'number' from creating the index, it starts working:
Bitmap Heap Scan on m2m_entries_n_elements (cost=6.39..70.29 rows=50 width=89) (actual time=0.284..0.819 rows=1663 loops=1)
Recheck Cond: (((value ->> 'number'::text))::integer = 2)
Heap Blocks: exact=149
-> Bitmap Index Scan on idx_elements (cost=0.00..6.38 rows=50 width=0) (actual time=0.257..0.258 rows=1663 loops=1)
Index Cond: (((value ->> 'number'::text))::integer = 2)
Planning Time: 0.207 ms
Execution Time: 0.922 ms
PostgreSQL does not have a general theorem prover. Maybe you intuit that value ->> 'number' being defined implies that value ? 'number' is true, but PostgreSQL doesn't know that. You would need to explicitly include the ? condition in your query to get use of the index.
But PostgreSQL is smart enough to know that CAST(value ->> 'number' AS INT) = 2 does imply that the LHS can't be null, so if you create the partial index WHERE value ->> 'number' IS NOT NULL then it will get used with no change to your query.

PostgreSQL query takes very longtime

I have a table with 3 columns and composite primary key with all the 3 columns. All the individual columns have lot of duplicates and I have btree separately on all of them. The table has around 10 million records.
My query with just a condition with a hardcoded value for single column would always return more than a million records. It takes more than 40 secs whereas it takes very few seconds if I limit the query to 1 or 2 million rows without any condition.
Any help to optimize it as there is no bitmap index in Postgres? All 3 columns have lots of duplicates, would it help if I drop the btree index on them?
SELECT t1.filterid,
t1.filterby,
t1.filtertype
FROM echo_sm.usernotificationfilters t1
WHERE t1.filtertype = 9
UNION
SELECT t1.filterid, '-1' AS filterby, 9 AS filtertype
FROM echo_sm.usernotificationfilters t1
WHERE NOT EXISTS (SELECT 1
FROM echo_sm.usernotificationfilters t2
WHERE t2.filtertype = 9 AND t2.filterid = t1.filterid);
Filtertype column is integer and the rest 2 are varchar(50). All 3 columns have separate btree indexes on them.
Explain plan:
Unique (cost=2168171.15..2201747.47 rows=3357632 width=154) (actual time=32250.340..36371.928 rows=3447159 loops=1)
-> Sort (cost=2168171.15..2176565.23 rows=3357632 width=154) (actual time=32250.337..35544.050 rows=4066447 loops=1)
Sort Key: usernotificationfilters.filterid, usernotificationfilters.filterby, usernotificationfilters.filtertype
Sort Method: external merge Disk: 142696kB
-> Append (cost=62854.08..1276308.41 rows=3357632 width=154) (actual time=150.155..16025.874 rows=4066447 loops=1)
-> Bitmap Heap Scan on usernotificationfilters (cost=62854.08..172766.46 rows=3357631 width=25) (actual time=150.154..574.297 rows=3422522 loops=1)
Recheck Cond: (filtertype = 9)
Heap Blocks: exact=39987
-> Bitmap Index Scan on index_sm_usernotificationfilters_filtertype (cost=0.00..62014.67 rows=3357631 width=0) (actual time=143.585..143.585 rows=3422522 loops=1)
Index Cond: (filtertype = 9)
-> Gather (cost=232131.85..1069965.63 rows=1 width=50) (actual time=3968.492..15133.812 rows=643925 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Anti Join (cost=231131.85..1068965.53 rows=1 width=50) (actual time=4135.235..12945.029 rows=214642 loops=3)
Hash Cond: ((usernotificationfilters_1.filterid)::text = (usernotificationfilters_1_1.filterid)::text)
-> Parallel Seq Scan on usernotificationfilters usernotificationfilters_1 (cost=0.00..106879.18 rows=3893718 width=14) (actual time=0.158..646.432 rows=3114974 loops=3)
-> Hash (cost=172766.46..172766.46 rows=3357631 width=14) (actual time=4133.991..4133.991 rows=3422522 loops=3)
Buckets: 131072 Batches: 64 Memory Usage: 3512kB
-> Bitmap Heap Scan on usernotificationfilters usernotificationfilters_1_1 (cost=62854.08..172766.46 rows=3357631 width=14) (actual time=394.775..1891.931 rows=3422522 loops=3)
Recheck Cond: (filtertype = 9)
Heap Blocks: exact=39987
-> Bitmap Index Scan on index_sm_usernotificationfilters_filtertype (cost=0.00..62014.67 rows=3357631 width=0) (actual time=383.635..383.635 rows=3422522 loops=3)
Index Cond: (filtertype = 9)
Planning time: 0.467 ms
Execution time: 36531.763 ms
The second subquery in your UNION takes about 15 seconds all by itself, and that could possibly be optimized separately from the rest of the query.
The sort to implement the duplicate removal implied by UNION takes about 20 seconds all by itself. It spills to disk. You could increase "work_mem" until it either stops spilling to disk, or starts using a hash rather than a sort. Of course you do need to have the RAM to backup your setting of "work_mem".
A third possibility would be not to treat these steps in isolation. If you had an index which would allow the data to be read from the 2nd branch of the union already in order, than it might not have to re-sort the whole thing. That would probably be an index on (filterid, filterby, filtertype).
This is a separate independent way to approach it.
I think your
WHERE NOT EXISTS (SELECT 1...
could be correctly changed to
WHERE t1.filtertype <> 9 NOT EXISTS AND (SELECT 1...
because the case where t1.filtertype=9 would filter itself out. Is that correct? If so, you could try writing it that way, as the planner is probably not smart enough to make that transformation on its own. Once you have done that, than maybe a filtered index, something like the below, would come in useful.
create index on echo_sm.usernotificationfilters (filterid, filterby, filtertype)
where filtertype <> 9
But, unless you get rid of or speed up that sort, there is only so much improvement you can get with other things.
It appears you only want to retrieve one record per filterid: a record with filtertype = 9 if available, or just another, with dummy values for the other columns. This can be done by ordering BY (filtertype<>9), filtertype ) and picking only the first row via row_number() = 1:
-- EXPLAIN ANALYZE
SELECT xx.filterid
, case(xx.filtertype) when 9 then xx.filterby ELSE '-1' END AS filterby
, 9 AS filtertype -- xx.filtertype
-- , xx.rn
FROM (
SELECT t1.filterid , t1.filterby , t1.filtertype
, row_number() OVER (PARTITION BY t1.filterid ORDER BY (filtertype<>9), filtertype ) AS rn
FROM userfilters t1
) xx
WHERE xx.rn = 1
-- ORDER BY xx.filterid, xx.rn
;
This query can be supported by a an index on the same expression:
CREATE INDEX ON userfilters ( filterid , (filtertype<>9), filtertype ) ;
But, on my machine the UNION ALL version is faster (using the same index):
EXPLAIN ANALYZE
SELECT t1.filterid
, t1.filterby
, t1.filtertype
FROM userfilters t1
WHERE t1.filtertype = 9
UNION ALL
SELECT DISTINCT t1.filterid , '-1' AS filterby ,9 AS filtertype
FROM userfilters t1
WHERE NOT EXISTS (
SELECT *
FROM userfilters t2
WHERE t2.filtertype = 9 AND t2.filterid = t1.filterid
)
;
Even simpler (and faster!) is to use DISTINCT ON() , supported by the same conditional index:
-- EXPLAIN ANALYZE
SELECT DISTINCT ON (t1.filterid)
t1.filterid
, case(t1.filtertype) when 9 then t1.filterby ELSE '-1' END AS filterby
, 9 AS filtertype -- t1.filtertype
FROM userfilters t1
ORDER BY t1.filterid , (t1.filtertype<>9), t1.filtertype
;

Why Postgres does not use indexes with OR condition on 2 separate tables

I am trying to improve the performance of a SQL query on a Postgres 9.4 database. I managed to re-write the query so that it would use indexes and it is superfast now! But I do not quite understand why.
This is the original query:
SELECT DISTINCT dt.id, dt.updated_at
FROM public.day dt
INNER JOIN public.optimized_localized_day sldt ON sldt.day_id = dt.id
INNER JOIN public.day_template_place dtp ON dtp.day_template_id = dt.id
INNER JOIN public.optimized_place op ON op.geoname_id = dtp.geoname_id
WHERE
op.alternate_localized_names ILIKE unaccent('%query%') OR
lower(sldt.unaccent_title) LIKE unaccent(lower('%query%')) OR
lower(sldt.unaccent_description) LIKE unaccent(lower('%query%'))
ORDER BY dt.updated_at DESC
LIMIT 100;
I have placed 3 trigram indexes using pg_trgm on op.alternate_localized_names, lower(sldt.unaccent_title) and lower(sldt.unaccent_description).
But, Postgres is not using them, instead it performs a SeqScan on the full tables to join them as shown by EXPLAIN:
Limit
-> Unique
-> Sort
Sort Key: dt.updated_at, dt.id
-> Hash Join
Hash Cond: (sldt.day_id = dt.id)
Join Filter: ((op.alternate_localized_names ~~* unaccent('%query%'::text)) OR (lower(sldt.unaccent_title) ~~ unaccent('%query%'::text)) OR (lower(sldt.unaccent_description) ~~ unaccent('%query%'::text)))
-> Seq Scan on optimized_localized_day sldt
-> Hash
-> Hash Join
Hash Cond: (dtp.geoname_id = op.geoname_id)
-> Hash Join
Hash Cond: (dtp.day_template_id = dt.id)
-> Seq Scan on day_template_place dtp
-> Hash
-> Seq Scan on day dt
-> Hash
-> Seq Scan on optimized_place op
However, when I split the query in 2, one to search on public.optimized_localized_day and one on public.optimized_place, it now uses their indexes:
SELECT DISTINCT dt.id, dt.updated_at
FROM public.day dt
INNER JOIN public.day_template_place dtp ON dtp.day_template_id = dt.id
INNER JOIN public.optimized_place op ON op.geoname_id = dtp.geoname_id
WHERE op.alternate_localized_names ILIKE unaccent('%query%')
UNION
SELECT DISTINCT dt.id, dt.updated_at
FROM public.day dt
INNER JOIN public.optimized_localized_day sldt ON sldt.day_id = dt.id
WHERE lower(sldt.unaccent_title) LIKE unaccent(lower('%query%'))
OR lower(sldt.unaccent_description) LIKE unaccent(lower('%query%'));
And the EXPLAIN:
HashAggregate
-> Append
-> HashAggregate
-> Nested Loop
-> Nested Loop
-> Bitmap Heap Scan on optimized_place op
Recheck Cond: (alternate_localized_names ~~* unaccent('%query%'::text))
-> Bitmap Index Scan on idx_trgm_place_lower
Index Cond: (alternate_localized_names ~~* unaccent('%jericho%'::text))
-> Bitmap Heap Scan on day_template_place dtp
Recheck Cond: (geoname_id = op.geoname_id)
-> Bitmap Index Scan on day_template_place_geoname_idx
Index Cond: (geoname_id = op.geoname_id)
-> Index Scan using day_pkey on day dt
Index Cond: (id = dtp.day_template_id)
-> HashAggregate
-> Nested Loop
-> Bitmap Heap Scan on optimized_localized_day sldt
Recheck Cond: ((lower(unaccent_title) ~~ unaccent('%query%'::text)) OR (lower(unaccent_description) ~~ unaccent('%query%'::text)))
-> BitmapOr
-> Bitmap Index Scan on tgrm_idx_localized_day_title
Index Cond: (lower(unaccent_title) ~~ unaccent('%query%'::text))
-> Bitmap Index Scan on tgrm_idx_localized_day_description
Index Cond: (lower(unaccent_description) ~~ unaccent('%query%'::text))
-> Index Scan using day_pkey on day dt_1
Index Cond: (id = sldt.day_id)
From what I understand, having conditions on 2 separate tables in an OR clause causes Postgres to join the tables first and then filter them. But I am not sure about this. Second thing that puzzles me, I would like to understand how Postgres manages the filtering in the second query.
Do you guys know how Postgres handles those 2 cases ?
Thanks :)
The transformation of the original query to the UNION cannot be made automatically.
Consider a simplified case:
SELECT x.a, y.b
FROM x JOIN y USING (c)
WHERE x.a = 0 OR x.b = 0;
Imagine it has three result rows:
a | b
---+---
0 | 0
1 | 0
1 | 0
If you replace this with
SELECT x.a, y.b
FROM x JOIN y USING (c)
WHERE x.a = 0
UNION
SELECT x.a, y.b
FROM x JOIN y USING (c)
WHERE y.b = 0;
the result will only have two rows, because UNION removes duplicates.
If you use UNION ALL instead, the result will have four rows, because the row with the two zeros will appear twice, once from each branch of the query.
So this transformation cannot always be made safely.
In your case, you can get away with it, because you remove duplicates anyway.
By the way: if you use UNION, you don't need the DISTINCT any more, because duplicates will be removed anyway. Your query will become cheaper if you remove the DISTINCTs.
In the second branch of your second query, PostgreSQL can handle the OR with index scans because the conditions are on the same table. In that case, PostgreSQL can perform a bitmap index scan:
The index is scanned, and PostgreSQL build a bitmap in memory that contains 1 for each table row where the index scan results in a match and 0 otherwise.
This bitmap is ordered in the physical order of the table rows.
The same thing happens for the other condition with the other index.
The resulting bitmaps are joined with a bit-wise OR operation.
The resulting bitmap is used to fetch the matching rows from the table.
A trigram index is only a filter that can have false positive results, so the original condition has to be re-checked during that table scan.

How can I force PostgreSQL to use a certain index?

I have two Postgres indexes on my table cache, both on jsonb column on fields date and condition.
The first one works on an immutable function, which takes the text field and transforms it into the date type.
The second one is created only on text.
So, when I tried the second one, it turns my btree index into a bitmap index and somehow works slower than the first one, which takes another two steps to work, but use only an index scan.
I have two questions: why and how?
Why does the first one use only index, compared with the second, which for some reason uses a bitmap? And how I can force PostgreSQL to use only the index and no the bitmap on the second index, because I don't want use the function.
If there another solution then please give me hints, because I don't have permission to install packages on the server.
Function index:
create index cache_ymd_index on cache (
to_yyyymmdd_date(((data -> 'Info'::text) ->> 'Date'::text)::character varying),
((data -> 'Info'::text) ->> 'Condition'::text)
) where (((data -> 'Info'::text) ->> 'Condition'::text) = '3'::text);
Text index:
create index cache_data_index on cache (
((data -> 'Info'::text) ->> 'Date'::text),
((data -> 'Info'::text) ->> 'Condition'::text)
) where (((data -> 'Info'::text) ->> 'Condition'::text) = '3'::text);
The function itself:
create or replace function to_yyyymmdd_date(the_date character varying) returns date
immutable language sql
as
$$
select to_date(the_date, 'YYYY-MM-DD')
$$;
ANALYZE condition for function index:
Index Scan using cache_ymd_index on cache (cost=0.29..1422.43 rows=364 width=585) (actual time=0.065..66.842 rows=71634 loops=1)
Index Cond: ((to_yyyymmdd_date((((data -> 'Info'::text) ->> 'Date'::text))::character varying) >= '2018-01-01'::date) AND (to_yyyymmdd_date((((data -> 'Info'::text) ->> 'Date'::text))::character varying) <= '2020-12-01'::date))
Planning Time: 0.917 ms
Execution Time: 70.464 ms
ANALYZE condition for text index:
Bitmap Heap Scan on cache (cost=12.15..1387.51 rows=364 width=585) (actual time=53.794..87.802 rows=71634 loops=1)
Recheck Cond: ((((data -> 'Info'::text) ->> 'Date'::text) >= '2018-01-01'::text) AND (((data -> 'Info'::text) ->> 'Date'::text) <= '2020-12-01'::text) AND (((data -> 'Info'::text) ->> 'Condition'::text) = '3'::text))
Heap Blocks: exact=16465
-> Bitmap Index Scan on cache_data_index (cost=0.00..12.06 rows=364 width=0) (actual time=51.216..51.216 rows=71634 loops=1)
Index Cond: ((((data -> 'Info'::text) ->> 'Date'::text) >= '2018-01-01'::text) AND (((data -> 'Info'::text) ->> 'Date'::text) <= '2020-12-01'::text))
Planning Time: 0.247 ms
Execution Time: 90.586 ms
A “bitmap index scan” is also an index scan. It is what PostgreSQL typically chooses if a bigger percentage of the table blocks have to be visited, because it is more efficient in that case.
For an index range scan like in your case, there are two possible explanations for this:
ANALYZE has run between the two indexes have been created, so that PostgreSQL knows about the distribution of the indexed values in the one case, but no the other.
To figure out if that was the case, run
ANALYZE cache;
and then try the two statements again. Maybe the plans are more similar now.
The statements were run on two different tables, which contain the same data, but they are physically arranged in a different way, so that the correlation is good on the one, but bad on the other. If the correlation is close to 1 or -1, and index scan becomes cheaper. Otherwise, a bitmap index scan is the best way.
Since you indicate that it is the same table in both cases, this explanation can be ruled out.
The second column of your index is superfluous; you should just omit it.
Otherwise, your two indexes should work about the same.
Of course all that would work much better if the table was defined with a date column in the first place...

Postgis ST_Intersects query doesn't use existing spatial index

I have a table of suburbs and each suburb has a geom value, representing its multipolygon on the map. There is another houses table where each house has a geom value of its point on the map.
Both the geom columns are indexed using gist, and suburbs table has the name column indexed as well. Suburbs table has 8k+ records while houses table has 300k+ records.
Now my task is to find all houses within a suburb named 'FOO'.
QUERY #1:
SELECT * FROM houses WHERE ST_INTERSECTS((SELECT geom FROM "suburbs" WHERE "suburb_name" = 'FOO'), geom);
Query Plan Result:
Seq Scan on houses (cost=8.29..86327.26 rows=102365 width=136)
Filter: st_intersects($0, geom)
InitPlan 1 (returns $0)
-> Index Scan using suburbs_suburb_name on suburbs (cost=0.28..8.29 rows=1 width=32)
Index Cond: ((suburb_name)::text = 'FOO'::text)
running the query took ~3.5s, returning 486 records.
QUERY #2: (prefix ST_INTERSECTS function with _ to explicitly ask it not to use index)
SELECT * FROM houses WHERE _ST_INTERSECTS((SELECT geom FROM "suburbs" WHERE "suburb_name" = 'FOO'), geom);
Query Plan Result: (exactly the same as Query #1)
Seq Scan on houses (cost=8.29..86327.26 rows=102365 width=136)
Filter: st_intersects($0, geom)
InitPlan 1 (returns $0)
-> Index Scan using suburbs_suburb_name on suburbs (cost=0.28..8.29 rows=1 width=32)
Index Cond: ((suburb_name)::text = 'FOO'::text)
running the query took ~1.7s, returning 486 records.
QUERY #3: (Using && operator to add a boundary box overlap check before the ST_Intersects function)
SELECT * FROM houses WHERE (geom && (SELECT geom FROM "suburbs" WHERE "suburb_name" = 'FOO')) AND ST_INTERSECTS((SELECT geom FROM "suburbs" WHERE "suburb_name" = 'FOO'), geom);
Query Plan Result:
Bitmap Heap Scan on houses (cost=21.11..146.81 rows=10 width=136)
Recheck Cond: (geom && $0)
Filter: st_intersects($1, geom)
InitPlan 1 (returns $0)
-> Index Scan using suburbs_suburb_name on suburbs (cost=0.28..8.29 rows=1 width=32)
Index Cond: ((suburb_name)::text = 'FOO'::text)
InitPlan 2 (returns $1)
-> Index Scan using suburbs_suburb_name on suburbs suburbs_1 (cost=0.28..8.29 rows=1 width=32)
Index Cond: ((suburb_name)::text = 'FOO'::text)
-> Bitmap Index Scan on houses_geom_gist (cost=0.00..4.51 rows=31 width=0)
Index Cond: (geom && $0)
running the query took 0.15s, returning 486 records.
Apparently only query #3 is gaining benefit from the spatial index which improves the performance significantly. However, the syntax is ugly and repeating itself to some extend. My question is:
Why postgis is not smart enough to use spatial index in query #1?
Why query #2 has (much) better performance compare to query #1, considering they are both not using index?
Any suggestions to make query #3 prettier? Or is there a better way to construct a query to do the same thing?
Try flattening the query into one query, without unnecessary sub-queries:
SELECT houses.*
FROM houses, suburbs
WHERE suburbs.suburb_name = 'FOO' AND ST_Intersects(houses.geom, suburbs.geom);