I have a table of suburbs and each suburb has a geom value, representing its multipolygon on the map. There is another houses table where each house has a geom value of its point on the map.
Both the geom columns are indexed using gist, and suburbs table has the name column indexed as well. Suburbs table has 8k+ records while houses table has 300k+ records.
Now my task is to find all houses within a suburb named 'FOO'.
QUERY #1:
SELECT * FROM houses WHERE ST_INTERSECTS((SELECT geom FROM "suburbs" WHERE "suburb_name" = 'FOO'), geom);
Query Plan Result:
Seq Scan on houses (cost=8.29..86327.26 rows=102365 width=136)
Filter: st_intersects($0, geom)
InitPlan 1 (returns $0)
-> Index Scan using suburbs_suburb_name on suburbs (cost=0.28..8.29 rows=1 width=32)
Index Cond: ((suburb_name)::text = 'FOO'::text)
running the query took ~3.5s, returning 486 records.
QUERY #2: (prefix ST_INTERSECTS function with _ to explicitly ask it not to use index)
SELECT * FROM houses WHERE _ST_INTERSECTS((SELECT geom FROM "suburbs" WHERE "suburb_name" = 'FOO'), geom);
Query Plan Result: (exactly the same as Query #1)
Seq Scan on houses (cost=8.29..86327.26 rows=102365 width=136)
Filter: st_intersects($0, geom)
InitPlan 1 (returns $0)
-> Index Scan using suburbs_suburb_name on suburbs (cost=0.28..8.29 rows=1 width=32)
Index Cond: ((suburb_name)::text = 'FOO'::text)
running the query took ~1.7s, returning 486 records.
QUERY #3: (Using && operator to add a boundary box overlap check before the ST_Intersects function)
SELECT * FROM houses WHERE (geom && (SELECT geom FROM "suburbs" WHERE "suburb_name" = 'FOO')) AND ST_INTERSECTS((SELECT geom FROM "suburbs" WHERE "suburb_name" = 'FOO'), geom);
Query Plan Result:
Bitmap Heap Scan on houses (cost=21.11..146.81 rows=10 width=136)
Recheck Cond: (geom && $0)
Filter: st_intersects($1, geom)
InitPlan 1 (returns $0)
-> Index Scan using suburbs_suburb_name on suburbs (cost=0.28..8.29 rows=1 width=32)
Index Cond: ((suburb_name)::text = 'FOO'::text)
InitPlan 2 (returns $1)
-> Index Scan using suburbs_suburb_name on suburbs suburbs_1 (cost=0.28..8.29 rows=1 width=32)
Index Cond: ((suburb_name)::text = 'FOO'::text)
-> Bitmap Index Scan on houses_geom_gist (cost=0.00..4.51 rows=31 width=0)
Index Cond: (geom && $0)
running the query took 0.15s, returning 486 records.
Apparently only query #3 is gaining benefit from the spatial index which improves the performance significantly. However, the syntax is ugly and repeating itself to some extend. My question is:
Why postgis is not smart enough to use spatial index in query #1?
Why query #2 has (much) better performance compare to query #1, considering they are both not using index?
Any suggestions to make query #3 prettier? Or is there a better way to construct a query to do the same thing?
Try flattening the query into one query, without unnecessary sub-queries:
SELECT houses.*
FROM houses, suburbs
WHERE suburbs.suburb_name = 'FOO' AND ST_Intersects(houses.geom, suburbs.geom);
Related
I have a table named "k3_order" with jsonb column "json_delivery".
Example content of that column is:
{
"delivery_cost": "11.99",
"packageNumbers": [
"0000000596034Q"
]
}
I've created index on json_delivery->'packageNumbers':
CREATE INDEX test_idx ON k3_order USING gin(json_delivery->'packageNumbers');
Now I use this two SQL Queries:
SELECT id, delivery_method_id
FROM k3_order
WHERE jsonb_exists (json_delivery->'packageNumbers', '0000000596034Q');
SELECT id, delivery_method_id
FROM k3_order
WHERE json_delivery->'packageNumbers' ? '0000000596034Q';
The second is faster and using index, but the first doesn't.
Is there any way to create index in PostgreSQL 10.4 in order for query 1) to use it?
Is this even possible in PostgreSQL 10.4 or newer versions?
EXPLAIN ANALYZE SELECT id, delivery_method_id
FROM k3_order
WHERE jsonb_exists (json_delivery->'packageNumbers', > '0000000596034Q');
produces:
Seq Scan on k3_order (cost=0.00..117058.10 rows=216847 width=8 (actual time=162.001..569.863 rows=1 loops=1)
Filter: jsonb_exists((json_delivery -> 'packageNumbers'::text), '0000000596034Q'::text)
Rows Removed by Filter: 650539
Planning time: 0.748 ms
Execution time: 569.886 ms
EXPLAIN ANALYZE SELECT id, delivery_method_id
FROM k3_order
WHERE json_delivery->'packageNumbers' ? '0000000596034Q';
produces:
Bitmap Heap Scan on k3_order (cost=21.04..2479.03 rows=651 width=8) (actual time=0.022..0.022 rows=1 loops=1)
Recheck Cond: ((json_delivery -> 'packageNumbers'::text) ? '0000000596034Q'::text)
Heap Blocks: exact=1
-> Bitmap Index Scan on test_idx (cost=0.00..20.88 rows=651 width=0) (actual time=0.016..0.016 rows=1 loops=1)
Index Cond: ((json_delivery -> 'packageNumbers'::text) ? '0000000596034Q'::text)
Planning time: 0.182 ms
Execution time: 0.050 ms
Indexes can only be used by queries in the following cases:
the WHERE condition contains an expression of the form <indexed expression> <operator> <constant>, where
an index has been created on <indexed expression>
<operator> is an operator in the index family of the operator class of the index
<constant> is an expression that stays constant for the duration of the index scan
the ORDER BY clause has the same or the exact opposite ordering as the index definition, and the index access method supports sorting (from v13 on, an index can also be used if it contains the starting columns of the ORDER BY clause)
the PostgreSQL version is v12 and higher, and the WHERE condition contains an expression of the form bool_func(...), where the function returns boolean and has a planner support function.
Now json_delivery->'packageNumbers' ? '0000000596034Q' satisfies the first condition, so an index scan can be used.
jsonb_exists(json_delivery->'packageNumbers', > '0000000596034Q') could only use an index if there were a planner support function for jsonb_exists, but there is none:
SELECT prosupport FROM pg_proc
WHERE proname = 'jsonb_exists';
prosupport
════════════
-
(1 row)
I have a table learners which has around 3.2 million rows. This table contains user related information like name and email. I need to optimize some queries that uses order by on some column. So for testing I have created a temp_learners table, with 0.8 million rows. I have created two indexes on this table:
CREATE UNIQUE INDEX "temp_learners_companyId_userId_idx"
ON temp_learners ("companyId" ASC, "userId" ASC, "learnerUserName" ASC, "learnerEmailId" ASC);
and
CREATE INDEX temp_learners_company_name_email_index
ON temp_learners ("companyId", "learnerUserName", "learnerEmailId");
The second index is just for testing.
Now When I run this query:
SELECT *
FROM temp_learners
WHERE "companyId" = 909666665757230431 AND "userId" IN (
4990609084216745771,
4990610022492247987,
4990609742667096366,
4990609476136523663,
5451985767018841230,
5451985767078553638,
5270390122102920730,
4763688819142650938,
5056979692501246449,
5279569274741647114,
5031660827132289520,
4862889373349389098,
5299864070077160421,
4740222596778406913,
5320170488686569878,
5270367618320474818,
5320170488587895729,
5228888485293847415,
4778050469432720821,
5270392314970177842,
4849087862439244546,
5270392117430427860,
5270351184072717902,
5330263074228870897,
4763688829301614114,
4763684609695916489,
5270390232949727716
) ORDER BY "learnerUserName","learnerEmailId";
The query plan used by db is this:
Sort (cost=138.75..138.76 rows=4 width=1581) (actual time=0.169..0.171 rows=27 loops=1)
" Sort Key: ""learnerUserName"", ""learnerEmailId"""
Sort Method: quicksort Memory: 73kB
-> Index Scan using "temp_learners_companyId_userId_idx" on temp_learners (cost=0.55..138.71 rows=4 width=1581) (actual time=0.018..0.112 rows=27 loops=1)
" Index Cond: ((""companyId"" = '909666665757230431'::bigint) AND (""userId"" = ANY ('{4990609084216745771,4990610022492247987,4990609742667096366,4990609476136523663,5451985767018841230,5451985767078553638,5270390122102920730,4763688819142650938,5056979692501246449,5279569274741647114,5031660827132289520,4862889373349389098,5299864070077160421,4740222596778406913,5320170488686569878,5270367618320474818,5320170488587895729,5228888485293847415,4778050469432720821,5270392314970177842,4849087862439244546,5270392117430427860,5270351184072717902,5330263074228870897,4763688829301614114,4763684609695916489,5270390232949727716}'::bigint[])))"
Planning time: 0.116 ms
Execution time: 0.191 ms
In this it does not sort on indexs.
But when I run this query
SELECT *
FROM temp_learners
WHERE "companyId" = 909666665757230431
ORDER BY "learnerUserName","learnerEmailId" limit 500;
This query uses indexs on sorting.
Limit (cost=0.42..1360.05 rows=500 width=1581) (actual time=0.018..0.477 rows=500 loops=1)
-> Index Scan using temp_learners_company_name_email_index on temp_learners (cost=0.42..332639.30 rows=122327 width=1581) (actual time=0.018..0.442 rows=500 loops=1)
Index Cond: ("companyId" = '909666665757230431'::bigint)
Planning time: 0.093 ms
Execution time: 0.513 ms
What I am not able to understand is why postgre does not uses index in first query? Also, I want to clear out that the normal use case of this table learner is to join with other tables. So the first query I written is more similar to joins equation. So for example,
SELECT *
FROM temp_learners AS l
INNER JOIN entity_learners_basic AS elb
ON l."companyId" = elb."companyId" AND l."userId" = elb."userId"
WHERE l."companyId" = 909666665757230431 AND elb."gameId" = 1050403501267716928
ORDER BY "learnerUserName", "learnerEmailId" limit 5000;
Even after correcting indexes the query plan does not indexes for sorting.
QUERY PLAN
Limit (cost=3785.11..3785.22 rows=44 width=1767) (actual time=163.554..173.135 rows=5000 loops=1)
-> Sort (cost=3785.11..3785.22 rows=44 width=1767) (actual time=163.553..172.791 rows=5000 loops=1)
" Sort Key: l.""learnerUserName"", l.""learnerEmailId"""
Sort Method: external merge Disk: 35416kB
-> Nested Loop (cost=1.12..3783.91 rows=44 width=1767) (actual time=0.019..63.743 rows=21195 loops=1)
-> Index Scan using primary_index__entity_learners_basic on entity_learners_basic elb (cost=0.57..1109.79 rows=314 width=186) (actual time=0.010..6.221 rows=21195 loops=1)
Index Cond: (("companyId" = '909666665757230431'::bigint) AND ("gameId" = '1050403501267716928'::bigint))
-> Index Scan using "temp_learners_companyId_userId_idx" on temp_learners l (cost=0.55..8.51 rows=1 width=1581) (actual time=0.002..0.002 rows=1 loops=21195)
Index Cond: (("companyId" = '909666665757230431'::bigint) AND ("userId" = elb."userId"))
Planning time: 0.309 ms
Execution time: 178.422 ms
Does Postgres not use indexes when joining and ordering data?
PostgreSQL chooses the plan it thinks will be faster. Using the index that provides rows in the correct order means using a much less selective index, so it doesn't think that will be faster overall.
If you want to force PostgreSQL into believing that sorting is the worst thing in the world, you could set enable_sort=off. If it still sorts after that, then you know PostgreSQL doesn't have the right indexes to avoid sorting, as opposed to just thinking they will not actually be faster.
PostgreSQL could use an index on ("companyId", "learnerUserName", "learnerEmailId") for your first query, but the additional IN condition reduces the number of result rows to an estimated 4 rows, which means that the sort won't cost anything at all. So it chooses to use the index that can support the IN condition.
Rows returned with that index won't be in the correct order automatically, because
you specified DESC for the last index column, but ASC to the preceding one
you have more than one element in the IN list.
Without the IN condition, enough rows are returned, so that PostgreSQL thinks that it is cheaper to order by the index and filter out rows that don't match the condition.
With your first query, it is impossible to have an index that supports both the IN list in the WHERE condition and the ORDER BY clause, so PostgreSQL has to make a choice.
I have defined the following index:
CREATE INDEX
users_search_idx
ON
auth_user
USING
gin(
username gin_trgm_ops,
first_name gin_trgm_ops,
last_name gin_trgm_ops
);
I am performing the following query:
PREPARE user_search (TEXT, INT) AS
SELECT
username,
email,
first_name,
last_name,
( -- would probably do per-field weightings here
s_username + s_first_name + s_last_name
) rank
FROM
auth_user,
similarity(username, $1) s_username,
similarity(first_name, $1) s_first_name,
similarity(last_name, $1) s_last_name
WHERE
username % $1 OR
first_name % $1 OR
last_name % $1
ORDER BY
rank DESC
LIMIT $2;
The auth_user table has 6.2 million rows.
The speed of the query seems to depend very heavily on the number of results potentially returned by the similarity query.
Increasing the similarity threshold via set_limit helps, but reduces usefulness of results by eliminating partial matches.
Some searches return in 200ms, others take ~ 10 seconds.
We have an existing implementation of this feature using Elasticsearch that returns in < 200ms for any query, while doing more complicated (better) ranking.
I would like to know if there is any way we could improve this to get more consistent performance?
It's my understanding that GIN index (inverted index) is the same basic approach used by Elasticsearch so I would have thought there is some optimization possible.
An EXPLAIN ANALYZE EXECUTE user_search('mel', 20) shows:
Limit (cost=54099.81..54099.86 rows=20 width=52) (actual time=10302.092..10302.104 rows=20 loops=1)
-> Sort (cost=54099.81..54146.66 rows=18739 width=52) (actual time=10302.091..10302.095 rows=20 loops=1)
Sort Key: (((s_username.s_username + s_first_name.s_first_name) + s_last_name.s_last_name)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Nested Loop (cost=382.74..53601.17 rows=18739 width=52) (actual time=118.164..10293.765 rows=8380 loops=1)
-> Nested Loop (cost=382.74..53132.69 rows=18739 width=56) (actual time=118.150..10262.804 rows=8380 loops=1)
-> Nested Loop (cost=382.74..52757.91 rows=18739 width=52) (actual time=118.142..10233.990 rows=8380 loops=1)
-> Bitmap Heap Scan on auth_user (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
Rows Removed by Index Recheck: 2434523
Heap Blocks: exact=49337 lossy=53104
-> BitmapOr (cost=382.74..382.74 rows=18757 width=0) (actual time=107.436..107.436 rows=0 loops=1)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=40.200..40.200rows=88908 loops=1)"
Index Cond: ((username)::text % 'mel'::text)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=43.847..43.847rows=102028 loops=1)"
Index Cond: ((first_name)::text % 'mel'::text)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=23.387..23.387rows=58740 loops=1)"
Index Cond: ((last_name)::text % 'mel'::text)
-> Function Scan on similarity s_username (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=8380)
-> Function Scan on similarity s_first_name (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
-> Function Scan on similarity s_last_name (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
Execution time: 10302.559 ms
Server is Postgres 9.6.1 running on Amazon RDS
update
1.
Shortly after posting the question I found this info: https://www.postgresql.org/message-id/464F3C5D.2000700#enterprisedb.com
So I tried
-> SHOW work_mem;
4MB
-> SET work_mem='12MB';
-> EXECUTE user_search('mel', 20);
(results returned in ~1.5s)
This made a big improvement (previously > 10s)!
1.5s is still way slower than ES for similar query so I would still like to hear any suggestions for optimising the query.
2.
In response to comments, and after seeing this question (Postgresql GIN index slower than GIST for pg_trgm), I tried exactly the same set up with a GIST index in place of the GIN one.
Trying the same search above, it returned in ~3.5s, using default work_mem='4MB'. Increasing work_mem made no difference.
From this I conclude that GIST index is more memory efficient (did not hit pathological case like GIN did) but is slower than GIN when GIN is working properly. This is inline with what's described in the docs recommending GIN index.
3.
I still don't understand why so much time is spent in:
-> Bitmap Heap Scan on auth_user (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
Rows Removed by Index Recheck: 2434523
Heap Blocks: exact=49337 lossy=53104
I don't understand why this step is needed or what it's doing.
There are the three Bitmap Index Scan beneath it for each of the username % $1 clauses... these results then get combined with a BitmapOr step. These parts are all quite fast.
But even in the case where we don't run out of work mem, we still spend nearly a whole second in Bitmap Heap Scan.
I expect much faster results with this approach:
1.
Create a GiST index with 1 column holding concatenated values:
CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);
This assumes all 3 columns to be defined NOT NULL (you did not specify). Else you need to do more.
Why not simplify with concat_ws()?
Combine two columns and add into one new column
Faster query with pattern-matching on multiple text fields
Combine two columns and add into one new column
2.
Use a proper nearest-neighbor query, matching above index:
SELECT username, email, first_name, last_name
, similarity(username , $1) AS s_username
, similarity(first_name, $1) AS s_first_name
, similarity(last_name , $1) AS s_last_name
, row_number() OVER () AS rank -- greatest similarity first
FROM auth_user
WHERE (username || ' ' || first_name || ' ' || last_name) % $1 -- !!
ORDER BY (username || ' ' || first_name || ' ' || last_name) <-> $1 -- !!
LIMIT $2;
Expressions in WHERE and ORDER BY must match index expression!
In particular ORDER BY rank (like you had it) will always perform poorly for a small LIMIT picking from a much larger pool of qualifying rows, because it cannot use an index directly: The sophisticated expression behind rank has to be calculated for every qualifying row, then all have to be sorted before the small selection of best matches can be returned. This is much, much more expensive than a true nearest-neighbor query that can pick the best results from the index directly without even looking at the rest.
row_number() with empty window definition just reflects the ordering produced by the ORDER BY of the same SELECT.
Related answers:
Best index for similarity function
Search in 300 million addresses with pg_trgm
As for your item 3., I added an answer to the question you referenced, that should explain it:
PostgreSQL GIN index slower than GIST for pg_trgm?
I have a table with about 7.5million records and am trying to implement an autocomplete form based on said table, but the performance is pretty bad.
The schema (irrelevant fields omitted) is as follows:
COMPANIES
---------
sid (integer primary key)
world_hq_sid (integer)
name (varchar(64))
marketing_alias (varchar(64))
address_country_code (char(4))
address_state (varchar(64))
sort_order integer
search_weight integer
annual_sales integer
The fields passed in are the optional country_code and state, along with a search term. What I want is for the search term to match (case insensitive) the beginning of either name or marketing_alias. I want the top ten results, with those results that also match country and state at the top, then country only, then no state/country match. After that, I want the results sorted by sort_order.
Also, I only want one match per world_hq_sid. Finally, when I have the top match per world_hq_sid, I want the final results to be sorted by search_weight.
I'm using a window query to achieve the world_hq_sid part. Here is the query:
SELECT * FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY world_hq_sid ORDER BY CASE WHEN address_country_code = 'US' AND address_state = 'CA' THEN 2 WHEN address_country_code = 'US' THEN 1 ELSE 0 END desc, sort_order asc) AS r,
companies.*
FROM companies
WHERE ((upper(name) LIKE upper('co%')) OR (upper(marketing_alias) LIKE upper('co%')))
) x
WHERE x.r = 1
ORDER BY CASE WHEN address_country_code = 'US' AND address_state = 'CA' THEN 2 WHEN address_state = 'CA' THEN 1 ELSE 0 END desc, search_weight asc, annual_sales desc
LIMIT 10;
I have normal btree indexes on address_state, address_country_code, world_hq_sid, sort_order, and search_weight.
I have the following indexes on the name and marketing_alias fields:
CREATE INDEX companies_alias_pattern_upper_idx ON companies(upper(marketing_alias) varchar_pattern_ops);
CREATE INDEX companies_name_pattern_upper_idx ON companies(upper(name) varchar_pattern_ops)
And here is the explain analyze when I pass CA as the state and 'co' as the search term
Limit (cost=676523.01..676523.03 rows=10 width=939) (actual time=18695.686..18695.687 rows=10 loops=1)
-> Sort (cost=676523.01..676526.67 rows=1466 width=939) (actual time=18695.686..18695.687 rows=10 loops=1)
Sort Key: x.search_weight, x.annual_sales
Sort Method: top-N heapsort Memory: 30kB
-> Subquery Scan on x (cost=665492.58..676491.33 rows=1466 width=939) (actual time=18344.715..18546.830 rows=151527 loops=1)
Filter: (x.r = 1)
Rows Removed by Filter: 20672
-> WindowAgg (cost=665492.58..672825.08 rows=293300 width=931) (actual time=18344.710..18511.625 rows=172199 loops=1)
-> Sort (cost=665492.58..666225.83 rows=293300 width=931) (actual time=18344.702..18359.145 rows=172199 loops=1)
Sort Key: companies.world_hq_sid, (CASE WHEN ((companies.address_state)::text = 'CA'::text) THEN 1 ELSE 0 END), companies.sort_order
Sort Method: quicksort Memory: 108613kB
-> Bitmap Heap Scan on companies (cost=17236.64..518555.98 rows=293300 width=931) (actual time=1861.665..17999.806 rows=172199 loops=1)
Recheck Cond: ((upper((name)::text) ~~ 'CO%'::text) OR (upper((marketing_alias)::text) ~~ 'CO%'::text))
Filter: ((upper((name)::text) ~~ 'CO%'::text) OR (upper((marketing_alias)::text) ~~ 'CO%'::text))
-> BitmapOr (cost=17236.64..17236.64 rows=196219 width=0) (actual time=1829.061..1829.061 rows=0 loops=1)
-> Bitmap Index Scan on companies_name_pattern_upper_idx (cost=0.00..8987.98 rows=97772 width=0) (actual time=971.331..971.331 rows=169390 loops=1)
Index Cond: ((upper((name)::text) ~>=~ 'CO'::text) AND (upper((name)::text) ~<~ 'CP'::text))
-> Bitmap Index Scan on companies_alias_pattern_upper_idx (cost=0.00..8102.02 rows=98447 width=0) (actual time=857.728..857.728 rows=170616 loops=1)
Index Cond: ((upper((marketing_alias)::text) ~>=~ 'CO'::text) AND (upper((marketing_alias)::text) ~<~ 'CP'::text))
I've bumped work_mem and shared_buffers to 100M.
As you can see, this query returns in 18 seconds. What is odd is that the results are all over the board for different starting characters, from 400ms (acceptable) to 30 seconds (very not acceptable). Postgres gurus, my question is, am I just expecting too much of postgresql to perform such a query quickly consistently? Is there a way I can speed this up?
select *
from (
select distinct on (world_hq_sid)
world_hq_sid,
(address_country_code = 'US')::int + (address_state = 'CA')::int address_weight,
sort_order,
search_weight, annual_sales,
sid, name, marketing_alias,
address_country_code, address_state
from companies
where
upper(name) LIKE upper('co%')
OR upper(marketing_alias) LIKE upper('co%')
order by 1, 2 desc, 3
) s
order by
address_weight desc,
search_weight,
annual_sales desc
limit 10
For autocomplete it's possible to use trigram search.
pg_trgm module.
CREATE EXTENSION pg_trgm;
ALTER TABLE companies ADD COLUMN name_trgm TEXT NULL;
UPDATE companies SET name_trgm = UPPER(name);
CREATE INDEX companies_name_trgm_gin_idx ON companies USING GIN (name_trgm gin_trgm_ops);
I'd like to query for a (list of) values or NULL but not use OR. The reasoning behind trying to not use OR is, that I need to use an index on that field to speed up a query.
A simple example to illustrate my question:
CREATE TABLE fruits
(
name text,
quantity integer
);
(The real table has lots of additional integer columns.)
The query that I'm not happy with is
SELECT * FROM fruits WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
The query I'm hoping for would be something like
SELECT * FROM fruits WHERE quantity MAGIC (1,2,3,4,NULL);
I'm using Postgresql 9.1.
As far as I can tell from the docs (e.g. http://www.postgresql.org/docs/9.1/static/functions-comparisons.html) and tests there is no way to do this. But I'm hoping one of you has some magic insight.
Test table with 100k rows:
create table fruits (name text, quantity integer);
insert into fruits (name, quantity)
select left(md5(i::text), 6), i
from generate_series(1, 10000) s(i);
With plain index on quantity:
create index fruits_index on fruits(quantity);
analyze fruits;
The query with or:
explain analyze
SELECT * FROM fruits WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on fruits (cost=21.29..34.12 rows=4 width=11) (actual time=0.032..0.032 rows=4 loops=1)
Recheck Cond: ((quantity = ANY ('{1,2,3,4}'::integer[])) OR (quantity IS NULL))
-> BitmapOr (cost=21.29..21.29 rows=4 width=0) (actual time=0.025..0.025 rows=0 loops=1)
-> Bitmap Index Scan on fruits_index (cost=0.00..17.03 rows=4 width=0) (actual time=0.019..0.019 rows=4 loops=1)
Index Cond: (quantity = ANY ('{1,2,3,4}'::integer[]))
-> Bitmap Index Scan on fruits_index (cost=0.00..4.26 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (quantity IS NULL)
Total runtime: 0.089 ms
Without or:
explain analyze
SELECT * FROM fruits WHERE quantity IN (1,2,3,4);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Scan using fruits_index on fruits (cost=0.00..21.07 rows=4 width=11) (actual time=0.026..0.038 rows=4 loops=1)
Index Cond: (quantity = ANY ('{1,2,3,4}'::integer[]))
Total runtime: 0.085 ms
The coalesce version proposed by wildplasser leads to a sequential scan:
explain analyze
SELECT *
FROM fruits
WHERE COALESCE(quantity, -1) IN (-1,1,2,3,4);
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on fruits (cost=0.00..217.50 rows=250 width=11) (actual time=0.023..4.358 rows=4 loops=1)
Filter: (COALESCE(quantity, (-1)) = ANY ('{-1,1,2,3,4}'::integer[]))
Rows Removed by Filter: 9996
Total runtime: 4.395 ms
Unless a coalesce expression index is created:
create index fruits_coalesce_index on fruits(coalesce(quantity, -1));
analyze fruits;
explain analyze
SELECT *
FROM fruits
WHERE COALESCE(quantity, -1) IN (-1,1,2,3,4);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Index Scan using fruits_coalesce_index on fruits (cost=0.00..25.34 rows=5 width=11) (actual time=0.112..0.124 rows=4 loops=1)
Index Cond: (COALESCE(quantity, (-1)) = ANY ('{-1,1,2,3,4}'::integer[]))
Total runtime: 0.172 ms
But it is still worse than the plain or query with a plain index on quantity.
Ugly hack with COALESCE:
SELECT *
FROM fruits
WHERE COALESCE(quantity,1) IN (1,2,3,4)
;
Please check the resulting plan. IIRC, the optimiser knows about COALESCE() in cases like this.
UPDATE: Alternative: use the EXISTS(NOT EXISTS(NOT IN)) trick (which generates a different plan here)
-- EXPLAIN ANALYZE
SELECT *
FROM fruits fr
WHERE EXISTS (
SELECT * FROM fruits ex
WHERE ex.id = fr.id
AND NOT EXISTS (
SELECT * FROM fruits nx
WHERE nx.id = ex.id
AND nx.quantity NOT IN (1,2,3,4)
)
)
;
BTW: while testing, (upto 1 million rows, with only 4+ a few qualifying) , the first query (which does not use an index) is always faster than the second (which uses indices and hash anti-join) YMMV.
UPDATE 2: the original query IS NULL OR IN() is a clear winner here:
-- EXPLAIN ANALYZE
SELECT *
FROM fruits
WHERE quantity IS NULL
OR quantity IN (1,2,3,4)
;
This is not an answer to your exact question, but you could build a partial index tailored for your query:
CREATE INDEX idx_partial (quantity) ON fruits
WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
From the docs: http://www.postgresql.org/docs/current/interactive/indexes-partial.html
This index should then be used by your query and speed it up.