I have defined the following index:
CREATE INDEX
users_search_idx
ON
auth_user
USING
gin(
username gin_trgm_ops,
first_name gin_trgm_ops,
last_name gin_trgm_ops
);
I am performing the following query:
PREPARE user_search (TEXT, INT) AS
SELECT
username,
email,
first_name,
last_name,
( -- would probably do per-field weightings here
s_username + s_first_name + s_last_name
) rank
FROM
auth_user,
similarity(username, $1) s_username,
similarity(first_name, $1) s_first_name,
similarity(last_name, $1) s_last_name
WHERE
username % $1 OR
first_name % $1 OR
last_name % $1
ORDER BY
rank DESC
LIMIT $2;
The auth_user table has 6.2 million rows.
The speed of the query seems to depend very heavily on the number of results potentially returned by the similarity query.
Increasing the similarity threshold via set_limit helps, but reduces usefulness of results by eliminating partial matches.
Some searches return in 200ms, others take ~ 10 seconds.
We have an existing implementation of this feature using Elasticsearch that returns in < 200ms for any query, while doing more complicated (better) ranking.
I would like to know if there is any way we could improve this to get more consistent performance?
It's my understanding that GIN index (inverted index) is the same basic approach used by Elasticsearch so I would have thought there is some optimization possible.
An EXPLAIN ANALYZE EXECUTE user_search('mel', 20) shows:
Limit (cost=54099.81..54099.86 rows=20 width=52) (actual time=10302.092..10302.104 rows=20 loops=1)
-> Sort (cost=54099.81..54146.66 rows=18739 width=52) (actual time=10302.091..10302.095 rows=20 loops=1)
Sort Key: (((s_username.s_username + s_first_name.s_first_name) + s_last_name.s_last_name)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Nested Loop (cost=382.74..53601.17 rows=18739 width=52) (actual time=118.164..10293.765 rows=8380 loops=1)
-> Nested Loop (cost=382.74..53132.69 rows=18739 width=56) (actual time=118.150..10262.804 rows=8380 loops=1)
-> Nested Loop (cost=382.74..52757.91 rows=18739 width=52) (actual time=118.142..10233.990 rows=8380 loops=1)
-> Bitmap Heap Scan on auth_user (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
Rows Removed by Index Recheck: 2434523
Heap Blocks: exact=49337 lossy=53104
-> BitmapOr (cost=382.74..382.74 rows=18757 width=0) (actual time=107.436..107.436 rows=0 loops=1)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=40.200..40.200rows=88908 loops=1)"
Index Cond: ((username)::text % 'mel'::text)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=43.847..43.847rows=102028 loops=1)"
Index Cond: ((first_name)::text % 'mel'::text)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=23.387..23.387rows=58740 loops=1)"
Index Cond: ((last_name)::text % 'mel'::text)
-> Function Scan on similarity s_username (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=8380)
-> Function Scan on similarity s_first_name (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
-> Function Scan on similarity s_last_name (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
Execution time: 10302.559 ms
Server is Postgres 9.6.1 running on Amazon RDS
update
1.
Shortly after posting the question I found this info: https://www.postgresql.org/message-id/464F3C5D.2000700#enterprisedb.com
So I tried
-> SHOW work_mem;
4MB
-> SET work_mem='12MB';
-> EXECUTE user_search('mel', 20);
(results returned in ~1.5s)
This made a big improvement (previously > 10s)!
1.5s is still way slower than ES for similar query so I would still like to hear any suggestions for optimising the query.
2.
In response to comments, and after seeing this question (Postgresql GIN index slower than GIST for pg_trgm), I tried exactly the same set up with a GIST index in place of the GIN one.
Trying the same search above, it returned in ~3.5s, using default work_mem='4MB'. Increasing work_mem made no difference.
From this I conclude that GIST index is more memory efficient (did not hit pathological case like GIN did) but is slower than GIN when GIN is working properly. This is inline with what's described in the docs recommending GIN index.
3.
I still don't understand why so much time is spent in:
-> Bitmap Heap Scan on auth_user (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
Rows Removed by Index Recheck: 2434523
Heap Blocks: exact=49337 lossy=53104
I don't understand why this step is needed or what it's doing.
There are the three Bitmap Index Scan beneath it for each of the username % $1 clauses... these results then get combined with a BitmapOr step. These parts are all quite fast.
But even in the case where we don't run out of work mem, we still spend nearly a whole second in Bitmap Heap Scan.
I expect much faster results with this approach:
1.
Create a GiST index with 1 column holding concatenated values:
CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);
This assumes all 3 columns to be defined NOT NULL (you did not specify). Else you need to do more.
Why not simplify with concat_ws()?
Combine two columns and add into one new column
Faster query with pattern-matching on multiple text fields
Combine two columns and add into one new column
2.
Use a proper nearest-neighbor query, matching above index:
SELECT username, email, first_name, last_name
, similarity(username , $1) AS s_username
, similarity(first_name, $1) AS s_first_name
, similarity(last_name , $1) AS s_last_name
, row_number() OVER () AS rank -- greatest similarity first
FROM auth_user
WHERE (username || ' ' || first_name || ' ' || last_name) % $1 -- !!
ORDER BY (username || ' ' || first_name || ' ' || last_name) <-> $1 -- !!
LIMIT $2;
Expressions in WHERE and ORDER BY must match index expression!
In particular ORDER BY rank (like you had it) will always perform poorly for a small LIMIT picking from a much larger pool of qualifying rows, because it cannot use an index directly: The sophisticated expression behind rank has to be calculated for every qualifying row, then all have to be sorted before the small selection of best matches can be returned. This is much, much more expensive than a true nearest-neighbor query that can pick the best results from the index directly without even looking at the rest.
row_number() with empty window definition just reflects the ordering produced by the ORDER BY of the same SELECT.
Related answers:
Best index for similarity function
Search in 300 million addresses with pg_trgm
As for your item 3., I added an answer to the question you referenced, that should explain it:
PostgreSQL GIN index slower than GIST for pg_trgm?
Related
I have product_details table with 30+ Million records. product attributes text type data is stored into column Value1.
Front end(web) users search for product details and it will be queried on column Value1.
create table product_details(
key serial primary key ,
product_key int,
attribute_key int ,
Value1 text[],
Value2 int[],
status text);
I created gin index on column Value1 to improve search query performance.
query execution improved a lot for many queries.
Tables and indexes are here
Below is one of query used by application for search.
select p.key from (select x.product_key,
x.value1,
x.attribute_key,
x.status
from product_details x
where value1 IS NOT NULL
) as pr_d
join attribute_type at on at.key = pr_d.attribute_key
join product p on p.key = pr_d.product_key
where value1_search(pr_d.value1) ilike '%B s%'
and at.type = 'text'
and at.status = 'active'
and pr_d.status = 'active'
and 1 = 1
and p.product_type_key=1
and 1 = 1
group by p.key
query is executed in 2 or 3 secs if we search %B % or any single or two char words and below is query plan
Group (cost=180302.82..180302.83 rows=1 width=4) (actual time=49.006..49.021 rows=65 loops=1)
Group Key: p.key
-> Sort (cost=180302.82..180302.83 rows=1 width=4) (actual time=49.005..49.009 rows=69 loops=1)
Sort Key: p.key
Sort Method: quicksort Memory: 28kB
-> Nested Loop (cost=0.99..180302.81 rows=1 width=4) (actual time=3.491..48.965 rows=69 loops=1)
Join Filter: (x.attribute_key = at.key)
Rows Removed by Join Filter: 10051
-> Nested Loop (cost=0.99..180270.15 rows=1 width=8) (actual time=3.396..45.211 rows=69 loops=1)
-> Index Scan using products_product_type_key_status on product p (cost=0.43..4420.58 rows=1413 width=4) (actual time=0.024..1.473 rows=1630 loops=1)
Index Cond: (product_type_key = 1)
-> Index Scan using product_details_product_attribute_key_status on product_details x (cost=0.56..124.44 rows=1 width=8) (actual time=0.026..0.027 rows=0 loops=1630)
Index Cond: ((product_key = p.key) AND (status = 'active'))
Filter: ((value1 IS NOT NULL) AND (value1_search(value1) ~~* '%B %'::text))
Rows Removed by Filter: 14
-> Seq Scan on attribute_type at (cost=0.00..29.35 rows=265 width=4) (actual time=0.002..0.043 rows=147 loops=69)
Filter: ((value_type = 'text') AND (status = 'active'))
Rows Removed by Filter: 115
Planning Time: 0.732 ms
Execution Time: 49.089 ms
But if i search for %B s%, query took 75 secs and below is query plan (second time query execution took 63 sec)
In below query plan, DB engine didn't consider index for scan as in above query plan indexes were used. Not sure why ?
Group (cost=8057.69..8057.70 rows=1 width=4) (actual time=62138.730..62138.737 rows=12 loops=1)
Group Key: p.key
-> Sort (cost=8057.69..8057.70 rows=1 width=4) (actual time=62138.728..62138.732 rows=14 loops=1)
Sort Key: p.key
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=389.58..8057.68 rows=1 width=4) (actual time=2592.685..62138.710 rows=14 loops=1)
-> Hash Join (cost=389.15..4971.85 rows=368 width=4) (actual time=298.280..62129.956 rows=831 loops=1)
Hash Cond: (x.attribute_type = at.key)
-> Bitmap Heap Scan on product_details x (cost=356.48..4937.39 rows=681 width=8) (actual time=298.117..62128.452 rows=831 loops=1)
Recheck Cond: (value1_search(value1) ~~* '%B s%'::text)
Rows Removed by Index Recheck: 26168889
Filter: ((value1 IS NOT NULL) AND (status = 'active'))
Rows Removed by Filter: 22
Heap Blocks: exact=490 lossy=527123
-> Bitmap Index Scan on product_details_value1_gin (cost=0.00..356.31 rows=1109 width=0) (actual time=251.596..251.596 rows=2846970 loops=1)
Index Cond: (value1_search(value1) ~~* '%B s%'::text)
-> Hash (cost=29.35..29.35 rows=265 width=4) (actual time=0.152..0.153 rows=269 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 18kB
-> Seq Scan on attribute_type at (cost=0.00..29.35 rows=265 width=4) (actual time=0.010..0.122 rows=269 loops=1)
Filter: ((value_type = 'text') AND (status = 'active'))
Rows Removed by Filter: 221
-> Index Scan using product_pkey on product p (cost=0.43..8.39 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=831)
Index Cond: (key = x.product_key)
Filter: (product_type_key = 1)
Rows Removed by Filter: 1
Planning Time: 0.668 ms
Execution Time: 62138.794 ms
Any suggestions pls to improve query for search %B s%
thanks
ilike '%B %' has no usable trigrams in it. The planner knows this, and punishes the pg_trgm index plan so much that the planner then goes with an entirely different plan instead.
But ilike '%B s%' does have one usable trigram in it, ' s'. It turns out that this trigram sucks because it is extremely common in the searched data, but the planner currently has no way to accurately estimate how much it sucks.
Even worse, this large number matches means your full bitmap can't fit in work_mem so it goes lossy. Then it needs to recheck all the tuples in any page which contains even one tuple that has the ' s' trigram in it, which looks like it is most of the pages in your table.
The first thing to do is to increase your work_mem to the point you stop getting lossy blocks. If most of your time is spent in the CPU applying the recheck condition, this should help tremendously. If most of your time is spent reading the product_details from disk (so that the recheck has the data it needs to run) then it won't help much. If you had done EXPLAIN (ANALYZE, BUFFERS) with track_io_timing turned on, then we would already know which is which.
Another thing you could do is have the application inspect the search parameter, and if it looks like two letters (with or without a space between), then forcibly disable that index usage, or just throw an error if there is no good reason to do that type of search. For example, changing the part of the query to look like this will disable the index:
where value1_search(pr_d.value1)||'' ilike '%B s%'
Another thing would be to rethink your data representation. '%B s%' is a peculiar thing to search for. Why would anyone search for that? Does it have some special meaning within the context of your data, which is not obvious to the outside observer? Maybe you could represent it in a different way that gets along better with pg_trgm.
Finally, you could try to improve the planning for GIN indexes generally by explicitly estimating how many tuples are going to fail recheck (due to inherent lossiness of the index, not due to overrunning work_mem). This would be a major undertaking, and you would be unlikely to see it in production for at least a couple years, if ever.
We have two tables - event_deltas and deltas_to_retrieve - which both have BTREE indexes on the same two columns:
CREATE TABLE event_deltas
(
event_id UUID REFERENCES events(id) NOT NULL,
version INT NOT NULL,
json_patch JSONB NOT NULL,
PRIMARY KEY (event_id, version)
);
CREATE TABLE deltas_to_retrieve(event_id UUID NOT NULL, version INT NOT NULL);
CREATE UNIQUE INDEX event_id_version ON deltas_to_retrieve (event_id, version);
In terms of table size, deltas_to_retrieve is a tiny lookup table of ~500 rows. The event_deltas table contains ~7,000,000 rows. Due to the size of the latter table, we want to limit how much we retrieve at once. Therefore, the tables are queried as follows:
SELECT ed.event_id, ed.version
FROM deltas_to_retrieve zz, event_deltas ed
WHERE zz.event_id = ed.event_id
AND ed.version > zz.version
ORDER BY ed.event_id, ed.version
LIMIT 5000;
Without the LIMIT, for the example I'm looking at the query returns ~30,000 rows.
What's odd about this query is the impact of the ORDER BY. Due to the existing indexes, the data comes back in the order we want with or without it. I would rather keep the explicit ORDER BY there so we're future-proofed against future changes, as well as for readability etc. However, as things stand it has a significant negative impact on performance.
According to the docs:
An important special case is ORDER BY in combination with LIMIT n: an explicit sort will have to process all the data to identify the first n rows, but if there is an index matching the ORDER BY, the first n rows can be retrieved directly, without scanning the remainder at all.
This makes me think that, given the indexes we already have in place, the ORDER BY should not slow down the query at all. However, in practice I'm seeing execution times of ~10s with the ORDER BY and <1s without. I've included the plans outputted by EXPLAIN below:
Without ORDER BY
Just EXPLAIN:
QUERY PLAN
Limit (cost=0.56..20033.38 rows=5000 width=20)
-> Nested Loop (cost=0.56..331980.39 rows=82859 width=20)
-> Seq Scan on deltas_to_retrieve zz (cost=0.00..9.37 rows=537 width=20)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..616.66 rows=154 width=20)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
More detailed EXPLAIN (ANALYZE, BUFFERS):
QUERY PLAN
Limit (cost=0.56..20039.35 rows=5000 width=20) (actual time=3.675..2083.063 rows=5000 loops=1)
" Buffers: shared hit=1450 read=4783, local hit=2"
-> Nested Loop (cost=0.56..1055082.88 rows=263260 width=20) (actual time=3.673..2080.745 rows=5000 loops=1)
" Buffers: shared hit=1450 read=4783, local hit=2"
-> Seq Scan on deltas_to_retrieve zz (cost=0.00..27.00 rows=1700 width=20) (actual time=0.022..0.307 rows=241 loops=1)
Buffers: local hit=2
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..619.07 rows=155 width=20) (actual time=1.317..8.612 rows=21 loops=241)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
Heap Fetches: 5000
Buffers: shared hit=1450 read=4783
Planning Time: 1.150 ms
Execution Time: 2084.647 ms
With ORDER BY
Just EXPLAIN:
QUERY PLAN
Limit (cost=0.84..929199.06 rows=5000 width=20)
-> Merge Join (cost=0.84..48924145.53 rows=263260 width=20)
Merge Cond: (ed.event_id = zz.event_id)
Join Filter: (ed.version > zz.version)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..48873353.76 rows=12318733 width=20)
-> Materialize (cost=0.28..6178.03 rows=1700 width=20)
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..6173.78 rows=1700 width=20)
More detailed EXPLAIN (ANALYZE, BUFFERS):
QUERY PLAN
Limit (cost=0.84..929199.06 rows=5000 width=20) (actual time=4457.770..506706.443 rows=5000 loops=1)
" Buffers: shared hit=78806 read=1071004 dirtied=148, local hit=63"
-> Merge Join (cost=0.84..48924145.53 rows=263260 width=20) (actual time=4457.768..506704.815 rows=5000 loops=1)
Merge Cond: (ed.event_id = zz.event_id)
Join Filter: (ed.version > zz.version)
" Buffers: shared hit=78806 read=1071004 dirtied=148, local hit=63"
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..48873353.76 rows=12318733 width=20) (actual time=4.566..505443.407 rows=1813438 loops=1)
Heap Fetches: 1814767
Buffers: shared hit=78806 read=1071004 dirtied=148
-> Materialize (cost=0.28..6178.03 rows=1700 width=20) (actual time=0.063..2.524 rows=5000 loops=1)
Buffers: local hit=63
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..6173.78 rows=1700 width=20) (actual time=0.056..0.663 rows=78 loops=1)
Heap Fetches: 78
Buffers: local hit=63
Planning Time: 1.088 ms
Execution Time: 506709.819 ms
I'm not very experienced at reading these plans, but it's obviously thinking that it needs to retrieve everything, sort it and then return TOP N, rather than just grabbing the first N using the index. It's doing a Seq Scan on the smaller deltas_to_retrieve table rather than an Index Only Scan - is that the problem? That table is v. small (~500 rows), so I wonder if it's just not bothering to use the index because of that?
Postgres version: 11.12
Upgrading to Postgres 13 fixed this for us, with the introduction of incremental sort. From some docs on the feature:
Incremental sorting: Sorting is a performance-intensive task, so every improvement in this area can make a difference. Now PostgreSQL 13 introduces incremental sorting, which leverages early-stage sorts of a query and sorts only the incremental unsorted fields, increasing the chances the sorted block will fit in memory and by that, improving performance.
The new query plan from EXPLAIN is as follows, with the query now completing in <500ms consistently:
QUERY PLAN
Limit (cost=71.06..820.32 rows=5000 width=20)
-> Incremental Sort (cost=71.06..15461.82 rows=102706 width=20)
" Sort Key: ed.event_id, ed.version"
Presorted Key: ed.event_id
-> Nested Loop (cost=0.84..6659.05 rows=102706 width=20)
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..1116.39 rows=541 width=20)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..8.35 rows=190 width=20)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
Note:
[Start by running VACUUM ANALYZE on both tables]
since deltas_to_retrieve only needs to contain the lowest versions, it could be unique on event_id
you can simplify the query to:
SELECT event_id, version
FROM event_deltas ed
WHERE EXISTS (
SELECT * FROM deltas_to_retrieve zz
WHERE zz.event_id = ed.event_id
AND zz.version < ed.version
)
ORDER BY event_id, version
LIMIT 5000;
I have a table
CREATE TABLE timedevent
(
id bigint NOT NULL,
eventdate timestamp with time zone NOT NULL,
newstateids character varying(255) NOT NULL,
sourceid character varying(255) NOT NULL,
CONSTRAINT timedevent_pkey PRIMARY KEY (id)
) WITH (OIDS=FALSE);
with PK id.
I have to query rows between two dates with certain newstate and source from a set of possible sources.
I created a btree indexes on eventdate and newstateids and one more (hash index) on sourceid. Only the index on date made the queries faster - it seems the other two are not used. Why is that so? How could I make my queries faster?
CREATE INDEX eventdate_index ON timedevent USING btree (eventdate);
CREATE INDEX newstateids_index ON timedevent USING btree (newstateids COLLATE pg_catalog."default");
CREATE INDEX sourceid_index_hash ON timedevent USING hash (sourceid COLLATE pg_catalog."default");
Here is the query as Hibernate generates it:
select this_.id as id1_0_0_, this_.description as descript2_0_0_, this_.eventDate as eventDat3_0_0_, this_.locationId as location4_0_0_, this_.newStateIds as newState5_0_0_, this_.oldStateIds as oldState6_0_0_, this_.sourceId as sourceId7_0_0_
from TimedEvent this_
where ((this_.newStateIds=? and this_.sourceId in (?, ?, ?, ?, ?, ?)))
and this_.eventDate between ? and ?
limit ?
EDIT:
Sorry for the misleading title but it seem postges uses all indexes. The problem is my query time still remains the same. Here ist the query plan I got:
Limit (cost=25130.29..33155.77 rows=321 width=161) (actual time=705.330..706.744 rows=279 loops=1)
Buffers: shared hit=6 read=8167 written=61
-> Bitmap Heap Scan on timedevent this_ (cost=25130.29..33155.77 rows=321 width=161) (actual time=705.330..706.728 rows=279 loops=1)
Recheck Cond: (((sourceid)::text = ANY ('{"root,kus-chemnitz,ize-159,Anwesend Bad","root,kus-chemnitz,ize-159,Alarmruf","root,kus-chemnitz,ize-159,Bett Alarm 1","root,kus-chemnitz,ize-159,Bett Alarm 2","root,kus-chemnitz,ize-159,Anwesend Zimmer" (...)
Filter: ((eventdate >= '2017-11-01 15:41:00+01'::timestamp with time zone) AND (eventdate <= '2018-03-20 14:58:16.724+01'::timestamp with time zone))
Buffers: shared hit=6 read=8167 written=61
-> BitmapAnd (cost=25130.29..25130.29 rows=2122 width=0) (actual time=232.990..232.990 rows=0 loops=1)
Buffers: shared hit=6 read=2152
-> Bitmap Index Scan on sourceid_index_hash (cost=0.00..1403.36 rows=39182 width=0) (actual time=1.195..1.195 rows=9308 loops=1)
Index Cond: ((sourceid)::text = ANY ('{"root,kus-chemnitz,ize-159,Anwesend Bad","root,kus-chemnitz,ize-159,Alarmruf","root,kus-chemnitz,ize-159,Bett Alarm 1","root,kus-chemnitz,ize-159,Bett Alarm 2","root,kus-chemnitz,ize-159,Anwesend Z (...)
Buffers: shared hit=6 read=26
-> Bitmap Index Scan on state_index (cost=0.00..23726.53 rows=777463 width=0) (actual time=231.160..231.160 rows=776520 loops=1)
Index Cond: ((newstateids)::text = 'ACTIV'::text)
Buffers: shared read=2126
Total runtime: 706.804 ms
After creating an index using btree on (sourceid, newstateids) as a_horse_with_no_name suggested, the cost reduced:
Limit (cost=125.03..8150.52 rows=321 width=161) (actual time=13.611..14.454 rows=279 loops=1)
Buffers: shared hit=18 read=4336
-> Bitmap Heap Scan on timedevent this_ (cost=125.03..8150.52 rows=321 width=161) (actual time=13.609..14.432 rows=279 loops=1)
Recheck Cond: (((sourceid)::text = ANY ('{"root,kus-chemnitz,ize-159,Anwesend Bad","root,kus-chemnitz,ize-159,Alarmruf","root,kus-chemnitz,ize-159,Bett Alarm 1","root,kus-chemnitz,ize-159,Bett Alarm 2","root,kus-chemnitz,ize-159,Anwesend Zimmer","r (...)
Filter: ((eventdate >= '2017-11-01 15:41:00+01'::timestamp with time zone) AND (eventdate <= '2018-03-20 14:58:16.724+01'::timestamp with time zone))
Buffers: shared hit=18 read=4336
-> Bitmap Index Scan on src_state_index (cost=0.00..124.95 rows=2122 width=0) (actual time=0.864..0.864 rows=4526 loops=1)
Index Cond: (((sourceid)::text = ANY ('{"root,kus-chemnitz,ize-159,Anwesend Bad","root,kus-chemnitz,ize-159,Alarmruf","root,kus-chemnitz,ize-159,Bett Alarm 1","root,kus-chemnitz,ize-159,Bett Alarm 2","root,kus-chemnitz,ize-159,Anwesend Zimmer (...)
Buffers: shared hit=18 read=44
Total runtime: 14.497 ms"
Basically only one index is used because database would have to combine your indexes into one so that they would be useful (or combine results from searches over more indexes) and doing this is so expensive, that it in this case chooses not to and use just one of the indexes relevant for one predicate and check the other predicates directly on values in found rows.
One B-tree index with several columns would work better, just as a_horse_with_no_name suggests in comments. Also note, that order of columns matter a lot (columns that are used for single value search should be first, the one for range search later, you want to limit range searches as much as possible).
Then databese will go through index, looking for rows that satisfy on predicate using the first column of index (hopefully narrowing the numbers of rows a lot) and the the second column and second predicate come to play, ...
Using separate B-tree indexes when predicates are combined using AND operator does not make sense to database, because it would have to use one index to choose all rows, which satisfy one predicate and then, it would have to use another index, read its blocks (where the index is stored) from disk again, only to get to rows which satisfy the condition relevant to this second index, but possibly not the other condition. And if they satisfy it, it's probably cheaper to just load the row after first using index and check the other predicates directly, not using index.
Supposing I have this table with index.
Create Table sample
(
table_date timestamp without timezone,
description
);
CREATE INDEX dailyinv_index
ON sample
USING btree
(date(table_date));
And It has a 33 million rows.
Why is it that running this query
select count(*) from sample where date(table_date) = '8/30/2017' and desc = 'desc1'
yields a result # 12ms
Using PostgreSQL to explain the query plan. this is what it does.
Aggregate (cost=288678.55..288678.56 rows=1 width=0)
->Bitmap Heap Scan on sample (cost=3119.63..288647.57 rows=12393 width=0)
Recheck Cond: (date(table_date) = '2017-08-30'::date)
Filter: ((description)::text = 'desc1'::text)
-> Bitmap Index Scan on dailyinv_index (cost=0.00..3116.54 rows=168529 width=0)
Index Cond: (date(table_date) = '2017-08-30'::date)
but this one
select date(table_date) from sample where date(table_date)<='8/30/2017' order by table_date desc limit 1
yields result after 11,460 ms?
Query Plan
Limit (cost=798243.52..798243.52 rows=1 width=8)
-> Sort (cost=798243.52..826331.69 rows=11235271 width=8)
Sort Key: table_date
-> Bitmap Heap Scan on sample (cost=210305.92..742067.16 rows=11235271 width=8)
Recheck Cond: (date(table_date) <= '2017-08-30'::date)
-> Bitmap Index Scan on dailyinv_index (cost=0.00..207497.10 rows=11235271 width=0)
Index Cond: (date(table_date) <= '2017-08-30'::date)
PostgreSQL Version: 9.4
Maybe Im doing the indexing wrong or I dont know. Really not familiar with indexing. Any help would be great. Thanks a lot!
Your problem is caused by you sorting on table_date rather than date(table_date). This can be corrected by modyfing the query to:
SELECT DATE(table_date)
FROM sample
WHERE DATE(table_date) <= '8/30/2017'
ORDER BY DATE(table_date) DESC
LIMIT 1
I am relatively new to Postgres full text search and still trying to understand it. I am working on ways I could optimize this query in PostgreSQL Full Text search. The query looks like this:
SELECT articles.article_id, article_title, article_excerpt, article_author, article_link_perm, article_default_image, article_date_added, article_bias_avg, article_rating_avg, article_keywords,
ts_rank(search_vector, to_tsquery('snowden|obama|nsa')) AS rank
FROM development.articles
WHERE search_vector ## to_tsquery('english', 'snowden|obama|nsa') AND ts_rank(search_vector, to_tsquery('snowden|obama|nsa')) > .045 ORDER BY article_date_added DESC, rank DESC LIMIT 20
And EXPLAN ANAYLIZE looks like this:
Limit (cost=20368.26..20368.31 rows=20 width=751) (actual time=276.006..276.101 rows=20 loops=1)
-> Sort (cost=20368.26..20376.91 rows=3459 width=751) (actual time=276.001..276.035 rows=20 loops=1)
Sort Key: article_date_added, (ts_rank(search_vector, to_tsquery('snowden|obama|nsa'::text)))
Sort Method: top-N heapsort Memory: 42kB
-> Bitmap Heap Scan on articles (cost=1136.19..20276.22 rows=3459 width=751) (actual time=22.735..273.558 rows=600 loops=1)
Recheck Cond: (search_vector ## '( ''snowden'' | ''obama'' ) | ''nsa'''::tsquery)
Filter: (ts_rank(search_vector, to_tsquery('snowden|obama|nsa'::text)) > 0.045::double precision)
-> Bitmap Index Scan on article_search_vector_index (cost=0.00..1135.33 rows=10377 width=0) (actual time=20.512..20.512 rows=9392 loops=1)
Index Cond: (search_vector ## '( ''snowden'' | ''obama'' ) | ''nsa'''::tsquery)
Total runtime: 276.674 ms
The index that is being used is GIN because I care more about the search and the update. Some of the problems I notice with this query is the more '|' I add, the slower it gets. What would be some ways that I could optimize this query to still get decent results with speed?
The bigger problem is the:
ORDER BY article_date_added DESC, rank DESC
It makes the planner consider a bunch of applicable rows based on full text, and it then ends up resorting them. If you ORDER BY rank DESC instead, you should get better results. (The default order in this case is by rank DESC.)
For the additional | decreasing performance, it's because each additional word/subquery is fetched separately as part of the bitmap index scan. The more rows qualify, the more will be fetched and considered for the top-n sort. It's perfectly normal.