I have the following postgresql table:
CREATE TABLE "initialTable" (
"paramIDFKey" integer,
"featAIDFKey" integer,
"featBIDFKey" integer,
"featAPresent" boolean,
"featBPresent" boolean,
"dataID" text
);
I update this table by following command:
UPDATE "initalTable"
SET "dataID" = "dataID" || '#' || 'NEWDATA'
where
"paramIDFKey" = parameterID
and "featAIDFKey" = featAIDFKey
and "featBIDFKey" = featBIDFKey
and "featAPresent" = featAPresent
and "featBPresent" = featBPresent
As you can see, I am updating the dataID for each row. This update works as an append. It appends new data to the previous ones.
This is too slow. Specially when the "dataID" column gets larger.
Following is the "Explain" results:
"Bitmap Heap Scan on "initialTable" (cost=4.27..8.29 rows=1 width=974)"
" Recheck Cond: (("paramIDFKey" = 53) AND ("featAIDFKey" = 0) AND ("featBIDFKey" = 95))"
" Filter: ("featAPresent" AND (NOT "featBPresent"))"
" -> Bitmap Index Scan on "InexactIndex" (cost=0.00..4.27 rows=1 width=0)"
" Index Cond: (("paramIDFKey" = 53) AND ("featAIDFKey" = 0) AND ("featBIDFKey" = 95) AND ("featAPresent" = true) AND ("featBPresent" = false))"
explain ANALYZE:
"Bitmap Heap Scan on "Inexact2Comb" (cost=4.27..8.29 rows=1 width=974) (actual time=0.621..0.675 rows=1 loops=1)"
" Recheck Cond: (("paramIDFKey" = 53) AND ("featAIDFKey" = 0) AND ("featBIDFKey" = 95))"
" Filter: ("featAPresent" AND (NOT "featBPresent"))"
" -> Bitmap Index Scan on "InexactIndex" (cost=0.00..4.27 rows=1 width=0) (actual time=0.026..0.026 rows=1 loops=1)"
" Index Cond: (("paramIDFKey" = 53) AND ("featAIDFKey" = 0) AND ("featBIDFKey" = 95) AND ("featAPresent" = true) AND ("featBPresent" = false))"
"Total runtime: 13.780 ms"
and the version:
"PostgreSQL 8.4.14, compiled by Visual C++ build 1400, 32-bit"
Is there any suggestion?
First, I am not at all convinced this is a real problem. If 15ms is too long for a single query, you need to start and ask if you are optimizing prematurely and if it is really the bottleneck. If it is, you may want to reconsider how you are using your database. Keep in mind that queries can execute faster than explain analyze suggsts (I have seen some queries run 4x slower under EXPLAIN ANALYZE). So start by profiling your application and look for real bottlenecks.
This being said, if you do find this is a bottleneck you could take a close look at what is indexed. Too many indexes slow down write operations and that is implicated in update queries. This may mean adding a new index with all columns in the update, and removing other indexes as needed. However, I really don't thin you are going to get a lot more out of that query.
Related
I'm trying to optimize a pagination query that runs on a table of Videos joined with Channels for a project that uses the Prism ORM.
I don't have a lot of experience adding database indexes to optimize performance and could use some guidance on whether I missed something obvious or I'm simply constrained by the database server hardware.
When extracted, the query that Prisma runs looks like this and takes at least 4-10 seconds to run on 136k videos even after I added some indexes:
SELECT
"public"."Video"."id",
"public"."Video"."youtubeId",
"public"."Video"."channelId",
"public"."Video"."type",
"public"."Video"."status",
"public"."Video"."reviewed",
"public"."Video"."category",
"public"."Video"."youtubeTags",
"public"."Video"."language",
"public"."Video"."title",
"public"."Video"."description",
"public"."Video"."duration",
"public"."Video"."durationSeconds",
"public"."Video"."viewCount",
"public"."Video"."likeCount",
"public"."Video"."commentCount",
"public"."Video"."scheduledStartTime",
"public"."Video"."actualStartTime",
"public"."Video"."actualEndTime",
"public"."Video"."sortTime",
"public"."Video"."createdAt",
"public"."Video"."updatedAt",
"public"."Video"."publishedAt"
FROM
"public"."Video",
(
SELECT
"public"."Video"."sortTime" AS "Video_sortTime_0"
FROM
"public"."Video"
WHERE ("public"."Video"."id") = (29949)
) AS "order_cmp"
WHERE (
("public"."Video"."id") IN(
SELECT
"t0"."id" FROM "public"."Video" AS "t0"
INNER JOIN "public"."Channel" AS "j0" ON ("j0"."id") = ("t0"."channelId")
WHERE (
(NOT "j0"."status" IN('HIDDEN', 'ARCHIVED'))
AND "t0"."id" IS NOT NULL)
)
AND "public"."Video"."status" IN('UPCOMING', 'LIVE', 'PUBLISHED')
AND "public"."Video"."sortTime" <= "order_cmp"."Video_sortTime_0")
ORDER BY
"public"."Video"."sortTime" DESC OFFSET 0;
I haven't yet defined the indexes in my Prisma schema file, I've just been setting them directly on the database. These are the current indexes on the Video and Channel tables:
CREATE UNIQUE INDEX "Video_youtubeId_key" ON public."Video" USING btree ("youtubeId") CREATE INDEX "Video_status_idx" ON public."Video" USING btree (status)
CREATE INDEX "Video_sortTime_idx" ON public."Video" USING btree ("sortTime" DESC)
CREATE UNIQUE INDEX "Video_pkey" ON public."Video" USING btree (id)
CREATE INDEX "Video_channelId_idx" ON public."Video" USING btree ("channelId")
CREATE UNIQUE INDEX "Channel_youtubeId_key" ON public."Channel" USING btree ("youtubeId")
CREATE UNIQUE INDEX "Channel_pkey" ON public."Channel" USING btree (id)
EXPLAIN (ANALYZE,BUFFERS) for the query shows this (analyzer tool link):
Sort (cost=114760.67..114867.67 rows=42801 width=1071) (actual time=4115.144..4170.368 rows=13943 loops=1)
Sort Key: ""Video"".""sortTime"" DESC
Sort Method: external merge Disk: 12552kB
Buffers: shared hit=19049 read=54334 dirtied=168, temp read=1569 written=1573
I/O Timings: read=11229.719
-> Nested Loop (cost=39030.38..91423.62 rows=42801 width=1071) (actual time=2720.873..4037.549 rows=13943 loops=1)
Join Filter: (""Video"".""sortTime"" <= ""Video_1"".""sortTime"")
Rows Removed by Join Filter: 115529
Buffers: shared hit=19049 read=54334 dirtied=168
I/O Timings: read=11229.719
-> Index Scan using ""Video_pkey"" on ""Video"" ""Video_1"" (cost=0.42..8.44 rows=1 width=8) (actual time=0.852..1.642 rows=1 loops=1)
Index Cond: (id = 29949)
Buffers: shared hit=2 read=2
I/O Timings: read=0.809
-> Gather (cost=39029.96..89810.14 rows=128404 width=1071) (actual time=2719.274..4003.170 rows=129472 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=19047 read=54332 dirtied=168
I/O Timings: read=11228.910
-> Parallel Hash Semi Join (cost=38029.96..75969.74 rows=53502 width=1071) (actual time=2695.849..3959.412 rows=43157 loops=3)
Hash Cond: (""Video"".id = t0.id)
Buffers: shared hit=19047 read=54332 dirtied=168
I/O Timings: read=11228.910
-> Parallel Seq Scan on ""Video"" (cost=0.00..37202.99 rows=53938 width=1071) (actual time=0.929..1236.450 rows=43157 loops=3)
Filter: (status = ANY ('{UPCOMING,LIVE,PUBLISHED}'::""VideoStatus""[]))
Rows Removed by Filter: 3160
Buffers: shared hit=9289 read=27118
I/O Timings: read=3526.407
-> Parallel Hash (cost=37312.18..37312.18 rows=57422 width=4) (actual time=2692.172..2692.180 rows=46084 loops=3)
Buckets: 262144 Batches: 1 Memory Usage: 7520kB
Buffers: shared hit=9664 read=27214 dirtied=168
I/O Timings: read=7702.502
-> Hash Join (cost=173.45..37312.18 rows=57422 width=4) (actual time=3.485..2666.998 rows=46084 loops=3)
Hash Cond: (t0.""channelId"" = j0.id)
Buffers: shared hit=9664 read=27214 dirtied=168
I/O Timings: read=7702.502
-> Parallel Seq Scan on ""Video"" t0 (cost=0.00..36985.90 rows=57890 width=8) (actual time=1.774..2646.207 rows=46318 loops=3)
Filter: (id IS NOT NULL)
Buffers: shared hit=9193 read=27214 dirtied=168
I/O Timings: read=7702.502
-> Hash (cost=164.26..164.26 rows=735 width=4) (actual time=1.132..1.136 rows=735 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 34kB
Buffers: shared hit=471
-> Seq Scan on ""Channel"" j0 (cost=0.00..164.26 rows=735 width=4) (actual time=0.024..0.890 rows=735 loops=3)
Filter: (status <> ALL ('{HIDDEN,ARCHIVED}'::""ChannelStatus""[]))
Rows Removed by Filter: 6
Buffers: shared hit=471
Planning Time: 8.134 ms
Execution Time: 4173.202 ms
Now a hint from that same tool seems to suggest that it needed to use disk space for sorting, since my work_mem setting was too low (needed 12560kB and on my Lightsail Postgres DB with 1 gig of RAM, it's set to '4M').
I'm a bit nervous about bumping work_mem to something like 16 or even 24M on a whim. Is that too much for my server's total RAM? Does this look like my root problem? Is there anything else I can do with indexes or my query?
If it helps, the actual Prisma query looks like this
const videos = await ctx.prisma.video.findMany({
where: {
channel: {
NOT: {
status: {
in: [ChannelStatus.HIDDEN, ChannelStatus.ARCHIVED],
},
},
},
status: {
in: [VideoStatus.UPCOMING, VideoStatus.LIVE, VideoStatus.PUBLISHED],
},
},
include: {
channel: {
include: {
links: true,
},
},
},
cursor: _args.cursor
? {
id: _args.cursor,
}
: undefined,
skip: _args.cursor ? 1 : 0,
orderBy: {
sortTime: 'desc',
},
take: Math.min(_args.limit, config.GRAPHQL_MAX_RECENT_VIDEOS),
});
Even if I eliminate the join with the Channel table from the Prisma query entirely, the performance doesn't improve by much and a query still takes 7-8 seconds to run.
This query is a bit of an ORM generated nested-select mess. Nested-selects get in the way of the optimizer. Joins are usually better.
If written by hand, the query would be something like this.
select *
from video
join channel on channel.id = video.channelId
where video.status in('UPCOMING', 'LIVE', 'PUBLISHED')
-- Not clear what this is for, might be wacky pagination?
and video.sortTime <= (
select sortTime from video where id = 29949
)
and not channel.status in('HIDDEN', 'ARCHIVED')
order by sortTime desc
offset 0
limit 100
Pretty straight-forward. Much easier to understand and optimize.
Same as below, this query would benefit from a single composite index on sortTime, status.
And since you're paginating, using limit to only get as many rows as you need in a page can drastically help with performance. Otherwise Postgres will do all the work to calculate all rows.
The performance is getting killed by multiple sequential scans of video.
-> Parallel Seq Scan on ""Video"" (cost=0.00..37202.99 rows=53938 width=1071) (actual time=0.929..1236.450 rows=43157 loops=3)
Filter: (status = ANY ('{UPCOMING,LIVE,PUBLISHED}'::""VideoStatus""[]))
Rows Removed by Filter: 3160
Buffers: shared hit=9289 read=27118
I/O Timings: read=3526.407
-> Parallel Seq Scan on ""Video"" t0 (cost=0.00..36985.90 rows=57890 width=8) (actual time=1.774..2646.207 rows=46318 loops=3)
Filter: (id IS NOT NULL)
Buffers: shared hit=9193 read=27214 dirtied=168
I/O Timings: read=7702.502
Looking at the where clause...
WHERE (
("public"."Video"."id") IN(
SELECT
"t0"."id" FROM "public"."Video" AS "t0"
INNER JOIN "public"."Channel" AS "j0" ON ("j0"."id") = ("t0"."channelId")
WHERE (
(NOT "j0"."status" IN('HIDDEN', 'ARCHIVED'))
AND "t0"."id" IS NOT NULL)
)
AND "public"."Video"."status" IN('UPCOMING', 'LIVE', 'PUBLISHED')
AND "public"."Video"."sortTime" <= "order_cmp"."Video_sortTime_0")
But you have indexes on video.id and video.status. Why is it doing a seq scan?
In general, Postgres will only use one index per query. Your query needs to check three columns: id, status, and sortTime. Postgres can only use one index, so it uses the one on sortTime and has to seq scan for the rest.
To solve this, try creating a single composite index on both sortTime and status. This will allow Postgres to use an index for both the status and sortTime parts of the query.
create index video_sortTime_status_idx on video (sortTime, status)
With this index the separate sortTime index is no longer necessary, drop it.
The second seq scan is from "t0"."id" IS NOT NULL. "t0" is the Video table. "id" is its primary key. It should be impossible for a primary key to be null, so remove that.
I don't think any index or db setting is going to improve your existing query by much.
Two small changes to the existing query does get it to use the "sortTime" index for ordering, but I don't know if you can influence Prisma to make the changes. One is to add the explicit LIMIT (although I don't know if that is necessary, I don't know how to test it with the "take" method instead of the LIMIT method), and the other is to move the 29949 subquery out of the join and put it directly into the WHERE.
AND "public"."Video"."sortTime" <= (
SELECT
"public"."Video"."sortTime" AS "Video_sortTime_0"
FROM
"public"."Video"
WHERE ("public"."Video"."id") = (29949)
)
But if Prisma allows you to inject custom queries, I would just rewrite it from scratch along the lines Schwern has suggested.
An improvement in the PostgreSQL planner might get it to work without moving the subquery, but even if we knew exactly what to change and had a high-quality implementation of it and could convince people it was a trade-off free improvement, it would still not be released for over a year (in v16), so it wouldn't help you immediately and I wouldn't have much hope for getting it accepted anyway.
In my query, I just want to call data with exact where conditions. These where conditions were created in index. Bu the explain shows bit index-scan. I couldn't understand why.
My query looks like below:
Select
r.spend,
r.date,
...
from metadata m
inner join
report r
on m.org_id = r.org_id and m.country_or_region = r.country_or_region and m.campaign_id = r.campaign_id and m.keyword_id = r.keyword_id
where r.org_id = 1 and m.keyword_type = 'KEYWORD'
offset 0 limit 20
Indexes:
Metadata(org_id, keyword_type, country_or_region, campaign_id, keyword_id);
Report(org_id, country_or_region, campaign_id, keyword_id, date);
Explain Analyze:
"Limit (cost=811883.21..910327.87 rows=20 width=8) (actual time=18120.268..18235.831 rows=20 loops=1)"
" -> Gather (cost=811883.21..2702020.67 rows=384 width=8) (actual time=18120.267..18235.791 rows=20 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Parallel Hash Join (cost=810883.21..2700982.27 rows=160 width=8) (actual time=18103.440..18103.496 rows=14 loops=3)"
" Hash Cond: (((r.country_or_region)::text = (m.country_or_region)::text) AND (r.campaign_id = m.campaign_id) AND (r.keyword_id = m.keyword_id))"
" -> Parallel Bitmap Heap Scan on report r (cost=260773.11..2051875.83 rows=3939599 width=35) (actual time=552.601..8532.962 rows=3162553 loops=3)"
" Recheck Cond: (org_id = 479360)"
" Rows Removed by Index Recheck: 21"
" Heap Blocks: exact=20484 lossy=84350"
" -> Bitmap Index Scan on idx_kr_org_date_camp (cost=0.00..258409.35 rows=9455038 width=0) (actual time=539.329..539.329 rows=9487660 loops=1)"
" Index Cond: (org_id = 479360)"
" -> Parallel Hash (cost=527278.08..527278.08 rows=938173 width=26) (actual time=7425.062..7425.062 rows=727133 loops=3)"
" Buckets: 65536 Batches: 64 Memory Usage: 2656kB"
" -> Parallel Bitmap Heap Scan on metadata m (cost=88007.61..527278.08 rows=938173 width=26) (actual time=1007.028..7119.233 rows=727133 loops=3)"
" Recheck Cond: ((org_id = 479360) AND ((keyword_type)::text = 'KEYWORD'::text))"
" Rows Removed by Index Recheck: 3"
" Heap Blocks: exact=14585 lossy=11054"
" -> Bitmap Index Scan on idx_primary (cost=0.00..87444.71 rows=2251615 width=0) (actual time=1014.631..1014.631 rows=2181399 loops=1)"
" Index Cond: ((org_id = 479360) AND ((keyword_type)::text = 'KEYWORD'::text))"
"Planning Time: 0.492 ms"
"Execution Time: 18235.879 ms"
In here, I just want to call 20 items. It should be more effective?
The Bitmap Index Scan happens when the result set will have a high selectivity rate with respect to the search conditions (i.e., there is a high percentage of rows that satisfy the search criteria). In this case, the planner will plan to scan the entire index, forming a bitmap of which pages on disk to pull the data out from (which happens during the Bitmap Heap Scan step). This is better than a Sequential Scan because it only scans the relevant pages on disk, skipping the pages that it knows relevant data does not exist. Depending on the statistics available to the optimizer, it may not be advantageous to do an Index Scan or an Index-Only Scan, but it is still better than a Sequential Scan.
To complete the answer to the question, an Index-Only Scan is a scan of the index that will pull the relevant data without having to visit the actual table. This is because the relevant data is already in the index. Take, for example, this table:
postgres=# create table foo (id int primary key, name text);
CREATE TABLE
postgres=# insert into foo values (generate_series(1,1000000),'foo');
INSERT 0 1000000
There is an index on the id column of this table, and suppose we call the following query:
postgres=# EXPLAIN ANALYZE SELECT * FROM foo WHERE id < 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using foo_pkey on foo (cost=0.42..10.25 rows=104 width=8) (actual time=0.012..1.027 rows=99 loops=1)
Index Cond: (id < 100)
Planning Time: 0.190 ms
Execution Time: 2.067 ms
(4 rows)
This query results in an Index scan because it scans the index for the rows that have id < 100, and then visits the actual table on disk to pull the other columns included in the * portion of the SELECT query.
However, suppose we call the following query (notice SELECT id instead of SELECT *):
postgres=# EXPLAIN ANALYZE SELECT id FROM foo WHERE id < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Only Scan using foo_pkey on foo (cost=0.42..10.25 rows=104 width=4) (actual time=0.019..0.996 rows=99 loops=1)
Index Cond: (id < 100)
Heap Fetches: 99
Planning Time: 0.098 ms
Execution Time: 1.980 ms
(5 rows)
This results in an Index-only scan because only the id column is requested, and that is included (naturally) in the index, so there's no need to visit the actual table on disk to retrieve anything else. This saves time, but its occurrence is very limited.
To answer your question about limiting to 20 results, the limiting occurs after the Bitmap Index Scan has occurred, so the runtime will still be the same whether you limit to 20, 40, or some other value. In the case of an Index/Index-Only Scan, the executor will stop scanning after it has acquired enough rows as specified by the LIMIT clause. In your case, with the Bitmap Heap Scan, this isn’t possible
I have a query that is being ran on PGSQL, and when queried at a fast rate for large data sets, it is taking a long time to run because it isn't making use of the available indexes. I found that changing the filter from multiple OR's to an IN clause causes the right index to be used. Is there a way I can force the index to be used even when using OR's?
Query with Disjunction:
SELECT field1, field2,..., fieldN
FROM table1 WHERE
((((field9='val1' OR field9='val2') OR field9='val3') OR field9='val4')
AND (field6='val5'));
Query Plan:
"Bitmap Heap Scan on table1 (cost=18.85..19.88 rows=1 width=395) (actual time=0.017..0.017 rows=0 loops=1)"
" Recheck Cond: (((field6)::text = 'val5'::text) AND (((field9)::text = 'val1'::text) OR ((field9)::text = 'val2'::text) OR ((field9)::text = 'val3'::text) OR ((field9)::text = 'val4'::text)))"
" -> BitmapAnd (cost=18.85..18.85 rows=1 width=0) (actual time=0.016..0.016 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_field6_field9 (cost=0.00..9.01 rows=611 width=0) (actual time=0.015..0.015 rows=0 loops=1)"
" Index Cond: ((field6)::text = 'val5'::text)"
" -> BitmapOr (cost=9.59..9.59 rows=516 width=0) (never executed)"
" -> Bitmap Index Scan on idx_id_field9 (cost=0.00..2.40 rows=129 width=0) (never executed)"
" Index Cond: ((field9)::text = 'val1'::text)"
" -> Bitmap Index Scan on idx_id_field9 (cost=0.00..2.40 rows=129 width=0) (never executed)"
" Index Cond: ((field9)::text = 'val2'::text)"
" -> Bitmap Index Scan on idx_id_field9 (cost=0.00..2.40 rows=129 width=0) (never executed)"
" Index Cond: ((field9)::text = 'val3'::text)"
" -> Bitmap Index Scan on idx_id_field9 (cost=0.00..2.40 rows=129 width=0) (never executed)"
" Index Cond: ((field9)::text = 'val4'::text)"
"Planning time: 0.177 ms"
"Execution time: 0.061 ms"
Query with IN
SELECT field1, field2,..., fieldN
FROM table1
WHERE
((field9 IN ('val1', 'val2', 'val3', 'val4'))
AND (field6='val5'));
Query Plan:
"Index Scan using idx_field6_field9 on table1 (cost=0.43..6.77 rows=1 width=395) (actual time=0.032..0.032 rows=0 loops=1)"
" Index Cond: (((field6)::text = 'val5'::text) AND ((field9)::text = ANY ('{val1,val2,val3,val4}'::text[])))"
"Planning time: 0.145 ms"
"Execution time: 0.055 ms"
There is an index on field 6 and field 9 which the second query uses as expected, which the first one also should. Field9 is also kind of like a state field, so its cardinality is extremely low - there's only like 9 different values across the whole table. Unfortunately, it isn't straightforward to change the query to use an IN clause, so getting PG to use the right plan would be ideal.
There is no way you can get the fast plan (single index scan) using the OR condition. You'll have to rewrite the query.
You want to know why, which is always difficult to explain. With optimizations like that, there are usually two reasons:
Nobody got around to do it.
This requires extra effort every time a query with an OR is planned:
Are there several conditions linked with OR that have the same expression on one side?
Both plans, the original and the rewritten one, would have to be estimated. It may well be that the BitmapOr is the most efficient way to process the query.
This price would have to be paid by every query with OR in it.
I am not saying that it is a bad idea to add an optimization like this, but there are two sides to the coin.
I have a index like this on my candidates and their first_name column:
CREATE INDEX ix_public_candidates_first_name_not_null
ON public.candidates (first_name)
WHERE first_name IS NOT NULL;
Is Postgres smart enough to know that an equal operator means it can't be null or am I just lucky that my "is not null" index is used in this query?
select *
from public.candidates
where first_name = 'Erik'
Analyze output:
Bitmap Heap Scan on candidates (cost=57.46..8096.88 rows=2714 width=352) (actual time=1.481..18.847 rows=2460 loops=1)
Recheck Cond: (first_name = 'Erik'::citext)
Heap Blocks: exact=2256
-> Bitmap Index Scan on ix_public_candidates_first_name_not_null (cost=0.00..56.78 rows=2714 width=0) (actual time=1.204..1.204 rows=2460 loops=1)
Index Cond: (first_name = 'Erik'::citext)
Planning time: 0.785 ms
Execution time: 19.340 ms
The PostgreSQL optimizer is not based on lucky guesses.
It can indeed infer that anything that matches an equality condition cannot be NULL; the proof is the execution plan you show.
I have defined the following index:
CREATE INDEX
users_search_idx
ON
auth_user
USING
gin(
username gin_trgm_ops,
first_name gin_trgm_ops,
last_name gin_trgm_ops
);
I am performing the following query:
PREPARE user_search (TEXT, INT) AS
SELECT
username,
email,
first_name,
last_name,
( -- would probably do per-field weightings here
s_username + s_first_name + s_last_name
) rank
FROM
auth_user,
similarity(username, $1) s_username,
similarity(first_name, $1) s_first_name,
similarity(last_name, $1) s_last_name
WHERE
username % $1 OR
first_name % $1 OR
last_name % $1
ORDER BY
rank DESC
LIMIT $2;
The auth_user table has 6.2 million rows.
The speed of the query seems to depend very heavily on the number of results potentially returned by the similarity query.
Increasing the similarity threshold via set_limit helps, but reduces usefulness of results by eliminating partial matches.
Some searches return in 200ms, others take ~ 10 seconds.
We have an existing implementation of this feature using Elasticsearch that returns in < 200ms for any query, while doing more complicated (better) ranking.
I would like to know if there is any way we could improve this to get more consistent performance?
It's my understanding that GIN index (inverted index) is the same basic approach used by Elasticsearch so I would have thought there is some optimization possible.
An EXPLAIN ANALYZE EXECUTE user_search('mel', 20) shows:
Limit (cost=54099.81..54099.86 rows=20 width=52) (actual time=10302.092..10302.104 rows=20 loops=1)
-> Sort (cost=54099.81..54146.66 rows=18739 width=52) (actual time=10302.091..10302.095 rows=20 loops=1)
Sort Key: (((s_username.s_username + s_first_name.s_first_name) + s_last_name.s_last_name)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Nested Loop (cost=382.74..53601.17 rows=18739 width=52) (actual time=118.164..10293.765 rows=8380 loops=1)
-> Nested Loop (cost=382.74..53132.69 rows=18739 width=56) (actual time=118.150..10262.804 rows=8380 loops=1)
-> Nested Loop (cost=382.74..52757.91 rows=18739 width=52) (actual time=118.142..10233.990 rows=8380 loops=1)
-> Bitmap Heap Scan on auth_user (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
Rows Removed by Index Recheck: 2434523
Heap Blocks: exact=49337 lossy=53104
-> BitmapOr (cost=382.74..382.74 rows=18757 width=0) (actual time=107.436..107.436 rows=0 loops=1)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=40.200..40.200rows=88908 loops=1)"
Index Cond: ((username)::text % 'mel'::text)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=43.847..43.847rows=102028 loops=1)"
Index Cond: ((first_name)::text % 'mel'::text)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=23.387..23.387rows=58740 loops=1)"
Index Cond: ((last_name)::text % 'mel'::text)
-> Function Scan on similarity s_username (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=8380)
-> Function Scan on similarity s_first_name (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
-> Function Scan on similarity s_last_name (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
Execution time: 10302.559 ms
Server is Postgres 9.6.1 running on Amazon RDS
update
1.
Shortly after posting the question I found this info: https://www.postgresql.org/message-id/464F3C5D.2000700#enterprisedb.com
So I tried
-> SHOW work_mem;
4MB
-> SET work_mem='12MB';
-> EXECUTE user_search('mel', 20);
(results returned in ~1.5s)
This made a big improvement (previously > 10s)!
1.5s is still way slower than ES for similar query so I would still like to hear any suggestions for optimising the query.
2.
In response to comments, and after seeing this question (Postgresql GIN index slower than GIST for pg_trgm), I tried exactly the same set up with a GIST index in place of the GIN one.
Trying the same search above, it returned in ~3.5s, using default work_mem='4MB'. Increasing work_mem made no difference.
From this I conclude that GIST index is more memory efficient (did not hit pathological case like GIN did) but is slower than GIN when GIN is working properly. This is inline with what's described in the docs recommending GIN index.
3.
I still don't understand why so much time is spent in:
-> Bitmap Heap Scan on auth_user (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
Rows Removed by Index Recheck: 2434523
Heap Blocks: exact=49337 lossy=53104
I don't understand why this step is needed or what it's doing.
There are the three Bitmap Index Scan beneath it for each of the username % $1 clauses... these results then get combined with a BitmapOr step. These parts are all quite fast.
But even in the case where we don't run out of work mem, we still spend nearly a whole second in Bitmap Heap Scan.
I expect much faster results with this approach:
1.
Create a GiST index with 1 column holding concatenated values:
CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);
This assumes all 3 columns to be defined NOT NULL (you did not specify). Else you need to do more.
Why not simplify with concat_ws()?
Combine two columns and add into one new column
Faster query with pattern-matching on multiple text fields
Combine two columns and add into one new column
2.
Use a proper nearest-neighbor query, matching above index:
SELECT username, email, first_name, last_name
, similarity(username , $1) AS s_username
, similarity(first_name, $1) AS s_first_name
, similarity(last_name , $1) AS s_last_name
, row_number() OVER () AS rank -- greatest similarity first
FROM auth_user
WHERE (username || ' ' || first_name || ' ' || last_name) % $1 -- !!
ORDER BY (username || ' ' || first_name || ' ' || last_name) <-> $1 -- !!
LIMIT $2;
Expressions in WHERE and ORDER BY must match index expression!
In particular ORDER BY rank (like you had it) will always perform poorly for a small LIMIT picking from a much larger pool of qualifying rows, because it cannot use an index directly: The sophisticated expression behind rank has to be calculated for every qualifying row, then all have to be sorted before the small selection of best matches can be returned. This is much, much more expensive than a true nearest-neighbor query that can pick the best results from the index directly without even looking at the rest.
row_number() with empty window definition just reflects the ordering produced by the ORDER BY of the same SELECT.
Related answers:
Best index for similarity function
Search in 300 million addresses with pg_trgm
As for your item 3., I added an answer to the question you referenced, that should explain it:
PostgreSQL GIN index slower than GIST for pg_trgm?