limiting the results of a query slows the query - postgresql

I have a PostgreSql 9.6 database used to record debug logs of an application. It contains 130 million records. The main field is a jsonb type using a GIN index.
If I perform a query like the following it executes quickly:
select id, logentry from inettklog where
logentry #> '{"instance":"1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb;
Here is the explain analyze:
Bitmap Heap Scan on inettklog (cost=2938.03..491856.81 rows=137552 width=300) (actual time=10.610..12.644 rows=128 loops=1)
Recheck Cond: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Heap Blocks: exact=128
-> Bitmap Index Scan on inettklog_ix_logentry (cost=0.00..2903.64 rows=137552 width=0) (actual time=10.564..10.564 rows=128 loops=1)
Index Cond: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Planning time: 68.522 ms
Execution time: 12.720 ms
(7 rows)
But if I simply add a limit, it suddenly becomes very slow:
select id, logentry from inettklog where
logentry #> '{"instance":"1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb
limit 20;
It now takes over 20 seconds!
Limit (cost=0.00..1247.91 rows=20 width=300) (actual time=0.142..37791.319 rows=20 loops=1)
-> Seq Scan on inettklog (cost=0.00..8582696.05 rows=137553 width=300) (actual time=0.141..37791.308 rows=20 loops=1)
Filter: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Rows Removed by Filter: 30825572
Planning time: 0.174 ms
Execution time: 37791.351 ms
(6 rows)
Here are the results when ORDER BY is included, even after setting enable_seqscan=off:
With no limit:
set enable_seqscan = off;
set enable_indexscan = on;
select id, date, logentry from inettklog where
logentry #> '{"instance":"1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb
order by date;
The explain analyze:
Sort (cost=523244.24..523588.24 rows=137600 width=308) (actual time=48.196..48.219 rows=128 loops=1)
Sort Key: date
Sort Method: quicksort Memory: 283kB
-> Bitmap Heap Scan on inettklog (cost=2658.40..491746.00 rows=137600 width=308) (actual time=31.773..47.865 rows=128 loops=1)
Recheck Cond: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Heap Blocks: exact=128
-> Bitmap Index Scan on inettklog_ix_logentry (cost=0.00..2624.00 rows=137600 width=0) (actual time=31.550..31.550 rows=128 loops=1)
Index Cond: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Planning time: 0.181 ms
Execution time: 48.254 ms
(10 rows)
And now when we add the limit:
set enable_seqscan = off;
set enable_indexscan = on;
select id, date, logentry from inettklog where
logentry #> '{"instance":"1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb
order by date
limit 20;
It now takes 90 seconds!!!
Limit (cost=0.57..4088.36 rows=20 width=308) (actual time=32017.438..98544.017 rows=20 loops=1)
-> Index Scan using inettklog_ix_logdate on inettklog (cost=0.57..28123416.21 rows=137597 width=308) (actual time=32017.437..98544.008 rows=20 loops=1)
Filter: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Rows Removed by Filter: 27829853
Planning time: 0.249 ms
Execution time: 98544.043 ms
(6 rows)
This is all very confusing! I want to be able to provide a utility to quickly query this database but it is all counter-intuitive.
Can anyone explain what is going on?
Can anyone explain the rules?

The estimates are way off. Try to run ANALYZE, possibly with an increased default_statistics_target.
Since PostgreSQL thinks there are so many results, it thinks that it is best off performing a sequential scan and stopping as soon as it has got enough results.

Using limit without indexing the column will slow it down as it will scan whole table then give you the result. So instead of doing that create a index on logentry then run the query with limit. It will give you much faster results.
You can check this answer for reference: PostgreSQL query very slow with limit 1

Related

GIN index not working with `SELECT 1` but it works if I do `SELECT COUNT(*)` on PostgreSQL

I have the following query
> explain analyze SELECT 1 AS one FROM "orders" WHERE "orders"."email"
ILIKE '%email#gmail.com%' LIMIT 1 OFFSET 0
QUERY PLAN
Limit (cost=0.00..470.44 rows=1 width=4) (actual time=2303.032..2303.033 rows=1 loops=1)
Output: 1
-> Seq Scan on public.orders (cost=0.00..108200.10 rows=230 width=4) (actual time=2303.031..2303.031 rows=1 loops=1)
Output: 1
Filter: ((orders.email)::text ~~* '%email#gmail.com%'::text)
Rows Removed by Filter: 2309367
Planning Time: 0.195 ms
Execution Time: 2303.047 ms
If I run the same query but instead of using SELECT 1 I use SELECT COUNT(*) the gin index (gin_trgm_ops) start to work
> explain analyze SELECT COUNT(*) FROM "orders" WHERE "orders"."email"
ILIKE '%email#gmail.com%' LIMIT 1 OFFSET 0
QUERY PLAN
Limit (cost=1263.98..1263.99 rows=1 width=8) (actual time=18.074..18.075 rows=1 loops=1)
-> Aggregate (cost=1263.98..1263.99 rows=1 width=8) (actual time=18.073..18.073 rows=1 loops=1)
-> Bitmap Heap Scan on orders (cost=377.78..1263.40 rows=230 width=0) (actual time=18.062..18.067 rows=3 loops=1)
Recheck Cond: ((email)::text ~~* '%email#gmail.com%'::text)
Heap Blocks: exact=2
-> Bitmap Index Scan on index_orders_on_email_gin (cost=0.00..377.72 rows=230 width=0) (actual time=18.043..18.044 rows=3 loops=1)
Index Cond: ((email)::text ~~* '%email#gmail.com%'::text)
Planning Time: 0.575 ms
Execution Time: 18.120 ms
Any idea why?
Thanks
With SELECT 1 ... LIMIT 1, it can stop early once it finds one qualifying row. Since PostgreSQL misestimates how many qualifying rows there are, it misestimates how useful this stopping early will be.
The LIMIT doesn't do anything when used with COUNT(*) but without a GROUP BY, since only one row is returned anyway. There is no stopping-early that can be done, as every qualifying row needs to be found in order to count them.
The crux of the matter is not SELECT 1 versus SELECT COUNT(*), it is a LIMIT that does something versus one that does not.

PostgreSql phraseto_tsquery is very slow

I am using PostgreSql 9.6
I have a database table with about 16 million records. I have a jsonb column - logentry -that has a field called "message". It has a GIN index created as so:
CREATE INDEX inettklog_ix_ts_message
ON public.inettklog USING gin
(to_tsvector('english'::regconfig, logentry ->> 'message'::text))
TABLESPACE pg_default;
I want to do a search for "application name".
A query with the WHERE clause
to_tsvector('english', logentry->>'message') ## plainto_tsquery('application name')
executes in 113 msecs and returns 7349 rows
EXPLAIN ANALYZE:
WindowAgg (cost=1812.98..2240.22 rows=95 width=12) (actual time=84.037..84.986 rows=7315 loops=1)
-> Bitmap Heap Scan on inettklog (cost=1812.98..2239.03 rows=95 width=4) (actual time=17.943..81.708 rows=7315 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## plainto_tsquery('application name'::text))
Heap Blocks: exact=7574
-> Bitmap Index Scan on inettklog_ix_ts_message (cost=0.00..1812.96 rows=95 width=0) (actual time=8.542..8.542 rows=8009 loops=1)
Index Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## plainto_tsquery('application name'::text))
Planning time: 0.387 ms
Execution time: 85.243 ms
But I don't want "application" and "name", I want "application name"
But a query with a WHERE clause of
to_tsvector('english', logentry->>'message') ## phraseto_tsquery('application name')
takes over 2 minutes to run!
EXPLAIN ANALYZE:
WindowAgg (cost=852.98..1280.22 rows=95 width=12) (actual time=145065.204..145066.127 rows=7314 loops=1)
-> Bitmap Heap Scan on inettklog (cost=852.98..1279.03 rows=95 width=4) (actual time=55.180..145030.148 rows=7314 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## phraseto_tsquery('application name'::text))
Heap Blocks: exact=7573
-> Bitmap Index Scan on inettklog_ix_ts_message (cost=0.00..852.96 rows=95 width=0) (actual time=8.196..8.196 rows=8008 loops=1)
Index Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## phraseto_tsquery('application name'::text))
Planning time: 25.926 ms
Execution time: 145067.052 ms
Surely the "<->" operator works by first locating the rows containing "application" and "name" and then filters the result to find those rows where "name" follows "application".
And, if so, why does it take 2 minutes to run???
The GIN index, unfortunately, cannot support ordering of the lexemes. Your first query is so much faster because it's able to handle everything using the index you built. With the phrase version, the recheck has to actually go to your table and create the ts_vectors to find the order.
You may be able to use a RUM index: https://github.com/postgrespro/rum which does include the ordering information.
This article expands on these points greatly.

Postgresql array function doesn't use index

I've created a postgresql array column with a GIN index, and I'm trying to do a contains query on that column. With standard postgresql I can get that working correctly like this:
SELECT d.name
FROM deck d
WHERE d.card_names_array #> string_to_array('"Anger#2","Pingle Who Annoys#2"', ',')
LIMIT 20;
Explain analyze:
Limit (cost=1720.01..1724.02 rows=1 width=31) (actual time=7.787..7.787 rows=0 loops=1)
-> Bitmap Heap Scan on deck deck0_ (cost=1720.01..1724.02 rows=1 width=31) (actual time=7.787..7.787 rows=0 loops=1)
Recheck Cond: (card_names_array #> '{"\"Anger#2\"","\"Pingle Who Annoys#2\""}'::text[])
-> Bitmap Index Scan on deck_card_names_array_idx (cost=0.00..1720.01 rows=1 width=0) (actual time=7.785..7.785 rows=0 loops=1)
Index Cond: (card_names_array #> '{"\"Anger#2\"","\"Pingle Who Annoys#2\""}'::text[])
Planning time: 0.216 ms
Execution time: 7.810 ms
Unfortunately (in this instance) I'm using QueryDSL with which I read the native array functions like #> are impossible to use. However, this answer says you can use arraycontains instead to do the same thing. I got that working, and it returns the correct results, but it doesn't use my index.
SELECT d.name
FROM deck d
WHERE arraycontains(d.card_names_array, string_to_array('"Anger#2","Pingle Who Annoys#2"', ','))=true
LIMIT 20;
Explain analyze:
Limit (cost=0.00..18.83 rows=20 width=31) (actual time=1036.151..1036.151 rows=0 loops=1)
-> Seq Scan on deck deck0_ (cost=0.00..159065.60 rows=168976 width=31) (actual time=1036.150..1036.150 rows=0 loops=1)
Filter: arraycontains(card_names_array, '{"\"Anger#2\"","\"Pingle Who Annoys#2\""}'::text[])
Rows Removed by Filter: 584014
Planning time: 0.204 ms
Execution time: 1036.166 ms
This is my QueryDSL code to create the boolean expression:
predicate.and(Expressions.booleanTemplate(
"arraycontains({0}, string_to_array({1}, ','))=true",
deckQ.cardNamesArray,
filters.cards.joinToString(",") { "${it.cardName}#${it.quantity}" }
))
Is there some way to get it to use my index? Or maybe a different way to do this with QueryDSL to use the native #> function?

Why does postgres do a table scan instead of using my index?

I'm working with the HackerNews dataset in Postgres. There are about 17M rows about 14.5M of them are comments and about 2.5M are stories. There is a very active user named "rbanffy" who has 25k submissions, about equal split stories/comments. Both "by" and "type" have separate indices.
I have a query:
SELECT *
FROM "hn_items"
WHERE by = 'rbanffy'
and type = 'story'
ORDER BY id DESC
LIMIT 20 OFFSET 0
That runs quickly (it's using the 'by' index). If I change the type to "comment" then it's very slow. From the explain, it doesn't use either index and does a scan.
Limit (cost=0.56..56948.32 rows=20 width=1937)
-> Index Scan using hn_items_pkey on hn_items (cost=0.56..45823012.32 rows=16093 width=1937)
Filter: (((by)::text = 'rbanffy'::text) AND ((type)::text = 'comment'::text))
If I change the query to have type||''='comment', then it will use the 'by' index and executes quickly.
Why is this happening? I understand from https://stackoverflow.com/a/309814/214545 that having to do a hack like this implies something is wrong. But I don't know what.
EDIT:
This is the explain for the type='story'
Limit (cost=72553.07..72553.12 rows=20 width=1255)
-> Sort (cost=72553.07..72561.25 rows=3271 width=1255)
Sort Key: id DESC
-> Bitmap Heap Scan on hn_items (cost=814.59..72466.03 rows=3271 width=1255)
Recheck Cond: ((by)::text = 'rbanffy'::text)
Filter: ((type)::text = 'story'::text)
-> Bitmap Index Scan on hn_items_by_index (cost=0.00..813.77 rows=19361 width=0)
Index Cond: ((by)::text = 'rbanffy'::text)
EDIT:
EXPLAIN (ANALYZE,BUFFERS)
Limit (cost=0.56..59510.10 rows=20 width=1255) (actual time=20.856..545.282 rows=20 loops=1)
Buffers: shared hit=21597 read=2658 dirtied=32
-> Index Scan using hn_items_pkey on hn_items (cost=0.56..47780210.70 rows=16058 width=1255) (actual time=20.855..545.271 rows=20 loops=1)
Filter: (((by)::text = 'rbanffy'::text) AND ((type)::text = 'comment'::text))
Rows Removed by Filter: 46798
Buffers: shared hit=21597 read=2658 dirtied=32
Planning time: 0.173 ms
Execution time: 545.318 ms
EDIT: EXPLAIN (ANALYZE,BUFFERS) of type='story'
Limit (cost=72553.07..72553.12 rows=20 width=1255) (actual time=44.121..44.127 rows=20 loops=1)
Buffers: shared hit=20137
-> Sort (cost=72553.07..72561.25 rows=3271 width=1255) (actual time=44.120..44.123 rows=20 loops=1)
Sort Key: id DESC
Sort Method: top-N heapsort Memory: 42kB
Buffers: shared hit=20137
-> Bitmap Heap Scan on hn_items (cost=814.59..72466.03 rows=3271 width=1255) (actual time=6.778..37.774 rows=11630 loops=1)
Recheck Cond: ((by)::text = 'rbanffy'::text)
Filter: ((type)::text = 'story'::text)
Rows Removed by Filter: 12587
Heap Blocks: exact=19985
Buffers: shared hit=20137
-> Bitmap Index Scan on hn_items_by_index (cost=0.00..813.77 rows=19361 width=0) (actual time=3.812..3.812 rows=24387 loops=1)
Index Cond: ((by)::text = 'rbanffy'::text)
Buffers: shared hit=152
Planning time: 0.156 ms
Execution time: 44.422 ms
EDIT: latest test results
I was playing around with the type='comment' query and noticed if changed the limit to a higher number like 100, it used the by index. I played with the values until I found the critical number was '47'. If I had a limit of 47, the by index was used, if I had a limit of 46, it was a full scan. I assume that number isn't magical, just happens to be the threshold for my dataset or some other variable I don't know. I'm don't know if this helps.
Since there ate many comments by rbanffy, PostgreSQL assumes that it will be fast enough if it searches the table in the order implied by the ORDER BY clause (which can use the primary key index) until it has found 20 rows that match the search condition.
Unfortunately it happens that in the guy has grown lazy lately — at any rate, PostgreSQL has to scan the 46798 highest ids until it has found its 20 hits. (You really shouldn't have removed the Backwards, that confused me.)
The best way to work around that is to confuse PostgreSQL so that it doesn't choose the primary key index, perhaps like this:
SELECT *
FROM (SELECT * FROM hn_items
WHERE by = 'rbanffy'
AND type = 'comment'
OFFSET 0) q
ORDER BY id DESC
LIMIT 20;

Postgres select performance with LIMIT 1

I have a table with about 50 million records in PostgreSQL. Trying to select a post with most "likes" filtering by a "tag". Both fields have b-tree index. For "love" tag I get
EXPLAIN analyse select user_id from posts where tags #> array['love'] order by likes desc nulls last limit 12
Limit (cost=0.57..218.52 rows=12 width=12) (actual time=2.658..14.243 rows=12 loops=1)
-> Index Scan using idx_likes on posts (cost=0.57..55759782.55 rows=3070010 width=12) (actual time=2.657..14.239 rows=12 loops=1)
Filter: (tags #> '{love}'::text[])
Rows Removed by Filter: 10584
Planning time: 0.297 ms
Execution time: 14.276 ms
14 ms is great, but if I try to get it for "tamir", it suddenly becomes over 22 seconds!! Obviously query planner is doing something wrong.
EXPLAIN analyse select user_id from posts where tags #> array['tamir'] order by likes desc nulls last limit 12
Limit (cost=0.57..25747.73 rows=12 width=12) (actual time=17552.406..22839.503 rows=12 loops=1)
-> Index Scan using idx_likes on posts (cost=0.57..55759782.55 rows=25988 width=12) (actual time=17552.405..22839.484 rows=12 loops=1)
Filter: (tags #> '{tamir}'::text[])
Rows Removed by Filter: 11785083
Planning time: 0.253 ms
Execution time: 22839.569 ms
After reading the article I've added "user_id" to the ORDER BY and "tamir" is blazingly fast, 0.2ms! Now it's doing sorts and Bitmap Heap Scan instead of Index scan.
EXPLAIN analyse select user_id from posts where tags #> array['tamir'] order by likes desc nulls last, user_id limit 12
Limit (cost=101566.17..101566.20 rows=12 width=12) (actual time=0.237..0.238 rows=12 loops=1)
-> Sort (cost=101566.17..101631.14 rows=25988 width=12) (actual time=0.237..0.237 rows=12 loops=1)
Sort Key: likes DESC NULLS LAST, user_id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on posts (cost=265.40..100970.40 rows=25988 width=12) (actual time=0.074..0.214 rows=126 loops=1)
Recheck Cond: (tags #> '{tamir}'::text[])
Heap Blocks: exact=44
-> Bitmap Index Scan on idx_tags (cost=0.00..258.91 rows=25988 width=0) (actual time=0.056..0.056 rows=126 loops=1)
Index Cond: (tags #> '{tamir}'::text[])
Planning time: 0.287 ms
Execution time: 0.277 ms
But what happens to "love"? Now it goes from 14 ms to 2.3 seconds...
EXPLAIN analyse select user_id from posts where tags #> array['love'] order by likes desc nulls last, user_id limit 12
Limit (cost=7347142.18..7347142.21 rows=12 width=12) (actual time=2360.784..2360.786 rows=12 loops=1)
-> Sort (cost=7347142.18..7354817.20 rows=3070010 width=12) (actual time=2360.783..2360.784 rows=12 loops=1)
Sort Key: likes DESC NULLS LAST, user_id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on posts (cost=28316.58..7276762.77 rows=3070010 width=12) (actual time=595.274..2171.571 rows=1517679 loops=1)
Recheck Cond: (tags #> '{love}'::text[])
Heap Blocks: exact=642705
-> Bitmap Index Scan on idx_tags (cost=0.00..27549.08 rows=3070010 width=0) (actual time=367.080..367.080 rows=1517679 loops=1)
Index Cond: (tags #> '{love}'::text[])
Planning time: 0.226 ms
Execution time: 2360.863 ms
Can somebody shed some light on why this is happening and what would the fix.
Update
"tag" field had gin index, not b-tree, just typo.
B-tree indexes are not very useful for searching of element in array field. You should remove b-tree index from tags field and use gin index instead:
drop index idx_tags;
create index idx_tags using gin(tags);
And don't add order by user_id — this sabotages possibility to use your idx_likes for ordering when there's a lot of rows with the tag you search for.
Also likes field should probably be not null default 0.