PostgreSql phraseto_tsquery is very slow - postgresql

I am using PostgreSql 9.6
I have a database table with about 16 million records. I have a jsonb column - logentry -that has a field called "message". It has a GIN index created as so:
CREATE INDEX inettklog_ix_ts_message
ON public.inettklog USING gin
(to_tsvector('english'::regconfig, logentry ->> 'message'::text))
TABLESPACE pg_default;
I want to do a search for "application name".
A query with the WHERE clause
to_tsvector('english', logentry->>'message') ## plainto_tsquery('application name')
executes in 113 msecs and returns 7349 rows
EXPLAIN ANALYZE:
WindowAgg (cost=1812.98..2240.22 rows=95 width=12) (actual time=84.037..84.986 rows=7315 loops=1)
-> Bitmap Heap Scan on inettklog (cost=1812.98..2239.03 rows=95 width=4) (actual time=17.943..81.708 rows=7315 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## plainto_tsquery('application name'::text))
Heap Blocks: exact=7574
-> Bitmap Index Scan on inettklog_ix_ts_message (cost=0.00..1812.96 rows=95 width=0) (actual time=8.542..8.542 rows=8009 loops=1)
Index Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## plainto_tsquery('application name'::text))
Planning time: 0.387 ms
Execution time: 85.243 ms
But I don't want "application" and "name", I want "application name"
But a query with a WHERE clause of
to_tsvector('english', logentry->>'message') ## phraseto_tsquery('application name')
takes over 2 minutes to run!
EXPLAIN ANALYZE:
WindowAgg (cost=852.98..1280.22 rows=95 width=12) (actual time=145065.204..145066.127 rows=7314 loops=1)
-> Bitmap Heap Scan on inettklog (cost=852.98..1279.03 rows=95 width=4) (actual time=55.180..145030.148 rows=7314 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## phraseto_tsquery('application name'::text))
Heap Blocks: exact=7573
-> Bitmap Index Scan on inettklog_ix_ts_message (cost=0.00..852.96 rows=95 width=0) (actual time=8.196..8.196 rows=8008 loops=1)
Index Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## phraseto_tsquery('application name'::text))
Planning time: 25.926 ms
Execution time: 145067.052 ms
Surely the "<->" operator works by first locating the rows containing "application" and "name" and then filters the result to find those rows where "name" follows "application".
And, if so, why does it take 2 minutes to run???

The GIN index, unfortunately, cannot support ordering of the lexemes. Your first query is so much faster because it's able to handle everything using the index you built. With the phrase version, the recheck has to actually go to your table and create the ts_vectors to find the order.
You may be able to use a RUM index: https://github.com/postgrespro/rum which does include the ordering information.
This article expands on these points greatly.

Related

PostgreSQL: Sequential scan despite having indexes

I have the following two tables.
person_addresses
address_normalization
The person_addresses table has a field named address_id as the primary key and address_normalization has the corresponding field address_id which has an index on it.
Now, when I explain the following query, I see a sequential scan.
SELECT
count(*)
FROM
mp_member2.person_addresses pa
JOIN mp_member2.address_normalization an ON
an.address_id = pa.address_id
WHERE
an.sr_modification_time >= 1550692189468;
-- Result: 2654
Please refer to the following screenshot.
You see that there is a sequential scan after the hash join. I'm not sure I understand this part; why would a sequential scan follow a hash join.
And as seen in the query above, the set of records returned is also low.
Is this expected behaviour or am I doing something wrong?
Update #1: I also have indices on the sr_modification_time fields of both the tables
Update #2: Full execution plan
Aggregate (cost=206944.74..206944.75 rows=1 width=0) (actual time=2807.844..2807.844 rows=1 loops=1)
Buffers: shared hit=4629 read=82217
-> Hash Join (cost=2881.95..206825.15 rows=47836 width=0) (actual time=0.775..2807.160 rows=2654 loops=1)
Hash Cond: (pa.address_id = an.address_id)
Buffers: shared hit=4629 read=82217
-> Seq Scan on person_addresses pa (cost=0.00..135924.93 rows=4911993 width=8) (actual time=0.005..1374.610 rows=4911993 loops=1)
Buffers: shared hit=4588 read=82217
-> Hash (cost=2432.05..2432.05 rows=35992 width=18) (actual time=0.756..0.756 rows=1005 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 41kB
Buffers: shared hit=41
-> Index Scan using mp_member2_address_normalization_mod_time on address_normalization an (cost=0.43..2432.05 rows=35992 width=18) (actual time=0.012..0.424 rows=1005 loops=1)
Index Cond: (sr_modification_time >= 1550692189468::bigint)
Buffers: shared hit=41
Planning time: 0.244 ms
Execution time: 2807.885 ms
Update #3: I tried with a newer timestamp and it used an index scan.
EXPLAIN (
ANALYZE
, buffers
, format TEXT
) SELECT
COUNT(*)
FROM
mp_member2.person_addresses pa
JOIN mp_member2.address_normalization an ON
an.address_id = pa.address_id
WHERE
an.sr_modification_time >= 1557507300342;
-- count: 1364
Query Plan:
Aggregate (cost=295.48..295.49 rows=1 width=0) (actual time=2.770..2.770 rows=1 loops=1)
Buffers: shared hit=1404
-> Nested Loop (cost=4.89..295.43 rows=19 width=0) (actual time=0.038..2.491 rows=1364 loops=1)
Buffers: shared hit=1404
-> Index Scan using mp_member2_address_normalization_mod_time on address_normalization an (cost=0.43..8.82 rows=14 width=18) (actual time=0.009..0.142 rows=341 loops=1)
Index Cond: (sr_modification_time >= 1557507300342::bigint)
Buffers: shared hit=14
-> Bitmap Heap Scan on person_addresses pa (cost=4.46..20.43 rows=4 width=8) (actual time=0.004..0.005 rows=4 loops=341)
Recheck Cond: (address_id = an.address_id)
Heap Blocks: exact=360
Buffers: shared hit=1390
-> Bitmap Index Scan on idx_mp_member2_person_addresses_address_id (cost=0.00..4.46 rows=4 width=0) (actual time=0.003..0.003 rows=4 loops=341)
Index Cond: (address_id = an.address_id)
Buffers: shared hit=1030
Planning time: 0.214 ms
Execution time: 2.816 ms
That is the expected behavior because you don't have index for sr_modification_time so after create the hash join db has to scan the whole set to check each row for the sr_modification_time value
You should create:
index for (sr_modification_time)
or composite index for (address_id , sr_modification_time )

limiting the results of a query slows the query

I have a PostgreSql 9.6 database used to record debug logs of an application. It contains 130 million records. The main field is a jsonb type using a GIN index.
If I perform a query like the following it executes quickly:
select id, logentry from inettklog where
logentry #> '{"instance":"1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb;
Here is the explain analyze:
Bitmap Heap Scan on inettklog (cost=2938.03..491856.81 rows=137552 width=300) (actual time=10.610..12.644 rows=128 loops=1)
Recheck Cond: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Heap Blocks: exact=128
-> Bitmap Index Scan on inettklog_ix_logentry (cost=0.00..2903.64 rows=137552 width=0) (actual time=10.564..10.564 rows=128 loops=1)
Index Cond: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Planning time: 68.522 ms
Execution time: 12.720 ms
(7 rows)
But if I simply add a limit, it suddenly becomes very slow:
select id, logentry from inettklog where
logentry #> '{"instance":"1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb
limit 20;
It now takes over 20 seconds!
Limit (cost=0.00..1247.91 rows=20 width=300) (actual time=0.142..37791.319 rows=20 loops=1)
-> Seq Scan on inettklog (cost=0.00..8582696.05 rows=137553 width=300) (actual time=0.141..37791.308 rows=20 loops=1)
Filter: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Rows Removed by Filter: 30825572
Planning time: 0.174 ms
Execution time: 37791.351 ms
(6 rows)
Here are the results when ORDER BY is included, even after setting enable_seqscan=off:
With no limit:
set enable_seqscan = off;
set enable_indexscan = on;
select id, date, logentry from inettklog where
logentry #> '{"instance":"1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb
order by date;
The explain analyze:
Sort (cost=523244.24..523588.24 rows=137600 width=308) (actual time=48.196..48.219 rows=128 loops=1)
Sort Key: date
Sort Method: quicksort Memory: 283kB
-> Bitmap Heap Scan on inettklog (cost=2658.40..491746.00 rows=137600 width=308) (actual time=31.773..47.865 rows=128 loops=1)
Recheck Cond: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Heap Blocks: exact=128
-> Bitmap Index Scan on inettklog_ix_logentry (cost=0.00..2624.00 rows=137600 width=0) (actual time=31.550..31.550 rows=128 loops=1)
Index Cond: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Planning time: 0.181 ms
Execution time: 48.254 ms
(10 rows)
And now when we add the limit:
set enable_seqscan = off;
set enable_indexscan = on;
select id, date, logentry from inettklog where
logentry #> '{"instance":"1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb
order by date
limit 20;
It now takes 90 seconds!!!
Limit (cost=0.57..4088.36 rows=20 width=308) (actual time=32017.438..98544.017 rows=20 loops=1)
-> Index Scan using inettklog_ix_logdate on inettklog (cost=0.57..28123416.21 rows=137597 width=308) (actual time=32017.437..98544.008 rows=20 loops=1)
Filter: (logentry #> '{"instance": "1.3.46.670589.11.0.0.11.4.2.0.8743.5.5396.2006120114440692624"}'::jsonb)
Rows Removed by Filter: 27829853
Planning time: 0.249 ms
Execution time: 98544.043 ms
(6 rows)
This is all very confusing! I want to be able to provide a utility to quickly query this database but it is all counter-intuitive.
Can anyone explain what is going on?
Can anyone explain the rules?
The estimates are way off. Try to run ANALYZE, possibly with an increased default_statistics_target.
Since PostgreSQL thinks there are so many results, it thinks that it is best off performing a sequential scan and stopping as soon as it has got enough results.
Using limit without indexing the column will slow it down as it will scan whole table then give you the result. So instead of doing that create a index on logentry then run the query with limit. It will give you much faster results.
You can check this answer for reference: PostgreSQL query very slow with limit 1

Postgresql array function doesn't use index

I've created a postgresql array column with a GIN index, and I'm trying to do a contains query on that column. With standard postgresql I can get that working correctly like this:
SELECT d.name
FROM deck d
WHERE d.card_names_array #> string_to_array('"Anger#2","Pingle Who Annoys#2"', ',')
LIMIT 20;
Explain analyze:
Limit (cost=1720.01..1724.02 rows=1 width=31) (actual time=7.787..7.787 rows=0 loops=1)
-> Bitmap Heap Scan on deck deck0_ (cost=1720.01..1724.02 rows=1 width=31) (actual time=7.787..7.787 rows=0 loops=1)
Recheck Cond: (card_names_array #> '{"\"Anger#2\"","\"Pingle Who Annoys#2\""}'::text[])
-> Bitmap Index Scan on deck_card_names_array_idx (cost=0.00..1720.01 rows=1 width=0) (actual time=7.785..7.785 rows=0 loops=1)
Index Cond: (card_names_array #> '{"\"Anger#2\"","\"Pingle Who Annoys#2\""}'::text[])
Planning time: 0.216 ms
Execution time: 7.810 ms
Unfortunately (in this instance) I'm using QueryDSL with which I read the native array functions like #> are impossible to use. However, this answer says you can use arraycontains instead to do the same thing. I got that working, and it returns the correct results, but it doesn't use my index.
SELECT d.name
FROM deck d
WHERE arraycontains(d.card_names_array, string_to_array('"Anger#2","Pingle Who Annoys#2"', ','))=true
LIMIT 20;
Explain analyze:
Limit (cost=0.00..18.83 rows=20 width=31) (actual time=1036.151..1036.151 rows=0 loops=1)
-> Seq Scan on deck deck0_ (cost=0.00..159065.60 rows=168976 width=31) (actual time=1036.150..1036.150 rows=0 loops=1)
Filter: arraycontains(card_names_array, '{"\"Anger#2\"","\"Pingle Who Annoys#2\""}'::text[])
Rows Removed by Filter: 584014
Planning time: 0.204 ms
Execution time: 1036.166 ms
This is my QueryDSL code to create the boolean expression:
predicate.and(Expressions.booleanTemplate(
"arraycontains({0}, string_to_array({1}, ','))=true",
deckQ.cardNamesArray,
filters.cards.joinToString(",") { "${it.cardName}#${it.quantity}" }
))
Is there some way to get it to use my index? Or maybe a different way to do this with QueryDSL to use the native #> function?

Postgres select performance with LIMIT 1

I have a table with about 50 million records in PostgreSQL. Trying to select a post with most "likes" filtering by a "tag". Both fields have b-tree index. For "love" tag I get
EXPLAIN analyse select user_id from posts where tags #> array['love'] order by likes desc nulls last limit 12
Limit (cost=0.57..218.52 rows=12 width=12) (actual time=2.658..14.243 rows=12 loops=1)
-> Index Scan using idx_likes on posts (cost=0.57..55759782.55 rows=3070010 width=12) (actual time=2.657..14.239 rows=12 loops=1)
Filter: (tags #> '{love}'::text[])
Rows Removed by Filter: 10584
Planning time: 0.297 ms
Execution time: 14.276 ms
14 ms is great, but if I try to get it for "tamir", it suddenly becomes over 22 seconds!! Obviously query planner is doing something wrong.
EXPLAIN analyse select user_id from posts where tags #> array['tamir'] order by likes desc nulls last limit 12
Limit (cost=0.57..25747.73 rows=12 width=12) (actual time=17552.406..22839.503 rows=12 loops=1)
-> Index Scan using idx_likes on posts (cost=0.57..55759782.55 rows=25988 width=12) (actual time=17552.405..22839.484 rows=12 loops=1)
Filter: (tags #> '{tamir}'::text[])
Rows Removed by Filter: 11785083
Planning time: 0.253 ms
Execution time: 22839.569 ms
After reading the article I've added "user_id" to the ORDER BY and "tamir" is blazingly fast, 0.2ms! Now it's doing sorts and Bitmap Heap Scan instead of Index scan.
EXPLAIN analyse select user_id from posts where tags #> array['tamir'] order by likes desc nulls last, user_id limit 12
Limit (cost=101566.17..101566.20 rows=12 width=12) (actual time=0.237..0.238 rows=12 loops=1)
-> Sort (cost=101566.17..101631.14 rows=25988 width=12) (actual time=0.237..0.237 rows=12 loops=1)
Sort Key: likes DESC NULLS LAST, user_id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on posts (cost=265.40..100970.40 rows=25988 width=12) (actual time=0.074..0.214 rows=126 loops=1)
Recheck Cond: (tags #> '{tamir}'::text[])
Heap Blocks: exact=44
-> Bitmap Index Scan on idx_tags (cost=0.00..258.91 rows=25988 width=0) (actual time=0.056..0.056 rows=126 loops=1)
Index Cond: (tags #> '{tamir}'::text[])
Planning time: 0.287 ms
Execution time: 0.277 ms
But what happens to "love"? Now it goes from 14 ms to 2.3 seconds...
EXPLAIN analyse select user_id from posts where tags #> array['love'] order by likes desc nulls last, user_id limit 12
Limit (cost=7347142.18..7347142.21 rows=12 width=12) (actual time=2360.784..2360.786 rows=12 loops=1)
-> Sort (cost=7347142.18..7354817.20 rows=3070010 width=12) (actual time=2360.783..2360.784 rows=12 loops=1)
Sort Key: likes DESC NULLS LAST, user_id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on posts (cost=28316.58..7276762.77 rows=3070010 width=12) (actual time=595.274..2171.571 rows=1517679 loops=1)
Recheck Cond: (tags #> '{love}'::text[])
Heap Blocks: exact=642705
-> Bitmap Index Scan on idx_tags (cost=0.00..27549.08 rows=3070010 width=0) (actual time=367.080..367.080 rows=1517679 loops=1)
Index Cond: (tags #> '{love}'::text[])
Planning time: 0.226 ms
Execution time: 2360.863 ms
Can somebody shed some light on why this is happening and what would the fix.
Update
"tag" field had gin index, not b-tree, just typo.
B-tree indexes are not very useful for searching of element in array field. You should remove b-tree index from tags field and use gin index instead:
drop index idx_tags;
create index idx_tags using gin(tags);
And don't add order by user_id — this sabotages possibility to use your idx_likes for ordering when there's a lot of rows with the tag you search for.
Also likes field should probably be not null default 0.

how to make my pg query faster

Im new to pg, and have a table like this one:
CREATE TABLE tbl_article (
uid TEXT PRIMARY KEY,
...
tags JSONB
)
CREATE INDEX idxgin ON tbl_article USING gin (tags);
uid is something like MongoDB's ObjectID generated in my program.
And this query:
SELECT * FROM tbl_article
WHERE tags #> '{"Category":0}'::jsonb
ORDER BY uid DESC LIMIT 20 OFFSET 10000
Here is explain:
Limit (cost=971.77..971.77 rows=1 width=1047) (actual time=121.811..121.811 rows=0 loops=1)
-> Sort (cost=971.46..971.77 rows=125 width=1047) (actual time=110.653..121.371 rows=8215 loops=1)
Sort Key: uid
Sort Method: external merge Disk: 8736kB
-> Bitmap Heap Scan on tbl_article (cost=496.97..967.11 rows=125 width=1047) (actual time=5.292..14.504 rows=8215 loops=1)
Recheck Cond: (tags #> '{"Category": 0}'::jsonb)
Heap Blocks: exact=3521
-> Bitmap Index Scan on idxgin (cost=0.00..496.93 rows=125 width=0) (actual time=4.817..4.817 rows=8216 loops=1)
Index Cond: (tags #> '{"Category": 0}'::jsonb)
Planning time: 0.105 ms
Execution time: 123.016 ms
Seems sorting is so slow. So how to make this faster
Sorry for my poor English