Postgres select performance with LIMIT 1

Postgres select performance with LIMIT 1 - postgresql

I have a table with about 50 million records in PostgreSQL. Trying to select a post with most "likes" filtering by a "tag". Both fields have b-tree index. For "love" tag I get
EXPLAIN analyse select user_id from posts where tags #> array['love'] order by likes desc nulls last limit 12
Limit (cost=0.57..218.52 rows=12 width=12) (actual time=2.658..14.243 rows=12 loops=1)
-> Index Scan using idx_likes on posts (cost=0.57..55759782.55 rows=3070010 width=12) (actual time=2.657..14.239 rows=12 loops=1)
Filter: (tags #> '{love}'::text[])
Rows Removed by Filter: 10584
Planning time: 0.297 ms
Execution time: 14.276 ms
14 ms is great, but if I try to get it for "tamir", it suddenly becomes over 22 seconds!! Obviously query planner is doing something wrong.
EXPLAIN analyse select user_id from posts where tags #> array['tamir'] order by likes desc nulls last limit 12
Limit (cost=0.57..25747.73 rows=12 width=12) (actual time=17552.406..22839.503 rows=12 loops=1)
-> Index Scan using idx_likes on posts (cost=0.57..55759782.55 rows=25988 width=12) (actual time=17552.405..22839.484 rows=12 loops=1)
Filter: (tags #> '{tamir}'::text[])
Rows Removed by Filter: 11785083
Planning time: 0.253 ms
Execution time: 22839.569 ms
After reading the article I've added "user_id" to the ORDER BY and "tamir" is blazingly fast, 0.2ms! Now it's doing sorts and Bitmap Heap Scan instead of Index scan.
EXPLAIN analyse select user_id from posts where tags #> array['tamir'] order by likes desc nulls last, user_id limit 12
Limit (cost=101566.17..101566.20 rows=12 width=12) (actual time=0.237..0.238 rows=12 loops=1)
-> Sort (cost=101566.17..101631.14 rows=25988 width=12) (actual time=0.237..0.237 rows=12 loops=1)
Sort Key: likes DESC NULLS LAST, user_id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on posts (cost=265.40..100970.40 rows=25988 width=12) (actual time=0.074..0.214 rows=126 loops=1)
Recheck Cond: (tags #> '{tamir}'::text[])
Heap Blocks: exact=44
-> Bitmap Index Scan on idx_tags (cost=0.00..258.91 rows=25988 width=0) (actual time=0.056..0.056 rows=126 loops=1)
Index Cond: (tags #> '{tamir}'::text[])
Planning time: 0.287 ms
Execution time: 0.277 ms
But what happens to "love"? Now it goes from 14 ms to 2.3 seconds...
EXPLAIN analyse select user_id from posts where tags #> array['love'] order by likes desc nulls last, user_id limit 12
Limit (cost=7347142.18..7347142.21 rows=12 width=12) (actual time=2360.784..2360.786 rows=12 loops=1)
-> Sort (cost=7347142.18..7354817.20 rows=3070010 width=12) (actual time=2360.783..2360.784 rows=12 loops=1)
Sort Key: likes DESC NULLS LAST, user_id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on posts (cost=28316.58..7276762.77 rows=3070010 width=12) (actual time=595.274..2171.571 rows=1517679 loops=1)
Recheck Cond: (tags #> '{love}'::text[])
Heap Blocks: exact=642705
-> Bitmap Index Scan on idx_tags (cost=0.00..27549.08 rows=3070010 width=0) (actual time=367.080..367.080 rows=1517679 loops=1)
Index Cond: (tags #> '{love}'::text[])
Planning time: 0.226 ms
Execution time: 2360.863 ms
Can somebody shed some light on why this is happening and what would the fix.
Update
"tag" field had gin index, not b-tree, just typo.

B-tree indexes are not very useful for searching of element in array field. You should remove b-tree index from tags field and use gin index instead:
drop index idx_tags;
create index idx_tags using gin(tags);
And don't add order by user_id — this sabotages possibility to use your idx_likes for ordering when there's a lot of rows with the tag you search for.
Also likes field should probably be not null default 0.

Related

Limit 1 is slower than limit 100

I have a sql query
SELECT * FROM sharescheduledjob WHERE sharescheduledjob.status = 'SCHEDULED' ORDER BY id ASC LIMIT 1
with limit 1 it is very slow
Limit (cost=0.43..46.43 rows=1 width=8) (actual time=3490.958..3490.959 rows=1 loops=1)
-> Index Scan using sharescheduledjob_pkey on sharescheduledjob sharesched0_ (cost=0.43..171383.41 rows=3726 width=8) (actual time=3490.956..3490.956 rows=1 loops=1)
Filter: ((status)::text = 'SCHEDULED'::text)
Rows Removed by Filter: 6058511
Total runtime: 3490.985 ms
But with limit 100 its pretty fast
Limit (cost=248.04..248.29 rows=100 width=8) (actual time=12.968..12.994 rows=100 loops=1)
-> Sort (cost=248.04..257.36 rows=3726 width=8) (actual time=12.966..12.978 rows=100 loops=1)
Sort Key: id
Sort Method: top-N heapsort Memory: 29kB
-> Index Scan using sharescheduledjob_status on sharescheduledjob sharesched0_ (cost=0.43..105.64 rows=3726 width=8) (actual time=0.044..8.636 rows=9284 loops=1)
Index Cond: ((status)::text = 'SCHEDULED'::text)
Total runtime: 13.042 ms
Is it possible to change database settings to enforce to lookup the status first?
I already tried
ANALYZE sharescheduledjob
but didnt help
And it is hard to change the query because its generated by hibernate from hql queries

Postgres optimize/replace DISTINCT

Trying to select users with most "followed_by" joining to filter by "tag". Both tables have millions of records. Using distinct to only select unique users.
select distinct u.*
from users u join posts p
on u.id=p.user_id
where p.tags #> ARRAY['love']
order by u.followed_by desc nulls last limit 21
It runs over 16s, seems because of the 'distinct' causing a Seq Scan over 6+ million users. Here is the explain analyse
Limit (cost=15509958.30..15509959.09 rows=21 width=292) (actual time=16882.861..16883.753 rows=21 loops=1)
-> Unique (cost=15509958.30..15595560.30 rows=2282720 width=292) (actual time=16882.859..16883.749 rows=21 loops=1)
-> Sort (cost=15509958.30..15515665.10 rows=2282720 width=292) (actual time=16882.857..16883.424 rows=525 loops=1)
Sort Key: u.followed_by DESC NULLS LAST, u.id, u.username, u.fullna
Sort Method: external merge Disk: 583064kBme, u.follows, u
-> Gather (cost=1000.57..14956785.06 rows=2282720 width=292) (actual time=0.377..11506.001 rows=1680890 loops=1).media, u.profile_pic_url_hd, u.is_private, u.is_verified, u.biography, u.external_url, u.updated, u.location_id, u.final_post
Workers Planned: 9
Workers Launched: 9
-> Nested Loop (cost=0.57..14727513.06 rows=253636 width=292) (actual time=1.013..12031.634 rows=168089 loops=10)
-> Parallel Seq Scan on posts p (cost=0.00..13187797.79 rows=253636 width=8) (actual time=0.940..10872.630 rows=168089 loops=10)
Filter: (tags #> '{love}'::text[])
Rows Removed by Filter: 6251355
-> Index Scan using user_pk on users u (cost=0.57..6.06 rows=1 width=292) (actual time=0.006..0.006 rows=1 loops=1680890)
Index Cond: (id = p.user_id)
Planning time: 1.276 ms
Execution time: 16964.271 ms
Would appreciate tips on how to make this fast.
Update
Thanks to #a_horse_with_no_name, for "love" tags it became really fast
Limit (cost=1.14..4293986.91 rows=21 width=292) (actual time=1.735..31.613 rows=21 loops=1)
-> Nested Loop Semi Join (cost=1.14..10959887484.70 rows=53600 width=292) (actual time=1.733..31.607 rows=21 loops=1)
-> Index Scan using idx_followed_by on users u (cost=0.57..322693786.19 rows=232404560 width=292) (actual time=0.011..0.103 rows=32 loops=1)
-> Index Scan using fki_user_fk1 on posts p (cost=0.57..1943.85 rows=43 width=8) (actual time=0.983..0.983 rows=1 loops=32)
Index Cond: (user_id = u.id)
Filter: (tags #> '{love}'::text[])
Rows Removed by Filter: 1699
Planning time: 1.322 ms
Execution time: 31.656 ms
However for some other tags like "beautiful" it's better, but still little slow. It also takes a different execution path
Limit (cost=3893365.84..3893365.89 rows=21 width=292) (actual time=2813.876..2813.892 rows=21 loops=1)
-> Sort (cost=3893365.84..3893499.84 rows=53600 width=292) (actual time=2813.874..2813.887 rows=21 loops=1)
Sort Key: u.followed_by DESC NULLS LAST
Sort Method: top-N heapsort Memory: 34kB
-> Nested Loop (cost=3437011.27..3891920.70 rows=53600 width=292) (actual time=1130.847..2779.928 rows=35230 loops=1)
-> HashAggregate (cost=3437010.70..3437546.70 rows=53600 width=8) (actual time=1130.809..1148.209 rows=35230 loops=1)
Group Key: p.user_id
-> Bitmap Heap Scan on posts p (cost=10484.20..3434173.21 rows=1134993 width=8) (actual time=268.602..972.390 rows=814919 loops=1)
Recheck Cond: (tags #> '{beautiful}'::text[])
Heap Blocks: exact=347002
-> Bitmap Index Scan on idx_tags (cost=0.00..10200.45 rows=1134993 width=0) (actual time=168.453..168.453 rows=814919 loops=1)
Index Cond: (tags #> '{beautiful}'::text[])
-> Index Scan using user_pk on users u (cost=0.57..8.47 rows=1 width=292) (actual time=0.045..0.046 rows=1 loops=35230)
Index Cond: (id = p.user_id)
Planning time: 1.388 ms
Execution time: 2814.132 ms
I did have a gin index for 'tags' already in place

This should be faster:
select *
from users u
where exists (select *
from posts p
where u.id=p.user_id
and p.tags #> ARRAY['love'])
order by u.followed_by desc nulls last
limit 21;
If there are only a few (<10%) posts with that tag, an index on posts.tags should help as well:
create index using gin on posts (tags);

Postgres Query Optimization w/ simple join

I have the following query:
SELECT "person_dimensions"."dimension"
FROM "person_dimensions"
join users
on users.id = person_dimensions.user_id
where users.team_id = 2
The following is the result of EXPLAIN ANALYZE:
Nested Loop (cost=0.43..93033.84 rows=452 width=11) (actual time=1245.321..42915.426 rows=827 loops=1)
-> Seq Scan on person_dimensions (cost=0.00..254.72 rows=13772 width=15) (actual time=0.022..9.907 rows=13772 loops=1)
-> Index Scan using users_pkey on users (cost=0.43..6.73 rows=1 width=4) (actual time=2.978..3.114 rows=0 loops=13772)
Index Cond: (id = person_dimensions.user_id)
Filter: (team_id = 2)
Rows Removed by Filter: 1
Planning time: 0.396 ms
Execution time: 42915.678 ms
Indexes exist on person_dimensions.user_id and users.team_id, so it is unclear as to why this seemingly simple query would be taking so long.
Maybe it has something to do with team_id being unable to be used in the join condition? Ideas how to speed this up?
EDIT:
I tried this query:
SELECT "person_dimensions"."dimension"
FROM "person_dimensions"
join users ON users.id = person_dimensions.user_id
WHERE users.id IN (2337,2654,3501,56,4373,1060,3170,97,4629,41,3175,4541,2827)
which contains the id's returned by the subquery:
SELECT id FROM users WHERE team_id = 2
The result was 380ms versus 42s as above. I could use this as a workaround, but I am really curious as to what is going on here...

I rebooted my DB server yesterday, and when it came back up this same query was performing as expected with a completely different query plan that used expected indices:
QUERY PLAN
Hash Join (cost=1135.63..1443.45 rows=84 width=11) (actual time=0.354..6.312 rows=835 loops=1)
Hash Cond: (person_dimensions.user_id = users.id)
-> Seq Scan on person_dimensions (cost=0.00..255.17 rows=13817 width=15) (actual time=0.002..2.764 rows=13902 loops=1)
-> Hash (cost=1132.96..1132.96 rows=214 width=4) (actual time=0.175..0.175 rows=60 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Bitmap Heap Scan on users (cost=286.07..1132.96 rows=214 width=4) (actual time=0.032..0.157 rows=60 loops=1)
Recheck Cond: (team_id = 2)
Heap Blocks: exact=68
-> Bitmap Index Scan on index_users_on_team_id (cost=0.00..286.02 rows=214 width=0) (actual time=0.021..0.021 rows=82 loops=1)
Index Cond: (team_id = 2)
Planning time: 0.215 ms
Execution time: 6.474 ms
Anyone have any ideas why it required a reboot to be aware of all of this? Could it be that manual vacuums were required that hadn't been done in a while, or something like this? Recall I did do an analyze on the relevant tables before the reboot and it didn't change anything.

how to make my pg query faster

Im new to pg, and have a table like this one:
CREATE TABLE tbl_article (
uid TEXT PRIMARY KEY,
...
tags JSONB
)
CREATE INDEX idxgin ON tbl_article USING gin (tags);
uid is something like MongoDB's ObjectID generated in my program.
And this query:
SELECT * FROM tbl_article
WHERE tags #> '{"Category":0}'::jsonb
ORDER BY uid DESC LIMIT 20 OFFSET 10000
Here is explain:
Limit (cost=971.77..971.77 rows=1 width=1047) (actual time=121.811..121.811 rows=0 loops=1)
-> Sort (cost=971.46..971.77 rows=125 width=1047) (actual time=110.653..121.371 rows=8215 loops=1)
Sort Key: uid
Sort Method: external merge Disk: 8736kB
-> Bitmap Heap Scan on tbl_article (cost=496.97..967.11 rows=125 width=1047) (actual time=5.292..14.504 rows=8215 loops=1)
Recheck Cond: (tags #> '{"Category": 0}'::jsonb)
Heap Blocks: exact=3521
-> Bitmap Index Scan on idxgin (cost=0.00..496.93 rows=125 width=0) (actual time=4.817..4.817 rows=8216 loops=1)
Index Cond: (tags #> '{"Category": 0}'::jsonb)
Planning time: 0.105 ms
Execution time: 123.016 ms
Seems sorting is so slow. So how to make this faster
Sorry for my poor English

Accelerate query where I sort on a unique key and another one

As part of a query (one side of a join actually)
SELECT DISTINCT ON (shop_id) shop_id, id, color
FROM products
ORDER BY shop_id, id
There are btree indices on shop_id and id, but for some reasons they are not used.
EXPLAIN ANALYZE says:
-> Unique (cost=724198.71..742348.37 rows=360949 width=71) (actual time=179157.101..195998.646 rows=1673170 loops=1)
-> Sort (cost=724198.71..733273.54 rows=3629931 width=71) (actual time=179157.095..191853.377 rows=3599644 loops=1)
Sort Key: products.shop_id, products.id
Sort Method: external merge Disk: 285064kB
-> Seq Scan on products (cost=0.00..328690.31 rows=3629931 width=71) (actual time=0.025..7575.905 rows=3629713 loops=1)
I also tried to make an multicolum btree index on both shop_id and id, but it wasn't used.. (Maybe I did it wrong and would have had to restart postgresql and everything?)
How can I accelarete this query (with an index or something?) ?
Adding an multicolumn index on all three ("CREATE INDEX products_idx ON products USING btree (shop_id, id, color);") doesn't get used.
I experimented, and if I set a LIMIT of 10000, it will be used:
Limit (cost=0.00..161337.91 rows=10000 width=14) (actual time=0.043..15.973 rows=10000 loops=1)
-> Unique (cost=0.00..2925620.98 rows=181335 width=14) (actual time=0.042..15.249 rows=10000 loops=1)
-> Index Scan using products_idx on products (cost=0.00..2922753.69 rows=1146917 width=14) (actual time=0.041..12.927 rows=14004 loops=1)
Total runtime: 16.293 ms
There are around 3*10^6 entries (3 million)
For larger LIMIT, it uses sequential scan again :(
Limit (cost=213533.52..215114.73 rows=50000 width=14) (actual time=816.580..835.075 rows=50000 loops=1)
-> Unique (cost=213533.52..219268.11 rows=181335 width=14) (actual time=816.578..831.963 rows=50000 loops=1)
-> Sort (cost=213533.52..216400.81 rows=1146917 width=14) (actual time=816.576..823.034 rows=80830 loops=1)
Sort Key: shop_id, id
Sort Method: quicksort Memory: 107455kB
-> Seq Scan on products (cost=0.00..98100.17 rows=1146917 width=14) (actual time=0.019..296.867 rows=1146917 loops=1)
Total runtime: 840.788 ms
(I also had to raise the work_mem here to 128MB, otherwise there would be an external merge which takes even longer)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Postgres select performance with LIMIT 1 - postgresql

Related

Limit 1 is slower than limit 100

Postgres optimize/replace DISTINCT

Postgres Query Optimization w/ simple join

how to make my pg query faster

Accelerate query where I sort on a unique key and another one

Categories

Resources