Im new to pg, and have a table like this one:
CREATE TABLE tbl_article (
uid TEXT PRIMARY KEY,
...
tags JSONB
)
CREATE INDEX idxgin ON tbl_article USING gin (tags);
uid is something like MongoDB's ObjectID generated in my program.
And this query:
SELECT * FROM tbl_article
WHERE tags #> '{"Category":0}'::jsonb
ORDER BY uid DESC LIMIT 20 OFFSET 10000
Here is explain:
Limit (cost=971.77..971.77 rows=1 width=1047) (actual time=121.811..121.811 rows=0 loops=1)
-> Sort (cost=971.46..971.77 rows=125 width=1047) (actual time=110.653..121.371 rows=8215 loops=1)
Sort Key: uid
Sort Method: external merge Disk: 8736kB
-> Bitmap Heap Scan on tbl_article (cost=496.97..967.11 rows=125 width=1047) (actual time=5.292..14.504 rows=8215 loops=1)
Recheck Cond: (tags #> '{"Category": 0}'::jsonb)
Heap Blocks: exact=3521
-> Bitmap Index Scan on idxgin (cost=0.00..496.93 rows=125 width=0) (actual time=4.817..4.817 rows=8216 loops=1)
Index Cond: (tags #> '{"Category": 0}'::jsonb)
Planning time: 0.105 ms
Execution time: 123.016 ms
Seems sorting is so slow. So how to make this faster
Sorry for my poor English
Related
I have a Postgres table with ~50 columns and ~75 million records.
It has the following index among others:
"index_shipments_on_buyer_supplier_id" btree (buyer_supplier_id)
EXPLAIN shows it wants to use a sequential scan:
db=# EXPLAIN SELECT COUNT(*) FROM "shipments" WHERE (buyer_supplier_id IS NULL)
db-# ;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=15427130.32..15427130.33 rows=1 width=8)
-> Gather (cost=15427130.11..15427130.32 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=15426130.11..15426130.12 rows=1 width=8)
-> Parallel Seq Scan on shipments (cost=0.00..15354385.03 rows=28698029 width=0)
Filter: (buyer_supplier_id IS NULL)
(6 rows)
Now force use of the index:
db=# set enable_seqscan = false;
SET
db=# EXPLAIN SELECT COUNT(*) FROM "shipments" WHERE (buyer_supplier_id IS NULL);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=17314493.48..17314493.49 rows=1 width=8)
-> Gather (cost=17314493.26..17314493.47 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=17313493.26..17313493.27 rows=1 width=8)
-> Parallel Bitmap Heap Scan on shipments (cost=1922711.90..17241748.19 rows=28698029 width=0)
Recheck Cond: (buyer_supplier_id IS NULL)
-> Bitmap Index Scan on index_shipments_on_buyer_supplier_id (cost=0.00..1905493.08 rows=68875269 width=0)
Index Cond: (buyer_supplier_id IS NULL)
(8 rows)
db=# EXPLAIN ANALYZE SELECT COUNT(*) FROM "shipments" WHERE (buyer_supplier_id IS NULL);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=17314493.48..17314493.49 rows=1 width=8) (actual time=795551.977..795573.311 rows=1 loops=1)
-> Gather (cost=17314493.26..17314493.47 rows=2 width=8) (actual time=795528.063..795573.304 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=17313493.26..17313493.27 rows=1 width=8) (actual time=795519.276..795519.277 rows=1 loops=3)
-> Parallel Bitmap Heap Scan on shipments (cost=1922711.90..17241748.19 rows=28698029 width=0) (actual time=7642.771..794473.494 rows=5439073 loops=3)
Recheck Cond: (buyer_supplier_id IS NULL)
Rows Removed by Index Recheck: 10948389
Heap Blocks: exact=14343 lossy=3993510
-> Bitmap Index Scan on index_shipments_on_buyer_supplier_id (cost=0.00..1905493.08 rows=68875269 width=0) (actual time=7633.652..7633.652 rows=62174015 loops=1)
Index Cond: (buyer_supplier_id IS NULL)
Planning time: 0.102 ms
Execution time: 795573.347 ms
(13 rows)
I don't understand why getting a COUNT of NULL buyer_supplier_ids should be so taxing on the system. What am I missing here, and how can I make this count fast?
Postgres organizes indexes with nulls placed last by default. Check https://www.postgresql.org/docs/current/indexes-ordering.html for more info
In your case, if the table has high cardinality for buyers_supplier_id it will have to scroll through the entire index to look for nulls hence the planner might be deciding to use seq scan.
To fix this
You can either recreate the index with nulls first option or you can also create a partial index with buyers_supplier_id is null condition as #a_horse_with_no_name mentioned.
Another thing to look into is index bloat. If this table is frequently getting updated and has not been through a vacuum index might start getting bloated reducing the performance.
I have the following two tables.
person_addresses
address_normalization
The person_addresses table has a field named address_id as the primary key and address_normalization has the corresponding field address_id which has an index on it.
Now, when I explain the following query, I see a sequential scan.
SELECT
count(*)
FROM
mp_member2.person_addresses pa
JOIN mp_member2.address_normalization an ON
an.address_id = pa.address_id
WHERE
an.sr_modification_time >= 1550692189468;
-- Result: 2654
Please refer to the following screenshot.
You see that there is a sequential scan after the hash join. I'm not sure I understand this part; why would a sequential scan follow a hash join.
And as seen in the query above, the set of records returned is also low.
Is this expected behaviour or am I doing something wrong?
Update #1: I also have indices on the sr_modification_time fields of both the tables
Update #2: Full execution plan
Aggregate (cost=206944.74..206944.75 rows=1 width=0) (actual time=2807.844..2807.844 rows=1 loops=1)
Buffers: shared hit=4629 read=82217
-> Hash Join (cost=2881.95..206825.15 rows=47836 width=0) (actual time=0.775..2807.160 rows=2654 loops=1)
Hash Cond: (pa.address_id = an.address_id)
Buffers: shared hit=4629 read=82217
-> Seq Scan on person_addresses pa (cost=0.00..135924.93 rows=4911993 width=8) (actual time=0.005..1374.610 rows=4911993 loops=1)
Buffers: shared hit=4588 read=82217
-> Hash (cost=2432.05..2432.05 rows=35992 width=18) (actual time=0.756..0.756 rows=1005 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 41kB
Buffers: shared hit=41
-> Index Scan using mp_member2_address_normalization_mod_time on address_normalization an (cost=0.43..2432.05 rows=35992 width=18) (actual time=0.012..0.424 rows=1005 loops=1)
Index Cond: (sr_modification_time >= 1550692189468::bigint)
Buffers: shared hit=41
Planning time: 0.244 ms
Execution time: 2807.885 ms
Update #3: I tried with a newer timestamp and it used an index scan.
EXPLAIN (
ANALYZE
, buffers
, format TEXT
) SELECT
COUNT(*)
FROM
mp_member2.person_addresses pa
JOIN mp_member2.address_normalization an ON
an.address_id = pa.address_id
WHERE
an.sr_modification_time >= 1557507300342;
-- count: 1364
Query Plan:
Aggregate (cost=295.48..295.49 rows=1 width=0) (actual time=2.770..2.770 rows=1 loops=1)
Buffers: shared hit=1404
-> Nested Loop (cost=4.89..295.43 rows=19 width=0) (actual time=0.038..2.491 rows=1364 loops=1)
Buffers: shared hit=1404
-> Index Scan using mp_member2_address_normalization_mod_time on address_normalization an (cost=0.43..8.82 rows=14 width=18) (actual time=0.009..0.142 rows=341 loops=1)
Index Cond: (sr_modification_time >= 1557507300342::bigint)
Buffers: shared hit=14
-> Bitmap Heap Scan on person_addresses pa (cost=4.46..20.43 rows=4 width=8) (actual time=0.004..0.005 rows=4 loops=341)
Recheck Cond: (address_id = an.address_id)
Heap Blocks: exact=360
Buffers: shared hit=1390
-> Bitmap Index Scan on idx_mp_member2_person_addresses_address_id (cost=0.00..4.46 rows=4 width=0) (actual time=0.003..0.003 rows=4 loops=341)
Index Cond: (address_id = an.address_id)
Buffers: shared hit=1030
Planning time: 0.214 ms
Execution time: 2.816 ms
That is the expected behavior because you don't have index for sr_modification_time so after create the hash join db has to scan the whole set to check each row for the sr_modification_time value
You should create:
index for (sr_modification_time)
or composite index for (address_id , sr_modification_time )
I am using PostgreSql 9.6
I have a database table with about 16 million records. I have a jsonb column - logentry -that has a field called "message". It has a GIN index created as so:
CREATE INDEX inettklog_ix_ts_message
ON public.inettklog USING gin
(to_tsvector('english'::regconfig, logentry ->> 'message'::text))
TABLESPACE pg_default;
I want to do a search for "application name".
A query with the WHERE clause
to_tsvector('english', logentry->>'message') ## plainto_tsquery('application name')
executes in 113 msecs and returns 7349 rows
EXPLAIN ANALYZE:
WindowAgg (cost=1812.98..2240.22 rows=95 width=12) (actual time=84.037..84.986 rows=7315 loops=1)
-> Bitmap Heap Scan on inettklog (cost=1812.98..2239.03 rows=95 width=4) (actual time=17.943..81.708 rows=7315 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## plainto_tsquery('application name'::text))
Heap Blocks: exact=7574
-> Bitmap Index Scan on inettklog_ix_ts_message (cost=0.00..1812.96 rows=95 width=0) (actual time=8.542..8.542 rows=8009 loops=1)
Index Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## plainto_tsquery('application name'::text))
Planning time: 0.387 ms
Execution time: 85.243 ms
But I don't want "application" and "name", I want "application name"
But a query with a WHERE clause of
to_tsvector('english', logentry->>'message') ## phraseto_tsquery('application name')
takes over 2 minutes to run!
EXPLAIN ANALYZE:
WindowAgg (cost=852.98..1280.22 rows=95 width=12) (actual time=145065.204..145066.127 rows=7314 loops=1)
-> Bitmap Heap Scan on inettklog (cost=852.98..1279.03 rows=95 width=4) (actual time=55.180..145030.148 rows=7314 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## phraseto_tsquery('application name'::text))
Heap Blocks: exact=7573
-> Bitmap Index Scan on inettklog_ix_ts_message (cost=0.00..852.96 rows=95 width=0) (actual time=8.196..8.196 rows=8008 loops=1)
Index Cond: (to_tsvector('english'::regconfig, (logentry ->> 'message'::text)) ## phraseto_tsquery('application name'::text))
Planning time: 25.926 ms
Execution time: 145067.052 ms
Surely the "<->" operator works by first locating the rows containing "application" and "name" and then filters the result to find those rows where "name" follows "application".
And, if so, why does it take 2 minutes to run???
The GIN index, unfortunately, cannot support ordering of the lexemes. Your first query is so much faster because it's able to handle everything using the index you built. With the phrase version, the recheck has to actually go to your table and create the ts_vectors to find the order.
You may be able to use a RUM index: https://github.com/postgrespro/rum which does include the ordering information.
This article expands on these points greatly.
I have a table with about 50 million records in PostgreSQL. Trying to select a post with most "likes" filtering by a "tag". Both fields have b-tree index. For "love" tag I get
EXPLAIN analyse select user_id from posts where tags #> array['love'] order by likes desc nulls last limit 12
Limit (cost=0.57..218.52 rows=12 width=12) (actual time=2.658..14.243 rows=12 loops=1)
-> Index Scan using idx_likes on posts (cost=0.57..55759782.55 rows=3070010 width=12) (actual time=2.657..14.239 rows=12 loops=1)
Filter: (tags #> '{love}'::text[])
Rows Removed by Filter: 10584
Planning time: 0.297 ms
Execution time: 14.276 ms
14 ms is great, but if I try to get it for "tamir", it suddenly becomes over 22 seconds!! Obviously query planner is doing something wrong.
EXPLAIN analyse select user_id from posts where tags #> array['tamir'] order by likes desc nulls last limit 12
Limit (cost=0.57..25747.73 rows=12 width=12) (actual time=17552.406..22839.503 rows=12 loops=1)
-> Index Scan using idx_likes on posts (cost=0.57..55759782.55 rows=25988 width=12) (actual time=17552.405..22839.484 rows=12 loops=1)
Filter: (tags #> '{tamir}'::text[])
Rows Removed by Filter: 11785083
Planning time: 0.253 ms
Execution time: 22839.569 ms
After reading the article I've added "user_id" to the ORDER BY and "tamir" is blazingly fast, 0.2ms! Now it's doing sorts and Bitmap Heap Scan instead of Index scan.
EXPLAIN analyse select user_id from posts where tags #> array['tamir'] order by likes desc nulls last, user_id limit 12
Limit (cost=101566.17..101566.20 rows=12 width=12) (actual time=0.237..0.238 rows=12 loops=1)
-> Sort (cost=101566.17..101631.14 rows=25988 width=12) (actual time=0.237..0.237 rows=12 loops=1)
Sort Key: likes DESC NULLS LAST, user_id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on posts (cost=265.40..100970.40 rows=25988 width=12) (actual time=0.074..0.214 rows=126 loops=1)
Recheck Cond: (tags #> '{tamir}'::text[])
Heap Blocks: exact=44
-> Bitmap Index Scan on idx_tags (cost=0.00..258.91 rows=25988 width=0) (actual time=0.056..0.056 rows=126 loops=1)
Index Cond: (tags #> '{tamir}'::text[])
Planning time: 0.287 ms
Execution time: 0.277 ms
But what happens to "love"? Now it goes from 14 ms to 2.3 seconds...
EXPLAIN analyse select user_id from posts where tags #> array['love'] order by likes desc nulls last, user_id limit 12
Limit (cost=7347142.18..7347142.21 rows=12 width=12) (actual time=2360.784..2360.786 rows=12 loops=1)
-> Sort (cost=7347142.18..7354817.20 rows=3070010 width=12) (actual time=2360.783..2360.784 rows=12 loops=1)
Sort Key: likes DESC NULLS LAST, user_id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on posts (cost=28316.58..7276762.77 rows=3070010 width=12) (actual time=595.274..2171.571 rows=1517679 loops=1)
Recheck Cond: (tags #> '{love}'::text[])
Heap Blocks: exact=642705
-> Bitmap Index Scan on idx_tags (cost=0.00..27549.08 rows=3070010 width=0) (actual time=367.080..367.080 rows=1517679 loops=1)
Index Cond: (tags #> '{love}'::text[])
Planning time: 0.226 ms
Execution time: 2360.863 ms
Can somebody shed some light on why this is happening and what would the fix.
Update
"tag" field had gin index, not b-tree, just typo.
B-tree indexes are not very useful for searching of element in array field. You should remove b-tree index from tags field and use gin index instead:
drop index idx_tags;
create index idx_tags using gin(tags);
And don't add order by user_id — this sabotages possibility to use your idx_likes for ordering when there's a lot of rows with the tag you search for.
Also likes field should probably be not null default 0.
I have the following query:
SELECT "person_dimensions"."dimension"
FROM "person_dimensions"
join users
on users.id = person_dimensions.user_id
where users.team_id = 2
The following is the result of EXPLAIN ANALYZE:
Nested Loop (cost=0.43..93033.84 rows=452 width=11) (actual time=1245.321..42915.426 rows=827 loops=1)
-> Seq Scan on person_dimensions (cost=0.00..254.72 rows=13772 width=15) (actual time=0.022..9.907 rows=13772 loops=1)
-> Index Scan using users_pkey on users (cost=0.43..6.73 rows=1 width=4) (actual time=2.978..3.114 rows=0 loops=13772)
Index Cond: (id = person_dimensions.user_id)
Filter: (team_id = 2)
Rows Removed by Filter: 1
Planning time: 0.396 ms
Execution time: 42915.678 ms
Indexes exist on person_dimensions.user_id and users.team_id, so it is unclear as to why this seemingly simple query would be taking so long.
Maybe it has something to do with team_id being unable to be used in the join condition? Ideas how to speed this up?
EDIT:
I tried this query:
SELECT "person_dimensions"."dimension"
FROM "person_dimensions"
join users ON users.id = person_dimensions.user_id
WHERE users.id IN (2337,2654,3501,56,4373,1060,3170,97,4629,41,3175,4541,2827)
which contains the id's returned by the subquery:
SELECT id FROM users WHERE team_id = 2
The result was 380ms versus 42s as above. I could use this as a workaround, but I am really curious as to what is going on here...
I rebooted my DB server yesterday, and when it came back up this same query was performing as expected with a completely different query plan that used expected indices:
QUERY PLAN
Hash Join (cost=1135.63..1443.45 rows=84 width=11) (actual time=0.354..6.312 rows=835 loops=1)
Hash Cond: (person_dimensions.user_id = users.id)
-> Seq Scan on person_dimensions (cost=0.00..255.17 rows=13817 width=15) (actual time=0.002..2.764 rows=13902 loops=1)
-> Hash (cost=1132.96..1132.96 rows=214 width=4) (actual time=0.175..0.175 rows=60 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Bitmap Heap Scan on users (cost=286.07..1132.96 rows=214 width=4) (actual time=0.032..0.157 rows=60 loops=1)
Recheck Cond: (team_id = 2)
Heap Blocks: exact=68
-> Bitmap Index Scan on index_users_on_team_id (cost=0.00..286.02 rows=214 width=0) (actual time=0.021..0.021 rows=82 loops=1)
Index Cond: (team_id = 2)
Planning time: 0.215 ms
Execution time: 6.474 ms
Anyone have any ideas why it required a reboot to be aware of all of this? Could it be that manual vacuums were required that hadn't been done in a while, or something like this? Recall I did do an analyze on the relevant tables before the reboot and it didn't change anything.