Why is my count query on index field slow? - postgresql

I have the following schema:
leadgenie-django=> \d main_lead;
Table "public.main_lead"
Column | Type | Modifiers
-----------------+--------------------------+-----------
id | uuid | not null
body | text | not null
username | character varying(255) | not null
link | character varying(255) | not null
source | character varying(10) | not null
keyword_matches | character varying(255)[] | not null
json | jsonb | not null
created_at | timestamp with time zone | not null
updated_at | timestamp with time zone | not null
campaign_id | uuid | not null
is_accepted | boolean |
is_closed | integer |
raw_body | text |
accepted_at | timestamp with time zone |
closed_at | timestamp with time zone |
score | double precision |
Indexes:
"main_lead_pkey" PRIMARY KEY, btree (id)
"main_lead_campaign_id_75034b1f" btree (campaign_id)
Foreign-key constraints:
"main_lead_campaign_id_75034b1f_fk_main_campaign_id" FOREIGN KEY (campaign_id) REFERENCES main_campaign(id) DEFERRABLE INITIALLY DEFERRED
As you can see, campaign_id is indexed.
When I do a simple WHERE with a campaign_id, the query still takes 16 seconds.
leadgenie-django=> EXPLAIN ANALYZE select count(*) from main_lead where campaign_id = '9a183263-7a60-4ec0-a354-2175f8a2e5c9';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=202866.79..202866.80 rows=1 width=8) (actual time=16715.762..16715.763 rows=1 loops=1)
-> Seq Scan on main_lead (cost=0.00..202189.94 rows=270739 width=0) (actual time=1143.886..16516.490 rows=279405 loops=1)
Filter: (campaign_id = '9a183263-7a60-4ec0-a354-2175f8a2e5c9'::uuid)
Rows Removed by Filter: 857300
Planning time: 0.080 ms
Execution time: 16715.807 ms
I would have expected this query to be fast (under 1s), since this field is indexed. Is there a reason my expectation is wrong? Anything I could do to speed it up?

The query fetches about 25% of your table, so PostgreSQL thinks that this is most cheaply done with a sequential scan of the whole table. That is probably correct.
Try running
VACUUM main_lead;
That will update the visibility map, and if there are no long-running concurrent transactions, that should mark most of the table blocks as all-visible, so that you can get a faster index only scan for the query.

Related

Postgres does not pick partial index even when the clause matches

I have a table that looks like so:
Column | Type | Collation | Nullable | Default
-----------+---------+-----------+----------+---------
app_id | uuid | | not null |
entity_id | uuid | | not null |
attr_id | uuid | | not null |
value | text | | not null |
ea_index | boolean | | |
Indexes:
"triples_pkey" PRIMARY KEY, btree (app_id, entity_id, attr_id, value)
"ea_index" UNIQUE, btree (app_id, entity_id, attr_id) WHERE ea_index
"triples_app_id" btree (app_id)
"triples_attr_id" btree (attr_id)
Foreign-key constraints:
"triples_app_id_fkey" FOREIGN KEY (app_id) REFERENCES apps(id) ON DELETE CASCADE
"triples_attr_id_fkey" FOREIGN KEY (attr_id) REFERENCES attrs(id) ON DELETE CASCADE
I have a special partial index ea_index, enabled for all the rows that have this column.
Now, when I run:
EXPLAIN (
SELECT
*
FROM triples
WHERE
app_id = '6b1ca162-0175-4188-9265-849f671d56cc' AND
entity_id = '6b1ca162-0175-4188-9265-849f671d56cc' AND
ea_index
);
I get:
Index Scan using triples_app_id on triples (cost=0.28..4.30 rows=1 width=221)
Index Cond: (app_id = '6b1ca162-0175-4188-9265-849f671d56cc'::uuid)
Filter: (ea_index AND (entity_id = '6b1ca162-0175-4188-9265-849f671d56cc'::uuid))
(3 rows)
I am a bit confused: why is this not using an index scan on ea_index? How could I debug this further?
This was because of a costing decision. EXPLAIN showed that it expected only 1 row, so there was no difference which index it chose. Changing up the uuids, it did pick the correct index.

What are the scenarios that cause postgres to do a seq scan instead of an index scan?

I've run into a very strange issue that has come up multiple times. When I create a database and initially query against a table, the plain command tells me to use index scan.
But during the development process, maybe (I'm not sure) some modifications were made to the table, or some index modifications were made. Later I found that for the same query, it no longer uses Index scan.
If I delete this table and rebuild it, I find the same table and index structure, it starts using index scan again.
I know there are some scenarios, such as when the scan results in very many rows, postgres may directly use seq scan for optimization. But my query could always return only 0 or 1 rows.
I also know that index scan can be more costly in some scenarios due to startup time, but this is clearly not that scenario here either.
Can anyone help me with a clue that I can investigate?
testdb=> \d tabverifies;
Table "public.tabverifies"
Column | Type | Collation | Nullable | Default
--------+----------+-----------+----------+------------------------------------------
vid | integer | | not null | nextval('tabverifies_vid_seq'::regclass)
lid | integer | | not null |
verify | integer | | not null |
secret | text | | not null |
Indexes:
"tabverifies_pkey" PRIMARY KEY, btree (vid)
"tabverifies_lid_verify_key" UNIQUE CONSTRAINT, btree (lid, verify)
Foreign-key constraints:
"tabverifies_lid_fkey" FOREIGN KEY (lid) REFERENCES tablogins(lid)
testdb=> explain select * from tabverifies where vid=1000;
QUERY PLAN
------------------------------------------------------------
Seq Scan on tabverifies (cost=0.00..1.04 rows=1 width=44)
Filter: (vid = 1000)
(2 rows)
testdb=> \d tabverifies;
Table "public.tabverifies"
Column | Type | Collation | Nullable | Default
--------+----------+-----------+----------+------------------------------------------
vid | integer | | not null | nextval('tabverifies_vid_seq'::regclass)
lid | integer | | not null |
verify | integer | | not null |
secret | text | | not null |
Indexes:
"tabverifies_pkey" PRIMARY KEY, btree (vid)
"tabverifies_lid_verify_key" UNIQUE CONSTRAINT, btree (lid, verify)
Foreign-key constraints:
"tabverifies_lid_fkey" FOREIGN KEY (lid) REFERENCES tablogins(lid)
sigserverdb=> explain select * from tabverifies where vid=1;
QUERY PLAN
-------------------------------------------------------------------------------------
Index Scan using tabverifies_pkey on tabverifies (cost=0.15..8.17 rows=1 width=44)
Index Cond: (vid = 1)
(2 rows)

How can I optimize a postgresql query where one dependent column is a timestamp

I have a table with a foreign key and a timestamp for when the row was most recently updated. rows with the same foreign key value are updated at roughly the same time, plus or minus an hour. I have an index on (foreign_key, timestamp). This is on postgresql 11.
When I make a query like:
select * from table where foreign_key = $1 and timestamp > $2 order by primary_key;
It will use my index in cases where the timestamp query is selective across the entire table. But if the timestamp is far enough in the past that the majority of rows match it will scan the primary_key index assuming it'll be faster. This problem goes away if I remove the order by.
I've looked at Postgresql's CREATE STATISTICS, but it doesn't seem to help in cases where the correlation is over a range of values like a timestamp plus or minus five minutes, rather than an specific value.
What are the best ways to work around this? I can remove the order by, but that complicates the business logic. I can partition the table on the foreign key id, but that is also a pretty expensive change.
Specifics:
Table "public.property_home_attributes"
Column | Type | Collation | Nullable | Default
----------------------+-----------------------------+-----------+----------+------------------------------------------------------
id | integer | | not null | nextval('property_home_attributes_id_seq'::regclass)
mls_id | integer | | not null |
property_id | integer | | not null |
formatted_attributes | jsonb | | not null |
created_at | timestamp without time zone | | |
updated_at | timestamp without time zone | | |
Indexes:
"property_home_attributes_pkey" PRIMARY KEY, btree (id)
"index_property_home_attributes_on_property_id" UNIQUE, btree (property_id)
"index_property_home_attributes_on_updated_at" btree (updated_at)
"property_home_attributes_mls_id_updated_at_idx" btree (mls_id, updated_at)
The table has about 16 million rows.
psql=# EXPLAIN ANALYZE SELECT * FROM property_home_attributes WHERE mls_id = 46 AND (property_home_attributes.updated_at < '2019-10-30 16:52:06.326774') ORDER BY id ASC LIMIT 1000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..10147.83 rows=1000 width=880) (actual time=1519.718..22310.674 rows=1000 loops=1)
-> Index Scan using property_home_attributes_pkey on property_home_attributes (cost=0.56..6094202.57 rows=600576 width=880) (actual time=1519.716..22310.398 rows=1000 loops=1)
Filter: ((updated_at < '2019-10-30 16:52:06.326774'::timestamp without time zone) AND (mls_id = 46))
Rows Removed by Filter: 358834
Planning Time: 0.110 ms
Execution Time: 22310.842 ms
(6 rows)
and then without the order by:
psql=# EXPLAIN ANALYZE SELECT * FROM property_home_attributes WHERE mls_id = 46 AND (property_home_attributes.updated_at < '2019-10-30 16:52:06.326774') LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..1049.38 rows=1000 width=880) (actual time=0.053..162.081 rows=1000 loops=1)
-> Index Scan using foo on property_home_attributes (cost=0.56..629893.60 rows=600576 width=880) (actual time=0.053..161.992 rows=1000 loops=1)
Index Cond: ((mls_id = 46) AND (updated_at < '2019-10-30 16:52:06.326774'::timestamp without time zone))
Planning Time: 0.100 ms
Execution Time: 162.140 ms
(5 rows)
If you want to keep PostgreSQL from using an index scan on property_home_attributes_pkey to support the ORDER BY, you can simply use
ORDER BY primary_key + 0

Speed up fulltext search - pgsql

I saw a millions of threads according to speed up postgresql query with full-text search. I tried to do everything, but have not more ideas.
I have pretty big (20 612 971 records at this moment) table and search in it with pgsql's fulltext serach, then order it by ts_rank_cd. I reached up to around 3500-4000ms to execute query. Any ideas to make it faster ? If its possible I dont want to use external soft like sphinx or solr. So, native postgresql solutions are preffered :) Below are describe of my table and example of explain analyze select.
# \d artifacts.item
Table "artifacts.item"
Column | Type | Modifiers
-------------------------+-----------------------------+-------------------------------------------------------------
add_timestamp | timestamp without time zone |
author_account_id | integer |
description | text |
id | integer | not null default nextval('artifacts.item_id_seq'::regclass)
removed_since_timestamp | timestamp without time zone |
slug | character varying(2044) | not null
thumb_height | integer |
thumb_path | character varying(2044) | default NULL::character varying
thumb_width | integer |
title | character varying(2044) | not null
search_data | tsvector |
tags | integer[] |
is_age_restricted | boolean | not null default false
is_on_homepage | boolean | not null default false
is_public | boolean | not null default false
thumb_filename | character varying(2044) |
is_removed | boolean | not null default false
Indexes:
"artifacts_item_add_timestamp_idx" btree (add_timestamp DESC NULLS LAST)
"artifacts_item_id_idx" btree (id)
"artifacts_item_is_on_homepage_add_timestamp" btree (is_on_homepage DESC, add_timestamp DESC NULLS LAST)
"artifacts_item_is_on_homepage_idx" btree (is_on_homepage)
"artifacts_item_search_results" gin (search_data) WHERE is_public IS TRUE AND is_removed IS FALSE
"artifacts_item_tags_gin_idx" gin (tags)
"artifacts_item_thumbs_list" btree (is_public, is_removed, id DESC)
"index1" btree (add_timestamp)
"itemIdx" btree (is_public, is_removed, is_age_restricted)
"item_author_account_id_idx" btree (author_account_id)
analyze:
# explain analyze SELECT i.id,
# i.title,
# i.description,
# i.slug,
# i.thumb_path,
# i.thumb_filename,
# CONCAT(
# i.thumb_path,
# '/',
# i.thumb_filename
# ) AS thumb_url,
# (CASE WHEN i.thumb_width = 0 THEN 280 ELSE i.thumb_width END) as thumb_width,
# (CASE WHEN i.thumb_height = 0 THEN 280 ELSE i.thumb_height END) as thumb_height,
# (i.thumb_height > i.thumb_width) AS is_vertical,
# i.add_timestamp
# FROM artifacts.item AS i
# WHERE i.is_public IS true
# AND i.is_removed IS false
# AND (i.search_data ## to_tsquery('public.polish', $$'lego'$$))
# ORDER BY ts_rank_cd(i.search_data, to_tsquery('public.polish', $$'lego'$$)) desc,
# ts_rank_cd(i.search_data, to_tsquery('public.polish', $$'lego'$$)) desc,
# i.add_timestamp DESC NULLS LAST
# LIMIT 60
# OFFSET 0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=358061.78..358061.93 rows=60 width=315) (actual time=335.870..335.876 rows=60 loops=1)
-> Sort (cost=358061.78..358357.25 rows=118189 width=315) (actual time=335.868..335.868 rows=60 loops=1)
Sort Key: (ts_rank_cd(search_data, '''lego'' | ''lega'''::tsquery)), add_timestamp
Sort Method: top-N heapsort Memory: 55kB
-> Bitmap Heap Scan on item i (cost=2535.96..353980.19 rows=118189 width=315) (actual time=33.163..308.371 rows=62025 loops=1)
Recheck Cond: ((search_data ## '''lego'' | ''lega'''::tsquery) AND (is_public IS TRUE) AND (is_removed IS FALSE))
-> Bitmap Index Scan on artifacts_item_search_results (cost=0.00..2506.42 rows=118189 width=0) (actual time=23.066..23.066 rows=62085 loops=1)
Index Cond: (search_data ## '''lego'' | ''lega'''::tsquery)
Total runtime: 335.967 ms
(9 rows)
Time: 3444.731 ms
There are 62025 rows that match the condition, and they have to be ordered…
That will take a while. Is there a chance that you can have the whole database or at least the index in RAM? That would help.

How can this simple query take so long?

=> SELECT * FROM "tags" WHERE ("kind" = 'View') ORDER BY "name";
Time: 278.318 ms
The tags table contains 358 rows. All of them are views at the moment.
Column | Type | Modifiers
-------------+--------------------------+-------------------------------------
id | uuid | not null default uuid_generate_v4()
name | text | not null
slug | text | not null
kind | text | not null
external_id | text |
created_at | timestamp with time zone | not null default now()
updated_at | timestamp with time zone |
filter | json |
Indexes:
"tags_pkey" PRIMARY KEY, btree (id)
"tags_kind_index" btree (kind)
"tags_name_index" btree (name)
Analyze says:
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Sort (cost=9.29..9.47 rows=358 width=124) (actual time=0.654..0.696 rows=358 loops=1)
Sort Key: name
Sort Method: quicksort Memory: 75kB
-> Seq Scan on tags (cost=0.00..6.25 rows=358 width=124) (actual time=0.006..0.108 rows=358 loops=1)
Filter: (kind = 'View'::text)
Total runtime: 0.756 ms
(6 rows)
Did you run analyze tags? It will update the table's statistics.
First if all the kind values are 'view' then an index on that column is useless. The index will only be used if the column's cardinality is high otherwise it is cheaper to do a sequential scan on the table.
Second with only 358 rows it is cheaper to do a sequential scan anyway.