PostgreSQL finding queries using an index I want to drop - postgresql

In a PostgreSQL 9.4.0 database I have a busy table with 22 indexes which are larger than the actual data in the table.
Since most of these indexes are for columns which are almost entirely NULL, I've been trying to replace some of them with partial indexes.
One of the columns is: auto_decline_at timestamp without time zone. This has 5453085 NULLS out of a total 5457088 rows.
The partial index replacement is being used, but according to the stats, the old index is also still in use, so I am afraid to drop it.
Selecting from pg_tables I see:
tablename | indexname | num_rows | table_size | index_size | unique | number_of_scans | tuples_read | tuples_fetched
-----------+---------------------------------------+-------------+------------+------------+--------+-----------------+-------------+----------------
jobs | index_jobs_on_auto_decline_at | 5.45496e+06 | 1916 MB | 3123 MB | N | 17056009 | 26506058607 | 26232155810
jobs | index_jobs_on_auto_decline_at_partial | 5.45496e+06 | 1916 MB | 120 kB | N | 6677 | 26850779 | 26679802
And a few minutes later:
tablename | indexname | num_rows | table_size | index_size | unique | number_of_scans | tuples_read | tuples_fetched
-----------+---------------------------------------+-------------+------------+------------+--------+-----------------+-------------+----------------
jobs | index_jobs_on_auto_decline_at | 5.45496e+06 | 1916 MB | 3124 MB | N | 17056099 | 26506058697 | 26232155900
jobs | index_jobs_on_auto_decline_at_partial | 5.45496e+06 | 1916 MB | 120 kB | N | 6767 | 27210639 | 27039623
So number_of_scans is increasing for both of them.
The index definitions:
"index_jobs_on_auto_decline_at" btree (auto_decline_at)
"index_jobs_on_auto_decline_at_partial" btree (auto_decline_at) WHERE auto_decline_at IS NOT NULL
The only relevant query I can see in my logs follows this pattern:
SELECT "jobs".* FROM "jobs" WHERE (jobs.pending_destroy IS NULL OR jobs.pending_destroy = FALSE) AND "jobs"."deleted_at" IS NULL AND (state = 'assigned' AND auto_decline_at IS NOT NULL AND auto_decline_at < '2015-08-17 06:57:22.325324')
EXPLAIN ANALYSE gives me the following plan, which uses the partial index as expected:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using index_jobs_on_auto_decline_at_partial on jobs (cost=0.28..12.27 rows=1 width=648) (actual time=22.143..22.143 rows=0 loops=1)
Index Cond: ((auto_decline_at IS NOT NULL) AND (auto_decline_at < '2015-08-17 06:57:22.325324'::timestamp without time zone))
Filter: (((pending_destroy IS NULL) OR (NOT pending_destroy)) AND (deleted_at IS NULL) AND ((state)::text = 'assigned'::text))
Rows Removed by Filter: 3982
Planning time: 2.731 ms
Execution time: 22.179 ms
(6 rows)
My questions:
Why is index_jobs_on_auto_decline_at still being used?
Could this same query sometimes use index_jobs_on_auto_decline_at, or is there likely to be another query I am missing?
Is there a way to log the queries which are using index_jobs_on_auto_decline_at?

Related

Postgresql index is not used for slow queries >30s

POSTGRESQL VERSION: 10
HARDWARE: 4 workers / 16GBRAM / 50% used
I'm not a Postgresql expert. I have just read a lot of documentation and did a lot of tests.
I have some postgresql queries whick take a lot of times > 30s because of 10 millions of rows on a table.
Column | Type | Collation | Nullable | Default
------------------------------+--------------------------+-----------+----------+----------------------------------------------------------
id | integer | | not null |
cveid | character varying(50) | | |
summary | text | | not null |
published | timestamp with time zone | | |
modified | timestamp with time zone | | |
assigner | character varying(128) | | |
vulnerable_products | character varying(250)[] | | |
cvss | double precision | | |
cvss_time | timestamp with time zone | | |
cvss_vector | character varying(250) | | |
access | jsonb | | not null |
impact | jsonb | | not null |
score | integer | | not null |
is_exploitable | boolean | | not null |
is_confirmed | boolean | | not null |
is_in_the_news | boolean | | not null |
is_in_the_wild | boolean | | not null |
reflinks | jsonb | | not null |
reflinkids | jsonb | | not null |
created_at | timestamp with time zone | | |
history_id | integer | | not null | nextval('vulns_historicalvuln_history_id_seq'::regclass)
history_date | timestamp with time zone | | not null |
history_change_reason | character varying(100) | | |
history_type | character varying(1) | | not null |
Indexes:
"vulns_historicalvuln_pkey" PRIMARY KEY, btree (history_id)
"btree_varchar" btree (history_type varchar_pattern_ops)
"vulns_historicalvuln_cve_id_850876bb" btree (cve_id)
"vulns_historicalvuln_cwe_id_2013d697" btree (cwe_id)
"vulns_historicalvuln_history_user_id_9e25ebf5" btree (history_user_id)
"vulns_historicalvuln_id_773f2af7" btree (id)
--- TRUNCATE
Foreign-key constraints:
"vulns_historicalvuln_history_user_id_9e25ebf5_fk_custusers" FOREIGN KEY (history_user_id) REFERENCES custusers_user(id) DEFERRABLE INITIALLY DEFERRED
Example of queries:
SELECT * FROM vulns_historicalvuln WHERE history_type <> '+' order by id desc fetch first 10000 rows only; -> 30s without cache
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..31878.33 rows=10000 width=1736) (actual time=0.173..32839.474 rows=10000 loops=1)
-> Index Scan Backward using vulns_historicalvuln_id_773f2af7 on vulns_historicalvuln (cost=0.43..26346955.92 rows=8264960 width=1736) (actual time=0.172..32830.958 rows=10000 loops=1)
Filter: ((history_type)::text <> '+'::text)
Rows Removed by Filter: 296
Planning time: 19.514 ms
Execution time: 32845.015 ms
SELECT DISTINCT "vulns"."id", "vulns"."uuid", "vulns"."feedid", "vulns"."cve_id", "vulns"."cveid", "vulns"."summary", "vulns"."published", "vulns"."modified", "vulns"."assigner", "vulns"."cwe_id", "vulns"."vulnerable_packages_versions", "vulns"."vulnerable_products", "vulns"."vulnerable_product_versions", "vulns"."cvss", "vulns"."cvss_time", "vulns"."cvss_version", "vulns"."cvss_vector", "vulns"."cvss_metrics", "vulns"."access", "vulns"."impact", "vulns"."cvss3", "vulns"."cvss3_vector", "vulns"."cvss3_version", "vulns"."cvss3_metrics", "vulns"."score", "vulns"."is_exploitable", "vulns"."is_confirmed", "vulns"."is_in_the_news", "vulns"."is_in_the_wild", "vulns"."reflinks", "vulns"."reflinkids", "vulns"."created_at", "vulns"."updated_at", "vulns"."id" AS "exploit_count", false AS "monitored", '42' AS "org" FROM "vulns" WHERE ("vulns"."score" >= 0 AND "vulns"."score" <= 100) ORDER BY "vulns"."updated_at" DESC LIMIT 10
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=315191.32..315192.17 rows=10 width=1691) (actual time=3013.964..3013.990 rows=10 loops=1)
-> Unique (cost=315191.32..329642.42 rows=170013 width=1691) (actual time=3013.962..3013.986 rows=10 loops=1)
-> Sort (cost=315191.32..315616.35 rows=170013 width=1691) (actual time=3013.961..3013.970 rows=10 loops=1)
Sort Key: updated_at DESC, id, uuid, feedid, cve_id, cveid, summary, published, modified, assigner, cwe_id, vulnerable_packages_versions, vulnerable_products, vulnerable_product_versions, cvss, cvss_time, cvss_version, cvss_vector, cvss_metrics, access, impact, cvss3, cvss3_vector, cvss3_version, cvss3_metrics, score, is_exploitable, is_confirmed, is_in_the_news, is_in_the_wild, reflinks, reflinkids, created_at
Sort Method: external merge Disk: 277648kB
-> Seq Scan on vulns (cost=0.00..50542.19 rows=170013 width=1691) (actual time=0.044..836.597 rows=169846 loops=1)
Filter: ((score >= 0) AND (score <= 100))
Planning time: 3.183 ms
Execution time: 3070.346 ms
I have created a btree varchar index btree_varchar" btree (history_type varchar_pattern_ops) like this:
CREATE INDEX CONCURRENTLY btree_varchar ON vulns_historicalvuln (history_type varchar_pattern_ops);
I have also created a index for vulns score for my second queries:
CREATE INDEX CONCURRENTLY ON vulns (score);
I read a lot of post and documentation about slow queries and index. I'am sure it's the solution about slow queries but the query plan of Postgresql doesn't use the index I have created. It estimates that it processes faster with seq scan than using the index...
SELECT relname, indexrelname, idx_scan FROM pg_catalog.pg_stat_user_indexes;
relname | indexrelname | idx_scan
-------------------------------------+-----------------------------------------------------------------+------------
vulns_historicalvuln | btree_varchar | 0
Could you tell me if my index is well designed ? How I can debug this, feel free to ask more information if needed.
Thanks
After some research, I understand that index is not the solution of my problem here.
Low_cardinality (repeated value) of this field make the index useless.
The time of the query postgresql here is normal because of 30M rows matched.
I close this issue because there is no problem with index here.

Evenly select Records on categorical column with Repeating Sequence and pagination in Postgres

Database: Postgres
I have a product(id, title, source, ...) table which contains almost 500K records.
An example of data is:
| Id | title | source |
|:---|---------:|:--------:|
| 1 | product1 | source1 |
| 2 | product2 | source1 |
| 3 | product3 | source1 |
| 4 | product4 | source1 |
| . | ........ | source1 |
| . | ........ | source2 |
| x | productx | source2 |
|x+n |productX+n| sourceN |
There are are 5 distinct source values. And all records have source values random.
I need to get paginated results in such a way that:
If I need to select 20 products then the results set should contain results equally distributed based on source and should be in a repeating sequence. 2 products from each source till the last source and again next 2 products from each source.
For example:
| # | title | source |
|:---|---------:|:--------:|
| 1 | product1 | source1 |
| 2 | product2 | source1 |
| 3 | product3 | source2 |
| 4 | product4 | source2 |
| 5 | product5 | source3 |
| 6 | product6 | source3 |
| 7 | product7 | source4 |
| 8 | product8 | source4 |
| 9 | product9 | source5 |
| 10 |product10 | source5 |
| 11 | ........ | source1 |
| 12 | ........ | source1 |
| 13 | ........ | source2 |
| 14 | ........ | source2 |
| .. | ........ | ....... |
| 20 | ........ | source5 |
What is the optimized PgSql query to achieve the above scenario considering the LIMIT, OFFSET, sources can increase or decrease?
EDIT
As Suggested by George S, the below solution works, however, it is less performant. it takes almost 6 seconds to select only 20 records.
select id, title, source
, (row_number() over(partition by source order by last_modified DESC) - 1) / 2 as ordinal
-- order here can be by created time, id, title, etc
from product p
order by ordinal, source
limit 20
offset 2;
Explain ANALYZE of above query on real data
Limit (cost=147621.60..147621.65 rows=20 width=92) (actual time=5956.126..5956.138 rows=20 loops=1)
-> Sort (cost=147621.60..148813.72 rows=476848 width=92) (actual time=5956.123..5956.128 rows=22 loops=1)
Sort Key: (((row_number() OVER (?) - 1) / 2)), provider
Sort Method: top-N heapsort Memory: 28kB
-> WindowAgg (cost=122683.80..134605.00 rows=476848 width=92) (actual time=5099.059..5772.821 rows=477731 loops=1)
-> Sort (cost=122683.80..123875.92 rows=476848 width=84) (actual time=5098.873..5347.858 rows=477731 loops=1)
Sort Key: provider, last_modified DESC
Sort Method: external merge Disk: 46328kB
-> Seq Scan on product p (cost=0.00..54889.48 rows=476848 width=84) (actual time=0.012..4360.000 rows=477731 loops=1)
Planning Time: 0.354 ms
Execution Time: 5961.670 ms
This can be accomplished easily with a window function:
select id, title, source
, (row_number() over(partition by source order by id) - 1) / 2 as ordinal
--ordering here can be by created time, id, title, etc
from product p
order by ordinal, source
limit 10
offset 2;
As you noted, depending on your table size and other filters being used this may or may not be performant. The best way to tell is to run an EXPLAIN ANALYZE with the query on your actual data. If this isn't performant, you can also add the ordinal field to the table itself if it will always be the same value / ordering. Sadly, you can't create an index using a window function (at least not in PG12).
If you don't want to change the table itself, you can create a materialized view and then query that view so that the calculation only has to be done once:
CREATE MATERIALIZED VIEW ordered_product AS
select id, title, source
, (row_number() over(partition by source order by id) - 1) / 2 as ordinal
from product;
Afterwards, you can query the view like a normal table:
select * from ordered_product order by ordinal, source limit 10 offset 20;
You can also create indexes for it if necessary. Note that to refresh the view you'd run a command like:
REFRESH MATERIALIZED VIEW ordered_product;

Query planner behaviour degradation after PostgreSQL update from (10.11 to 11.6)

After updating postgres, I noticed that one of the queries I was using became much slower. After running EXPLAIN ANALYZE I see that it is now using a different index on the same query.
Among other columns, my table has an applicationid column which is a foreign key BIGINT, and I have an attributes columns which is a jsonb key/value map.
The description on coupons table are (some irrelevant parts were omitted):
+------------------------+--------------------------+-------------------------------------------------------+
| Column | Type | Modifiers |
|------------------------+--------------------------+-------------------------------------------------------|
| id | integer | not null default nextval('coupons_id_seq'::regclass) |
| created | timestamp with time zone | not null default now() |
| campaignid | bigint | |
| value | text | |
| expirydate | timestamp with time zone | |
| startdate | timestamp with time zone | |
| attributes | jsonb | not null default '{}'::jsonb |
| applicationid | bigint | |
| deleted | timestamp with time zone | |
| deleted_changelogid | bigint | not null default 0 |
| accountid | bigint | not null |
| recipientintegrationid | text | |
+------------------------+--------------------------+-------------------------------------------------------+
Indexes:
"coupons_applicationid_value_idx" UNIQUE, btree (applicationid, value) WHERE deleted IS NULL
"coupons_attrs_index" gin (attributes)
"coupons_recipientintegrationid_idx" btree (recipientintegrationid)
"coupons_value_trgm_idx" gin (value gin_trgm_ops)
The query I'm running is (some irrelevant parts were omitted):
EXPLAIN ANALYZE SELECT
*,
COUNT(*) OVER () AS total_rows
FROM
coupons
WHERE
deleted IS NULL
AND coupons.applicationid = 2
AND coupons.attributes #> '{"SessionId":"1070695459"}'
ORDER BY
id ASC
LIMIT 1000;
applicationid doesn't help us much. The index that was previously used was coupons_attrs_index (over attributes column) which produced very good results.
After the update however, the query planner started preferring the index coupons_applicationid_value_idx for some reason!
Here is output from EXPLAIN ANALYZE (some irrelevant parts were omitted):
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -> Sort (cost=64.09..64.10 rows=1 width=237) (actual time=3068.996..3068.996 rows=0 loops=1) |
| Sort Key: coupons.id |
| Sort Method: quicksort Memory: 25kB |
| -> WindowAgg (cost=0.86..64.08 rows=1 width=237) (actual time=3068.988..3068.988 rows=0 loops=1) |
| -> Nested Loop (cost=0.86..64.07 rows=1 width=229) (actual time=3068.985..3068.985 rows=0 loops=1) |
| -> Index Scan using coupons_applicationid_value_idx on coupons (cost=0.43..61.61 rows=1 width=213) (actual time=3068.984..3068.984 rows=0 loops=1) |
| Index Cond: (applicationid = 2) |
| Filter: (attributes #> '{"SessionId": "1070695459"}'::jsonb) |
| Rows Removed by Filter: 2344013 |
| Planning Time: 0.531 ms |
| Execution Time: 3069.076 ms |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 3.159s (3 seconds), executed in: 3.102s (3 seconds)
Can anyone help me understand why the query planner uses a less efficient index (coupons_applicationid_value_idx instead of coupons_attrs_index) after the update?
After adding a mixed (BTREE + GIN) index on (applicationid, attributes) that index was selected effectively solving the issue. I would still like to understand what happened to predict issues like this one in the future.
[EDIT 31-01-20 11:02]: The issue returned after 24 hours. Again the wrong index was chosen by the planner and the query became slow. Running a simple analyze solved it.
It is still very strange that it only started happening after the update to PG 11.

postgres not using index

There are lots of questions on this topic, but all of them seem to be more complex cases than what I'm looking at at the moment and the answers don't seem applicable.
OHDSI=> \d record_counts
Table "results2.record_counts"
Column | Type | Modifiers
------------------------+-----------------------+-----------
concept_id | integer |
schema | text |
table_name | text |
column_name | text |
column_type | text |
descendant_concept_ids | bigint |
rc | numeric |
drc | numeric |
domain_id | character varying(20) |
vocabulary_id | character varying(20) |
concept_class_id | character varying(20) |
standard_concept | character varying(1) |
Indexes:
"rc_dom" btree (domain_id, concept_id)
"rcdom" btree (domain_id)
"rcdomvocsc" btree (domain_id, vocabulary_id, standard_concept)
The table has 3,133,778 records, so Postgres shouldn't be ignoring the index because of small table size.
I filter on domain_id, which is indexed, and the index is ignored:
OHDSI=> explain select * from record_counts where domain_id = 'Drug';
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on record_counts (cost=0.00..76744.81 rows=2079187 width=87)
Filter: ((domain_id)::text = 'Drug'::text)
I turn off seqscan and:
OHDSI=> set enable_seqscan=false;
SET
OHDSI=> explain select * from record_counts where domain_id = 'Drug';
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on record_counts (cost=42042.13..105605.97 rows=2079187 width=87)
Recheck Cond: ((domain_id)::text = 'Drug'::text)
-> Bitmap Index Scan on rcdom (cost=0.00..41522.33 rows=2079187 width=0)
Index Cond: ((domain_id)::text = 'Drug'::text)
Indeed, the plan says it's going to be more expensive to use the index than not, but why? If the index lets it handle many fewer records, shouldn't it be quicker to use it?
Ok, it looks like Postgres knew what it was doing. The particular value of the indexed column I was using ('Drug') happened to account for 66% of the rows in the table. So, yes, the filter makes the row set significantly smaller, but since those rows would be scattered between pages, the index doesn't allow them to be retrieved faster.
OHDSI=> select domain_id, count(*) as rows, round((100 * count(*)::float / 3133778.0)::numeric,4) pct from record_counts group by 1 order by 2 desc;
domain_id | rows | pct
---------------------+---------+---------
Drug | 2074991 | 66.2137
Condition | 466882 | 14.8984
Observation | 217807 | 6.9503
Procedure | 165800 | 5.2907
Measurement | 127239 | 4.0602
Device | 29410 | 0.9385
Spec Anatomic Site | 28783 | 0.9185
Meas Value | 10415 | 0.3323
Unit | 2350 | 0.0750
Type Concept | 2170 | 0.0692
Provider Specialty | 1957 | 0.0624
Specimen | 1767 | 0.0564
Metadata | 1689 | 0.0539
Revenue Code | 538 | 0.0172
Place of Service | 480 | 0.0153
Race | 467 | 0.0149
Relationship | 242 | 0.0077
Condition/Obs | 182 | 0.0058
Currency | 180 | 0.0057
Condition/Meas | 115 | 0.0037
Route | 81 | 0.0026
Obs/Procedure | 78 | 0.0025
Condition/Device | 52 | 0.0017
Condition/Procedure | 25 | 0.0008
Meas/Procedure | 25 | 0.0008
Gender | 19 | 0.0006
Device/Procedure | 9 | 0.0003
Meas Value Operator | 9 | 0.0003
Visit | 8 | 0.0003
Drug/Procedure | 3 | 0.0001
Spec Disease Status | 3 | 0.0001
Ethnicity | 2 | 0.0001
When I use any other value in the where clause (including 'Condition', with 15% of the rows), Postgres does use the index.
(Somewhat surprisingly, even after I cluster the table based on the domain_id index, it still doesn't use the index when I filter on 'Drug', but the performance improvement for filtering out 34% of the rows doesn't seem worth pursuing this further.)

Postgresql very slow query on indexed column

I have table with 50 mln rows. One column named u_sphinx is very important available values are 1,2,3. Now all rows have value 3 but, when i checking for new rows (u_sphinx = 1) the query is very slow. What could be wrong ? Maybe index is broken ? Server: Debian, 8GB 4x Intel(R) Xeon(R) CPU E3-1220 V2 # 3.10GHz
Table structure:
base=> \d u_user
Table "public.u_user"
Column | Type | Modifiers
u_ip | character varying |
u_agent | text |
u_agent_js | text |
u_resolution_id | integer |
u_os | character varying |
u_os_id | smallint |
u_platform | character varying |
u_language | character varying |
u_language_id | smallint |
u_language_js | character varying |
u_cookie | smallint |
u_java | smallint |
u_color_depth | integer |
u_flash | character varying |
u_charset | character varying |
u_doctype | character varying |
u_compat_mode | character varying |
u_sex | character varying |
u_age | character varying |
u_theme | character varying |
u_behave | character varying |
u_targeting | character varying |
u_resolution | character varying |
u_user_hash | bigint |
u_tech_hash | character varying |
u_last_target_data_time | integer |
u_last_target_prof_time | integer |
u_id | bigint | not null default nextval('u_user_u_id_seq'::regclass)
u_sphinx | smallint | not null default 1::smallint
Indexes:
"u_user_u_id_pk" PRIMARY KEY, btree (u_id)
"u_user_hash_index" btree (u_user_hash)
"u_user_u_sphinx_ind" btree (u_sphinx)
Slow query:
base=> explain analyze SELECT u_id FROM u_user WHERE u_sphinx = 1 LIMIT 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.15 rows=1 width=8) (actual time=485146.252..485146.252 rows=0 loops=1)
-> Seq Scan on u_user (cost=0.00..3023707.80 rows=19848860 width=8) (actual time=485146.249..485146.249 rows=0 loops=1)
Filter: (u_sphinx = 1)
Rows Removed by Filter: 23170476
Total runtime: 485160.241 ms
(5 rows)
Solved:
After adding partial index
base=> explain analyze SELECT u_id FROM u_user WHERE u_sphinx = 1 LIMIT 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.27..4.28 rows=1 width=8) (actual time=0.063..0.063 rows=0 loops=1)
-> Index Scan using u_user_u_sphinx_index_1 on u_user (cost=0.27..4.28 rows=1 width=8) (actual time=0.061..0.061 rows=0 loops=1)
Index Cond: (u_sphinx = 1)
Total runtime: 0.106 ms
Thx for #Kouber Saparev
Try making a partial index.
CREATE INDEX u_user_u_sphinx_idx ON u_user (u_sphinx) WHERE u_sphinx = 1;
Your query plan looks like the DB is treating the query as if 1 was so common in the DB that it'll be better off digging into a disk page or two in order to identify a relevant row, instead of adding the overhead of plowing through an index and finding a row in a random disk page.
This could be an indication that you forgot to run to analyze the table so the planner has proper stats:
analyze u_user