Slow postgres text column query - postgresql

I have a btree index on a text column which holds a status token - queries including this field run 100x slower than queries without it - what can I do to speed it up? It's rather low cardinality - I've tried hash, btree with text_pattern_ops, and partial indexes -- collation is utf8 . Queries without the status index run in about the same time...
db=# show lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)
db=# drop index job_status_idx;
DROP INDEX
db=# CREATE INDEX job_status_idx ON job(status text_pattern_ops);
CREATE INDEX
db=# select status, count(*) from job group by 1;
status | count
-------------+--------
pending | 365027
booked | 37515
submitted | 20783
cancelled | 191707
negotiating | 30
completed | 241339
canceled | 56
(7 rows)
db=# explain analyze SELECT key_ FROM "job" WHERE active = true and start > '2014-06-15T19:23:23.691670'::timestamp and status = 'completed' ORDER BY start DESC OFFSET 450 LIMIT
150;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=5723.07..7630.61 rows=150 width=51) (actual time=634.978..638.086 rows=150 loops=1)
-> Index Scan Backward using job_start_idx on job (cost=0.42..524054.39 rows=41209 width=51) (actual time=625.776..637.023 rows=600 loops=1)
Index Cond: (start > '2014-06-15 19:23:23.69167'::timestamp without time zone)
Filter: (active AND (status = 'completed'::text))
Rows Removed by Filter: 94866
Total runtime: 638.358 ms
(6 rows)
db=# explain analyze SELECT key_ FROM "job" WHERE active = true and start > '2014-06-15T19:23:23.691670'::timestamp ORDER BY start DESC OFFSET 450 LIMIT 150;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1585.61..2114.01 rows=150 width=51) (actual time=4.620..6.333 rows=150 loops=1)
-> Index Scan Backward using job_start_idx on job (cost=0.42..523679.58 rows=148661 width=51) (actual time=0.080..5.271 rows=600 loops=1)
Index Cond: (start > '2014-06-15 19:23:23.69167'::timestamp without time zone)
Filter: active
Total runtime: 6.584 ms
(5 rows)

This is your query:
SELECT key_
FROM "job"
WHERE active = true and
start > '2014-06-15T19:23:23.691670'::timestamp and
status = 'completed'
ORDER BY start DESC
OFFSET 450 LIMIT 150;
The index on status is not very selective. I would suggest a composite index:
CREATE INDEX job_status_idx ON job(status text_pattern_ops, active, start, key_)
This is a covering index and it should do a better job of matching the where clause.

Related

Why is index not being used when 'Check X Min' flag is set to true?

I have a simple query for a simple table in postgres. I have a simple index on that table.
In some environments it is using the index when performing the query, in other environments (on the same RDS instance, different database) it isn't. (checked using EXPLAIN ANALYSE)
One thing I've noticed is that if the 'Check X Min' flag on the index is TRUE then index is not used. (pg_catalog.pg_index.indcheckxmin)
How do I ensure the index is used and, presumably, have the 'Check X Min' flag set to false?
Table contains 100K+ rows.
Things I have tried:
The index is valid and is always used in environments where the 'Check X Min' is set to false.
set enable_seqscan to off; still does not use the index.
Creating/recreating an index in these environments always seems to have 'Check X Min' set to true.
Vacuuming does not seem to help.
Setup of table and index:
CREATE TABLE schema_1.table_1 (
field_1 varchar(20) NOT NULL,
field_2 int4 NULL,
field_3 timestamptz NULL,
field_4 numeric(10,2) NULL
);
CREATE INDEX table_1_field_1_field_3_idx ON schema_1.table_1 USING btree (field_1, field_3 DESC);
Query:
select field_1, field_2, field_3, field_4
from schema_1.table_1
where field_1 = ’abcdef’
order by field_3 desc limit 1;
When not using index:
QUERY PLAN |
---------------------------------------------------------------------------------------------------------------------|
Limit (cost=4.41..4.41 rows=1 width=51) (actual time=3.174..3.176 rows=1 loops=1) |
-> Sort (cost=4.41..4.42 rows=6 width=51) (actual time=3.174..3.174 rows=1 loops=1) |
Sort Key: field_3 DESC |
Sort Method: top-N heapsort Memory: 25kB |
-> Seq Scan on table_1 (cost=0.00..4.38 rows=6 width=51) (actual time=3.119..3.150 rows=3 loops=1)|
Filter: ((field_1)::text = 'abcdef'::text) |
Rows Removed by Filter: 96 |
Planning time: 2.895 ms |
Execution time: 3.197 ms |
When using index:
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------------------------|
Limit (cost=0.28..6.30 rows=1 width=51) (actual time=0.070..0.144 rows=1 loops=1) |
-> Index Scan using table_1_field_1_field_3_idx on field_1 (cost=0.28..12.31 rows=2 width=51) (actual time=0.049..0.066 rows=1 loops=1)|
Index Cond: ((field_1)::text = 'abcdef'::text) |
Planning time: 0.184 ms |
Execution time: 0.303 ms |
Have renamed fields, schema, and table to avoid sharing business context
You seem to be using CREATE INDEX CONCURRENTLY, and have long-open transactions. From the docs:
Even then, however, the index may not be immediately usable for queries: in the worst case, it cannot be used as long as transactions exist that predate the start of the index build.
You don't have a lot of options here. Hunt down and fix your long-open transactions, don't use CONCURRENTLY, or put up with the limitation.

Do Postgres multicolumn indexes get used for OR queries?

This seems like a straightforward question, but I can't find the answer online.
I'm using Postgres 9.4 and have this table:
Table "public.title"
Column | Type | Collation | Nullable | Default
---------------------------------+-------------------------+-----------+----------+-----------------------------------
id | integer | | not null | nextval('title_id_seq'::regclass)
name1 | character varying(1000) | | |
name2 | character varying(1000) | | |
name3 | character varying(1000) | | |
name4 | character varying(1000) | | |
And I have a multicolumn index:
"idx_title_names" btree (name1, name2, name3, name4)
But for OR queries, the index isn't being used:
EXPLAIN ANALYZE SELECT * FROM "title" WHERE ("title"."name1" = 'foo'
OR "title"."name3" = 'foo' OR "title"."name3" = 'foo' OR "title"."name4" = 'foo');
Gather (cost=1000.00..436451.46 rows=659 width=4500) (actual time=561.418..1297.877 rows=3222 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on title (cost=0.00..435385.56 rows=275 width=4500) (actual time=551.627..1286.724 rows=1074 loops=3)
Filter: (((name1)::text = 'foo'::text) OR ((name2)::text = 'foo'::text) OR ((name3)::text = 'foo'::text) OR ((name4)::text = 'foo'::text))
Rows Removed by Filter: 1231911
Planning Time: 0.102 ms
Execution Time: 1298.148 ms
Is this because these indexes don't work with OR queries?
And: if so, is my best bet just to create 4 separate standard indexes?
One option is to create a GIN index on the array of the columns, then use an array operator:
create index on title using gin (array[name1,name2,name3,name4]);
Then use
SELECT *
FROM title
WHERE array[name1,name2,name3,name4] #> array['foo'];
Note that a GIN index is a bit more expensive to maintain than a BTree index.
OR is often a performance problem in SQL.
This index cannot be used for a condition like that.
Your best bet is to create four single-column indexes and hope for a Bitmap Or:
CREATE INDEX ON public.title (name1);
CREATE INDEX ON public.title (name2);
CREATE INDEX ON public.title (name3);
CREATE INDEX ON public.title (name4);
Having index on (col1, col2, col3, etc) it will be used for conditions/ordering on col1, or col1 and col2, or col1, col2 and col3 etc. It will be not used for conditions/ordering only on col3 for example.
Look at this:
# create table t as select random() as a, random() as b from generate_series(1,1000000);
# create index i on t(a,b);
# analyze t;
# explain analyze select * from t where a > 0.9;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=2246.83..8863.15 rows=96826 width=16) (actual time=10.973..28.023 rows=99311 loops=1)
Recheck Cond: (a > '0.9'::double precision)
Heap Blocks: exact=5406
-> Bitmap Index Scan on i (cost=0.00..2222.62 rows=96826 width=0) (actual time=10.251..10.252 rows=99311 loops=1)
Index Cond: (a > '0.9'::double precision)
Planning Time: 0.348 ms
Execution Time: 31.054 ms
# explain analyze select * from t where b > 0.9;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..17906.00 rows=99117 width=16) (actual time=0.015..70.505 rows=100137 loops=1)
Filter: (b > '0.9'::double precision)
Rows Removed by Filter: 899863
Planning Time: 0.090 ms
Execution Time: 73.656 ms
However when you are using or condition the DBMS actually should to perform several queries, for our example select * from t where a > 0.9 or b > 0.9 is equal to select * from t where a > 0.9 (index could be used) and select * from t where b > 0.9 (index could not be used) thus instead of two actions (scan index then scan whole table) DBMS performs only one action (scan whole table)
Hope it explains why your index is not used for your query.

Why does postgres scan all the table?

I discovered that a query I am doing on indexed column results in a sequential scan:
mydatabase=> explain analyze SELECT account_id,num_tokens,tok_holdings FROM tokacct WHERE address='00000000000000000';
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on tokacct (cost=0.00..6.69 rows=1 width=27) (actual time=0.046..0.046 rows=0 loops=1)
Filter: (address = '00000000000000000'::text)
Rows Removed by Filter: 225
Planning time: 0.108 ms
Execution time: 0.075 ms
(5 rows)
mydatabase=>
However a \di shows I have a unique index:
mydatabase=> \di
List of relations
Schema | Name | Type | Owner | Table
--------+-------------------------+-------+--------+----------------
......
public | tokacct_address_key | index | mydb | tokacct
.....
My table is defined like this:
CREATE TABLE tokacct (
tx_id BIGINT NOT NULL,
account_id SERIAL PRIMARY KEY,
state_acct_id INT NOT NULL DEFAULT 0,
num_tokens INT DEFAULT 0,
ts_created INT DEFAULT 0,
block_created INT DEFAULT 0,
address TEXT NOT NULL UNIQUE
tok_holdings TEXT DEFAULT ''
);
As you can see, the address field is declared as UNIQUE. The \di also confirms there is an index. So, why does it use a sequential scan on the table ?
Seq Scan on tokacct (cost=0.00..6.69 rows=1 width=27) (actual time=0.046..0.046 rows=0 loops=1)
create one page table:
db=# create table small as select g, chr(g) from generate_series(1,200) g;
SELECT 200
db=# create index small_i on small(g);
CREATE INDEX
db=# analyze small;
ANALYZE
seq scan:
db=# explain (analyze, verbose, buffers) select g from small where g = 200;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on public.small (cost=0.00..3.50 rows=1 width=4) (actual time=0.044..0.045 rows=1 loops=1)
Output: g
Filter: (small.g = 200)
Rows Removed by Filter: 199
Buffers: shared hit=1
Planning time: 1.360 ms
Execution time: 0.066 ms
(7 rows)
create three page table:
db=# drop table small;
DROP TABLE
db=# create table small as select g, chr(g) from generate_series(1,500) g;
SELECT 500
db=# create index small_i on small(g);
CREATE INDEX
db=# analyze small;
ANALYZE
db=# explain (analyze, verbose, buffers) select g from small where g = 200;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Index Only Scan using small_i on public.small (cost=0.27..8.29 rows=1 width=4) (actual time=3.194..3.195 rows=1 loops=1)
Output: g
Index Cond: (small.g = 200)
Heap Fetches: 1
Buffers: shared hit=1 read=2
Planning time: 0.271 ms
Execution time: 3.747 ms
(7 rows)
now the table takes three pages and index - two, thus index is cheaper...
how do I know the number of pages? it says so in the (verbose) execution plan. And the table?
db=# select max(ctid) from small;
max
--------
(2,48)
(1 row)
Here 2, means page two (counts from zero).
or again from the verbose plan:
db=# set enable_indexonlyscan to off;
SET
db=# set enable_indexscan to off;
SET
db=# set enable_bitmapscan to off;
SET
db=# explain (analyze, verbose, buffers) select g from small where g = 200;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on public.small (cost=0.00..9.25 rows=1 width=4) (actual time=0.124..0.303 rows=1 loops=1)
Output: g
Filter: (small.g = 200)
Rows Removed by Filter: 499
Buffers: shared hit=3
Planning time: 0.105 ms
Execution time: 0.327 ms
(7 rows)
Here, hit=3

Slow postgresql query if where clause matches no rows

I have a simple table in a PostgreSQL 9.0.3 database that holds data polled from a wind turbine controller. Each row represents the value of a particular sensor at a particular time. Currently the table has around 90M rows:
wtdata=> \d integer_data
Table "public.integer_data"
Column | Type | Modifiers
--------+--------------------------+-----------
date | timestamp with time zone | not null
name | character varying(64) | not null
value | integer | not null
Indexes:
"integer_data_pkey" PRIMARY KEY, btree (date, name)
"integer_data_date_idx" btree (date)
"integer_data_name_idx" btree (name)
One query that I need is to find the last time that a variable was updated:
select max(date) from integer_data where name = '<name of variable>';
This query works fine when searching for a variable that exists in the table:
wtdata=> select max(date) from integer_data where name = 'STATUS_OF_OUTPUTS_UINT16';
max
------------------------
2011-04-11 02:01:40-05
(1 row)
However, if I try and search for a variable that doesn't exist in the table, the query hangs (or takes longer than I have patience for):
select max(date) from integer_data where name = 'Message';
I've let the query run for hours and sometimes days with no end in sight. There are no rows in the table with name = 'Message':
wtdata=> select count(*) from integer_data where name = 'Message';
count
-------
0
(1 row)
I don't understand why one query is fast and the other takes forever. Is the query somehow being forced to scan the entire table for some reason?
wtdata=> explain select max(date) from integer_data where name = 'Message';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Result (cost=13.67..13.68 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..13.67 rows=1 width=8)
-> Index Scan Backward using integer_data_pkey on integer_data (cost=0.00..6362849.53 rows=465452 width=8)
Index Cond: ((date IS NOT NULL) AND ((name)::text = 'Message'::text))
(5 rows)
Here's the query plan for a fast query:
wtdata=> explain select max(date) from integer_data where name = 'STATUS_OF_OUTPUTS_UINT16';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Result (cost=4.64..4.65 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..4.64 rows=1 width=8)
-> Index Scan Backward using integer_data_pkey on integer_data (cost=0.00..16988170.38 rows=3659570 width=8)
Index Cond: ((date IS NOT NULL) AND ((name)::text = 'STATUS_OF_OUTPUTS_UINT16'::text))
(5 rows)
Change the primary key to (name,date).

A slow sql statments , is there any way to optmize it?

Our application has a very slow statement, it takes more than 11 second, so I want to know is there any way to optimize it ?
The SQL statement
SELECT id FROM mapfriends.cell_forum_topic WHERE id in (
SELECT topicid FROM mapfriends.cell_forum_item WHERE skyid=103230293 GROUP BY topicid )
AND categoryid=29 AND hidden=false ORDER BY restoretime DESC LIMIT 10 OFFSET 0;
id
---------
2471959
2382296
1535967
2432006
2367281
2159706
1501759
1549304
2179763
1598043
(10 rows)
Time: 11444.976 ms
Plan
friends=> explain SELECT id FROM friends.cell_forum_topic WHERE id in (
friends(> SELECT topicid FROM friends.cell_forum_item WHERE skyid=103230293 GROUP BY topicid)
friends-> AND categoryid=29 AND hidden=false ORDER BY restoretime DESC LIMIT 10 OFFSET 0;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1443.15..1443.15 rows=2 width=12)
-> Sort (cost=1443.15..1443.15 rows=2 width=12)
Sort Key: cell_forum_topic.restoretime
-> Nested Loop (cost=1434.28..1443.14 rows=2 width=12)
-> HashAggregate (cost=1434.28..1434.30 rows=2 width=4)
-> Index Scan using cell_forum_item_idx_skyid on cell_forum_item (cost=0.00..1430.49 rows=1516 width=4)
Index Cond: (skyid = 103230293)
-> Index Scan using cell_forum_topic_pkey on cell_forum_topic (cost=0.00..4.40 rows=1 width=12)
Index Cond: (cell_forum_topic.id = cell_forum_item.topicid)
Filter: ((NOT cell_forum_topic.hidden) AND (cell_forum_topic.categoryid = 29))
(10 rows)
Time: 1.109 ms
Indexes
friends=> \d cell_forum_item
Table "friends.cell_forum_item"
Column | Type | Modifiers
---------+--------------------------------+--------------------------------------------------------------
id | integer | not null default nextval('cell_forum_item_id_seq'::regclass)
topicid | integer | not null
skyid | integer | not null
content | character varying(200) |
addtime | timestamp(0) without time zone | default now()
ischeck | boolean |
Indexes:
"cell_forum_item_pkey" PRIMARY KEY, btree (id)
"cell_forum_item_idx" btree (topicid, skyid)
"cell_forum_item_idx_1" btree (topicid, id)
"cell_forum_item_idx_skyid" btree (skyid)
friends=> \d cell_forum_topic
Table "friends.cell_forum_topic"
Column | Type | Modifiers
-------------+--------------------------------+-------------------------------------------------------------------------------------
-
id | integer | not null default nextval(('"friends"."cell_forum_topic_id_seq"'::text)::regclass)
categoryid | integer | not null
topic | character varying | not null
content | character varying | not null
skyid | integer | not null
addtime | timestamp(0) without time zone | default now()
reference | integer | default 0
restore | integer | default 0
restoretime | timestamp(0) without time zone | default now()
locked | boolean | default false
settop | boolean | default false
hidden | boolean | default false
feature | boolean | default false
picid | integer | default 29249
managerid | integer |
imageid | integer | default 0
pass | boolean | default false
ischeck | boolean |
Indexes:
"cell_forum_topic_pkey" PRIMARY KEY, btree (id)
"idx_cell_forum_topic_1" btree (categoryid, settop, hidden, restoretime, skyid)
"idx_cell_forum_topic_2" btree (categoryid, hidden, restoretime, skyid)
"idx_cell_forum_topic_3" btree (categoryid, hidden, restoretime)
"idx_cell_forum_topic_4" btree (categoryid, hidden, restore)
"idx_cell_forum_topic_5" btree (categoryid, hidden, restoretime, feature)
"idx_cell_forum_topic_6" btree (categoryid, settop, hidden, restoretime)
Explain analyze
mapfriends=> explain analyze SELECT id FROM mapfriends.cell_forum_topic
mapfriends-> join (SELECT topicid FROM mapfriends.cell_forum_item WHERE skyid=103230293 GROUP BY topicid) as tmp
mapfriends-> on mapfriends.cell_forum_topic.id=tmp.topicid
mapfriends-> where categoryid=29 AND hidden=false ORDER BY restoretime DESC LIMIT 10 OFFSET 0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------
Limit (cost=1446.89..1446.90 rows=2 width=12) (actual time=18016.006..18016.013 rows=10 loops=1)
-> Sort (cost=1446.89..1446.90 rows=2 width=12) (actual time=18016.001..18016.002 rows=10 loops=1)
Sort Key: cell_forum_topic.restoretime
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=1438.02..1446.88 rows=2 width=12) (actual time=16988.492..18015.869 rows=20 loops=1)
-> HashAggregate (cost=1438.02..1438.04 rows=2 width=4) (actual time=15446.735..15447.243 rows=610 loops=1)
-> Index Scan using cell_forum_item_idx_skyid on cell_forum_item (cost=0.00..1434.22 rows=1520 width=4) (actual time=302.378..15429.782 rows=7133 loops=1)
Index Cond: (skyid = 103230293)
-> Index Scan using cell_forum_topic_pkey on cell_forum_topic (cost=0.00..4.40 rows=1 width=12) (actual time=4.210..4.210 rows=0 loops=610)
Index Cond: (cell_forum_topic.id = cell_forum_item.topicid)
Filter: ((NOT cell_forum_topic.hidden) AND (cell_forum_topic.categoryid = 29))
Total runtime: 18019.461 ms
Could you give us some more information about the tables (the statistics) and the configuration?
SELECT version();
SELECT category, name, setting FROM pg_settings WHERE name IN('effective_cache_size', 'enable_seqscan', 'shared_buffers');
SELECT * FROM pg_stat_user_tables WHERE relname IN('cell_forum_topic', 'cell_forum_item');
SELECT * FROM pg_stat_user_indexes WHERE relname IN('cell_forum_topic', 'cell_forum_item');
SELECT * FROM pg_stats WHERE tablename IN('cell_forum_topic', 'cell_forum_item');
And before getting this data, use ANALYZE.
It looks like you have a problem with an index, this is where all the query spends all it's time:
-> Index Scan using cell_forum_item_idx_skyid on
cell_forum_item (cost=0.00..1434.22
rows=1520 width=4) (actual
time=302.378..15429.782 rows=7133
loops=1)
If you use VACUUM FULL on a regular basis (NOT RECOMMENDED!), index bloat might be your problem. A REINDEX might be a good idea, just to be sure:
REINDEX TABLE cell_forum_item;
And talking about indexes, you can drop a couple of them, these are obsolete:
"idx_cell_forum_topic_6" btree (categoryid, settop, hidden, restoretime)
"idx_cell_forum_topic_3" btree (categoryid, hidden, restoretime)
Other indexes have the same data and can be used by the database as well.
It looks like you have a couple of problems:
autovacuum is turned off or it's way
behind. That last autovacuum was on
2010-12-02 and you have 256734 dead
tuples in one table and 451430 dead
ones in the other.... You have to do
something about this, this is a
serious problem.
When autovacuum is working again, you
have to do a VACUUM FULL and a
REINDEX to force a table rewrite and
get rid of all empty space in your
tables.
after fixing the vacuum-problem, you
have to analyze as well: the database
expects 1520 results but it gets 7133
results. This could be a problem with
statistics, maybe you have to
increase the STATISTICS.
The query itself needs some rewriting
as well: It gets 7133 results but it
needs only 610 results. Over 90% of
the results are lost... And getting
these 7133 takes a lot of time, over
15 seconds. Get rid of the subquery by using a JOIN without the GROUP BY or use EXISTS, also without the GROUP BY.
But first get autovacuum back on track, before you get new or other problems.
the problem isn't due to lack of caching of the query plan but most likely due to the choice of plan due to lack of appropriate indexes