Inconsistent statistics on expression with partial index - postgresql

[PostgreSQL 9.6.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.2.0-10) 6.2.0 20161027, 64-bit]
I have a table with timestamp ranges:
create table testing.test as
select tsrange(d, null) ts from
generate_series(timestamp '2000-01-01', timestamp '2018-01-01', interval '1 minute') s(d);
I need to run the following query:
select *
from testing.test
where lower(ts)> '2017-06-17 20:00:00'::timestamp and upper_inf(ts)
Explain analyze result for table without indexes:
Seq Scan on test (cost=0.00..72482.26 rows=1052013 width=14) (actual time=2165.477..2239.781 rows=283920 loops=1)
Filter: (upper_inf(ts) AND (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone))
Rows Removed by Filter: 9184081
Planning time: 0.046 ms
Execution time: 2250.221 ms
Next I'm going to add a following partial index:
create index lower_rt_inf ON testing.test using btree(lower(ts)) where upper_inf(ts);
analyze testing.test;
Explain analyze result for table with partial index:
Index Scan using lower_rt_inf on test (cost=0.04..10939.03 rows=1051995 width=14) (actual time=0.037..52.083 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.156 ms
Execution time: 62.900 ms
And:
SELECT null_frac, n_distinct, correlation FROM pg_catalog.pg_stats WHERE tablename = 'lower_rt_inf'
null_frac |n_distinct |correlation |
----------|-----------|------------|
0 |-1 |1 |
Then I create an index similar to the previous one, but without partial condition:
create index lower_rt_full ON testing.test using btree(lower(ts));
analyze testing.test;
And now the same index is used, but the cost/rows are different:
Index Scan using lower_rt_inf on test (cost=0.04..1053.87 rows=101256 width=14) (actual time=0.029..58.613 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.280 ms
Execution time: 71.794 ms
And a bit more:
select * from testing.test where lower(ts)> '2017-06-17 20:00:00'::timestamp;
Index Scan using lower_rt_full on test (cost=0.04..3159.52 rows=303767 width=14) (actual time=0.036..64.208 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.099 ms
Execution time: 78.759 ms
How can I effectively use partial indexes for expressions?

What happens here is that the statistics on index lower_rt_full are used to estimate the row count, but statistics on lower_rt_inf, which is a partial index, aren't.
Since function are a black box for PostgreSQL, it has no idea about the distribution of lower(ts) and uses a bad estimate.
After lower_rt_full has been created and the table analyzed, PostgreSQL has a good idea about this distribution and can estimate much better. Even if the index isn't used to execute the query, it is used for query planning.
Since upper_inf is also a function (black box), you would get an even better estimate if you had an index ON test (upper_inf(ts), lower(ts)).
For an explanation why partial indexes are not considered to estimate the number of result rows, see this comment in examine_variable in backend/utils/adt/selfuncs.c, which tries to find statistical data about an expression:
* Has it got stats? We only consider stats for
* non-partial indexes, since partial indexes probably
* don't reflect whole-relation statistics; the above
* check for uniqueness is the only info we take from
* a partial index.

Thanks for the answer.
The problem in using the function in the index (lower(rt))?
Or in that the function is used in condition of partial index.
Because, if I add a separate field "latest":
alter table testing.test add column latest boolean;
update testing.test set latest = upper_inf(ts);
create index lower_latest_rt ON testing.test using btree(lower(ts)) where latest = true;
alter index testing.lower_latest_rt alter column lower set statistics 1000;
analyze testing.test;
And execute follwing query:
select *
from testing.test
where lower(ts)> '2017-06-17 20:00:00'::timestamp and latest = true
I have result:
Index Scan using lower_latest_rt on test (cost=0.04..11406.44 rows=285833 width=23) (actual time=0.027..178.054 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 1.788 ms
Execution time: 188.481 ms

Related

PostgreSQL 10.4 - how to index jsonb for sql functions not operators?

I have a table named "k3_order" with jsonb column "json_delivery".
Example content of that column is:
{
"delivery_cost": "11.99",
"packageNumbers": [
"0000000596034Q"
]
}
I've created index on json_delivery->'packageNumbers':
CREATE INDEX test_idx ON k3_order USING gin(json_delivery->'packageNumbers');
Now I use this two SQL Queries:
SELECT id, delivery_method_id
FROM k3_order
WHERE jsonb_exists (json_delivery->'packageNumbers', '0000000596034Q');
SELECT id, delivery_method_id
FROM k3_order
WHERE json_delivery->'packageNumbers' ? '0000000596034Q';
The second is faster and using index, but the first doesn't.
Is there any way to create index in PostgreSQL 10.4 in order for query 1) to use it?
Is this even possible in PostgreSQL 10.4 or newer versions?
EXPLAIN ANALYZE SELECT id, delivery_method_id
FROM k3_order
WHERE jsonb_exists (json_delivery->'packageNumbers', > '0000000596034Q');
produces:
Seq Scan on k3_order (cost=0.00..117058.10 rows=216847 width=8 (actual time=162.001..569.863 rows=1 loops=1)
Filter: jsonb_exists((json_delivery -> 'packageNumbers'::text), '0000000596034Q'::text)
Rows Removed by Filter: 650539
Planning time: 0.748 ms
Execution time: 569.886 ms
EXPLAIN ANALYZE SELECT id, delivery_method_id
FROM k3_order
WHERE json_delivery->'packageNumbers' ? '0000000596034Q';
produces:
Bitmap Heap Scan on k3_order (cost=21.04..2479.03 rows=651 width=8) (actual time=0.022..0.022 rows=1 loops=1)
Recheck Cond: ((json_delivery -> 'packageNumbers'::text) ? '0000000596034Q'::text)
Heap Blocks: exact=1
-> Bitmap Index Scan on test_idx (cost=0.00..20.88 rows=651 width=0) (actual time=0.016..0.016 rows=1 loops=1)
Index Cond: ((json_delivery -> 'packageNumbers'::text) ? '0000000596034Q'::text)
Planning time: 0.182 ms
Execution time: 0.050 ms
Indexes can only be used by queries in the following cases:
the WHERE condition contains an expression of the form <indexed expression> <operator> <constant>, where
an index has been created on <indexed expression>
<operator> is an operator in the index family of the operator class of the index
<constant> is an expression that stays constant for the duration of the index scan
the ORDER BY clause has the same or the exact opposite ordering as the index definition, and the index access method supports sorting (from v13 on, an index can also be used if it contains the starting columns of the ORDER BY clause)
the PostgreSQL version is v12 and higher, and the WHERE condition contains an expression of the form bool_func(...), where the function returns boolean and has a planner support function.
Now json_delivery->'packageNumbers' ? '0000000596034Q' satisfies the first condition, so an index scan can be used.
jsonb_exists(json_delivery->'packageNumbers', > '0000000596034Q') could only use an index if there were a planner support function for jsonb_exists, but there is none:
SELECT prosupport FROM pg_proc
WHERE proname = 'jsonb_exists';
prosupport
════════════
-
(1 row)

Query not using index on timestamp without time zone field

I have a table with more than 3 million rows, one column named creationdate is a timestamp without time zone.
I created a couple of indexes on it, like:
"idx_routingtasks_creationdate" btree (creationdate)
"idx_routingtasks_creationdate2" btree ((creationdate::date))
When filter by creationdate::date (casting by date) the index idx_routingtasks_creationdate is not used:
explain analyze select * from routingtasks where creationdate::date < (now() - interval '1 day')::date;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on routingtasks (cost=0.00..1315631.93 rows=2811638 width=715) (actual time=186642.012..413763.643 rows=2800659 loops=1)
Filter: ((creationdate)::date < ((now() - '1 day'::interval))::date)
Rows Removed by Filter: 212248
Planning time: 0.195 ms
Execution time: 413875.829 ms
(5 rows)
The same when not casting by date:
explain analyze select * from routingtasks where creationdate < now() - interval '1 day';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Seq Scan on routingtasks (cost=0.00..1300588.39 rows=2918447 width=715) (actual time=72089.312..327288.333 rows=2876756 loops=1)
Filter: (creationdate < (now() - '1 day'::interval))
Rows Removed by Filter: 141052
Planning time: 0.104 ms
Execution time: 327401.745 ms
(5 rows)
How can I create an index on the creationdate column to allow this filter use it?
The answer lies in this part of the execution plan:
Seq Scan ... (actual ... rows=2876756 ...)
...
Rows Removed by Filter: 141052
Since almost all rows are returned anyway, using a sequential scan and discarding the few rows that are filtered out is the most efficient way to process the query.
If you want to verify that, temporarily
SET enable_seqscan = off;
to make PostgreSQL avoid a sequential scan if possible. Then you can test if query execution gets faster or not.

Multi-Column index not used for index-only scan, but partial index is

My question is why a multi-column index is not used for a index only scan, when a partial index with equivalent information (I think) is.
The table:
CREATE TABLE test
(
id INT,
descr TEXT,
flag BOOLEAN
);
INSERT INTO test
SELECT GENERATE_SERIES(1,100000) AS id,
MD5(RANDOM()::TEXT) AS descr,
(RANDOM() < 0.1) AS flag;
SELECT *
FROM test LIMIT 10;
A content sample:
id descr flag
1 81978ceb5514461fbad9af1152ad78f6 true
2 cc0aee68ba3e0095cc74d53e8da55fef false
3 689a76e5897d565638f8ddd2d2019b7a true
4 9df03bc2969a6af88cd1d6e0423d0f4c true
5 318983766d11f831e9f0df34606dc908 false
6 198102bb71640a16f28263b7fb56ba2e false
7 9bef7320389db46a8ad88ffa611e81b5 false
8 c1f0d637ee0a985aa7d768a78d2d97b1 false
9 781b4064f721ae3879d95579264b0aba false
10 c4582890bb1e9af430e0f36b50f5e88c false
The query I need to run is:
SELECT id
FROM test
WHERE flag;
Now if I use a partial index. The query (eventually) gets executed as an index only scan:
CREATE INDEX i1
ON test (id) WHERE flag;
QUERY PLAN
Index Only Scan using i1 on test (cost=0.29..354.95 rows=9911 width=4) (actual time=0.120..6.268 rows=9911 loops=1)
Heap Fetches: 9911
Buffers: shared hit=834 read=29
Planning time: 0.806 ms
Execution time: 6.922 ms
What I do not understand is: why a multi column index of the following form is never used for an index-only scan?
CREATE INDEX i2
ON test (flag, id);
QUERY PLAN
Bitmap Heap Scan on test (cost=189.10..1122.21 rows=9911 width=4) (actual time=0.767..5.986 rows=9911 loops=1)
Filter: flag
Heap Blocks: exact=834
Buffers: shared hit=863
-> Bitmap Index Scan on i2 (cost=0.00..186.62 rows=9911 width=0) (actual time=0.669..0.669 rows=9911 loops=1)
Index Cond: (flag = true)
Buffers: shared hit=29
Planning time: 0.090 ms
Execution time: 6.677 ms
Can't the query visit all consecutive leaves of the btree where flag is True to determine all ids?
(Note that tuple visibility probably is not the problem since the index-only scan is used with the partial index i1.)
My Postgres version is: PostgreSQL 9.6.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413, 64-bit
Without being able to determine the exact cause of the behavior on your system, it is probably caused by bad statistics or an inaccurate visibility map that mislead the PostgreSQL optimizer.
VACUUM (ANALYZE) test;
will do two things:
It will update the visibility map, which is necessary for PostgreSQL to make an informed decision if an index only scan can be performed or not.
It will collect statistics on the table, which cause PostgreSQL to estimate the number of result rows accurately so that it can choose the best plan.

Why doesn't PostgreSQL combine two indexes in this example?

I'm creating a snapshot of an order book from "order book events". The essense of the task is demonstrated by the example below:
CREATE TABLE t AS SELECT i.event_id, 10000*(round(i.event_id/10000,0)+1) AS last_event_id FROM ( SELECT * FROM generate_series(1,1000000) AS event_id ) i;
ALTER TABLE t ADD PRIMARY KEY (event_id);
CREATE INDEX t_idx ON t USING btree (last_event_id ASC NULLS LAST);
EXPLAIN ANALYZE SELECT * FROM T WHERE event_id <= 80001 and last_event_id >= 80001;
The output of EXPLAIN ANALYZE is as follows:
QUERY PLAN
-----------------------------------------------------------------------------
Index Scan using t_pkey on t (cost=0.42..2928.77 rows=73870 width=9) (actual time=52.526..52.529 rows=2 loops=1)
Index Cond: (event_id <= 80001)
Filter: (last_event_id >= '80001'::numeric)
Rows Removed by Filter: 79999
Planning time: 0.211 ms
Execution time: 52.578 ms
Thus PostgreSQL uses only t_pkey index and ignores t_idx.
Why doesn't PostgreSQL uses both t_pkey and t_idx?
I'm using PostgreSQL 9.6 on Centos 7

Postgres execution plan order of where clause and order by

I have a very simple SQL:
select * from email.email_task where acquire_time < now() and state IN ('CREATED', 'RELEASED') order by creation_time asc limit 1;
I have 2 indexes created:
Index of state
Index of state, acquire_time, creation_time
Ideally I think Postgres should pick the 2nd one since it matches every column required in this SQL:
However the execution plan shows differently, it uses neither of the indexes:
Limit (cost=187404.36..187404.36 rows=1 width=743)
-> Sort (cost=187404.36..190753.58 rows=1339690 width=743)
Sort Key: creation_time
-> Seq Scan on email_task (cost=0.00..180705.91 rows=1339690 width=743)
Filter: (((state)::text = 'CREATED'::text) AND (acquire_time < now()))
I understand that if the number of rows returned arrives like 10% of total, then it would pick Seq Scan over Index Scan. (As explained at Why does PostgreSQL perform sequential scan on indexed column?
) So that's why index1 is not picked.
What I don't understand is why index2 is not picked since matches all the columns?
Then I tried a 3rd index:
Index of create_time, acquire_time, state
And this time it uses the index3 (I add the index using another smaller database
perf_1 because the original one has 2 million rows and it takes too much time)
Limit (cost=0.29..0.36 rows=1 width=75) (actual time=0.043..0.043 rows=1 loops=1)
-> Index Scan using perf_1 on email_task (cost=0.29..763.76 rows=9998 width=75) (actual time=0.042..0.042 rows=1 loops=1)
Index Cond: (acquire_time < now())
Filter: ((state)::text = ANY ('{CREATED,RELEASED}'::text[]))
It seems that, Postgres execution planner is picking the order by clause first then the where clause which is a little bit counter-intuitive.
Is my understanding correct or there are some other factors that impact the Postgres planner?
Thanks in advance.