Why doesn't PostgreSQL combine two indexes in this example? - postgresql

I'm creating a snapshot of an order book from "order book events". The essense of the task is demonstrated by the example below:
CREATE TABLE t AS SELECT i.event_id, 10000*(round(i.event_id/10000,0)+1) AS last_event_id FROM ( SELECT * FROM generate_series(1,1000000) AS event_id ) i;
ALTER TABLE t ADD PRIMARY KEY (event_id);
CREATE INDEX t_idx ON t USING btree (last_event_id ASC NULLS LAST);
EXPLAIN ANALYZE SELECT * FROM T WHERE event_id <= 80001 and last_event_id >= 80001;
The output of EXPLAIN ANALYZE is as follows:
QUERY PLAN
-----------------------------------------------------------------------------
Index Scan using t_pkey on t (cost=0.42..2928.77 rows=73870 width=9) (actual time=52.526..52.529 rows=2 loops=1)
Index Cond: (event_id <= 80001)
Filter: (last_event_id >= '80001'::numeric)
Rows Removed by Filter: 79999
Planning time: 0.211 ms
Execution time: 52.578 ms
Thus PostgreSQL uses only t_pkey index and ignores t_idx.
Why doesn't PostgreSQL uses both t_pkey and t_idx?
I'm using PostgreSQL 9.6 on Centos 7

Related

PostgreSQL not using index on JSONB text field if > in where clause

I'm trying to index a large JSONB column based on a text field (with an ISO date string). This index works fine using = but is ignored if I use a > condition.
create table test_table (
id text NOT null primary key,
data jsonb,
text_test text
);
Then I add a bunch data to the jsonb column. And to ensure my JSON is valid, extract/copy the value I'm interested in from the JSONB column into another text column to test against too.
update test_table set text_test = (data->>'dueDate');
A quick sample shows it's good ISO formatted date strings:
select text_test, (data->>'dueDate') from test_table limit 1;
-- 2020-08-07T11:59:00 2020-08-07T11:59:00
I add btree indexes to both the JSONB and the text_test copy column. I tried adding one with explicit '::text' casting, as well as one with 'text_pattern_ops'.
create index test_table_duedate_iso on test_table using btree(text_test);
create index test_table_duedate_iso_jsonb on test_table using btree((data->>'dueDate'));
create index test_table_duedate_iso_jsonb_cast on test_table using btree(((data->>'dueDate')::text));
create index test_table_duedate_iso_jsonb_cast_pattern on test_table using btree(((data->>'dueDate')::text) text_pattern_ops);
Now if I query an exact value, explain shows it using the 'cast' version of the index. Good.
explain select * from test_table where (data->>'dueDate') = '2020-08-07T11:59:00';
"-> Bitmap Index Scan on test_table_duedate_iso_jsonb_cast (cost=0.00..10.37 rows=261 width=0)"
But if I try it with a >, it does a very slow full scan.
explain analyze select count(*) from test_table where (data->>'dueDate') > '2020-04-14';
--Aggregate (cost=10037.94..10037.95 rows=1 width=8) (actual time=1070.808..1070.813 rows=1 loops=1)
-- -> Seq Scan on test_table (cost=0.00..9994.42 rows=17409 width=0) (actual time=0.069..1057.258 rows=2930 loops=1)
-- Filter: ((data ->> 'dueDate'::text) > '2020-04-14'::text)
-- Rows Removed by Filter: 49298
--Planning Time: 0.252 ms
--Execution Time: 1070.874 ms
So just to check my sanity, I do the same query against the text_test column, it uses it's index as desired:
explain analyze select count(*) from test_table where text_test > '2020-04-14';
--Aggregate (cost=6037.02..6037.03 rows=1 width=8) (actual time=19.979..19.984 rows=1 loops=1)
-- -> Bitmap Heap Scan on test_table (cost=77.76..6030.14 rows=2754 width=0) (actual time=1.354..11.007 rows=2930 loops=1)
-- Recheck Cond: (text_test > '2020-04-14'::text)
-- Heap Blocks: exact=455
-- -> Bitmap Index Scan on test_table_duedate_iso (cost=0.00..77.07 rows=2754 width=0) (actual time=1.215..1.217 rows=2930 loops=1)
-- Index Cond: (text_test > '2020-04-14'::text)
--Planning Time: 0.145 ms
--Execution Time: 20.041 ms
I have also tested indexing a numerical field within the JSON and it actually works properly, using it's index for ranged type queries. So it's something about the text field or something I'm doing wrong with it.
PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14), 64-bit

Database design for time series

Approximately every 10 min I insert ~50 records with the same timestamp.
It means ~600 records per hour or 7.200 records per day or 2.592.000 records per year.
User wants to retrieve all records for the timestamp closest to the asked time.
Design #1 - one table with index on timestamp column:
CREATE TABLE A (t timestamp, value int);
CREATE a_idx ON A (t);
Single insert statement creates ~50 records with the same timestamp:
INSERT INTO A VALUES (
(‘2019-01-02 10:00’, 5),
(‘2019-01-02 10:00’, 12),
(‘2019-01-02 10:00’, 7),
….
)
Get all records which are closest to the asked time
(I use the function greatest() available in PostgreSQL):
SELECT * FROM A WHERE t =
(SELECT t FROM A ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
I think this query is not efficient because it requires the full table scan.
I plan to partition the A table by timestamp to have 1 partition per year, but the approximate match above still will be slow.
Design #2 - create 2 tables:
1st table: to keep the unique timestamps and auto-incremented PK,
2nd table: to keep data and the foreign key on 1st table PK
CREATE TABLE UNIQ_TIMESTAMP (id SERIAL PRIMARY KEY, t timestamp);
CREATE TABLE DATA (id INTEGER REFERENCES UNIQ_TIMESTAMP (id), value int);
CREATE INDEX data_time_idx ON DATA (id);
Get all records which are closest to the asked time:
SELECT * FROM DATA WHERE id =
(SELECT id FROM UNIQ_TIMESTAMP ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
It should run faster compared to Design #1 because the nested select scans the smaller table.
Disadvantage of this approach:
- I have to insert into 2 tables instead just one
- I lost the ability to partition the DATA table by timestamp
What you could recommend?
I'd go with tje single table approach, perhaps partitioned by year so that it becomes easy to get rid of old data.
Create an index like
CREATE INDEX ON a (date_trunc('hour', t + INTERVAL '30 minutes'));
Then use your query like you wrote it, but add
AND date_trunc('hour', t + INTERVAL '30 minutes')
= date_trunc('hour', asked_time + INTERVAL '30 minutes')
The additional condition acts as a filter and can use the index.
You can use a UNION of two queries to find all timestamps closest to a given one:
(
select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1
)
union all
(
select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1
)
That will efficiently make use of an index on t. On a table with 10 million rows (~3 years of data), I get the following execution plan:
Append (cost=0.57..1.16 rows=2 width=8) (actual time=0.381..0.407 rows=2 loops=1)
Buffers: shared hit=6 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.380..0.381 rows=1 loops=1)
Output: a.t
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Index Only Scan using a_t_idx on stuff.a (cost=0.57..253023.35 rows=30699415 width=8) (actual time=0.380..0.380 rows=1 loops=1)
Output: a.t
Index Cond: (a.t >= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
Output: a_1.t
Buffers: shared hit=5
-> Index Only Scan Backward using a_t_idx on stuff.a a_1 (cost=0.57..649469.88 rows=78800603 width=8) (actual time=0.024..0.024 rows=1 loops=1)
Output: a_1.t
Index Cond: (a_1.t <= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=5
Planning Time: 1.823 ms
Execution Time: 0.425 ms
As you can see it only requires very few I/O operations and that is pretty much independent of the table size.
The above can be used for an IN condition:
select *
from a
where t in (
(select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1)
union all
(select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1)
);
If you know you will never have more than 100 values close to that requested timestamp, you could remove the IN query completely and simply use a limit 100 in both parts of the union. That makes the query a bit more efficient as there is no second step for evaluating the IN condition, but might return more rows than you want.
If you always look for timestamps in the same year, then partitioning by year will indeed help with this.
You can put that into a function if it is too complicated as a query:
create or replace function get_closest(p_tocheck timestamp)
returns timestamp
as
$$
select *
from (
(select t
from a
where t >= p_tocheck
order by t
limit 1)
union all
(select t
from a
where t <= p_tocheck
order by t desc
limit 1)
) x
order by greatest(t - p_tocheck, p_tocheck - t)
limit 1;
$$
language sql stable;
The the query gets as simple as:
select *
from a
where t = get_closest(timestamp '2019-03-01 17:00:00');
Another solution is to use the btree_gist extension which provides a "distance" operator <->
Then you can create a GiST index on the timestamp:
create index on a using gist (t) ;
and use the following query:
select *
from a where t in (select t
from a
order by t <-> timestamp '2019-03-01 17:00:00'
limit 1);

How to auto-extract or index day from timestamp in Postgres?

We have a Postgres table with a timestamp field created_at. On a regular basis, we need to find all the records with the day field of created_at being a certain number.
We can run a query like
select * from table where extract(day from created_at) = 3;
I suspect this isn't efficient, ie it's doing a full-table scan. If so, can I create an index somehow to make the above efficient?
If it's not possible, we can create a separate column called created_at_day and create an index on it.
So we can simply run the query like
select * from table where created_at_day = 3;
Let's say created_at can be updated. Whenever this happens, created_at_day should be updated, too.
Does Postgres provide any support to automatically keep created_at_day in sync with created_at? If so, how?
Of course this can be done in the application logic. So whenever created_at is created or updated, we update the created_at_day column. But just wondering if there's an easier, automated way to do this.
Thanks
You can create an index on extract(day from created_at)
To see the difference:
Create a table
knayak=# create table t as select i ,now()::timestamp + interval '1 days' * i as created_at from generate_series(1,10000) as i;
SELECT 10000
Create normal index on created_at
knayak=# create index ind_created_at on t(created_at);
CREATE INDEX
knayak=# explain analyze select * from t where extract(day from created_at) = 3;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..205.00 rows=50 width=12) (actual time=1.049..6.020 rows=328 loops=1)
Filter: (date_part('day'::text, created_at) = '3'::double precision)
Rows Removed by Filter: 9672
Planning time: 0.392 ms
Execution time: 6.070 ms
(5 rows)
Create index with extract
knayak=# drop index ind_created_at;
DROP INDEX
knayak=# create index ind_created_at on t( extract(day from created_at) );
CREATE INDEX
knayak=# explain analyze select * from t where extract(day from created_at) = 3;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=4.67..61.66 rows=50 width=12) (actual time=0.110..0.260 rows=328 loops=1)
Recheck Cond: (date_part('day'::text, created_at) = '3'::double precision)
Heap Blocks: exact=54
-> Bitmap Index Scan on ind_created_at (cost=0.00..4.66 rows=50 width=0) (actual time=0.093..0.093 rows=328 loops=1)
Index Cond: (date_part('day'::text, created_at) = '3'::double precision)
Planning time: 0.316 ms
Execution time: 0.314 ms
(7 rows)

Cannot find tables that would join using nested loops

I feel like I will get lots of downvotes here, but let's give it a go.
I am trying to explain nested loops vs hash vs merge join to my students on real examples. However, I am struggling to find tables that would join with nested loops (I tried many different sizes, indexes setups, etc.). Postgres always uses hash join regardless of the table sizes, indexes, etc.
Could someone give an example of tables (with data) that would join with nested loops without explicitly running set enable_hashjoin = true; beforehand?
The following does a nested loop for me (without disabling hashjoins) on Postgres 10.5
create table one (id integer primary key, some_ts timestamp, some_value integer);
insert into one values (1, clock_timestamp(), 42),(2, clock_timestamp(), 42);
create table two (id integer primary key, one_id integer not null references one, some_ts timestamp);
insert into two
select i, 1, clock_timestamp()
from generate_series(1,10) i;
insert into two
select i, 2, clock_timestamp()
from generate_series(11,20) i;
create index on two (one_id);
explain (analyze)
select one.*, two.id, two.some_ts
from one
join two on one.id = two.one_id
where one.id = 1;
Results in:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.15..4.23 rows=1 width=28) (actual time=0.029..0.033 rows=10 loops=1)
-> Index Scan using one_pkey on one (cost=0.15..3.16 rows=1 width=16) (actual time=0.016..0.016 rows=1 loops=1)
Index Cond: (id = 1)
-> Seq Scan on two (cost=0.00..1.07 rows=1 width=16) (actual time=0.011..0.014 rows=10 loops=1)
Filter: (one_id = 1)
Rows Removed by Filter: 10
Planning time: 0.130 ms
Execution time: 0.058 ms
Online example: http://rextester.com/CXZZ12304
Create some tables:
CREATE TABLE a (
a_id integer PRIMARY KEY,
a_val text NOT NULL
);
CREATE TABLE b (
b_id integer PRIMARY KEY,
a_id integer REFERENCES a(a_id) NOT NULL,
b_val text NOT NULL
);
-- *never* forget an index on the foreign key column
CREATE INDEX ON b(a_id);
Add some sample data:
INSERT INTO a
SELECT i, 'value ' || i FROM generate_series(1, 1000) i;
INSERT INTO b
SELECT i, (i + 1) / 2, 'value ' || i FROM generate_series(1, 2000) i;
Analyze the tables to get good statistics:
ANALYZE a;
ANALYZE b;
Let's run a sample query:
EXPLAIN SELECT a.a_val, b.b_val FROM a JOIN b USING (a_id) WHERE a_id = 42;
QUERY PLAN
---------------------------------------------------------------------------
Nested Loop (cost=0.55..16.62 rows=2 width=19)
-> Index Scan using a_pkey on a (cost=0.28..8.29 rows=1 width=13)
Index Cond: (a_id = 42)
-> Index Scan using b_a_id_idx on b (cost=0.28..8.31 rows=2 width=14)
Index Cond: (a_id = 42)
(5 rows)

Inconsistent statistics on expression with partial index

[PostgreSQL 9.6.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.2.0-10) 6.2.0 20161027, 64-bit]
I have a table with timestamp ranges:
create table testing.test as
select tsrange(d, null) ts from
generate_series(timestamp '2000-01-01', timestamp '2018-01-01', interval '1 minute') s(d);
I need to run the following query:
select *
from testing.test
where lower(ts)> '2017-06-17 20:00:00'::timestamp and upper_inf(ts)
Explain analyze result for table without indexes:
Seq Scan on test (cost=0.00..72482.26 rows=1052013 width=14) (actual time=2165.477..2239.781 rows=283920 loops=1)
Filter: (upper_inf(ts) AND (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone))
Rows Removed by Filter: 9184081
Planning time: 0.046 ms
Execution time: 2250.221 ms
Next I'm going to add a following partial index:
create index lower_rt_inf ON testing.test using btree(lower(ts)) where upper_inf(ts);
analyze testing.test;
Explain analyze result for table with partial index:
Index Scan using lower_rt_inf on test (cost=0.04..10939.03 rows=1051995 width=14) (actual time=0.037..52.083 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.156 ms
Execution time: 62.900 ms
And:
SELECT null_frac, n_distinct, correlation FROM pg_catalog.pg_stats WHERE tablename = 'lower_rt_inf'
null_frac |n_distinct |correlation |
----------|-----------|------------|
0 |-1 |1 |
Then I create an index similar to the previous one, but without partial condition:
create index lower_rt_full ON testing.test using btree(lower(ts));
analyze testing.test;
And now the same index is used, but the cost/rows are different:
Index Scan using lower_rt_inf on test (cost=0.04..1053.87 rows=101256 width=14) (actual time=0.029..58.613 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.280 ms
Execution time: 71.794 ms
And a bit more:
select * from testing.test where lower(ts)> '2017-06-17 20:00:00'::timestamp;
Index Scan using lower_rt_full on test (cost=0.04..3159.52 rows=303767 width=14) (actual time=0.036..64.208 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.099 ms
Execution time: 78.759 ms
How can I effectively use partial indexes for expressions?
What happens here is that the statistics on index lower_rt_full are used to estimate the row count, but statistics on lower_rt_inf, which is a partial index, aren't.
Since function are a black box for PostgreSQL, it has no idea about the distribution of lower(ts) and uses a bad estimate.
After lower_rt_full has been created and the table analyzed, PostgreSQL has a good idea about this distribution and can estimate much better. Even if the index isn't used to execute the query, it is used for query planning.
Since upper_inf is also a function (black box), you would get an even better estimate if you had an index ON test (upper_inf(ts), lower(ts)).
For an explanation why partial indexes are not considered to estimate the number of result rows, see this comment in examine_variable in backend/utils/adt/selfuncs.c, which tries to find statistical data about an expression:
* Has it got stats? We only consider stats for
* non-partial indexes, since partial indexes probably
* don't reflect whole-relation statistics; the above
* check for uniqueness is the only info we take from
* a partial index.
Thanks for the answer.
The problem in using the function in the index (lower(rt))?
Or in that the function is used in condition of partial index.
Because, if I add a separate field "latest":
alter table testing.test add column latest boolean;
update testing.test set latest = upper_inf(ts);
create index lower_latest_rt ON testing.test using btree(lower(ts)) where latest = true;
alter index testing.lower_latest_rt alter column lower set statistics 1000;
analyze testing.test;
And execute follwing query:
select *
from testing.test
where lower(ts)> '2017-06-17 20:00:00'::timestamp and latest = true
I have result:
Index Scan using lower_latest_rt on test (cost=0.04..11406.44 rows=285833 width=23) (actual time=0.027..178.054 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 1.788 ms
Execution time: 188.481 ms