Query not using index on timestamp without time zone field - postgresql

I have a table with more than 3 million rows, one column named creationdate is a timestamp without time zone.
I created a couple of indexes on it, like:
"idx_routingtasks_creationdate" btree (creationdate)
"idx_routingtasks_creationdate2" btree ((creationdate::date))
When filter by creationdate::date (casting by date) the index idx_routingtasks_creationdate is not used:
explain analyze select * from routingtasks where creationdate::date < (now() - interval '1 day')::date;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on routingtasks (cost=0.00..1315631.93 rows=2811638 width=715) (actual time=186642.012..413763.643 rows=2800659 loops=1)
Filter: ((creationdate)::date < ((now() - '1 day'::interval))::date)
Rows Removed by Filter: 212248
Planning time: 0.195 ms
Execution time: 413875.829 ms
(5 rows)
The same when not casting by date:
explain analyze select * from routingtasks where creationdate < now() - interval '1 day';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Seq Scan on routingtasks (cost=0.00..1300588.39 rows=2918447 width=715) (actual time=72089.312..327288.333 rows=2876756 loops=1)
Filter: (creationdate < (now() - '1 day'::interval))
Rows Removed by Filter: 141052
Planning time: 0.104 ms
Execution time: 327401.745 ms
(5 rows)
How can I create an index on the creationdate column to allow this filter use it?

The answer lies in this part of the execution plan:
Seq Scan ... (actual ... rows=2876756 ...)
...
Rows Removed by Filter: 141052
Since almost all rows are returned anyway, using a sequential scan and discarding the few rows that are filtered out is the most efficient way to process the query.
If you want to verify that, temporarily
SET enable_seqscan = off;
to make PostgreSQL avoid a sequential scan if possible. Then you can test if query execution gets faster or not.

Related

How to speed up select query in the Partitioned table?

I am using PostgreSQL 12. My Table contains 1 Billion Records. It was partitioned by every Month based on date range. Every day contains more than 3 Million Records. When I select some set of ids, it takes more time.
I want to filter 20 days but it takes very long time. So I select records for one day only. It's also took more than 10 seconds...
Below is my query,
Select id,col1,col2,col3,col4,col5,col6,Logtime from myLog R
where R.id in(1818154 59,…………**500 IDS**………..,180556591)
and R.LogTime='2019-07-29'::date
and R.Logtime>='2019-07-29 00:00:00' and R.LogTime<='2019-07-30 00:00:00'
Order by R.LogTime desc;
Below is my queries plan,
"Sort (cost=2556.35.2556.35 rows=1 width=298) (actual time 10088.084.10088.091 rows=557 loops-1)"
" Sort Key: R.LogTime DESC”
" Sort Method: quicksort Memory: 172 kB
-> Index Scan using p_my_log201907_LogTime_id_idx on p_mylog201907 r (cost=
0.56..2556.34 rows=1 width-298) (actual time=69.056..10085.712 rows=557 loops=1)"
Index Cond: (Logtime):: date = "2019-07-29’::date)
AND (id = ANY (‘{1818154 59,…………500 IDS………..,180556591}’::bigint[])
Filter: ( Logtime>= ‘2019-07-29 00:00:00’:: timestamp without time
AND (Logtime < ‘2019-07-30 00:00:00’:: timestamp without time zone)}"
"Planning Time: 0.664 ms
"Execution Time: 10088.189 ms
Below is my index,
CREATE INDEX Idx_LogTime ON MyLog( (LogTime::date) DESC,id desc);
At the time of query execution I have set the WORK_MEM to '1GB'. Please suggest me. How can I optimize and speedup my query?

slow search on postgres table despite index

the following search is taking 15 seconds despite an index on the last_time column which is a timestamp
select * from my_table where last_time>now() - interval '1 hour'
The table has around 15 M rows and the query returns around 100 rows.
I have recreated the index and have vacuumed the table. but the search is still slow.
Similar searches on similar tables of comparable size return around 1000 rows in < 1 sec.
Seq Scan on my_table (cost=0.00..625486.61 rows=3817878 width=237) (actual time=16397.053..16397.054 rows=0 loops=1)
Filter: (last_time > (now() - '01:00:00'::interval))
Rows Removed by Filter: 11453235
Buffers: shared hit=73 read=424975
Planning Time: 0.290 ms
Execution Time: 16397.097 ms
Any suggests?

Database design for time series

Approximately every 10 min I insert ~50 records with the same timestamp.
It means ~600 records per hour or 7.200 records per day or 2.592.000 records per year.
User wants to retrieve all records for the timestamp closest to the asked time.
Design #1 - one table with index on timestamp column:
CREATE TABLE A (t timestamp, value int);
CREATE a_idx ON A (t);
Single insert statement creates ~50 records with the same timestamp:
INSERT INTO A VALUES (
(‘2019-01-02 10:00’, 5),
(‘2019-01-02 10:00’, 12),
(‘2019-01-02 10:00’, 7),
….
)
Get all records which are closest to the asked time
(I use the function greatest() available in PostgreSQL):
SELECT * FROM A WHERE t =
(SELECT t FROM A ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
I think this query is not efficient because it requires the full table scan.
I plan to partition the A table by timestamp to have 1 partition per year, but the approximate match above still will be slow.
Design #2 - create 2 tables:
1st table: to keep the unique timestamps and auto-incremented PK,
2nd table: to keep data and the foreign key on 1st table PK
CREATE TABLE UNIQ_TIMESTAMP (id SERIAL PRIMARY KEY, t timestamp);
CREATE TABLE DATA (id INTEGER REFERENCES UNIQ_TIMESTAMP (id), value int);
CREATE INDEX data_time_idx ON DATA (id);
Get all records which are closest to the asked time:
SELECT * FROM DATA WHERE id =
(SELECT id FROM UNIQ_TIMESTAMP ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
It should run faster compared to Design #1 because the nested select scans the smaller table.
Disadvantage of this approach:
- I have to insert into 2 tables instead just one
- I lost the ability to partition the DATA table by timestamp
What you could recommend?
I'd go with tje single table approach, perhaps partitioned by year so that it becomes easy to get rid of old data.
Create an index like
CREATE INDEX ON a (date_trunc('hour', t + INTERVAL '30 minutes'));
Then use your query like you wrote it, but add
AND date_trunc('hour', t + INTERVAL '30 minutes')
= date_trunc('hour', asked_time + INTERVAL '30 minutes')
The additional condition acts as a filter and can use the index.
You can use a UNION of two queries to find all timestamps closest to a given one:
(
select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1
)
union all
(
select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1
)
That will efficiently make use of an index on t. On a table with 10 million rows (~3 years of data), I get the following execution plan:
Append (cost=0.57..1.16 rows=2 width=8) (actual time=0.381..0.407 rows=2 loops=1)
Buffers: shared hit=6 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.380..0.381 rows=1 loops=1)
Output: a.t
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Index Only Scan using a_t_idx on stuff.a (cost=0.57..253023.35 rows=30699415 width=8) (actual time=0.380..0.380 rows=1 loops=1)
Output: a.t
Index Cond: (a.t >= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
Output: a_1.t
Buffers: shared hit=5
-> Index Only Scan Backward using a_t_idx on stuff.a a_1 (cost=0.57..649469.88 rows=78800603 width=8) (actual time=0.024..0.024 rows=1 loops=1)
Output: a_1.t
Index Cond: (a_1.t <= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=5
Planning Time: 1.823 ms
Execution Time: 0.425 ms
As you can see it only requires very few I/O operations and that is pretty much independent of the table size.
The above can be used for an IN condition:
select *
from a
where t in (
(select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1)
union all
(select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1)
);
If you know you will never have more than 100 values close to that requested timestamp, you could remove the IN query completely and simply use a limit 100 in both parts of the union. That makes the query a bit more efficient as there is no second step for evaluating the IN condition, but might return more rows than you want.
If you always look for timestamps in the same year, then partitioning by year will indeed help with this.
You can put that into a function if it is too complicated as a query:
create or replace function get_closest(p_tocheck timestamp)
returns timestamp
as
$$
select *
from (
(select t
from a
where t >= p_tocheck
order by t
limit 1)
union all
(select t
from a
where t <= p_tocheck
order by t desc
limit 1)
) x
order by greatest(t - p_tocheck, p_tocheck - t)
limit 1;
$$
language sql stable;
The the query gets as simple as:
select *
from a
where t = get_closest(timestamp '2019-03-01 17:00:00');
Another solution is to use the btree_gist extension which provides a "distance" operator <->
Then you can create a GiST index on the timestamp:
create index on a using gist (t) ;
and use the following query:
select *
from a where t in (select t
from a
order by t <-> timestamp '2019-03-01 17:00:00'
limit 1);

How to auto-extract or index day from timestamp in Postgres?

We have a Postgres table with a timestamp field created_at. On a regular basis, we need to find all the records with the day field of created_at being a certain number.
We can run a query like
select * from table where extract(day from created_at) = 3;
I suspect this isn't efficient, ie it's doing a full-table scan. If so, can I create an index somehow to make the above efficient?
If it's not possible, we can create a separate column called created_at_day and create an index on it.
So we can simply run the query like
select * from table where created_at_day = 3;
Let's say created_at can be updated. Whenever this happens, created_at_day should be updated, too.
Does Postgres provide any support to automatically keep created_at_day in sync with created_at? If so, how?
Of course this can be done in the application logic. So whenever created_at is created or updated, we update the created_at_day column. But just wondering if there's an easier, automated way to do this.
Thanks
You can create an index on extract(day from created_at)
To see the difference:
Create a table
knayak=# create table t as select i ,now()::timestamp + interval '1 days' * i as created_at from generate_series(1,10000) as i;
SELECT 10000
Create normal index on created_at
knayak=# create index ind_created_at on t(created_at);
CREATE INDEX
knayak=# explain analyze select * from t where extract(day from created_at) = 3;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..205.00 rows=50 width=12) (actual time=1.049..6.020 rows=328 loops=1)
Filter: (date_part('day'::text, created_at) = '3'::double precision)
Rows Removed by Filter: 9672
Planning time: 0.392 ms
Execution time: 6.070 ms
(5 rows)
Create index with extract
knayak=# drop index ind_created_at;
DROP INDEX
knayak=# create index ind_created_at on t( extract(day from created_at) );
CREATE INDEX
knayak=# explain analyze select * from t where extract(day from created_at) = 3;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=4.67..61.66 rows=50 width=12) (actual time=0.110..0.260 rows=328 loops=1)
Recheck Cond: (date_part('day'::text, created_at) = '3'::double precision)
Heap Blocks: exact=54
-> Bitmap Index Scan on ind_created_at (cost=0.00..4.66 rows=50 width=0) (actual time=0.093..0.093 rows=328 loops=1)
Index Cond: (date_part('day'::text, created_at) = '3'::double precision)
Planning time: 0.316 ms
Execution time: 0.314 ms
(7 rows)

Inconsistent statistics on expression with partial index

[PostgreSQL 9.6.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.2.0-10) 6.2.0 20161027, 64-bit]
I have a table with timestamp ranges:
create table testing.test as
select tsrange(d, null) ts from
generate_series(timestamp '2000-01-01', timestamp '2018-01-01', interval '1 minute') s(d);
I need to run the following query:
select *
from testing.test
where lower(ts)> '2017-06-17 20:00:00'::timestamp and upper_inf(ts)
Explain analyze result for table without indexes:
Seq Scan on test (cost=0.00..72482.26 rows=1052013 width=14) (actual time=2165.477..2239.781 rows=283920 loops=1)
Filter: (upper_inf(ts) AND (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone))
Rows Removed by Filter: 9184081
Planning time: 0.046 ms
Execution time: 2250.221 ms
Next I'm going to add a following partial index:
create index lower_rt_inf ON testing.test using btree(lower(ts)) where upper_inf(ts);
analyze testing.test;
Explain analyze result for table with partial index:
Index Scan using lower_rt_inf on test (cost=0.04..10939.03 rows=1051995 width=14) (actual time=0.037..52.083 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.156 ms
Execution time: 62.900 ms
And:
SELECT null_frac, n_distinct, correlation FROM pg_catalog.pg_stats WHERE tablename = 'lower_rt_inf'
null_frac |n_distinct |correlation |
----------|-----------|------------|
0 |-1 |1 |
Then I create an index similar to the previous one, but without partial condition:
create index lower_rt_full ON testing.test using btree(lower(ts));
analyze testing.test;
And now the same index is used, but the cost/rows are different:
Index Scan using lower_rt_inf on test (cost=0.04..1053.87 rows=101256 width=14) (actual time=0.029..58.613 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.280 ms
Execution time: 71.794 ms
And a bit more:
select * from testing.test where lower(ts)> '2017-06-17 20:00:00'::timestamp;
Index Scan using lower_rt_full on test (cost=0.04..3159.52 rows=303767 width=14) (actual time=0.036..64.208 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.099 ms
Execution time: 78.759 ms
How can I effectively use partial indexes for expressions?
What happens here is that the statistics on index lower_rt_full are used to estimate the row count, but statistics on lower_rt_inf, which is a partial index, aren't.
Since function are a black box for PostgreSQL, it has no idea about the distribution of lower(ts) and uses a bad estimate.
After lower_rt_full has been created and the table analyzed, PostgreSQL has a good idea about this distribution and can estimate much better. Even if the index isn't used to execute the query, it is used for query planning.
Since upper_inf is also a function (black box), you would get an even better estimate if you had an index ON test (upper_inf(ts), lower(ts)).
For an explanation why partial indexes are not considered to estimate the number of result rows, see this comment in examine_variable in backend/utils/adt/selfuncs.c, which tries to find statistical data about an expression:
* Has it got stats? We only consider stats for
* non-partial indexes, since partial indexes probably
* don't reflect whole-relation statistics; the above
* check for uniqueness is the only info we take from
* a partial index.
Thanks for the answer.
The problem in using the function in the index (lower(rt))?
Or in that the function is used in condition of partial index.
Because, if I add a separate field "latest":
alter table testing.test add column latest boolean;
update testing.test set latest = upper_inf(ts);
create index lower_latest_rt ON testing.test using btree(lower(ts)) where latest = true;
alter index testing.lower_latest_rt alter column lower set statistics 1000;
analyze testing.test;
And execute follwing query:
select *
from testing.test
where lower(ts)> '2017-06-17 20:00:00'::timestamp and latest = true
I have result:
Index Scan using lower_latest_rt on test (cost=0.04..11406.44 rows=285833 width=23) (actual time=0.027..178.054 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 1.788 ms
Execution time: 188.481 ms