the following search is taking 15 seconds despite an index on the last_time column which is a timestamp
select * from my_table where last_time>now() - interval '1 hour'
The table has around 15 M rows and the query returns around 100 rows.
I have recreated the index and have vacuumed the table. but the search is still slow.
Similar searches on similar tables of comparable size return around 1000 rows in < 1 sec.
Seq Scan on my_table (cost=0.00..625486.61 rows=3817878 width=237) (actual time=16397.053..16397.054 rows=0 loops=1)
Filter: (last_time > (now() - '01:00:00'::interval))
Rows Removed by Filter: 11453235
Buffers: shared hit=73 read=424975
Planning Time: 0.290 ms
Execution Time: 16397.097 ms
Any suggests?
Related
So I have this query on a pretty big table:
SELECT * FROM datTable WHERE type='bla'
AND timestamp > (CURRENT_DATE - INTERVAL '1 day')
This query is too slow, like 5 seconds; and there is an index on type
So I tried:
SELECT * FROM datTable WHERE type NOT IN ('blu','bli','blo')
AND timestamp > (CURRENT_DATE - INTERVAL '1 day')
This query is way better like 1second, but the issue is that I don't want this not type list hardcoded.
So I tried:
with res as (
SELECT * FROM datTable WHERE type NOT IN ('blu','bli','blo')
AND timestamp > (CURRENT_DATE - INTERVAL '1 day')
)
select * from res where type='bla'
And I'm back to bad perf, 5 seconds same as before.
Any idea how I could trick postgres to get the 1sec perf but specifying positively the type I want ('bla') ?
EDIT: EXPLAIN ANALYZE for the last request
GroupAggregate (cost=677400.59..677493.09 rows=3595 width=59) (actual time=4789.667..4803.183 rows=3527 loops=1)
Group Key: event_historic.sender
-> Sort (cost=677400.59..677412.48 rows=4756 width=23) (actual time=4789.646..4792.808 rows=68045 loops=1)
Sort Key: event_historic.sender
Sort Method: quicksort Memory: 9469kB
-> Bitmap Heap Scan on event_historic (cost=505379.21..677110.11 rows=4756 width=23) (actual time=4709.494..4769.437 rows=68045 loops=1)
Recheck Cond: (("timestamp" > (CURRENT_DATE - '1 day'::interval)) AND ((type)::text = 'NEAR_TRANSFER'::text))
Heap Blocks: exact=26404
-> BitmapAnd (cost=505379.21..505379.21 rows=44676 width=0) (actual time=4706.080..4706.082 rows=0 loops=1)
-> Bitmap Index Scan on event_historic_timestamp_idx (cost=0.00..3393.89 rows=263109 width=0) (actual time=167.838..167.838 rows=584877 loops=1)
Index Cond: ("timestamp" > (CURRENT_DATE - '1 day'::interval))
-> Bitmap Index Scan on event_historic_type_idx (cost=0.00..501982.69 rows=45316549 width=0) (actual time=4453.071..4453.071 rows=44279973 loops=1)
Index Cond: ((type)::text = 'NEAR_TRANSFER'::text)
Planning Time: 0.385 ms
JIT:
Functions: 10
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 2.505 ms, Inlining 18.102 ms, Optimization 87.745 ms, Emission 44.270 ms, Total 152.622 ms
Execution Time: 4809.099 ms
EDIT 2: After adding the index on (type, timestamp) the result is way faster:
HashAggregate (cost=156685.88..156786.59 rows=8057 width=59) (actual time=95.201..96.511 rows=3786 loops=1)
Group Key: sender
Batches: 1 Memory Usage: 2449kB
Buffers: shared hit=31041
-> Index Scan using typetimestamp on event_historic eh (cost=0.57..156087.67 rows=47857 width=44) (actual time=12.244..55.921 rows=76220 loops=1)
Index Cond: (((type)::text = 'NEAR_TRANSFER'::text) AND ("timestamp" > (CURRENT_DATE - '1 day'::interval)))
Buffers: shared hit=31041
Planning:
Buffers: shared hit=5
Planning Time: 0.567 ms
JIT:
Functions: 10
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 2.543 ms, Inlining 0.000 ms, Optimization 1.221 ms, Emission 10.819 ms, Total 14.584 ms
Execution Time: 99.496 ms
You need a two-column index on ((type::text), timestamp) to make that query fast.
Let me explain the reasoning behind the index order in detail. If type is first in the index, the index scan can start with the first index entry after ('NEAR_TRANSFER', <now - 1 day>) and scan all index entries until it hits the next type, so all the index entries that are found correspond to a result row. If the index order is the other way around, the scan has to start at the first entry after (<now - 1 day>, ...) and read all index entries up to the end of the index. It discards the index entries where type IS DISTINCT FROM 'NEAR_TRANSFER' and fetches the table rows for the remaining index entries. So this scan will fetch the same number of table rows, but has to read more index entries.
It is an old myth that the most selective column should be the first in the index, but it is nonetheless a myth. For the reason described above, you should have the columns that are compared with = first in the index. The selectivity of the columns is irrelevant.
All this is speaking about a single query in isolation. But you always have to consider all the other queries in the workload, and for them it may make a difference how the columns are ordered.
A single index on timestamp and type might be faster:
CREATE INDEX idx1 ON datTable (timestamp, type);
Or maybe:
CREATE INDEX idx1 ON datTable (type, timestamp);
Check the query plan if the new index is used. Maybe you have to drop an old one as well. And most likely you could drop the one anyway.
I am using PostgreSQL 12. My Table contains 1 Billion Records. It was partitioned by every Month based on date range. Every day contains more than 3 Million Records. When I select some set of ids, it takes more time.
I want to filter 20 days but it takes very long time. So I select records for one day only. It's also took more than 10 seconds...
Below is my query,
Select id,col1,col2,col3,col4,col5,col6,Logtime from myLog R
where R.id in(1818154 59,…………**500 IDS**………..,180556591)
and R.LogTime='2019-07-29'::date
and R.Logtime>='2019-07-29 00:00:00' and R.LogTime<='2019-07-30 00:00:00'
Order by R.LogTime desc;
Below is my queries plan,
"Sort (cost=2556.35.2556.35 rows=1 width=298) (actual time 10088.084.10088.091 rows=557 loops-1)"
" Sort Key: R.LogTime DESC”
" Sort Method: quicksort Memory: 172 kB
-> Index Scan using p_my_log201907_LogTime_id_idx on p_mylog201907 r (cost=
0.56..2556.34 rows=1 width-298) (actual time=69.056..10085.712 rows=557 loops=1)"
Index Cond: (Logtime):: date = "2019-07-29’::date)
AND (id = ANY (‘{1818154 59,…………500 IDS………..,180556591}’::bigint[])
Filter: ( Logtime>= ‘2019-07-29 00:00:00’:: timestamp without time
AND (Logtime < ‘2019-07-30 00:00:00’:: timestamp without time zone)}"
"Planning Time: 0.664 ms
"Execution Time: 10088.189 ms
Below is my index,
CREATE INDEX Idx_LogTime ON MyLog( (LogTime::date) DESC,id desc);
At the time of query execution I have set the WORK_MEM to '1GB'. Please suggest me. How can I optimize and speedup my query?
I have a table with more than 3 million rows, one column named creationdate is a timestamp without time zone.
I created a couple of indexes on it, like:
"idx_routingtasks_creationdate" btree (creationdate)
"idx_routingtasks_creationdate2" btree ((creationdate::date))
When filter by creationdate::date (casting by date) the index idx_routingtasks_creationdate is not used:
explain analyze select * from routingtasks where creationdate::date < (now() - interval '1 day')::date;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on routingtasks (cost=0.00..1315631.93 rows=2811638 width=715) (actual time=186642.012..413763.643 rows=2800659 loops=1)
Filter: ((creationdate)::date < ((now() - '1 day'::interval))::date)
Rows Removed by Filter: 212248
Planning time: 0.195 ms
Execution time: 413875.829 ms
(5 rows)
The same when not casting by date:
explain analyze select * from routingtasks where creationdate < now() - interval '1 day';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Seq Scan on routingtasks (cost=0.00..1300588.39 rows=2918447 width=715) (actual time=72089.312..327288.333 rows=2876756 loops=1)
Filter: (creationdate < (now() - '1 day'::interval))
Rows Removed by Filter: 141052
Planning time: 0.104 ms
Execution time: 327401.745 ms
(5 rows)
How can I create an index on the creationdate column to allow this filter use it?
The answer lies in this part of the execution plan:
Seq Scan ... (actual ... rows=2876756 ...)
...
Rows Removed by Filter: 141052
Since almost all rows are returned anyway, using a sequential scan and discarding the few rows that are filtered out is the most efficient way to process the query.
If you want to verify that, temporarily
SET enable_seqscan = off;
to make PostgreSQL avoid a sequential scan if possible. Then you can test if query execution gets faster or not.
Approximately every 10 min I insert ~50 records with the same timestamp.
It means ~600 records per hour or 7.200 records per day or 2.592.000 records per year.
User wants to retrieve all records for the timestamp closest to the asked time.
Design #1 - one table with index on timestamp column:
CREATE TABLE A (t timestamp, value int);
CREATE a_idx ON A (t);
Single insert statement creates ~50 records with the same timestamp:
INSERT INTO A VALUES (
(‘2019-01-02 10:00’, 5),
(‘2019-01-02 10:00’, 12),
(‘2019-01-02 10:00’, 7),
….
)
Get all records which are closest to the asked time
(I use the function greatest() available in PostgreSQL):
SELECT * FROM A WHERE t =
(SELECT t FROM A ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
I think this query is not efficient because it requires the full table scan.
I plan to partition the A table by timestamp to have 1 partition per year, but the approximate match above still will be slow.
Design #2 - create 2 tables:
1st table: to keep the unique timestamps and auto-incremented PK,
2nd table: to keep data and the foreign key on 1st table PK
CREATE TABLE UNIQ_TIMESTAMP (id SERIAL PRIMARY KEY, t timestamp);
CREATE TABLE DATA (id INTEGER REFERENCES UNIQ_TIMESTAMP (id), value int);
CREATE INDEX data_time_idx ON DATA (id);
Get all records which are closest to the asked time:
SELECT * FROM DATA WHERE id =
(SELECT id FROM UNIQ_TIMESTAMP ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
It should run faster compared to Design #1 because the nested select scans the smaller table.
Disadvantage of this approach:
- I have to insert into 2 tables instead just one
- I lost the ability to partition the DATA table by timestamp
What you could recommend?
I'd go with tje single table approach, perhaps partitioned by year so that it becomes easy to get rid of old data.
Create an index like
CREATE INDEX ON a (date_trunc('hour', t + INTERVAL '30 minutes'));
Then use your query like you wrote it, but add
AND date_trunc('hour', t + INTERVAL '30 minutes')
= date_trunc('hour', asked_time + INTERVAL '30 minutes')
The additional condition acts as a filter and can use the index.
You can use a UNION of two queries to find all timestamps closest to a given one:
(
select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1
)
union all
(
select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1
)
That will efficiently make use of an index on t. On a table with 10 million rows (~3 years of data), I get the following execution plan:
Append (cost=0.57..1.16 rows=2 width=8) (actual time=0.381..0.407 rows=2 loops=1)
Buffers: shared hit=6 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.380..0.381 rows=1 loops=1)
Output: a.t
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Index Only Scan using a_t_idx on stuff.a (cost=0.57..253023.35 rows=30699415 width=8) (actual time=0.380..0.380 rows=1 loops=1)
Output: a.t
Index Cond: (a.t >= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
Output: a_1.t
Buffers: shared hit=5
-> Index Only Scan Backward using a_t_idx on stuff.a a_1 (cost=0.57..649469.88 rows=78800603 width=8) (actual time=0.024..0.024 rows=1 loops=1)
Output: a_1.t
Index Cond: (a_1.t <= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=5
Planning Time: 1.823 ms
Execution Time: 0.425 ms
As you can see it only requires very few I/O operations and that is pretty much independent of the table size.
The above can be used for an IN condition:
select *
from a
where t in (
(select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1)
union all
(select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1)
);
If you know you will never have more than 100 values close to that requested timestamp, you could remove the IN query completely and simply use a limit 100 in both parts of the union. That makes the query a bit more efficient as there is no second step for evaluating the IN condition, but might return more rows than you want.
If you always look for timestamps in the same year, then partitioning by year will indeed help with this.
You can put that into a function if it is too complicated as a query:
create or replace function get_closest(p_tocheck timestamp)
returns timestamp
as
$$
select *
from (
(select t
from a
where t >= p_tocheck
order by t
limit 1)
union all
(select t
from a
where t <= p_tocheck
order by t desc
limit 1)
) x
order by greatest(t - p_tocheck, p_tocheck - t)
limit 1;
$$
language sql stable;
The the query gets as simple as:
select *
from a
where t = get_closest(timestamp '2019-03-01 17:00:00');
Another solution is to use the btree_gist extension which provides a "distance" operator <->
Then you can create a GiST index on the timestamp:
create index on a using gist (t) ;
and use the following query:
select *
from a where t in (select t
from a
order by t <-> timestamp '2019-03-01 17:00:00'
limit 1);
[PostgreSQL 9.6.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.2.0-10) 6.2.0 20161027, 64-bit]
I have a table with timestamp ranges:
create table testing.test as
select tsrange(d, null) ts from
generate_series(timestamp '2000-01-01', timestamp '2018-01-01', interval '1 minute') s(d);
I need to run the following query:
select *
from testing.test
where lower(ts)> '2017-06-17 20:00:00'::timestamp and upper_inf(ts)
Explain analyze result for table without indexes:
Seq Scan on test (cost=0.00..72482.26 rows=1052013 width=14) (actual time=2165.477..2239.781 rows=283920 loops=1)
Filter: (upper_inf(ts) AND (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone))
Rows Removed by Filter: 9184081
Planning time: 0.046 ms
Execution time: 2250.221 ms
Next I'm going to add a following partial index:
create index lower_rt_inf ON testing.test using btree(lower(ts)) where upper_inf(ts);
analyze testing.test;
Explain analyze result for table with partial index:
Index Scan using lower_rt_inf on test (cost=0.04..10939.03 rows=1051995 width=14) (actual time=0.037..52.083 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.156 ms
Execution time: 62.900 ms
And:
SELECT null_frac, n_distinct, correlation FROM pg_catalog.pg_stats WHERE tablename = 'lower_rt_inf'
null_frac |n_distinct |correlation |
----------|-----------|------------|
0 |-1 |1 |
Then I create an index similar to the previous one, but without partial condition:
create index lower_rt_full ON testing.test using btree(lower(ts));
analyze testing.test;
And now the same index is used, but the cost/rows are different:
Index Scan using lower_rt_inf on test (cost=0.04..1053.87 rows=101256 width=14) (actual time=0.029..58.613 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.280 ms
Execution time: 71.794 ms
And a bit more:
select * from testing.test where lower(ts)> '2017-06-17 20:00:00'::timestamp;
Index Scan using lower_rt_full on test (cost=0.04..3159.52 rows=303767 width=14) (actual time=0.036..64.208 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.099 ms
Execution time: 78.759 ms
How can I effectively use partial indexes for expressions?
What happens here is that the statistics on index lower_rt_full are used to estimate the row count, but statistics on lower_rt_inf, which is a partial index, aren't.
Since function are a black box for PostgreSQL, it has no idea about the distribution of lower(ts) and uses a bad estimate.
After lower_rt_full has been created and the table analyzed, PostgreSQL has a good idea about this distribution and can estimate much better. Even if the index isn't used to execute the query, it is used for query planning.
Since upper_inf is also a function (black box), you would get an even better estimate if you had an index ON test (upper_inf(ts), lower(ts)).
For an explanation why partial indexes are not considered to estimate the number of result rows, see this comment in examine_variable in backend/utils/adt/selfuncs.c, which tries to find statistical data about an expression:
* Has it got stats? We only consider stats for
* non-partial indexes, since partial indexes probably
* don't reflect whole-relation statistics; the above
* check for uniqueness is the only info we take from
* a partial index.
Thanks for the answer.
The problem in using the function in the index (lower(rt))?
Or in that the function is used in condition of partial index.
Because, if I add a separate field "latest":
alter table testing.test add column latest boolean;
update testing.test set latest = upper_inf(ts);
create index lower_latest_rt ON testing.test using btree(lower(ts)) where latest = true;
alter index testing.lower_latest_rt alter column lower set statistics 1000;
analyze testing.test;
And execute follwing query:
select *
from testing.test
where lower(ts)> '2017-06-17 20:00:00'::timestamp and latest = true
I have result:
Index Scan using lower_latest_rt on test (cost=0.04..11406.44 rows=285833 width=23) (actual time=0.027..178.054 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 1.788 ms
Execution time: 188.481 ms