I'm running the following query, and the index I have created is not used; PostgreSQL uses a sequential scan instead of an index scan.
Query:
EXPLAIN (analyze, buffers)
SELECT date,
company,
host
FROM mytable
where date < ( now() - INTERVAL '10 days' )
The index I have created:
CREATE INDEX ix_mytable_company ON mytable USING btree (
host,
cumpany,
date
)
thanks for the reply but i still have problems
get this no matter what i try:
Seq Scan on mytable (cost=0.00..5577644.22 rows=65056015 width=34) (actual time=120.238..43372.108 rows=63058234 loops=1) Filter: (date < (now() - '10 days'::interval)) Rows Removed by Filter: 2543550 Buffers: shared hit=581914 read=3847699 Planning Time: 1.263 ms JIT: Functions: 4 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 0.495 ms, Inlining 66.265 ms, Optimization 37.358 ms, Emission 16.325 ms, Total 120.443 ms Execution Time: 46204.494 ms
view index shows this CREATE INDEX ix_mytable_company ON public.mytable USING btree (date) INCLUDE (host, cumpany)
and i still run: EXPLAIN (analyze, buffers) SELECT date,cumpany,host FROM mytable where date < (now() - INTERVAL '10 days' )
6.5 milj rows should qualify for an index scan i think.
The problem is that date is not the first column in the index, so the index cannot be used for the query. It would be enough to have an index on date alone, but if you are targeting an index-only scan, you could use this index:
CREATE INDEX myindex ON mytable (date) INCLUDE (company, host);
/* for good performance, refresh the visibility map */
VACUUM mytable;
Note that an index will only be used if the table is big enough.
Related
I am using PostgreSQL 12. My Table contains 1 Billion Records. It was partitioned by every Month based on date range. Every day contains more than 3 Million Records. When I select some set of ids, it takes more time.
I want to filter 20 days but it takes very long time. So I select records for one day only. It's also took more than 10 seconds...
Below is my query,
Select id,col1,col2,col3,col4,col5,col6,Logtime from myLog R
where R.id in(1818154 59,…………**500 IDS**………..,180556591)
and R.LogTime='2019-07-29'::date
and R.Logtime>='2019-07-29 00:00:00' and R.LogTime<='2019-07-30 00:00:00'
Order by R.LogTime desc;
Below is my queries plan,
"Sort (cost=2556.35.2556.35 rows=1 width=298) (actual time 10088.084.10088.091 rows=557 loops-1)"
" Sort Key: R.LogTime DESC”
" Sort Method: quicksort Memory: 172 kB
-> Index Scan using p_my_log201907_LogTime_id_idx on p_mylog201907 r (cost=
0.56..2556.34 rows=1 width-298) (actual time=69.056..10085.712 rows=557 loops=1)"
Index Cond: (Logtime):: date = "2019-07-29’::date)
AND (id = ANY (‘{1818154 59,…………500 IDS………..,180556591}’::bigint[])
Filter: ( Logtime>= ‘2019-07-29 00:00:00’:: timestamp without time
AND (Logtime < ‘2019-07-30 00:00:00’:: timestamp without time zone)}"
"Planning Time: 0.664 ms
"Execution Time: 10088.189 ms
Below is my index,
CREATE INDEX Idx_LogTime ON MyLog( (LogTime::date) DESC,id desc);
At the time of query execution I have set the WORK_MEM to '1GB'. Please suggest me. How can I optimize and speedup my query?
I have a table with more than 3 million rows, one column named creationdate is a timestamp without time zone.
I created a couple of indexes on it, like:
"idx_routingtasks_creationdate" btree (creationdate)
"idx_routingtasks_creationdate2" btree ((creationdate::date))
When filter by creationdate::date (casting by date) the index idx_routingtasks_creationdate is not used:
explain analyze select * from routingtasks where creationdate::date < (now() - interval '1 day')::date;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on routingtasks (cost=0.00..1315631.93 rows=2811638 width=715) (actual time=186642.012..413763.643 rows=2800659 loops=1)
Filter: ((creationdate)::date < ((now() - '1 day'::interval))::date)
Rows Removed by Filter: 212248
Planning time: 0.195 ms
Execution time: 413875.829 ms
(5 rows)
The same when not casting by date:
explain analyze select * from routingtasks where creationdate < now() - interval '1 day';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Seq Scan on routingtasks (cost=0.00..1300588.39 rows=2918447 width=715) (actual time=72089.312..327288.333 rows=2876756 loops=1)
Filter: (creationdate < (now() - '1 day'::interval))
Rows Removed by Filter: 141052
Planning time: 0.104 ms
Execution time: 327401.745 ms
(5 rows)
How can I create an index on the creationdate column to allow this filter use it?
The answer lies in this part of the execution plan:
Seq Scan ... (actual ... rows=2876756 ...)
...
Rows Removed by Filter: 141052
Since almost all rows are returned anyway, using a sequential scan and discarding the few rows that are filtered out is the most efficient way to process the query.
If you want to verify that, temporarily
SET enable_seqscan = off;
to make PostgreSQL avoid a sequential scan if possible. Then you can test if query execution gets faster or not.
the following search is taking 15 seconds despite an index on the last_time column which is a timestamp
select * from my_table where last_time>now() - interval '1 hour'
The table has around 15 M rows and the query returns around 100 rows.
I have recreated the index and have vacuumed the table. but the search is still slow.
Similar searches on similar tables of comparable size return around 1000 rows in < 1 sec.
Seq Scan on my_table (cost=0.00..625486.61 rows=3817878 width=237) (actual time=16397.053..16397.054 rows=0 loops=1)
Filter: (last_time > (now() - '01:00:00'::interval))
Rows Removed by Filter: 11453235
Buffers: shared hit=73 read=424975
Planning Time: 0.290 ms
Execution Time: 16397.097 ms
Any suggests?
Approximately every 10 min I insert ~50 records with the same timestamp.
It means ~600 records per hour or 7.200 records per day or 2.592.000 records per year.
User wants to retrieve all records for the timestamp closest to the asked time.
Design #1 - one table with index on timestamp column:
CREATE TABLE A (t timestamp, value int);
CREATE a_idx ON A (t);
Single insert statement creates ~50 records with the same timestamp:
INSERT INTO A VALUES (
(‘2019-01-02 10:00’, 5),
(‘2019-01-02 10:00’, 12),
(‘2019-01-02 10:00’, 7),
….
)
Get all records which are closest to the asked time
(I use the function greatest() available in PostgreSQL):
SELECT * FROM A WHERE t =
(SELECT t FROM A ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
I think this query is not efficient because it requires the full table scan.
I plan to partition the A table by timestamp to have 1 partition per year, but the approximate match above still will be slow.
Design #2 - create 2 tables:
1st table: to keep the unique timestamps and auto-incremented PK,
2nd table: to keep data and the foreign key on 1st table PK
CREATE TABLE UNIQ_TIMESTAMP (id SERIAL PRIMARY KEY, t timestamp);
CREATE TABLE DATA (id INTEGER REFERENCES UNIQ_TIMESTAMP (id), value int);
CREATE INDEX data_time_idx ON DATA (id);
Get all records which are closest to the asked time:
SELECT * FROM DATA WHERE id =
(SELECT id FROM UNIQ_TIMESTAMP ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
It should run faster compared to Design #1 because the nested select scans the smaller table.
Disadvantage of this approach:
- I have to insert into 2 tables instead just one
- I lost the ability to partition the DATA table by timestamp
What you could recommend?
I'd go with tje single table approach, perhaps partitioned by year so that it becomes easy to get rid of old data.
Create an index like
CREATE INDEX ON a (date_trunc('hour', t + INTERVAL '30 minutes'));
Then use your query like you wrote it, but add
AND date_trunc('hour', t + INTERVAL '30 minutes')
= date_trunc('hour', asked_time + INTERVAL '30 minutes')
The additional condition acts as a filter and can use the index.
You can use a UNION of two queries to find all timestamps closest to a given one:
(
select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1
)
union all
(
select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1
)
That will efficiently make use of an index on t. On a table with 10 million rows (~3 years of data), I get the following execution plan:
Append (cost=0.57..1.16 rows=2 width=8) (actual time=0.381..0.407 rows=2 loops=1)
Buffers: shared hit=6 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.380..0.381 rows=1 loops=1)
Output: a.t
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Index Only Scan using a_t_idx on stuff.a (cost=0.57..253023.35 rows=30699415 width=8) (actual time=0.380..0.380 rows=1 loops=1)
Output: a.t
Index Cond: (a.t >= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
Output: a_1.t
Buffers: shared hit=5
-> Index Only Scan Backward using a_t_idx on stuff.a a_1 (cost=0.57..649469.88 rows=78800603 width=8) (actual time=0.024..0.024 rows=1 loops=1)
Output: a_1.t
Index Cond: (a_1.t <= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=5
Planning Time: 1.823 ms
Execution Time: 0.425 ms
As you can see it only requires very few I/O operations and that is pretty much independent of the table size.
The above can be used for an IN condition:
select *
from a
where t in (
(select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1)
union all
(select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1)
);
If you know you will never have more than 100 values close to that requested timestamp, you could remove the IN query completely and simply use a limit 100 in both parts of the union. That makes the query a bit more efficient as there is no second step for evaluating the IN condition, but might return more rows than you want.
If you always look for timestamps in the same year, then partitioning by year will indeed help with this.
You can put that into a function if it is too complicated as a query:
create or replace function get_closest(p_tocheck timestamp)
returns timestamp
as
$$
select *
from (
(select t
from a
where t >= p_tocheck
order by t
limit 1)
union all
(select t
from a
where t <= p_tocheck
order by t desc
limit 1)
) x
order by greatest(t - p_tocheck, p_tocheck - t)
limit 1;
$$
language sql stable;
The the query gets as simple as:
select *
from a
where t = get_closest(timestamp '2019-03-01 17:00:00');
Another solution is to use the btree_gist extension which provides a "distance" operator <->
Then you can create a GiST index on the timestamp:
create index on a using gist (t) ;
and use the following query:
select *
from a where t in (select t
from a
order by t <-> timestamp '2019-03-01 17:00:00'
limit 1);
[PostgreSQL 9.6.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.2.0-10) 6.2.0 20161027, 64-bit]
I have a table with timestamp ranges:
create table testing.test as
select tsrange(d, null) ts from
generate_series(timestamp '2000-01-01', timestamp '2018-01-01', interval '1 minute') s(d);
I need to run the following query:
select *
from testing.test
where lower(ts)> '2017-06-17 20:00:00'::timestamp and upper_inf(ts)
Explain analyze result for table without indexes:
Seq Scan on test (cost=0.00..72482.26 rows=1052013 width=14) (actual time=2165.477..2239.781 rows=283920 loops=1)
Filter: (upper_inf(ts) AND (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone))
Rows Removed by Filter: 9184081
Planning time: 0.046 ms
Execution time: 2250.221 ms
Next I'm going to add a following partial index:
create index lower_rt_inf ON testing.test using btree(lower(ts)) where upper_inf(ts);
analyze testing.test;
Explain analyze result for table with partial index:
Index Scan using lower_rt_inf on test (cost=0.04..10939.03 rows=1051995 width=14) (actual time=0.037..52.083 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.156 ms
Execution time: 62.900 ms
And:
SELECT null_frac, n_distinct, correlation FROM pg_catalog.pg_stats WHERE tablename = 'lower_rt_inf'
null_frac |n_distinct |correlation |
----------|-----------|------------|
0 |-1 |1 |
Then I create an index similar to the previous one, but without partial condition:
create index lower_rt_full ON testing.test using btree(lower(ts));
analyze testing.test;
And now the same index is used, but the cost/rows are different:
Index Scan using lower_rt_inf on test (cost=0.04..1053.87 rows=101256 width=14) (actual time=0.029..58.613 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.280 ms
Execution time: 71.794 ms
And a bit more:
select * from testing.test where lower(ts)> '2017-06-17 20:00:00'::timestamp;
Index Scan using lower_rt_full on test (cost=0.04..3159.52 rows=303767 width=14) (actual time=0.036..64.208 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.099 ms
Execution time: 78.759 ms
How can I effectively use partial indexes for expressions?
What happens here is that the statistics on index lower_rt_full are used to estimate the row count, but statistics on lower_rt_inf, which is a partial index, aren't.
Since function are a black box for PostgreSQL, it has no idea about the distribution of lower(ts) and uses a bad estimate.
After lower_rt_full has been created and the table analyzed, PostgreSQL has a good idea about this distribution and can estimate much better. Even if the index isn't used to execute the query, it is used for query planning.
Since upper_inf is also a function (black box), you would get an even better estimate if you had an index ON test (upper_inf(ts), lower(ts)).
For an explanation why partial indexes are not considered to estimate the number of result rows, see this comment in examine_variable in backend/utils/adt/selfuncs.c, which tries to find statistical data about an expression:
* Has it got stats? We only consider stats for
* non-partial indexes, since partial indexes probably
* don't reflect whole-relation statistics; the above
* check for uniqueness is the only info we take from
* a partial index.
Thanks for the answer.
The problem in using the function in the index (lower(rt))?
Or in that the function is used in condition of partial index.
Because, if I add a separate field "latest":
alter table testing.test add column latest boolean;
update testing.test set latest = upper_inf(ts);
create index lower_latest_rt ON testing.test using btree(lower(ts)) where latest = true;
alter index testing.lower_latest_rt alter column lower set statistics 1000;
analyze testing.test;
And execute follwing query:
select *
from testing.test
where lower(ts)> '2017-06-17 20:00:00'::timestamp and latest = true
I have result:
Index Scan using lower_latest_rt on test (cost=0.04..11406.44 rows=285833 width=23) (actual time=0.027..178.054 rows=283920 loops=1)
Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 1.788 ms
Execution time: 188.481 ms