Partial index on timestamp against current time - postgresql

I have a query where I filter the rows by comparing their insertion timestamps by five months ago.
This field does not get updated, we may think of it immutable if it helps.
CREATE TABLE events (
id serial PRIMARY KEY,
inserted_at timestamp without time zone DEFAULT now() NOT NULL
);
SELECT *
FROM events e
WHERE e.inserted_at >= (now() - '5 minutes'::interval);
And EXPLAIN ANALYZE VERBOSE:
Seq Scan on public.events e (cost=0.00..459.00 rows=57 width=12) (actual time=0.738..33.127 rows=56 loops=1)
Output: id, inserted_at
Filter: (e.inserted_at >= (now() - '5 minutes'::interval))
Rows Removed by Filter: 19944
Planning time: 0.156 ms
Execution time: 33.180 ms
It seems PostgreSQL performs sequence scan on the field, which increases the cost on that behalf.
Do I have a chance of creating a B-tree partial index, or anything more to optimize that query?

Partial index on last 5 minutes will be in need of rebuild every some time. You can build it concurrently (as you relation is in intensive use) with cron, dropping old indexes. Such approach would give you faster selects on last inserted data of course, but consider the fact that at least every 5 minutes you have to rescan table to build short partial index.
The workaround is math - you can split index build in stages (as function):
select now()- inserted_at >= '5 minutes'::interval
from events
where id > (currval('events_id_seq') - 5*(1000000/30))
that is get id lower then last id value minus approximate inserted in last 5 minutes.
If the result is true then build index in dynamic query with same math, if not, enlarge the step.
This way you scan only PK to build index on timestamp - will be much cheaper.
Another point - if you apply such calculations, you might not need partial index at all?..

Related

Postgresql - Index scan - Slow filtering

I try to improve query performances on a big (500M rows) time partitioned table. Here is the simplified table structure:
CREATE TABLE execution (
start_time TIMESTAMP WITH TIME ZONE NOT NULL,
end_time TIMESTAMP WITH TIME ZONE,
restriction_criteria VARCHAR(36) NOT NULL
PARTITION BY RANGE (start_time);
Time partitioning
is based on the start_time column because the end_time value is not known when the row is created.
is used to implement efficiently the retention policy.
Request looks like to this generic pattern
SELECT *
FROM execution
WHERE start_time BETWEEN :from AND start_time :to
AND restriction_criteria IN ('123', '456')
ORDER BY end_time DESC, id
FETCH NEXT 20 ROWS ONLY;
I've got the "best" performances using this index
CREATE INDEX IF NOT EXISTS end_time_desc_start_time_index ON execution USING btree (end_time DESC, start_time);
Yet, performances are not good enough.
Limit (cost=1303.21..27189.31 rows=20 width=1674) (actual time=6791.191..6791.198 rows=20 loops=1)
-> Incremental Sort (cost=1303.21..250693964.74 rows=193689 width=1674) (actual time=6791.189..6791.194 rows=20 loops=1)
" Sort Key: execution.end_time DESC, execution.id"
Presorted Key: execution.end_time
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 64kB Peak Memory: 64kB
-> Merge Append (cost=8.93..250685248.74 rows=193689 width=1674) (actual time=4082.161..6791.047 rows=21 loops=1)
Sort Key: execution.end_time DESC
Subplans Removed: 15
-> Index Scan using execution_2021_10_end_time_start_time_idx on execution_2021_10 execution_1 (cost=0.56..113448316.66 rows=93103 width=1674) (actual time=578.896..578.896 rows=1 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 734
-> Index Scan using execution_2021_11_end_time_start_time_idx on execution_2021_11 execution_2 (cost=0.56..113653576.54 rows=87605 width=1674) (actual time=116.841..116.841 rows=1 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 200
-> Index Scan using execution_2021_12_end_time_start_time_idx on execution_2021_12 execution_3 (cost=0.56..16367185.18 rows=12966 width=1674) (actual time=3386.416..6095.261 rows=21 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 5934
Planning Time: 4.108 ms
Execution Time: 6791.317 ms
The index Filter looks is very slow.
I set up a multi-column index hoping the filtering would be done in the Index cond. But it doesn't work
CREATE INDEX IF NOT EXISTS pagination_index ON execution USING btree (end_time DESC, start_time, restriction_criteria);
My feeling is that the first index column should be end_time because we want to leverage the btree index sorting capability. The second one should be restriction_criteria so that an index cond filters rows which doesn't match the restriction_criteria. However, this doesn't work because the query planner need to also check the start_time clause.
The alternative I imagine is to get rid of the partitioning because a multi-column end_time, restriction_critera index would work just fine.
Yet, this is not a perfect solution because dealing with our retention policy would become a pain.
Is there another alternative allowing to keep the start_time partitioning ?
I set up a multi-column index hoping the filtering would be done in the Index cond
The index machinery is very circumspect about what code it runs inside the index. It won't call any operators that it doesn't 'trust', because if the operator throws an error then the whole query will error out, possibly due to rows that weren't even user 'visible' in the first place (i.e. ones that were already deleted or created but never committed). No one wants that. Now the =ANY construct could be considered trustable, but it is not. That means it won't be applied in the Index Cond, but must be applied against the table row, which in turn means you need to visit the table, which is probably where all your time is going, visiting random table rows.
I don't know what it would take code-wise to make =ANY trusted. I've made efforts to investigate that in the past but really never got anywhere, the code around the ANY is too complicated for me to grasp. That would be a nice improvement for the future, but won't help you now anyway.
One way around this is to get an index-only scan. At that point it will call arbitrary code in the index, as it already knows the tuple is visible. But it won't do that for you, because you are selecting at least one column not in the index (and also not shown in your CREATE command, but obviously present anyway)
If you create an index like your widest one but adding "id" to the end, and only select from among those columns, then you should be get a much faster index only scans with merge appends.
If you have even more columns than the ones you've shown plus "id", and you really need to select those columns, and don't want to add all of them to the index, then you can use a trick to use an index-only scan anyway by doing a dummy self join:
with t as (SELECT id
FROM execution
WHERE start_time BETWEEN :from AND :to
AND restriction_criteria IN ('123', '456')
ORDER BY end_time DESC, id
FETCH NEXT 20 ROWS ONLY
)
select real.* from execution real join t using (id)
ORDER BY end_time DESC, id
(If "id" is not unique, then you might need to join on additional column. Also, you would need an index on "id", which you probably already have)
This one will still need to visit the table to fetch the extra columns, but only for the 20 rows being returned, not for all the ones failing the restriction_criteria.
If the restriction_criteria is very selective, another approach might be better: an index on or leading with that column. It will need to read and sort all of those rows (in the relevant partitions) before applying the LIMIT, but if it is very selective this will not take long.
While you can have the output sorted if the leading column is end_time you can reduce the amount of data processed if you use start_time as a leading column.
Since your filter in start_time and restriction_criteria, is excluding ~7000 rows in order to retrieve 20, maybe speeding up the filtering is more important that speeding up the sorting.
CREATE INDEX IF NOT EXISTS execution_start_time_restriction_idx
ON execution USING btree (start_time, restriction_criteria);
CREATE INDEX IF NOT EXISTS execution_restriction_start_time_idx
ON execution USING btree (restriction_criteria, start_time);
ANALYZE execution
If
FROM execution
WHERE start_time BETWEEN :from AND start_time :to
AND restriction_criteria IN ('123', '456')
Is more than the number of rows removed by the filter then having the `end_time as the leading column might be a good idea. But the planner should be able to figure that out for you.
In the end if some of those indices are not used you can drop it.

Postgres TIMESTAMP index and query performance

I have this table:
CREATE TABLE IF NOT EXISTS CHANGE_REQUESTS (
ID UUID PRIMARY KEY,
FIELD_ID INTEGER NOT NULL,
LAST_CHANGE_DATE TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
);
And I'm always going to be running the exact same query on it:
select * from change_requests where last_change_date > now() - INTERVAL '10 min';
The size of the table is going to be anywhere from 750k to 1million rows on average.
My question is how can I make sure the query is always very fast? I'm thinking of adding an index on last_change_date, but I'm not sure if that will do anything. I tried it (with only 1 row in the table right now) and got this explain:
create index change_requests__dt_index
on change_requests (last_change_date);
Seq Scan on change_requests (cost=0.00..1.02 rows=1 width=28)
Filter: (last_change_date > (now() - '00:10:00'::interval))
So it doesn't appear to use the index at all.
Will this index actually help? If not, what else could I do? Thanks!
Your index is perfect for the task. You see the sequential scan in the execution plan because you don't have a realistic amount of test data in the table, and for very small tables the overhead of using the index is not worth the effort (you'd have to process more 8kB database blocks).
Always test with realistic amounts of data. That will safe you some pain later on.

PSQL - Performance of query filtered by intervals on two separate fields

I have got a PostgreSQL table that covers time intervals.
This is a simplified structure of my table
CREATE TABLE intervals (
name varchar(40),
time_from timestamp,
time_to timestamp
);
The table contains millions of records, but, if you apply a filter in a specific point of time in the past, the number of records for which
time_from <= [requested time] <= time_to
are always very limited in number (not more than 3k results). So, a query like this one
SELECT *
FROM intervals
WHERE time_from <= '2020-01-01T10:00:00' and time_to >= '2020-01-01T10:00:00'
is supposed to return a relatively small amount of results, and, in theory, it should be quite fast if I used the correct index. But it's not fast at all
I tried adding a combined index on time_from and time_to, but the engine doesn't pick it.
Seq Scan on intervals (cost=0.00..156152.46 rows=428312 width=32) (actual time=13.223..3599.840 rows=4981 loops=1)
Filter: ((time_from <= '2020-01-01T10:00:00') AND (time_to >= '2020-01-01T10:00:00'))
Rows Removed by Filter: 2089650
Planning Time: 0.159 ms
Execution Time: 3600.618 ms
What type of index should I add, in order to speed up this query?
A btree index cannot be very efficient here. It can quickly throw out everything whose time_from > '2020-01-01T10:00:00', but that is probably not all that much of the table (at least, not if your table goes back for many years). Once the first column of the index has been consumed in this way, the next column cannot be used very efficiently. It can only jump to a specific part of time_to values within the ties of time_from, and that is just not very useful as there are probably not all that many ties. (At least, not that it can prove to itself while planning your query).
What you need is a gist index, which specializes in this kind of multi-dimensional thing:
create extension btree_gist ;
create index on intervals using gist (time_from,time_to);
This index will support your query as written. Another possibility is to index the time ranges and index those, rather than separate begin and end point.
-- this one does not need btree_gist.
create index on intervals using gist (tsrange(time_from,time_to));
But this index forces you to write the query differently:
SELECT * FROM intervals
WHERE tsrange(time_from,time_to) #> '2020-01-01T10:00:00'::timestamp

Fetching rows not updated within last 24 hours

I have a large table (40+ million records) with a structure like the following:
CREATE TABLE collected_data(
id TEXT NOT NULL,
status TEXT NOT NULL,
PRIMARY KEY(id, status),
blob JSONB,
updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
I need to get all (or atleast 100,000) records that have a updated_at older than 24 hours, of a certain status, and have a blob that is not null.
So the query becomes:
SELECT
id
FROM
collected_data
WHERE
status = 'waiting'
AND blob IS NOT NULL
AND updated_at < NOW() - '24 hours'::interval
LIMIT 100000;
Which results in the execution plan of something like:
Limit (cost=0.00..234040.07 rows=100000 width=12)
-> Seq Scan on collected_data (cost=0.00..59236150.00 rows=25310265 width=12)
" Filter: ((blob IS NOT NULL) AND (type = 'waiting'::text) AND (updated_at >= (now() - '24:00:00'::interval)))"
It almost always results in a full table scan, which mean that some queries are really slow.
I have tried to create indexes like CREATE INDEX idx_special ON collected_data(status, updated_at); but it does not help.
Is there any way I can make this query faster?
The planner thinks that 25,310,265 rows will meet your conditions, so it thinks it will be spoiled for choice in getting a mere 100,000 of them by a seq scan and then stopping early. If there isn't really that many, or there are that many but they are all clustered in the wrong part of the table, this won't actually be so fast. This is especially likely to be the case if, after selecting 100,000 of them, the next thing you do is update them in a way such they no longer meet the criteria. Because then you have to keep walking past the accumulating remnants of the ones that used to qualify, to find the next batch.
You can encourage it to use the index by adding 'order by updated_at' to your query. You could also stack the deck in your favor by creating a partial index CREATE INDEX ON collected_data(status, updated_at) where blob is not null or maybe CREATE INDEX ON collected_data(updated_at) where status='waiting' and blob is not null.

How do I improve date-based query performance on a large table?

This is related to 2 other questions I posted (sounds like I should post this as a new question) - the feedback helped, but I think the same issue will come back the next time I need to insert data. Things were running slowly still which forced me to temporarily remove some of the older data so that only 2 months' worth remained in the table that I'm querying.
Indexing strategy for different combinations of WHERE clauses incl. text patterns
How to get date_part query to hit index?
Giving further detail this time - hopefully it will help pinpoint the issue:
PG version 10.7 (running on heroku
Total DB size: 18.4GB (this contains 2 months worth of data, and it will grow at approximately the same rate each month)
15GB RAM
Total available storage: 512GB
The largest table (the one that the slowest query is acting on) is 9.6GB (it's the largest chunk of the total DB) - about 10 million records
Schema of the largest table:
-- Table Definition ----------------------------------------------
CREATE TABLE reportimpression (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpression_feb2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-02-01 00:00:00'::timestamp without time zone AND datelocal < '2019-03-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_mar2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-03-01 00:00:00'::timestamp without time zone AND datelocal < '2019-04-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_jan2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-01-01 00:00:00'::timestamp without time zone AND datelocal < '2019-02-01 00:00:00'::timestamp without time zone;
Slow query:
SELECT
date_part('hour', datelocal) AS hour,
SUM(CASE WHEN gender = 'male' THEN views ELSE 0 END) AS male,
SUM(CASE WHEN gender = 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE
datelocal >= '3-1-2019' AND
datelocal < '4-1-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)
The date range in this query will generally be for an entire month (it accepts user input from a web based report) - as you can see, I tried creating an index for each month's worth of data. That helped, but as far as I can tell, unless the query has recently been run (putting the results into the cache), it can still take up to a minute to run.
Explain analyze results:
Finalize GroupAggregate (cost=1035890.38..1035897.86 rows=1361 width=24) (actual time=3536.089..3536.108 rows=24 loops=1)
Group Key: (date_part('hour'::text, datelocal))
-> Sort (cost=1035890.38..1035891.06 rows=1361 width=24) (actual time=3536.083..3536.087 rows=48 loops=1)
Sort Key: (date_part('hour'::text, datelocal))
Sort Method: quicksort Memory: 28kB
-> Gather (cost=1035735.34..1035876.21 rows=1361 width=24) (actual time=3535.926..3579.818 rows=48 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Partial HashAggregate (cost=1034735.34..1034740.11 rows=1361 width=24) (actual time=3532.917..3532.933 rows=24 loops=2)
Group Key: date_part('hour'::text, datelocal)
-> Parallel Index Scan using reportimpression_mar2019_index on reportimpression (cost=0.09..1026482.42 rows=3301168 width=17) (actual time=0.045..2132.174 rows=2801158 loops=2)
Planning time: 0.517 ms
Execution time: 3579.965 ms
I wouldn't think 10 million records would be too much to handle, especially given that I recently bumped up the PG plan that I'm on to try to throw resources at it, so I assume that the issue is still just either my indexes or my queries not being very efficient.
A materialized view is the way to go for what you outlined. Querying past months of read-only data works without refreshing it. You may want to special-case the current month if you need to cover that, too.
The underlying query can still benefit from an index, and there are two directions you might take:
First off, partial indexes like you have now won't buy much in your scenario, not worth it. If you collect many more months of data and mostly query by month (and add / drop rows by month) table partitioning might be an idea, then you have your indexes partitioned automatically, too. I'd consider Postgres 11 or even the upcoming Postgres 12 for this, though.)
If your rows are wide, create an index that allows index-only scans. Like:
CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal, views, gender);
Related:
How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
Or INCLUDE additional columns in Postgres 11 or later:
CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal) INCLUDE (views, gender);
Else, if your rows are physically sorted by datelocal, consider a BRIN index. It's extremely small and probably about as fast as a B-tree index for your case. (But being so small it will stay cached much easier and not push other data out as much.)
CREATE INDEX reportimpression_brin_idx ON reportimpression USING BRIN (datelocal);
You may be interested in CLUSTER or pg_repack to physically sort table rows. pg_repack can do it without exclusive locks on the table and even without a btree index (required by CLUSTER). But it's an additional module not shipped with the standard distribution of Postgres.
Related:
Optimize Postgres deletion of orphaned records
How to reclaim disk space after delete without rebuilding table?
Your execution plan seems to be doing the right thing.
Things you can do to improve, in descending order of effectiveness:
Use a materialized view that pre-aggregates the data
Don't use a hosted database, use your own iron with good local storage and lots of RAM.
Use only one index instead of several partitioned ones. This is not primarily a performance advice (the query will probably not be measurably slower unless you have a lot of indexes), but it will ease the management burden.