Postgres TIMESTAMP index and query performance - postgresql

I have this table:
CREATE TABLE IF NOT EXISTS CHANGE_REQUESTS (
ID UUID PRIMARY KEY,
FIELD_ID INTEGER NOT NULL,
LAST_CHANGE_DATE TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
);
And I'm always going to be running the exact same query on it:
select * from change_requests where last_change_date > now() - INTERVAL '10 min';
The size of the table is going to be anywhere from 750k to 1million rows on average.
My question is how can I make sure the query is always very fast? I'm thinking of adding an index on last_change_date, but I'm not sure if that will do anything. I tried it (with only 1 row in the table right now) and got this explain:
create index change_requests__dt_index
on change_requests (last_change_date);
Seq Scan on change_requests (cost=0.00..1.02 rows=1 width=28)
Filter: (last_change_date > (now() - '00:10:00'::interval))
So it doesn't appear to use the index at all.
Will this index actually help? If not, what else could I do? Thanks!

Your index is perfect for the task. You see the sequential scan in the execution plan because you don't have a realistic amount of test data in the table, and for very small tables the overhead of using the index is not worth the effort (you'd have to process more 8kB database blocks).
Always test with realistic amounts of data. That will safe you some pain later on.

Related

Incorrect index usage Postgresql Version 12

Query Plan:
db=> explain
db-> SELECT MIN("id"), MAX("id") FROM "public"."tablename" WHERE ( "updated_at" >= '2022-07-24 09:08:05.926533' AND "updated_at" < '2022-07-28 09:16:54.95459' );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=128.94..128.95 rows=1 width=16)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..64.47 rows=1 width=8)
-> Index Scan using tablename_pkey on tablename (cost=0.57..416250679.26 rows=6513960 width=8)
Index Cond: (id IS NOT NULL)
Filter: ((updated_at >= '2022-07-24 09:08:05.926533'::timestamp without time zone) AND (updated_at < '2022-07-28 09:16:54.95459'::timestamp without time zone))
InitPlan 2 (returns $1)
-> Limit (cost=0.57..64.47 rows=1 width=8)
-> Index Scan Backward using tablename_pkey on tablename tablename_1 (cost=0.57..416250679.26 rows=6513960 width=8)
Index Cond: (id IS NOT NULL)
Filter: ((updated_at >= '2022-07-24 09:08:05.926533'::timestamp without time zone) AND (updated_at < '2022-07-28 09:16:54.95459'::timestamp without time zone))
(11 rows)
Indexes:
"tablename_pkey" PRIMARY KEY, btree (id)
"tablename_updated_at_incl_id_partial_idx" btree (updated_at) INCLUDE (id) WHERE updated_at >= '2022-07-01 00:00:00'::timestamp without time zone
Idea is when there is already a filtered index which only has small subset of records, why is query doing index scan on primary key, instead of tablename_updated_at_incl_id_partial_idx. Also this is a heap table not clustered table.
Because you're using MIN and MAX, try redefining your second index so id is part of the BTREE index, not just INCLUDEd in it. That may make searching for the MIN and MAX items faster.
Since a small fraction of your table really is over 6e6 rows, then your data must be huge. And I am guessing that id and updated_at are nearly perfectly correlated with each other, so selecting specifically for recent updated_at means you are also selecting for higher id. But the planner doesn't now about that. It thinks that by walking up the id index it can stop after walking about 1/6513960 of it, once it finds the first row qualifying on the time column. But instead it has to walk most of the index before finding that row.
The simplest solution probably to introduce some dummy arithmetic into the aggregates: SELECT MIN("id"+0), MAX("id"+0) ... This will force it not to use the index on id. This will probably be the most robust and simplest solution as long as you have the flexibility to change the query text in your app. But even if you can't change the app, this should at least allow you to verify my assumptions and capture an EXPLAIN (ANALYZE) of it while it is not using the pk index.
None of PostgreSQL's advanced statistics will (as of yet) fix this problem. so you are stuck with fixing it by changing the query or the indexes. Changing the query in the silly way I described is the best currently available solution, but if you need to do just with indexes there are some other less-good options but which will likely still be better than what you currently have.
One is to make the horrible index scan at least into a horrible index-only scan. You could replace your existing primary key index with one like create unique index on tablename (id) include (updated_at). Here the INCLUDE is necessary because otherwise the UNIQUE would not do what you want. It will still have to walk a large part of the index, but at least it won't need to keep jumping between index and table to fetch the time column. (Make sure the table is well-vacuumed)
Or, you could provide a partial index that the planner would find attractive, by switching the order of the columns in it: create index on tablename (id, updated_at) WHERE updated_at >= '2022-07-01 00:00:00'::timestamp without time zone The only thing that makes this better than your existing partial index is that this one would actually get used.

PSQL - Performance of query filtered by intervals on two separate fields

I have got a PostgreSQL table that covers time intervals.
This is a simplified structure of my table
CREATE TABLE intervals (
name varchar(40),
time_from timestamp,
time_to timestamp
);
The table contains millions of records, but, if you apply a filter in a specific point of time in the past, the number of records for which
time_from <= [requested time] <= time_to
are always very limited in number (not more than 3k results). So, a query like this one
SELECT *
FROM intervals
WHERE time_from <= '2020-01-01T10:00:00' and time_to >= '2020-01-01T10:00:00'
is supposed to return a relatively small amount of results, and, in theory, it should be quite fast if I used the correct index. But it's not fast at all
I tried adding a combined index on time_from and time_to, but the engine doesn't pick it.
Seq Scan on intervals (cost=0.00..156152.46 rows=428312 width=32) (actual time=13.223..3599.840 rows=4981 loops=1)
Filter: ((time_from <= '2020-01-01T10:00:00') AND (time_to >= '2020-01-01T10:00:00'))
Rows Removed by Filter: 2089650
Planning Time: 0.159 ms
Execution Time: 3600.618 ms
What type of index should I add, in order to speed up this query?
A btree index cannot be very efficient here. It can quickly throw out everything whose time_from > '2020-01-01T10:00:00', but that is probably not all that much of the table (at least, not if your table goes back for many years). Once the first column of the index has been consumed in this way, the next column cannot be used very efficiently. It can only jump to a specific part of time_to values within the ties of time_from, and that is just not very useful as there are probably not all that many ties. (At least, not that it can prove to itself while planning your query).
What you need is a gist index, which specializes in this kind of multi-dimensional thing:
create extension btree_gist ;
create index on intervals using gist (time_from,time_to);
This index will support your query as written. Another possibility is to index the time ranges and index those, rather than separate begin and end point.
-- this one does not need btree_gist.
create index on intervals using gist (tsrange(time_from,time_to));
But this index forces you to write the query differently:
SELECT * FROM intervals
WHERE tsrange(time_from,time_to) #> '2020-01-01T10:00:00'::timestamp

Fetching rows not updated within last 24 hours

I have a large table (40+ million records) with a structure like the following:
CREATE TABLE collected_data(
id TEXT NOT NULL,
status TEXT NOT NULL,
PRIMARY KEY(id, status),
blob JSONB,
updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
I need to get all (or atleast 100,000) records that have a updated_at older than 24 hours, of a certain status, and have a blob that is not null.
So the query becomes:
SELECT
id
FROM
collected_data
WHERE
status = 'waiting'
AND blob IS NOT NULL
AND updated_at < NOW() - '24 hours'::interval
LIMIT 100000;
Which results in the execution plan of something like:
Limit (cost=0.00..234040.07 rows=100000 width=12)
-> Seq Scan on collected_data (cost=0.00..59236150.00 rows=25310265 width=12)
" Filter: ((blob IS NOT NULL) AND (type = 'waiting'::text) AND (updated_at >= (now() - '24:00:00'::interval)))"
It almost always results in a full table scan, which mean that some queries are really slow.
I have tried to create indexes like CREATE INDEX idx_special ON collected_data(status, updated_at); but it does not help.
Is there any way I can make this query faster?
The planner thinks that 25,310,265 rows will meet your conditions, so it thinks it will be spoiled for choice in getting a mere 100,000 of them by a seq scan and then stopping early. If there isn't really that many, or there are that many but they are all clustered in the wrong part of the table, this won't actually be so fast. This is especially likely to be the case if, after selecting 100,000 of them, the next thing you do is update them in a way such they no longer meet the criteria. Because then you have to keep walking past the accumulating remnants of the ones that used to qualify, to find the next batch.
You can encourage it to use the index by adding 'order by updated_at' to your query. You could also stack the deck in your favor by creating a partial index CREATE INDEX ON collected_data(status, updated_at) where blob is not null or maybe CREATE INDEX ON collected_data(updated_at) where status='waiting' and blob is not null.

Selecting primary key:Why postgres prefers to do sequential scan vs index scan

I have the following table
create table log
(
id bigint default nextval('log_id_seq'::regclass) not null
constraint log_pkey
primary key,
level integer,
category varchar(255),
log_time timestamp,
prefix text,
message text
);
It contains like 3 million of rows.
I'm comparing the following queries:
EXPLAIN SELECT id
FROM log
WHERE log_time < now() - INTERVAL '3 month'
LIMIT 100000
which yields the following plan:
Limit (cost=0.00..19498.87 rows=100000 width=8)
-> Seq Scan on log (cost=0.00..422740.48 rows=2168025 width=8)
Filter: (log_time < (now() - '3 mons'::interval))
And the same query with ORDER BY id instruction added:
EXPLAIN SELECT id
FROM log
WHERE log_time < now() - INTERVAL '3 month'
ORDER BY id ASC
LIMIT 100000
which yields
Limit (cost=0.43..25694.15 rows=100000 width=8)
-> Index Scan using log_pkey on log (cost=0.43..557048.28 rows=2168031 width=8)
Filter: (log_time < (now() - '3 mons'::interval))
I have the following questions:
The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?
How can Postgres use index in the first place in such a query? WHERE clause of the query contains a non-indexed column and to fetch that column, sequential database scan will be required, but the query with ORDER BY doesn't indicate that.
The Postgres manual page says:
For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I/O due to following a sequential access pattern
Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.
Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.
The index is read sequentially, yes, but postgres needs to follow up with a read of the rows from the table. That is, in most cases, if an index identifies 100 rows, then postgres will need to perform up to 100 random reads against the table.
Internally, the postgres planner weighs sequential and random reads differently, with random reads generally much more expensive. The settings seq_page_cost and random_page_cost determine those. There are other settings you can view and tinker with if you want, though I recommend being very conservative with modifications.
Let's go back to your earlier questions:
The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?
The reason is the sort. As you note later, the index doesn't include the constraining column, so it doesn't make any sense to use the index. Instead, the planner is basically saying "read the whole table, figure out which rows conform to the constraint, and then return the first 100000 of them, in whatever order we find them".
The sort changes things. In that case, the planner is saying "we need to sort by this field, and we have an index which is already sorted, so read rows from the table in index order, checking against the constraint, until we have 100000 of them, and return that set".
You'll note that the cost estimates (e.g. '0.43..25694.15') are much higher for the second query -- the planner thinks that doing so many random reads from the index scan is going to cost significantly more than just reading the whole table at once with no sorting.
Hope that helps, and let me know if you have further questions.

Partial index on timestamp against current time

I have a query where I filter the rows by comparing their insertion timestamps by five months ago.
This field does not get updated, we may think of it immutable if it helps.
CREATE TABLE events (
id serial PRIMARY KEY,
inserted_at timestamp without time zone DEFAULT now() NOT NULL
);
SELECT *
FROM events e
WHERE e.inserted_at >= (now() - '5 minutes'::interval);
And EXPLAIN ANALYZE VERBOSE:
Seq Scan on public.events e (cost=0.00..459.00 rows=57 width=12) (actual time=0.738..33.127 rows=56 loops=1)
Output: id, inserted_at
Filter: (e.inserted_at >= (now() - '5 minutes'::interval))
Rows Removed by Filter: 19944
Planning time: 0.156 ms
Execution time: 33.180 ms
It seems PostgreSQL performs sequence scan on the field, which increases the cost on that behalf.
Do I have a chance of creating a B-tree partial index, or anything more to optimize that query?
Partial index on last 5 minutes will be in need of rebuild every some time. You can build it concurrently (as you relation is in intensive use) with cron, dropping old indexes. Such approach would give you faster selects on last inserted data of course, but consider the fact that at least every 5 minutes you have to rescan table to build short partial index.
The workaround is math - you can split index build in stages (as function):
select now()- inserted_at >= '5 minutes'::interval
from events
where id > (currval('events_id_seq') - 5*(1000000/30))
that is get id lower then last id value minus approximate inserted in last 5 minutes.
If the result is true then build index in dynamic query with same math, if not, enlarge the step.
This way you scan only PK to build index on timestamp - will be much cheaper.
Another point - if you apply such calculations, you might not need partial index at all?..