Postgres query optimization (forcing an index scan)

Postgres query optimization (forcing an index scan) - postgresql

Below is my query. I am trying to get it to use an index scan, but it will only seq scan.
By the way the metric_data table has 130 million rows. The metrics table has about 2000 rows.
metric_data table columns:
metric_id integer
, t timestamp
, d double precision
, PRIMARY KEY (metric_id, t)
How can I get this query to use my PRIMARY KEY index?
SELECT
S.metric,
D.t,
D.d
FROM metric_data D
INNER JOIN metrics S
ON S.id = D.metric_id
WHERE S.NAME = ANY (ARRAY ['cpu', 'mem'])
AND D.t BETWEEN '2012-02-05 00:00:00'::TIMESTAMP
AND '2012-05-05 00:00:00'::TIMESTAMP;
EXPLAIN:
Hash Join (cost=271.30..3866384.25 rows=294973 width=25)
Hash Cond: (d.metric_id = s.id)
-> Seq Scan on metric_data d (cost=0.00..3753150.28 rows=29336784 width=20)
Filter: ((t >= '2012-02-05 00:00:00'::timestamp without time zone)
AND (t <= '2012-05-05 00:00:00'::timestamp without time zone))
-> Hash (cost=270.44..270.44 rows=68 width=13)
-> Seq Scan on metrics s (cost=0.00..270.44 rows=68 width=13)
Filter: ((sym)::text = ANY ('{cpu,mem}'::text[]))

For testing purposes you can force the use of the index by "disabling" sequential scans - best in your current session only:
SET enable_seqscan = OFF;
Do not use this on a productive server. Details in the manual here.
I quoted "disabling", because you cannot actually disable sequential table scans. But any other available option is now preferable for Postgres. This will prove that the multicolumn index on (metric_id, t) can be used - just not as effective as an index on the leading column.
You probably get better results by switching the order of columns in your PRIMARY KEY (and the index used to implement it behind the curtains with it) to (t, metric_id). Or create an additional index with reversed columns like that.
Is a composite index also good for queries on the first field?
You do not normally have to force better query plans by manual intervention. If setting enable_seqscan = OFF leads to a much better plan, something is probably not right in your database. Consider this related answer:
Keep PostgreSQL from sometimes choosing a bad query plan

You cannot force index scan in this case because it will not make it faster.
You currently have index on metric_data (metric_id, t), but server cannot take advantage of this index for your query, because it needs to be able to discriminate by metric_data.t only (without metric_id), but there is no such index. Server can use sub-fields in compound indexes, but only starting from the beginning. For example, searching by metric_id will be able to employ this index.
If you create another index on metric_data (t), your query will make use of that index and will work much faster.
Also, you should make sure that you have an index on metrics (id).

Have you tried to use:
WHERE S.NAME = ANY (VALUES ('cpu'), ('mem'))
instead of
ARRAY
like here

It appears you are lacking suitable FK constraints:
CREATE TABLE metric_data
( metric_id integer
, t timestamp
, d double precision
, PRIMARY KEY (metric_id, t)
, FOREIGN KEY metrics_xxx_fk (metric_id) REFERENCES metrics (id)
)
and in table metrics:
CREATE TABLE metrics
( id INTEGER PRIMARY KEY
...
);
Also check if your statistics are sufficient (and fine-grained enough, since you intend to select 0.2 % of the metrics_data table)

Related

Postgresql - Index scan - Slow filtering

I try to improve query performances on a big (500M rows) time partitioned table. Here is the simplified table structure:
CREATE TABLE execution (
start_time TIMESTAMP WITH TIME ZONE NOT NULL,
end_time TIMESTAMP WITH TIME ZONE,
restriction_criteria VARCHAR(36) NOT NULL
PARTITION BY RANGE (start_time);
Time partitioning
is based on the start_time column because the end_time value is not known when the row is created.
is used to implement efficiently the retention policy.
Request looks like to this generic pattern
SELECT *
FROM execution
WHERE start_time BETWEEN :from AND start_time :to
AND restriction_criteria IN ('123', '456')
ORDER BY end_time DESC, id
FETCH NEXT 20 ROWS ONLY;
I've got the "best" performances using this index
CREATE INDEX IF NOT EXISTS end_time_desc_start_time_index ON execution USING btree (end_time DESC, start_time);
Yet, performances are not good enough.
Limit (cost=1303.21..27189.31 rows=20 width=1674) (actual time=6791.191..6791.198 rows=20 loops=1)
-> Incremental Sort (cost=1303.21..250693964.74 rows=193689 width=1674) (actual time=6791.189..6791.194 rows=20 loops=1)
" Sort Key: execution.end_time DESC, execution.id"
Presorted Key: execution.end_time
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 64kB Peak Memory: 64kB
-> Merge Append (cost=8.93..250685248.74 rows=193689 width=1674) (actual time=4082.161..6791.047 rows=21 loops=1)
Sort Key: execution.end_time DESC
Subplans Removed: 15
-> Index Scan using execution_2021_10_end_time_start_time_idx on execution_2021_10 execution_1 (cost=0.56..113448316.66 rows=93103 width=1674) (actual time=578.896..578.896 rows=1 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 734
-> Index Scan using execution_2021_11_end_time_start_time_idx on execution_2021_11 execution_2 (cost=0.56..113653576.54 rows=87605 width=1674) (actual time=116.841..116.841 rows=1 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 200
-> Index Scan using execution_2021_12_end_time_start_time_idx on execution_2021_12 execution_3 (cost=0.56..16367185.18 rows=12966 width=1674) (actual time=3386.416..6095.261 rows=21 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 5934
Planning Time: 4.108 ms
Execution Time: 6791.317 ms
The index Filter looks is very slow.
I set up a multi-column index hoping the filtering would be done in the Index cond. But it doesn't work
CREATE INDEX IF NOT EXISTS pagination_index ON execution USING btree (end_time DESC, start_time, restriction_criteria);
My feeling is that the first index column should be end_time because we want to leverage the btree index sorting capability. The second one should be restriction_criteria so that an index cond filters rows which doesn't match the restriction_criteria. However, this doesn't work because the query planner need to also check the start_time clause.
The alternative I imagine is to get rid of the partitioning because a multi-column end_time, restriction_critera index would work just fine.
Yet, this is not a perfect solution because dealing with our retention policy would become a pain.
Is there another alternative allowing to keep the start_time partitioning ?

I set up a multi-column index hoping the filtering would be done in the Index cond
The index machinery is very circumspect about what code it runs inside the index. It won't call any operators that it doesn't 'trust', because if the operator throws an error then the whole query will error out, possibly due to rows that weren't even user 'visible' in the first place (i.e. ones that were already deleted or created but never committed). No one wants that. Now the =ANY construct could be considered trustable, but it is not. That means it won't be applied in the Index Cond, but must be applied against the table row, which in turn means you need to visit the table, which is probably where all your time is going, visiting random table rows.
I don't know what it would take code-wise to make =ANY trusted. I've made efforts to investigate that in the past but really never got anywhere, the code around the ANY is too complicated for me to grasp. That would be a nice improvement for the future, but won't help you now anyway.
One way around this is to get an index-only scan. At that point it will call arbitrary code in the index, as it already knows the tuple is visible. But it won't do that for you, because you are selecting at least one column not in the index (and also not shown in your CREATE command, but obviously present anyway)
If you create an index like your widest one but adding "id" to the end, and only select from among those columns, then you should be get a much faster index only scans with merge appends.
If you have even more columns than the ones you've shown plus "id", and you really need to select those columns, and don't want to add all of them to the index, then you can use a trick to use an index-only scan anyway by doing a dummy self join:
with t as (SELECT id
FROM execution
WHERE start_time BETWEEN :from AND :to
AND restriction_criteria IN ('123', '456')
ORDER BY end_time DESC, id
FETCH NEXT 20 ROWS ONLY
)
select real.* from execution real join t using (id)
ORDER BY end_time DESC, id
(If "id" is not unique, then you might need to join on additional column. Also, you would need an index on "id", which you probably already have)
This one will still need to visit the table to fetch the extra columns, but only for the 20 rows being returned, not for all the ones failing the restriction_criteria.
If the restriction_criteria is very selective, another approach might be better: an index on or leading with that column. It will need to read and sort all of those rows (in the relevant partitions) before applying the LIMIT, but if it is very selective this will not take long.

While you can have the output sorted if the leading column is end_time you can reduce the amount of data processed if you use start_time as a leading column.
Since your filter in start_time and restriction_criteria, is excluding ~7000 rows in order to retrieve 20, maybe speeding up the filtering is more important that speeding up the sorting.
CREATE INDEX IF NOT EXISTS execution_start_time_restriction_idx
ON execution USING btree (start_time, restriction_criteria);
CREATE INDEX IF NOT EXISTS execution_restriction_start_time_idx
ON execution USING btree (restriction_criteria, start_time);
ANALYZE execution
If
FROM execution
WHERE start_time BETWEEN :from AND start_time :to
AND restriction_criteria IN ('123', '456')
Is more than the number of rows removed by the filter then having the `end_time as the leading column might be a good idea. But the planner should be able to figure that out for you.
In the end if some of those indices are not used you can drop it.

Incorrect index usage Postgresql Version 12

Query Plan:
db=> explain
db-> SELECT MIN("id"), MAX("id") FROM "public"."tablename" WHERE ( "updated_at" >= '2022-07-24 09:08:05.926533' AND "updated_at" < '2022-07-28 09:16:54.95459' );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=128.94..128.95 rows=1 width=16)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..64.47 rows=1 width=8)
-> Index Scan using tablename_pkey on tablename (cost=0.57..416250679.26 rows=6513960 width=8)
Index Cond: (id IS NOT NULL)
Filter: ((updated_at >= '2022-07-24 09:08:05.926533'::timestamp without time zone) AND (updated_at < '2022-07-28 09:16:54.95459'::timestamp without time zone))
InitPlan 2 (returns $1)
-> Limit (cost=0.57..64.47 rows=1 width=8)
-> Index Scan Backward using tablename_pkey on tablename tablename_1 (cost=0.57..416250679.26 rows=6513960 width=8)
Index Cond: (id IS NOT NULL)
Filter: ((updated_at >= '2022-07-24 09:08:05.926533'::timestamp without time zone) AND (updated_at < '2022-07-28 09:16:54.95459'::timestamp without time zone))
(11 rows)
Indexes:
"tablename_pkey" PRIMARY KEY, btree (id)
"tablename_updated_at_incl_id_partial_idx" btree (updated_at) INCLUDE (id) WHERE updated_at >= '2022-07-01 00:00:00'::timestamp without time zone
Idea is when there is already a filtered index which only has small subset of records, why is query doing index scan on primary key, instead of tablename_updated_at_incl_id_partial_idx. Also this is a heap table not clustered table.

Because you're using MIN and MAX, try redefining your second index so id is part of the BTREE index, not just INCLUDEd in it. That may make searching for the MIN and MAX items faster.

Since a small fraction of your table really is over 6e6 rows, then your data must be huge. And I am guessing that id and updated_at are nearly perfectly correlated with each other, so selecting specifically for recent updated_at means you are also selecting for higher id. But the planner doesn't now about that. It thinks that by walking up the id index it can stop after walking about 1/6513960 of it, once it finds the first row qualifying on the time column. But instead it has to walk most of the index before finding that row.
The simplest solution probably to introduce some dummy arithmetic into the aggregates: SELECT MIN("id"+0), MAX("id"+0) ... This will force it not to use the index on id. This will probably be the most robust and simplest solution as long as you have the flexibility to change the query text in your app. But even if you can't change the app, this should at least allow you to verify my assumptions and capture an EXPLAIN (ANALYZE) of it while it is not using the pk index.
None of PostgreSQL's advanced statistics will (as of yet) fix this problem. so you are stuck with fixing it by changing the query or the indexes. Changing the query in the silly way I described is the best currently available solution, but if you need to do just with indexes there are some other less-good options but which will likely still be better than what you currently have.
One is to make the horrible index scan at least into a horrible index-only scan. You could replace your existing primary key index with one like create unique index on tablename (id) include (updated_at). Here the INCLUDE is necessary because otherwise the UNIQUE would not do what you want. It will still have to walk a large part of the index, but at least it won't need to keep jumping between index and table to fetch the time column. (Make sure the table is well-vacuumed)
Or, you could provide a partial index that the planner would find attractive, by switching the order of the columns in it: create index on tablename (id, updated_at) WHERE updated_at >= '2022-07-01 00:00:00'::timestamp without time zone The only thing that makes this better than your existing partial index is that this one would actually get used.

Postgres TIMESTAMP index and query performance

I have this table:
CREATE TABLE IF NOT EXISTS CHANGE_REQUESTS (
ID UUID PRIMARY KEY,
FIELD_ID INTEGER NOT NULL,
LAST_CHANGE_DATE TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
);
And I'm always going to be running the exact same query on it:
select * from change_requests where last_change_date > now() - INTERVAL '10 min';
The size of the table is going to be anywhere from 750k to 1million rows on average.
My question is how can I make sure the query is always very fast? I'm thinking of adding an index on last_change_date, but I'm not sure if that will do anything. I tried it (with only 1 row in the table right now) and got this explain:
create index change_requests__dt_index
on change_requests (last_change_date);
Seq Scan on change_requests (cost=0.00..1.02 rows=1 width=28)
Filter: (last_change_date > (now() - '00:10:00'::interval))
So it doesn't appear to use the index at all.
Will this index actually help? If not, what else could I do? Thanks!

Your index is perfect for the task. You see the sequential scan in the execution plan because you don't have a realistic amount of test data in the table, and for very small tables the overhead of using the index is not worth the effort (you'd have to process more 8kB database blocks).
Always test with realistic amounts of data. That will safe you some pain later on.

Does Postgres use indexes if casting timestamp to date?

Let's say I have a table with some columns and a column dt which is of type TIMESTAMP.
I create a (non functional) index on this column.
Then I execute a query
SELECT *
FROM tbl
WHERE
dt::DATE = NOW()::DATE
The question is will Postgres use the index I've created earlier and under which circumstances it will/will not?
I understand that a functional index would cover this case, but does a simple index cover both cases or not when it's a TIMESTAMP -> DATE type conversion?
EDIT:
performing an EXPLAIN ANALYZE on the query tells us it does not use index and performs a Seq scan (table with 3+ mil records:
Seq Scan on tbl (cost=0.00..192289.92 rows=17043 width=12) (actual time=7.237..2493.496 rows=4928 loops=1)
Filter: ((dt)::date = (now())::date)
Rows Removed by Filter: 3397155
Total runtime: 2494.546 ms
Let me ask a question differently then, is it possible to make Postgres utilize this index or should I create another one?

A simple index will not work in this case; try it with EXPLAIN.
What you could do to use the simple index is
WHERE dt >= current_date::timestamptz
AND dt < (current_date + 1)::timestamptz
I think that this is pretty readable and the best solution, but if you want to go with your current query, you'll have to add a second index on (dt::date).
Don't forget that every additional index costs space and slows down the performance of data modifying statements.

Postgresql Sorting a Joined Table with an index

I'm currently working on a complex sorting problem in Postgres 9.2
You can find the Source Code used in this Question(simplified) here: http://sqlfiddle.com/#!12/9857e/11
I have a Huge (>>20Mio rows) table containing various columns of different types.
CREATE TABLE data_table
(
id bigserial PRIMARY KEY,
column_a character(1),
column_b integer
-- ~100 more columns
);
Lets say i want to sort this table over 2 Columns (ASC).
But i don't want to do that with a simply Order By, because later I might need to insert rows in the sorted output and the user probably only wants to see 100 Rows at once (of the sorted output).
To achieve these goals i do the following:
CREATE TABLE meta_table
(
id bigserial PRIMARY KEY,
id_data bigint NOT NULL -- refers to the data_table
);
--Function to get the Column A of the current row
CREATE OR REPLACE FUNCTION get_column_a(bigint)
RETURNS character AS
'SELECT column_a FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Function to get the Column B of the current row
CREATE OR REPLACE FUNCTION get_column_b(bigint)
RETURNS integer AS
'SELECT column_b FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Creating a index on expression:
CREATE INDEX meta_sort_index
ON meta_table
USING btree
(get_column_a(id_data), get_column_b(id_data), id_data);
And then I copy the Id's of the data_table to the meta_table:
INSERT INTO meta_table(id_data) (SELECT id FROM data_table);
Later I can add additional rows to the table with a similar simple insert.
To get the Rows 900000 - 900099 (100 Rows) i can now use:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
ORDER BY 1,2,3 OFFSET 900000 LIMIT 100;
(With an additional INNER JOIN on data_table if I want all the data.)
The Resulting Plan is:
Limit (cost=498956.59..499012.03 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..554396.21 rows=1000000 width=8)
This is a pretty efficient plan (Index Only Scans are new in Postgres 9.2).
But what is if I want to get Rows 20'000'000 - 20'000'099 (100 Rows)? Same Plan, much longer execution time. Well, to improve the Offset Performance (Improving OFFSET performance in PostgreSQL) I can do the following (Let's assume I saved every 100'000th Row away into another table).
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE (get_column_a(id_data), get_column_b(id_data), id_data ) >= (get_column_a(587857), get_column_b(587857), 587857 )
ORDER BY 1,2,3 LIMIT 100;
This runs much faster. The Resulting Plan is:
Limit (cost=0.51..61.13 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.51..193379.65 rows=318954 width=8)
Index Cond: (ROW((get_column_a(id_data)), (get_column_b(id_data)), id_data) >= ROW('Z'::bpchar, 27857, 587857))
So far everything works perfect and postgres does a great job!
Let's assume I want to change the Order of the 2nd Column to DESC.
But then I would have to change my WHERE Clause, because the > Operator compares both Columns ASC. The same query as above (ASC Ordering) could also be written as:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE
(get_column_a(id_data) > get_column_a(587857))
OR (get_column_a(id_data) = get_column_a(587857) AND ((get_column_b(id_data) > get_column_b(587857))
OR ( (get_column_b(id_data) = get_column_b(587857)) AND (id_data >= 587857))))
ORDER BY 1,2,3 LIMIT 100;
Now the Plan Changes and the Query becomes slow:
Limit (cost=0.00..1095.94 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..1117877.41 rows=102002 width=8)
Filter: (((get_column_a(id_data)) > 'Z'::bpchar) OR (((get_column_a(id_data)) = 'Z'::bpchar) AND (((get_column_b(id_data)) > 27857) OR (((get_column_b(id_data)) = 27857) AND (id_data >= 587857)))))
How can I use the efficient older plan with DESC-Ordering?
Do you have any better ideas how to solve the Problem?
(I already tried to declare a own Type with own Operator Classes, but that's too slow)

You need to rethink your approach. Where to begin? This is a clear example, basically of the limits, performance-wise, of the sort of functional approach you are taking to SQL. Functions are largely planner opaque, and you are forcing two different lookups on data_table for every row retrieved because the stored procedure's plans cannot be folded together.
Now, far worse, you are indexing one table based on data in another. This might work for append-only workloads (inserts allowed but no updates) but it will not work if data_table can ever have updates applied. If the data in data_table ever changes, you will have the index return wrong results.
In these cases, you are almost always better off writing in the join as explicit, and letting the planner figure out the best way to retrieve the data.
Now your problem is that your index becomes a lot less useful (and a lot more intensive disk I/O-wise) when you change the order of your second column. On the other hand, if you had two different indexes on the data_table and had an explicit join, PostgreSQL could more easily handle this.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse