Multi-column indexes and ORDER BY - postgresql

The PostgreSQL documentation states that if we run a query ... ORDER BY x ASC, y DESC on a table with an index ... (x ASC, y ASC), the index cannot be used because the directions do not match.
Does this mean that this index is completely useless, or can the database engine use the index to order the x ASC part (and then manually sort the y DESC part)?
What if we run a query ... WHERE x = 999 ORDER BY y DESC, can this index be used?

Answer to question 1.
Postgres 12 or older
No, the index cannot be used, like the manual suggests. You can verify by creating such an index on any table and then, for the test session only:
SET enable_seqscan = OFF;
Then:
EXPLAIN
SELECT * FROM tbl ORDER BY ORDER BY x, y DESC;
Now, if the index could be used in any way, it would be. But you'll still see a sequential scan.
Corner case exception: If an index-only scan is possible, the index might still be used if it's substantially smaller than the table. But rows have to be sorted from scratch.
Related:
Optimizing queries on a range of timestamps (two columns)
Postgres 13 or newer
Postgres 13 added "incremental sort", which can be controlled with the GUC setting enable_incremental_sort that is on by default. The release notes:
If an intermediate query result is known to be sorted by one or more
leading keys of a required sort ordering, the additional sorting can
be done considering only the remaining keys, if the rows are sorted in
batches that have equal leading keys.
Corner case problems have been fixed with version 13.1 and 13.2. So - as always - be sure to run the latest point release.
Now, the index can be used. You'll see something like this in the EXPLAIN plan:
Sort Key: x, y DESC
Presorted Key: x
It's not as efficient as an index with matching (switched) sort order where readily sorted rows can be read from the index directly (with no sort step at all). But it can be a huge improvement, especially with a small LIMIT, where Postgres had to sort all rows historically. Now it can look at each (set of) leading column(s), sort only those, and stop as soon as LIMIT is satisfied.
Answer to question 2.
Yes, the index fits perfectly.
(Even works if the index has y ASC. It can be scanned backwards. Only NULL placement is a drawback in this case.)
Of course, if x = 999 is a stable predicate (it's always 999 we are interested in) and more than a few rows have a different x, then a partial index would be even more efficient:
CREATE INDEX ON tbl (y DESC) WHERE x = 999;
db<>fiddle here - Postgres 10
db<>fiddle here - Postgres 13 (2nd demo uses incremental sort now)

Related

How to optimize the following query by adding more indexes?

I am trying to optimize a query which has been destroying my DB.
https://explain.depesz.com/s/isM1
If you have any insights into how to make this better please let me know.
We are using RDS/Postgres 11.9
explain analyze SELECT "src_rowdifference"."key",
"src_rowdifference"."port_id",
"src_rowdifference"."shipping_line_id",
"src_rowdifference"."container_type_id",
"src_rowdifference"."shift_id",
"src_rowdifference"."prev_availability_id",
"src_rowdifference"."new_availability_id",
"src_rowdifference"."date",
"src_rowdifference"."prev_last_update",
"src_rowdifference"."new_last_update"
FROM "src_rowdifference"
INNER JOIN "src_containertype" ON ("src_rowdifference"."container_type_id" = "src_containertype"."key")
WHERE ("src_rowdifference"."container_type_id" IN
(SELECT U0."key"
FROM "src_containertype" U0
INNER JOIN "notification_tablenotification_container_types" U1 ON (U0."key" = U1."containertype_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."new_last_update" >= '2020-01-15T03:11:06.291947+00:00'::timestamptz
AND "src_rowdifference"."port_id" IN
(SELECT U0."key"
FROM "src_port" U0
INNER JOIN "notification_tablenotification_ports" U1 ON (U0."key" = U1."port_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."shipping_line_id" IN
(SELECT U0."key"
FROM "src_shippingline" U0
INNER JOIN "notification_tablenotification_shipping_lines" U1 ON (U0."key" = U1."shippingline_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."prev_last_update" IS NOT NULL
AND NOT ("src_rowdifference"."prev_availability_id" = 'na'
AND "src_rowdifference"."prev_availability_id" IS NOT NULL)
AND NOT ("src_rowdifference"."key" IN
(SELECT V1."rowdifference_id"
FROM "notification_tablenotificationtrigger_row_differences" V1
WHERE V1."tablenotificationtrigger_id" IN
(SELECT U0."id"
FROM "notification_tablenotificationtrigger" U0
WHERE U0."notification_id" = 'test#test.com'))));
All my indexes are btree + btree(varchar_pattern_ops)
"src_rowdifference_port_id_shipping_line_id_9b3465fc_uniq" UNIQUE CONSTRAINT, btree (port_id, shipping_line_id, container_type_id, shift_id, date, new_last_update)
Edit: A little unrelated change that I made was added some more ssd disk space to my RDS instance. That made a huge difference to the CPU usage and in turn made a huge difference to the number of connections we have.
It is hard to think about the plan as a whole, as I don't understand what it is looking for. But looking at the individual pieces, there are two which together dominate the run time.
One is the index scan on src_rowdifference_port_id_shipping_line_id_9b3465fc, which seems pretty slow given the number of rows returned. Comparing the Index Condition to the index columns, I can see that the condition on new_last_update cannot be applied efficiently in the index because two columns in the index come before it and have no equality conditions in the node. So instead that >= is applied as an "in-index filter" where it needs to test each row and reject it, rather than just skipping it in bulk. I don't know how many rows that removes as the "Rows Removed by Filter" does not count in-index filters, but it is potentially large. So one thing to try would be to make a new index on (port_id, shipping_line_id, container_type_id, new_last_update). Or maybe replace that index with a reordered version (port_id, shipping_line_id, container_type_id, new_last_update, shift_id, date) but of course that might make some other query worse.
The other time consuming thing is kicking the materialized node 47 thousand times (each one looping over up to 22 thousand rows) to implement NOT (SubPlan 1). That should be using a hashed subplan, rather than a linear searched subplan. The only reason I can think of that it not doing the hashed subplan is that work_mem is not large enough to anticipate fitting it into memory. What is your setting for work_mem? What happens if you bump it up to "100MB" or so?
The NOT (SubPlan 1) from the EXPLAIN corresponds to the part of your query AND NOT ("src_rowdifference"."key" IN (...)). If bumping up work_mem doesn't work, you could try rewriting that into a NOT EXISTS clause instead.

PostgreSQL 11.5 doing sequential scan for SELECT EXISTS query

I have a multi tenant environment where each tenant (customer) has its own schema to isolate their data. Not ideal I know, but it was a quick port of a legacy system.
Each tenant has a "reading" table, with a composite index of 4 columns:
site_code char(8), location_no int, sensor_no int, reading_dtm timestamptz.
When a new reading is added, a function is called which first checks if there has already been a reading in the last minute (for the same site_code.location_no.sensor_no):
IF EXISTS (
SELECT
FROM reading r
WHERE r.site_code = p_site_code
AND r.location_no = p_location_no
AND r.sensor_no = p_sensor_no
AND r.reading_dtm > p_reading_dtm - INTERVAL '1 minute'
)
THEN
RETURN;
END IF;
Now, bare in mind there are many tenants, all behaving fine except 1. In 1 of the tenants, the call is taking nearly half a second rather than the usual few milliseconds because it is doing a sequential scan on a table with nearly 2 million rows instead of an index scan.
My random_page_cost is set to 1.5.
I could understand a sequential scan if the query was returning possibly many rows, checking for the existance of any.
I've tried ANALYZE on the table, VACUUM FULL, etc but it makes no difference.
If I put "SET LOCAL enable_seqscan = off" before the query, it works perfectly... but it feels wrong, but it will have to be a temporary solution as this is a live system and it needs to work.
What else can I do to help Postgres make what is clearly the better decision of using the index?
EDIT: If I do a similar query manually (outside of a function) it chooses an index.
My guess is that the engine is evaluating the predicate and considers is not selective enough (thinks too many rows will be returned), so decides to use a table scan instead.
I would do two things:
Make sure you have the correct index in place:
create index ix1 on reading (site_code, location_no,
sensor_no, reading_dtm);
Trick the optimizer by making the selectivity look better. You can do that by adding the extra [redundant] predicate and r.reading_dtm < :p_reading_dtm:
select 1
from reading r
where r.site_code = :p_site_code
and r.location_no = :p_location_no
and r.sensor_no = :p_sensor_no
and r.reading_dtm > :p_reading_dtm - interval '1 minute'
and r.reading_dtm < :p_reading_dtm

Selecting primary key:Why postgres prefers to do sequential scan vs index scan

I have the following table
create table log
(
id bigint default nextval('log_id_seq'::regclass) not null
constraint log_pkey
primary key,
level integer,
category varchar(255),
log_time timestamp,
prefix text,
message text
);
It contains like 3 million of rows.
I'm comparing the following queries:
EXPLAIN SELECT id
FROM log
WHERE log_time < now() - INTERVAL '3 month'
LIMIT 100000
which yields the following plan:
Limit (cost=0.00..19498.87 rows=100000 width=8)
-> Seq Scan on log (cost=0.00..422740.48 rows=2168025 width=8)
Filter: (log_time < (now() - '3 mons'::interval))
And the same query with ORDER BY id instruction added:
EXPLAIN SELECT id
FROM log
WHERE log_time < now() - INTERVAL '3 month'
ORDER BY id ASC
LIMIT 100000
which yields
Limit (cost=0.43..25694.15 rows=100000 width=8)
-> Index Scan using log_pkey on log (cost=0.43..557048.28 rows=2168031 width=8)
Filter: (log_time < (now() - '3 mons'::interval))
I have the following questions:
The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?
How can Postgres use index in the first place in such a query? WHERE clause of the query contains a non-indexed column and to fetch that column, sequential database scan will be required, but the query with ORDER BY doesn't indicate that.
The Postgres manual page says:
For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I/O due to following a sequential access pattern
Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.
Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.
The index is read sequentially, yes, but postgres needs to follow up with a read of the rows from the table. That is, in most cases, if an index identifies 100 rows, then postgres will need to perform up to 100 random reads against the table.
Internally, the postgres planner weighs sequential and random reads differently, with random reads generally much more expensive. The settings seq_page_cost and random_page_cost determine those. There are other settings you can view and tinker with if you want, though I recommend being very conservative with modifications.
Let's go back to your earlier questions:
The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?
The reason is the sort. As you note later, the index doesn't include the constraining column, so it doesn't make any sense to use the index. Instead, the planner is basically saying "read the whole table, figure out which rows conform to the constraint, and then return the first 100000 of them, in whatever order we find them".
The sort changes things. In that case, the planner is saying "we need to sort by this field, and we have an index which is already sorted, so read rows from the table in index order, checking against the constraint, until we have 100000 of them, and return that set".
You'll note that the cost estimates (e.g. '0.43..25694.15') are much higher for the second query -- the planner thinks that doing so many random reads from the index scan is going to cost significantly more than just reading the whole table at once with no sorting.
Hope that helps, and let me know if you have further questions.

Postgresql 9.4 slow [duplicate]

I have table
create table big_table (
id serial primary key,
-- other columns here
vote int
);
This table is very big, approximately 70 million rows, I need to query:
SELECT * FROM big_table
ORDER BY vote [ASC|DESC], id [ASC|DESC]
OFFSET x LIMIT n -- I need this for pagination
As you may know, when x is a large number, queries like this are very slow.
For performance optimization I added indexes:
create index vote_order_asc on big_table (vote asc, id asc);
and
create index vote_order_desc on big_table (vote desc, id desc);
EXPLAIN shows that the above SELECT query uses these indexes, but it's very slow anyway with a large offset.
What can I do to optimize queries with OFFSET in big tables? Maybe PostgreSQL 9.5 or even newer versions have some features? I've searched but didn't find anything.
A large OFFSET is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number to the table (or create a MATERIALIZED VIEW including said row_number) and work with WHERE row_number > x instead of OFFSET x.
However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.
I suggest a different approach for pagination:
SELECT *
FROM big_table
WHERE (vote, id) > (vote_x, id_x) -- ROW values
ORDER BY vote, id -- needs to be deterministic
LIMIT n;
Where vote_x and id_x are from the last row of the previous page (for both DESC and ASC). Or from the first if navigating backwards.
Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.
CREATE INDEX vote_order_asc ON big_table (vote, id);
Or for descending order:
SELECT *
FROM big_table
WHERE (vote, id) < (vote_x, id_x) -- ROW values
ORDER BY vote DESC, id DESC
LIMIT n;
Can use the same index.
I suggest you declare your columns NOT NULL or acquaint yourself with the NULLS FIRST|LAST construct:
PostgreSQL sort by datetime asc, null first?
Note two things in particular:
The ROW values in the WHERE clause cannot be replaced with separated member fields. WHERE (vote, id) > (vote_x, id_x) cannot be replaced with:
WHERE vote >= vote_x
AND id > id_x
That would rule out all rows with id <= id_x, while we only want to do that for the same vote and not for the next. The correct translation would be:
WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in ORDER BY like:
ORDER BY vote ASC, id DESC
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on (vote, (id * -1)) - and use the same expression in ORDER BY:
ORDER BY vote ASC, (id * -1) ASC
Related:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Improve performance for order by with columns from many tables
Note in particular the presentation by Markus Winand I linked to:
"Pagination done the PostgreSQL way"
Have you tried partioning the table ?
Ease of management, improved scalability and availability, and a
reduction in blocking are common reasons to partition tables.
Improving query performance is not a reason to employ partitioning,
though it can be a beneficial side-effect in some cases. In terms of
performance, it is important to ensure that your implementation plan
includes a review of query performance. Confirm that your indexes
continue to appropriately support your queries after the table is
partitioned, and verify that queries using the clustered and
nonclustered indexes benefit from partition elimination where
applicable.
http://sqlperformance.com/2013/09/sql-indexes/partitioning-benefits

Postgres multi-column index (integer, boolean, and array)

I have a Postgres 9.4 database with a table like this:
| id | other_id | current | dn_ids | rank |
|----|----------|---------|---------------------------------------|------|
| 1 | 5 | F | {123,234,345,456,111,222,333,444,555} | 1 |
| 2 | 7 | F | {123,100,200,900,800,700,600,400,323} | 2 |
(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:
CREATE TABLE mytable (
id integer NOT NULL,
other_id integer,
rank integer,
current boolean DEFAULT false,
dn_ids integer[] DEFAULT '{}'::integer[]
);
CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1;
ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);
CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids);
CREATE INDEX index_mytable_on_current ON mytable USING btree (current);
CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);
I need to optimize queries like this:
SELECT id, dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids)
ORDER BY rank ASC
LIMIT 500 OFFSET 1000
This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.
Here's an example EXPLAIN:
Limit (cost=36944.53..36945.78 rows=500 width=65)
-> Sort (cost=36942.03..37007.42 rows=26156 width=65)
Sort Key: rank
-> Seq Scan on mytable (cost=0.00..35431.42 rows=26156 width=65)
Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))
Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.
What's the right type of compound index to use for this query? I don't care about space at all.
Does it matter much how I order the terms in the WHERE clause?
What's the right type of compound index to use for this query? I don't care about space at all.
This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:
Difference between GiST and GIN index
You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).
Multicolumn index on 3 fields with heterogenous data types
However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.
And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.
Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).
Not applicable to updated question.
Partial indexes might be an option. Other than that, you already have all the indexes you need.
I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.
Does it matter much how I order the terms in the WHERE clause?
The order of WHERE conditions is completely irrelevant.
Addendum after question update
The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.
For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.
Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.
Even with your added information, much of what's relevant is still in the dark. I am going to assume that:
Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids.
The most selective predicate is other_id = 5.
A substantial part of the rows is eliminated with NOT current.
Aside: current = F isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;
While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:
CREATE INDEX foo ON mytable (other_id, rank, dn_ids)
WHERE NOT current;
The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:
CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;
GiST indexes may become more interesting in Postgres 9.5 due to this new feature:
Allow GiST indexes to perform index-only scans (Anastasia Lubennikova,
Heikki Linnakangas, Andreas Karlsson)
Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres.
Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.
Auto increment SQL function
Unfortunately I don't think you can combine a BTree and a GIN/GIST index into a single compound index, so the planner is going to have to choose between using the other_id index or the dn_ids index. One advantage of using other_id, as you pointed out, is that you could use a multicolumn index to improve the sort performance. The way you would do this would be
CREATE INDEX index_mytable_on_other_id_and_current
ON mytable (other_id, rank) WHERE current = F;
This is using a partial index, and will allow you to skip the sort step when you are sorting by rank and querying on other_id.
Depending on the cardinality of other_id, the only benefit of this might be the sorting. Because your plan has a LIMIT clause, it's difficult to tell. SEQ scans can be the fastest option if you're using > 1/5 of the table, especially if you're using a standard HDD instead of solid state. If you're planner insists on SEQ scanning when you know an IDX scan is faster (you've tested with enable_seqscan false, you may want to try fine tuning your random_page_cost or effective_cache_size.
Finally, I'd recomment not keeping all of these indexes. Find the ones you need, and cull the rest. Indexes cause huge performance degradation in inserts (especially mutli-column and GIN/GIST indexes).
The simplest index for your query is mytable(other_id, current). This handles the first two conditions. This would be a normal b-tree type index.
You can satisfy the array condition using a GIST index on mytable(dn_ids).
However, I don't think you can mix the different data types in one index, at least not without extensions.