I have a date field on a large table that I mostly query and sort in DESC order. I have an index on that field with the default ASC order. I read that if an index is on a single field it does not matter if it is in ASC or DESC order since an index can be read from both directions. Will I benefit from changing my index to DESC?
operating systems are generally more efficient reading files in a forwards direction, so you may get a slight speed up by creating a DESC index.
For a big speed up create the DESC index and CLUSTER the table on it.
CLUSTER tablename USING indexname;
clustering on the ASC index will also give improvement, but it will be less.
Related
Within my db I have table prediction_fsd with about 5 million entries. The site table contains approx 3 million entries. I need to execute queries that look like
SELECT prediction_fsd.id AS prediction_fsd_id,
prediction_fsd.site_id AS prediction_fsd_site_id,
prediction_fsd.html_hash AS prediction_fsd_html_hash,
prediction_fsd.prediction AS prediction_fsd_prediction,
prediction_fsd.algorithm AS prediction_fsd_algorithm,
prediction_fsd.model_version AS prediction_fsd_model_version,
prediction_fsd.timestamp AS prediction_fsd_timestamp,
site_1.id AS site_1_id,
site_1.url AS site_1_url,
site_1.status AS site_1_status
FROM prediction_fsd
LEFT OUTER JOIN site AS site_1
ON site_1.id = prediction_fsd.site_id
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
LIMIT 1
at the moment this query takes about ~4 seconds. I'd like to reduce that by introducing an index. Which tables and fields should I include in that index. I'm having troubles properly understanding the EXPLAIN ANALYZE output of Postgres
CREATE INDEX prediction_fsd_site_id_algorithm_timestamp
ON public.prediction_fsd USING btree
(site_id, algorithm, "timestamp" DESC)
TABLESPACE pg_default;
By introducing a combined index as suggested by Frank Heikens I was able to bring down the query execution time to 0.25s
These three SQL lines point to a possible BTREE index to help you.
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
You're filtering the rows of the table by equality on two columns, and ordering by the third column. So try this index.
CREATE INDEX site_alg_ts ON prediction_fsd
(site_id, algorithm, timestamp DESC);
This BTREE index lets PostgreSQL random-access it to the first eligible row, which happens also to be the row you want with your ORDER BY ... LIMIT 1 clause.
The query plan in your question says that PostgreSQL did an expensive Parallel Sequential Scan on all five megarows of that table. This index will almost certainly change that to a cheap index lookup.
On the other table, it appears that you already look up rows in it via the primary key id. So you don't need any other index for that one.
I'm building a TimescaleDB local server and I'm creating my first "production" hypertables. The point is that, at the moment, all the future consumers of my DB are going to use the data in ASC order, but by default timescale creates a DESC index in the time column.
My doubt is, does it worth to change the default behaviour and make the index to be ASC?
I don't know if it's DESC by default for a good reason and I'm going to have some penalty. I have also read that indexs in postgresql can be read backward, so a DESC index could be used in an ASC query, but I don't know if there are performance penalties.
In the other hand, it's safe to simple delete the default index and create a new one with different order? Also not sure if deleting it I'm going to screw up some timescale internal functionality.
Thanks for your time,
H25E
For a single-column index, it does not matter at all if it is created ASC or DESC, because indexes can be read in both directions with the same efficiency.
The only time when you really need to specify DESC in an index is if the index is supposed to support an ORDER BY clause like ORDER BY a, b DESC. Then one of the index columns must be sorted ASC and the other DESC — but again it doesn't matter which one is ASC and which DESC, as the index can be read in both directions.
So, for a single column index, there is no need to build the index again, and there was no good reason to create it DESC in the first place (but it doesn't matter).
I have a table with 2000000 records.
I have one index:
CREATE INDEX test
ON public."PaymentReceipt" USING btree
("CompanyID" DESC NULLS LAST, created_at ASC NULLS LAST)
TABLESPACE pg_default;
If I run this query, it will run the index:
explain
select *
from public."PaymentReceipt"
where "CompanyID" = '5c67762bd0949'
order by "created_at" desc
limit 100 offset 1600589
But if I run this query, it won't use the index:
explain
select *
from public."PaymentReceipt"
where "CompanyID" = '5c67762bd0949'
order by "created_at" desc
limit 100 offset 1600590
I'm not sure what happened to the index!
OFFSET is bad for query performance.
The reason is that in order to skip the first 1000000 rows, the database has to find and then discard them. So the work for the database is the same as if you omit the OFFSET clause.
Now sequential scans of a table are cheaper than index scans, so the latter offer an advantage only if the number of rows retrieved is much smaller than with a sequential scan. At some point the query plan will “tilt” and PostgreSQL will assume that an index scan is cheaper. You have found exactly that point.
If your random_page_cost setting reflects your hardware correctly, PostgreSQL will estimate this correctly. In a nutshell, PostgreSQL is doing the right thing.
I have table
create table big_table (
id serial primary key,
-- other columns here
vote int
);
This table is very big, approximately 70 million rows, I need to query:
SELECT * FROM big_table
ORDER BY vote [ASC|DESC], id [ASC|DESC]
OFFSET x LIMIT n -- I need this for pagination
As you may know, when x is a large number, queries like this are very slow.
For performance optimization I added indexes:
create index vote_order_asc on big_table (vote asc, id asc);
and
create index vote_order_desc on big_table (vote desc, id desc);
EXPLAIN shows that the above SELECT query uses these indexes, but it's very slow anyway with a large offset.
What can I do to optimize queries with OFFSET in big tables? Maybe PostgreSQL 9.5 or even newer versions have some features? I've searched but didn't find anything.
A large OFFSET is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number to the table (or create a MATERIALIZED VIEW including said row_number) and work with WHERE row_number > x instead of OFFSET x.
However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.
I suggest a different approach for pagination:
SELECT *
FROM big_table
WHERE (vote, id) > (vote_x, id_x) -- ROW values
ORDER BY vote, id -- needs to be deterministic
LIMIT n;
Where vote_x and id_x are from the last row of the previous page (for both DESC and ASC). Or from the first if navigating backwards.
Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.
CREATE INDEX vote_order_asc ON big_table (vote, id);
Or for descending order:
SELECT *
FROM big_table
WHERE (vote, id) < (vote_x, id_x) -- ROW values
ORDER BY vote DESC, id DESC
LIMIT n;
Can use the same index.
I suggest you declare your columns NOT NULL or acquaint yourself with the NULLS FIRST|LAST construct:
PostgreSQL sort by datetime asc, null first?
Note two things in particular:
The ROW values in the WHERE clause cannot be replaced with separated member fields. WHERE (vote, id) > (vote_x, id_x) cannot be replaced with:
WHERE vote >= vote_x
AND id > id_x
That would rule out all rows with id <= id_x, while we only want to do that for the same vote and not for the next. The correct translation would be:
WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in ORDER BY like:
ORDER BY vote ASC, id DESC
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on (vote, (id * -1)) - and use the same expression in ORDER BY:
ORDER BY vote ASC, (id * -1) ASC
Related:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Improve performance for order by with columns from many tables
Note in particular the presentation by Markus Winand I linked to:
"Pagination done the PostgreSQL way"
Have you tried partioning the table ?
Ease of management, improved scalability and availability, and a
reduction in blocking are common reasons to partition tables.
Improving query performance is not a reason to employ partitioning,
though it can be a beneficial side-effect in some cases. In terms of
performance, it is important to ensure that your implementation plan
includes a review of query performance. Confirm that your indexes
continue to appropriately support your queries after the table is
partitioned, and verify that queries using the clustered and
nonclustered indexes benefit from partition elimination where
applicable.
http://sqlperformance.com/2013/09/sql-indexes/partitioning-benefits
Im having trouble with a query that becomes ghastly slow as the database grows.
The problem seems to be the sorting, which depends on three conditions - importance, urgency and timestamp.
The query currently in use is plain old
ORDER BY urgent DESC, important DESC, date_published DESC
Fields are boolean for urgent and important, and date_published is an integer (UNIX timestamp).
Create indexes for columns you sort by regularly. You may even set a compound index.
CREATE INDEX foo ON table_name (urgent DESC, important DESC, date_published DESC);