I have a table with 2000000 records.
I have one index:
CREATE INDEX test
ON public."PaymentReceipt" USING btree
("CompanyID" DESC NULLS LAST, created_at ASC NULLS LAST)
TABLESPACE pg_default;
If I run this query, it will run the index:
explain
select *
from public."PaymentReceipt"
where "CompanyID" = '5c67762bd0949'
order by "created_at" desc
limit 100 offset 1600589
But if I run this query, it won't use the index:
explain
select *
from public."PaymentReceipt"
where "CompanyID" = '5c67762bd0949'
order by "created_at" desc
limit 100 offset 1600590
I'm not sure what happened to the index!
OFFSET is bad for query performance.
The reason is that in order to skip the first 1000000 rows, the database has to find and then discard them. So the work for the database is the same as if you omit the OFFSET clause.
Now sequential scans of a table are cheaper than index scans, so the latter offer an advantage only if the number of rows retrieved is much smaller than with a sequential scan. At some point the query plan will “tilt” and PostgreSQL will assume that an index scan is cheaper. You have found exactly that point.
If your random_page_cost setting reflects your hardware correctly, PostgreSQL will estimate this correctly. In a nutshell, PostgreSQL is doing the right thing.
Related
Within my db I have table prediction_fsd with about 5 million entries. The site table contains approx 3 million entries. I need to execute queries that look like
SELECT prediction_fsd.id AS prediction_fsd_id,
prediction_fsd.site_id AS prediction_fsd_site_id,
prediction_fsd.html_hash AS prediction_fsd_html_hash,
prediction_fsd.prediction AS prediction_fsd_prediction,
prediction_fsd.algorithm AS prediction_fsd_algorithm,
prediction_fsd.model_version AS prediction_fsd_model_version,
prediction_fsd.timestamp AS prediction_fsd_timestamp,
site_1.id AS site_1_id,
site_1.url AS site_1_url,
site_1.status AS site_1_status
FROM prediction_fsd
LEFT OUTER JOIN site AS site_1
ON site_1.id = prediction_fsd.site_id
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
LIMIT 1
at the moment this query takes about ~4 seconds. I'd like to reduce that by introducing an index. Which tables and fields should I include in that index. I'm having troubles properly understanding the EXPLAIN ANALYZE output of Postgres
CREATE INDEX prediction_fsd_site_id_algorithm_timestamp
ON public.prediction_fsd USING btree
(site_id, algorithm, "timestamp" DESC)
TABLESPACE pg_default;
By introducing a combined index as suggested by Frank Heikens I was able to bring down the query execution time to 0.25s
These three SQL lines point to a possible BTREE index to help you.
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
You're filtering the rows of the table by equality on two columns, and ordering by the third column. So try this index.
CREATE INDEX site_alg_ts ON prediction_fsd
(site_id, algorithm, timestamp DESC);
This BTREE index lets PostgreSQL random-access it to the first eligible row, which happens also to be the row you want with your ORDER BY ... LIMIT 1 clause.
The query plan in your question says that PostgreSQL did an expensive Parallel Sequential Scan on all five megarows of that table. This index will almost certainly change that to a cheap index lookup.
On the other table, it appears that you already look up rows in it via the primary key id. So you don't need any other index for that one.
I have a table similar to this one:
create table request_journal
(
id bigint,
request_body text,
request_date timestamp,
user_id bigint,
);
It is used for request logging purposes, so frequent inserts in it are expected (2k+ rps).
I want to create composite index on columns request_date and user_id to speed up execution of select queries like this:
select *
from request_journal
where request_date between '2021-07-08 10:00:00' and '2021-07-08 16:00:00'
and user_id = 123
order by request_date desc;
I tested select queries with (request_date desc, user_id) btree index and (user_id, request_date desc) btree index. With request_date leading column index select queries are executed about 10% faster, but in general performance of any of this indexes is acceptable.
So my question is does the index column order affect insertion time? I have not spotted any differences using EXPLAIN/EXPLAIN ANALYZE on insert query. Which index will be more build time efficient under "high load"?
It is hard to believe your test were done on any vaguely realistic data size.
At the rate you indicate, a 6 hour range would include over 43 million records. If the user_ids are spread evenly over 1e6 different values, I get the index leading with user_id to be a thousand times faster for that query than the one leading with request_date.
But anyway, for loading new data, assuming the new data is all from recent times, then the one with request_date should be faster as the part of the index needing maintenance while loading will be more concentrated in part of the index, and so better cached. But this would depend on how much RAM you have, what your disk system is like, and how many distinct user_ids you are loading data for.
I'm using Postgres.
I have queries of the form:
`SELECT COUNT(*) from clicks where link_id=1`
Clicks is millions of rows.
These queries are taking 10-20 seconds.
Are there any elegant ways to accelerate this?
Edit: Query plan:
Indexes
CREATE INDEX clicks_link_id_index
ON public.clicks USING btree
(link_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX clicks_link_workspace_id_index
ON public.clicks USING btree
(link_workspace_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX index_date_trunc
ON public.clicks USING btree
(date_trunc('day'::text, inserted_at) ASC NULLS LAST)
TABLESPACE pg_default;
SET ENABLE_SEQSCAN TO OFF:
It thinks the index-only scan is going to be 4 times cheaper than the sequential scan, so I am confused on why it didn't pick index-only-scan in the first place (before you had set enable_seqscan=off). Now if you set that back to the default, does it go back to using the seq scan again?
If you freshly VACUUM the table, does it get much faster with the index-only scan? While 99% of the rows are fetched without heap fetches, the remaining 1% could account for most of the time.
Or maybe you should just throw more hardware at it. Having more RAM in which to cache data (either in shared_buffers or in filesystem cache) can't hurt.
I have table
create table big_table (
id serial primary key,
-- other columns here
vote int
);
This table is very big, approximately 70 million rows, I need to query:
SELECT * FROM big_table
ORDER BY vote [ASC|DESC], id [ASC|DESC]
OFFSET x LIMIT n -- I need this for pagination
As you may know, when x is a large number, queries like this are very slow.
For performance optimization I added indexes:
create index vote_order_asc on big_table (vote asc, id asc);
and
create index vote_order_desc on big_table (vote desc, id desc);
EXPLAIN shows that the above SELECT query uses these indexes, but it's very slow anyway with a large offset.
What can I do to optimize queries with OFFSET in big tables? Maybe PostgreSQL 9.5 or even newer versions have some features? I've searched but didn't find anything.
A large OFFSET is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number to the table (or create a MATERIALIZED VIEW including said row_number) and work with WHERE row_number > x instead of OFFSET x.
However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.
I suggest a different approach for pagination:
SELECT *
FROM big_table
WHERE (vote, id) > (vote_x, id_x) -- ROW values
ORDER BY vote, id -- needs to be deterministic
LIMIT n;
Where vote_x and id_x are from the last row of the previous page (for both DESC and ASC). Or from the first if navigating backwards.
Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.
CREATE INDEX vote_order_asc ON big_table (vote, id);
Or for descending order:
SELECT *
FROM big_table
WHERE (vote, id) < (vote_x, id_x) -- ROW values
ORDER BY vote DESC, id DESC
LIMIT n;
Can use the same index.
I suggest you declare your columns NOT NULL or acquaint yourself with the NULLS FIRST|LAST construct:
PostgreSQL sort by datetime asc, null first?
Note two things in particular:
The ROW values in the WHERE clause cannot be replaced with separated member fields. WHERE (vote, id) > (vote_x, id_x) cannot be replaced with:
WHERE vote >= vote_x
AND id > id_x
That would rule out all rows with id <= id_x, while we only want to do that for the same vote and not for the next. The correct translation would be:
WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in ORDER BY like:
ORDER BY vote ASC, id DESC
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on (vote, (id * -1)) - and use the same expression in ORDER BY:
ORDER BY vote ASC, (id * -1) ASC
Related:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Improve performance for order by with columns from many tables
Note in particular the presentation by Markus Winand I linked to:
"Pagination done the PostgreSQL way"
Have you tried partioning the table ?
Ease of management, improved scalability and availability, and a
reduction in blocking are common reasons to partition tables.
Improving query performance is not a reason to employ partitioning,
though it can be a beneficial side-effect in some cases. In terms of
performance, it is important to ensure that your implementation plan
includes a review of query performance. Confirm that your indexes
continue to appropriately support your queries after the table is
partitioned, and verify that queries using the clustered and
nonclustered indexes benefit from partition elimination where
applicable.
http://sqlperformance.com/2013/09/sql-indexes/partitioning-benefits
I have a date field on a large table that I mostly query and sort in DESC order. I have an index on that field with the default ASC order. I read that if an index is on a single field it does not matter if it is in ASC or DESC order since an index can be read from both directions. Will I benefit from changing my index to DESC?
operating systems are generally more efficient reading files in a forwards direction, so you may get a slight speed up by creating a DESC index.
For a big speed up create the DESC index and CLUSTER the table on it.
CLUSTER tablename USING indexname;
clustering on the ASC index will also give improvement, but it will be less.