I have a table similar to this one:
create table request_journal
(
id bigint,
request_body text,
request_date timestamp,
user_id bigint,
);
It is used for request logging purposes, so frequent inserts in it are expected (2k+ rps).
I want to create composite index on columns request_date and user_id to speed up execution of select queries like this:
select *
from request_journal
where request_date between '2021-07-08 10:00:00' and '2021-07-08 16:00:00'
and user_id = 123
order by request_date desc;
I tested select queries with (request_date desc, user_id) btree index and (user_id, request_date desc) btree index. With request_date leading column index select queries are executed about 10% faster, but in general performance of any of this indexes is acceptable.
So my question is does the index column order affect insertion time? I have not spotted any differences using EXPLAIN/EXPLAIN ANALYZE on insert query. Which index will be more build time efficient under "high load"?
It is hard to believe your test were done on any vaguely realistic data size.
At the rate you indicate, a 6 hour range would include over 43 million records. If the user_ids are spread evenly over 1e6 different values, I get the index leading with user_id to be a thousand times faster for that query than the one leading with request_date.
But anyway, for loading new data, assuming the new data is all from recent times, then the one with request_date should be faster as the part of the index needing maintenance while loading will be more concentrated in part of the index, and so better cached. But this would depend on how much RAM you have, what your disk system is like, and how many distinct user_ids you are loading data for.
Related
I have a table with the following columns:
ID (VARCHAR)
CUSTOMER_ID (VARCHAR)
STATUS (VARCHAR) (4 different status possible)
other not relevant columns
I try to find all the lines with customer_id = and status = two different status.
The query looks like:
SELECT *
FROM my_table
WHERE customer_id = '12345678' AND status IN ('STATUS1', 'STATUS2');
The table contains about 1 mio lines. I added two indexes on customer_id and status. The query still needs about 1 second to run.
The explain plan is:
1. Gather
2. Seq Scan on my_table
Filter: (((status)::text = ANY ('{SUBMITTED,CANCELLED}'::text[])) AND ((customer_id)::text = '12345678'::text))
I ran the 'analyze my_table' after creating the indexes. What could I do to improve the performance of this quite simple query?
You need a compound (multi-column) index to help satisfy your query.
Guessing, it seems like the most selective column (the one with the most distinct values) is customer_id. status probably has only a few distinct values. So customer_id should go first in the index. Try this.
ALTER TABLE my_table ADD INDEX customer_id_status (customer_id, status);
This creates a BTREE index. A useful mental model for such an index is an old-fashioned telephone book. It's sorted in order. You look up the first matching entry in the index, then scan it sequentially for the items you want.
You may also want to try running ANALYZE my_table; to update the statistics (about selectivity) used by the query planner to choose an appropriate index.
Pro tip Avoid SELECT * if you can. Instead name the columns you want. This can help performance a lot.
Pro tip Your question said some of your columns aren't relevant to query optimization. That's probably not true; index design is a weird art. SELECT * makes it less true.
Within my db I have table prediction_fsd with about 5 million entries. The site table contains approx 3 million entries. I need to execute queries that look like
SELECT prediction_fsd.id AS prediction_fsd_id,
prediction_fsd.site_id AS prediction_fsd_site_id,
prediction_fsd.html_hash AS prediction_fsd_html_hash,
prediction_fsd.prediction AS prediction_fsd_prediction,
prediction_fsd.algorithm AS prediction_fsd_algorithm,
prediction_fsd.model_version AS prediction_fsd_model_version,
prediction_fsd.timestamp AS prediction_fsd_timestamp,
site_1.id AS site_1_id,
site_1.url AS site_1_url,
site_1.status AS site_1_status
FROM prediction_fsd
LEFT OUTER JOIN site AS site_1
ON site_1.id = prediction_fsd.site_id
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
LIMIT 1
at the moment this query takes about ~4 seconds. I'd like to reduce that by introducing an index. Which tables and fields should I include in that index. I'm having troubles properly understanding the EXPLAIN ANALYZE output of Postgres
CREATE INDEX prediction_fsd_site_id_algorithm_timestamp
ON public.prediction_fsd USING btree
(site_id, algorithm, "timestamp" DESC)
TABLESPACE pg_default;
By introducing a combined index as suggested by Frank Heikens I was able to bring down the query execution time to 0.25s
These three SQL lines point to a possible BTREE index to help you.
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
You're filtering the rows of the table by equality on two columns, and ordering by the third column. So try this index.
CREATE INDEX site_alg_ts ON prediction_fsd
(site_id, algorithm, timestamp DESC);
This BTREE index lets PostgreSQL random-access it to the first eligible row, which happens also to be the row you want with your ORDER BY ... LIMIT 1 clause.
The query plan in your question says that PostgreSQL did an expensive Parallel Sequential Scan on all five megarows of that table. This index will almost certainly change that to a cheap index lookup.
On the other table, it appears that you already look up rows in it via the primary key id. So you don't need any other index for that one.
I have 2 tables, approximately this:
Parent_table: Parent_id bigint, Loc geometry
Child_table: Child_id bigint,
parent_id bigint,
record_date timestamp,
value double precision,
category character varying(10)
I need to query subsets of child table for varying conditions (location, date range, value range, category). As part of this, I winnow down the locations from the parent table and then want to get all the matching records
The obvious way to do this is:
with limited_parents as
(
select parent_id from parent_table where [location condition]
)
select [ columns ] from child_table where parent_id in
(select parent_id from limited_parents)
and [ other conditions for record_date, value, category ]
Child_table has >200m records, partioned by year. It has an index of the parent_key and the other columns are all included in an index, in that order.
Parent_table has <10k records. Each Parent could easily have > 1m+ child records (the numbers of child records per parent are widely distributed from a few hundred to million+). The set of parents which are in scope in any query (and therefore included in that sub-select) might be from 1 to several hundred.
Database is currently Postgres 10.
The query is performant for ranges of a couple of years/partitions, but gets significantly slower as the amount of date in scope in increases.
I have freedom to adjust indexes and change the queries. Is there a more efficient way of doing this query? (Flattening the two tables, and putting the location on the child table and doing the geo intersection there, makes the whole thing orders of magnitude slower)
Your query is written in a complicated and inefficient fashion. In particular, CTEs are an optimizer fence in older PostgreSQL versions.
Try this:
SELECT [ columns ] FROM child_table AS c
WHERE EXISTS (SELECT 1 FROM parent_table AS p
WHERE [location condition on p]
AND p.parent_id = c.parent_id)
AND [ other conditions for record_date, value, category ]
Create an index on the foreign key column of the large table to speed up a nested loop join. Set work_mem high to speed up a hash join. PostgreSQL will autonatically pick the best solution.
I have a Notifications table with approximately 7,000,000 records where the relevant columns are:
id: integer
time_created: timestamp with time zone
device_id: integer (foreign key to another table)
And the indexes:
CREATE INDEX notifications_device ON notifications (device_id);
CREATE INDEX notifications_time ON notifications (time_created);
And my query:
SELECT COUNT(*) AS "__count"
FROM "notifications"
WHERE ("notifications"."device_id" IN (
SELECT "id" FROM device WHERE (
device."device_type" = 'iOS' AND
device."registration_id" IN (
'XXXXXXX',
'YYYYYYY',
'ZZZZZZZ'
)
)
)
AND "notifications"."time_created" BETWEEN
'2020-10-26 00:00:00' AND '2020-10-26 17:33:00')
;
For most of the day, this query will use the index on device_id, and will run in under 1ms. But once the table is written to very quickly (logging notifications sent) the planner switches to using the index on time_created and the query blows out to 300ms.
Running an ANALYZE NOTIFICATIONS immediately fixes the problem, and the index on device_id is used again.
The table is pruned to the last 30 days each night, which is why there is a separate index on the time_created column.
Can I fix this issue, so that the planner always chooses the index on device_id, by forcing postgres to maintain better statistics on this table? Alternatively, can I re-write the time_created index (perhaps by using a different index type like BRIN) so that it'd only be considered for a WHERE clause like time_created < ..30 days ago.. and not WHERE time_created BETWEEN midnight and now?
EXPLAIN ANALYZE stats:
Bad Plan (time_created):
Rows Removed by Filter = 20926
Shared Hit Blocks = 143934
Plan Rows = 38338
Actual Rows = 84479
Good Plan (device_id):
Rows Removed by Filter = 95
Shared Hit Blocks = 34
Plan Rows = 1
Actual Rows = 0
I would actually suggest a composite index on the notifications table:
CREATE INDEX idx1 ON notifications (device_id, time_created);
This index would cover both restrictions in the current WHERE clause. I would also add an index on the device table:
CREATE INDEX idx2 ON device (device_type, registration_id, id);
The first two columns of this 3-column index would cover the WHERE clause of the subquery. It also includes the id column to completely cover the SELECT clause. If used, Postgres could more rapidly evaluate the subquery on the device table.
You could also play around with some slight variants of the above two indices, by changing column order. For example, you could also try:
CREATE INDEX idx1 ON notifications (time_created, device_id);
CREATE INDEX idx2 ON device (registration_id , device_type, id);
The table is pruned to the last 30 days each night, which is why there is a separate index on the time_created column.
But, is that a good reason to have the index? Does it matter if the nightly query takes a little longer? Indeed, for deleting 3% of a table, does it even use the index and if it does, does that actually make it faster? Maybe you could replace the index with partitioning, or with nothing.
In any case, you can use this ugly hack to force it not to use the index:
AND "notifications"."time_created" + interval '0 seconds' BETWEEN '2020-10-26 00:00:00' AND '2020-10-26 17:33:00'
I have table
create table big_table (
id serial primary key,
-- other columns here
vote int
);
This table is very big, approximately 70 million rows, I need to query:
SELECT * FROM big_table
ORDER BY vote [ASC|DESC], id [ASC|DESC]
OFFSET x LIMIT n -- I need this for pagination
As you may know, when x is a large number, queries like this are very slow.
For performance optimization I added indexes:
create index vote_order_asc on big_table (vote asc, id asc);
and
create index vote_order_desc on big_table (vote desc, id desc);
EXPLAIN shows that the above SELECT query uses these indexes, but it's very slow anyway with a large offset.
What can I do to optimize queries with OFFSET in big tables? Maybe PostgreSQL 9.5 or even newer versions have some features? I've searched but didn't find anything.
A large OFFSET is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number to the table (or create a MATERIALIZED VIEW including said row_number) and work with WHERE row_number > x instead of OFFSET x.
However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.
I suggest a different approach for pagination:
SELECT *
FROM big_table
WHERE (vote, id) > (vote_x, id_x) -- ROW values
ORDER BY vote, id -- needs to be deterministic
LIMIT n;
Where vote_x and id_x are from the last row of the previous page (for both DESC and ASC). Or from the first if navigating backwards.
Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.
CREATE INDEX vote_order_asc ON big_table (vote, id);
Or for descending order:
SELECT *
FROM big_table
WHERE (vote, id) < (vote_x, id_x) -- ROW values
ORDER BY vote DESC, id DESC
LIMIT n;
Can use the same index.
I suggest you declare your columns NOT NULL or acquaint yourself with the NULLS FIRST|LAST construct:
PostgreSQL sort by datetime asc, null first?
Note two things in particular:
The ROW values in the WHERE clause cannot be replaced with separated member fields. WHERE (vote, id) > (vote_x, id_x) cannot be replaced with:
WHERE vote >= vote_x
AND id > id_x
That would rule out all rows with id <= id_x, while we only want to do that for the same vote and not for the next. The correct translation would be:
WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in ORDER BY like:
ORDER BY vote ASC, id DESC
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on (vote, (id * -1)) - and use the same expression in ORDER BY:
ORDER BY vote ASC, (id * -1) ASC
Related:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Improve performance for order by with columns from many tables
Note in particular the presentation by Markus Winand I linked to:
"Pagination done the PostgreSQL way"
Have you tried partioning the table ?
Ease of management, improved scalability and availability, and a
reduction in blocking are common reasons to partition tables.
Improving query performance is not a reason to employ partitioning,
though it can be a beneficial side-effect in some cases. In terms of
performance, it is important to ensure that your implementation plan
includes a review of query performance. Confirm that your indexes
continue to appropriately support your queries after the table is
partitioned, and verify that queries using the clustered and
nonclustered indexes benefit from partition elimination where
applicable.
http://sqlperformance.com/2013/09/sql-indexes/partitioning-benefits