Postgres Index to speed up LEFT OUTER JOIN

Postgres Index to speed up LEFT OUTER JOIN - postgresql

Within my db I have table prediction_fsd with about 5 million entries. The site table contains approx 3 million entries. I need to execute queries that look like
SELECT prediction_fsd.id AS prediction_fsd_id,
prediction_fsd.site_id AS prediction_fsd_site_id,
prediction_fsd.html_hash AS prediction_fsd_html_hash,
prediction_fsd.prediction AS prediction_fsd_prediction,
prediction_fsd.algorithm AS prediction_fsd_algorithm,
prediction_fsd.model_version AS prediction_fsd_model_version,
prediction_fsd.timestamp AS prediction_fsd_timestamp,
site_1.id AS site_1_id,
site_1.url AS site_1_url,
site_1.status AS site_1_status
FROM prediction_fsd
LEFT OUTER JOIN site AS site_1
ON site_1.id = prediction_fsd.site_id
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
LIMIT 1
at the moment this query takes about ~4 seconds. I'd like to reduce that by introducing an index. Which tables and fields should I include in that index. I'm having troubles properly understanding the EXPLAIN ANALYZE output of Postgres

CREATE INDEX prediction_fsd_site_id_algorithm_timestamp
ON public.prediction_fsd USING btree
(site_id, algorithm, "timestamp" DESC)
TABLESPACE pg_default;
By introducing a combined index as suggested by Frank Heikens I was able to bring down the query execution time to 0.25s

These three SQL lines point to a possible BTREE index to help you.
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
You're filtering the rows of the table by equality on two columns, and ordering by the third column. So try this index.
CREATE INDEX site_alg_ts ON prediction_fsd
(site_id, algorithm, timestamp DESC);
This BTREE index lets PostgreSQL random-access it to the first eligible row, which happens also to be the row you want with your ORDER BY ... LIMIT 1 clause.
The query plan in your question says that PostgreSQL did an expensive Parallel Sequential Scan on all five megarows of that table. This index will almost certainly change that to a cheap index lookup.
On the other table, it appears that you already look up rows in it via the primary key id. So you don't need any other index for that one.

Related

Postgres: Optimization of query with simple where clause

I have a table with the following columns:
ID (VARCHAR)
CUSTOMER_ID (VARCHAR)
STATUS (VARCHAR) (4 different status possible)
other not relevant columns
I try to find all the lines with customer_id = and status = two different status.
The query looks like:
SELECT *
FROM my_table
WHERE customer_id = '12345678' AND status IN ('STATUS1', 'STATUS2');
The table contains about 1 mio lines. I added two indexes on customer_id and status. The query still needs about 1 second to run.
The explain plan is:
1. Gather
2. Seq Scan on my_table
Filter: (((status)::text = ANY ('{SUBMITTED,CANCELLED}'::text[])) AND ((customer_id)::text = '12345678'::text))
I ran the 'analyze my_table' after creating the indexes. What could I do to improve the performance of this quite simple query?

You need a compound (multi-column) index to help satisfy your query.
Guessing, it seems like the most selective column (the one with the most distinct values) is customer_id. status probably has only a few distinct values. So customer_id should go first in the index. Try this.
ALTER TABLE my_table ADD INDEX customer_id_status (customer_id, status);
This creates a BTREE index. A useful mental model for such an index is an old-fashioned telephone book. It's sorted in order. You look up the first matching entry in the index, then scan it sequentially for the items you want.
You may also want to try running ANALYZE my_table; to update the statistics (about selectivity) used by the query planner to choose an appropriate index.
Pro tip Avoid SELECT * if you can. Instead name the columns you want. This can help performance a lot.
Pro tip Your question said some of your columns aren't relevant to query optimization. That's probably not true; index design is a weird art. SELECT * makes it less true.

Does the index column order matter on row insert in Postgresql?

I have a table similar to this one:
create table request_journal
(
id bigint,
request_body text,
request_date timestamp,
user_id bigint,
);
It is used for request logging purposes, so frequent inserts in it are expected (2k+ rps).
I want to create composite index on columns request_date and user_id to speed up execution of select queries like this:
select *
from request_journal
where request_date between '2021-07-08 10:00:00' and '2021-07-08 16:00:00'
and user_id = 123
order by request_date desc;
I tested select queries with (request_date desc, user_id) btree index and (user_id, request_date desc) btree index. With request_date leading column index select queries are executed about 10% faster, but in general performance of any of this indexes is acceptable.
So my question is does the index column order affect insertion time? I have not spotted any differences using EXPLAIN/EXPLAIN ANALYZE on insert query. Which index will be more build time efficient under "high load"?

It is hard to believe your test were done on any vaguely realistic data size.
At the rate you indicate, a 6 hour range would include over 43 million records. If the user_ids are spread evenly over 1e6 different values, I get the index leading with user_id to be a thousand times faster for that query than the one leading with request_date.
But anyway, for loading new data, assuming the new data is all from recent times, then the one with request_date should be faster as the part of the index needing maintenance while loading will be more concentrated in part of the index, and so better cached. But this would depend on how much RAM you have, what your disk system is like, and how many distinct user_ids you are loading data for.

Slow Postgres Query when using GROUP BY

I have table in postgres database. After the row reach 2 mio rows, the query become slower. This is my query
SELECT
c.source,
c.destination,
c.product_id,
sum(c.weight),
count(c.weight),
c.owner_id
FROM stock c
GROUP BY c.source, c.destination, c.product_id, c.owner_id;
I already add index
CREATE INDEX stock_custom_idx ON public.stock USING btree (source, destination, product_id, owner_id)
The query is very slow, so I do explain analyze, but the index not called.
So, how to optimize this query, because its take too long time and not return data?

Try this index:
CREATE INDEX better_index ON public.stock USING btree
(source, destination, product_id, owner_id, weight);
If you do not include the weight, this information still needs to be fetched from the table, thus you have a full table scan.
With the new index, you should have an index only scan. Also, the query planner can make use of the sorting order of the index for the grouping (just as it could have done with your index).
In newer versions of PostgreSQL, there would also exist the INCLUDE clause, where you can "add" columns to an index without this having any impact on the sorting order (the data is there, but this part of the data is not sorted). This would make the index yet another bit more performant for your query, I guess.

Postgresql 9.4 slow [duplicate]

I have table
create table big_table (
id serial primary key,
-- other columns here
vote int
);
This table is very big, approximately 70 million rows, I need to query:
SELECT * FROM big_table
ORDER BY vote [ASC|DESC], id [ASC|DESC]
OFFSET x LIMIT n -- I need this for pagination
As you may know, when x is a large number, queries like this are very slow.
For performance optimization I added indexes:
create index vote_order_asc on big_table (vote asc, id asc);
and
create index vote_order_desc on big_table (vote desc, id desc);
EXPLAIN shows that the above SELECT query uses these indexes, but it's very slow anyway with a large offset.
What can I do to optimize queries with OFFSET in big tables? Maybe PostgreSQL 9.5 or even newer versions have some features? I've searched but didn't find anything.

A large OFFSET is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number to the table (or create a MATERIALIZED VIEW including said row_number) and work with WHERE row_number > x instead of OFFSET x.
However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.
I suggest a different approach for pagination:
SELECT *
FROM big_table
WHERE (vote, id) > (vote_x, id_x) -- ROW values
ORDER BY vote, id -- needs to be deterministic
LIMIT n;
Where vote_x and id_x are from the last row of the previous page (for both DESC and ASC). Or from the first if navigating backwards.
Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.
CREATE INDEX vote_order_asc ON big_table (vote, id);
Or for descending order:
SELECT *
FROM big_table
WHERE (vote, id) < (vote_x, id_x) -- ROW values
ORDER BY vote DESC, id DESC
LIMIT n;
Can use the same index.
I suggest you declare your columns NOT NULL or acquaint yourself with the NULLS FIRST|LAST construct:
PostgreSQL sort by datetime asc, null first?
Note two things in particular:
The ROW values in the WHERE clause cannot be replaced with separated member fields. WHERE (vote, id) > (vote_x, id_x) cannot be replaced with:
WHERE vote >= vote_x
AND id > id_x
That would rule out all rows with id <= id_x, while we only want to do that for the same vote and not for the next. The correct translation would be:
WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in ORDER BY like:
ORDER BY vote ASC, id DESC
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on (vote, (id * -1)) - and use the same expression in ORDER BY:
ORDER BY vote ASC, (id * -1) ASC
Related:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Improve performance for order by with columns from many tables
Note in particular the presentation by Markus Winand I linked to:
"Pagination done the PostgreSQL way"

Have you tried partioning the table ?
Ease of management, improved scalability and availability, and a
reduction in blocking are common reasons to partition tables.
Improving query performance is not a reason to employ partitioning,
though it can be a beneficial side-effect in some cases. In terms of
performance, it is important to ensure that your implementation plan
includes a review of query performance. Confirm that your indexes
continue to appropriately support your queries after the table is
partitioned, and verify that queries using the clustered and
nonclustered indexes benefit from partition elimination where
applicable.
http://sqlperformance.com/2013/09/sql-indexes/partitioning-benefits

Postgres multi-column index (integer, boolean, and array)

I have a Postgres 9.4 database with a table like this:
| id | other_id | current | dn_ids | rank |
|----|----------|---------|---------------------------------------|------|
| 1 | 5 | F | {123,234,345,456,111,222,333,444,555} | 1 |
| 2 | 7 | F | {123,100,200,900,800,700,600,400,323} | 2 |
(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:
CREATE TABLE mytable (
id integer NOT NULL,
other_id integer,
rank integer,
current boolean DEFAULT false,
dn_ids integer[] DEFAULT '{}'::integer[]
);
CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1;
ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);
CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids);
CREATE INDEX index_mytable_on_current ON mytable USING btree (current);
CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);
I need to optimize queries like this:
SELECT id, dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids)
ORDER BY rank ASC
LIMIT 500 OFFSET 1000
This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.
Here's an example EXPLAIN:
Limit (cost=36944.53..36945.78 rows=500 width=65)
-> Sort (cost=36942.03..37007.42 rows=26156 width=65)
Sort Key: rank
-> Seq Scan on mytable (cost=0.00..35431.42 rows=26156 width=65)
Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))
Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.
What's the right type of compound index to use for this query? I don't care about space at all.
Does it matter much how I order the terms in the WHERE clause?

What's the right type of compound index to use for this query? I don't care about space at all.
This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:
Difference between GiST and GIN index
You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).
Multicolumn index on 3 fields with heterogenous data types
However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.
And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.
Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).
Not applicable to updated question.
Partial indexes might be an option. Other than that, you already have all the indexes you need.
I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.
Does it matter much how I order the terms in the WHERE clause?
The order of WHERE conditions is completely irrelevant.
Addendum after question update
The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.
For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.
Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.
Even with your added information, much of what's relevant is still in the dark. I am going to assume that:
Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids.
The most selective predicate is other_id = 5.
A substantial part of the rows is eliminated with NOT current.
Aside: current = F isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;
While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:
CREATE INDEX foo ON mytable (other_id, rank, dn_ids)
WHERE NOT current;
The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:
CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;
GiST indexes may become more interesting in Postgres 9.5 due to this new feature:
Allow GiST indexes to perform index-only scans (Anastasia Lubennikova,
Heikki Linnakangas, Andreas Karlsson)
Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres.
Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.
Auto increment SQL function

Unfortunately I don't think you can combine a BTree and a GIN/GIST index into a single compound index, so the planner is going to have to choose between using the other_id index or the dn_ids index. One advantage of using other_id, as you pointed out, is that you could use a multicolumn index to improve the sort performance. The way you would do this would be
CREATE INDEX index_mytable_on_other_id_and_current
ON mytable (other_id, rank) WHERE current = F;
This is using a partial index, and will allow you to skip the sort step when you are sorting by rank and querying on other_id.
Depending on the cardinality of other_id, the only benefit of this might be the sorting. Because your plan has a LIMIT clause, it's difficult to tell. SEQ scans can be the fastest option if you're using > 1/5 of the table, especially if you're using a standard HDD instead of solid state. If you're planner insists on SEQ scanning when you know an IDX scan is faster (you've tested with enable_seqscan false, you may want to try fine tuning your random_page_cost or effective_cache_size.
Finally, I'd recomment not keeping all of these indexes. Find the ones you need, and cull the rest. Indexes cause huge performance degradation in inserts (especially mutli-column and GIN/GIST indexes).

The simplest index for your query is mytable(other_id, current). This handles the first two conditions. This would be a normal b-tree type index.
You can satisfy the array condition using a GIST index on mytable(dn_ids).
However, I don't think you can mix the different data types in one index, at least not without extensions.