PostgreSQL - should i only create index for the rest of columns that don't have index yet? - postgresql

if table_name has 3 column (a,b,c), and I'm going to create an index with those 3 column:
CREATE INDEX idx_table_name_a_b_c ON table_name (a,b,c);
But there's already an index of column a that I previously created :
CREATE INDEX idx_table_name_a ON table_name (a);
Should I create only for the other 2, or create for those 3 columns that also include column a (with above query)?

Note that index considerations are only possible if you have a query. It never makes sense to index a table as such, but only to index a table for a query.
So let's assume that you have a query that would benefit from a three-column index, like
SELECT count(*) FROM table_name
WHERE a = 12 AND b = 42 AND c BETWEEN 7 AND 22;
The best option is to create that index and drop the existing one, because the three-column index can serve all purposes that the single-column index can (that is because a is the leading column in the index).
Such an index will lead to a single index-only scan on the table, which (if you have VACUUMed the table) is the most efficient way to execute the query.
The second best option is to create the two-column index you proposed and leave the single-column index on a.
Then the optimizer's strategy will depend on the distribution of values.
If the condition on a is selective enough, PostgreSQL will ignore your new index and just scan the one on a.
If the condition on b and c is selective, PostgreSQL will scan only your new index.
If all conditions together are not selective, PostgreSQL may choose a sequential scan of the table and ignore all your indexes.
If neither the condition on a nor the conditions on b and c together are selective, but all three conditions together are selective, PostgreSQL can opt to perform a bitmap index scan on both indexes and combine the result.

Related

Postgres Index to speed up LEFT OUTER JOIN

Within my db I have table prediction_fsd with about 5 million entries. The site table contains approx 3 million entries. I need to execute queries that look like
SELECT prediction_fsd.id AS prediction_fsd_id,
prediction_fsd.site_id AS prediction_fsd_site_id,
prediction_fsd.html_hash AS prediction_fsd_html_hash,
prediction_fsd.prediction AS prediction_fsd_prediction,
prediction_fsd.algorithm AS prediction_fsd_algorithm,
prediction_fsd.model_version AS prediction_fsd_model_version,
prediction_fsd.timestamp AS prediction_fsd_timestamp,
site_1.id AS site_1_id,
site_1.url AS site_1_url,
site_1.status AS site_1_status
FROM prediction_fsd
LEFT OUTER JOIN site AS site_1
ON site_1.id = prediction_fsd.site_id
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
LIMIT 1
at the moment this query takes about ~4 seconds. I'd like to reduce that by introducing an index. Which tables and fields should I include in that index. I'm having troubles properly understanding the EXPLAIN ANALYZE output of Postgres
CREATE INDEX prediction_fsd_site_id_algorithm_timestamp
ON public.prediction_fsd USING btree
(site_id, algorithm, "timestamp" DESC)
TABLESPACE pg_default;
By introducing a combined index as suggested by Frank Heikens I was able to bring down the query execution time to 0.25s
These three SQL lines point to a possible BTREE index to help you.
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
You're filtering the rows of the table by equality on two columns, and ordering by the third column. So try this index.
CREATE INDEX site_alg_ts ON prediction_fsd
(site_id, algorithm, timestamp DESC);
This BTREE index lets PostgreSQL random-access it to the first eligible row, which happens also to be the row you want with your ORDER BY ... LIMIT 1 clause.
The query plan in your question says that PostgreSQL did an expensive Parallel Sequential Scan on all five megarows of that table. This index will almost certainly change that to a cheap index lookup.
On the other table, it appears that you already look up rows in it via the primary key id. So you don't need any other index for that one.

Is it possible to index the position of an array column in PostgreSQL?

Let's say I want to find rows in the table my_table that have the value 5 at the first position of the array column my_array_column. To prepare the table, I executed the following statements:
CREATE TABLE my_table (
id serial primary key,
my_array_column integer[]
);
CREATE INDEX my_table_my_array_column_index on "my_table" USING GIN ("my_array_column");
SET enable_seqscan TO off;
INSERT INTO my_table (my_array_column) VALUES ('{5,7,10}');
Now, the query can look like this:
select * from my_table where my_array_column[1] = 5;
This works, but it doesn't use the created GIN index. Is it possible to search for the value 5 at a specific position with an index?
I want to find rows in the table my_table that have the value 5 at the first position of the array column
A partial index would be most efficient for that definition:
CREATE INDEX my_table_my_array_special_idx ON my_table ((true))
WHERE my_array_column[1] = 5;
If only a small fraction of rows qualifies, a partial index is accordingly smaller. Plus, the actual index column only occupies minimum space (typically 8 bytes). And, on top of that, Postgres 13 or later can apply index deduplication to make the index much smaller, yet.
Once the index is fully cached, its small size does not make it much faster, but still.
And most writes do not have to manipulate the index, which may be the most important benefit, depending on the workload.
Oh, and Postgres collects statistics for a partial index. So you can expect the query planner to make a fully educated choice when that index is involved.
Related:
PostgreSQL partial index unused when created on a table with existing data
Index that is not used, yet influences query
It's applicable when the query repeats the same condition.
Typically, you have something useful as index field on top of your declared purpose. But if you don't, just use any small constant - true in my example, but anything < 8 bytes is equally good.
Minor disclaimer: The "first position" in a Postgres array does not necessarily have index 1. If non-standard array indexes are possible, consider:
...
WHERE (my_array_column[:])[1] = 5;
In index and queries.
See:
Normalize array subscripts for 1-dimensional array so they start with 1
You can index just the first position. You need an extra set of parentheses in the create statement to do that:
create index on my_table ((my_array_column[1]));
Or you could augment your query to work with your gin index, on the theory that an array can't have the first element be 5 unless at least one element is 5.
select * from my_table where my_array_column[1] = 5 and my_array_column #> ARRAY[5];
Of course this won't be very efficient if a lot of your arrays contain 5, but in some other spot in the array. It would have to recheck all of those "false matches" to eliminate them. So if you only care about the first element, the first index I showed is better. (Of course, if you only care about the first element, why use an array to start with?)
If you always look at the first position a regular B-Tree index will do:
create index on my_table ( (my_array_column[1]) );
If you don't know the position, then a GIN index is indeed needed, but you need to use an operator that is supported by a gin index, that would be e.g. the #> operator. But for that you need to use a different query:
select *
from my_table
where my_array_column #> array[5];
That would find all rows where the array column contains the value 5.
But you should head the advice given in the manual regarding the use of arrays:
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.

SQL Server: Optimizer intelligence with multiple indexes on the same table in a UNION ALL query

I'm trying to write a query for a rather large table (10 million+ would be a typical size), the result of which needs to be filtered on various predicates / conditions based on some business logic. My question: does the query optimizier (in SQL Server 2008+) attempt to use a single index for the whole query, or does it attempt to use different indexes on a by-query basis?
Consider the following:
--Use Index A
SELECT Set1
FROM ATable
WHERE AColumn = sarg-able value
UNION ALL
--Are we stuck with Index A?
SELECT Set2
FROM ATable
WHERE BColumn = sarg-able value
If we choose Index A for Set1, are we stuck with Index A for the entire query, or is the optimizer smart enough to use a different index for Set2 (assuming one exists)?
Everything #andreyNikolov said is 100% correct. This is the kind of thing you can easily figure out on your own by reviewing the Actual Execution Plan (Not Estimated Execution plan). Note the following sample data, table and index structure:
USE tempdb -- safe place in Dev to test this kind of thing...
GO
-- sample data and indexes
IF OBJECT_ID('dbo.ATable','U') IS NOT NULL DROP TABLE dbo.ATable
CREATE TABLE dbo.ATable
(
Set1 INT NOT NULL,
Set2 INT NOT NULL,
AColumn INT NOT NULL,
BColumn INT NOT NULL
);
INSERT dbo.ATable (Set1, Set2, AColumn, BColumn)
VALUES (1,2,3,3),(1,2,4,4),(5,5,6,6),(11,22,40,40),(11,20,40,44),(11,22,14,4),(1,2,3,3);
CREATE NONCLUSTERED INDEX indexA ON dbo.ATable(AColumn) INCLUDE(Set1);
CREATE NONCLUSTERED INDEX indexB ON dbo.ATable(BColumn) INCLUDE(Set2);
Now run the following with "Include Actual Execution Plan" turned on.
SELECT Set1 --Use Index A
FROM dbo.ATable
WHERE AColumn = 3
UNION ALL
SELECT Set2 --Use Index B
FROM dbo.ATable
WHERE BColumn = 4;
... and the execution plan:
The query above the UNION ALL performs a nonclustered seek against IndexA's key column (AColumn). Because I included Set1 as an include column on IndexA, IndexA can satisfy the query without a Rid or Key lookup against. This is how indexes should be designed. The same is true about the query below the UNION ALL except that it's using IndexB.
Again, this is the kind of thing that is easy to figure out on your own once you have a full understanding of how to read the execution plans.
Yes, the optimizer is smart enough. These are two separate operations, which can be performed either as a table/index scans or seeks. The decision for executing each one of them is independent and it is perfectly normal to use different indexes for each of them. Then the results of both operations will be combined.

Postgres multi-column index (integer, boolean, and array)

I have a Postgres 9.4 database with a table like this:
| id | other_id | current | dn_ids | rank |
|----|----------|---------|---------------------------------------|------|
| 1 | 5 | F | {123,234,345,456,111,222,333,444,555} | 1 |
| 2 | 7 | F | {123,100,200,900,800,700,600,400,323} | 2 |
(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:
CREATE TABLE mytable (
id integer NOT NULL,
other_id integer,
rank integer,
current boolean DEFAULT false,
dn_ids integer[] DEFAULT '{}'::integer[]
);
CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1;
ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);
CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids);
CREATE INDEX index_mytable_on_current ON mytable USING btree (current);
CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);
I need to optimize queries like this:
SELECT id, dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids)
ORDER BY rank ASC
LIMIT 500 OFFSET 1000
This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.
Here's an example EXPLAIN:
Limit (cost=36944.53..36945.78 rows=500 width=65)
-> Sort (cost=36942.03..37007.42 rows=26156 width=65)
Sort Key: rank
-> Seq Scan on mytable (cost=0.00..35431.42 rows=26156 width=65)
Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))
Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.
What's the right type of compound index to use for this query? I don't care about space at all.
Does it matter much how I order the terms in the WHERE clause?
What's the right type of compound index to use for this query? I don't care about space at all.
This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:
Difference between GiST and GIN index
You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).
Multicolumn index on 3 fields with heterogenous data types
However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.
And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.
Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).
Not applicable to updated question.
Partial indexes might be an option. Other than that, you already have all the indexes you need.
I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.
Does it matter much how I order the terms in the WHERE clause?
The order of WHERE conditions is completely irrelevant.
Addendum after question update
The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.
For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.
Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.
Even with your added information, much of what's relevant is still in the dark. I am going to assume that:
Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids.
The most selective predicate is other_id = 5.
A substantial part of the rows is eliminated with NOT current.
Aside: current = F isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;
While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:
CREATE INDEX foo ON mytable (other_id, rank, dn_ids)
WHERE NOT current;
The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:
CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;
GiST indexes may become more interesting in Postgres 9.5 due to this new feature:
Allow GiST indexes to perform index-only scans (Anastasia Lubennikova,
Heikki Linnakangas, Andreas Karlsson)
Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres.
Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.
Auto increment SQL function
Unfortunately I don't think you can combine a BTree and a GIN/GIST index into a single compound index, so the planner is going to have to choose between using the other_id index or the dn_ids index. One advantage of using other_id, as you pointed out, is that you could use a multicolumn index to improve the sort performance. The way you would do this would be
CREATE INDEX index_mytable_on_other_id_and_current
ON mytable (other_id, rank) WHERE current = F;
This is using a partial index, and will allow you to skip the sort step when you are sorting by rank and querying on other_id.
Depending on the cardinality of other_id, the only benefit of this might be the sorting. Because your plan has a LIMIT clause, it's difficult to tell. SEQ scans can be the fastest option if you're using > 1/5 of the table, especially if you're using a standard HDD instead of solid state. If you're planner insists on SEQ scanning when you know an IDX scan is faster (you've tested with enable_seqscan false, you may want to try fine tuning your random_page_cost or effective_cache_size.
Finally, I'd recomment not keeping all of these indexes. Find the ones you need, and cull the rest. Indexes cause huge performance degradation in inserts (especially mutli-column and GIN/GIST indexes).
The simplest index for your query is mytable(other_id, current). This handles the first two conditions. This would be a normal b-tree type index.
You can satisfy the array condition using a GIST index on mytable(dn_ids).
However, I don't think you can mix the different data types in one index, at least not without extensions.

Why index seek and not a scan for the following setup in SQL Server 2005

I have created a table
create table #temp(a int, b int, c int)
I have 2 indexes on this table:
Non clustered non unique index on c
Clustered Index on a
When I try to execute the following query:
select b from #temp where c = 3
I see that the system goes for index scan. This is fine, because the non clustered index doesn't have b as the key value. Hence it does an index scan from column a.
But when I try to execute the below query:-
select b from #temp where c= 3 and a = 3
I see that the execute plan has got only index seek. No scan. Why is that?
Neither the clustered index nor the nonclustered index as b as one of the columns?
I was hoping an index scan.
Please clarify
If you have a as your clustering key, then that column is included in all non-clustered indices on that table.
So your index on c also includes a, so the condition
where c= 3 and a = 3
can be found in that index using an index seek. Most likely, the query optimizer decided that doing a index seek to find a and c and a key lookup to get the rest of the data is faster/more efficient here than using an index scan.
BTW: why did you expect / prefer an index scan over an index seek? The index seek typically is faster and uses a lot less resources - I would always strive to get index seeks over scans.
This is fine, because the non clustered index doesn't have b as the key value. Hence it does an index scan from column a.
This assumption is not right. index seek and scan has to deal with WHERE clause and not the select clause.
Now your question -
Where clause is optimised by sql optimizer and as there is a=3 condition, clustered index can be applied.