Why index seek and not a scan for the following setup in SQL Server 2005 - tsql

I have created a table
create table #temp(a int, b int, c int)
I have 2 indexes on this table:
Non clustered non unique index on c
Clustered Index on a
When I try to execute the following query:
select b from #temp where c = 3
I see that the system goes for index scan. This is fine, because the non clustered index doesn't have b as the key value. Hence it does an index scan from column a.
But when I try to execute the below query:-
select b from #temp where c= 3 and a = 3
I see that the execute plan has got only index seek. No scan. Why is that?
Neither the clustered index nor the nonclustered index as b as one of the columns?
I was hoping an index scan.
Please clarify

If you have a as your clustering key, then that column is included in all non-clustered indices on that table.
So your index on c also includes a, so the condition
where c= 3 and a = 3
can be found in that index using an index seek. Most likely, the query optimizer decided that doing a index seek to find a and c and a key lookup to get the rest of the data is faster/more efficient here than using an index scan.
BTW: why did you expect / prefer an index scan over an index seek? The index seek typically is faster and uses a lot less resources - I would always strive to get index seeks over scans.

This is fine, because the non clustered index doesn't have b as the key value. Hence it does an index scan from column a.
This assumption is not right. index seek and scan has to deal with WHERE clause and not the select clause.
Now your question -
Where clause is optimised by sql optimizer and as there is a=3 condition, clustered index can be applied.

Related

Postgres Index to speed up LEFT OUTER JOIN

Within my db I have table prediction_fsd with about 5 million entries. The site table contains approx 3 million entries. I need to execute queries that look like
SELECT prediction_fsd.id AS prediction_fsd_id,
prediction_fsd.site_id AS prediction_fsd_site_id,
prediction_fsd.html_hash AS prediction_fsd_html_hash,
prediction_fsd.prediction AS prediction_fsd_prediction,
prediction_fsd.algorithm AS prediction_fsd_algorithm,
prediction_fsd.model_version AS prediction_fsd_model_version,
prediction_fsd.timestamp AS prediction_fsd_timestamp,
site_1.id AS site_1_id,
site_1.url AS site_1_url,
site_1.status AS site_1_status
FROM prediction_fsd
LEFT OUTER JOIN site AS site_1
ON site_1.id = prediction_fsd.site_id
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
LIMIT 1
at the moment this query takes about ~4 seconds. I'd like to reduce that by introducing an index. Which tables and fields should I include in that index. I'm having troubles properly understanding the EXPLAIN ANALYZE output of Postgres
CREATE INDEX prediction_fsd_site_id_algorithm_timestamp
ON public.prediction_fsd USING btree
(site_id, algorithm, "timestamp" DESC)
TABLESPACE pg_default;
By introducing a combined index as suggested by Frank Heikens I was able to bring down the query execution time to 0.25s
These three SQL lines point to a possible BTREE index to help you.
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
You're filtering the rows of the table by equality on two columns, and ordering by the third column. So try this index.
CREATE INDEX site_alg_ts ON prediction_fsd
(site_id, algorithm, timestamp DESC);
This BTREE index lets PostgreSQL random-access it to the first eligible row, which happens also to be the row you want with your ORDER BY ... LIMIT 1 clause.
The query plan in your question says that PostgreSQL did an expensive Parallel Sequential Scan on all five megarows of that table. This index will almost certainly change that to a cheap index lookup.
On the other table, it appears that you already look up rows in it via the primary key id. So you don't need any other index for that one.

Why postgresql can efficiently use the index in this query?

Given the following ddl
CREATE TABLE test
(
c1 SMALLINT NOT NULL,
c2 INTERVAL NOT NULL,
c3 TIMESTAMP NOT NULL,
c4 VARCHAR NOT NULL,
PRIMARY KEY (c1, c2, c3)
);
CREATE INDEX test_index ON test (c3, c2);
The following query
SELECT *
FROM test
WHERE c2 = '1 minute'::INTERVAL
ORDER BY c3
LIMIT 1000
gives the following query plan in PostgreSQL 13.3
Limit (cost=0.43..49.92 rows=1000 width=60)
-> Index Scan using test_index on test (cost=0.43..316739.07 rows=6400526 width=60)
Index Cond: (c2 = '00:01:00'::interval)
Considering that test_index has columns in this order (c3, c2), why postgres can efficiently filter by c2 and sort by c3 using this index? From my understanding columns that appear in ORDER BY must be the last in the index definition otherwise index will not be used. It also works the same in case of ORDER BY c3 DESC
Without the actual run statistics (EXPLAIN ANALYZE), we don't know that it is efficient. We only know that the planner thinks it is more efficient than the alternatives.
By addressing the rows already in the desired order, it can filter out the ones that fail the c2 condition, then stop once it accumulates 1000 which pass the condition. It thinks it will accomplish this after reading only about 1/6000 of the index.
The plan doesn't explicitly say that the index is being used to provide ordering. We can infer that based on the absence of a Sort node. PostgreSQL knows how to follow an index in either direction, which is why the index also works if the order is DESC.
Whatever efficiency this does have mostly comes from stopping early and avoiding the sort. The filtering on c2='00:01:00'::interval is not very efficient. It can't jump to the part of the index were it knows that condition to be true, but rather it has to scan the index and individually assess the index tuples to filter them. But at least it can apply the filter to the index tuple, without needing to visit the table tuple, which can save a lot of random IO. (I think it would be a good idea if the plan would somehow distinguish jump-to index usage from in-index filtering usage, but that is easier said than done).
An index just on c3 could still read in order and stop early, but it would have to visit the table for every tuple, even the ones that end up failing on c2. The better index would be on (c2,c3). That way it can jump to the part of the index satisfying c2condition, and then read in order by c3 within just that part.
It may not be efficient.
However, the optimizer chose to filter on the index.
That means it will read index entries that are sorted according to the expected ordering, but not all of them will be useful. That's why it added the filtering predicate c2 = '00:01:00'::interval on the index scan.
Maybe the cost of the index scan that discards entries is still lower than a table scan, specially considering it will only keep 1000 at the most.

PostgreSQL - should i only create index for the rest of columns that don't have index yet?

if table_name has 3 column (a,b,c), and I'm going to create an index with those 3 column:
CREATE INDEX idx_table_name_a_b_c ON table_name (a,b,c);
But there's already an index of column a that I previously created :
CREATE INDEX idx_table_name_a ON table_name (a);
Should I create only for the other 2, or create for those 3 columns that also include column a (with above query)?
Note that index considerations are only possible if you have a query. It never makes sense to index a table as such, but only to index a table for a query.
So let's assume that you have a query that would benefit from a three-column index, like
SELECT count(*) FROM table_name
WHERE a = 12 AND b = 42 AND c BETWEEN 7 AND 22;
The best option is to create that index and drop the existing one, because the three-column index can serve all purposes that the single-column index can (that is because a is the leading column in the index).
Such an index will lead to a single index-only scan on the table, which (if you have VACUUMed the table) is the most efficient way to execute the query.
The second best option is to create the two-column index you proposed and leave the single-column index on a.
Then the optimizer's strategy will depend on the distribution of values.
If the condition on a is selective enough, PostgreSQL will ignore your new index and just scan the one on a.
If the condition on b and c is selective, PostgreSQL will scan only your new index.
If all conditions together are not selective, PostgreSQL may choose a sequential scan of the table and ignore all your indexes.
If neither the condition on a nor the conditions on b and c together are selective, but all three conditions together are selective, PostgreSQL can opt to perform a bitmap index scan on both indexes and combine the result.

Why PostgreSQL doesn't use indexes on "WHERE NOT IN" conditions.

I have two tables db100 and db60 with the same fields: x, y, z.
Indexes are created for both the tables on field z like this:
CREATE INDEX db100_z_idx
ON db100
USING btree
(z COLLATE pg_catalog."default");
CREATE INDEX db60_z_idx
ON db60
USING btree
(z COLLATE pg_catalog."default");
Trying to find z values from db60 that don't exist in db100:
select db60.z from db60 where db60.z not in (select db100.z from db100)
As far as I understand, all the information required to execute the query is presented in the indexes. So, I would expect only indexes used.
However it uses sequential scan on tables instead:
"Seq Scan on db60 (cost=0.00..25951290012.84 rows=291282 width=4)"
" Filter: (NOT (SubPlan 1))"
" SubPlan 1"
" -> Materialize (cost=0.00..80786.26 rows=3322884 width=4)"
" -> Seq Scan on db100 (cost=0.00..51190.84 rows=3322884 width=4)"
Can someone pls explain why PostgreSQL doesn't use indexes in this example?
Both the tables contain a few millions records and execution takes a while.
I know that using a left join with "is null" condition gives better results. However, the question is about this particular syntax.
I'm on PG v 9.5
SubPlan 1 is for select db100.z from db100. You select all rows and hence an index is useless. You really want to select DISTINCT z from db100 here and then the index should be used.
In the main query you have select db60.z from db60 where db60.z not in .... Again, you select all rows except where a condition is not true, so again the index does not apply because it applies to the inverse condition.
In general, an index is only used if the planner thinks that such a use will speed up the query processing. It always depends on how many distinct values there are and how the rows are distributed over the physical pages on disk. An index to search for all rows having a column with a certain value is not the same as finding the rows that do not have that same value; the index indicates on which pages and at which locations to find the rows, but that set can not simply be inversed.
Given - in your case - that z is some text type, a meaningful "negative" index can not be constructed (this is actually almost a true-ism, although in some cases a "negative" index could be conceivable). You should look into trigram indexes, as these tend to work much faster than btree on text indexing.
You really want to extract all 291,282 rows with the same z value, or perhaps use a DISTINCT clause here too? That should speed things up quite a bit.

Postgres multi-column index (integer, boolean, and array)

I have a Postgres 9.4 database with a table like this:
| id | other_id | current | dn_ids | rank |
|----|----------|---------|---------------------------------------|------|
| 1 | 5 | F | {123,234,345,456,111,222,333,444,555} | 1 |
| 2 | 7 | F | {123,100,200,900,800,700,600,400,323} | 2 |
(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:
CREATE TABLE mytable (
id integer NOT NULL,
other_id integer,
rank integer,
current boolean DEFAULT false,
dn_ids integer[] DEFAULT '{}'::integer[]
);
CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1;
ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);
CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids);
CREATE INDEX index_mytable_on_current ON mytable USING btree (current);
CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);
I need to optimize queries like this:
SELECT id, dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids)
ORDER BY rank ASC
LIMIT 500 OFFSET 1000
This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.
Here's an example EXPLAIN:
Limit (cost=36944.53..36945.78 rows=500 width=65)
-> Sort (cost=36942.03..37007.42 rows=26156 width=65)
Sort Key: rank
-> Seq Scan on mytable (cost=0.00..35431.42 rows=26156 width=65)
Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))
Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.
What's the right type of compound index to use for this query? I don't care about space at all.
Does it matter much how I order the terms in the WHERE clause?
What's the right type of compound index to use for this query? I don't care about space at all.
This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:
Difference between GiST and GIN index
You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).
Multicolumn index on 3 fields with heterogenous data types
However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.
And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.
Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).
Not applicable to updated question.
Partial indexes might be an option. Other than that, you already have all the indexes you need.
I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.
Does it matter much how I order the terms in the WHERE clause?
The order of WHERE conditions is completely irrelevant.
Addendum after question update
The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.
For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.
Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.
Even with your added information, much of what's relevant is still in the dark. I am going to assume that:
Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids.
The most selective predicate is other_id = 5.
A substantial part of the rows is eliminated with NOT current.
Aside: current = F isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;
While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:
CREATE INDEX foo ON mytable (other_id, rank, dn_ids)
WHERE NOT current;
The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:
CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;
GiST indexes may become more interesting in Postgres 9.5 due to this new feature:
Allow GiST indexes to perform index-only scans (Anastasia Lubennikova,
Heikki Linnakangas, Andreas Karlsson)
Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres.
Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.
Auto increment SQL function
Unfortunately I don't think you can combine a BTree and a GIN/GIST index into a single compound index, so the planner is going to have to choose between using the other_id index or the dn_ids index. One advantage of using other_id, as you pointed out, is that you could use a multicolumn index to improve the sort performance. The way you would do this would be
CREATE INDEX index_mytable_on_other_id_and_current
ON mytable (other_id, rank) WHERE current = F;
This is using a partial index, and will allow you to skip the sort step when you are sorting by rank and querying on other_id.
Depending on the cardinality of other_id, the only benefit of this might be the sorting. Because your plan has a LIMIT clause, it's difficult to tell. SEQ scans can be the fastest option if you're using > 1/5 of the table, especially if you're using a standard HDD instead of solid state. If you're planner insists on SEQ scanning when you know an IDX scan is faster (you've tested with enable_seqscan false, you may want to try fine tuning your random_page_cost or effective_cache_size.
Finally, I'd recomment not keeping all of these indexes. Find the ones you need, and cull the rest. Indexes cause huge performance degradation in inserts (especially mutli-column and GIN/GIST indexes).
The simplest index for your query is mytable(other_id, current). This handles the first two conditions. This would be a normal b-tree type index.
You can satisfy the array condition using a GIST index on mytable(dn_ids).
However, I don't think you can mix the different data types in one index, at least not without extensions.