SQL Top Function with a clustered index - tsql

The order that the records come in is not always guaranteed unless I use an order by clause.
If I throw a clustered index on a table and then do a select top 100, for example, would the 100 rows returned always be the same?
I am asking this because a clustered index sorts the data physically on the key value.
I am lead to believe so from my observations, but wanted to see what others thought.

No. The rule is simple: SQL tables and result sets represent unordered sets. The only exception is a result set associated with a query that has an ORDER BY in the outermost SELECT.
A clustered index affects how data is stored on each page. However, it does not guarantee that a result set built on that table will even use the clustered index.
Consider a table that has a primary, clustered key on id and a query that returns:
select top (100) othercol
from t;
This query could use an index on othercol -- avoiding the clustered index altogether.

Related

Order of output values when select one field from table. SQL server

In what order will the field values be displayed for "select field from table".
Table is a clustered index
Anything or as in the table?
it looks like in the table. But I would like to know for sure
Does it matter in this case the table is a clustered index or not?
Without an explicit ORDER BY - there's no defined / reliable ordering.
Relational database are inherently unordered - so you cannot rely on any ordering - regardless of whether that table has a clustered index or not - UNLESS you explicitly specify what ordering you need by using ORDER BY .....

Are these indexes doing the same thing in respect to customer_id?

I'm pretty new to PostgreSQL so apologies if I'm asking the obvious.
I've got a table called customer_products. It contains the following two indexes:
CREATE INDEX customer_products_customer_id
ON public.customer_products USING btree (customer_id)
CREATE UNIQUE INDEX customer_products_customer_id_product_id
ON public.customer_products USING btree (customer_id, product_id)
Are they both doing the same thing in respect to customer_id or do they function in a different way? I'm not sure if I should leave them or remove customer_products_customer_id.
There is nothing that the first index can do that the second cannot, so you should drop the first index.
The only advantage of the first index over the second when it comes to queries whose WHERE (or ORDER BY) clause involves customer_id only is that the index is smaller. That makes a range scan over many index entries somewhat faster.
The price for an extra index in terms of size and data modification speed usually outweighs that advantage. In a read-only data warehouse where I have a query that profits significantly I may be tempted to keep both indexes, otherwise I wouldn't.
You should definitely not drop the UNIQUE index, because it has a valuable use that has nothing to do with performance: it prevents the table from containing two rows that have the save values for the indexed columns. If that is what you want to guarantee, a UNIQUE index will make sure that your data keep in good shape.
Side remark: even though the effect is the same, it is better if the table has a unique constraint (which is backed by a unique index) than just having the index. If nothing else, it documents the purpose better.

Which index is used to answer aggregates when we have several indexes?

I have a table which is partitioned on daily basis, each partition has certainly a primary key, and several other indexes on columns which are not null. If I get the query plane for the following:
SELECT COUNT(*) FROM parent_table;
I can see different indexes are used, sometimes the primary key index is used and some times others. How postgres is able to decide which index to use. Note that, my table is not clustered and never clustered before. Also, the primary key is serial.
What are the catalog / statistics tables which are used to make this decision.

Should this PostgreSQL query use the indexes?

I have two tables:
CREATE TABLE soils (
sample_id TEXT PRIMARY KEY,
project_id TEXT,
technician_id TEXT
);
CREATE INDEX soils_idx
ON soils
USING btree
(sample_id COLLATE pg_catalog."default");
CREATE TABLE assays (
sample_id TEXT PRIMARY KEY,
mo_ppm NUMERIC
);
CREATE INDEX assays_idx
ON assays
USING btree
(sample_id COLLATE pg_catalog."default");
Each table contains about a half million records, and, in reality, about 20 additional columns each, of type TEXT (omitted in the DDL posted above to save time here).
When I perform the query:
EXPLAIN SELECT
s.sample_id, s.project_id, s.technician_id, a.mo_ppm
FROM
soils AS s INNER JOIN assays AS a ON s.sample_id = a.sample_id
I get 2 SEQ SCANs, rather than a lookup to the index. Is that expected behaviour?
Since you have no WHERE conditions, you effectively read the whole table. It's cheaper to run sequential scans and not involve any indexes at all.
Try:
EXPLAIN
SELECT s.sample_id, s.project_id, s.technician_id, a.mo_ppm
FROM soils s
JOIN assays a USING (sample_id)
WHERE <some condition that returns few rows>;
... and an index matching the WHERE condition should be used.
You don't need to define an index on a PRIMARY KEY column. A PK constraint is implemented with a unique index automatically. Your additional index is redundant and of no use.
An index on a foreign key column would be a good idea, but there isn't one in your example, which looks odd. Like the two tables could be combined into one. Probably just over-simplification for the test case.
Finally, for big tables, I would consider using a simple integer primary key instead of text, possibly a serial column. That's typically faster.
Yes, that's expected behaviour. On the other hand it depends on your random_page_cost, seq_page_cost and effective_cache_size settings. Your query doesn't have WHERE clause hence it might be faster to read everything sequentially. You can try to penalise sequential scan:
set enable_seqscan = off;
explain analyse <your query>;
and then compare plan/cost/IO wait (it is not possible to disable seq-scan but it gets very high cost -- ~1e7 (or 1e8)).
If you have SSD and WHERE clause in your query then you can lower random_page_cost to 1.5..2.5 and encourage PG to use index.

Doubt in clustered and non Clustered index

I have a doubt that if my table do n't have any constraint like Primary Key,Foreign key,Unique key etc. then can i create the clustered index on table and clustered index can have the douplicate records ?
My 2nd question is where should we exectly use the non clustered index and when it is useful and benificial to create in table?
My 3rd question is How can we create the 249 non clustered index in a table .Is it the meaning, Creating the non clustered index on 249 columns ?
Can you anyone help me to remove my confusion in this.
First, the definition of a clustered index is that it is physical ordering of data on the disk. Every time you do an insert into that table, the new record will be placed on the physical disk in its order based on its value in the clustered index column. Because it is the physical location on the disk, it is (A) the most rapidly accessible column in the table but (B) only possible to define a single clustered index per table. Which column (or columns) you use as the clustered index depend on the data itself and its use. Primary keys are typically the clustered index, especially if the primary key is sequential (e.g. an integer that increments automatically with each insert). This will provide the fastest insert/update/delete functionality. If you are more interested in performing reads (select * from table), you may want to cluster on a Date column, as most queries have either a date in the where clause, the group by clause or both.
Second, clustered indexes (at least in the DB's I know) need not be unique (they CAN have duplicates). Constraining the column to be unique is separate matter. If the clustered index is a primary key its uniqueness is a function of being a primary key.
Third, I can't follow you questions concerning 249 columns. A non-clustered index is basically a tool for accelerating queries at the expense of extra disk space. It's hard to think of a case where creating an index on each column is necessary. If you want a quick rule of thumb...
Write a query using your table.
If a column is required to do a join, index it.
If a column is used in a where column, index it.
Remember all the indexes are doing for you is speeding up your queries. If queries run fast, don't worry about them.
This is just a thumbnail sketch of a large topic. There are tons of more informative/comprehensive resources on this matter, and some depend on the database system ... just google it.