Confused between clustered and nonclustered index. Contains 5 doubts - rdbms

Do the clustered and non-clustered indexes both work on B-Tree? I read that clustered indexes affect the way how the data is physically stored in table whereas with non-clustered indexes a separate copy of the column is created and that is stored in sorted order. Also, Sql Server creates clustered indexes on primary key by default.
Does that mean :
1) Non clustered indexes occupy more space than clustered indexes since a separate copy of column is stored in non clustered?
2) How does the clustered and non clustered index work when we have primary key based on two columns say.. (StudentName,Marks)?
3) Are there only 2 types of indexes? If so, then what are bitmap indexes? I can't seem to find any such index type in Sql Server Management Studio but in my datawarehousing book all these types are mentioned.
4) Is creating clustered or non-clustered index on primary key effecient?
5) Suppose we create clustered index on name i.e data is physically stored in sorted order name wise then a new record is created. How will the new record find it's place in table?
Thanks in advance :)

Indexes are structures stored separately from the actual datapages and simply contain pointers to the datapages. In SQL Server indexes are B-Trees.
Clustered indexes sort and store the datapages in the table according to the columns defined for the index. In SQL Server 2005 you can add additional columns to an index so it should not be a problem when you have composite primary keys. You can think of a clustered index like a set of filing cabinets with folders. In the first draw you have documents starting with A and in the first folder of that draw you may have documents starting from AA to AC and so on. To search for "Spider" then, you can jump straight to the S draw and look for the folder containing "SP" and quickly find what you are looking for. But it is obvious that if you sort all documents physically by one index then you cannot physically sort the same set of documents by another index. Hence, only one clustered index per table.
A Non Clustered index is a separate structure much like the table of contents or the index at the back of a book. So I think I have only answered some of your questions specifically:
Yes the index does occupy space but not as much as the original table. That is why you must choose your indexes carefully. There is also a small performance hit for update operations since the index has to be maintained.
Your book will mention all the theoretical types of indexes. Bitmap indexes are useful in data warehousing applications or for data that has a few distinct values like days of the week etc. So they are not generally used in your basic RDBMS. I know that Oracle has some implementations but I don't know much about that.
I think that efficiency of an index is determined by how the field is used. It is expected that the majority of the data scanning in your table will be done on the primary key then an index on the primary key makes sense. You usually add indexes to columns that appear in the where clause or the join condition of your queries.
On insert the index has to be maintained, so there is a little extra work that has to be done by the system to rearrange things a bit.

Related

Are these indexes doing the same thing in respect to customer_id?

I'm pretty new to PostgreSQL so apologies if I'm asking the obvious.
I've got a table called customer_products. It contains the following two indexes:
CREATE INDEX customer_products_customer_id
ON public.customer_products USING btree (customer_id)
CREATE UNIQUE INDEX customer_products_customer_id_product_id
ON public.customer_products USING btree (customer_id, product_id)
Are they both doing the same thing in respect to customer_id or do they function in a different way? I'm not sure if I should leave them or remove customer_products_customer_id.
There is nothing that the first index can do that the second cannot, so you should drop the first index.
The only advantage of the first index over the second when it comes to queries whose WHERE (or ORDER BY) clause involves customer_id only is that the index is smaller. That makes a range scan over many index entries somewhat faster.
The price for an extra index in terms of size and data modification speed usually outweighs that advantage. In a read-only data warehouse where I have a query that profits significantly I may be tempted to keep both indexes, otherwise I wouldn't.
You should definitely not drop the UNIQUE index, because it has a valuable use that has nothing to do with performance: it prevents the table from containing two rows that have the save values for the indexed columns. If that is what you want to guarantee, a UNIQUE index will make sure that your data keep in good shape.
Side remark: even though the effect is the same, it is better if the table has a unique constraint (which is backed by a unique index) than just having the index. If nothing else, it documents the purpose better.

Understanding indexes and performance as they relate to indexed column and non-indexed column data in the same row

I have some tables that are around 100 columns wide. I haven't normalized them because to put it back together would require almost 3 dozen joins and am not sure it would perform any better... haven't tested it yet (I will) so can't say for sure.
Anyway, that really isn't the question. I have been indexing columns in these tables that I know will be pulled frequently, so something like 50 indexes per table.
I got to thinking though. These columns will never be pulled by themselves and are meaningless without the primary key (basically an item number). The PK will always be used for the join and even in simple SELECT queries, it will have to be a specified column so the data makes sense.
That got me thinking further about indexes and how they work. As I understand them the locations of a values are committed to memory for that column so it is quickly found in a query.
For example, if you have:
SELECT itemnumber, expdate
FROM items;
And both itemnumber and expdate are indexed, is that excessive and really adding any benefit? Is it sufficient to just index itemnumber and the index will know that expdate, or anything else that is queried for that item, is on the same row?
Secondly, if multiple columns constitute a primary key, should the index include them grouped together, or is individually sufficient?
For example,
CREATE INDEX test_index ON table (pk_col1, pk_col2, pk_col3);
vs.
CREATE INDEX test_index1 ON table (pk_col1);
CREATE INDEX test_index2 ON table (pk_col2);
CREATE INDEX test_index3 ON table (pk_col3);
Thanks for clearing that up in advance!
Uh oh, there is a mountain of basics that you still have to learn.
I'd recommend that you read the PostgreSQL documentation and the excellent book “SQL Performance Explained”.
I'll give you a few pointers to get you started:
Whenever you create a PRIMARY KEY or UNIQUE constraint, PostgreSQL automatically creates a unique index over all the columns of that constraint. So you don't have to create that index explicitly (but if it is a multicolumn index, it sometimes is useful to create another index on any but the first column).
Indexes are relevant to conditions in the WHERE clause and the GROUP BY clause and to some extent for table joins. They are irrelevant for entries in the SELECT list. An index provides an efficient way to get the part of a table that satisfies a certain condition; an (unsorted) access to all rows of a table will never benefit from an index.
Don't sprinkle your schema with indexes randomly, since indexes use space and make all data modification slow.
Use them where you know that they will do good: on columns on which a foreign key is defined, on columns that appear in WHERE clauses and contain many different values, on columns where your examination of the execution plan (with EXPLAIN) suggests that you can expect a performance benefit.

Which index is used to answer aggregates when we have several indexes?

I have a table which is partitioned on daily basis, each partition has certainly a primary key, and several other indexes on columns which are not null. If I get the query plane for the following:
SELECT COUNT(*) FROM parent_table;
I can see different indexes are used, sometimes the primary key index is used and some times others. How postgres is able to decide which index to use. Note that, my table is not clustered and never clustered before. Also, the primary key is serial.
What are the catalog / statistics tables which are used to make this decision.

DB associative entities and indexing

This is kind of a general DB design question. If one has an associative entity table, i.e. a cross-reference, containing records that basically just consist of two FK references, should it be indexed in some way? Is it necessary to explicitly index that table, since the PKs in the associated tables are already indexed by definition? If one should index it, should it be a combination index, consisting of the two FK fields together?
Indexes on the referenced pk columns in the other tables do not cover it.
By defining the two fk columns as composite primary key of the "associative entity" table (as you should in most cases - provided that associations are unique), you implicitly create a multi-column index.
That covers all queries involving both or the first columns optimally.
It also covers queries on the second column, but in a less effective way.
If you have important queries involving just the second column, create an additional index on that one, too.
Read all the details about the topic at this related question on dba.SE.
Or this question on SO, also covering this topic.
Suppose your associative table has a schema such as:
CREATE TABLE Association
(
ReferenceA INTEGER NOT NULL REFERENCES TableA CONSTRAINT FK1_Association,
ReferenceB INTEGER NOT NULL REFERENCES TableB CONSTRAINT FK2_Association,
PRIMARY KEY(ReferenceA, ReferenceB) CONSTRAINT PK_Association
);
The chances are that your DBMS will automatically create some indexes.
Some DBMS will create an index for each of the two foreign keys and also a unique index for the primary key. This is slightly wasteful since the PK index could be used for accessing ReferenceA too.
Ideally, there will be just two indexes: the PK (unique) index and the (duplicates allowed) FK index for ReferenceB, assuming that the PK index has ReferenceA as the first column.
If a DBMS does not automatically create indexes to enforce the referential integrity constraints, you'll want to create the RI or FK duplicates-allowed index. If it doesn't automatically create an index to enforce the PK constraint, you'll want to create that unique index too. The upside is that you'll only create the indexes for the ideal case.
Depending on your DBMS, you might find it more effective to create the table without the constraints, then to add the indexes, and then to add the constraints (which will then use the indexes you created). Things like fragmentation schemes can also factor into this; I ignored them above.
The concept remains simple — you want two indexes in total, one to enforce uniqueness on both columns and provide fast access on the leading column, and a non-unique or duplicates-allowed index on the trailing column.

Doubt in clustered and non Clustered index

I have a doubt that if my table do n't have any constraint like Primary Key,Foreign key,Unique key etc. then can i create the clustered index on table and clustered index can have the douplicate records ?
My 2nd question is where should we exectly use the non clustered index and when it is useful and benificial to create in table?
My 3rd question is How can we create the 249 non clustered index in a table .Is it the meaning, Creating the non clustered index on 249 columns ?
Can you anyone help me to remove my confusion in this.
First, the definition of a clustered index is that it is physical ordering of data on the disk. Every time you do an insert into that table, the new record will be placed on the physical disk in its order based on its value in the clustered index column. Because it is the physical location on the disk, it is (A) the most rapidly accessible column in the table but (B) only possible to define a single clustered index per table. Which column (or columns) you use as the clustered index depend on the data itself and its use. Primary keys are typically the clustered index, especially if the primary key is sequential (e.g. an integer that increments automatically with each insert). This will provide the fastest insert/update/delete functionality. If you are more interested in performing reads (select * from table), you may want to cluster on a Date column, as most queries have either a date in the where clause, the group by clause or both.
Second, clustered indexes (at least in the DB's I know) need not be unique (they CAN have duplicates). Constraining the column to be unique is separate matter. If the clustered index is a primary key its uniqueness is a function of being a primary key.
Third, I can't follow you questions concerning 249 columns. A non-clustered index is basically a tool for accelerating queries at the expense of extra disk space. It's hard to think of a case where creating an index on each column is necessary. If you want a quick rule of thumb...
Write a query using your table.
If a column is required to do a join, index it.
If a column is used in a where column, index it.
Remember all the indexes are doing for you is speeding up your queries. If queries run fast, don't worry about them.
This is just a thumbnail sketch of a large topic. There are tons of more informative/comprehensive resources on this matter, and some depend on the database system ... just google it.