This is kind of a general DB design question. If one has an associative entity table, i.e. a cross-reference, containing records that basically just consist of two FK references, should it be indexed in some way? Is it necessary to explicitly index that table, since the PKs in the associated tables are already indexed by definition? If one should index it, should it be a combination index, consisting of the two FK fields together?
Indexes on the referenced pk columns in the other tables do not cover it.
By defining the two fk columns as composite primary key of the "associative entity" table (as you should in most cases - provided that associations are unique), you implicitly create a multi-column index.
That covers all queries involving both or the first columns optimally.
It also covers queries on the second column, but in a less effective way.
If you have important queries involving just the second column, create an additional index on that one, too.
Read all the details about the topic at this related question on dba.SE.
Or this question on SO, also covering this topic.
Suppose your associative table has a schema such as:
CREATE TABLE Association
(
ReferenceA INTEGER NOT NULL REFERENCES TableA CONSTRAINT FK1_Association,
ReferenceB INTEGER NOT NULL REFERENCES TableB CONSTRAINT FK2_Association,
PRIMARY KEY(ReferenceA, ReferenceB) CONSTRAINT PK_Association
);
The chances are that your DBMS will automatically create some indexes.
Some DBMS will create an index for each of the two foreign keys and also a unique index for the primary key. This is slightly wasteful since the PK index could be used for accessing ReferenceA too.
Ideally, there will be just two indexes: the PK (unique) index and the (duplicates allowed) FK index for ReferenceB, assuming that the PK index has ReferenceA as the first column.
If a DBMS does not automatically create indexes to enforce the referential integrity constraints, you'll want to create the RI or FK duplicates-allowed index. If it doesn't automatically create an index to enforce the PK constraint, you'll want to create that unique index too. The upside is that you'll only create the indexes for the ideal case.
Depending on your DBMS, you might find it more effective to create the table without the constraints, then to add the indexes, and then to add the constraints (which will then use the indexes you created). Things like fragmentation schemes can also factor into this; I ignored them above.
The concept remains simple — you want two indexes in total, one to enforce uniqueness on both columns and provide fast access on the leading column, and a non-unique or duplicates-allowed index on the trailing column.
Related
In Amazon's guide, they mention specifying PRIMARY and FOREIGN KEYs for all of your tables, and then designating distribution keys where it makes sense, like on columns that often get used to join tables together. I understand that even with a single table query, the right DISTKEY specification would help in doing GROUP BY, but for JOINing two or more tables, do the DISTKEY columns have to be specified as FOREIGN KEYs as well? Or will Redshift co-locate rows from different tables to the same nodes based on the data-type (and maybe name) of columns used as the DISTKEY?
The reason I'm asking is because I'm not really using dimension tables in my application. I could create them simply to use as a foreign key reference to help with the distribution, but then the dimensions tables would have to be maintained.
Consider the following example where I have two tables that are frequently joined:
CREATE TABLE motorcycles
(
id INT,
hexcolor CHAR(6)
);
CREATE TABLE helmets
(
id INT,
hexcolor CHAR(6)
);
Now suppose in my application, we frequently join the motorcycles table to the helmets table on the hexcolor column. Then it would make sense to use DISTSTYLE KEY and use DISTKEY (hexcolor), right? However, you can't really say that the hexcolor column from the motorcycles table is a foreign key to the helmets table or vice-versa. I could create a dimension table that just had a list of all the possible hexcolor values, and then both the motorcycles and helmets tables could have a foreign key to this dimension table, but it would be a pain to have to maintain this dimension table (Amazon's guide also warns against specifying primary or foreign keys that are not properly maintained, because it will confuse the query planner).
So, with my motorcycles and helmets example, would a foreign key to a dimension table be necessary? Or will Redshift make an assumption that it should distribute the rows for both of these tables the same way based on the fact that the data type of the column used as the distribution key is the same?
As long as the columns have the same data type, you should expect Redshift to distribute the motorcycles and helmets tables in the same fashion.
There is no justification for a foreign-key in your case. The query planner will be able to take advantage of the fact that the tables are distributed by the same key.
But it's always good to read the execution plan and make sure that it says DS_DIST_NONE - which means that no data redistribution was needed.
Can a table have multiple primary indexes in progress open edge.
No - each table (and temp-table) has one and only one primary index.
This is also true for tables with "no" indexes defined for it - the db engine will make an index (based on recid?) for tables like that.
There is nothing very special about the designation "primary". It is a potential tie-breaker used when other index selection rules result in multiple possible index choices -- if everything else is equal the "primary" index will be used. It is also the default index used when various utilities are run. But beyond that there is no magic associated with the "primary" attribute.
No. Each table can have only one primary index.
A table can have multiple unique indexes, though. A primary index usually is (and should be) unique, but this is not enforced.
(Just in case there is confusion about primary and unique indexes.)
On Postgres, a unique index is automatically created for primary key columns. From the docs,
When an index is declared unique, multiple table rows with equal
indexed values are not allowed. Null values are not considered equal.
A multicolumn unique index will only reject cases where all indexed
columns are equal in multiple rows.
From my understanding, it seems like this index only checks uniqueness and isn't actually present for faster access when querying by primary key id's. Does this mean that this index structure doesn't consist of a sorted table (or a tree) for the primary key column? Is this correct?
In theory a unique or primary key constraint could be enforced without the presence of an index, but it would be a painful process. The index is mainly there for performance purposes.
However some databases (eg Oracle) allow a unique or primary key constraint to be supported by a non-unique index. Primarily this allows the enforcement of the constraint to be deferred until the end of a transaction, so lack of uniqueness can be permitted temporarily during a transaction, but also allows indexes to be built in parallel and with the constraint then defined as a secondary step.
Also, I'm not sure how the internals work on a PostgreSQL btree index, but all Oracle btree's are internally declared to be unique either:
on the key column(s), for an index that is intended to be UNIQUE, or
on the key column(s) plus the indexed row's ROWID, for a non-unique index.
Quite the contrary, The index is created in order to allow faster access - mainly to check for duplicates when a new record is inserted but can also be used by other queries against PK columns. The best structure for uk indexes is a btree because during the insert the index is created - If the rdbms detects collision in the leaf he will raise a unique constraint violation.
Let's say I have the following table
TABLE subgroups (
group_id t_group_id NOT NULL REFERENCES groups(group_id),
subgroup_name t_subgroup_name NOT NULL,
more attributes ...
)
subgroup_name is UNIQUE to a group(group_id).
A group can have many subgroups.
The subgroup_names are user-supplied. (I would like to avoid using a subgroup_id column. subgroup_name has meaning in the model and is more than just a label, I am providing a list of predetermined names but allow a user to add his owns for flexibility).
This table has 2 levels of referencing child tables containing subgroup attributes (with many-to-one relations);
I would like to have a PRIMARY KEY on (group_id, upper(trim(subgroup_name)));
From what I know, postgres doesn't allow to use PRIMARY KEY/UNIQUE on a function.
IIRC, the relational model also requires columns to be used as stored.
CREATE UNIQUE INDEX ON subgroups (group_id, upper(trim(subgroup_name))); doesn't solve my problem
as other tables in my model will have FOREIGN KEYs pointing to those two columns.
I see two options.
Option A)
Store a cleaned up subgroup name in subgroup_name
Add an extra column called subgroup_name_raw that would contained the uncleaned string
Option B)
Create both a UNIQUE INDEX and PRIMARY KEY on my key pair. (seems like a huge waste)
Any insights?
Note: I'm using Postgres 9.2
Actually you can do a UNIQUE constraint on the output of a function. You can't do it in the table definition though. What you need to do is create a unique index after. So something like:
CREATE UNIQUE INDEX subgroups_ukey2 ON subgroups(group_id, upper(trim(subgroup_name)));
PostgreSQL has a number of absolutely amazing indexing capabilities, and the ability to create unique (and partial unique) indexes on function output is quite underrated.
I have a doubt that if my table do n't have any constraint like Primary Key,Foreign key,Unique key etc. then can i create the clustered index on table and clustered index can have the douplicate records ?
My 2nd question is where should we exectly use the non clustered index and when it is useful and benificial to create in table?
My 3rd question is How can we create the 249 non clustered index in a table .Is it the meaning, Creating the non clustered index on 249 columns ?
Can you anyone help me to remove my confusion in this.
First, the definition of a clustered index is that it is physical ordering of data on the disk. Every time you do an insert into that table, the new record will be placed on the physical disk in its order based on its value in the clustered index column. Because it is the physical location on the disk, it is (A) the most rapidly accessible column in the table but (B) only possible to define a single clustered index per table. Which column (or columns) you use as the clustered index depend on the data itself and its use. Primary keys are typically the clustered index, especially if the primary key is sequential (e.g. an integer that increments automatically with each insert). This will provide the fastest insert/update/delete functionality. If you are more interested in performing reads (select * from table), you may want to cluster on a Date column, as most queries have either a date in the where clause, the group by clause or both.
Second, clustered indexes (at least in the DB's I know) need not be unique (they CAN have duplicates). Constraining the column to be unique is separate matter. If the clustered index is a primary key its uniqueness is a function of being a primary key.
Third, I can't follow you questions concerning 249 columns. A non-clustered index is basically a tool for accelerating queries at the expense of extra disk space. It's hard to think of a case where creating an index on each column is necessary. If you want a quick rule of thumb...
Write a query using your table.
If a column is required to do a join, index it.
If a column is used in a where column, index it.
Remember all the indexes are doing for you is speeding up your queries. If queries run fast, don't worry about them.
This is just a thumbnail sketch of a large topic. There are tons of more informative/comprehensive resources on this matter, and some depend on the database system ... just google it.