I have a relation where I'll be querying the table with two columns for eg: findByXIdAndStatus.
select * from X where XId = '12345' and status = 'INACTIVE'
Status is a column which holds two values, ACTIVE or INACTIVE. Is it sufficient to create an index on XId column or do I need to create a composite index on both XId and Status or should I create multiple indexes on both columns. Currently I am using postgres DB.
The most performant index might cover the entire WHERE clause, and so would be on (XId, status). However, if the XId column itself already be very restrictive, then Postgres might use a single column index on just XId. In that case, Postgres would have to seek back to clustered index to find the status value, in order to complete filtering. In any case, you may try both types of indices, check the EXPLAIN plans, and then decide which option to use.
The single-column index on xid is probably good enough.
If only a small percentage of the rows are inactive, a partial index would be beneficial:
CREATE INDEX ON x (xid) WHERE status = 'INACTIVE';
Related
The order that the records come in is not always guaranteed unless I use an order by clause.
If I throw a clustered index on a table and then do a select top 100, for example, would the 100 rows returned always be the same?
I am asking this because a clustered index sorts the data physically on the key value.
I am lead to believe so from my observations, but wanted to see what others thought.
No. The rule is simple: SQL tables and result sets represent unordered sets. The only exception is a result set associated with a query that has an ORDER BY in the outermost SELECT.
A clustered index affects how data is stored on each page. However, it does not guarantee that a result set built on that table will even use the clustered index.
Consider a table that has a primary, clustered key on id and a query that returns:
select top (100) othercol
from t;
This query could use an index on othercol -- avoiding the clustered index altogether.
I need to create a varchar category column in a table and search for rows that are belonging to a particular category.
ie. ALTER TABLE items ADD COLUMN category VARCHAR(30)
The number of categories is very small (repeated across the table)
and the intention is to only use = in the where clause.
ie. select * from items where category = 'food'
What kind of index would be ideal in postgres?
Especially if the table is never expected to be too big (less than 5,000 rows always)
This is a textbook usecase for a Hash Index - you have a very small number of distinct values and only use the equality operator to query them. Using a hash index will enable you to index a relatively small hash of the value, which will allow for faster querying.
I have some tables that are around 100 columns wide. I haven't normalized them because to put it back together would require almost 3 dozen joins and am not sure it would perform any better... haven't tested it yet (I will) so can't say for sure.
Anyway, that really isn't the question. I have been indexing columns in these tables that I know will be pulled frequently, so something like 50 indexes per table.
I got to thinking though. These columns will never be pulled by themselves and are meaningless without the primary key (basically an item number). The PK will always be used for the join and even in simple SELECT queries, it will have to be a specified column so the data makes sense.
That got me thinking further about indexes and how they work. As I understand them the locations of a values are committed to memory for that column so it is quickly found in a query.
For example, if you have:
SELECT itemnumber, expdate
FROM items;
And both itemnumber and expdate are indexed, is that excessive and really adding any benefit? Is it sufficient to just index itemnumber and the index will know that expdate, or anything else that is queried for that item, is on the same row?
Secondly, if multiple columns constitute a primary key, should the index include them grouped together, or is individually sufficient?
For example,
CREATE INDEX test_index ON table (pk_col1, pk_col2, pk_col3);
vs.
CREATE INDEX test_index1 ON table (pk_col1);
CREATE INDEX test_index2 ON table (pk_col2);
CREATE INDEX test_index3 ON table (pk_col3);
Thanks for clearing that up in advance!
Uh oh, there is a mountain of basics that you still have to learn.
I'd recommend that you read the PostgreSQL documentation and the excellent book “SQL Performance Explained”.
I'll give you a few pointers to get you started:
Whenever you create a PRIMARY KEY or UNIQUE constraint, PostgreSQL automatically creates a unique index over all the columns of that constraint. So you don't have to create that index explicitly (but if it is a multicolumn index, it sometimes is useful to create another index on any but the first column).
Indexes are relevant to conditions in the WHERE clause and the GROUP BY clause and to some extent for table joins. They are irrelevant for entries in the SELECT list. An index provides an efficient way to get the part of a table that satisfies a certain condition; an (unsorted) access to all rows of a table will never benefit from an index.
Don't sprinkle your schema with indexes randomly, since indexes use space and make all data modification slow.
Use them where you know that they will do good: on columns on which a foreign key is defined, on columns that appear in WHERE clauses and contain many different values, on columns where your examination of the execution plan (with EXPLAIN) suggests that you can expect a performance benefit.
I want to remove duplicates from a large table having about 1million rows and increasing every hour. It has no unique id and has about ~575 columns but sparsely filled.
The table is 'like' a log table where new entries are appended every hour without unique timestamp.
The duplicates are like 1-3% but I want to remove it anyway ;) Any ideas?
I tried ctid column (as here) but its very slow.
The basic idea that works generally well with PostgreSQL is to create an index on the hash of the set of columns as a whole.
Example:
CREATE INDEX index_name ON tablename (md5((tablename.*)::text));
This will work unless there are columns that don't play well with the requirement of immutability (mostly timestamp with time zone because their cast-to-text value is session-dependent).
Once this index is created, duplicates can be found quickly by self-joining with the hash, with a query looking like this:
SELECT t1.ctid, t2.ctid
FROM tablename t1 JOIN tablename t2
ON (md5((t1.*)::text) = md5((t2.*)::text))
WHERE t1.ctid > t2.ctid;
You may also use this index to avoid duplicates rows in the future rather than periodically de-duplicating them, by making it UNIQUE (duplicate rows would be rejected at INSERT or UPDATE time).
I was wondering I'm not really sure how multiple indexes would work on the same column.
So lets say I have an id column and a country column. And on those I have an index on id and another index on id and country. When I do my query plan it looks like its using both those indexes. I was just wondering how that works? Can I force it to use just the id and country index.
Also is it bad practice to do that? When is it a good idea to index the same column multiple times?
It is common to have indexes on both (id) and (country,id), or alternatively (country) and (country,id) if you have queries that benefit from each of them. You might also have (id) and (id, country) if you want the "covering" index on (id,country) to support index only scans, but still need the stand along to enforce a unique constraint.
In theory you could just have (id,country) and still use it to enforce uniqueness of id, but PostgreSQL does not support that at this time.
You could also sensibly have different indexes on the same column if you need to support different collations or operator classes.
If you want to force PostgreSQL to not use a particular index to see what happens with it gone, you can drop it in a transactions then roll it back when done:
BEGIN; drop index table_id_country_idx; explain analyze select * from ....; ROLLBACK;