I'm pretty new to PostgreSQL so apologies if I'm asking the obvious.
I've got a table called customer_products. It contains the following two indexes:
CREATE INDEX customer_products_customer_id
ON public.customer_products USING btree (customer_id)
CREATE UNIQUE INDEX customer_products_customer_id_product_id
ON public.customer_products USING btree (customer_id, product_id)
Are they both doing the same thing in respect to customer_id or do they function in a different way? I'm not sure if I should leave them or remove customer_products_customer_id.
There is nothing that the first index can do that the second cannot, so you should drop the first index.
The only advantage of the first index over the second when it comes to queries whose WHERE (or ORDER BY) clause involves customer_id only is that the index is smaller. That makes a range scan over many index entries somewhat faster.
The price for an extra index in terms of size and data modification speed usually outweighs that advantage. In a read-only data warehouse where I have a query that profits significantly I may be tempted to keep both indexes, otherwise I wouldn't.
You should definitely not drop the UNIQUE index, because it has a valuable use that has nothing to do with performance: it prevents the table from containing two rows that have the save values for the indexed columns. If that is what you want to guarantee, a UNIQUE index will make sure that your data keep in good shape.
Side remark: even though the effect is the same, it is better if the table has a unique constraint (which is backed by a unique index) than just having the index. If nothing else, it documents the purpose better.
Related
I have a mapping table that looks like
group_id (int)
item_id (int)
there already exists two composite indexes group_id, item_id and item_id, group_id
I'm finding that deleting all records by group_id from the table is very slow (e.g. DELETE FROM table_name WHERE group_id = 1). From what I've read and see by using EXPLAIN the leading column composite index group_id, item_id will get used even though there no single-column index for group_id. I've seen people mention on here you can get even better performance by having a dedicated single-column index on the first column. How much of a performance benefit should I expect? Would it be a marginal improvement or
On a side note I'm also curious if it's the item_id, group_id index that hurting delete performance by needing to clean up indexes.
A smaller index might help from being able to more easily fit in cache. But that would help when you are jumping all around the index reading only one row from each spot, not reading a big chunk of adjacent index entries like you are here. Deletes don't incur direct index maintenance cost. They do create work for some future vacuum to clean up, but that doesn't seem to be what is happening here (and it is mostly independent of the number of columns in the index anyway). Whatever is slowing down your delete, it is not this. The biggest culprit for slowing down non-join deletes are triggers and FK constraints.
What is the difference between a BRIN index and a table partition in PostgreSQL? When I should use one instead of another? It seems that they provide very similar benefits and also have similar use cases
Example
Suppose we have the following table structure
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
store_id INT,
client_id INT,
created_at timestamp,
information jsonb
)
that has the following characteristics:
orders can only be inserted, deletions are not allowed and updates are very rare and they don't involve the created_at column
the created_at column contains the timestamp of the insertion of the row in the database thus the values in the column are strictly increasing
almost every query use the created_at column in a condition and some of them may use the store_id and client_id columns
the most accessed rows are the most recent ones in terms of the created_at column
some queries may return a few records (example: analyzing a single record or the records created in a small time interval) while others may scan a vast amount of records (example: aggregate functions for a dashboard functionality)
I have chosen this example because it's very common and also both approach could be used (in my opinion). In this case which choice should I use between a BRIN index on the whole table or a partitioned table with maybe a btree index (or just a simple btree index without partitioning)? Does the table dimension influence the choice?
I have used both features (although I'll caveat that my experience with partitioning is from back when you had to use inheritance + constraints, before the introduction of CREATE TABLE ... PARTITION BY). You are correct that they seem similar-ish on a surface level, but they function by completely different mechanisms.
Table partitioning basically works as follows: replace all references to table with (select * from table_partition1 union all select * from table_partition2 /* repeat for all partitions */). The partitions will have a constraint on the partition columns, so that if those columns appear in a WHERE, the constraints can be applied up-front to prune which partitions are actually scanned. IOW, if table_partition1 has CHECK(client_id=1), and your WHERE Has client_id=2, table_partition1 will be skipped since the table constraint automatically excludes all rows from this partition from passing that WHERE.
BRIN indexes, in contrast, choose a block size for the table, and then for each block, records a min/max bound of the indexed column. This allows WHERE conditions to skip entire blocks when we can see, say, that the maximum created_at in a particular block of rows is below a created_at>={some_value} clause in your WHERE.
I can't tell you a definitive answer for your case as to which would work better. Well, that's not true, actually: the definitive answer is, "benchmark it for your own data" ;)
This is kind of fuzzy, but my general feeling is that BRIN is lightweight, and table partitioning is not. BRIN is something that can be added on to an existing table without much trouble, the indexes themselves are very small, and the impact on writes is not major (at least, not without inordinately many indices). Table partitioning, on the other hand, is a different way of representing the data on-disk; you are actually determining into which data files particular rows will be written. This requires a much more involved migration process when introducing it to an existing dataset.
However, the set of query optimizations available for table partitioning is much greater. Not only is there the constraint exclusion I described above, but you can also have indices (even BRIN ones!) on each individual partition. Of course, you can also have BRIN + other indices on a single-big-table, but I'm not sure that is particularly helpful IRL.
A few other thoughts: BRIN is good for monotonic data (timestamps, incremnting IDs, etc); the more correlated the on-disk ordering is to the indexed value, the more effective a BRIN index can be at pruning blocks to be scanned. Things like customer IDs, however, are unlikely to work well with BRIN; any given block of rows is likely to have at least one relatively low and relatively high ID. However, fields that like work quite well for partitioning: a partition-per-client, or partitioning on the modulus of a customer ID (which would more commonly be called sharding), is a good way of scaling horizontally, almost without bound.
Any update, even if it does not change the indexed column, will make a BRIN index pretty useless (unless it is a HOT update). Even without that, there are differences, for example:
partitioning allows you to get rid of lots of data efficiently, a BRIN index won't
a partitioned table allows one autovacuum worker per partition, which improves autovacuum performance
But if your only concern is to efficiently select all rows for a certain value of the index or partitioning key, both may offer about the same benefit.
I want to increase the performance of queries on table in Postgrsql db i need to use.
CREATE TABLE mytable (
article_number text NOT NULL,
description text NOT null,
feature text NOT null,
...
);
The table is just in example but the thing is that there are no unique columns. article_number is the one used in the where clause but for example article_number='000.002-00A' can have from 3 to 300 rows. The total number of rows is 102,165,920. What would be the best index to use for such a situation?
I know there B-tree, Hash, GiST, SP-GiST, GIN and BRIN index types in postgres but which one would be the best for this.
If the lookups are filtered on article_number then an index should be created on that. Not quite sure what else you're asking.
The default index is a btree and that'll work fine. If you're only checking for strict equality hash would also be an option but it has issues before Postgres 10, so I wouldn't recommend it.
Other index types are for more complicated forms of querying or custom data types, there's no reason to even consider them if you just want to perform equality filters.
btrees are useful for strict equality and range searches (which includes prefix search e.g. foo like 'bar%')
hash indexes are useful only for strict equality they can be faster & smaller than btrees in some rare cases
GIN indexes are useful when you have multiple index values per row (arrays, json, gis, some FTS cases)
GiST indexes are useful for more complex querying than equality and range (geom/gis, FTS)
I've never looked into BRIN index so I'm not sure what their use case would be. But my understanding is that there's no case to even consider it before you have huge numbers of rows.
Basically, use btree unless you know that you can not.
I have some tables that are around 100 columns wide. I haven't normalized them because to put it back together would require almost 3 dozen joins and am not sure it would perform any better... haven't tested it yet (I will) so can't say for sure.
Anyway, that really isn't the question. I have been indexing columns in these tables that I know will be pulled frequently, so something like 50 indexes per table.
I got to thinking though. These columns will never be pulled by themselves and are meaningless without the primary key (basically an item number). The PK will always be used for the join and even in simple SELECT queries, it will have to be a specified column so the data makes sense.
That got me thinking further about indexes and how they work. As I understand them the locations of a values are committed to memory for that column so it is quickly found in a query.
For example, if you have:
SELECT itemnumber, expdate
FROM items;
And both itemnumber and expdate are indexed, is that excessive and really adding any benefit? Is it sufficient to just index itemnumber and the index will know that expdate, or anything else that is queried for that item, is on the same row?
Secondly, if multiple columns constitute a primary key, should the index include them grouped together, or is individually sufficient?
For example,
CREATE INDEX test_index ON table (pk_col1, pk_col2, pk_col3);
vs.
CREATE INDEX test_index1 ON table (pk_col1);
CREATE INDEX test_index2 ON table (pk_col2);
CREATE INDEX test_index3 ON table (pk_col3);
Thanks for clearing that up in advance!
Uh oh, there is a mountain of basics that you still have to learn.
I'd recommend that you read the PostgreSQL documentation and the excellent book “SQL Performance Explained”.
I'll give you a few pointers to get you started:
Whenever you create a PRIMARY KEY or UNIQUE constraint, PostgreSQL automatically creates a unique index over all the columns of that constraint. So you don't have to create that index explicitly (but if it is a multicolumn index, it sometimes is useful to create another index on any but the first column).
Indexes are relevant to conditions in the WHERE clause and the GROUP BY clause and to some extent for table joins. They are irrelevant for entries in the SELECT list. An index provides an efficient way to get the part of a table that satisfies a certain condition; an (unsorted) access to all rows of a table will never benefit from an index.
Don't sprinkle your schema with indexes randomly, since indexes use space and make all data modification slow.
Use them where you know that they will do good: on columns on which a foreign key is defined, on columns that appear in WHERE clauses and contain many different values, on columns where your examination of the execution plan (with EXPLAIN) suggests that you can expect a performance benefit.
Do the clustered and non-clustered indexes both work on B-Tree? I read that clustered indexes affect the way how the data is physically stored in table whereas with non-clustered indexes a separate copy of the column is created and that is stored in sorted order. Also, Sql Server creates clustered indexes on primary key by default.
Does that mean :
1) Non clustered indexes occupy more space than clustered indexes since a separate copy of column is stored in non clustered?
2) How does the clustered and non clustered index work when we have primary key based on two columns say.. (StudentName,Marks)?
3) Are there only 2 types of indexes? If so, then what are bitmap indexes? I can't seem to find any such index type in Sql Server Management Studio but in my datawarehousing book all these types are mentioned.
4) Is creating clustered or non-clustered index on primary key effecient?
5) Suppose we create clustered index on name i.e data is physically stored in sorted order name wise then a new record is created. How will the new record find it's place in table?
Thanks in advance :)
Indexes are structures stored separately from the actual datapages and simply contain pointers to the datapages. In SQL Server indexes are B-Trees.
Clustered indexes sort and store the datapages in the table according to the columns defined for the index. In SQL Server 2005 you can add additional columns to an index so it should not be a problem when you have composite primary keys. You can think of a clustered index like a set of filing cabinets with folders. In the first draw you have documents starting with A and in the first folder of that draw you may have documents starting from AA to AC and so on. To search for "Spider" then, you can jump straight to the S draw and look for the folder containing "SP" and quickly find what you are looking for. But it is obvious that if you sort all documents physically by one index then you cannot physically sort the same set of documents by another index. Hence, only one clustered index per table.
A Non Clustered index is a separate structure much like the table of contents or the index at the back of a book. So I think I have only answered some of your questions specifically:
Yes the index does occupy space but not as much as the original table. That is why you must choose your indexes carefully. There is also a small performance hit for update operations since the index has to be maintained.
Your book will mention all the theoretical types of indexes. Bitmap indexes are useful in data warehousing applications or for data that has a few distinct values like days of the week etc. So they are not generally used in your basic RDBMS. I know that Oracle has some implementations but I don't know much about that.
I think that efficiency of an index is determined by how the field is used. It is expected that the majority of the data scanning in your table will be done on the primary key then an index on the primary key makes sense. You usually add indexes to columns that appear in the where clause or the join condition of your queries.
On insert the index has to be maintained, so there is a little extra work that has to be done by the system to rearrange things a bit.