Partitioned table - is adding an index on the partition column unnecessary? - postgresql

We have a table partitioned on a date column.
Some of my colleagues believe that this means that column is automatically indexed. Having looked for evidence of this I don't believe that is so. Who is right?
The manual https://www.postgresql.org/docs/current/ddl-partitioning.html (section 5.11.2.1. Example) says:
Create an index on the key column(s), as well as any other indexes you
might want, on the partitioned table. (The key index is not strictly
necessary, but in most scenarios it is helpful.) This automatically
creates a matching index on each partition, and any partitions you
create or attach later will also have such an index. An index or
unique constraint declared on a partitioned table is “virtual” in the
same way that the partitioned table is: the actual data is in child
indexes on the individual partition tables.
This suggests to me we should create the index.
Each partition has ~350K rows. Since we often query by date range on that column would each partition get its own index? Or one massive one across all partitions?
Would adding an index on this column improve or degrade performance?

There is not automatically an index on the partition column.
If you did list partitioning and every list only contains one date (i.e. every date has its own partition) then I don't think also having an index on that column would be helpful. There is not extra information in the column beyond what the partitioning already knows about.
If you did range partitioning on quarter or year, but often query by a specific date, then the index would likely be useful as it provides a lot of extra specificity.

Related

Difference between BRIN index and table partitioning in PostgreSQL

What is the difference between a BRIN index and a table partition in PostgreSQL? When I should use one instead of another? It seems that they provide very similar benefits and also have similar use cases
Example
Suppose we have the following table structure
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
store_id INT,
client_id INT,
created_at timestamp,
information jsonb
)
that has the following characteristics:
orders can only be inserted, deletions are not allowed and updates are very rare and they don't involve the created_at column
the created_at column contains the timestamp of the insertion of the row in the database thus the values in the column are strictly increasing
almost every query use the created_at column in a condition and some of them may use the store_id and client_id columns
the most accessed rows are the most recent ones in terms of the created_at column
some queries may return a few records (example: analyzing a single record or the records created in a small time interval) while others may scan a vast amount of records (example: aggregate functions for a dashboard functionality)
I have chosen this example because it's very common and also both approach could be used (in my opinion). In this case which choice should I use between a BRIN index on the whole table or a partitioned table with maybe a btree index (or just a simple btree index without partitioning)? Does the table dimension influence the choice?
I have used both features (although I'll caveat that my experience with partitioning is from back when you had to use inheritance + constraints, before the introduction of CREATE TABLE ... PARTITION BY). You are correct that they seem similar-ish on a surface level, but they function by completely different mechanisms.
Table partitioning basically works as follows: replace all references to table with (select * from table_partition1 union all select * from table_partition2 /* repeat for all partitions */). The partitions will have a constraint on the partition columns, so that if those columns appear in a WHERE, the constraints can be applied up-front to prune which partitions are actually scanned. IOW, if table_partition1 has CHECK(client_id=1), and your WHERE Has client_id=2, table_partition1 will be skipped since the table constraint automatically excludes all rows from this partition from passing that WHERE.
BRIN indexes, in contrast, choose a block size for the table, and then for each block, records a min/max bound of the indexed column. This allows WHERE conditions to skip entire blocks when we can see, say, that the maximum created_at in a particular block of rows is below a created_at>={some_value} clause in your WHERE.
I can't tell you a definitive answer for your case as to which would work better. Well, that's not true, actually: the definitive answer is, "benchmark it for your own data" ;)
This is kind of fuzzy, but my general feeling is that BRIN is lightweight, and table partitioning is not. BRIN is something that can be added on to an existing table without much trouble, the indexes themselves are very small, and the impact on writes is not major (at least, not without inordinately many indices). Table partitioning, on the other hand, is a different way of representing the data on-disk; you are actually determining into which data files particular rows will be written. This requires a much more involved migration process when introducing it to an existing dataset.
However, the set of query optimizations available for table partitioning is much greater. Not only is there the constraint exclusion I described above, but you can also have indices (even BRIN ones!) on each individual partition. Of course, you can also have BRIN + other indices on a single-big-table, but I'm not sure that is particularly helpful IRL.
A few other thoughts: BRIN is good for monotonic data (timestamps, incremnting IDs, etc); the more correlated the on-disk ordering is to the indexed value, the more effective a BRIN index can be at pruning blocks to be scanned. Things like customer IDs, however, are unlikely to work well with BRIN; any given block of rows is likely to have at least one relatively low and relatively high ID. However, fields that like work quite well for partitioning: a partition-per-client, or partitioning on the modulus of a customer ID (which would more commonly be called sharding), is a good way of scaling horizontally, almost without bound.
Any update, even if it does not change the indexed column, will make a BRIN index pretty useless (unless it is a HOT update). Even without that, there are differences, for example:
partitioning allows you to get rid of lots of data efficiently, a BRIN index won't
a partitioned table allows one autovacuum worker per partition, which improves autovacuum performance
But if your only concern is to efficiently select all rows for a certain value of the index or partitioning key, both may offer about the same benefit.

Is there a difference between DynamoDB Local Secondary Index vs Global Secondary Index if the partition key is the same?

So currently I have a table that is
playerName:String (Partition)
playerAge:Number (Sort)
player_str_dex_int_luck:String (Local Secondary Index Sort Key)
I want to add a key to my table player_dex_str_int_luck:String to sort on while keeping playerName as my partition key.
I have to use a GSI for this since you cant create LSI after the table is made. Is there any difference between a GSI and LSI since I'm keeping the partition key of my GSI the same as that of the original table?
LSI - no extra cost, automatically have all attributes of the record. Automatically consistent with the table.
GSI - extra cost, must specify attributes you want in the GSI. For any others, have to read the table. "Updated in an eventually consistent fashion."
However, one benefit of using GSIs instead of LSIs, is that a table without LSI's can have partitions larger than 10GB.
Item Collection Size Limit
The maximum size of any item collection is
10 GB. This limit does not apply to tables without local secondary
indexes. Only tables that have one or more local secondary indexes are
affected.

Understanding indexes and performance as they relate to indexed column and non-indexed column data in the same row

I have some tables that are around 100 columns wide. I haven't normalized them because to put it back together would require almost 3 dozen joins and am not sure it would perform any better... haven't tested it yet (I will) so can't say for sure.
Anyway, that really isn't the question. I have been indexing columns in these tables that I know will be pulled frequently, so something like 50 indexes per table.
I got to thinking though. These columns will never be pulled by themselves and are meaningless without the primary key (basically an item number). The PK will always be used for the join and even in simple SELECT queries, it will have to be a specified column so the data makes sense.
That got me thinking further about indexes and how they work. As I understand them the locations of a values are committed to memory for that column so it is quickly found in a query.
For example, if you have:
SELECT itemnumber, expdate
FROM items;
And both itemnumber and expdate are indexed, is that excessive and really adding any benefit? Is it sufficient to just index itemnumber and the index will know that expdate, or anything else that is queried for that item, is on the same row?
Secondly, if multiple columns constitute a primary key, should the index include them grouped together, or is individually sufficient?
For example,
CREATE INDEX test_index ON table (pk_col1, pk_col2, pk_col3);
vs.
CREATE INDEX test_index1 ON table (pk_col1);
CREATE INDEX test_index2 ON table (pk_col2);
CREATE INDEX test_index3 ON table (pk_col3);
Thanks for clearing that up in advance!
Uh oh, there is a mountain of basics that you still have to learn.
I'd recommend that you read the PostgreSQL documentation and the excellent book “SQL Performance Explained”.
I'll give you a few pointers to get you started:
Whenever you create a PRIMARY KEY or UNIQUE constraint, PostgreSQL automatically creates a unique index over all the columns of that constraint. So you don't have to create that index explicitly (but if it is a multicolumn index, it sometimes is useful to create another index on any but the first column).
Indexes are relevant to conditions in the WHERE clause and the GROUP BY clause and to some extent for table joins. They are irrelevant for entries in the SELECT list. An index provides an efficient way to get the part of a table that satisfies a certain condition; an (unsorted) access to all rows of a table will never benefit from an index.
Don't sprinkle your schema with indexes randomly, since indexes use space and make all data modification slow.
Use them where you know that they will do good: on columns on which a foreign key is defined, on columns that appear in WHERE clauses and contain many different values, on columns where your examination of the execution plan (with EXPLAIN) suggests that you can expect a performance benefit.

Partitioning Oracle tables used for logging

I have an application that records activity in a table (Oracle 10g). The logging records should be kept for at least 30 days. I expect about 20 million rows to be added to this table every month.
The DBA suggested that the table be split in partitions containing one week of data. The weekly maintenance script would then delete the oldest partition (leaving only 4 weeks of data in the table).
What would be the best way of partitioning this logging table?
Partitioning a table isn't hard - it appears that you will be removing the data on a weekly basis, so the partition clauses will look like
PARTITION "P2009_45" VALUES LESS THAN
(TO_DATE(' 2009-11-02 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')),
PARTITION "P2009_46" VALUES LESS THAN
(TO_DATE(' 2009-11-09 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')),
... etc
where your partitioning column is your date column of interest in the table.
Additional comments:
If you can upgrade to 11g you can
take advantage of interval
partitioning, which is similar to
this range partitioning, but Oracle
will manage creating new partitions
for you.
If you're going to routinely drop off
partitions, I would advise making all
indexes on the table
locally-partitioned to avoid the
rebuilds that would be necessary with
global partitions after partition
operations.
If you have a good idea of the number
of log entries per month, and it
stays relatively constant, you might
consider using a sequence (as a primary key) that is
capped at this number and then
recycles back to 0. Then your
logging statements must become "MERGE
INTO... " statements that either
create a new row or overwrite the row
if it exists. This only guarantees
that you'll retain the number of rows
allowed by the sequence max value and
NOT a certain time interval, but this
might be an alternative to
partitioning (which as DvE points
out, is an extra-expense option)
The most likely partitioning scheme would be to range-partition your data on the creation date. Each week you would create a new partition and drop the oldest one. The impact will depend on how this table is used / indexed.
Since it is a logging table perhaps it is not indexed, in that case dropping a partition will have little impact: referencing objects won't be invalidated, the drop will be just require a partition lock (and the oldest partition shouldn't be inserted at that time).
If the table is indexed, you will have to decide if your indexes will be global or partitionned. Global indexes will have to be rebuilt when you drop a partition (which takes time, although 20M rows is still manageable). You can use the UPDATE GLOBAL INDEXES clause to keep the indexes valid after the partition drop.
Local indexes will be partitionned like the table and may be less efficient than global indexes (index range scans will have to scan each local index instead of a common index if you do not query by date). These indexes won't have to be updated after a partition drop.
20 million rows every month and you only have to keep 30 days of data? (That's about a months worth).
Even with 12 months worth of data it wouldn't be hard to query this table (as one big table) with the correct index.
Inserting is no problemen either with 1 row in the logging table or 20 million.
Partitioning in Oracle is also a feature that needs to be paid for if I'm correct, so it's costly too (if you don't have a license already).

Doubt in clustered and non Clustered index

I have a doubt that if my table do n't have any constraint like Primary Key,Foreign key,Unique key etc. then can i create the clustered index on table and clustered index can have the douplicate records ?
My 2nd question is where should we exectly use the non clustered index and when it is useful and benificial to create in table?
My 3rd question is How can we create the 249 non clustered index in a table .Is it the meaning, Creating the non clustered index on 249 columns ?
Can you anyone help me to remove my confusion in this.
First, the definition of a clustered index is that it is physical ordering of data on the disk. Every time you do an insert into that table, the new record will be placed on the physical disk in its order based on its value in the clustered index column. Because it is the physical location on the disk, it is (A) the most rapidly accessible column in the table but (B) only possible to define a single clustered index per table. Which column (or columns) you use as the clustered index depend on the data itself and its use. Primary keys are typically the clustered index, especially if the primary key is sequential (e.g. an integer that increments automatically with each insert). This will provide the fastest insert/update/delete functionality. If you are more interested in performing reads (select * from table), you may want to cluster on a Date column, as most queries have either a date in the where clause, the group by clause or both.
Second, clustered indexes (at least in the DB's I know) need not be unique (they CAN have duplicates). Constraining the column to be unique is separate matter. If the clustered index is a primary key its uniqueness is a function of being a primary key.
Third, I can't follow you questions concerning 249 columns. A non-clustered index is basically a tool for accelerating queries at the expense of extra disk space. It's hard to think of a case where creating an index on each column is necessary. If you want a quick rule of thumb...
Write a query using your table.
If a column is required to do a join, index it.
If a column is used in a where column, index it.
Remember all the indexes are doing for you is speeding up your queries. If queries run fast, don't worry about them.
This is just a thumbnail sketch of a large topic. There are tons of more informative/comprehensive resources on this matter, and some depend on the database system ... just google it.