In PostgreSQL 12, Does creating partitioning via inheritance improve query performance if queries are contained with a child table? - postgresql

Using PostgreSQL 12, I'd like to take advantage of partitioning to 1: Aid in query performance, 2: Allow removing historic data more easily to keep mitigate database growth.
Unfortunately, declarative partitioning requires the key to be part of the PKs. A temporal field as primary key doesn't work well for my model -- so I'm exploring using inheritance instead (as per the docs).
My question is whether using this approach will similarly isolate the amount of rows that my SELECT statement will be exposed to if an item in my WHERE statement limits the results to a single child table.
eg.
Books => BooksJan2020, BooksFeb2020, BooksMar2020.
SELECT * FROM Books WHERE created < '01 20 2020' and author LIKE 'John%';
In declarative partitioning, I would expect the 'LIKE' statement to only be exposed to rows within the January table. Can I expect the same with inheritance? When studying how to create inherited tables, I don't see a mechanism that would tell the planner which child table to pull from.
SteveJ

You can do that by creating the appropriate check constraints on the inheritance children and leaving constraint_exclusion at its default value on.
But I want to dissuade you from using anything but declarative partitioning in v12. Partitioning by inheritance hurts. Besides, you cannot get a true primary key on anything that does not contain the partitioning key that way: even though you have a primary key on all partitions, nothing can prevent you from inserting the same key in different partitions.
My advice is to go with a primary key on (id, created). True, that does not guarantee global uniqueness of id, but it goes a long way towards that goal. With values generated from a single sequence, the risk of duplicates is marginal.
The remaining down side of a composite primary key is that you have to include both columns into any table that has a foreign key constraint to the partitioned table, but I'd say that is the price you pay for the advantages of partitioning. Besides, with inheritance partitioning you couldn't have foreign keys pointing to the partitioned table at all.

Related

How do you manage UPSERTs on PostgreSQL partitioned tables for unique constraints on columns outside the partitioning strategy?

This question is for a database using PostgreSQL 12.3; we are using declarative partitioning and ON CONFLICT against the partitioned table is possible.
We had a single table representing application event data from client activity. Therefore, each row has fields client_id int4 and a dttm timestamp field. There is also an event_id text field and a project_id int4 field which together formed the basis of a composite primary key. (While rare, it was possible for two event records to have the same event_id but different project_id values for the same client_id.)
The table became non-performant, and we saw that queries most often targeted a single client in a specific timeframe. So we shifted the data into a partitioned table: first by LIST (client_id) and then each partition is further partitioned by RANGE(dttm).
We are running into problems shifting our upsert strategy to work with this new table. We used to perform a query of INSERT INTO table SELECT * FROM staging_table ON CONFLICT (event_id, project_id) DO UPDATE ...
But since the columns that determine uniqueness (event_id and project_id) are not part of the partitioning strategy (dttm and client_id), I can't do the same thing with the partitioned table. I thought I could get around this by building UNIQUE indexes on each partition on (project_id, event_id) but the ON CONFLICT is still not firing because there is no such unique index on the parent table (there can't be, since it doesn't contain all partitioning columns). So now a single upsert query appears impossible.
I've found two solutions so far but both require additional changes to the upsert script that seem like they'd be less performant:
I can still do an INSERT INTO table_partition_subpartition ... ON CONFLICT (event_id, project_id) DO UPDATE ... but that requires explicitly determining the name of the partition for each row instead of just INSERT INTO table ... once for the entire dataset.
I could implement the "old way" UPSERT procedure: https://www.postgresql.org/docs/9.4/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE but this again requires looping through all rows.
Is there anything else I could do to retain the cleanliness of a single, one-and-done INSERT INTO table SELECT * FROM staging_table ON CONFLICT () DO UPDATE ... while still keeping the partitioning strategy as-is?
Edit: if it matters, concurrency is not an issue here; there's just one machine executing the UPSERT into the main table from the staging table on a schedule.

Difference between BRIN index and table partitioning in PostgreSQL

What is the difference between a BRIN index and a table partition in PostgreSQL? When I should use one instead of another? It seems that they provide very similar benefits and also have similar use cases
Example
Suppose we have the following table structure
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
store_id INT,
client_id INT,
created_at timestamp,
information jsonb
)
that has the following characteristics:
orders can only be inserted, deletions are not allowed and updates are very rare and they don't involve the created_at column
the created_at column contains the timestamp of the insertion of the row in the database thus the values in the column are strictly increasing
almost every query use the created_at column in a condition and some of them may use the store_id and client_id columns
the most accessed rows are the most recent ones in terms of the created_at column
some queries may return a few records (example: analyzing a single record or the records created in a small time interval) while others may scan a vast amount of records (example: aggregate functions for a dashboard functionality)
I have chosen this example because it's very common and also both approach could be used (in my opinion). In this case which choice should I use between a BRIN index on the whole table or a partitioned table with maybe a btree index (or just a simple btree index without partitioning)? Does the table dimension influence the choice?
I have used both features (although I'll caveat that my experience with partitioning is from back when you had to use inheritance + constraints, before the introduction of CREATE TABLE ... PARTITION BY). You are correct that they seem similar-ish on a surface level, but they function by completely different mechanisms.
Table partitioning basically works as follows: replace all references to table with (select * from table_partition1 union all select * from table_partition2 /* repeat for all partitions */). The partitions will have a constraint on the partition columns, so that if those columns appear in a WHERE, the constraints can be applied up-front to prune which partitions are actually scanned. IOW, if table_partition1 has CHECK(client_id=1), and your WHERE Has client_id=2, table_partition1 will be skipped since the table constraint automatically excludes all rows from this partition from passing that WHERE.
BRIN indexes, in contrast, choose a block size for the table, and then for each block, records a min/max bound of the indexed column. This allows WHERE conditions to skip entire blocks when we can see, say, that the maximum created_at in a particular block of rows is below a created_at>={some_value} clause in your WHERE.
I can't tell you a definitive answer for your case as to which would work better. Well, that's not true, actually: the definitive answer is, "benchmark it for your own data" ;)
This is kind of fuzzy, but my general feeling is that BRIN is lightweight, and table partitioning is not. BRIN is something that can be added on to an existing table without much trouble, the indexes themselves are very small, and the impact on writes is not major (at least, not without inordinately many indices). Table partitioning, on the other hand, is a different way of representing the data on-disk; you are actually determining into which data files particular rows will be written. This requires a much more involved migration process when introducing it to an existing dataset.
However, the set of query optimizations available for table partitioning is much greater. Not only is there the constraint exclusion I described above, but you can also have indices (even BRIN ones!) on each individual partition. Of course, you can also have BRIN + other indices on a single-big-table, but I'm not sure that is particularly helpful IRL.
A few other thoughts: BRIN is good for monotonic data (timestamps, incremnting IDs, etc); the more correlated the on-disk ordering is to the indexed value, the more effective a BRIN index can be at pruning blocks to be scanned. Things like customer IDs, however, are unlikely to work well with BRIN; any given block of rows is likely to have at least one relatively low and relatively high ID. However, fields that like work quite well for partitioning: a partition-per-client, or partitioning on the modulus of a customer ID (which would more commonly be called sharding), is a good way of scaling horizontally, almost without bound.
Any update, even if it does not change the indexed column, will make a BRIN index pretty useless (unless it is a HOT update). Even without that, there are differences, for example:
partitioning allows you to get rid of lots of data efficiently, a BRIN index won't
a partitioned table allows one autovacuum worker per partition, which improves autovacuum performance
But if your only concern is to efficiently select all rows for a certain value of the index or partitioning key, both may offer about the same benefit.

Is there a way to reserve a range in a postgres sequence?

I'm writing a program that generates large numbers of rows to be inserted into a PostgreSQL database. Due to the presence of multiple indices, this process is getting slower over time. That's why I want to move to using COPY which seems to be significantly faster. The problem is that one of the tables has a foreign key into another, and I do not have the IDs for the foreign key column.
I was thinking that maybe if I could reserve a range in the sequence used for the primary key of the first table, I could do the ID assignment manually but I don't think Postgres natively supports such an operation. Is there a way to achieve this another way?
First off from your source data identify the business key for the parent and child tables. Create those tables and a unique constraint business key. This will not be the surrogate - auto generated - PK. Now create a staging table with all the columns necessary (except the FK). Since you will most likely be using across sessions this is a permanent table, but the intent is single time usage. With this insert into the parent table generating the pk from the sequence. Then insert into the child selecting the FK by referencing the business key from the parent.
insert into parent( <columns> )
select column_list
from stage
on conflict (business key ) do nothing;
insert into child ( <columns>, )
select s.<columns>, a.id
from stage s
join parent a on s.business key = a.business key
on conflict (a.parent_id, child_bk) do nothing;
Since the above is rather obscure in the abstract see a concrete example here. There is no need attempting to "reserve a range", just let the pk/fk develop naturally.

cassandra 2.0.9: best practices for write-heavy columns

I am a little confused by clustering in Cassandra. I have an application that is very write-heavy and update-heavy. With a traditional relational database, I'd partition data into two tables: one table for data that changes infrequently; and one table (with shorter rows) for the columns that change frequently:
For example:
create table user_def ( id int primary key, email list< varchar > ); # stable
create table user_var ( id int primary key, state int ); # changes all the time
But Cassandra seems to be optimized for accessing sparsely-populated columns, so I'm not sure there is any advantage in mimicking this approach for Cassandra schemas.
With Cassandra, is there any advantage in separating frequently-updated columns to a separate table/column-family (away from infrequently-updated columns) or should I combine all the columns together into one table/column-family? Do circumstances change if I have a compound primary key and clustering comes into play?
Cassandra treats primary keys like this:
The first key in the primary key (which can be a composite) is used to partition your data. This defines which node(s) your data is saved in (and replicated to). Other fields in the primary key is then used to sort entries within a partition. The whole partition is always going to be in one node (and replica nodes) in its entirety. Moreover, each entry within a node is sorted by the "other" fields in the primary key. [The first element of the primary key is called the partition key, while the other fields in the primary key are called clustering keys.]
Based on that, I'd say you might as well simply have a table with id, state and email. It looks like you're using skinny rows, and I don't think you'd gain anything (if any) of creating the separate tables.
I had approved ashic's answer until I came upon this:
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
which states (for delete-heavy access):
...consider partitioning data with heavy churn rate into separate rows and deleting the entire rows when you no longer need them. Alternatively, partition it into separate tables and truncate them when they aren’t needed anymore...
This falls under the 'queue' anti-pattern for the product.

When to use inherited tables in PostgreSQL?

In which situations you should use inherited tables? I tried to use them very briefly and inheritance didn't seem like in OOP world.
I thought it worked like this:
Table users has all fields required for all user levels. Tables like moderators, admins, bloggers, etc but fields are not checked from parent. For example users has email field and inherited bloggers has it now too but it's not unique for both users and bloggers at the same time. ie. same as I add email field to both tables.
The only usage I could think of is fields that are usually used, like row_is_deleted, created_at, modified_at. Is this the only usage for inherited tables?
There are some major reasons for using table inheritance in postgres.
Let's say, we have some tables needed for statistics, which are created and filled each month:
statistics
- statistics_2010_04 (inherits statistics)
- statistics_2010_05 (inherits statistics)
In this sample, we have 2.000.000 rows in each table. Each table has a CHECK constraint to make sure only data for the matching month gets stored in it.
So what makes the inheritance a cool feature - why is it cool to split the data?
PERFORMANCE: When selecting data, we SELECT * FROM statistics WHERE date BETWEEN x and Y, and Postgres only uses the tables, where it makes sense. Eg. SELECT * FROM statistics WHERE date BETWEEN '2010-04-01' AND '2010-04-15' only scans the table statistics_2010_04, all other tables won't get touched - fast!
Index size: We have no big fat table with a big fat index on column date. We have small tables per month, with small indexes - faster reads.
Maintenance: We can run vacuum full, reindex, cluster on each month table without locking all other data
For the correct use of table inheritance as a performance booster, look at the postgresql manual.
You need to set CHECK constraints on each table to tell the database, on which key your data gets split (partitioned).
I make heavy use of table inheritance, especially when it comes to storing log data grouped by month. Hint: If you store data, which will never change (log data), create or indexes with CREATE INDEX ON () WITH(fillfactor=100); This means no space for updates will be reserved in the index - index is smaller on disk.
UPDATE:
fillfactor default is 100, from http://www.postgresql.org/docs/9.1/static/sql-createtable.html:
The fillfactor for a table is a percentage between 10 and 100. 100 (complete packing) is the default
"Table inheritance" means something different than "class inheritance" and they serve different purposes.
Postgres is all about data definitions. Sometimes really complex data definitions. OOP (in the common Java-colored sense of things) is about subordinating behaviors to data definitions in a single atomic structure. The purpose and meaning of the word "inheritance" is significantly different here.
In OOP land I might define (being very loose with syntax and semantics here):
import life
class Animal(life.Autonomous):
metabolism = biofunc(alive=True)
def die(self):
self.metabolism = False
class Mammal(Animal):
hair_color = color(foo=bar)
def gray(self, mate):
self.hair_color = age_effect('hair', self.age)
class Human(Mammal):
alcoholic = vice_boolean(baz=balls)
The tables for this might look like:
CREATE TABLE animal
(name varchar(20) PRIMARY KEY,
metabolism boolean NOT NULL);
CREATE TABLE mammal
(hair_color varchar(20) REFERENCES hair_color(code) NOT NULL,
PRIMARY KEY (name))
INHERITS (animal);
CREATE TABLE human
(alcoholic boolean NOT NULL,
FOREIGN KEY (hair_color) REFERENCES hair_color(code),
PRIMARY KEY (name))
INHERITS (mammal);
But where are the behaviors? They don't fit anywhere. This is not the purpose of "objects" as they are discussed in the database world, because databases are concerned with data, not procedural code. You could write functions in the database to do calculations for you (often a very good idea, but not really something that fits this case) but functions are not the same thing as methods -- methods as understood in the form of OOP you are talking about are deliberately less flexible.
There is one more thing to point out about inheritance as a schematic device: As of Postgres 9.2 there is no way to reference a foreign key constraint across all of the partitions/table family members at once. You can write checks to do this or get around it another way, but its not a built-in feature (it comes down to issues with complex indexing, really, and nobody has written the bits necessary to make that automatic). Instead of using table inheritance for this purpose, often a better match in the database for object inheritance is to make schematic extensions to tables. Something like this:
CREATE TABLE animal
(name varchar(20) PRIMARY KEY,
ilk varchar(20) REFERENCES animal_ilk NOT NULL,
metabolism boolean NOT NULL);
CREATE TABLE mammal
(animal varchar(20) REFERENCES animal PRIMARY KEY,
ilk varchar(20) REFERENCES mammal_ilk NOT NULL,
hair_color varchar(20) REFERENCES hair_color(code) NOT NULL);
CREATE TABLE human
(mammal varchar(20) REFERENCES mammal PRIMARY KEY,
alcoholic boolean NOT NULL);
Now we have a canonical reference for the instance of the animal that we can reliably use as a foreign key reference, and we have an "ilk" column that references a table of xxx_ilk definitions which points to the "next" table of extended data (or indicates there is none if the ilk is the generic type itself). Writing table functions, views, etc. against this sort of schema is so easy that most ORM frameworks do exactly this sort of thing in the background when you resort to OOP-style class inheritance to create families of object types.
Inheritance can be used in an OOP paradigm as long as you do not need to create foreign keys on the parent table. By example, if you have an abstract class vehicle stored in a vehicle table and a table car that inherits from it, all cars will be visible in the vehicle table but a foreign key from a driver table on the vehicle table won't match theses records.
Inheritance can be also used as a partitionning tool. This is especially usefull when you have tables meant to be growing forever (log tables etc).
Main use of inheritance is for partitioning, but sometimes it's useful in other situations. In my database there are many tables differing only in a foreign key. My "abstract class" table "image" contains an "ID" (primary key for it must be in every table) and PostGIS 2.0 raster. Inherited tables such as "site_map" or "artifact_drawing" have a foreign key column ("site_name" text column for "site_map", "artifact_id" integer column for the "artifact_drawing" table etc.) and primary and foreign key constraints; the rest is inherited from the the "image" table. I suspect I might have to add a "description" column to all the image tables in the future, so this might save me quite a lot of work without making real issues (well, the database might run little slower).
EDIT: another good use: with two-table handling of unregistered users, other RDBMSs have problems with handling the two tables, but in PostgreSQL it is easy - just add ONLY when you are not interrested in data in the inherited "unregistered user" table.
The only experience I have with inherited tables is in partitioning. It works fine, but it's not the most sophisticated and easy to use part of PostgreSQL.
Last week we were looking the same OOP issue, but we had too many problems with Hibernate - we didn't like our setup, so we didn't use inheritance in PostgreSQL.
I use inheritance when I have more than 1 on 1 relationships between tables.
Example: suppose you want to store object map locations with attributes x, y, rotation, scale.
Now suppose you have several different kinds of objects to display on the map and each object has its own map location parameters, and map parameters are never reused.
In these cases table inheritance would be quite useful to avoid having to maintain unnormalised tables or having to create location id’s and cross referencing it to other tables.
I tried some operations on it, I will not point out if is there any actual use case for database inheritance, but I will give you some detail for making your decision. Here is an example of PostgresQL: https://www.postgresql.org/docs/15/tutorial-inheritance.html
You can try below SQL script.
CREATE TABLE IF NOT EXISTS cities (
name text,
population real,
elevation int -- (in ft)
);
CREATE TABLE IF NOT EXISTS capitals (
state char(2) UNIQUE NOT NULL
) INHERITS (cities);
ALTER TABLE cities
ADD test_id varchar(255); -- Both table would contains test col
DROP TABLE cities; -- Cannot drop because capitals depends on it
ALTER TABLE cities
ADD CONSTRAINT fk_test FOREIGN KEY (test_id) REFERENCES sometable (id);
As you can see my comments, let me summarize:
When you add/delete/update fields -> the inheritance table would also be affected.
Cannot drop the parent table.
Foreign keys would not be inherited.
From my perspective, in growing applications, we cannot easily predict the changes in the future, for me I would avoid applying this to early database developing.
When features are stable as well and we want to create some database model which much likely the same as the existing one, we can consider that use case.
Use it as little as possible. And that usually means never, it boiling down to a way of creating structures that violate the relational model, for instance by breaking the information principle and by creating bags instead of relations.
Instead, use table partitioning combined with proper relational modelling, including further normal forms.