I am building a database in Postgres 11, and I would like to segment the information by partitioning tables. The appointment table is already partitioned by date ranges, and I would also like to partition the patient table; a partition of patients by each doctor.
The question is: How can I partition the patient table with list partitioning? That is to say, for this table I would have to make a direct partition relationship with the doctor table or I would have to use the intermediate table, since between the two mentioned tables there is a relationship of many to many.
Attached is an illustrative image.
For a many-to-many relationship you will need a mapping table, partitioning or not.
I wouldn't use an artificial primary key for the mapping table, but the combination of id_doctor and id_patient (they are artificial anyway). The same holds for the appointment table.
Since id_doctor is not part of the patient table (and shouldn't be), you cannot partition the patient table per doctor. Why would you want to do that? Partitioning is mostly useful for mass deletions (and to some extent for speeding up sequential scans) — is that your objective?
There is a wide-spread assumption that bigger tables should be partitioned just because they are big, but that is not the case. Index access to a partitioned table is — if anything — slightly slower than index access to a non-partitioned table. Do you have billions of patients?
Related
What is the difference between a BRIN index and a table partition in PostgreSQL? When I should use one instead of another? It seems that they provide very similar benefits and also have similar use cases
Example
Suppose we have the following table structure
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
store_id INT,
client_id INT,
created_at timestamp,
information jsonb
)
that has the following characteristics:
orders can only be inserted, deletions are not allowed and updates are very rare and they don't involve the created_at column
the created_at column contains the timestamp of the insertion of the row in the database thus the values in the column are strictly increasing
almost every query use the created_at column in a condition and some of them may use the store_id and client_id columns
the most accessed rows are the most recent ones in terms of the created_at column
some queries may return a few records (example: analyzing a single record or the records created in a small time interval) while others may scan a vast amount of records (example: aggregate functions for a dashboard functionality)
I have chosen this example because it's very common and also both approach could be used (in my opinion). In this case which choice should I use between a BRIN index on the whole table or a partitioned table with maybe a btree index (or just a simple btree index without partitioning)? Does the table dimension influence the choice?
I have used both features (although I'll caveat that my experience with partitioning is from back when you had to use inheritance + constraints, before the introduction of CREATE TABLE ... PARTITION BY). You are correct that they seem similar-ish on a surface level, but they function by completely different mechanisms.
Table partitioning basically works as follows: replace all references to table with (select * from table_partition1 union all select * from table_partition2 /* repeat for all partitions */). The partitions will have a constraint on the partition columns, so that if those columns appear in a WHERE, the constraints can be applied up-front to prune which partitions are actually scanned. IOW, if table_partition1 has CHECK(client_id=1), and your WHERE Has client_id=2, table_partition1 will be skipped since the table constraint automatically excludes all rows from this partition from passing that WHERE.
BRIN indexes, in contrast, choose a block size for the table, and then for each block, records a min/max bound of the indexed column. This allows WHERE conditions to skip entire blocks when we can see, say, that the maximum created_at in a particular block of rows is below a created_at>={some_value} clause in your WHERE.
I can't tell you a definitive answer for your case as to which would work better. Well, that's not true, actually: the definitive answer is, "benchmark it for your own data" ;)
This is kind of fuzzy, but my general feeling is that BRIN is lightweight, and table partitioning is not. BRIN is something that can be added on to an existing table without much trouble, the indexes themselves are very small, and the impact on writes is not major (at least, not without inordinately many indices). Table partitioning, on the other hand, is a different way of representing the data on-disk; you are actually determining into which data files particular rows will be written. This requires a much more involved migration process when introducing it to an existing dataset.
However, the set of query optimizations available for table partitioning is much greater. Not only is there the constraint exclusion I described above, but you can also have indices (even BRIN ones!) on each individual partition. Of course, you can also have BRIN + other indices on a single-big-table, but I'm not sure that is particularly helpful IRL.
A few other thoughts: BRIN is good for monotonic data (timestamps, incremnting IDs, etc); the more correlated the on-disk ordering is to the indexed value, the more effective a BRIN index can be at pruning blocks to be scanned. Things like customer IDs, however, are unlikely to work well with BRIN; any given block of rows is likely to have at least one relatively low and relatively high ID. However, fields that like work quite well for partitioning: a partition-per-client, or partitioning on the modulus of a customer ID (which would more commonly be called sharding), is a good way of scaling horizontally, almost without bound.
Any update, even if it does not change the indexed column, will make a BRIN index pretty useless (unless it is a HOT update). Even without that, there are differences, for example:
partitioning allows you to get rid of lots of data efficiently, a BRIN index won't
a partitioned table allows one autovacuum worker per partition, which improves autovacuum performance
But if your only concern is to efficiently select all rows for a certain value of the index or partitioning key, both may offer about the same benefit.
We have a table partitioned on a date column.
Some of my colleagues believe that this means that column is automatically indexed. Having looked for evidence of this I don't believe that is so. Who is right?
The manual https://www.postgresql.org/docs/current/ddl-partitioning.html (section 5.11.2.1. Example) says:
Create an index on the key column(s), as well as any other indexes you
might want, on the partitioned table. (The key index is not strictly
necessary, but in most scenarios it is helpful.) This automatically
creates a matching index on each partition, and any partitions you
create or attach later will also have such an index. An index or
unique constraint declared on a partitioned table is “virtual” in the
same way that the partitioned table is: the actual data is in child
indexes on the individual partition tables.
This suggests to me we should create the index.
Each partition has ~350K rows. Since we often query by date range on that column would each partition get its own index? Or one massive one across all partitions?
Would adding an index on this column improve or degrade performance?
There is not automatically an index on the partition column.
If you did list partitioning and every list only contains one date (i.e. every date has its own partition) then I don't think also having an index on that column would be helpful. There is not extra information in the column beyond what the partitioning already knows about.
If you did range partitioning on quarter or year, but often query by a specific date, then the index would likely be useful as it provides a lot of extra specificity.
Since Postgres also supports partitioned tables, what is the use of child table.
Suppose there is a table of users which has a column created_date. We can store data in 2 ways:
We create many child tables of this user table and distribute the data of users on the basis of created_date (say, one table for every date, like user_jan01_21).
We can create a partitioned table with the partitioning key created_date
Then what is the difference between these solution?
Basically, I want to know what problem table inheritance can solve that partitioning cannot.
Another doubt I have: if I follow solution 1, and I query the user table without the ONLY keyword, will it scan all the child tables?
For example:
SELECT * FROM WHERE where created_date = current_date - 10;
If the objective is partitioning, as in your example, then there is no advantage in using table inheritance. Declarative partitioning is far superior in ease of use, performance and available features.
Table inheritance has uses that are unrelated to partitioning. Features that partitioning doesn't offer are:
the child table can have additional columns
a table can inherit from more than one table
With table inheritance, if you select from the parent table, you will also get all results from the child tables, just as if you had used UNION ALL to combine the results.
Using PostgreSQL 12, I'd like to take advantage of partitioning to 1: Aid in query performance, 2: Allow removing historic data more easily to keep mitigate database growth.
Unfortunately, declarative partitioning requires the key to be part of the PKs. A temporal field as primary key doesn't work well for my model -- so I'm exploring using inheritance instead (as per the docs).
My question is whether using this approach will similarly isolate the amount of rows that my SELECT statement will be exposed to if an item in my WHERE statement limits the results to a single child table.
eg.
Books => BooksJan2020, BooksFeb2020, BooksMar2020.
SELECT * FROM Books WHERE created < '01 20 2020' and author LIKE 'John%';
In declarative partitioning, I would expect the 'LIKE' statement to only be exposed to rows within the January table. Can I expect the same with inheritance? When studying how to create inherited tables, I don't see a mechanism that would tell the planner which child table to pull from.
SteveJ
You can do that by creating the appropriate check constraints on the inheritance children and leaving constraint_exclusion at its default value on.
But I want to dissuade you from using anything but declarative partitioning in v12. Partitioning by inheritance hurts. Besides, you cannot get a true primary key on anything that does not contain the partitioning key that way: even though you have a primary key on all partitions, nothing can prevent you from inserting the same key in different partitions.
My advice is to go with a primary key on (id, created). True, that does not guarantee global uniqueness of id, but it goes a long way towards that goal. With values generated from a single sequence, the risk of duplicates is marginal.
The remaining down side of a composite primary key is that you have to include both columns into any table that has a foreign key constraint to the partitioned table, but I'd say that is the price you pay for the advantages of partitioning. Besides, with inheritance partitioning you couldn't have foreign keys pointing to the partitioned table at all.
I have a schema with one table with the majority of data, customer, and three other tables with foreign key references to customer.entry_id which is a BIGSERIAL field. The three other tables are called location, devices and urls where we store various data related to a specific entry in the customer table.
I want to partition the customer table into monthly child tables, and have that part worked out; customer will stay as-is, each month will have a table customer_YYYY_MM that inherits from the master table with the right CHECK constraint and indexes will be created on each individual child table. Data will be moved to the correct child tables while the master table stays empty.
My question is about the other three tables, as I want to partition them as well. However, they have no date information (at all), only the reference to the primary key from the master table. How can I setup the constraints on these tables? Is it even meaningful or possible without date information?
My application logic knows where to insert all the data (it's fairly trivial), but I expect to be able to do simple SELECT queries without specifying which child tables to get it from. So this should work as you would expect from non-partitioned tables:
SELECT l.*
FROM customer c
JOIN location l USING entry_id
WHERE c.date_field > '2015-01-01'
I would partition them by the reference key. The foreign key is used in join conditions and is not usually subject to change so it fulfills the following important points:
Partition by the information that is used mostly in the WHERE clauses of the queries or other parts where partitioning can be used to filter out tables that don't need to be scanned. As one guide puts it:
The objective when defining partitions should be to allow as many queries as possible to fetch data from as few partitions as possible - ideally one.
Partition by information that is not going to be changed so that rows don't constantly need to be thrown from one subtable to another
This all depends of the size of the tables too of course. If the sizes stay small then there is no need to partition.
Read more about partitioning here.
Use views:
create view customer as
select * from customer_jan_15 union all
select * from customer_feb_15 union all
select * from customer_mar_15;
create view location as
select * from location_jan_15 union all
select * from location_feb_15 union all
select * from location_mar_15;