Postgres table partitioning with star schema - postgresql

I have a schema with one table with the majority of data, customer, and three other tables with foreign key references to customer.entry_id which is a BIGSERIAL field. The three other tables are called location, devices and urls where we store various data related to a specific entry in the customer table.
I want to partition the customer table into monthly child tables, and have that part worked out; customer will stay as-is, each month will have a table customer_YYYY_MM that inherits from the master table with the right CHECK constraint and indexes will be created on each individual child table. Data will be moved to the correct child tables while the master table stays empty.
My question is about the other three tables, as I want to partition them as well. However, they have no date information (at all), only the reference to the primary key from the master table. How can I setup the constraints on these tables? Is it even meaningful or possible without date information?
My application logic knows where to insert all the data (it's fairly trivial), but I expect to be able to do simple SELECT queries without specifying which child tables to get it from. So this should work as you would expect from non-partitioned tables:
SELECT l.*
FROM customer c
JOIN location l USING entry_id
WHERE c.date_field > '2015-01-01'

I would partition them by the reference key. The foreign key is used in join conditions and is not usually subject to change so it fulfills the following important points:
Partition by the information that is used mostly in the WHERE clauses of the queries or other parts where partitioning can be used to filter out tables that don't need to be scanned. As one guide puts it:
The objective when defining partitions should be to allow as many queries as possible to fetch data from as few partitions as possible - ideally one.
Partition by information that is not going to be changed so that rows don't constantly need to be thrown from one subtable to another
This all depends of the size of the tables too of course. If the sizes stay small then there is no need to partition.

Read more about partitioning here.
Use views:
create view customer as
select * from customer_jan_15 union all
select * from customer_feb_15 union all
select * from customer_mar_15;
create view location as
select * from location_jan_15 union all
select * from location_feb_15 union all
select * from location_mar_15;

Related

How do you manage UPSERTs on PostgreSQL partitioned tables for unique constraints on columns outside the partitioning strategy?

This question is for a database using PostgreSQL 12.3; we are using declarative partitioning and ON CONFLICT against the partitioned table is possible.
We had a single table representing application event data from client activity. Therefore, each row has fields client_id int4 and a dttm timestamp field. There is also an event_id text field and a project_id int4 field which together formed the basis of a composite primary key. (While rare, it was possible for two event records to have the same event_id but different project_id values for the same client_id.)
The table became non-performant, and we saw that queries most often targeted a single client in a specific timeframe. So we shifted the data into a partitioned table: first by LIST (client_id) and then each partition is further partitioned by RANGE(dttm).
We are running into problems shifting our upsert strategy to work with this new table. We used to perform a query of INSERT INTO table SELECT * FROM staging_table ON CONFLICT (event_id, project_id) DO UPDATE ...
But since the columns that determine uniqueness (event_id and project_id) are not part of the partitioning strategy (dttm and client_id), I can't do the same thing with the partitioned table. I thought I could get around this by building UNIQUE indexes on each partition on (project_id, event_id) but the ON CONFLICT is still not firing because there is no such unique index on the parent table (there can't be, since it doesn't contain all partitioning columns). So now a single upsert query appears impossible.
I've found two solutions so far but both require additional changes to the upsert script that seem like they'd be less performant:
I can still do an INSERT INTO table_partition_subpartition ... ON CONFLICT (event_id, project_id) DO UPDATE ... but that requires explicitly determining the name of the partition for each row instead of just INSERT INTO table ... once for the entire dataset.
I could implement the "old way" UPSERT procedure: https://www.postgresql.org/docs/9.4/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE but this again requires looping through all rows.
Is there anything else I could do to retain the cleanliness of a single, one-and-done INSERT INTO table SELECT * FROM staging_table ON CONFLICT () DO UPDATE ... while still keeping the partitioning strategy as-is?
Edit: if it matters, concurrency is not an issue here; there's just one machine executing the UPSERT into the main table from the staging table on a schedule.

Postgres table with thousands of partition

I have a postgres table (in postgres12) which is supposed to have thousands of partitions (200k at least) in near future.
Here is how I am creating the parent table:
create table if not exists content (
key varchar(20) not NULL,
value json not null default '[]'::json
) PARTITION BY LIST(key)
And then adding any given child tables like:
create table if not exists content_123 PARTITION OF content for VALUES in ('123');
Also I am adding an index on top of the child table for quick access (since I will be accessing the child table directly):
create index if not exists content_123_idx on content_123 using btree(key)
Here is my question: I have never in the past managed this many partitions in a postgres table so I am just wondering is there any downside of doing what I am doing? Also, (as mentioned above) I will not be querying from the parent table directly, but will read directly from individual child tables.
With these table definitions, the index is completely useless.
With 200000 partitions, query planning will become intolerably slow, and each SQL statement will need very many locks and open files. This won't work well.
Lump several keys together into a single partition (then the index might make sense).

Use of Postgres table inheritance

Since Postgres also supports partitioned tables, what is the use of child table.
Suppose there is a table of users which has a column created_date. We can store data in 2 ways:
We create many child tables of this user table and distribute the data of users on the basis of created_date (say, one table for every date, like user_jan01_21).
We can create a partitioned table with the partitioning key created_date
Then what is the difference between these solution?
Basically, I want to know what problem table inheritance can solve that partitioning cannot.
Another doubt I have: if I follow solution 1, and I query the user table without the ONLY keyword, will it scan all the child tables?
For example:
SELECT * FROM WHERE where created_date = current_date - 10;
If the objective is partitioning, as in your example, then there is no advantage in using table inheritance. Declarative partitioning is far superior in ease of use, performance and available features.
Table inheritance has uses that are unrelated to partitioning. Features that partitioning doesn't offer are:
the child table can have additional columns
a table can inherit from more than one table
With table inheritance, if you select from the parent table, you will also get all results from the child tables, just as if you had used UNION ALL to combine the results.

Create unique integer id column for result rows of union query

I have a view as below in which I union several tables and I'm thinking it might be a good idea to have a unique row number for each row in the result set. The prescient reason is I have an admin tool which doesn't know I'm using a view rather than an ordinary table, and which expects a unique id to be present, but I'm now speculating it might be worth doing more generally (i.e. it may make sense to do this in certain theoretical terms - discussion on this would be welcome). Wondering how to do this in postgresql.
CREATE VIEW subscriptions AS (
SELECT subscriber_id, course, end_at
FROM subscriptions_individual_stripe
UNION ALL SELECT subscriber_id, course, end_at
FROM subscriptions_individual_bank_transfer
ORDER BY end_at DESC);
Discussion
The reason these are separate tables is of course that they are actually different entities, and yet I also need to be able to contemplate them in a combined way, hence the VIEW. This is my way of avoiding so-called 'polymorphic relationships' in certain popular web frameworks.
I have a tool that expects an id and while my first thought was that views don't need a unique key, on the other hand, maybe they do...?
Reason being two records could exist in one of the UNIONed tables which were only unique by virtue of the primary key. If one does not include the primary key, the union should remove one of those, so a record would be lost. Should we also take that into account, i.e. select the primary key (here an integer id) for each of the UNIONed tables, but, "convert it" to some other unique id, so the view has its own unique integer primary key? Of course this won't be usable in terms of referencing anything in the original UNIONed tables, but I'm OK with that (The view is a terminal point of my analysis, I don't intend to do anything further with it, and of course it is not writable).
Update
I'm accepting S-Man's answer below because it is a solution to the question I asked, however, as pointed out, the row_number() must not be treated as if it was a real identifier because it will not be.
So as an important aside, I'm left wondering what row_number() is really intended for then. Perhaps it's (mainly? occasionally?) useful where you want to output some query when you plan to export the data somewhere else (i.e. seems almost spreadsheet-ish), and you abandon any sense of it being integrated with the rest of your database?
Table inheritance may be better as Abelisto has pointed out in the comments.
You can add a row count to the UNION using the row_number() window function:
demo:db<>fiddle
CREATE VIEW v_myview AS
SELECT
row_number() OVER (ORDER BY ...) AS id,
*
FROM (
SELECT ...
UNION
SELECT ...
) AS foo;
The main problem with this is: You should never deal with this id as an real identifier because the data of the table can change. So it could be that one table today generates a few records more than yesterday. So, the generated row numbers wouldn't match to the same record as before.
Edit: Removed the md5 solution I added before because of some problems with uniqueness on same data.

DB2 Partitioning

I know how partitioning in DB2 works but I am unaware about where this partition values exactly get stored. After writing a create partition query, for example:
CREATE TABLE orders(id INT, shipdate DATE, …)
PARTITION BY RANGE(shipdate)
(
STARTING '1/1/2006' ENDING '12/31/2006'
EVERY 3 MONTHS
)
after running the above query we know that partitions are created on order for every 3 month but when we run a select query the query engine refers this partitions. I am curious to know where this actually get stored, whether in the same table or DB2 has a different table where partition value for every table get stored.
Thanks,
table partitions in DB2 are stored in tablespaces.
For regular tables (if table partitioning is not used) table data is stored in a single tablespace (not considering LOBs).
For partitioned tables multiple tablespaces can used for its partitions.
This is achieved by the "" clause of the CREATE TABLE statement.
CREATE TABLE parttab
...
in TBSP1, TBSP2, TBSP3
In this example the first partition will be stored in TBSP1, the second in TBSP2, The third in TBSP3, the fourth in TBSP1 and so on.
Table partitions are named in DB2 - by default PART1 ..PARTn - and all these details can be looked up in the system catalog view SYSCAT.DATAPARTITIONS including the specified partition ranges.
See also http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0021353.html?cp=SSEPGG_10.5.0%2F2-12-8-27&lang=en
The column used as partitioning key can be looked up in syscat.datapartitionexpression.
There is also a long syntax for creating partitioned tables where partition names can be explizitly specified as well as the tablespace where the partitions will get stored.
For applications partitioned tables look like a single normal table.
Partitions can be detached from a partitioned table. In this case a partition is "disconnected" from the partitioned table and converted to a table without moving the data (or vice versa).
best regards
Michael
After a bit of research I finally figure it out and want to share this information with others, I hope it may come useful to others.
How to see this key values ? => For LUW (Linux/Unix/Windows) you can see the keys in the Table Object Editor or the Object Viewer Script tab. For z/OS there is an Object Viewer tab called "Limit Keys". I've opened issue TDB-885 to create an Object Viewer tab for LUW tables.
A simple query to check this values:
SELECT * FROM SYSCAT.DATAPARTITIONS
WHERE TABSCHEMA = ? AND TABNAME = ?
ORDER BY SEQNO
reference: http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0021353.html?lang=en
DB2 will create separate Physical Locations for each partition. So each partition will have its own Table-space. When you SELECT on this partitioned Table your SQL may directly go to a single partition or it may span across many depending on how your SQL is. Also, this may allow your SQL to run in parallel i.e. many TS can be accessed concurrently to speed up the SELECT.