How to re-partition table with hash in PostgreSQL? - postgresql

I'm currently designing a table and want to partition it by account_name.
For now I'm thinking of going with a small number of partitions (e.g. 8) but since I expect a lot of data there is a chance I will need to re-partition it and make more partitions.
What is the best way to do this? If I understand correctly I can't just attach new partitions since I need to change modulus for previously used ones.
Should I copy and re-insert all the data or there is an easier way?

Repartitioning would mean to completely rewrite the table, as in
INSERT INTO new_tab SELECT * FROM old_tab;
which will cause extensive down time. One way around this is to use logical replication with new_tab on the standby side (possible from v13 on).
But my recommendation is not to do that. Choose a reasonable number of partitions and stick with that.

Related

Partitioning of related tables in PostgreSQL

I've checked documentation and saw some presentations, read blogs, but can't find examples of partitioning of more than a single table in PostgreSQL - and that's what we need. Our tables are insert only audit trail with master-detail structure and we aim to solve our problem with slow data removal problem, currently done using delete.
The simplified structure and some queries are shown in the following fiddle: https://www.db-fiddle.com/f/2mRXT4wGjM2ZSftjgKyZce/46
The issue I'm investigating right now is how to effectively query the detail table, be it in JOIN or directly. Because the timestamp field is part of the partition key I understand that using it in query is essential. I don't understand why JOIN is not able to figure this out when timestamp equality is used in ON clause (couple of explain examples are in the fiddle).
Then there are broader questions:
What is general recommended strategy for similar case? We expect that timestamp is essential for our query, so it feels natural to use it as partitioning key.
I've made a short experiment (so no real experiences from it yet) and based the partitioning solely on id range. This seems to have one advantage - predictable partition table sizes (more or less, depending on the size of variable columns, of course). It is possible to add check timestamp ... conditions on any full partition (and open interval check on active one too!) which helps with partition pruning. This has nice benefit that detail table needs single column FK referencing only master.id (perhaps even pruning better during JOINs). Any ideas or experiences with something similar?
We would rather have time-based partitioning, seems more natural, but it's not a hard condition. The need of dragging timestamp to another table and to its FK, etc. makes it less compelling.
Obviously, we want both tables (all, to be precise, we will have more detail table types) partitioned along the same range, be it id or timestamp. I guess not doing so beats the whole purpose of partitioning as we would not be able to remove data related to the master partitions.
I welcome any pointers or ideas on how to do it properly. In the end we will decide for ourselves, but there is not much material to help with the decision right now. Thanks.
Your strategy is good. Partition related tables by the common timestamp and make sure that the partition boundaries are the same.
You probably didn't get the efficient partitionwise join because you didn't set enable_partitionwise_join to on. That parameter is turned off by default because it can consume substantial query planning time that you don't want to expend unless you know you can benefit.

Do you need to add an index on a partitioned table (postgres 11)?

My team is looking at moving our non partitioned table with ~1TB of data over to a partitioned table.
We would be using range partitioning based on a timestamp column.
One thing I don't understand is whether we need to add an index on the timestamp column if it's being used as the partition key. If we make our partitions quite small (e.g. partition for every day), would this act in a similar way to an index?
We would only be doing queries on a maximum resolution of one day.
I am reluctant to add an index as we've tried this in the past and it never completed (probably because we didn't turn off writes. Not really an option to turn off writes for an extended period).
Your feeling is right: omitting the index on the partitioning column is one of the few places where partitioning actually makes queries faster.
You can then get away with a sequential scan of a single partition, and you don't have to maintain the index with every data modifying statement.
The other advantage is that partitioning makes mass deletion of data (along the partition boundaries) so much more efficient. And finally, autovacuum's job will become easier.
Two points about partitioning:
Upgrade to v12; there have been substantial performance improvements that concern partitioning.
Don't use too many partitions. With v12, you can probably go up to a few thousand, in earlier versions you will get performance problems earlier on.

Archive old data in Postgresql

I'm currently expecting for somebody to advice me on the process which I'm gonna take forward for DB archiving.
I've database (DB-1) which has 2 very large tables, one table having 25 GB of data and another is 20 GB of data. Which cause major performance issues even I have indexes.
So, we considered to archive the old data with the below process,
Clone a new database (DB-2) from existing database (DB-1).
Delete the old data from DB-1, so it will have only the last 2 years records. In case If I need old data can connect DB-2.
Every month should move an old data from DB-1 to DB-2, and delete the moved rows from DB-1.
That is the wrong approach.
What you are looking for is partitioning.
You can create range partitions covering one year each. To remove old data all you need to do is to drop the partition for the year(s) no longer needed.
If you need to keep the data for some reasons, you can also just detach the partition from the table. Then the data is still "lying around", but would not show up in the (partitioned) table. You could query the (detached) partition directly to access that data. You could even move that (detached) partition to a slower harddisk to free up space on your fast disks if you have more than one.
But you might even see that partitioning alone might already improves performance, but that depends a lot on your queries.
Note that you should use Postgres 11 for that, as partitioning wasn't that sophisticated in older versions.
While you should no doubt upgrade your current version (I'd suggest moving away from the EDB system you are working on now, and going to community based Postgres 11) even if you can't upgrade, partitioning is still a much better answer than creating a second database.
By recreating your table as a set of partitions within the same database, you will be able to add/remove data in a much cleaner fashion, and it will make dealing with Vacuums much easier. Even in 9.5, you can take advantage of table inheritance to build out partitions by first adding partitions for incoming data, and then creating partitions at various intervals (probably monthly, since you want to run monthly cleanup) and moving the data into those partitions. This can be accomplished atomically with a series of INSERT INTO partition SELECT * FROM table WHERE <timestamp> style statements.
I suspect you can probably manage this yourself (you need basic sql and the ability to write simple triggers/functions... here is a link to the 9.5 docs), but if you need help, you can engage with one of the Postgres chat communities, or contact a support company if you want a deeper dive.

Is belated partitioning possible?

I've got a PostgreSQL 10+ (probably 11) database to work with. As the tables are growing fast, but implementation time is scarce, my approach is to first setup the whole scheme, so that the app can run on it. After that, I'd like to introduce partitioning (by date) on some tables.
Is belated partitioning possible?
Practically I mean that I create partitions by date and that already-inserted data is then automatically re-assigned to those partitions (if applicable). Also, if it's possible, is this a good approach or are there better alternatives?
I guess that reorganizing a big table needs its time, but that's ok.

SQL Server stats and indexes after a partition swap

This question might be a bit too general but I thought I would ask. I'm working with a terabyte scale data warehouse in SQL Server 2008 R2. There is a large fact table with data going back 5 years. I have aggregated a lot of this old data to a different table at a higher level of granularity. The next step is to remove the old data from my fact table.
I've decided that partition swapping is probably the best way to go to remove the older rows from the fact table and put them in an archive table, but I was wondering what a partition swap will do to stats and indexes on my fact table? Should I consider manually updating statistics after a partition swap? (auto update is set to off), will my indexes be fragmented and need reorganising or rebuilding?
Thanks for your help!
Partition switching is a metadata operation, so it's not going to cause fragmentation as no physical data is actually moving-- just logical references to it.
You should probably be updating statistics on a large table regularly, but it's not especially needed after a partition switch.