I've got a PostgreSQL 10+ (probably 11) database to work with. As the tables are growing fast, but implementation time is scarce, my approach is to first setup the whole scheme, so that the app can run on it. After that, I'd like to introduce partitioning (by date) on some tables.
Is belated partitioning possible?
Practically I mean that I create partitions by date and that already-inserted data is then automatically re-assigned to those partitions (if applicable). Also, if it's possible, is this a good approach or are there better alternatives?
I guess that reorganizing a big table needs its time, but that's ok.
Related
I am unaware of the best practices involved in the automation of the addition of future partitions to a table. So the situation is like this: In the month of say, Dec 2021, we want to create partitions for the next year (2022) for some tables in Postgres. We can obviously do it manually, but we want to automate this. So far, I could think of (and found by researching, talking to some people, etc.) the following ways: -
Using PL/pgSQL (In my opinion, there could be issues related to version control and deployment here)
Writing a script (in Python, say) and executing it as a cron job annually
Adding the partitioning logic to the code (that inserts records in the database) i.e., whenever a record is inserted in a table, you check whether the partition corresponding to the record exists, if not, you create the partition (fetching the metadata of the table with each incoming record can be expensive, but we can try and optimize it in my opinion)
Is there any other way that I am missing? If not, what would be the best way of the above to automate the addition of future partitions to a table in Postgres.
Also, please point out if this is not the right platform for such questions (it would be great if you could direct me to the right one).
Thank you for reading this.
Option 1 is not sufficient, because you need a way to run the code automatically (that's the hard part). It doesn't matter much if you use PL/pgSQL or a client side language for the procedural parts of the operation.
Option 3 is not easy to achieve, and certainly not in an efficient fashion.
I would say that the best way is to schedule a job for partition creation, either with the operating system scheduler (cron) or with a PostgreSQL extension like pg_timetable or pg_cron.
I'm currently designing a table and want to partition it by account_name.
For now I'm thinking of going with a small number of partitions (e.g. 8) but since I expect a lot of data there is a chance I will need to re-partition it and make more partitions.
What is the best way to do this? If I understand correctly I can't just attach new partitions since I need to change modulus for previously used ones.
Should I copy and re-insert all the data or there is an easier way?
Repartitioning would mean to completely rewrite the table, as in
INSERT INTO new_tab SELECT * FROM old_tab;
which will cause extensive down time. One way around this is to use logical replication with new_tab on the standby side (possible from v13 on).
But my recommendation is not to do that. Choose a reasonable number of partitions and stick with that.
I've checked documentation and saw some presentations, read blogs, but can't find examples of partitioning of more than a single table in PostgreSQL - and that's what we need. Our tables are insert only audit trail with master-detail structure and we aim to solve our problem with slow data removal problem, currently done using delete.
The simplified structure and some queries are shown in the following fiddle: https://www.db-fiddle.com/f/2mRXT4wGjM2ZSftjgKyZce/46
The issue I'm investigating right now is how to effectively query the detail table, be it in JOIN or directly. Because the timestamp field is part of the partition key I understand that using it in query is essential. I don't understand why JOIN is not able to figure this out when timestamp equality is used in ON clause (couple of explain examples are in the fiddle).
Then there are broader questions:
What is general recommended strategy for similar case? We expect that timestamp is essential for our query, so it feels natural to use it as partitioning key.
I've made a short experiment (so no real experiences from it yet) and based the partitioning solely on id range. This seems to have one advantage - predictable partition table sizes (more or less, depending on the size of variable columns, of course). It is possible to add check timestamp ... conditions on any full partition (and open interval check on active one too!) which helps with partition pruning. This has nice benefit that detail table needs single column FK referencing only master.id (perhaps even pruning better during JOINs). Any ideas or experiences with something similar?
We would rather have time-based partitioning, seems more natural, but it's not a hard condition. The need of dragging timestamp to another table and to its FK, etc. makes it less compelling.
Obviously, we want both tables (all, to be precise, we will have more detail table types) partitioned along the same range, be it id or timestamp. I guess not doing so beats the whole purpose of partitioning as we would not be able to remove data related to the master partitions.
I welcome any pointers or ideas on how to do it properly. In the end we will decide for ourselves, but there is not much material to help with the decision right now. Thanks.
Your strategy is good. Partition related tables by the common timestamp and make sure that the partition boundaries are the same.
You probably didn't get the efficient partitionwise join because you didn't set enable_partitionwise_join to on. That parameter is turned off by default because it can consume substantial query planning time that you don't want to expend unless you know you can benefit.
I need to maintain audit table and since the number of changes are going to be huge, I need an efficient way of dealing with the problem. The solution which I have thought is to record only the changed column in the audit table and partition it on the createdon column quarterly or half-yearly.
I wanted to know if there is anything like 'interval partition' of oracle? If not then how can I achieve it?
I want that every 6 months a new partition is created automatically as the row is inserted.
I am using postgres 11 as my db.
I do not think there is any magic configuration that make your life easier on this point :
https://www.postgresql.org/docs/11/ddl-partitioning.html
If you want the table auto-created, I think you have two major possibilities :
Verify each data at the in of the 'mother' table to see if it fits in an already present partition (trigger, if huge amount of inserts it could be a problem)
Check once in a while that you already have the partitions that are going to be needed in the future. For this one pg_partman is going to be your best ally.
As an example, few years ago, I had done a partition mechanism when there was only the declarative one and not any possibility to add pg_partman. With the trigger mechanism for 15 million rows per month it still works like a charm.
If you do not want to harm your performances EVER (and especially if you do not know how large your system is going to grow) I recommand to you the same response than in a_horse_with_no_name comment : use pg_partman.
If you cannot use it, like it was the case for me, adopt one of the two philosophies (trigger or advance table creation by crontask (for example)).
My software runs a cronjob every 30 minutes, which pulls data from Google Analytics / Social networks and inserts the results into a Postgres DB.
The data looks like this:
url text NOT NULL,
rangeStart timestamp NOT NULL,
rangeEnd timestamp NOT NULL,
createdAt timestamp DEFAULT now() NOT NULL,
...
(various integer columns)
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table. At this rate, the cronjob will generate about 480 000 records a day and about 14.5 million a month.
I think the solution would be using several tables, for example I could use a specific table to store data generated in a given month: stats_2015_09, stats_2015_10, stats_2015_11 etc.
I know Postgres supports table partitioning. However, I'm new to this concept, so I'm not sure what's the best way to do this. Do I need partitioning in this case, or should I just create these tables manually? Or maybe there is a better solution?
The data will be queried later in various ways, and those queries are expected to run fast.
EDIT:
If I end up with 12-14 tables, each storing 10-20 millions rows, Postgres should be still able to run select statements quickly, right? Inserts don't have to be super fast.
Partitioning is a good idea under various circumstances. Two that come to mind are:
Your queries have a WHERE clause that can be readily mapped onto one or a handful of partitions.
You want a speedy way to delete historical data (dropping a partition is faster than deleting records).
Without knowledge of the types of queries that you want to run, it is difficult to say if partitioning is a good idea.
I think I can say that splitting the data into different tables is a bad idea because it is a maintenance nightmare:
You can't have foreign key references into the table.
Queries spanning multiple tables are cumbersome, so simple questions are hard to answer.
Maintaining tables becomes a nightmare (adding/removing a column).
Permissions have to be carefully maintained, if you have users with different roles.
In any case, the place to start is with Postgres's documentation on partitioning, which is here. I should note that Postgres's implementation is a bit more awkward than in other databases, so you might want to review the documentation for MySQL or SQL Server to get an idea of what it is doing.
Firstly, I would like to challenge the premise of your question:
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table.
As far as I know, there is no fundamental reason why the database would not cope fine with a single table of many millions of rows. At the extreme, if you created a table with no indexes, and simply appended rows to it, Postgres could simply carry on writing these rows to disk until you ran out of storage space. (There may be other limits internally, I'm not sure; but if so, they're big.)
The problems only come when you try to do something with that data, and the exact problems - and therefore exact solutions - depend on what you do.
If you want to regularly delete all rows which were inserted more than a fixed timescale ago, you could partition the data on the createdAt column. The DELETE would then become a very efficient DROP TABLE, and all INSERTs would be routed through a trigger to the "current" partition (or could even by-pass it if your import script was aware of the partition naming scheme). SELECTs, however, would probably not be able to specify a range of createAt values in their WHERE clause, and would thus need to query all partitions and combine the results. The more partitions you keep around at a time, the less efficient this would be.
Alternatively, you might examine the workload on the table and see that all queries either already do, or easily can, explicitly state a rangeStart value. In that case, you could partition on rangeStart, and the query planner would be able to eliminate all but one or a few partitions when planning each SELECT query. INSERTs would need to be routed through a trigger to the appropriate table, and maintenance operations (such as deleting old data that is no longer needed) would be much less efficient.
Or perhaps you know that once rangeEnd becomes "too old" you will no longer need the data, and can get both benefits: partition by rangeEnd, ensure all your SELECT queries explicitly mention rangeEnd, and drop partitions containing data you are no longer interested in.
To borrow Linus Torvald's terminology from git, the "plumbing" for partitioning is built into Postgres in the form of table inheritance, as documented here, but there is little in the way of "porcelain" other than examples in the manual. However, there is a very good extension called pg_partman which provides functions for managing partition sets based on either IDs or date ranges; it's well worth reading through the documentation to understand the different modes of operation. In my case, none quite matched, but forking that extension was significantly easier than writing everything from scratch.
Remember that partitioning does not come free, and if there is no obvious candidate for a column to partition by based on the kind of considerations above, you may actually be better off leaving the data in one table, and considering other optimisation strategies. For instance, partial indexes (CREATE INDEX ... WHERE) might be able to handle the most commonly queried subset of rows; perhaps combined with "covering indexes", where Postgres can return the query results directly from the index without reference to the main table structure ("index-only scans").