Postgres partitioning? - postgresql

My software runs a cronjob every 30 minutes, which pulls data from Google Analytics / Social networks and inserts the results into a Postgres DB.
The data looks like this:
url text NOT NULL,
rangeStart timestamp NOT NULL,
rangeEnd timestamp NOT NULL,
createdAt timestamp DEFAULT now() NOT NULL,
...
(various integer columns)
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table. At this rate, the cronjob will generate about 480 000 records a day and about 14.5 million a month.
I think the solution would be using several tables, for example I could use a specific table to store data generated in a given month: stats_2015_09, stats_2015_10, stats_2015_11 etc.
I know Postgres supports table partitioning. However, I'm new to this concept, so I'm not sure what's the best way to do this. Do I need partitioning in this case, or should I just create these tables manually? Or maybe there is a better solution?
The data will be queried later in various ways, and those queries are expected to run fast.
EDIT:
If I end up with 12-14 tables, each storing 10-20 millions rows, Postgres should be still able to run select statements quickly, right? Inserts don't have to be super fast.

Partitioning is a good idea under various circumstances. Two that come to mind are:
Your queries have a WHERE clause that can be readily mapped onto one or a handful of partitions.
You want a speedy way to delete historical data (dropping a partition is faster than deleting records).
Without knowledge of the types of queries that you want to run, it is difficult to say if partitioning is a good idea.
I think I can say that splitting the data into different tables is a bad idea because it is a maintenance nightmare:
You can't have foreign key references into the table.
Queries spanning multiple tables are cumbersome, so simple questions are hard to answer.
Maintaining tables becomes a nightmare (adding/removing a column).
Permissions have to be carefully maintained, if you have users with different roles.
In any case, the place to start is with Postgres's documentation on partitioning, which is here. I should note that Postgres's implementation is a bit more awkward than in other databases, so you might want to review the documentation for MySQL or SQL Server to get an idea of what it is doing.

Firstly, I would like to challenge the premise of your question:
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table.
As far as I know, there is no fundamental reason why the database would not cope fine with a single table of many millions of rows. At the extreme, if you created a table with no indexes, and simply appended rows to it, Postgres could simply carry on writing these rows to disk until you ran out of storage space. (There may be other limits internally, I'm not sure; but if so, they're big.)
The problems only come when you try to do something with that data, and the exact problems - and therefore exact solutions - depend on what you do.
If you want to regularly delete all rows which were inserted more than a fixed timescale ago, you could partition the data on the createdAt column. The DELETE would then become a very efficient DROP TABLE, and all INSERTs would be routed through a trigger to the "current" partition (or could even by-pass it if your import script was aware of the partition naming scheme). SELECTs, however, would probably not be able to specify a range of createAt values in their WHERE clause, and would thus need to query all partitions and combine the results. The more partitions you keep around at a time, the less efficient this would be.
Alternatively, you might examine the workload on the table and see that all queries either already do, or easily can, explicitly state a rangeStart value. In that case, you could partition on rangeStart, and the query planner would be able to eliminate all but one or a few partitions when planning each SELECT query. INSERTs would need to be routed through a trigger to the appropriate table, and maintenance operations (such as deleting old data that is no longer needed) would be much less efficient.
Or perhaps you know that once rangeEnd becomes "too old" you will no longer need the data, and can get both benefits: partition by rangeEnd, ensure all your SELECT queries explicitly mention rangeEnd, and drop partitions containing data you are no longer interested in.
To borrow Linus Torvald's terminology from git, the "plumbing" for partitioning is built into Postgres in the form of table inheritance, as documented here, but there is little in the way of "porcelain" other than examples in the manual. However, there is a very good extension called pg_partman which provides functions for managing partition sets based on either IDs or date ranges; it's well worth reading through the documentation to understand the different modes of operation. In my case, none quite matched, but forking that extension was significantly easier than writing everything from scratch.
Remember that partitioning does not come free, and if there is no obvious candidate for a column to partition by based on the kind of considerations above, you may actually be better off leaving the data in one table, and considering other optimisation strategies. For instance, partial indexes (CREATE INDEX ... WHERE) might be able to handle the most commonly queried subset of rows; perhaps combined with "covering indexes", where Postgres can return the query results directly from the index without reference to the main table structure ("index-only scans").

Related

Partitioning of related tables in PostgreSQL

I've checked documentation and saw some presentations, read blogs, but can't find examples of partitioning of more than a single table in PostgreSQL - and that's what we need. Our tables are insert only audit trail with master-detail structure and we aim to solve our problem with slow data removal problem, currently done using delete.
The simplified structure and some queries are shown in the following fiddle: https://www.db-fiddle.com/f/2mRXT4wGjM2ZSftjgKyZce/46
The issue I'm investigating right now is how to effectively query the detail table, be it in JOIN or directly. Because the timestamp field is part of the partition key I understand that using it in query is essential. I don't understand why JOIN is not able to figure this out when timestamp equality is used in ON clause (couple of explain examples are in the fiddle).
Then there are broader questions:
What is general recommended strategy for similar case? We expect that timestamp is essential for our query, so it feels natural to use it as partitioning key.
I've made a short experiment (so no real experiences from it yet) and based the partitioning solely on id range. This seems to have one advantage - predictable partition table sizes (more or less, depending on the size of variable columns, of course). It is possible to add check timestamp ... conditions on any full partition (and open interval check on active one too!) which helps with partition pruning. This has nice benefit that detail table needs single column FK referencing only master.id (perhaps even pruning better during JOINs). Any ideas or experiences with something similar?
We would rather have time-based partitioning, seems more natural, but it's not a hard condition. The need of dragging timestamp to another table and to its FK, etc. makes it less compelling.
Obviously, we want both tables (all, to be precise, we will have more detail table types) partitioned along the same range, be it id or timestamp. I guess not doing so beats the whole purpose of partitioning as we would not be able to remove data related to the master partitions.
I welcome any pointers or ideas on how to do it properly. In the end we will decide for ourselves, but there is not much material to help with the decision right now. Thanks.
Your strategy is good. Partition related tables by the common timestamp and make sure that the partition boundaries are the same.
You probably didn't get the efficient partitionwise join because you didn't set enable_partitionwise_join to on. That parameter is turned off by default because it can consume substantial query planning time that you don't want to expend unless you know you can benefit.

Criteria and strategy for partitioning a large table in Postgres

We are looking into migrating our app to a multi-tenant database. Currently, the app runs with one database per tenant. There are currently around 400 tenants. When combined, the largest table would have around 1 billion rows and would grow as tenants are added. Size by tenant varies wildly, with one tenant alone having 180 million records in that table, some having less than a million. There are a few other tables in the hundred millions, most tables would have much less. My main concerns revolve around planning for scalability for the large tables, and I'll focus on the largest one. The parameters for it are that it's a linking/many-to-many table with basic audit fields for created by and created date (though am questioning if those are even necessary for this one). Date/time is not relevant to this, this is an assignment table and applies at all times. Records can get deleted or inserted, not updated, some times in bulk, probably not frequently but can happen at any time. Data cardinality would be relatively high on both foreign keys I think, though I'm not sure what constitutes high cardinality as a ratio to total number of records. For some perspective, the tenant with 180 million records has around 100,000 distinct records for one foreign key and 165,000 for the other. Meanwhile, another client has around 180,000 records, with 500 distinct values in one field and 5000 in the other. So as I said, a lot of variability.
Would the kind of table I described above (billions of rows, high data cardinality, not time based, tenant segmented, bulk insert/deletes at any time) in the kind of scenario I described (400+ tenants with varying amounts of data) be a good candidate for partitioning? The reason I'm concerned about this now is that I've read in a number of places that partitioning is something that can be much less painful to deal with if you plan for it ahead of time rather than try to partition later after the table is huge and harder to work with without requiring down-time or jumping through hoops. At this point, my main concern is not so much querying the data, I tested with a table with 1 billion records and with a proper index select queries run very fast. I'm more worried about concurrency with the read/write/delete, running into blocking because of locks, etc. If partitioning is warranted, what would a good strategy be? Partition by tenant? Just partition large ones and keep smaller ones bundled together?
Given that you said that query performance is not an issue, the only reason I can think of to consider partitioning is to make mass purging easier to accomplish.
Do you have contractual or legal retention policies in place?
The most common scenario would be using time periods as your partition key so that rolling-off old data is simply a matter of dropping partitions, but since you clearly state that date/time is not relevant, I do not see how that would help.
Is it common for you to roll-on/roll-off individual customers? Is there a purging or retention requirement? If so, then partitioning by customer, no matter how imbalanced the partitions would be, would make sense since you could purge a large customer's data without affecting other customers' access to their data.
As for any concurrency issues, partitioning by customer should help contain these problems within a specific customer that is showing heavy activity.
I recommend testing this thoroughly for a few reasons:
I have not seen multiple active partitions in action because I have worked only with time series partitions
I have not looked deeply into PostgreSQL 12's foreign key enhancements and wonder whether a foreign key with a partitioned table on both sides would complicate dropping parititons
I have never explored the practical limits of the number of partitions a database could contain
I may be reading things from my experience into your question about partitioning, but have you considered a schema per customer?

Will one Foreign key to two tables cause confusion?

If I have table A, and table B, and I have data that start off in table A but end up in table B, and I have a table C which has a foreign key that points to the primary key of A, but when the data get removed from A and ends up in table B, it should point to B instead (having the same id as A's data did). Will this cause confusion. Heres and example to show what I mean:
A (Pending results)
id =3
B( Completed Results)
null
C(user)
id = 1
results id = 3 (foreign key to both A and B)
After three minutes, the results have been posted.
A (Pending results)
null
B( Completed Results)
id = 3
C(user)
id = 1
results id = 3 (foreign key to both A and B)
Is there anything wrong with this implementation. Or would it be better to have A and B as one table? The table could grow very large which is what I am worried about. As separate tables, the reads to table A would be far greater than the reads to table B and table A would be much smaller, as it is just pending results. If A and B were combined into one table, then it would be both pending and a history of all completed results, so finding the ones which are pending would take much more time I assume. All of this is being done is postgresql if that makes a difference.
So I guess my question is: Is this implementation fine for a medium scale, or given the information I just said, should I combine table A and B (Even though B will grow infinitely large whereas A only contains present data and is significantly smaller).
Sounds like you've already found that this does not work. I couldn't follow your example properly because "A", "B", and "C" never work for me. I suspect those kinds of formulaic labels are better than specifics for other people. You just can't win ;-) In any case, it sounds like you're facing a practical concern about table size, and are being tempted to use a design that splits a natural table into two parts. (Hot and old.) As you found, that doesn't really work with the keys in a system. The relational model (etc., etc.) doesn't have a concept for "this thing is a child of this or that." So, you're swimming up stream there. Regardless, this kind of setup is very commonplace in the wild, so much so that it's got a name. Well, several names. "Polymorphic Association" from Bill Karwin's SQL Anti-Patterns is common. That's a good book, and short, by the way. Similarly, "promiscuous association" is a term you'll see. Or sometimes you'll see the table itself listed as a "jump table", or a "hub", etc.
I suspect there's a reason this non-relational pattern is so widely used: It makes sense to humans. An area where the relational model is always a tight pinch is when you have things which are kinds of things. Like, people who are staff or student. So many fields in common, several that are distinct to their specific type. One table? Two? Three? Table inheritance in Postgres might help...at least it's trying to. Anyway, polymorphic relations are problematic in an RDBMS because they're not able to be modeled or constrained automatically. You need custom code to figure out that this record is a child of that table...or the other table. You can't bake that into the relations. If you're interested in various solutions to this design problem, Karwin's chapter is quite good, easy to read, and full of alternative designs. If you don't feel like tracking down the book but are a bit interested, check out this article from a few years ago:
https://hashrocket.com/blog/posts/modeling-polymorphic-associations-in-a-relational-database
Chances are, your interest right now is more day-to-day. Is sounds like you've got a processing pipeline with a few active records and an ever-increasing collection of older records. You don't mention your Postgres version, but you might have less to worry about than you imagine. First up, you could consider partitioning the table. A partitioned table has a single logical table that you talk to in your queries with a collection of smaller physical tables under the hood. You can get at the partitions directly, but you don't need to. You just talk to my_big_table and Postgres figures out where to look. So, you could split the data on week, month, etc. so that no one bucket every gets too big for you. In this case, the individual partitions have their own indexes too. So, you'll end up with smaller tables and smaller indexes under the hood. For this, you're best off using PG 11, or maybe PG 10. Partitioning is a big topic, and the Postgres feature set isn't a perfect match for every situation...you have to work within its limits. I'll leave it at that now as it's likely not what you need first.
Simpler than partitioning is an awesome Postgres feature you may not know about, partial indexes. This isn't unique to Postgres (SQL Server calls the same sort of feature a "filtered" index), but I don't think MySQL has it. Okay, the idea is really simple: Build an index that only includes rows that match a condition. Here's an example:
CREATE INDEX tasks_pending
ON tasks (status)
WHERE status = 'Pending'
If you're table has 100M records, a full B-tree has to catalog all 100M rows. You need that for a uniqueness check on a primary key...but it's big and expensive. Now imagine your 100M records have only 1,000 rows where status = pending. You've got an index with just those 1,000 rows. Tiny, fast, perfect. The beauty part here is that the partial index doesn't necessarily get bigger as your historical data set grows. And, shout out to historical data sets, they're very nice to have when you need to get aggregates, etc. in a simple search. If you split things into multiple tables, you'll need to write longer queries with UNION. (That wouldn't be the case with partitions where the physical division is masked by the logical partition master table.)
HTH

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

ef-core - bool index vs historical table

I´m using aspnet-core, ef-core with sql server. I have an 'order' entity. As I'm expecting the orders table to be large and a the most frequent query would get the active orders only for certain customer (active orders are just a tiny fraction of the whole table) I like to optimize the speed of the query but I can decide from this two approaches:
1) I don't know if this is possible as I haven't done this before, but I was thinking about creating a Boolean column named 'IsActive' and make it an index thus when querying only Active orders would be faster.
2) When an order becomes not active, move the order to another table, i.e HistoricalOrders, thus keeping the orders table small.
Which of the two would have better results?, or none of this is a good solution and a third approach could be suggested?
If you want to partition away cold data then a leading boolean index column is a valid way to do that. That column must be added to all indexes that you want to hot/cold partition. This includes the clustered index. This is quite awkward. The query optimizer requires that you add a dummy predicate where IsActive IN (0, 1) to make it able to still seek on such indexes. Of course, this will now also touch the cold data. So you probably need to know the concrete value of IsActive or try the 1 value first and be sure that it matches 99% of the time.
Depending on the schema this can be impractical. I have never seen a good case for this but I'm sure it exists.
A different way to do that is to use partitioning. Here, the query optimizer is used to probing multiple partitions anyway but again you don't want it to probe cold data. Even if it does not find anything this will pull pages into memory making the partitioning moot.
The historical table idea (e.g. HistoricalOrders) is the same thing in different clothes.
So in order to make this work you need:
Modify all indexes that you are about (likely all), or partition, or create a history table.
Have a way to almost never need to probe the cold partition.
I think (2) kills it for most cases.
Among the 3 solutions I'd probably pick the indexing solution because it is simplest. If you are worried about people making mistakes by writing bad queries all the time, I'd pick a separate table. That makes mistakes hard but makes the code quite awkward.
Note, that many indexes are already naturally partitioned. Indexes on the identity column or on an increasing datetime column are hot at the end and cold elsewhere. An index on (OrderStatus INT, CreateDateTime datetime2) would have one hot spot per order status and be cold otherwise. So those are already solved.
Some further discussion.
Before think about the new table HistoricalOrders,Just create a column name IsActive and test it with your data.You don't need to make it as index column.B'cos Indexes eat up storage and it slows down writes and updates.So we must very careful when we create an index.When you query the data do it as shown below.On the below query where data selection (or filter) is done on the SQL srever side (IQueryable).So it is very fast.
Note : Use AsNoTracking() too.It will boost the perfromnace too.
var activeOrders =_context.Set<Orders>().Where(o => o.IsActive == true).AsNoTracking()
.ToList();
Reference : AsNoTracking()