I would like to partition a table with 1M+ rows by date range. How is this commonly done without requiring much downtime or risking losing data? Here are the strategies I am considering, but open to suggestions:
1.The existing table is the master and children inherit from it. Over time move data from master to child, but there will be a period of time where some of the data is in the master table and some in the children.
2.Create a new master and children tables. Create copy of data in existing table in child tables (so data will reside in two places). Once child tables have most recent data, change all inserts going forward to point to new master table and delete existing table.
First you have to ask yourself, if a table partition is really warranted. Go thru the partition document:
https://www.postgresql.org/docs/9.6/static/ddl-partitioning.html
Remember this very important info for partitioning data (from the link above)
The benefits will normally be worthwhile only when a table would
otherwise be very large. The exact point at which a table will benefit
from partitioning depends on the application, although a rule of thumb
is that the size of the table should exceed the physical memory of the
database server.
You can check the size of your table with this SQL
SELECT pg_size_pretty(pg_database_size(<table_name>))
if you are having performance problems, try re-indexing or re-evaluating your indexes. Check your postgres log for auto vacuuming.
1m+ rows do not need partitioning.
Related
Assume I'm doing everything in one postgresql database. I have 10 source tables I'm using to create one huge denormalized table. These source tables change frequently and have triggers firing after insert/update/delete to modify denormalized table in near-real-time. The problem is, some of these source tables I'm joining are huge (one table has 120M and other 25M rows) and statements for inserting new rows into denormalized table execute for a long time (20+ minutes for 50-100k rows).
So, I was thinking on what would be the best solution for updating(IUD)changes on this denormalized table, based on changes coming to source tables? Should I run these operations on a schedule, should I dedicate a specific database replica just for this, or should I continue trying to use triggers?
I'm open to using a totally different approach, as long as it's doable on the same database.
That sounds like there is no good and simple solution.
Perhaps you don't need that one huge denormalized table, and denormalizing a few attributes would be good enough for your query speed.
If not, you will probably need a kind of data warehouse for the denormalized data, and refresh that daily with increments. Ideally, tables there are already pre-aggregated.
I have a script which imports data from csv files and loads into db tables. The tables are partitioned by a column like customer_id. The script first loads data into a staging table, checks unique key/FK constraints, deletes data which violates the FK/unique constraints.
Then it drops the existing partition from the main table and adds this staging table as a partition. My question is this the best approach to import data? The another approach I can think of is: don't use partitions, just use staging table and after cleaning the data, import it into the main table after deleting the existing data.
If you can load data partitionwise, that is going to be better.
Deleting many rows in a PostgreSQL table is painful, and DROP TABLE will always win.
Keep in mind that in PostgreSQL, a DELETE doesn't actually delete data from the table -- it merely flags the row as invisible to later transactions. Later, when the table is vacuumed, those invisible rows get flagged as re-usable for future INSERTs and UPDATEs. Only when a VACUUM FULL on the table is run will the space be reclaimed. Therefore, "import[ing] it into the maint able after deleting the existing data" would cause bloat.
A DROP TABLE will immediately reclaim space. As such, it would make more sense to work with partitions and drop partitions to reclaim space.
More information about this behavior can be found in the documentation
I'm migrating from a proprietary dbms to PG. In the proprietary dbms, "offlining" and "onlining" data partitions is a very lightweight operation. I'm looking to implement similar functionality with PG by backup and restore of individual table (partitions). Obviously I need to avoid a performance regression. So my question is what the fastest way is of:
Backing up a table (partition), both data and indexes
Taking the table offline (meaning that the data is now gone from the database)
Restoring the table (partition), both data and indexes
Once I have some advice I can design more targeted performance comparisons. Thanks in advance for any pointers.
What is fast and needs to be fast is adding or removing a partition (ALTER TABLE ... ATTACH/DETACH PARTITION).
After you have detached the partition you are in no great hurry to backup/export the data. This can be done comfortably with pg_dump.
Similarly, importing the data for a table that is to become a new partition is normally not time critical.
If you need this to happen faster (for example, you want the old partition to be visible in another database as soon as it is detached in the old one), you could use logical replication to replicate the partition to another PostgreSQL database before you detach it. As soon as replication has caught up, you detach or drop the original partition and attach the copy in the other database.
I have a table that stores information about weather for specific events and for specific timestamps. I do insert, update and select (more often than delete) on this table. All of my queries query on timestamp and event_id. Since this table is blowing up, I was considering doing table partitioning in postgres.
I could also think of having multiple tables and naming them "table_< event_id >_< timestamp >" to store specific timestamp information, instead of using postgres declarative/inheritance partitioning. But, I noticed that no one on the internet has done or written about any approach like this. Is there something I am missing?
I see that in postgres partitioning, the data is both kept in master as well as child tables. Why keep in both places? It seems less efficient to do inserts and updates to me.
Is there a generic limit on the number of tables when postgres will start to choke?
Thank you!
re 1) Don't do it. Why re-invent the wheel if the Postgres devs have already done it for you by providing declarative partitioning
re 2) You are mistaken. The data is only kept in the partition to which it belongs to. It just looks as if it is stored in the "master".
re 3) there is no built-in limit, but anything beyond a "few thousand" partitions is probably too much. It will still work, but especially query planning will be slower. And sometime the query execution might also suffer because runtime partition pruning is not as efficient any more.
Given your description you probably want to do hash partitioning on the event ID and then create range sub-partitions on the timestamp value (so each partition for an event is again partitioned on the range of the timestamps)
Based on the above image, there are certain tables I want to be in the Internal Database (right hand side). The other tables I want to be replicated in the external database.
In reality there's only one set of values that SHOULD NOT be replicated across. The rest of the database can be replicated. Basically the actual price columns in the prices table cannot be replicated across. It should stay within the internal database.
Because the vendors are external to the network, they have no access to the internal app.
My plan is to create a replicated version of the same app and allow vendors to submit quotations and picking items.
Let's say the replicated tables are at least quotations and quotation_line_items. These tables should be writeable (in terms of data for INSERTs, UPDATEs, and DELETEs) at both the external database and the internal database. Hence at both databases, the data in the quotations and quotation_line_items table are writeable and should be replicated across in both directions.
The data in the other tables are going to be replicated in a single direction (from internal to external) except for the actual raw prices columns in the prices table.
The quotation_line_items table will have a price_id column. However, the raw price values in the prices table should not appear in the external database.
Ultimately, I want the data to be consistent for the replicated tables on both databases. I am okay with synchronous replication, so a bit of delay (say, a couple of second for the write operations) is fine.
I came across pglogical https://github.com/2ndQuadrant/pglogical/tree/REL2_x_STABLE
and they have the concept of PUBLISHER and SUBSCRIBER.
I cannot tell based on the readme which one would be acting as publisher and subscriber and how to configure it for my situation.
That won't work. With the setup you are dreaming of, you will necessarily end up with replication conflicts.
How do you want to prevent that data are modified in a conflicting fashion in the two databases? If you say that that won't happen, think again.
I believe that you would be much better off using a single database with two users: one that can access the “secret” table and one that cannot.
If you want to restrict access only to certain columns, use a view. Simple views are updateable in PostgreSQL.
It is possible with BDR replication which uses pglogical. On a basic level by allocating ranges of key ids to each node so writes are possible in both locations without conflict. However BDR is now a commercial paid for product.