Archive old data in Postgresql - postgresql

I'm currently expecting for somebody to advice me on the process which I'm gonna take forward for DB archiving.
I've database (DB-1) which has 2 very large tables, one table having 25 GB of data and another is 20 GB of data. Which cause major performance issues even I have indexes.
So, we considered to archive the old data with the below process,
Clone a new database (DB-2) from existing database (DB-1).
Delete the old data from DB-1, so it will have only the last 2 years records. In case If I need old data can connect DB-2.
Every month should move an old data from DB-1 to DB-2, and delete the moved rows from DB-1.

That is the wrong approach.
What you are looking for is partitioning.
You can create range partitions covering one year each. To remove old data all you need to do is to drop the partition for the year(s) no longer needed.
If you need to keep the data for some reasons, you can also just detach the partition from the table. Then the data is still "lying around", but would not show up in the (partitioned) table. You could query the (detached) partition directly to access that data. You could even move that (detached) partition to a slower harddisk to free up space on your fast disks if you have more than one.
But you might even see that partitioning alone might already improves performance, but that depends a lot on your queries.
Note that you should use Postgres 11 for that, as partitioning wasn't that sophisticated in older versions.

While you should no doubt upgrade your current version (I'd suggest moving away from the EDB system you are working on now, and going to community based Postgres 11) even if you can't upgrade, partitioning is still a much better answer than creating a second database.
By recreating your table as a set of partitions within the same database, you will be able to add/remove data in a much cleaner fashion, and it will make dealing with Vacuums much easier. Even in 9.5, you can take advantage of table inheritance to build out partitions by first adding partitions for incoming data, and then creating partitions at various intervals (probably monthly, since you want to run monthly cleanup) and moving the data into those partitions. This can be accomplished atomically with a series of INSERT INTO partition SELECT * FROM table WHERE <timestamp> style statements.
I suspect you can probably manage this yourself (you need basic sql and the ability to write simple triggers/functions... here is a link to the 9.5 docs), but if you need help, you can engage with one of the Postgres chat communities, or contact a support company if you want a deeper dive.

Related

Table partitioning in PostgreSQL 11 with automatic partition creation?

I need to maintain audit table and since the number of changes are going to be huge, I need an efficient way of dealing with the problem. The solution which I have thought is to record only the changed column in the audit table and partition it on the createdon column quarterly or half-yearly.
I wanted to know if there is anything like 'interval partition' of oracle? If not then how can I achieve it?
I want that every 6 months a new partition is created automatically as the row is inserted.
I am using postgres 11 as my db.
I do not think there is any magic configuration that make your life easier on this point :
https://www.postgresql.org/docs/11/ddl-partitioning.html
If you want the table auto-created, I think you have two major possibilities :
Verify each data at the in of the 'mother' table to see if it fits in an already present partition (trigger, if huge amount of inserts it could be a problem)
Check once in a while that you already have the partitions that are going to be needed in the future. For this one pg_partman is going to be your best ally.
As an example, few years ago, I had done a partition mechanism when there was only the declarative one and not any possibility to add pg_partman. With the trigger mechanism for 15 million rows per month it still works like a charm.
If you do not want to harm your performances EVER (and especially if you do not know how large your system is going to grow) I recommand to you the same response than in a_horse_with_no_name comment : use pg_partman.
If you cannot use it, like it was the case for me, adopt one of the two philosophies (trigger or advance table creation by crontask (for example)).

What are some strategies to efficiently store a lot of data (millions of rows) in Postgres?

I host a popular website and want to store certain user events to analyze later. Things like: clicked on item, added to cart, removed from cart, etc. I imagine about 5,000,000+ new events would be coming in every day.
My basic idea is to take the event, and store it in a row in Postgres along with a unique user id.
What are some strategies to handle this much data? I can't imagine one giant table is realistic. I've had a couple people recommend things like: dumping the tables into Amazon Redshift at the end of every day, Snowflake, Google BigQuery, Hadoop.
What would you do?
I would partition the table, and as soon as you don't need the detailed data in the live system, detach a partition and export it to an archive and/or aggregate it and put the results into a data warehouse for analyses.
We have similar use case with PostgreSQL 10 and 11. We collect different metrics from customers' websites.
We have several partitioned tables for different data and together we collect per day more then 300 millions rows, i.e. 50-80 GB data daily. In some special days even 2x-3x more.
Collecting database keeps data for current and last day (because especially around midnight there can be big mess with timestamps from different part of the world).
On previous versions PG 9.x we transferred data 1x per day to our main PostgreSQL Warehouse DB (currently 20+ TB). Now we implemented logical replication from collecting database into Warehouse because sync of whole partitions was lately really heavy and long.
Beside of it we daily copy new data to Bigquery for really heavy analytical processing which would on PostgreSQL take like 24+ hours (real life results - trust me). On BQ we get results in minutes but pay sometimes a lot for it...
So daily partitions are reasonable segmentation. Especially with logical replication you do not need to worry. From our experiences I would recommend to not do any exports to BQ etc. from collecting database. Only from Warehouse.

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

SQL Server stats and indexes after a partition swap

This question might be a bit too general but I thought I would ask. I'm working with a terabyte scale data warehouse in SQL Server 2008 R2. There is a large fact table with data going back 5 years. I have aggregated a lot of this old data to a different table at a higher level of granularity. The next step is to remove the old data from my fact table.
I've decided that partition swapping is probably the best way to go to remove the older rows from the fact table and put them in an archive table, but I was wondering what a partition swap will do to stats and indexes on my fact table? Should I consider manually updating statistics after a partition swap? (auto update is set to off), will my indexes be fragmented and need reorganising or rebuilding?
Thanks for your help!
Partition switching is a metadata operation, so it's not going to cause fragmentation as no physical data is actually moving-- just logical references to it.
You should probably be updating statistics on a large table regularly, but it's not especially needed after a partition switch.

How to verify large postgresql Databases running different version have the same data without dumping

How Would I verify that the data in a 8.3 postgresql DB is the same as the data in a 9.0 DB
When I did a sql dump on a example table there we3re many differences that showed but this was due to 9.0 truncating 0's on the end and begining of date fields, also the order of the dump was not fixed, even though this can be sorted with sort(no pun intended). it does not allow validation as it would loose what table it was part of as the sorted sql dump would be a meaningless splat of sql commands with dump settings thrown in for extra.
count(*) is also not adequate.
I would like to be 100% sure that the data in one is equal to the data in the other despite the version differences and the way that at the very least dates are held in 9.0.
I should add I have several hundred tables and many hundred GB of data. so i need a automated process like diff DUMPa.sql DUMP2.sql, a SHA of the data(not the format) would be idea, but one cannot diff binary dumps of PostgreSQL for well known reasons. I am aware mysql has a checksum feature, but im using postgresql.
First the bad news. There is really no way to offer the full concerns you want addressed without loading all the data into an intermediary program and directly comparing. This will take time and it will drag your system down load-wise so my recommendation is set up some sort of replication and compare replicas.
One thing you might be able to do is to use something like Slony or Bucardo to replicate, and then triggers to move data into secondary child partitions and replicate those onto a consolidated server for comparison. You could then compare within PostgreSQL. This would reduce the load and it would mean your reporting data would be relatively easy to manage compared to other approaches. But all the data is going to have to be loaded and compared line-by-line.