db partition vs data delete and import - postgresql

I have a script which imports data from csv files and loads into db tables. The tables are partitioned by a column like customer_id. The script first loads data into a staging table, checks unique key/FK constraints, deletes data which violates the FK/unique constraints.
Then it drops the existing partition from the main table and adds this staging table as a partition. My question is this the best approach to import data? The another approach I can think of is: don't use partitions, just use staging table and after cleaning the data, import it into the main table after deleting the existing data.

If you can load data partitionwise, that is going to be better.
Deleting many rows in a PostgreSQL table is painful, and DROP TABLE will always win.

Keep in mind that in PostgreSQL, a DELETE doesn't actually delete data from the table -- it merely flags the row as invisible to later transactions. Later, when the table is vacuumed, those invisible rows get flagged as re-usable for future INSERTs and UPDATEs. Only when a VACUUM FULL on the table is run will the space be reclaimed. Therefore, "import[ing] it into the maint able after deleting the existing data" would cause bloat.
A DROP TABLE will immediately reclaim space. As such, it would make more sense to work with partitions and drop partitions to reclaim space.
More information about this behavior can be found in the documentation

Related

pg_repack and logical replication: any risk to missing out on changes from the table while running pg_repack?

As I understand, pg_repack creates a temporary 'mirror' table (table B) and copies the rows from the original table (table A) and re-indexes them and then replaces the original with the mirror. The mirroring step creates a lot of noise with logical replication (a lot of inserts at once), so I'd like to ignore the mirror table from being replicated.
I'm a bit confused with what happens during the switch over though. Is there a risk with losing some changes? I don't think there is since all actual writes are still going to the original table before and after the switch, so it should be safe right?
We're running Postgres 10.7 on AWS Aurora, using wal2json as the output plugin for replication.
I have neither used pg_repack nor logical replication but according to pg_repack Github repository there is a possible issue using pg_repack with logical replication: see
https://github.com/reorg/pg_repack/issues/135
To perform a repack, pg_repack will:
create a log table to record changes made to the original table.
add a trigger onto the original table, logging INSERTs, UPDATEs, and DELETEs into our log table.
create a new table containing all the rows in the old table.
build indexes on this new table.
apply all changes which have occurred in the log table to the new table.
swap the tables, including indexes and toast tables, using the system catalogs.
drop the original table.
In my experience, the log table keeps all changes and applies them after build indexes, besides if repack needs to rollback changes applied on the original table too.

How to partition existing table in postgres?

I would like to partition a table with 1M+ rows by date range. How is this commonly done without requiring much downtime or risking losing data? Here are the strategies I am considering, but open to suggestions:
1.The existing table is the master and children inherit from it. Over time move data from master to child, but there will be a period of time where some of the data is in the master table and some in the children.
2.Create a new master and children tables. Create copy of data in existing table in child tables (so data will reside in two places). Once child tables have most recent data, change all inserts going forward to point to new master table and delete existing table.
First you have to ask yourself, if a table partition is really warranted. Go thru the partition document:
https://www.postgresql.org/docs/9.6/static/ddl-partitioning.html
Remember this very important info for partitioning data (from the link above)
The benefits will normally be worthwhile only when a table would
otherwise be very large. The exact point at which a table will benefit
from partitioning depends on the application, although a rule of thumb
is that the size of the table should exceed the physical memory of the
database server.
You can check the size of your table with this SQL
SELECT pg_size_pretty(pg_database_size(<table_name>))
if you are having performance problems, try re-indexing or re-evaluating your indexes. Check your postgres log for auto vacuuming.
1m+ rows do not need partitioning.

Postgres SET UNLOGGED takes a long time

Running Postgres-9.5. I have a large table that I'm doing ALTER TABLE table SET UNLOGGED on. I already dropped all foreign key constraints targeting the table since FK-referred tables can't be unlogged. The query took about 20 minutes and consumed 100% CPU the whole time. I can understand it taking a long time to make a table logged, but making it unlogged doesn't seem difficult... but is it?
Is there anything I could do to make it faster to set a table unlogged?
SET UNLOGGED involves a table rewrite, so for a large table, you can expect it to take quite a while.
As you said, it doesn't seem like making a table UNLOGGED should be that difficult. And simply converting the table isn't that difficult; the complicating factor is the need to make it crash-safe. An UNLOGGED table has an additional file associated with it (the init fork), and there's no way to synchronise the creation of this file with the rest of the commit.
So instead, SET UNLOGGED builds a copy of the table, with an init fork attached, and then swaps in the new relfilenode, which the commit can handle atomically. A more efficient implementation would be possible, but not without changing the representation of unlogged tables (which predate SET UNLOGGED by quite a while) or the logic behind COMMIT itself, both of which were deemed too intrusive for this relatively minor feature. You can read the discussion behind the design on the pgsql-hackers list.
If you really need to minimise downtime, you could take a similar approach to that taken by SET UNLOGGED: create a new UNLOGGED table, copy all of the records across, briefly lock the old table while you sync the last few changes, and swap the new table in with a RENAME when you're done.

Adding Default value to oracle database with high volumes of data

I am trying to add a new column to a table with upwards of 9 million records.
This issue is the column needs to be default value of 'N'. When updating the table the database is getting an issue with the temp data being filled. Also, it is taking a huge amount of time.
I was wondering if anyone knows of anyway to make this faster or a better way of doing this to avoid problems with the temp data filling up.
The database is Oracle10g.
If you could move to 11g and the column was NOT NULL, Oracle has an optimization where the default value doesn't need to be stored in each row so you can add the column very quickly. Unfortunately, it sounds like you're stuck with a depricated version of Oracle where that isn't available.
Most likely, you don't have a lot of really good options other than waiting. It may be more efficient, assuming you're doing this during a period of downtime, to create a new table with the new column, do a direct-path insert of all the data from the old table to the new table, rename the tables, and re-point any constraints at the new table. Whether this is actually more efficient than waiting for the update will depend on your hardware and your table but an INSERT is likely to be more efficient than an UPDATE. On the other hand, for a new single-character column that isn't going to create a lot of migrated rows, you're probably better off waiting for the UPDATE rather than going to this level of effort-- there are a lot of things that could potentially go wrong that you'd need to test and validate (i.e. making sure that you updated all the constraints correctly).

Most efficient way of bulk loading unnormalized dataset into PostgreSQL?

I have loaded a huge CSV dataset -- Eclipse's Filtered Usage Data using PostgreSQL's COPY, and it's taking a huge amount of space because it's not normalized: three of the TEXT columns is much more efficiently refactored into separate tables, to be referenced from the main table with foreign key columns.
My question is: is it faster to refactor the database after loading all the data, or to create the intended tables with all the constraints, and then load the data? The former involves repeatedly scanning a huge table (close to 10^9 rows), while the latter would involve doing multiple queries per CSV row (e.g. has this action type been seen before? If not, add it to the actions table, get its ID, create a row in the main table with the correct action ID, etc.).
Right now each refactoring step is taking roughly a day or so, and the initial loading also takes about the same time.
From my experience you want to get all the data you care about into a staging table in the database and go from there, after that do as much set based logic as you can most likely via stored procedures. When you load into the staging table don't have any indexes on the table. Create the indexes after the data is loaded into the table.
Check this link out for some tips http://www.postgresql.org/docs/9.0/interactive/populate.html