Any disadvantage to leaving a read only table in unlogged mode after bulk loading? - postgresql

Loading a large table (150GB+) from Teradata into PostgreSQL 9.6 on AWS. Conventional load was extremely slow so I set the table to "unlogged" and the load is going much faster. Once loaded, this database will be strictly read only for archival purposes. Is there any need to alter it back to logged when loaded? I understand that process takes a lot of time and would like to avoid that time if possible. we will be taking a backup of the data once this table is loaded and the data verified.
Edit: I should note I am using the copy command reading from a named pipe for this.

If you leave the table UNLOGGED, it will be empty after PostgreSQL recovers from a crash. So that is a bad idea if you need these data.

Related

Is there a way to show everything that was changed in a PostgreSQL database during a transaction?

I often have to execute complex sql scripts in a single transaction on a large PostgreSQL database and I would like to verify everything that was changed during the transaction.
Verifying each single entry on each table "by hand" would take ages.
Dumping the database before and after the script to plain sql and using diff on the dumps isn't really an option since each dump would be about 50G of data.
Is there a way to show all the data that was added, deleted or modified during a single transaction?
Dude, What are you looking for is the most searchable thing on the internet when it comes to capturing Database changes. It is a kind of version control we can say.
But as long as I know, sadly there are no in-built approaches are available in PostgreSQL or MySql. But you can overcome it by setting/adding some triggers for your most usable operations.
You can create some backup schemas, and tables to capture your changes that are changed(updated), created, or deleted.
In this way you can achieve what you want. I know this process is fully manual, But really effective.
If you need to analyze the script's behaviour only sporadically, then the easiest approach would be to change server configuration parameter log_min_duration_statement to 0 and then back to any value it had before the analysis. Then all of the script activity will be written to the instance log.
This approach is not suitable if your storage is not prepared to accommodate this amount of data, or for systems in which you don't want sensitive client data to be written to a plain-text log file.

PostgreSQL how to find what is causing Deadlock in vacuum when using --jobs parameter

How to find in PostgreSQL 9.5 what is causing deadlock error/failure when doing full vacuumdb over database with option --jobs to run full vacuum in parallel.
I just get some process numbers and table names... How to prevent this so I could successfuly do full vacuum over database in parallel?
Completing a VACUUM FULL under load is a pretty hard task. The problem is that Postgres is contracting space taken by the table, thus any data manipulation interferes with that.
To achieve a full vacuum you have these options:
Lock access to the vacuumed table. Not sure if acquiring some exclusive lock will help, though. You may need to prevent access to the table on application level.
Use a create new table - swap (rename tables) - move data - drop original technique. This way you do not contract space under the original table, you free it by simply dropping the table. Of course you are rebuilding all indexes, redirecting FKs, etc.
Another question is: do you need to VACUUM FULL? The only thing it does that VACUUM ANALYZE does not is contracting the table on the file system. If you are not very limited by disk space you do not need doing a full vacuum that much.
Hope that helps.

Dump postgres data with indexes

I've got a Postgres 9.0 database which frequently I took data dumps of it.
This database has a lot of indexes and everytime I restore a dump postgres starts background task vacuum cleaner (is that right?). That task consumes much processing time and memory to recreate indexes of the restored dump.
My question is:
Is there a way to dump the database data and the indexes of that database?
If there is a way, will worth the effort (I meant dumping the data with the indexes will perform better than vacuum cleaner)?
Oracle has some the "data pump" command a faster way to imp and exp. Does postgres have something similar?
Thanks in advance,
Andre
If you use pg_dump twice, once with --schema-only, and once with --data-only, you can cut the schema-only output in two parts: the first with the bare table definitions and the final part with the constraints and indexes.
Something similar can probably be done with pg_restore.
Best Practice is probably to
restore the schema without indexes
and possibly without constraints,
load the data,
then create the constraints,
and create the indexes.
If an index exists, a bulk load will make PostgreSQL write to the database and to the index. And a bulk load will make your table statistics useless. But if you load data first, then create the index, the stats are automatically up to date.
We store scripts that create indexes and scripts that create tables in different files under version control. This is why.
In your case, changing autovacuum settings might help you. You might also consider disabling autovacuum for some tables or for all tables, but that might be a little extreme.

How do I ALTER a set of partitioned tables in Postgres?

I created a set of partitioned tables in Postgres, and started inserting a lot of rows via the master table. When the load process blew up on me, I realized I should have declared the id row BIGSERIAL (BIGINT with a sequence, behind the scenes), but inadvertently set it as SERIAL (INTEGER). Now that I have a couple of billion rows loaded, I am trying to ALTER the column to BIGINT. The process seems to be working, but is taking a long time. So, in reality, I don't really know if it is working or it is hung. I'd rather not restart the entire load process again.
Any suggestions?
When you update a row to alter it in PostgreSQL, that writes out a new copy of the row and then does some cleanup later to remove the original. This means that trying to fix the problem by doing updates can take longer than just loading all the data in from scratch again--it's more disk I/O than loading a new copy, and some extra processing time too. The only situation where you'd want to do an update instead of a reload is when the original load was very inefficient, for example if a slow client programs is inserting the data and it's the bottleneck on the process.
To figure out if the process is still working, see if it's using CPU when you run top (UNIX-ish systems) or the Task Manager (Windows). On Linux, "top -c" will even show you what the PostgreSQL client processes are doing. You probably just expected it to take less time than the original load, which it won't, and it's still running rather than hung up.
Restart it (clarifying edit: restart the entire load process again).
Altering a column value requires a new row version, and all indexes pointing to the old version to be updated to point to the new version.
Additionally, see how much of the advise on populating databases you can follow.
Correction from #archnid:
altering the type of the column will trigger a table rewrite, so the row versioning isn't a big problem, but it will still take lots of disk space temporarily. you can usually monitor progress by looking at which files in the database directory are being appended to...

How to prevent Write Ahead Logging on just one table in PostgreSQL?

I am considering log-shipping of Write Ahead Logs (WAL) in PostgreSQL to create a warm-standby database. However I have one table in the database that receives a huge amount of INSERT/DELETEs each day, but which I don't care about protecting the data in it. To reduce the amount of WALs produced I was wondering, is there a way to prevent any activity on one table from being recorded in the WALs?
Ran across this old question, which now has a better answer. Postgres 9.1 introduced "Unlogged Tables", which are tables that don't log their DML changes to WAL. See the docs for more info, but at least now there is a solution for this problem.
See Waiting for 9.1 - UNLOGGED tables by depesz, and the 9.1 docs.
Unfortunately, I don't believe there is. The WAL logging operates on the page level, which is much lower than the table level and doesn't even know which page holds data from which table. In fact, the WAL files don't even know which pages belong to which database.
You might consider moving your high activity table to a completely different instance of PostgreSQL. This seems drastic, but I can't think of another way off the top of my head to avoid having that activity show up in your WAL files.
To offer one option to my own question. There are temp tables - "temporary tables are automatically dropped at the end of a session, or optionally at the end of the current transaction (see ON COMMIT below)" - which I think don't generate WALs. Even so, this might not be ideal as the table creation & design will be have to be in the code.
I'd consider memcached for use-cases like this. You can even spread the load over a bunch of cheap machines too.