Backing up the DB vs. backing up the VM - postgresql

We're serving a Django/Postgres site running on a VM hypervisor. We're now trying to figure out our back up strategy and have two probable options:
Back up the DB directly using pg_dump
Back up the VM directly by copying the VM image
I'm with the latter as I think, I could simply back up everything that has to do with the site. I'm not sure whether I have to shut down the VM for this though.
What is a better and more recommended way of backing up a DB? Are there any reasons for not using the VM backup?
Thanks

The question basically boils down to, can you consider a hot copy of PostgreSQL's data files a backup?
The answer is: not really. PostgreSQL tries very hard through the use of WAL to ensure that its files are in a consistent state all the time and that it can survive a power failure, but starting it up from a copy of these files puts PostgreSQL into recovery mode. If the backup happened at the wrong second and PostgreSQL can't recover from the state of these files, your backup is useless. You don't want your backup/restore mechanism to depend on the recovery mechanism (unless you're dealing with "crash only" software, which PostgreSQL is not).
The probability of PostgreSQL not being able to recover from these files is not high, but it's not zero either. The probability of PostgreSQL not being able to load an SQL dump that it made, on the other hand, is zero. I prefer backup choices with lower probabilities of failure. pg_dump was designed for doing backups.

PostgreSQL recommends using pg_dump for backups, as a file system (or VM) backup requires the database to be shut down (and has other drawbacks):
http://www.postgresql.org/docs/8.1/static/backup-file.html
Edit: Also, a pg_dump backup will be significantly smaller than a filesystem dump of the same database.

There is an additional option. With PostgreSQL you can make an online backup that allows you to snapshot the file system and maintain consistency. You can see details here:
http://www.postgresql.org/docs/9.0/static/continuous-archiving.html
We use this exact method for making backups when we run PostgreSQL in a VM.

Related

Postgres pg_dumpall consistency across databases that are backed up

Is there a way to backup all of the databases on a hyperscaler managed postgres server as of a certain time in order to maintain data consistency between the databases with either pg_dumpall, pg_dump or something else?
Background:
With the utilization of micro-services, an application may have many databases associated to it on a single hyperscaler managed postgres server. The hyperscalers do perform a functional snapshot backup; however, when a hyperscaler managed postgres server is accidentally deleted, the postgres backups are lost as well. These hyperscalers provide locks to prevent accidental deletes of a postgres server and mention that their support teams can be contacted to restore a deleted server, however, we still had a postgres server get deleted. We were able to recover by contacting the hyperscalers support team but would like to have a second way of backing up a hyperscaler managed postgres server.
I realize that the micro-services should be able to auto-recover to a data consistent point but the reality is that many of the micro-services have not been designed nor written to that requirement. I really do not want to get into the aspect of micro-service design and want to retain this to be a DBA backup question.
pg_dumpall will not take a consistent backup across all databases. Each database backup will be consistent, but the snapshots for the backups of the different databases will be taken at different times.
If you need a consistent backup across several databases in a single cluster, use an online file system backup with pg_basebackup.
You can use pg_basebackup which most of the PostgreSQL DBA’s would end up using on a daily basis be it via scripts or manually. This creates the base backup of the database which can help in recovering in multiple situations. This takes an online backup of the database and hence is very useful when being used in production
you can review this for more details

Database restore from a hacked system

A linux VM with postgres 9.4 was hacked into. (Two processes taking 100% cpu, weird files in /tmp, did not reoccur after kill(s) and restart.) It was decided to install the system from scratch on a new machine (with postgres 9.6). The only data needed was in one of postgres databases. A pg_dump of the database was made after the attack.
Regardless of whether the data - the tables/rows/etc. - were modified during the attack: is it safe to restore the database in the new system?
I consider using pg_restore with the -O option (ignores the user permissions)
The two dangers are:
important data could have been modified
back doors could have been installed in your database
With the first, you're on your own how to verify that your data are ok. The safest thing would be to use a backup from before the machine was compromized, but this would mean data loss.
For the second, I would run a pg_dumpall -s and spend a day reading it carefully. Compare it with a dump from a backup made before the breach. Watch out for weird object and column names and functions with SECURITY DEFINER.

Can I copy the postgresql /base directory as a DB backup?

Don't shoot me, I'm only the OP!
When needing to backup our DB, we always are able to shutdown postgresql completely. After it is down, I found I could copy the "/base" directory with the binary data in it to another location. Checksum the accuracy and am later able to restore that if/when necessary. This has even worked when upgrading to a later version of postgresql. Integrity of the various 'conf' files is not an issue as that is done elsewhere (ie. by other processes/procedures!) in the system.
Is there any risk to this approach that I am missing?
The "File System Level Backup" link in Abelisto's comment is what JoeG is talking about doing. https://www.postgresql.org/docs/current/static/backup-file.html
To be safe I would go up one more level, to "main" on our ubuntu systems to take the snapshot, and thoroughly go through the caveats of doing file-level backups. I was tempted to post the caveats here, but I'd end up quoting the entire page.
The thing to be most aware of (in a 'simple' postgres environment ) is the relationship between the postgres database, a user database and the pg_clog and pg_xlog files. If you only get the "base" you lose the transaction and WAL information, and in more complex installations, other 'necessary' information.
If those caveat conditions listed do not exist in your environment, and you can do a full shutdown, this is a valid backup strategy, which can be much faster than a pg_dump.

Best method for PostgreSQL incremental backup

I am currently using pg_dump piped to gzip piped to split. But the problem with this is that all output files are always changed. So checksum-based backup always copies all data.
Are there any other good ways to perform an incremental backup of a PostgreSQL database, where a full database can be restored from the backup data?
For instance, if pg_dump could make everything absolutely ordered, so all changes are applied only at the end of the dump, or similar.
Update: Check out Barman for an easier way to set up WAL archiving for backup.
You can use PostgreSQL's continuous WAL archiving method. First you need to set wal_level=archive, then do a full filesystem-level backup (between issuing pg_start_backup() and pg_stop_backup() commands) and then just copy over newer WAL files by configuring the archive_command option.
Advantages:
Incremental, the WAL archives include everything necessary to restore the current state of the database
Almost no overhead, copying WAL files is cheap
You can restore the database at any point in time (this feature is called PITR, or point-in-time recovery)
Disadvantages:
More complicated to set up than pg_dump
The full backup will be much larger than a pg_dump because all internal table structures and indexes are included
Does not work well for write-heavy databases, since recovery will take a long time.
There are some tools such as pitrtools and omnipitr that can simplify setting up and restoring these configurations. But I haven't used them myself.
Also check out http://www.pgbackrest.org
pgBackrest is another backup tool for PostgreSQL which you should be evaluating as it supports:
parallel backup (tested to scale almost linearly up to 32 cores but can probably go much farther..)
compressed-at-rest backups
incremental and differential (compressed!) backups
streaming compression (data is compressed only once at the source and then transferred across the network and stored)
parallel, delta restore (ability to update an older copy to the latest)
Fully supports tablespaces
Backup rotation and archive expiration
Ability to resume backups which failed for some reason
etc, etc..
Another method is to backup to plain text and use rdiff to create incremental diffs.

Compile PostgreSQL with reverse endianness

My old iMac G5 died recently. I had a PostgreSQL instance running there, and I kept backups. But when I went to restore them in my newer Macbook Pro, I realized that I couldn't do the restore due to byte endianness differences.
Is there any compile flags that I can pass to PostgreSQL's configure script to use a reversed endianness on Intel, allowing me to restore from backup, then do a SQL dump, and restore it on a default endianness instance?
Actually, I don't even know if I need to recompile PostgreSQL, perhaps there is some setting that someone can point me to to allow me to do the restore?
No. If your backups were made with pg_dump they'd be endian-independent, even if you used the binary format.
If you took a copy of the data directory (with the database shutdown, or using pg_start_backup()/pg_stop_backup()) you need to restore on a machine with the same endianness. You're probably best off importing it in a virtual machine with the same endianness somewhere, then dump from there with pg_dump and reload on your new machine.
If you took the backup from disk without shutting down PostgreSQL, the same basic thing holds, except you're also going to have to try to recover from a corrupt backup. That can be little work or hundreds of hours or more work depending on how much traffic you had on the database, and how lucky you are.