Advice for Backup - google-cloud-sql

I have a chron job that runs on a stateless server. On this chron job, I am trying to take a snapshot of my Postgres GCP Sql db (PRODUCTION_DATABASE), save it to S3 and then upload it to my staging, qa-1, dev databases. The problem is one table, call it LARGE_TABLE, needs to be shrunk because the table size is growing rapidly, thus causing problems and exceeding timeouts. Does anyone have any advice on how to get this done?
I tried running the cloud_sql_proxy to run pg_dump but no go with that method. Is there a way I can truncate one table and make a backup?

Related

Persisting a single, static, large Postgres database beyond removal of the db cluster?

I have an application which, for local development, has multiple Docker containers (organized under Docker Compose). One of those containers is a Postgres 10 instance, based on the official postgres:10 image. That instance has its data directory mounted as a Docker volume, which persists data across container runs. All fine so far.
As part of testing the creation and initialization of the postgres cluster, it is frequently the case that I need to remove the Docker volume that holds the data. (The official postgres image runs cluster init if-and-only-if the data directory is found to be empty at container start.) This is also fine.
However! I now have a situation where in order to test and use a third party Postgres extension, I need to load around 6GB of (entirely static) geocoding lookup data into a database on the cluster, from Postgres backup dump files. It's certainly possible to load the data from a local mount point at container start, and the resulting (very large) tables would persist across container restarts in the volume that holds the entire cluster.
Unfortunately, they won't survive the removal of the docker volume which, again, needs to happen with some frequency. I am looking for a way to speed up or avoid the rebuilding of the single database which holds the geocoding data.
Approaches I have been or currently am considering:
Using a separate Docker volume on the same container to create persistent storage for a separate Postgres tablespace that holds only the geocoder database. This appears to be unworkable because while I can definitely set it up, the official PG docs say that tablespaces and clusters are inextricably linked such that the loss of the rest of the cluster would render the additional tablespace unusable. I would love to be wrong about this, since it seems like the simplest solution.
Creating an entirely separate container running Postgres, which mounts a volume to hold a separate cluster containing only the geocoding data. Presumably I would then need to do something kludgy with foreign data wrappers (or some more arcane postgres admin trickery that I don't know of at this point) to make the data seamlessly accessible from the application code.
So, my question: Does anyone know of a way to persist a single database from a dockerized Postgres cluster, without resorting to a dump and reload strategy?
If you want to speed up then you could convert your database dump to a data directory (import your dump to a clean postgres container, stop it and create a tarball of the data directory, then upload it somewhere). Now when you need to create a new postgres container use use a init script to stop the database, download and unpack your tarball to the data directory and start the database again, this way you skip the whole db restore process.
Note: The data tarball has to match the postgres major version so the container has no problem to start from it.
If you want to speed up things even more then create a custom postgres image with the tarball and init script bundled so everytime it starts then it will wipe the empty cluster and copy your own.
You could even change the entrypoint to use your custom script and load the database data, then call docker-entrypoint.sh so there is no need to delete a possible empty cluster.
This will only work if you are OK with replacing the whole cluster everytime you want to run your tests, else you are stuck with importing the database dump.

Do backups using pg_dump cause server outage if the database is too busy?

I have a Postgres database in production environment, and it has millions of records in tables. So I wanted to take a backup using pg_dump for some investigation.
But this database is so busy. So I am afraid if backup operation is caused any server issue like slow down server or crash database etc. as it is busy database.
Can anyone share if there is any risk? And please give some idea about best practice to take a backup from Postgres with no risk.
Running pg_dump will not cause a server crash, but it will add some extra CPU and particularly I/O load. You can test if that is a problem, pg_dump can be canceled any time.
On a busy database, it can also lead to table bloat, because old row versions have to be retained for the duration of pg_dump and cannot be vacuumed.
There are some alternatives:
Run pg_dump against a standby server.
Use pg_basebackup to perform a physical backup. That can be throttled to reduce the I/O load.

What's a good way to backup a (AWS) Postgres DB

what's a good way to backup a Postgres DB (running on Amazon RDS).
The built in snapshoting from RDS is by default daily and you can not export the snapshots. Besides that, it can take quite a long time to import a snapshot.
Is there a good service that takes dumps on a regular basis and stores them on e.g. S3? We don't want to spin up and maintain a ec2 instance which does that.
Thank you!
I want the backups to be automated, so I would prefer to have dedicated service for that.
Your choices:
run pg_dump from an EC2 instance on a schedule. This is a great use case for Spot instances.
restore a snapshot to a new RDS instance, then run pg_dump as above. This reduces database load.
Want to run a RDS snapshot more often than daily? Kick it off manually.
These are all automateable. For "free" (low effort on your part) you get daily snapshots. I agree, I wish they could be sent to S3.
SOLUTION: Now you can do a pg_dumpall and dump all Postgres databases on a single AWS RDS Instance.
It has caveats and so its better to read the post before going ahead and compiling your own version of pg_dumpall for this. Details here.

How to salvage data from Heroku Postgres

we are using Heroku Postgres with Ruby on Rails 3.2.
A few days before, we deleted important data by mistake using 'heroku run db:load' with misconfigured data.yml, that is, drop tables and the recreate tables with almost no data.
Backup data is only available 2 weeeks before, so we lost data of 2 weeks.
So We need to recover not by PG Backup/pg_dump but by postgresql's system data files.
I think, the only way to recover data is to restore data from xlog or archive file, but of course we don't have permission to be Super User/Replication Role to copy postgres database on heroku (or Amazon EC2) to local server.
Is there anyone who confronted such a case and resolved the problem?
Your only option is the backups provided by the PgBackups service (if you had that running). If not, Heroku support might have more options available.
At a minimum, you will have some data loss, but you can guarantee you won't do it again ;)

Backing up the DB vs. backing up the VM

We're serving a Django/Postgres site running on a VM hypervisor. We're now trying to figure out our back up strategy and have two probable options:
Back up the DB directly using pg_dump
Back up the VM directly by copying the VM image
I'm with the latter as I think, I could simply back up everything that has to do with the site. I'm not sure whether I have to shut down the VM for this though.
What is a better and more recommended way of backing up a DB? Are there any reasons for not using the VM backup?
Thanks
The question basically boils down to, can you consider a hot copy of PostgreSQL's data files a backup?
The answer is: not really. PostgreSQL tries very hard through the use of WAL to ensure that its files are in a consistent state all the time and that it can survive a power failure, but starting it up from a copy of these files puts PostgreSQL into recovery mode. If the backup happened at the wrong second and PostgreSQL can't recover from the state of these files, your backup is useless. You don't want your backup/restore mechanism to depend on the recovery mechanism (unless you're dealing with "crash only" software, which PostgreSQL is not).
The probability of PostgreSQL not being able to recover from these files is not high, but it's not zero either. The probability of PostgreSQL not being able to load an SQL dump that it made, on the other hand, is zero. I prefer backup choices with lower probabilities of failure. pg_dump was designed for doing backups.
PostgreSQL recommends using pg_dump for backups, as a file system (or VM) backup requires the database to be shut down (and has other drawbacks):
http://www.postgresql.org/docs/8.1/static/backup-file.html
Edit: Also, a pg_dump backup will be significantly smaller than a filesystem dump of the same database.
There is an additional option. With PostgreSQL you can make an online backup that allows you to snapshot the file system and maintain consistency. You can see details here:
http://www.postgresql.org/docs/9.0/static/continuous-archiving.html
We use this exact method for making backups when we run PostgreSQL in a VM.