Installing Postgresql in AWS EC2 CentOS 7 on secondary volume - postgresql

My AWS EC2 has two volumes, primary and secondary, with the secondary volume being larger. I am looking to install Postgres on this EC2. As the database gets used, I anticipate it will overrun the size of the primary volume. So,
1 - How can I install it such that the database sits on the secondary volume? I am referencing this article for installation. Particularly, the following command installs it on the primary volume:
sudo yum install postgresql postgresql-server postgresql-devel postgresql- contrib postgresql-docs
2 - Is is advisable to install it on the secondary volume? If no, why?
Thanks.

1 - How can I install it such that the database sits on the secondary volume?
see the documentation, basically you can initialize a database on any folder
https://www.postgresql.org/docs/13/app-initdb.html
Example:
initdb -D /mnt/data
2 - Is is advisable to install it on the secondary volume? If no, why?
Sure, it's easier to maintain and resize a non-root volume.
Regardless that with AWS you could consider running the AWS RDS, where a lot of maintenance tasks (e.g. storage auto-scaling) is offloaded to AWS

The standard pattern I see for this is to install postres db the normal way to the normal place, and then setting the pg data directory to a mountpoint on a different volume. This differentiates the postgres application files (which would be on the same volume as the rest of the OS filesystem) from the postgres data (which would be on the secondary). It can be advisable for a few reasons - isolating db data disk usage from system disk usage is a good one. Another reason is to be able to scale throughput and size independently and see usage independently.

Related

Are there any quick ways to move PostgreSQL database between clusters on the same server?

We have two big databases (200GB and 330GB) in our "9.6 main" PostgreSQL cluster.
What if we create another cluster (instance) on the same server, is there any way to quickly move database files to new cluster's folder?
Without using pg_dump and pg_restore, with minimum downtime.
We want to be able to replicate the 200GB database to another server without pumping all 530GB of data.
Databases aren't portable, so the only way to move them to another cluster is to use pg_dump (which I'm aware you want to avoid), or use logical replication to copy it to another cluster. You would just need to set wal_level to 'logical' in postgresql.conf, and create a publication that included all tables.
CREATE PUBLICATION my_pub FOR ALL TABLES;
Then, on your new cluster, you'd create a subscription:
CREATE SUBSCRIPTION my_sub
CONNECTION 'host=172.100.100.1 port=5432 dbname=postgres'
PUBLICATION my_pub;
More information on this is available in the PostgreSQL documentation: https://www.postgresql.org/docs/current/logical-replication.html
TL;DR: no.
PosgreSQL itself does not allow to move all data files from a single database from one source PG cluster to another target PG cluster, whether the cluster runs on the same machine or on another machine. To this respect it is less flexible than Oracle transportable tablespaces or SQL Server attach/detach database commands for example.
The usual way to clone a PG cluster is to use streaming physical replication to build a physical standby cluster of all databases but this requires to backup and restore all databases with pg_basebackup (physical backup): it can be slow depending on the databases size but once the standby cluster is synchronized it should be really fast to failover to standby cluster by promoting it; miminal downtime is possible. After promotion you can drop the database not needed.
However it may be possible to use storage snaphots to copy quickly all data files from one source cluster to another cluster (and then drop the database not needed in the target cluster). But I have not practiced it and it does not seem to be really used (except maybe in some managed services in the cloud).
(PG cluster means PG instance).
If You would like to avoid pg_dump/pg_restore, than use:
logical replication (enables to replicate only desired databases)
streaming replication via replication slot (moving the whole cluster
to another and then drop undesired databases)
While 1. option is described above, I will briefly describe the 2.:
a) create role with replication privileges on master (cluster I want to copy from)
master# psql> CREATE USER replikator WITH REPLICATION ENCRYPTED PASSWORD 'replikator123';
b) log to slave cluster and switch to postgres user. Stop postgresql instance and delete DB data files. Then You will initiate replication from slave (watch versions and dirs!):
pg_basebackup -h MASTER_IP -U replikator -D /var/lib/pgsql/11/data -r 50M -R –waldir /var/lib/pgwal/11/pg_wal -X stream -c fast -C -S master1_to_slave1 -v -P
What this command do? It connects to master with replikator credentials and start pg_basebackup via slot that will be created. There is bandwith throttling as well (50M) as other options... Right after the basebackup slave will start streaming replication and You've got failsafe replication.
c) Then when You want, promote slave to be standalone and delete undesired databases:
rm -f /varlib/pgsql/11/data/recovery.conf
systemctl restart postgresql11.service

Persisting a single, static, large Postgres database beyond removal of the db cluster?

I have an application which, for local development, has multiple Docker containers (organized under Docker Compose). One of those containers is a Postgres 10 instance, based on the official postgres:10 image. That instance has its data directory mounted as a Docker volume, which persists data across container runs. All fine so far.
As part of testing the creation and initialization of the postgres cluster, it is frequently the case that I need to remove the Docker volume that holds the data. (The official postgres image runs cluster init if-and-only-if the data directory is found to be empty at container start.) This is also fine.
However! I now have a situation where in order to test and use a third party Postgres extension, I need to load around 6GB of (entirely static) geocoding lookup data into a database on the cluster, from Postgres backup dump files. It's certainly possible to load the data from a local mount point at container start, and the resulting (very large) tables would persist across container restarts in the volume that holds the entire cluster.
Unfortunately, they won't survive the removal of the docker volume which, again, needs to happen with some frequency. I am looking for a way to speed up or avoid the rebuilding of the single database which holds the geocoding data.
Approaches I have been or currently am considering:
Using a separate Docker volume on the same container to create persistent storage for a separate Postgres tablespace that holds only the geocoder database. This appears to be unworkable because while I can definitely set it up, the official PG docs say that tablespaces and clusters are inextricably linked such that the loss of the rest of the cluster would render the additional tablespace unusable. I would love to be wrong about this, since it seems like the simplest solution.
Creating an entirely separate container running Postgres, which mounts a volume to hold a separate cluster containing only the geocoding data. Presumably I would then need to do something kludgy with foreign data wrappers (or some more arcane postgres admin trickery that I don't know of at this point) to make the data seamlessly accessible from the application code.
So, my question: Does anyone know of a way to persist a single database from a dockerized Postgres cluster, without resorting to a dump and reload strategy?
If you want to speed up then you could convert your database dump to a data directory (import your dump to a clean postgres container, stop it and create a tarball of the data directory, then upload it somewhere). Now when you need to create a new postgres container use use a init script to stop the database, download and unpack your tarball to the data directory and start the database again, this way you skip the whole db restore process.
Note: The data tarball has to match the postgres major version so the container has no problem to start from it.
If you want to speed up things even more then create a custom postgres image with the tarball and init script bundled so everytime it starts then it will wipe the empty cluster and copy your own.
You could even change the entrypoint to use your custom script and load the database data, then call docker-entrypoint.sh so there is no need to delete a possible empty cluster.
This will only work if you are OK with replacing the whole cluster everytime you want to run your tests, else you are stuck with importing the database dump.

How to reduce storage(scale down) my RDS instance?

I have a RDS(Postgres) instance with Storage SSD 1000GB, but the data is only 100GB of size.
How can I scale down the storage resource of RDS easily ?
RDS does not allow you to reduce the amount of storage allocated to a database instance, only increase it.
To move your database to less storage you would have to create a new RDS instance with your desired storage space, then use something like pg_dump/pg_restore to move the data from the old database to the new one.
Also be aware that an RDS instance with 1,000GB of SSD storage has a base IOPS of 3,000. An RDS instance with 100GB of SSD storage has a base IOPS of 300, with occasional bursts of up to 3,000.
Based on AWS's help here, this is the full process that worked for me:
1) Dump the database to a file: run this on a machine that has network access to the database:
pg_dump -Fc -v -h your-rds-endpoint.us-west-2.rds.amazonaws.com -U your-username your-databasename > your-databasename.dump
2) In the AWS console, create a new RDS instance with smaller storage. (You probably want to set it up with the same username, password, and database name.)
3) Restore the database on the new RDS instance: run this command (obviously on the same machine as the previous command):
pg_restore -v -h the-new-rds-endpoint.us-west-2.rds.amazonaws.com -U your-username -d your-databasename your-databasename.dump
(Note, in step 3, that I'm using the endpoint of the new RDS instance. Also note there's no :5432 at the end of the endpoint addresses.)
Amazon doesn't allow to reduce size of HDD of RDS instance, you may have two options to reduce size of storage.
1:-if you can afford downtimes, then mysqldump backup of old instance can be restored to new instance having lesser storage size.
2:- You can use Database migration service to move data from one instance to another instance without any downtime.
When using RDS, instead of doing typical hardware "capacity planning", you just provisioning just enough disk space for short or medium term (depends), expand it when needed.
As #Mark B mentioned , you need to watchout the IOPS as well. You can use "provisioned IOPS" if you need high performance DB.
You should make you cost vs performance adjustment before jump into the disk space storage part.
E.g. if you reduce 1000GB to 120GB , for US west, you will save 0.125x 880GB = 110/month. But the Max IOPS will be 120x 3 = 360IOPS
It will cost you $0.10 to provision additional IOPS to increase performance. Say if you actually need 800IOPS for higher online user response,
(800-360) x 0.10 = $44. So the actual saving may eventually "less". You will not save any money if your RDS need constant 1100 IOPS. And also other discount factor may come into play.
You can do this by migrating the DB to Aurora.
If you don't want Aurora, the Data Migration Service is the best option in my opinion. We're moving production to Aurora, so this didn't matter, and we can always get it back out of Aurora using pg_dump or DMS. (I assume this will apply to MySQL as well, but haven't tested it.)
My specific goal was to reduce RDS Postgres final snapshot sizes after decommissioning some instances that were initially created with 1TB+ storage each.
Create the normal snapshot. The full provisioned storage size is allocated to the snapshot.
Upgrade the snapshot to an engine version supported by Aurora, if not already supported. I chose 10.7.
Migrate the snapshot to Aurora. This creates a new Aurora DB.
Snapshot the new Aurora DB. The snapshot storage size starts as the full provisioned size, but drops to actual used storage after completion.
Remove the new Aurora DB.
Confirm your Aurora snapshot is good by restoring it again and poking around in the new new DB until you're satisfied that the original snapshots can be deleted.
Remove new new Aurora DB and original snapshot.
You can stop at 3 if you want and just use the Aurora DB going forward.
The #2 answer does not work on Windows 10 because, per this dba overflow question, the shell re-encodes the output when the > operator is used. The pg_dump will generate a file, but the pg_restore gives the cryptic error:
pg_restore: [archiver] did not find magic string in file header
Add -E UTF8 and -f instead of >:
pg_dump -Fc -v -E UTF8 -h your-rds-endpoint.us-west-2.rds.amazonaws.com -U your-username your-databasename -f your-databasename.dump
What about creating a read replica with smaller disk space, promoting it to standalone, and then switching that to be primary? I think that's what I will do.

MongoDB does not see database or collections after migrating from localhost to EBS volume

full disclosure: I am a complete n00b to mongodb and am just getting my feet wet with using mongo on AWS (but have 2 decades working in IT so not a total n00b :P)
I setup an EBS volume and installed mongo on a EC2 instance.
My problem is that I provisioned too small an EBS volume initially.
When I realized this I:
created a new larger EBS volume
mounted it on the server
stopped mongo ( $ sudo service mongod stop)
copied all my /data/db files into the new volume
updated conf files and fstab (dbpath, logpath, pidfilepath and mount point for new volume respectively)
restarted mongod
When I execute: $ sudo service mongod start
- everything runs fine.
- I can futz about in the admin and local databases.
However, when I run the mongos command: > show databases
- I only see the admin and local.
- the database I copied into the new volume (named encompass) is not listed.
I still have a working local copy of the database so my data is not lost, just not sure how best to move mongo data around other than:
A) start all over importing the data to the db on the AWS server (not what I would like since it is already loaded in my local db)
B) copy the local db to the new EBS volume again (also not preferred but better that importing all the data from scratch again!).
NOTE: originally I secure copied the data into the EBS volume with this command:
$ scp -r -i / / ec2-user#:/
then when I copied between volumes I used a vanilla cp command.
Did I miss something here?
The best I could find on SO and the web was this process (How to scale MongoDB?), but perhaps I missed a switch in a command or a nuance to the process that rendered my database files inert/useless?
Any idea how I can get mongo to see my other database files and collections?
Or did I make a irreversible error somewhere along the way?
Thanks for any help!!
Are you sure you conf file is being loaded? You can, for a test, load mongod.exe and specify the path directly to your db for a test, i.e.:
mongod --dbpath c:\mongo\data\db (unix syntax may vary a bit, this is windows)
run this from the command line and see what, if anything, mongo complains about.
A database has a very finicky algorithm that is easy to damage. Before copying from one database to another you should probably seed the database, a few dummy entries will tell you the database is working.

Mongodb EC2 EBS Backups

I have confusion on what I need to do here. I am new to Mongo. I have set up a small Mongo server on Amazon EC2, with EBS volumes, one for data, one for logs. I need to do a backup. It's okay to take the DB down in the middle of the night, at least currently.
Using the boto library, EBS snapshots and python to do the backup, I built a simple script that does the following:
sudo service mongodb stop
run backup of data
run backup of logs
sudo service mongodb start
The script ran through and restarted, but I noted in the AWS console that the snapshots are still being created, even through boto has come back, but Mongo has restarted. Certainly not ideal.
I checked the Mongo docs, and found this explanation on what to do for backups:
http://docs.mongodb.org/ecosystem/tutorial/backup-and-restore-mongodb-on-amazon-ec2/#ec2-backup-database-files
This is good info, but a bit unclear. If you are using journaling, which we are, it says:
If the dbpath is mapped to a single EBS volume then proceed to Backup the Database Files.
We have a single volume for data. So, I'm assuming that means to bypass the steps on flushing and locking. But at the end of Backup the Database Files, it discusses removing the locks.
So, I'm a bit confused. As I read it originally, then I don't actually need to do anything - I can just run the backup, and not worry about flushing/locking period. I probably don't need to take the DB down. But the paranoid part of me says no, that sounds suspicious.
Any thoughts from anyone on this, or experience, or good old fashioned knowledge?
Since you are using journaling, you can just run the snapshot without taking the DB down. This will be fine as long as the journal files are on the same EBS volume, which they would be unless you symlink them elsewhere.
We run a lot of mongodb servers on Amazon and this is how we do it too.