How to update a postgres database in kubernetes using scripts [CI/CD]? - postgresql

I have one postgres database deployed in kubernetes attached to a pvc [with RWX access mode]. What is the right way to update (create table) the database through my CI/CD instead of logging in to the pod and running queries [Without deleting the pvc] ?

My understanding is that the background reason for your question is how to deploy DB structure changes DB onto production with minimal downtime. For that I'd go with a Blue Green Deployments 1 .
After your comments I assume that you already have running instance of PostgreSQL and would like to modify the content of the DB by altering file structure directly on a "disk" (in this case PVC).
Modifying data structure directly on disk is not the best idea if we are speaking about data integrity, etc.
The reasons for that statement are explained in this article 2. It describes how exactly postgreSQL stores data on disk.
PostgreSQL (by default) writes blocks of data (what PostgreSQL calls pages) to disk in 8k chunks.
Additionally, there is a relation between table and file_path, so postgresql knows which exactly file stores which table.
SELECT pg_relation_filepath('test_data');
pg_relation_filepath
----------------------
base/20886/186770
In this example the file /database/base/20866/186770 contains the actual data for the table test_data.
What is the right way to update the database instead of logging in to the pod and running queries
However, if you sure that you have complete set of files for the DB to operate (like the one you are using during pg_dump / pg_restore) you can try placing that data on another PVC and recreate pod, however that will still result in a downtime.
Hope that helps.

Related

PostgreSQL: even read access changes data files disk leading to large incremental backups using pgbackrest

We are using pgbackrest to backup our database to Amazon S3. We do full backups once a week and an incremental backup every other day.
Size of our database is around 1TB, a full backup is around 600GB and an incremental backup is also around 400GB!
We found out that even read access (pure select statements) on the database has the effect that the underlying data files (in /usr/local/pgsql/data/base/xxxxxx) change. This results in large incremental backups and also in very large storage (costs) on Amazon S3.
Usually the files with low index names (e.g. 391089.1) change on read access.
On an update, we see changes in one or more files - the index could correlate to the age of the row in the table.
Some more facts:
Postgres version 13.1
Database is running in docker container (docker version 20.10.0)
OS is CentOS 7
We see the phenomenon on multiple servers.
Can someone explain, why postgresql changes data files on pure read access?
We tested on a pure database without any other resources accessing the database.
This is normal. Some cases I can think of right away are:
a SELECT or other SQL statement setting a hint bit
This is a shortcut for subsequent statements that access the data, so they don't have t consult the commit log any more.
a SELECT ... FOR UPDATE writing a row lock
autovacuum removing dead row versions
These are leftovers from DELETE or UPDATE.
autovacuum freezing old visible row versions
This is necessary to prevent data corruption if the transaction ID counter wraps around.
The only way to fairly reliably prevent PostgreSQL from modifying a table in the future is:
never perform an INSERT, UPDATE or DELETE on it
run VACUUM (FREEZE) on the table and make sure that there are no concurrent transactions

RDS instance unusably slow after restoring from snapshot

Details:
Database: Postgres.
Version: 9.6
Host: Amazon RDS
Problem: After restoring from snapshot, the database is unusably slow.
Why: Something AWS calls the "first touch penalty". When a newly restored instance becomes available, the EBS volume attachment is complete but not all the data has been migrated to the attached EBS volume from S3. Only after initially "touching" the data will RDS realize the data isn't on the EBS volume and it needs to pull it from S3. This completely destroys our performance. We also cannot use dd or fio to pre-touch the data because RDS does not allow access to the mounted EBS volumes.
What I've done: Contact AWS support. They acknowledged that it's a problem, that they are working on it and that the only solution is to select * from all tables.
Why I still need help: The select * strategy did speed things up (I selected everything from the public schema), but not as much as is needed. So I read up on how postgres stores data to disk. There's a heck of a lot on disk that wouldn't be "touched" by a simple select from user-defined tables.
My question: Being limited to only SQL queries/functions and not having direct access to the underlying disk, what are the best sql statements I can use to "touch" as much as possible on the disk in order to get it loaded on the EBS volume from S3?
My suggestion would be to manually trigger a vacuum analyze, this will do a full table scan of each table within scope to update the planner with fresh statistics. You can scope this fairly easily to only a certain schema, the database in question and the Postgres schema for example could help keep total time down if you have multiple databases within the one host.
The operation is rather time consuming and I'm not aware of a good way to parallelize it. There is also the vacuumdb utility but this just runs a query with a vacuum statement in it.
Source: I asked RDS support this very question a few days ago.
[1] https://www.postgresql.org/docs/9.5/static/sql-vacuum.html
edit: will reformat later, on mobile

Best way to backup and restore data in PostgreSQL for testing

I'm trying to migrate our database engine from MsSql to PostgreSQL. In our automated test, we restore the database back to "clean" state at the start of every test. We do this by comparing the "diff" between the working copy of the database with the clean copy (table by table). Then copying over any records that have changed. Or deleting any records that have been added. So far this strategy seems to be the best way to go about for us because per test, not a lot of data is changed, and the size of the database is not very big.
Now I'm looking for a way to essentially do the same thing but with PostgreSQL. I'm considering doing the exact same thing with PostgreSQL. But before doing so, I was wondering if anyone else has done something similar and what method you used to restore data in your automated tests.
On a side note - I considered using MsSql's snapshot or backup/restore strategy. The main problem with these methods is that I have to re-establish the db connection from the app after every test, which is not possible at the moment.
If you're okay with some extra storage, and if you (like me) are particularly not interested in re-inventing the wheel in terms of checking for diffs via your own code, you should try creating a new DB (per run) via templates feature of createdb command (or CREATE DATABASE statement) in PostgreSQL.
So for e.g.
(from bash) createdb todayDB -T snapshotDB
or
(from psql) CREATE DATABASE todayDB TEMPLATE snaptshotDB;
Pros:
In theory, always exact same DB by design (no custom logic)
Replication is a file-transfer (not DB restore). So far less time taken (i.e. doesn't run SQL again, doesn't recreate indexes / restore tables etc.)
Cons:
Takes 2x the disk space (although template could be on a low performance NFS etc)
For my specific situation. I decided to go back to the original solution. Which is to compare the "working" copy of the database with "clean" copy of the database.
There are 3 types of changes.
For INSERT records - find max(id) from clean table and delete any record on working table that has higher ID
For UPDATE or DELETE records - find all records in clean table EXCEPT records found in working table. Then UPSERT those records into working table.

Postgres - Is it necessary to create tablespace in my case?

I have a mobile/web project, using pg9.3 as database, and linux as server.
The data won't be huge, but as time goes on, the data increase.
For long term considering, I want to know about:
Questions:
1. Is it necessary for me to create tablespace for my database, or just use the default one?
2. If I create new tablespace, what is the proper location on linux to create the folder, and why?
3. If I don't create it now, and wait until I have to, till then, will it be easy for me to migrate db with data to new tablespace?
Just use the default tablespace, do not create new tablespaces. Tablespaces are only useful if you have multiple physical disks, so you can define which data is stored on which physical disk. The directory where your data is located is not that important for the workings of postgres, so if you only have one disk it is useless to use tablespaces
Should your data grow beyond the capacity of 1 disk, you will have to perform a full data migration anyway to move it to another physical disk, so you can configure tablespaces at that time
The idea behind defining which data is located on which disk (with tablespaces) is that you can do things like putting a big table which is hardly used on a slow disk, and putting this very intensively used table on a separated faster disk. But I assume you're not there yet, so don't over complicate things

Moving Postgres tablespaces and tables across EC2 instance

I have postgres database running on Amazon EC2 instance. I have few tablespaces created for
some monthly tables, such that each table is on individual tablespace. To get the maximum performance, I have created each tablespace on individual amazon ebs volume.
I want to move some of this tables to different instance and database. I will explain it with one example.
Lets say.
I have EC2 instance A with postgres setup as explained above.
I have another Amazon instance B running and I installed postgres on it as well.
I want to create the same table structure for some of the tables present in A on B. I want to detach the volumes from instance A and attach it to instance B.
Also, I want to create tablespaces on instance B, which will point to the newly attached volumes.
And when I start up this newly created postgres, I expect to see the tables populated with data from those volumes(database).
finally I will delete those tables from A
I know I am being rusty in writing, but couldn't find a better way to ask the question.
Is something along these lines is possible? Are there any pointers for achieving something like this?
No.
The data in the tablespace directory is only the data. You also need the metadata that's in the tables in the pg_catalog schema, as well as the information from pg_clog and pg_xlog to access it.
If you want to move things across using volumes, you must move the entire installation at once (all the tablespaces, including pg_default). Otherwise, you need to use pg_dump/pg_restore to transfer the data over.