Postgres subchart not recommended for production enviroment for airflow in Kubernetes - postgresql

I am new working with Airflow and Kubernetes. I am trying to use apache Airflow in Kubernetes.
To deploy it I used this chart: https://github.com/apache/airflow/tree/master/chart.
When I deploy it like in the link above a PostgreSQL database is created. When I explore the value.yml file of the chart I found this:
# Configuration for postgresql subchart
# Not recommended for production
postgresql:
enabled: true
postgresqlPassword: postgres
postgresqlUsername: postgres
I cannot find why is not recommended for production.
and also this:
data:
# If secret names are provided, use those secrets
metadataSecretName: ~
resultBackendSecretName: ~
# Otherwise pass connection values in
metadataConnection:
user: postgres
pass: postgres
host: ~
port: 5432
db: postgres
sslmode: disable
resultBackendConnection:
user: postgres
pass: postgres
host: ~
port: 5432
db: postgres
sslmode: disable
What is recommended for production? use my own PostgreSQL database outside Kubernetes? If it is correct, how can I use it instead this one? How I have to modify it to use my own postgresql?

The reason why it is not recommended for production is because the chart provides a very basic Postgres setup.
In container world containers are transient unlike processes in the VM world. So likelihood of database getting restarted or killed is high. So if we are running stateful components in K8s, someone needs to make sure that the Pod is always running with its configured storage backend.
The following tools help to run Postgres with High Availablity on K8s/containers and provides various other benefits:
Patroni
Stolon
We have used Stolon to run 80+ Postgres instances on Kubernetes in a microservices environment. These are for public facing products so services are heavily loaded as well.
Its very easy to setup a Stolon cluster once you understand its architecture. Apart from HA it also provides replication, standby clusters and CLI for cluster administration.
Also please consider this blog as well for making your decision. It brings in the perspective of how much Ops will be involved in different solutions.

Managing databases in Kubernetes its a pain and not recommended due to scaling, replicating, backups, among other common tasks are not as easy to do, what you should do is set up your own Postgres in VM or a managed cloud service as RDS or GCP, more information:
https://cloud.google.com/blog/products/databases/to-run-or-not-to-run-a-database-on-kubernetes-what-to-consider

Related

Is there any way to make streaming replication of the RDS Postgres DB to a Kubernetes Statefulset postgres-replica?

We currently have a setup for staging in RDS and we need to replicate that RDS in our Dev Kubernetes environment. What will be the steps required for the same?

Using airflow's postgres database for application data

I'm starting a new project which requires a data warehouse, for which we'll be using postgres.
(Shameless plug: swarm64 makes postgres a great DW option for datasets up to terabytes large)
I'm using apache airflow to orchestrate the workloads, but as I'm new to airflow, I'm not sure what the best practice is for the application's DB needs.
For some more context, I'm using airflow's docker-compose.yml, and I'm also an airflow newbie.
Noticing that the docker-compose already defines a postgres db:
...
postgres:
image: postgres:13
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
restart: always
...
I'm immediately wondering if it would be a better idea to add another postgres service or if to configure the existing one to have 2 users, 2 databases...
Eventually, I'll move this project to a cloud, and will probably use a AWS postgres RDS or similar.
My question then is:
What is the best practice here?
If there isn't one, what are the tradeoffs between the different approaches?
Airflow doesn't care what is your DWH you will be able to interact with it using Hooks and Operators. Many of them are available as Providers to Airflow and you can always write custom ones if needed. You need to separate between the Airflow backend metadata db (which can be PostgreSQL, MySQL) and you analytical storage where you store your processed data which can be anything you want (PostgreSQL, MySQL, S3, BigQuery, and many others).
Do NOT make Airflow backend database also your analytical database even if they both are PostgreSQL!
As for your questions the answer is:
Use plain regular PostgreSQL/MySQL for your Airflow installation.

Helm Chart available for highly available PostgreSQL on Openshift

I was looking for options to run Postgres HA in production on Openshift.
I tried incubator/patroni chart(added sa and scc), but sometimes it runs properly and sometimes lock is not acquired either to master or replica instance of postgres.
Also there is no way to create automatic schema. Schema needs to be created manually by execing into the pod.
Again I tried stable/postgresql, still there are issues in the helm chart while running it on Openshift
I saw some helm charts for production grade setup such as Zalando Postgres Operator with Patroni and Crunchy Postgres Operator but through single helm chart I am not able to run full setup of highly available postgresql. There are manual steps involved like installing pgo client and connectiong it to psql.
So, is there any postgres highly available helm chart which can be run in production on Openshift with 1 or 2 commands by just changing in values.yaml file.
Bitnami have many interesting helm chart, including one for PostgreSQL:
https://github.com/bitnami/charts/tree/master/bitnami/postgresql
For HA: https://github.com/bitnami/charts/tree/master/bitnami/postgresql-ha

Are there any quick ways to move PostgreSQL database between clusters on the same server?

We have two big databases (200GB and 330GB) in our "9.6 main" PostgreSQL cluster.
What if we create another cluster (instance) on the same server, is there any way to quickly move database files to new cluster's folder?
Without using pg_dump and pg_restore, with minimum downtime.
We want to be able to replicate the 200GB database to another server without pumping all 530GB of data.
Databases aren't portable, so the only way to move them to another cluster is to use pg_dump (which I'm aware you want to avoid), or use logical replication to copy it to another cluster. You would just need to set wal_level to 'logical' in postgresql.conf, and create a publication that included all tables.
CREATE PUBLICATION my_pub FOR ALL TABLES;
Then, on your new cluster, you'd create a subscription:
CREATE SUBSCRIPTION my_sub
CONNECTION 'host=172.100.100.1 port=5432 dbname=postgres'
PUBLICATION my_pub;
More information on this is available in the PostgreSQL documentation: https://www.postgresql.org/docs/current/logical-replication.html
TL;DR: no.
PosgreSQL itself does not allow to move all data files from a single database from one source PG cluster to another target PG cluster, whether the cluster runs on the same machine or on another machine. To this respect it is less flexible than Oracle transportable tablespaces or SQL Server attach/detach database commands for example.
The usual way to clone a PG cluster is to use streaming physical replication to build a physical standby cluster of all databases but this requires to backup and restore all databases with pg_basebackup (physical backup): it can be slow depending on the databases size but once the standby cluster is synchronized it should be really fast to failover to standby cluster by promoting it; miminal downtime is possible. After promotion you can drop the database not needed.
However it may be possible to use storage snaphots to copy quickly all data files from one source cluster to another cluster (and then drop the database not needed in the target cluster). But I have not practiced it and it does not seem to be really used (except maybe in some managed services in the cloud).
(PG cluster means PG instance).
If You would like to avoid pg_dump/pg_restore, than use:
logical replication (enables to replicate only desired databases)
streaming replication via replication slot (moving the whole cluster
to another and then drop undesired databases)
While 1. option is described above, I will briefly describe the 2.:
a) create role with replication privileges on master (cluster I want to copy from)
master# psql> CREATE USER replikator WITH REPLICATION ENCRYPTED PASSWORD 'replikator123';
b) log to slave cluster and switch to postgres user. Stop postgresql instance and delete DB data files. Then You will initiate replication from slave (watch versions and dirs!):
pg_basebackup -h MASTER_IP -U replikator -D /var/lib/pgsql/11/data -r 50M -R –waldir /var/lib/pgwal/11/pg_wal -X stream -c fast -C -S master1_to_slave1 -v -P
What this command do? It connects to master with replikator credentials and start pg_basebackup via slot that will be created. There is bandwith throttling as well (50M) as other options... Right after the basebackup slave will start streaming replication and You've got failsafe replication.
c) Then when You want, promote slave to be standalone and delete undesired databases:
rm -f /varlib/pgsql/11/data/recovery.conf
systemctl restart postgresql11.service

Backing up postresql database in openshift to outside the pod

Looking for some guidance/best practice on how to back up a postgresql database running in an Openshift pod. I've come across this rsync solution for application data - https://docs.openshift.com/enterprise/3.2/admin_guide/backup_restore.html#backup-application-data - but was wondering how to use pg_dump?
I'd like the pg_dump to dump the database to a volume outside the pod.
Nowadays best practice is most likely to use an Operator to run PostgreSQL on Kubernetes, for example with the Crunchy Data PostgreSQL Operator.
If you do not want to use Operators, another option would be to use a CronJob to run pg_dump in a regular interval and put the dump on a PersistentVolume mounted to the Pod.