Using airflow's postgres database for application data

Using airflow's postgres database for application data - postgresql

I'm starting a new project which requires a data warehouse, for which we'll be using postgres.
(Shameless plug: swarm64 makes postgres a great DW option for datasets up to terabytes large)
I'm using apache airflow to orchestrate the workloads, but as I'm new to airflow, I'm not sure what the best practice is for the application's DB needs.
For some more context, I'm using airflow's docker-compose.yml, and I'm also an airflow newbie.
Noticing that the docker-compose already defines a postgres db:
...
postgres:
image: postgres:13
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
restart: always
...
I'm immediately wondering if it would be a better idea to add another postgres service or if to configure the existing one to have 2 users, 2 databases...
Eventually, I'll move this project to a cloud, and will probably use a AWS postgres RDS or similar.
My question then is:
What is the best practice here?
If there isn't one, what are the tradeoffs between the different approaches?

Airflow doesn't care what is your DWH you will be able to interact with it using Hooks and Operators. Many of them are available as Providers to Airflow and you can always write custom ones if needed. You need to separate between the Airflow backend metadata db (which can be PostgreSQL, MySQL) and you analytical storage where you store your processed data which can be anything you want (PostgreSQL, MySQL, S3, BigQuery, and many others).
Do NOT make Airflow backend database also your analytical database even if they both are PostgreSQL!
As for your questions the answer is:
Use plain regular PostgreSQL/MySQL for your Airflow installation.

Related

Contents of EFS-backed volume vary depending on the container reading it

Why do the contents of my EFS-backed volume vary depending on the container reading it?
I'm seeing divergent, EFS-related behavior depending on whether I run two processes in a single container or each in their own containers.
I'm using the Docker Compose ECS integration to launch the containers on Fargate.
The two processes are Database and Verifier.
Verifier directly inspects the on-disk storage of Database.
For this reason they share a volume, and the natural docker-compose.yml looks like this (simplifying):
services:
database:
image: database
volumes:
- database-state:/database-state
verifier:
image: verifier
volumes:
- database-state:/database-state
depends_on:
- database
volumes:
database-state: {}
However, if I launch in this configuration the volume database-state is often in an inconsistent state when read by Verifier, causing it to error.
OTOH, if I combine the services so both Database and Verifier run in the same container there are no consistency issues:
services:
database-and-verifier:
image: database-and-verifier
volumes:
- database-state:/database-state
volumes:
database-state: {}
Note that in both cases the database state is stored in database-state. This issue doesn't appear if I run locally, so it is specific to Fargate / EFS.
Any ideas what's going on and how to fix it?
This feels to me like a write-caching issue, but I doubt EFS would have such a basic problem.
It also feels like it could be a permissions issue, where key files are somehow hid from Verfier.
Thanks!

How to run .sql script against docker container after a different (dependent) container starts?

I have a SpringBoot application container myApi that depends on another SpringBoot application container configApi, they both use flyway. They also both depend on a postgres container. configApi exposes an endpoint that myApi uses to fetch all relevant configs (db details etc).
What currently happens is:
postgres container starts and inits appropriate db's and user
configApi container starts
a) it connects to postgres
b) it runs a flyway migration (creates required schema and tables)
c) api launches and is ready
myApi container starts
a) it hits a config endpoint exposed by configApi
b) the request fails because configApi cannot find any useful data in postgres since none was inserted
My restrictions are:
I cannot modify configApi code to contain anything specific to myApi or an environment
Flyway migration during configApi launch is what creates the tables that would contain any required data
I cannot create the tables and populate them when postgres is launched (using init.sql) because then configApi flyway migration will fail
myApi cannot contain any hard coded or environmental info about postgres since it's all supposed to be fetched from configApi endpoints
Problem summary TLDR:
How do I execute a sql script against the postgres container after configApi has launched but before myApi has launched without modifying configApi or myApi to contain anything specific to each other's environments?
I have the following docker-compose file:
version: "3"
volumes:
db_data:
services:
postgres:
image: postgres:10.14
volumes:
- ./init-local.sql:/docker-entrypoint-initdb.d/init.sql
- db_data:/var/lib/postgresql
ports:
- 5432:5432
configApi:
image: org/config-api:latest
ports:
- 8234:8234
environment:
- DB_PORT=5432
- DB_HOST=postgres
depends_on:
- postgres
myApi:
build: ./my-api
image: org/my-api:latest
container_name: my-api
ports:
- 9080:9080
environment:
- CONFIG_MANAGER_API_URL=http://configApi:8234/
depends_on:
- postgres
- configApi
Notes (I'll be adding more as questions come in):
I am using a single postgres container because this is for local/test, both api's use unique db's within that postgres instance

So here's my solution.
I modified my flyway code to dynamically include extra scripts if they exists as follows.
In my database java config in configApi I read an env variable that specifies any dir with extra/app external scripts:
// on class level
#Value("${FLYWAY_FOLDER}")
private String externalFlywayFolder;
//when creating DataSource bean
List<String> flywayFolders = new ArrayList<>();
flywayFolders.add("classpath:db/migrations");
if (externalFlywayFolder != null && !externalFlywayFolder.isEmpty()) {
flywayFolders.add("filesystem:"+externalFlywayFolder);
}
String[] flywayFoldersArray = new String[flywayFolders.size()];
flywayFolders.toArray(flywayFoldersArray);
Flyway flyway = Flyway
.configure()
.dataSource(dataSource)
.baselineOnMigrate(true)
.schemas("flyway")
.mixed(true)
.locations(flywayFoldersArray)
.load();
flyway.migrate();
Then I modified the docker compose to attach extra files to the container and set the FLYWAY_FOLDER env variable:
configApi:
image: org/config-api:latest
volumes:
- ./scripts/flyway/configApi:/flyway #attach scripts to container
ports:
- 8234:8234
environment:
- DB_PORT=5432
- DB_HOST=postgres
- FLYWAY_FOLDER=flyway #specify script dir for api
depends_on:
- postgres
Then just a case of adding the files, but the trick is to make them repeatable migrations so they don't interfere with any versioned migrations that may be done for the configApi itself.
Repeatable migrations are applied after versioned migrations, also they get reapplied if their checksum changes.

How can a docker-compose configuration transition from using anonymous volumes to named volumes while maintaining existing data?

Is there a way to migrate from a docker-compose configuration using all anonymous volumes to one using named volumes without needing manual intervention to maintain data (e.g. manually copying folders)? This could entail having users run a script on the host machine but there would need to be some safeguard against a subsequent docker-compose up succeeding if the script hadn't been run.
I contribute to an open source server application that users install on a range of infrastructure. Our users are typically not very technical and are resource-constrained. We have provided a simple docker-compose-based setup. Persistent data is in a containerized postgres database which stores its data on an anonymous volume. All of our administration instructions involve stopping running containers but not bringing them down.
This works well for most users but some users have ended up doing docker-compose down either because they have a bit of Docker experience or by simple analogy to up. When they bring their server back up, they get new anonymous volumes and it looks like they have lost their data. We have provided instructions for recovering from this state but it's happening often enough that we're reconsidering our configuration and exploring transitioning to named volumes.
We have many users happily using anonymous volumes and following our administrative instructions exactly. These are our least technical users and we want to make sure that they are not negatively affected by any change we make to the docker-compose configuration. For that reason, we can't "just" change the docker-compose configuration to use named volumes and provide a script to migrate data. There's too high of a risk that users would forget/fail to run the script and end up thinking they had lost all their data. This kind of approach would be fine if we could somehow ensure that bringing the service back up with the new configuration only succeeds if the data migration has been completed.
Side note for those wondering about our choice to use a containerized database: we also have a path for users to specify an external db server (e.g. RDS) but this is only accessible to our most resourced users.
Edit: Here is a similar ServerFault question.

Given that you're using an official PostgreSQL image, you can exploit their database initialization system
If you would like to do additional initialization in an image derived from this one, add one or more *.sql, *.sql.gz, or *.sh scripts under /docker-entrypoint-initdb.d (creating the directory if necessary). After the entrypoint calls initdb to create the default postgres user and database, it will run any *.sql files, run any executable *.sh scripts, and source any non-executable *.sh scripts found in that directory to do further initialization before starting the service.
with a change of PGDATA
This optional variable can be used to define another location - like a subdirectory - for the database files. The default is /var/lib/postgresql/data. If the data volume you're using is a filesystem mountpoint (like with GCE persistent disks) or remote folder that cannot be chowned to the postgres user (like some NFS mounts), Postgres initdb recommends a subdirectory be created to contain the data.
to solve the problem. The idea is that you define a different location for Postgres files and mount a named volume there. The new location will be empty initially and that will trigger database initialization scripts. You can use this to move data from anonymous volume and do this exactly once.
I've prepared an example for you to test this out. First, create a database on an anonymous volume with some sample data in it:
docker-compose.yml:
version: "3.7"
services:
postgres:
image: postgres
environment:
POSTGRES_PASSWORD: test
volumes:
- ./test.sh:/docker-entrypoint-initdb.d/test.sh
test.sh:
#!/bin/bash
set -e
psql -v ON_ERROR_STOP=1 --username "postgres" --dbname "postgres" <<-EOSQL
CREATE TABLE public.test_table (test_column integer NOT NULL);
INSERT INTO public.test_table VALUES (1);
INSERT INTO public.test_table VALUES (2);
INSERT INTO public.test_table VALUES (3);
INSERT INTO public.test_table VALUES (4);
INSERT INTO public.test_table VALUES (5);
EOSQL
Note how this test.sh is mounted, it should be in /docker-entrypoint-initdb.d/ directory in order to be executed at the initialization stage. Bring the stack up and down to initialize the database with this sample data.
Now create a script to move the data:
move.sh:
#!/bin/bash
set -e
rm -rf $PGDATA/*
mv /var/lib/postgresql/data/* "$PGDATA/"
and update the docker-compose.yml with a named volume and a custom location for data:
docker-compose.yml:
version: "3.7"
services:
postgres:
image: postgres
environment:
POSTGRES_PASSWORD: test
# set a different location for data
PGDATA: /pgdata
volumes:
# mount the named volume
- pgdata:/pgdata
- ./move.sh:/docker-entrypoint-initdb.d/move.sh
volumes:
# define a named volume
pgdata: {}
When you bring this stack up it won't find the database (because named volume is initially empty) and Postgres will run initialization scripts. First it runs its own script to create an empty database then it runs custom scripts from the /docker-entrypoint-initdb.d directory. In this example I mounted move.sh into that directory, which will erase temporary database and move old database to the new location.

Postgres subchart not recommended for production enviroment for airflow in Kubernetes

I am new working with Airflow and Kubernetes. I am trying to use apache Airflow in Kubernetes.
To deploy it I used this chart: https://github.com/apache/airflow/tree/master/chart.
When I deploy it like in the link above a PostgreSQL database is created. When I explore the value.yml file of the chart I found this:
# Configuration for postgresql subchart
# Not recommended for production
postgresql:
enabled: true
postgresqlPassword: postgres
postgresqlUsername: postgres
I cannot find why is not recommended for production.
and also this:
data:
# If secret names are provided, use those secrets
metadataSecretName: ~
resultBackendSecretName: ~
# Otherwise pass connection values in
metadataConnection:
user: postgres
pass: postgres
host: ~
port: 5432
db: postgres
sslmode: disable
resultBackendConnection:
user: postgres
pass: postgres
host: ~
port: 5432
db: postgres
sslmode: disable
What is recommended for production? use my own PostgreSQL database outside Kubernetes? If it is correct, how can I use it instead this one? How I have to modify it to use my own postgresql?

The reason why it is not recommended for production is because the chart provides a very basic Postgres setup.
In container world containers are transient unlike processes in the VM world. So likelihood of database getting restarted or killed is high. So if we are running stateful components in K8s, someone needs to make sure that the Pod is always running with its configured storage backend.
The following tools help to run Postgres with High Availablity on K8s/containers and provides various other benefits:
Patroni
Stolon
We have used Stolon to run 80+ Postgres instances on Kubernetes in a microservices environment. These are for public facing products so services are heavily loaded as well.
Its very easy to setup a Stolon cluster once you understand its architecture. Apart from HA it also provides replication, standby clusters and CLI for cluster administration.
Also please consider this blog as well for making your decision. It brings in the perspective of how much Ops will be involved in different solutions.

Managing databases in Kubernetes its a pain and not recommended due to scaling, replicating, backups, among other common tasks are not as easy to do, what you should do is set up your own Postgres in VM or a managed cloud service as RDS or GCP, more information:
https://cloud.google.com/blog/products/databases/to-run-or-not-to-run-a-database-on-kubernetes-what-to-consider

dockerfile for backend and a seperate one for dbms because compose wont let me copy sql file into dbms container?

I have a dockerfile for frontend, one for backend, and one for the database.
In the backend portion of the project, I have a dockerfile and a docker-compose.yml file.
the dockerfile is great for the backend because it configures the backend, copies and sets up the information etc. I like it alot.
The issue i have come to though is that if i can easily create a dockerfile for the dbms, but it requires me to put it in a different directory, where i was hoping to just define it in the same directory as the backend, and because of the fact the backend and the dbms is so tightly coupled, i figured this is where docker-compose would go.
My issue I ran into is that in a compose file, I cant do a COPY into the dbms container. I would just have to create another dockerfile to set that up. I was thinking that would work.
When looking on github, there was a big enhancement thread about it, but the closest people would get is just creating volume relationship, which fails to do what I want.
Ideally, All i want to be able to do is to stand up a postgres dbms in a fashion such that i could conduct load balancing on it later down the line with 1 write, 5 read or something, and have its initial db defined in my one sql file.
Am I missing something? I thought i was going about it correctly, but maybe I need to create a whole new directory with a dockerfile for the dbms.
Thoughts on how I should accomplish this?
Right now i was doing something like:
version: '2.0'
services:
backend:
build: .
ports:
- "8080:8080"
database:
image: "postgres:10"
environment:
POSTGRES_USER: "test"
POSTGRES_PASSWORD: "password"
POSTGRES_DB: "foo"
# I shouldnt have volumes as it would copy the entire folder and its contents to db.
volumes:
- ./:/var/lib/postgresql/data

To copy things with docker there an infinite set of possibilities.
At image build time:
use COPY or ADD instructions
use shell commands including cp,ssh,wget and many others.
From the docker command line:
use docker cp to copy from/to hosts and containers
use docker exec to run arbitrary shell commands including cp, ssh and many others...
In docker-compose / kubernetes (or through command line):
use volume to share data between containers
volume can be local or distant file systems (network disk for example)
potentially combine that with shell commands for example to perform backups
Still how you should do it dependy heavily of the use case.
If the data you copy is linked to the code and versionned (in the git repo...) then treat as it was code and build the image with it thanks to the Dockerfile. This is for me a best practice.
If the data is a configuration dependrnt of the environement (like test vs prod, farm 1 vs farm 2), then go for docker config/secret + ENV variables.
If the data is dynamic and generated at production time (like a DB that is filled with user data as the app is used), use persistant volumes and be sure you understand well the impact of container failure for your data.
For a database in a test system it can make sense to relauch the DB from a backup dump, a read only persistant volume or much simpler backup the whole container at a known state (with docker commit).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse