How is airflow database managed periodically?

How is airflow database managed periodically? - postgresql

I am running airflow using postgres.
There was a phenomenon that the web server was slow during operation.
It was a problem caused by data continuing to accumulate in dag_run and log of the db table (it became faster by accessing postgres and deleting data directly).
Are there any airflow options to clean the db periodically?
If there is no such option, we will try to delete the data directly using the dag script.
And I think it's strange that the web server slows down because there is a lot of data. Does the web server get all the data when opening another window?

You can purge old records by running:
airflow db clean [-h] --clean-before-timestamp CLEAN_BEFORE_TIMESTAMP [--dry-run] [--skip-archive] [-t TABLES] [-v] [-y]
(cli reference)
It is a quite common setup to include this command in a DAG that runs periodically.

Related

Apache Airflow Init Db

I am trying to initialize a database for my project which is based on using apache airflow. I am not too familiar with what happened but I changed my value from airflow.cfg file to sql_alchemy_conn =postgresql+psycopg2:////Users/gabeuy/airflow/airflow.db. Then when I saved the changes and ran the command airflow db init, the error occurred and did not allow me to run the db.
I tried looking up different ways to change it, ensured that I had Postgres and psycopg installed but it still resulted in an error when I ran the command. I was expecting it to run so that I could access the airflow db local host with the DAGs. error occured

Your sql_alchemy_conn is pointing to a local file path (indicating a SQLite DB), but the protocol is indicating a PostgreSQL DB. The error is telling you it's missing a password, which is required by PostgreSQL.
For PostgreSQL, the expected URL format is:
postgresql+psycopg2://<user>:<password>#<host>/<db>
And for a SQLite DB, the expected URL format is:
sqlite:////<path/to/airflow.db>
A SQLite DB is convenient for testing purposes. A SQLite DB is stored as a single file on your computer which makes it easy to set up (airflow db init will generate the file if it doesn't exist). A PostgreSQL DB takes a bit more work to set up, but is generally advised for a production scenario.
For more information about Airflow database configuration, see: https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html.
And for more information about airflow db CLI commands, see: https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#db.

Advice for Backup

I have a chron job that runs on a stateless server. On this chron job, I am trying to take a snapshot of my Postgres GCP Sql db (PRODUCTION_DATABASE), save it to S3 and then upload it to my staging, qa-1, dev databases. The problem is one table, call it LARGE_TABLE, needs to be shrunk because the table size is growing rapidly, thus causing problems and exceeding timeouts. Does anyone have any advice on how to get this done?
I tried running the cloud_sql_proxy to run pg_dump but no go with that method. Is there a way I can truncate one table and make a backup?

Does Airflow clear Tasks remove data from the database also?

We have a use case where we have to populate fresh data in our DB. There is already old data present from the successful DAG run in our DB . Now we need to delete the old data and re run the task.
Airflow already provides a command to clear selection.
airflow clear -dx occupancy_reports.* -t building -s 2022-04-01 -e 2022-04-30
Will running this also delete the data from the Database and then populate fresh data ?

I guess you meant : airflow **tasks** clear ...
It is only clear the set of task instance, as if they never ran (it is not rollback)

Upgrade from Sequential executor to Celery executor in Apache Airflow

I have Apache Airflow running on an EC2 instance (Ubuntu). Everything is running fine.
The DB is SQLite and the executor is Sequential Executor (provided as default). But now I would like to run some DAGs which needs to be run at the same time every hour and every 2 minutes.
My question is how can I upgrade my current setup to Celery executor and postgres DB to have the advantage of parallel execution?
Will it work, if I install and setup the postgres, rabbitmq and celery. And make the necessary changes in the airflow.cfg configuration file?
Or do I need to re-install everything from scratch (including airflow)?
Please guide me on this.
Thanks

You can, indeed, install Postgres/RabbitMQ/Celery, then update your configuration file (airflow.cfg), initialise the database, and restart the Airflow services.
However, there is a side note: if required, you'd also have to migrate data from SQLite to Postgres. Most importantly, the database contains your connections and variables. It's possible to export variables beforehand and import them again using the Airflow CLI (see this answer, and the Airflow documentation).
It's also possible to import your connections using the CLI, as described in this Airflow guide (or the documentation).
If you just switched to the new database set up and you see something's missing, you can still easily switch back to the SQLite setup by reverting the changes to airflow.cfg.

App to monitor PostgreSQL queries in real time?

I'd like to monitor the queries getting sent to my database from an application. To that end, I've found pg_stat_activity, but more often then not, the rows which are returned read " in transaction". I'm either doing something wrong, am not fast enough to see the queries come through, am confused, or all of the above!
Can someone recommend the most idiot-proof way to monitor queries running against PostgreSQL? I'd prefer some sort of easy-to-use UI based solution (example: SQL Server's "Profiler"), but I'm not too choosy.

PgAdmin offers a pretty easy-to-use tool called server monitor
(Tools ->ServerStatus)

With PostgreSQL 8.4 or higher you can use the contrib module pg_stat_statements to gather query execution statistics of the database server.
Run the SQL script of this contrib module pg_stat_statements.sql (on ubuntu it can be found in /usr/share/postgresql/<version>/contrib) in your database and add this sample configuration to your postgresql.conf (requires re-start):
custom_variable_classes = 'pg_stat_statements'
pg_stat_statements.max = 1000
pg_stat_statements.track = top # top,all,none
pg_stat_statements.save = off
To see what queries are executed in real time you might want to just configure the server log to show all queries or queries with a minimum execution time. To do so set the logging configuration parameters log_statement and log_min_duration_statement in your postgresql.conf accordingly.

pg_activity is what we use.
https://github.com/dalibo/pg_activity
It's a great tool with a top-like interface.
You can install and run it on Ubuntu 21.10 with:
sudo apt install pg-activity
pg_activity

If you are using Docker Compose, you can add this line to your docker-compose.yaml file:
command: ["postgres", "-c", "log_statement=all"]
now you can see postgres query logs in docker-compose logs with
docker-compose logs -f
or if you want to see only postgres logs
docker-compose logs -f [postgres-service-name]
https://stackoverflow.com/a/58806511/10053470

I haven't tried it myself unfortunately, but I think that pgFouine can show you some statistics.
Although, it seems it does not show you queries in real time, but rather generates a report of queries afterwards, perhaps it still satisfies your demand?
You can take a look at
http://pgfouine.projects.postgresql.org/