Upgrade from Sequential executor to Celery executor in Apache Airflow - celery

I have Apache Airflow running on an EC2 instance (Ubuntu). Everything is running fine.
The DB is SQLite and the executor is Sequential Executor (provided as default). But now I would like to run some DAGs which needs to be run at the same time every hour and every 2 minutes.
My question is how can I upgrade my current setup to Celery executor and postgres DB to have the advantage of parallel execution?
Will it work, if I install and setup the postgres, rabbitmq and celery. And make the necessary changes in the airflow.cfg configuration file?
Or do I need to re-install everything from scratch (including airflow)?
Please guide me on this.
Thanks

You can, indeed, install Postgres/RabbitMQ/Celery, then update your configuration file (airflow.cfg), initialise the database, and restart the Airflow services.
However, there is a side note: if required, you'd also have to migrate data from SQLite to Postgres. Most importantly, the database contains your connections and variables. It's possible to export variables beforehand and import them again using the Airflow CLI (see this answer, and the Airflow documentation).
It's also possible to import your connections using the CLI, as described in this Airflow guide (or the documentation).
If you just switched to the new database set up and you see something's missing, you can still easily switch back to the SQLite setup by reverting the changes to airflow.cfg.

Related

How is airflow database managed periodically?

I am running airflow using postgres.
There was a phenomenon that the web server was slow during operation.
It was a problem caused by data continuing to accumulate in dag_run and log of the db table (it became faster by accessing postgres and deleting data directly).
Are there any airflow options to clean the db periodically?
If there is no such option, we will try to delete the data directly using the dag script.
And I think it's strange that the web server slows down because there is a lot of data. Does the web server get all the data when opening another window?
You can purge old records by running:
airflow db clean [-h] --clean-before-timestamp CLEAN_BEFORE_TIMESTAMP [--dry-run] [--skip-archive] [-t TABLES] [-v] [-y]
(cli reference)
It is a quite common setup to include this command in a DAG that runs periodically.

PostgreSQL log configuration on Ubuntu

I have PostgreSQL 9.5 (yes I know it's not supported anymore) installed on Ubuntu Server 18.04 using this instructions https://www.postgresql.org/download/linux/ubuntu/
I want to change path and separate log for every database. But it's configuret by package maintainer in such a way that it ignores log* settings in PostgreSQl configuration and uses some other way to log everything to files and I can't find out how. Currently it logs to /var/log/postgresql/postgresql-9.5-clustername.log. I want it to be /var/log/postgresql/clustername/database.log but I don't know where to configure it. In PostgreSQL log_destination is set to stderr
The Ubuntu packages have logging_collector disabled by default, so the log is not handled by PostgreSQL, but by the startup script.
However, there is no way in PostgreSQL to get a separate log file per database, so the only way to get what you want is to put the databases in individual clusters rather than into a single cluster.

AWS DMS Streaming replication : Logical Decoding Output Plugins(test_decoding) not accessible

I'm trying to migrate a PostgreSQL DB persisted on cloud (on DO droplet) to RDS using AWS Database Migration Service (DMS).
I've successfully configured the replication instance and endpoints.
I've created a task with Migrate existing data and replicate ongoing changes. When I start the task it shows some error ERROR: could not access file "test_decoding": No such file or directory.
I've tried to create a replication slot manually on my DB console it throws the same error.
I've followed the procedures which was suggested on the DMS documentation for Postgres
I'm using PostgreSQL 9.4.6 on my source endpoint.
I presume that the problem is the output plugin test_decoding was not accessible to do the replication.
Please assist me to resolve this. Thanks in advance!
You must install postgresql-contrib additional supplied modules on Your source endpoint.
If it is installed, make sure, directory where test_decoding module located is the same with directory where PostgreSQL expect it.
In *nix, You can check module directory by command:
pg_config --pkglibdir
If it is not the same, copy module, or make symlink, or some other solution You prefer.

PostgreSQL turn off durabilty

I want to make a script that will run postgres in-memory without durability.
I read this page: http://www.postgresql.org/docs/9.1/static/non-durability.html
But I didn't understand how I can set this parameters in script. Could you please, help me?
Thanks for help!
Most of those parameters, like fsync, can only be set in postgresql.conf. Changes are applied by re-starting PostgreSQL. They apply to the whole database cluster - all the databases in that PostgreSQL install. That's because the databases all share a single postmaster, write-ahead log, and set of shared system tables.
The only parameter listed there that you can set at the SQL level in a script is synchronous_commit. By setting synchronous_commit = 'off' you can say "it's OK to lose this transaction if the database crashes in the next few seconds, just make sure it still applies atomically".
I wrote more on this topic in a previous answer, Optimise PostgreSQL for fast testing.
If you want to set the other params with a script you can do so but you have to do it by opening and modifying postgresql.conf using the script, then re-starting PostgreSQL. Text-processing tools like sed make this kind of job easier.
If you're running a debian based linux distro, you can just do something like:
pg_createcluster -d /dev/shm/mypgcluster 8.4 ramcluster
to create a ram based cluster. Note that you'll have to do:
pg_drop cluster 8.4 ramcluster
and recreate it on reboot etc.

Postgresql cluster initialization

SQL distributes pre-initialized catalog cluster but for postgresql we need initialize cluster using initdb and a network service account. It fails in few cases and causing bit of misery!
Can initialize cluster ourselves and distribute pre-initialized cluster?
Thanks
The "cluster" (or data directory) depends on the operating system and the architecture. So a data directory that was initialized with initdb on a 32bit Linux will not work on a 64bit Windows.
But you don't need to do that. A service account is only necessary if you want to run PostgreSQL as a service.
You can easily use the ZIP distribution to install and start Postgres without the need for a full-fledge installation or a service account.
The steps to do so are:
Unzip the binaries
Run initdb pointing it to the directory where the database cluster should be created.
Run pg_ctl to start the server.
Note that the steps 2) and 3) must be run using the same user, otherwise the server will have no priviliges to write to the data directory.
These steps can easily be put into a batch file or shell script.
Hard to understand your question, but I think you are talking about the Windows installer for PostgreSQL. Right? What version, what installer, what about error messages, loggings, etc. ?
The installer can be found here.
SQL = database language, SQL Server =
Microsoft database product