Airflow parallelism failuer while changing the DB to postgres - postgresql

I have installed airflow locally and I am changing the executor to run parallel tasks
For that, I changed
1- the Database to Postgres 13.3
2- in the config file
sql_alchemy_conn = postgresql+psycopg2://postgres:postgres#localhost/postgres
3- executor = LocalExecutor
I have checked the DB and no errors
using
airflow db check --> INFO - Connection successful.
airflow db init --> Initialization done
Errors that I receive and I don't use SQLite at all
1- {dag_processing.py:515} WARNING - Because we cannot use more than 1 thread (parsing_processes = 2 ) when using SQLite. So we set parallelism to 1.
2- I receive this error from airflow web-interface
The scheduler does not appear to be running.
The DAGs list may not update, and new tasks will not be scheduled.
So shall i do any other change ?

Did you actually restart your Airflow webserver/scheduler after you changed the config?

The following logging statement:
{dag_processing.py:515} WARNING - Because we cannot use more than 1 thread (parsing_processes = 2 ) when using SQLite. So we set parallelism to 1.
It comes from Airflow 2.0.1 with the following code fragment
if 'sqlite' in conf.get('core', 'sql_alchemy_conn') and self._parallelism > 1:
self.log.warning(
"Because we cannot use more than 1 thread (parsing_processes = "
"%d ) when using sqlite. So we set parallelism to 1.",
self._parallelism,
)
self._parallelism = 1
This means that somehow, it is still on 'sqlite' based on your [core] sql_alchemy_conn setting. I think if you are certain you changed the airflow.cfg and restart all airflow service, that it might be picking up another copy of an airflow.cfg then you expect. Please inspect the logs to verify it is using the correct one.

Related

How is airflow database managed periodically?

I am running airflow using postgres.
There was a phenomenon that the web server was slow during operation.
It was a problem caused by data continuing to accumulate in dag_run and log of the db table (it became faster by accessing postgres and deleting data directly).
Are there any airflow options to clean the db periodically?
If there is no such option, we will try to delete the data directly using the dag script.
And I think it's strange that the web server slows down because there is a lot of data. Does the web server get all the data when opening another window?
You can purge old records by running:
airflow db clean [-h] --clean-before-timestamp CLEAN_BEFORE_TIMESTAMP [--dry-run] [--skip-archive] [-t TABLES] [-v] [-y]
(cli reference)
It is a quite common setup to include this command in a DAG that runs periodically.

Why is postgres not respecting the configs passes via config file?

Im trying to change postgres settings using the /var/lib/pgsql/12/data/postgresql file.
Specifically the settings wal_level to miniaml or max_wal_senders to 10 in order to restart a broken postgres service, however even after changing the config file, it still outputs the same error message : " WAL streaming (max_wal_senders > 0) requires wal_level "replica" or "logical" "
if anyone runs into a similar issue, the problem was that i initially changed the config from logical to minimal, however seems like performing that change doesnt alter the dependency max_wal_senders to minimal. Upon restarting postgres, the wal_level = minimal is set on an autoconfig file that runs upon start up which will conflict with the "max_wal_senders " on the normal config file. In order to run it we changed the autoconfig file to logical and postgres ran again

Docker Run MongoDB replica set.., how is it working?

So I'm creating some softwaare that makes heavy use of mongo transactions.
So far, I've tried only with testcontainers mongo, and pure unit testing.
Now I'm moving to test it manually and I get an error that says something like: Transaction numbers are only allowed on a replica set ..., yet that error doesn't happen during unit tests.
I read that this error happens because transactions are only possible on a replica set, but then, how is testcontainers working? I checked docker ps during running of tests and only one mongo docker container is up.
I checked the args passed by testcontainers, and it resulted they pass --replSet docker-rs. So I did, but then I get this error: NotYetInitialized: Cannot use non-local read concern until replica set is finished initializing.
I'm scratching my head bad, wondering how is testcontainers running a ONE mongo docker container that behaves like a replica set.
Assuming you're using Tescontainers MongoDB module, the missing part in your manual setup is most probably the mongo replica set initiation.
This is mentioned in the testcontainers module docs as:
Initialize a single replica set via executing a proper command
Also feel free to take a look at the module sources itself to dig into implementation details. For example, initReplicaSet() part.

Upgrade from Sequential executor to Celery executor in Apache Airflow

I have Apache Airflow running on an EC2 instance (Ubuntu). Everything is running fine.
The DB is SQLite and the executor is Sequential Executor (provided as default). But now I would like to run some DAGs which needs to be run at the same time every hour and every 2 minutes.
My question is how can I upgrade my current setup to Celery executor and postgres DB to have the advantage of parallel execution?
Will it work, if I install and setup the postgres, rabbitmq and celery. And make the necessary changes in the airflow.cfg configuration file?
Or do I need to re-install everything from scratch (including airflow)?
Please guide me on this.
Thanks
You can, indeed, install Postgres/RabbitMQ/Celery, then update your configuration file (airflow.cfg), initialise the database, and restart the Airflow services.
However, there is a side note: if required, you'd also have to migrate data from SQLite to Postgres. Most importantly, the database contains your connections and variables. It's possible to export variables beforehand and import them again using the Airflow CLI (see this answer, and the Airflow documentation).
It's also possible to import your connections using the CLI, as described in this Airflow guide (or the documentation).
If you just switched to the new database set up and you see something's missing, you can still easily switch back to the SQLite setup by reverting the changes to airflow.cfg.

Can't make Airflow use postgresql ( rds ) instead sqlite

I Installed airflow on 3 ec2 nodes: webserver, scheduler and worker, i set same config to /airflow/airflow.cfg at all 3 nodes, configuration of DB is next sql_alchemy_conn = postgresql+psycopg2://airflow:password#rdsdatabaseaddreess.com/airflow.
After that i restarted service airflow and execute command airflow initdb
ec2-user#ip-10-0-0-143 airflow]$ /usr/local/bin/airflow initdb
DB: sqlite:////airflow/airflow.db
[2019-11-21 01:39:30,325] {db.py:368} INFO - Creating tables
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
WARNI [airflow.utils.log.logging_mixin.LoggingMixin] empty cryptography key - values will not be stored encrypted.
Done.
However airflow still using sqlite DB: sqlite:////airflow/airflow.db
Please advice.
With best regards.
Yeah, I find a solution: the setting sql_alchemy_conn = postgresql+psycopg2://airflow:██████████#rdsdatabaseaddreess.com/airflow must be in section [core]. It is mentioned in the official docs. However, it is located in section [database] in the example config.
So, finally, my settings look like this:
[core]
executed = LocalExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow:████████#rdsdatabaseaddreess.com/airflow
[database]
sql_alchemy_conn = postgresql+psycopg2://airflow:████████#rdsdatabaseaddreess.com/airflow