Airflow database performance impact due to multiple connection to database - postgresql

I have setup Airflow 1.10 to schedule python DAGs. I am also working on other project, which would need data from backend postgresql external database of Airflow (querying at every regular 5 min interval).
Now I am trying to understand the impact on Airflow database performance due to multiple connections on Airflow. Accordingly, I will plan my approach to get the data from Airflow database for other purpose

Related

SQLAlchemy with Aurora Serverless V2 PostgreSQL - many connections

I have an AWS Serverless V2 database setup (postgresql) that is being accessed from a compute cluster. The cluster launches a large number of jobs (>1000) and each job independently puts/pulls some data from the database. The Serverless cluster is setup to autoscale from 2 to 32 units as needed.
The code being run by each cluster job is using SQLAlchemy (either the ORM or the core). I am setting up each database connection with a null pool and pessimistic disconnect handling (i.e., pool_pre_ping=True). From my reading of the docs this should be handling disconnects due to being idle mid-connection.
Code is also written to access the DB, get the results, close the connection (to avoid idle connections), and then reopen the connection after processing (5-30 minutes). This is working well because once processing is completed, the new connections are staggered and the DB has scaled up.
My logs are showing the standard, all connections are taken error: psycopg2.OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser and rds_superuser connections until the DB scales the available units high enough.
Questions:
Should I be configuring the SQLAlchemy connection differently? It feels like an anti-pattern to put in a custom retry to grab a connection while waiting for the DB to scale the number of available units as this type of capability seems to be built into SQLAlchemy usually.
Should I be using an RDS Proxy in front of the database? This also seems like an anti-pattern, adding a proxy in front of an autoscaling DB.
PG version is 10.

High number of connections to Airflow Metadata DB

I tried to find information about the number of connections that Airflow establishes with the metadata database instance (Postgres in my case).
By running select * from pg_stat_activity I realized it creates at least 7 connections whose states change between idle and idle in transaction. The queries are registered as COMMIT or SELECT 1 (mostly). This was using the LocalExecutor on Airflow 2.1, but I tested with an installation of Airflow 1.10 having the same results.
Is anyone aware of where these connections come from? And, is there a way (and a reason) to change this?
Yes. Airflow will Open big number of connections - basically every process it creates will almost for sure open at least one connection. This is "known" characteristics of Apache Airflow.
If you are using MySQL - this is not a big issue as MySQL is good in handling multiple connections (it multiplexes incoming connnections via threads). Postgres uses process-per-connection approach which is much more resource-hungry.
The recommended way to handle that (Postgres is the most stable backend for Airflow) is to use PGBouncer to proxy such connections to Postgres.
In our Official Helm Chart, PGBouncer is used by default, when Postgres is used. https://airflow.apache.org/docs/helm-chart/stable/index.html
I Highly recommend this approach.

Best way to set up jupyter notebook project in AWS

My current project have the following structure:
Starts with a script in jupyter notebook which dowloads data from a CRM API to put in a local PostgressSql database I run with PgAdmin. After that it runs cluster analysis, return some scoring values, creates a table in database with the results and updates this values in the CRM with another API call. This process will take between 10 to 20 hours (the API only allows 400 requests per minute).
The second notebook reads the database, detects last update, runs api call to update database since the last call, runs kmeans analysis to cluster the data, compare results with the previous call, updates the new ones and the CRM via API. This second process takes less than 2 hours in my estimation and I want this script to run every 24 hours.
After testing, this works fine. Now I'm evaluating how to put this in production in AWS. I understand for the notebooks I need Sagemaker and from I have seen is not that complicated, my only doubt here is if I can call the API without implementing aditional code or need some configuration. My second problem is database. I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3. My goal is to write the less code as possible, but a have try some tutorial of RDS like this one: [1]: https://www.youtube.com/watch?v=6fDTre5gikg&t=10s, and I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance?? and how to connect to it to analysis this data from SageMaker. My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud. Just some orientation about how to use this tools would be appreciated.
I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3
RDS and Aurora are relational databases fully managed by AWS. "Regular" RDS allows you to launch the existing popular databases such as MySQL, PostgreSQSL and other which you can launch at home/work as well.
Aurora is in-house, cloud-native implementation databases compatible with MySQL and PosrgreSQL. It can store the same data as RDS MySQL or PosrgreSQL, but provides a number of features not available for RDS, such as more read replicas, distributed storage, global databases and more.
S3 is not a database, but an object storage, where you can store your files, such as images, csv, excels, similarly like you would store them on your computer.
I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance??
You can migrate your data from your local postgress to RDS or Aurora if you wish. But RDS nor Aurora will not connect to your existing local database, as they are databases themselfs.
My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud.
I don't see a reason why you wouldn't be able to connect to the database. You can try to make it work, and if you encounter difficulties you can make new question on SO with RDS/Aurora setup details.

High latency, connection and queue depth in Airflow postgres backend

We are running airflow with Postgres(in RDS) as metadata DB and as a result backend and Redis as a Celery backend. Over the weekend the read/write latency and subsequently the number of connections and queue depth increased that
Number of tasks failed with Timeout
All of the airflow slowed down
Eventually modifying the DB instance to a larger instance resolved the issue, however not sure what was the root cause of this issue is not obvious yet.
Have someone else faced this issue and have a solution?
Airflow: 1.10.5
Postgres: 9.6.15

Multi Node Cluster in Postgres

Is there a way we can setup multinode cluster in Postgres like Oracle RAC:
Oracle RAC allows multiple computers to run Oracle RDBMS software simultaneously while accessing a single database, thus providing clustering.
So far I went through several articles but it seems Postgres does not support it.
PGPool is the only way we can do load balancing but it won't be same as RAC.