Airflow with PostgreSQL bottleneck issues - postgresql

I have set up Airflow 1.10.10 with Celery as Executor, Postgres as result backend and sql alchemy connection, and Redis as broker/message queue.
I'm using one pod for each Airflow component (scheduler, webserver, broker and 1 worker) with 2 GiB of memory and 2 cores of CPU. My Postgres instance is running in Azure with 2 CPU cores.
The main issue is that whenever I start scheduling some of the example DAGs, the CPU resource of Postgres will hit ~95% and the tasks will start to fail, cause of connection issues (like PID timeouts in the Scheduler or the "FATAL: remaining connection slots are reserved for non-replication superuser connections" error)
I've tried changing some of the pool parameters from sql alchemy in the airflow.cfg but still getting the issue.
My question would be: is a Postgres DB running in Azure, 2 CPU cores good enough for handling DAGS? What would be an appropiate set up? Or how can prevent Airflow of congesting Postgres? Thanks!

Related

SQLAlchemy with Aurora Serverless V2 PostgreSQL - many connections

I have an AWS Serverless V2 database setup (postgresql) that is being accessed from a compute cluster. The cluster launches a large number of jobs (>1000) and each job independently puts/pulls some data from the database. The Serverless cluster is setup to autoscale from 2 to 32 units as needed.
The code being run by each cluster job is using SQLAlchemy (either the ORM or the core). I am setting up each database connection with a null pool and pessimistic disconnect handling (i.e., pool_pre_ping=True). From my reading of the docs this should be handling disconnects due to being idle mid-connection.
Code is also written to access the DB, get the results, close the connection (to avoid idle connections), and then reopen the connection after processing (5-30 minutes). This is working well because once processing is completed, the new connections are staggered and the DB has scaled up.
My logs are showing the standard, all connections are taken error: psycopg2.OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser and rds_superuser connections until the DB scales the available units high enough.
Questions:
Should I be configuring the SQLAlchemy connection differently? It feels like an anti-pattern to put in a custom retry to grab a connection while waiting for the DB to scale the number of available units as this type of capability seems to be built into SQLAlchemy usually.
Should I be using an RDS Proxy in front of the database? This also seems like an anti-pattern, adding a proxy in front of an autoscaling DB.
PG version is 10.

High latency, connection and queue depth in Airflow postgres backend

We are running airflow with Postgres(in RDS) as metadata DB and as a result backend and Redis as a Celery backend. Over the weekend the read/write latency and subsequently the number of connections and queue depth increased that
Number of tasks failed with Timeout
All of the airflow slowed down
Eventually modifying the DB instance to a larger instance resolved the issue, however not sure what was the root cause of this issue is not obvious yet.
Have someone else faced this issue and have a solution?
Airflow: 1.10.5
Postgres: 9.6.15

Number of concurrent database connections

We are using amazon r3.8xlarge postgres RDS for our production server.I checked the max connections limit of the RDS, it happens to be 8192 max connections limit.
I have a service which is deployed in ECS and each ECS tasks can take one database connection.The tasks go up to 2000 during peak load.That means we will have 2000 concurrent connections to the database.
I want to check whether it is ok to have 2000 concurrent connections to database.secondly, Will it impact the performance of amazon postgres RDS.
Having 2000 connection at time should not cause any performance issue, since AWS manages the performance part. There are many DB load testing tools available, if you want to be at most sure about this.

Idle (not idle in transaction) connections are not released/closed in PostgreSQL AWS RDS

I'm using C3P0 connection pool and PostgreSQL(10.3) in AWS RDS.
I did a load test at low TPS (1 TPS) for 2 minutes, after load test finished, the number of connections were not dropped according to the monitoring board in AWS RDS. (See below). Neither did CPU utilization.
I'm still new to database, not sure if this is expected? This seems like it's reaching RDS instance's max_connection. I did a select from pg_stat_activity, 99% of connections are idle, and most of the queries are SHOW TRANSACTION ISOLATION LEVEL and SELECT 1.
Here's my C3P0 config:
maxConnection: 100
initialPoolSize: 1
minPoolSize: 1
acquireIncrement: 1
idleConnectionTestPeriod: 40
maxIdleTime: 20
maxConnectionAge: 30
maxStatements:0
numHelperThread:5
preferredTestQuery: SELECT 1
propertyCycle: 0
testConnectionOnCheckIn: false
testConnectionOnCheckOut: false
debugUnreturnedConnectionStacktraces: false
unreturnedConnectionTimeout: 60
acquireRetryAttempts: 10
acquireRetryDelay: 1000
checkoutTimeout: 10000
Any help will be appreciated! Thanks in advance!
Load test tool: It's a company internal load test tool. Generally speaking, it's creating loads to the service (5+ hosts) to hit my API, the API talks to connection pool to connectionPool.getDataSource().getConnection()(ComboPooledDataSource). The connection pool is a singleton instance among service, while each call to the API is in its own thread.

High CPU Utilisation on AWS RDS - Postgres

Attempted to migrate my production environment from Native Postgres environment (hosted on AWS EC2) to RDS Postgres (9.4.4) but it failed miserably. The CPU utilisation of RDS Postgres instances shooted up drastically when compared to that of Native Postgres instances.
My environment details goes here
Master: db.m3.2xlarge instance
Slave1: db.m3.2xlarge instance
Slave2: db.m3.2xlarge instance
Slave3: db.m3.xlarge instance
Slave4: db.m3.xlarge instance
[Note: All the slaves were at Level 1 replication]
I had configured Master to receive only write request and this instance was all fine. The write count was 50 to 80 per second and they CPU utilisation was around 20 to 30%
But apart from this instance, all my slaves performed very bad. The Slaves were configured only to receive Read requests and I assume all writes that were happening was due to replication.
Provisioned IOPS on these boxes were 1000
And on an average there were 5 to 7 Read request hitting each slave and the CPU utilisation was 60%.
Where as in Native Postgres, we stay well with in 30% for this traffic.
Couldn't figure whats going wrong on RDS setup and AWS support is not able to provide good leads.
Did anyone face similar things with RDS Postgres?
There are lots of factors, that maximize the CPU utilization on PostgreSQL like:
Free disk space
CPU Usage
I/O usage etc.
I came across with the same issue few days ago. For me the reason was that some transactions was getting stuck and running since long time. Hence forth CPU utilization got inceased. I came to know about this, by running some postgreSql monitoring command:
SELECT max(now() - xact_start) FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'active');
This command shows the time from which a transaction is running. This time should not be greater than one hour. So killing the transaction which was running from long time or that was stuck at any point, worked for me. I followed this post for monitoring and solving my issue. Post includes lots of useful commands to monitor this situation.
I would suggest increasing your work_mem value, as it might be too low, and doing normal query optimization research to see if you're using queries without proper indexes.