Utilizing multiple database pool connections within multiple gunicorn workers - postgresql

I am using a flask server which initialises the app by creating 10 connections in psycopg2 Connection Pool (using Postgres). My flask server receives 40 requests every second.
Every request uses 1 connection and takes approximately 5 seconds in the database. If a connection is not found in the connection pool, new connections are created.
There is a limitation of 150 maximum database connections on postgreSQL server. However, I am facing challenges in specifying the maximum number of connection in the connection pool . For the pool intialization, I use:
app.config['pool'] = psycopg2.pool.ThreadedConnectionPool(
10, 145,
host = config["HOST"],
database = config["DATABASE"],
user = config["USER"],
password = config["PASSWORD"]
)
I know it may not be possible to share connections within multiple workers. What is the best practice to utilize these 150 connections across multiple workers?
Fo reference, my tech stack is flask + postgreSQL(on Azure). For deployment, i use gunicorn and nginx with flask.
Following is my gunicorn command-
gunicorn --bind 0.0.0.0:8000 --worker-class=gevent --worker-connections=1000 --workers=3 --timeout=1000 manage:app

The easiest solution is to change the worker-class from gevent to sync or possibly gthread.
It is worth paying attention to the entry straight from the gunicorn documentation: "For full greenlet support applications might need to be adapted. When using, e.g., Gevent and Psycopg it makes sense to ensure psycogreen is installed and setup." (https://docs.gunicorn.org/en/stable/design.html)

Related

SQLAlchemy with Aurora Serverless V2 PostgreSQL - many connections

I have an AWS Serverless V2 database setup (postgresql) that is being accessed from a compute cluster. The cluster launches a large number of jobs (>1000) and each job independently puts/pulls some data from the database. The Serverless cluster is setup to autoscale from 2 to 32 units as needed.
The code being run by each cluster job is using SQLAlchemy (either the ORM or the core). I am setting up each database connection with a null pool and pessimistic disconnect handling (i.e., pool_pre_ping=True). From my reading of the docs this should be handling disconnects due to being idle mid-connection.
Code is also written to access the DB, get the results, close the connection (to avoid idle connections), and then reopen the connection after processing (5-30 minutes). This is working well because once processing is completed, the new connections are staggered and the DB has scaled up.
My logs are showing the standard, all connections are taken error: psycopg2.OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser and rds_superuser connections until the DB scales the available units high enough.
Questions:
Should I be configuring the SQLAlchemy connection differently? It feels like an anti-pattern to put in a custom retry to grab a connection while waiting for the DB to scale the number of available units as this type of capability seems to be built into SQLAlchemy usually.
Should I be using an RDS Proxy in front of the database? This also seems like an anti-pattern, adding a proxy in front of an autoscaling DB.
PG version is 10.

High number of connections to Airflow Metadata DB

I tried to find information about the number of connections that Airflow establishes with the metadata database instance (Postgres in my case).
By running select * from pg_stat_activity I realized it creates at least 7 connections whose states change between idle and idle in transaction. The queries are registered as COMMIT or SELECT 1 (mostly). This was using the LocalExecutor on Airflow 2.1, but I tested with an installation of Airflow 1.10 having the same results.
Is anyone aware of where these connections come from? And, is there a way (and a reason) to change this?
Yes. Airflow will Open big number of connections - basically every process it creates will almost for sure open at least one connection. This is "known" characteristics of Apache Airflow.
If you are using MySQL - this is not a big issue as MySQL is good in handling multiple connections (it multiplexes incoming connnections via threads). Postgres uses process-per-connection approach which is much more resource-hungry.
The recommended way to handle that (Postgres is the most stable backend for Airflow) is to use PGBouncer to proxy such connections to Postgres.
In our Official Helm Chart, PGBouncer is used by default, when Postgres is used. https://airflow.apache.org/docs/helm-chart/stable/index.html
I Highly recommend this approach.

Why does my mongoDB account have 292 connections?

I only write data into my mongoDB database once a day and I am not currently writing any data into it but there have been a consistent 292 connections into my database for the past three hours. No reads or writes, just connections and a consistent 29 commands per second since this started.
Concerned by this, I adjusted settings to only allow access from one specific IP, and changed all my passwords but the number hasn't changed, still 292 connections and 29 commands per second. Any idea what is causing this or perhaps how I can dig in further?
The number of connections depends on the cluster setup. A connection can be external (e.g. your app or monitoring tools) or internal (e.g. to replicate your data to secondary nodes or a backup process).
You can use db.currentOp() to list the active connections.
Consider that you app instance(s) may not open just 1 connection, but several, depending on the driver that connects to the DB and how it handles connection pooling. The connection pool size can be thought of as the max number of concurrent requests that your driver can service. For example, the default connection pool size for the Node.js MongoDB driver is 5. If you have set a high pool size, either with the driver or connection string, your app may open many connections to concurrently process the write commands.
You can start by process of elimination:
Completely cut your app off from the DB. There is a keep-alive time, so connections won‘t close immediately unless the driver closes them formally. You may have to wait some time, depending on the keep-alive setting. You can also restart your cluster and see how many connections there are initially.
Connect you app to the DB and check how the connection number changes with each request. Check whether your app properly closes connections to the DB at some point after opening them.

How to close SQL connections of old Cloud Run revisions?

Context
I am running a SpringBoot application on Cloud Run which connects to a postgres11 CloudSQL database using a Hikari connection pool. I am using the smallest PSQL instance (1vcpu/614mb/25connection limit). For the setup, I have followed these resources:
Connecting to Cloud SQL from Cloud Run
Managing database connections
Problem
After deploying the third revision, I get the following error:
FATAL: remaining connection slots are reserved for non-replication superuser connections
What I found out
Default connection pool size is 10, hence why it fails on the third deployment (30 > 25).
When deleting an old revision, active connections shown in the Cloud SQL admin panel drop by 10, and the next deployment succeeds.
Question
It seems, that old Cloud Run revisions are being kept in a "cold" state, maintaining their connection pools. Is there a way to close these connections without deleting the revisions?
In the best practices section it says:
...we recommend that you use a client library that supports connection pools that automatically reconnect broken client connections."
What is the recommended way of managing connection pools in Cloud Run, given that it seems old revisions somehow manage to maintain their connections?
Thanks!
Currently, Cloud Run doesn't provide any guarantees on how long it will remain warm after it's started up. When not in use, the instance is severely throttled by not necessarily shutdown. Thus, you have some revisions that are holding up connections even when not being directed traffic.
Even in this situation, I disagree that with the idea that you should avoid using connection pooling. Connection pooling can lower latency, improve stability, and help put an upper limit on the number of open connections. Alternatively, you can use some of the following configuration options to help keep your pool in check:
minimumIdle - This property controls the minimum number of idle connections that HikariCP tries to maintain in the pool. If the idle connections dip below this value and total connections in the pool are less than maximumPoolSize, HikariCP will make a best effort to add additional connections quickly and efficiently.
maximumPoolSize - This property controls the maximum size that the pool is allowed to reach, including both idle and in-use connections.
idleTimeout - This property controls the maximum amount of time that a connection is allowed to sit idle in the pool. This setting only applies when minimumIdle is defined to be less than maximumPoolSize. Idle connections will not be retired once the pool reaches minimumIdle connections.
If you set minimumIdle to 0, your application will still be able to use up to maximumPoolSize connections at once. However, once a connection is idle in the pool for idleTimeout seconds, it will be closed. If you set idleTimeout to something small like 1 minute, it will allow the number of connections your pool is using to scale down to 0 when not in use.
Hope this helps!
The issue here is that the connections don't get closed by HikariCP when they are opened. I don't know much about Hikari but I found this which explains how connections should be handled through Hikari. I hope that helps!

docker swarm - connections from wildfly to postgres randomly hang

I'm experiencing a weird problem when deploying a docker stack (compose file).
I have a three node docker swarm - master and two workers.
All machines are CentOS 7.5 with kernel 3.10.0 and docker 18.03.1-ce.
Most things run on the master, one of which is a wildfly (v9.x) application server.
On one of the workers is a postgres database.
After deploying the stack things work normally, but after a while (or maybe after a specific action in the web app) request start to hang.
Running netstat -ntp inside the wildfly container shows 52 bytes stuck in the Send-q:
tcp 0 52 10.0.0.72:59338 10.0.0.37:5432 ESTABLISHED -
On the postgres side the connection is also in ESTABLISHED state, but the send and receive queues are 0.
It's always exactly 52 bytes. I read somewhere that ACK packets with timestamps are also 52 bytes. Is there any way I can verify that?
We have the following sysctl tunables set:
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_timestamps = 0
The first three were needed because of this.
All services in the stack are connected to the same default network that docker creates.
Now if I move the postgres service to be on the same host as the wildfly service the problem doesn't seem to surface or if I declare a separate network for postgres and add it only to the services that need the database (and the database of course) the problem also doesn't seem to show.
Has anyone come across a similar issue? Can anyone provide any pointers on how I can debug the problem further?
Turns out this is a known issue with pooled connections in swarm with services on different nodes.
Basically the workaround is to set the above tuneables + enable tcp keepalive on the socket. See here and here for more details.