Azure Waiting Statistics - PostgreSQL database performance degradation

Azure Waiting Statistics - PostgreSQL database performance degradation - postgresql

After moving to a new version of our platform we experienced significantly increased waiting statistics resulting in: could not receive data from client: An existing connection was forcibly closed by the remote host
The following top 3 waiting statistics appear in diminishing order:
ClientRead (1)
IO: DataFileRead (2)
LWLock: buffer_io (3)
We are using PostgreSQL v11 and have tables which are heavily used in small time intervals.
Before our migration we did not use websockets.
If there are any good hints where to search for solutions those would be most welcome. We have seen several resources on Azure but not specifically how to further digest these waiting statistics in a procedural manner.

Related

SQLAlchemy with Aurora Serverless V2 PostgreSQL - many connections

I have an AWS Serverless V2 database setup (postgresql) that is being accessed from a compute cluster. The cluster launches a large number of jobs (>1000) and each job independently puts/pulls some data from the database. The Serverless cluster is setup to autoscale from 2 to 32 units as needed.
The code being run by each cluster job is using SQLAlchemy (either the ORM or the core). I am setting up each database connection with a null pool and pessimistic disconnect handling (i.e., pool_pre_ping=True). From my reading of the docs this should be handling disconnects due to being idle mid-connection.
Code is also written to access the DB, get the results, close the connection (to avoid idle connections), and then reopen the connection after processing (5-30 minutes). This is working well because once processing is completed, the new connections are staggered and the DB has scaled up.
My logs are showing the standard, all connections are taken error: psycopg2.OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser and rds_superuser connections until the DB scales the available units high enough.
Questions:
Should I be configuring the SQLAlchemy connection differently? It feels like an anti-pattern to put in a custom retry to grab a connection while waiting for the DB to scale the number of available units as this type of capability seems to be built into SQLAlchemy usually.
Should I be using an RDS Proxy in front of the database? This also seems like an anti-pattern, adding a proxy in front of an autoscaling DB.
PG version is 10.

1 million concurrent database connections

In https://cloud.google.com/sql/docs/quotas, it mentioned that "Cloud Run services are limited to 100 connections to a Cloud SQL database.". Assume I deploy my service as Cloud Run, what's the right way to handle 1 million concurrent connections? Can cloud spanner enables this - I can't find documentation discussing maximum concurrent connections on cloud spanner maximum concurrent connection with Cloud Run.

Do you want Cloud Run to handle a million concurrent connections, or do you want Cloud SQL to handle a million concurrent connections?
If you want Cloud SQL to handle a million concurrent connections, you are probably wrong. Check out this article about Pool sizing (it's on a Java repo, but is general enough to apply to all connection pooling). If you are at the point where you need a million concurrent connections, you would need to invest in more advanced architectures (such as sharding).

High number of connections to Airflow Metadata DB

I tried to find information about the number of connections that Airflow establishes with the metadata database instance (Postgres in my case).
By running select * from pg_stat_activity I realized it creates at least 7 connections whose states change between idle and idle in transaction. The queries are registered as COMMIT or SELECT 1 (mostly). This was using the LocalExecutor on Airflow 2.1, but I tested with an installation of Airflow 1.10 having the same results.
Is anyone aware of where these connections come from? And, is there a way (and a reason) to change this?

Yes. Airflow will Open big number of connections - basically every process it creates will almost for sure open at least one connection. This is "known" characteristics of Apache Airflow.
If you are using MySQL - this is not a big issue as MySQL is good in handling multiple connections (it multiplexes incoming connnections via threads). Postgres uses process-per-connection approach which is much more resource-hungry.
The recommended way to handle that (Postgres is the most stable backend for Airflow) is to use PGBouncer to proxy such connections to Postgres.
In our Official Helm Chart, PGBouncer is used by default, when Postgres is used. https://airflow.apache.org/docs/helm-chart/stable/index.html
I Highly recommend this approach.

Mongo Write Client connection

What does the MongoDB_ActiveClientsWriting_ metric in mongostat refer to?
If I am performing multiple writes from an application on the same connection - does it get audited as a single Active Client?
In that case - does a connection indicate a thread - a single funnel of write?
Or does Mongo have inherent worker threads to fork off parallel writes on a connection.
If so,what is the metric/configuration that flags the active threads writing at a time.
We are using Mongo 4.x.x

It is stats from the locker - how many active connections keep write lock at the time of reporting.
https://docs.mongodb.com/manual/reference/command/serverStatus/#serverstatus.globalLock.activeClients reads:
globalLock.activeClients.total
The total number of internal client connections to the database including system threads as well as queued readers and writers. This metric will be higher than the total of activeClients.readers and activeClients.writers due to the inclusion of system threads.
globalLock.activeClients.readers
The number of the active client connections performing read operations.
globalLock.activeClients.writers
The number of active client connections performing write operations.
The metric itself is calculated in https://github.com/mongodb/mongo/blob/3a508dcd9755cc5012288068ce88afb9117ac8b8/src/mongo/db/stats/lock_server_status_section.cpp#L55

MongoDB connection fails on multiple app servers

We have mongodb with mgo driver for golang. There are two app servers connecting to mongodb running besides apps (golang binaries). Mongodb runs as a replica set and each server connects two primary or secondary depending on replica's current state.
We have experienced the SocketException handling request, closing client connection: 9001 socket exception on one of the mongo servers( which resulted in the connection to mongodb from our apps to die. After that, replica set continued to be functional but our second server (on which the error didn't happen) the connection died as well.
In the golang logs it was manifested as:
read tcp 10.10.0.5:37698-\u003e10.10.0.7:27017: i/o timeout
Why did this happen? How can this be prevented?
As I understand, mgo connects to the whole replica by the url (it detects whole topology by the single instance's url) but why did dy·ing of the connection on one of the servers killed it on second one?
Edit:
Full package path that is used "gopkg.in/mgo.v2"
Unfortunately can't share mongo files here. But besides the socketexecption mongo logs don't contain anything useful. There is indication of some degree of lock contention where lock acquired time is quite high some times but nothing beyond that
MongoDB does some heavy indexing some times but the wasn't any unusual spikes recently so it's nothing beyond normal

First, the mgo driver you are using: gopkg.in/mgo.v2 developed by Gustavo Niemeyer (hosted at https://github.com/go-mgo/mgo) is not maintained anymore.
Instead use the community supported fork github.com/globalsign/mgo. This one continues to get patched and evolve.
Its changelog includes: "Improved connection handling" which seems to be directly relating to your issue.
Its details can be read here https://github.com/globalsign/mgo/pull/5 which points to the original pull request https://github.com/go-mgo/mgo/pull/437:
If mongoServer fail to dial server, it will close all sockets that are alive, whether they're currently use or not.
There are two cons:
Inflight requests will be interrupt rudely.
All sockets closed at the same time, and likely to dial server at the same time. Any occasional fail in the massive dial requests (high concurrency scenario) will make all sockets closed again, and repeat...(It happened in our production environment)
So I think sockets currently in use should closed after idle.
Note that the github.com/globalsign/mgo has backward compatible API, it basically just added a few new things / features (besides the fixes and patches), which means you should be able to just change the import paths and all should be working without further changes.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse