We have a production system which is an ASP.NET Web API (classic, not .NET Core) application published to Azure. Data storage is Azure SQL Database and we use Entity Framework to access the data. API has a medium load, 10-60 requests per second and upper_90 latency is 100-200 ms which is a target latency is our case. Some time ago we noticed that approximately every 20-30 minutes our services stalls and latency jumps to approximately 5-10 sec. All requests start to be slow for about a minute and then the system recovers by itself. Same time no requests are being dropped, they all just take longer to execute. for a short period of time (usually 1 minute).
We start to see the following picture at our HTTP requests telemetry (Azure):
We can also see a correlation to with our Azure SQL Database metrics, such as DTU (drop) and connections (increase):
We've analyzed the server and didn't see any correlation with the host (we have just one host) CPU/Memory usage, it's stable at 20-30% CPU usage level and 50% memory usage.
We also have an alternative source of telemetry which shows the same behavior. Our telemetry measures API latency and database metrics such as active connection count and pooled connection count (ADO.NET Connection Pool):
What is interesting, that every system stall is accompanied by a pooled connection quantity raise. And our tests show, the more connection pooled, the longer you spend waiting on a new connection from that pool to execute your next database operation. We analyzed a few suggestions but were unable to prove or disprove any of them:
ADO.NET connection leak (all our db access happens in a using statement with proper connection disposal/return to pool)
Socket/Port Exhaustion - where unable to properly track telemetry on that metric
CPU/Memory bottleneck - charts shows there is none
DTU (database units) bottleneck - charts shows there is none
As of now we are trying to identify the possible culprit of this behavior. Unfortunately, we cannot identify the changes which led to it becuase of missing telemetry, so now the only way to deal with the issue is to properly diagnose it. And, of course, we can only reproduce it in production, under permanent load (even when load is not high like 10 requests a second).
What are the possible causes for this behavior and what is the proper way to diagnose and troubleshoot it?
There can be several possible reasons:
The problem could be in your application code, create a staging environment and re-run your test with profiler tool telemetry (i.e. using YourKit .NET Profiler) - this will allow you to detect the heaviest methods, largest objects, slowest DB queries, etc.Also do a load test on your API with JMeter.
I would recommend you to try Kudu Process API to look at the list of currently running processes, and get more info about them list their CPU time.
The article for how to monitor CPU using in Azure App service are shown below:
https://azure.microsoft.com/en-in/documentation/articles/web-sites-monitor/
https://azure.microsoft.com/en-in/documentation/articles/app-insights-web-monitor-performance/
We ended up separating a few web apps hosted at a single App Service Plan. Even though the metrics were not showing us any bottle neck with the CPU on the app, there are other apps which cause CPU usage spikes and as a result Connection Pool Queue growth with huge Latency spikes.
When we checked the App Service Plan usage and compared it to the Database plan usage, it became clear that the bottleneck is in the App Service Plan. It's still hard to explain while CPU bottleneck causes uneven latency spikes but we decided to separate the most loaded web app to a separate plan and deal with it in isolation. After the separation the app behave normally, no CPU or Latency spikes and it look very stable (same picture as between spikes):
We will continue to analyze the other apps and eventually will find the culprit but at this point the mission critical web app is in isolation and very stable. The lesson here is to monitor not only Web App resources usage but also a hosting App Service Plan which could have other apps consuming resources (CPU, Memory)
I'm looking for more detailed guidance / other people's experience of using Npgsql in production with Pgbouncer.
Basically we have the following setup using GKE and Google Cloud SQL:
Right now - I've got npgsql configured as if pgbouncer wasn't in place, using a local connection pool. I've added pgbouncer as a deployment in my GKE cluster as Google SQL has very low max connection limits - and to be able to scale my application horizontally inside of Kubernetes I need to protect against overwhelming it.
My problem is one of reliability when one of the pgbouncer pods dies (due to a node failure or as I'm scaling up/down).
When that happens (1) all of the existing open connections from the client side connection pools in the application pods don't immediately close (2) - and basically result in exceptions to my application as it tries to execute commands. Not ideal!
As I see it (and looking at the advice at https://www.npgsql.org/doc/compatibility.html) I have three options.
Live with it, and handle retries of SQL commands within my application. Possible, but seems like a lot of effort and creates lots of possible bugs if I get it wrong.
Turn on keep alives and let npgsql itself 'fail out' relatively quickly the bad connections when those fail. I'm not even sure if this will work or if it will cause further problems.
Turn off client side connection pooling entirely. This seems to be the official advice, but I am loathe to do this for performance reasons, it seems very wasteful for Npgsql to have to open a connnection to pgbouncer for each session - and runs counter to all of my experience with other RDBMS like SQL Server.
Am I on the right track with one of those options? Or am I missing something?
You are generally on the right track and your analysis seems accurate. Some comments:
Option 2 (turning out keepalives) will help remove idle connections in Npgsql's pool which have been broken. As you've written your application will still some failures (as some bad idle connections may not be removed in time). There is no particular reason to think this would cause further problems - this should be pretty safe to turn on.
Option 3 is indeed problematic for perf, as a TCP connection to pgbouncer would have to be established every single time a database connection is needed. It will also not provide a 100% fail-proof mechanism, since pgbouncer may still drop out while a connection is in use.
At the end of the day, you're asking about resiliency in the face of arbitrary network/server failure, which isn't an easy thing to achieve. The only 100% reliable way to deal with this is in your application, via a dedicated layer which would retry operations when a transient exception occurs. You may want to look at Polly, and note that Npgsql helps our a bit by exposing an IsTransient exception which can be used as a trigger to retry (Entity Framework Core also includes a similar "retry strategy"). If you do go down this path, note that transactions are particularly difficult to handle correctly.
Similar question: Lot of SHOW TRANSACTION ISOLATION LEVEL queries in postgres
I did look above questions and did lots of research, still confused.
My service database is in PostgreSQL AWS RDS, and I use C3P0 for connection pooling. However i found almost all of the connections in pg_stat_activity are running either SHOW TRANSACTION ISOLATION LEVEL or SELECT 1. All of these connections are idle.
My concern is that with the service TPS stays the same, say 10 TPS, the number of connections in database keeps growing.
Even if all these connections are idle, seems like every time there's a client hit the service database, there's a new connection being created and stay forever.
Also even if the TPS trends down, the number of connection still keeps growing, just little be slower.
Can anybody share some insight through this? Thanks in advance!
I have one standard web dyno and worker dyno connected to the same standard 0 database.
The worker dyno runs background jobs that insert a lot of data into the database. I feel like I have noticed slower response times while browsing my site when the workers are running.
I'm always well below the 120 connection limit. Am I imagining this or does it have an impact on read time? If so, how do people mitigate it?
From the database's perspective, there is no difference between connections originating from the web dynos and worker dynos; they're both just clients of the database.
If your worker dynos are doing heavy inserts all the time, then they could certainly impact query performance as this places a lot of load on the database; how this impacts your web response times is specific to your particular application.
I would recommend starting by looking at Heroku Postgres tools for database performance tuning.
https://devcenter.heroku.com/articles/heroku-postgres-database-tuning
Without knowing more about your application I would say you could start with looking at the slowest queries related to your web requests and compare them to query time with and without the workers enabled.
My application uses Postgresql 9.0 and is composed by one or more stations that interacts with a global database: it is like a common client server application but to avoid any additional hardware, all stations include both client and server: a main station is promoted to act also as server, and any other act as a client to it. This solution permits me to be scalable: a user may initially need a single station but it can decide to expand to more in future without a useless separate server in the initial phase.
I'm trying to avoid that if main station goes down all others stop working; to do it the best solution could be to continuously replicate the main database to unused database on one or more stations.
Searching I've found that pgpool can be used for my needs but from all examples and tutorial it seems that point of failure moves from main database to server that runs pgpool.
I read something about multiple pgpool and heartbeat tool but it isn't clear how to do it.
Considering my architecture, where doesn't exist separated and specialized servers, can someone give me some hints ? In case of failover it seems that pgpool do everything in automatic, can I consider that failover situation can be handled by a standard user without the intervention of an administrator ?
For these kind of applications I really like Amazon's Dynamo design. The document by the link is quite big, but it is worth reading. In fact, there're applications that already implement this approach:
mongoDB
Cassandra
Project Voldemort
Maybe others, but I'm not aware. Cassandra started within Facebook, Voldemort is the one used by LinkedIn. Making things distributed and adding redundancy into your data distribution you will step away from traditional Master-Slave replication approaches.
If you'd like to stay with PostgreSQL, it shouldn't be a big deal to implement such approach. You will need to implement an extra layer (a proxy), that will decide based on pre-configured options how to retrieve/save the data.
The proxying layer can be implemented in:
application (requires lot's of work IMHO);
database;
as a middleware.
You can use PL/Proxy on the middleware layer, project originated in Skype. It is deeply integrated into the PostgreSQL, so I'd say it is a combination of options 2 and 3. PL/Proxy will require you to use functions for all kind of queries against the database.
In case you will hit performance issues, PgBouncer can be used.
Last note: any way you decide to go, a known amount of development will be required.
EDIT:
It all depends on what you call “failure” and what you consider system being in an interrupted state.
Let's look on the pgpool features.
Connection Pooling PostgreSQL is using a single process (fork) per session. Obviously, if you have a very busy site, you'll hit the OS limit. To overcome this, connection poolers are used. They also allow you to use your resources evenly, so generally it's a good idea to have pooler before your database.In case of pgpool outage you'll face a big number of clients unable to reach your database. If you'll point them directly to the database, avoiding pooler, you'll face performance issues.
Replication All your queries will be auto-replicated to slave instances. This has meaning for the DML and DDL queries.In case of pgpool outage your replication will stop and slaves will not be able to catchup with master, as there's no change tracking done outside pgpool (as far as I know).
Load Balance Your read-only queries will be spread across several instances, achieving nice response times, allowing you to put more bandwidth on the system.In case of pgpool outage your queries will suddenly run much slower, if the system is capable of handling such a load. And this is in the case that master database will catchup instead of failed pgpool.
Limiting Exceeding Connections pgpool will queue connections in case they're not being able to process immediately.In case of pgpool outage all such connections will be aborted, which might brake the DB/Application protocol, i.e. Application was designed to never get connection aborts.
Parallel Query A single query is executed on several nodes to reduce response time.In case of pgpool outage such queries will not be possible, resulting in a longer processing.
If you're fine to face such conditions and you don't treat them as a failure, then pgpool can serve you well. And if 5 minutes of outage will cost your company several thousands $, then you should seek for a more solid solution.
The higher is the cost of the outage, the more fine tuned failover system should be.
Typically, it is not just single tool used to achieve failover automation.
In each failure you will have to tweak:
DNS, unless you want all clients' reconfiguration;
re-initialize backups and failover procedures;
make sure old master will not try to fight for it's role in case it comes back (STONITH);
in my experience we're people from DBA, SysAdmin, Architects and Operations departments who decide proper strategies.
Finally, in my view, pgpool is a good tool, I do use it. But it is not designed as a complete failover solution, not without extra thinking, measures taken, scripts written. Thus I've provided links to the distributed databases, they provide a much higher level of availability.
And PostgreSQL can be made distributed with a little effort due to it's great extensibility.
First of all, I'd recommend checking out pgBouncer rather than pgpool. Next, what level of scaling are you attempting to reach? You might just choose to run your connection pooler on all your client systems (bouncer is light enough for this to work).
That said, vyegorov's answer is probably the direction you should really be looking at in this day and age. Are you sure you really need a database?
EDIT
So, the rather obvious answer is that pgPool creates a single point of failure if you only have one box running it. The obvious solution is to run multiple poolers across multiple boxes. You then need to engineer your application code to handle database disconnections. This is not as easy at it sounds, but basically you need to use 2-phase commit for non-idempotent changes. So to the greatest extent possible you should make your changes idempotent.
Based on your comments, I'd guess that maybe you have limited experience dealing with database replication? pgPool does statement based replication. There are tradeoffs here. The benefit is that it's very easy to set up. The downside is that there is no guarantee that data on the replicated databases will be identical. It is also (I believe but haven't checked lately) not compatible with 2pc.
My prior comment asking if you really need a database was driven by my perception that you have designed a system without going into much detail around this part of it. I have about 2 decades experience working on "this part" of similar systems. I expect you will find that there are no out of the box solutions and that the issues involved get very complicated. In other words, I'm suggesting you re-consider your design.
Try reading this blog (with lots of information about PostgreSQL and PgPool-II):
https://www.itenlight.com/blog/2016/05/21/PostgreSQL+HA+with+pgpool-II+-+Part+5
Search for "WATCHDOG" on that same blog. With that you can configure a PgPool-II cluster. Two machines on the same subnet are required, though, and a virtual IP on the same subnet.
Hope that this is useful for anyone trying the same thing (even if this answer is a lot late).
PGPool certainly becomes a single point of failure, but it is a much smaller one than a Postgres instance.
Though I have not attempted it yet, it should be possible to have two machines with PGPool installed, but only running on one. You can then use Linux-HA to restart PGPool on the standby host if the primary becomes unavailable, and to optionally fail it back again when the primary comes back. You can at the same time use Linux-HA to move a single virtual IP over as well, so that your clients can connect to a single IP for their Postgres services.
Death of the postgres server will make PGPool send queries to the backup Postgres (promoting it to master if necessary).
Death of the PGPool server will cause a brief outage (configurable, but likely in the region of <1min) until PGPool starts up on the standby, the IP address is claimed, and a gratuitous ARP sent out. Of course, the client will have to be intelligent enough to reconnect without dying.