Error using the connection to database when RDS scales out - postgresql

We have a .net API hosted in ECS that queries data from a serverless v1 cluster using Entity Framework. Under normal load this service performs very well but when there's a large spike in traffic that require the RDS cluster to scale out to more ACUs we are seeing a lot of connection errors in our API.
An error occurred using the connection to database '\"ourdatabasename\"' on server '\"tcp://ourcluster.region.rds.amazonaws.com:5432\"'.
The high level overiew of the infrastructure looks like this:
CloudFront >> Load Balancer >> ECS Fargate >> RDS Aurora PostgeSQL Serverless v1
Stack information:
.Net 6 API compiled for Linux
Entity Framework Core 6.x
Npgsql.EntityFrameworkCore.PostgreSQL 6.x
PostgeSQL 10.18
We did open AWS support cases about this issue in the past year, but those basically always result in the answer that this is an implementation issue and not an infrastructure issue.
We can easily reproduce the issue by running a k6 stress test on our API (bypassing the CloudFront caching layer of course) to generate a spike high enough to trigger scaling of the RDS cluster.
For the past year we have worked around this issue by configuring RDS at a capacity at which it basically never needs to scale out. This is of course wasting money, and not the purpose of serverless as all, so we would like to find the underlying root cause and solve that.
Some things we have tried already:
We have experimented with serverless v2 which should scale in a completely different fashion as it's just the same vm consuming more resources from the hosting machine. But our preliminary conclusion is that this was even worse. We do not yet understand why that is, but it appears to trigger the same effect but than a lot faster/more as v2 scales a lot faster/more. With v1 we get in trouble around 400 requests per second, with v2 it was at 150rps.
EnableRetryOnFailure seemed to help a tiny bit, but not a lot. We have left it at the default configuration as implemented by Npgsql for now.
We have experimented with the Maximum Pool Size connection string parameter. At 300 it appears to be a bit better, but it does not solve the issue.
Changing the scaling behaviour of ECS/the ALB or even just prescaling that to handle peak load did not change anything.
We have not tried:
RDS Proxy, it's supposed to solve all your connection pooling issues. But we're not sure it's even a pooling issue. We're not keen on trusting on yet another black box service to solve the issues our first black box service (aurora serverless) has. And it's not really cheap. If all of SO will now convince us this is the holy grail, then surely we'll try it out.
Data API for RDS, you can't have connection management issues if you're not making them right? It's a huge investment to rewrite all EF code to Data API requests and I'm not sure what it says about the service if it's still not out for serverless v2. So, not for now I think.
The first purpose of this question here on SO it trying to find someone that could help us understand what is even going on. Helping us understand the error and where it comes from. We understand that you cannot expect that ECS+RDS can just magically handle all the load you throw at it. But if we do not fully understand how it breaks we are not able to come up with how to create potential failover mechanisms or how to make the system fail more gracefully.
If someone knows the magic setting but not the why that's also great of course :) We can then maybe figure out the why ourselves and share that back with the community ;)
Feel free to ask more questions where needed.

Related

Google Cloud SQL Postgres - randomly slow queries from Google Compute / Kubernetes

I've been testing Google Cloud SQL with Postgresql, but I have random queries taking ~3s instead of a few ms.
Troubleshooting I did:
The queries themselves aren't problems, rerunning the same query will work.
Indexes are properly set. The database is also very very small, it shouldn't do this, even if there weren't any index.
The Kubernetes container is connecting to the database through SQL Proxy (I followed this https://cloud.google.com/sql/docs/postgres/connect-kubernetes-engine). It is not the problem though as I tried to connect directly to the database, with the same issue.
I configured net.ipv4.tcp_keepalive_time to 60 to make sure the connection weren't dropping.
I also have a pool of connection that are never disconnected to make sure it wasn't from that.
When I run queries directly through my local Postgresql client, I never have the problem.
I don't have this issue when developing locally either and connecting to my local database.
What I'm getting at is: I feel there's some weird connection/link issue between my Google Compute instances and my Google SQL instance that I can't seem to figure out.
Any idea?
Edit:
I also noticed these logs in my SQL Cloud instance every 30s:
ERROR: recovery is not in progress
HINT: Recovery control functions can only be executed during recovery.
STATEMENT: SELECT pg_is_xlog_replay_paused(), current_timestamp
That's an interesting problem you are facing. So my knowledge on Kubernetes isn't that great, but I do have a general understanding so let's see if I can provide some suggestions.
To start with, the API that you linked to in your question does mention that it is still in beta. So I do believe there would still be issues to patch in maximizing speed performance.
Secondly, from what I understand, Kubernetes is a great tool for handling stateless workloads. Thus, handling data where state is required for queries would be a slow operation. This article (although not entirely related) does explain some of the pitfalls of Kubernetes (not all the questions are relevant)
Thirdly, could you explain your use case a little bit? Do you really need to use Kubernetes or will another tool like a powerful Compute Engine Instance or or a Dataflow job resolve the the issue? Are you making your database queries through a programming language or an application call?
Thanks, and do let me know!

"Hibernating" service fabric application

I have a kinda small service fabric application that I'm building and have since I converted to service fabric been annoyed about the slow startup time and it's not only after a release but also after like 10-15 min of inactivity.
I have added a project whose sole purpose is to go to each service and make a small db request every 10s, thinking that will keep the application and ef running. This helped me from getting timeouts and now the first requests are in the 5-15s range. After some warming up the requests are usually in the 300ms range so they are quite easy requests and there isn't much communication between the services (4 services in total).
After a lot of searching I found a profiler that seems to work as most doesn't like the one in visual studio. Unfortunately it didn't really say that much except that it waits for threads a lot and that it doesn't seem to be in my code. All my external requests use await async. Also when following the request it kinda seemed like there were information missing...
At first I thought that the slowness might come from ef generating the search query so I migrated that part to use dapper instead (the full request still uses some ef) but that didn't change anything really.
The application has all the latest service fabric, dotnet core, ef core, application insights packages. All services except for the one validating tokens are stateless. And of course built in release mode.
At this point I'm kinda lost as I cannot find the reason it's so slow. In the old days this was usually because of IIS shutting down the application or recycling it but now when it isn't there, what can it be?
Similar issue happen to us however we use DI container and until the first call to our service, all dependency is not resolved and it take time to create these instances. For example a singleton of class. Another one is was EF DB context. To overcome that we have process to "warm" the services first.
Hope that helps
This might be a shot in the dark: Are your services communicating using the Service Fabric remoting options or using HTTP? In the case of HTTP, might the hibernation and warmup time be caused by HttpSys/Kestrel?
Regarding your slow responses (300ms) that does seem a bit odd, we have multiple stateless services (using HTTP and Kestrel) with EF in the back, and have sub 50ms response times).

Using MongoDB in AWS Lambda with the mLab API

Usually you cant use MongoDB in Lambdas because Lambda functions are stateless and operations on MongoDB require a connection, thus you suffer a large performance hit in setting up a DB connection each time a function is run.
A solution I have thought of is to use mLab's REST API (http://docs.mlab.com/data-api/), that way I dont need to open a new connection each time my Lambda function is called.
On problem I can see if that mLab's REST service could become a bottleneck, plus im relying on it never going down.
Thoughts?
I have a couple of alternative suggestions for you on this. Only because I've never used mLab.
Setup http://restheart.org/ and have that sit between your lambda micro services and your MongoDB instance. I've used this with pretty decent success on another project. It does come with the downside of now having an EC2 instance to maintain. However, setting up restheart is pretty easy and the crew maintaining it and giving support is pretty great.
You can setup a lambda function that pays the cost of connecting and keeping a connection open. All of your other microservices can then call that lambda function for the data they need. If it is hit more frequently, you will not have to pay the cost of the DB connection as frequently. However, that first connection can be pretty brutal so you may need something keeping it warm. You will also have the potential issue of connections never getting properly closed, and eventually running out.
Those two options aside, if mlab is hosting your DB, you already have put a lot of faith in their ability to keep a system alive. If they cant keep an API up that lack of faith should also translate to their ability to keep your DB alive.

Google Cloud SQL very slow from time to time

It's been almost 3 months I have switched my platform to Google Cloud (Compute Engine + Cloud SQL + Cloud Storage).
I am very happy with it but from time to time I noticed big latency on the Cloud SQL server. My VMs from Compute Engine and my Cloud SQL instance are all on the same location (us-1) datacenter.
Since my Java backend makes a lot of SQL queries to generate a server response, the response times may vary from 250-300ms (normal) up to 2s!
In the console, I notice absolutely nothing: no CPU peaks, no read/write peaks, no backup running, nothing. No alert. Last time it happened, it lasted for a few days and then the response times went suddenly better than ever.
I am pretty sure Google works on the infrastructure behind the scenes... But no way to point that out.
So here's my questions:
Has anybody else ever had noticed the same kind of problem?
It is really annoying for me because my web pages get very slow and I have absolutely no control over it. Plus I loose a lot of time because I generally never first suspect a hardware problem / maintenance but instead something that we introduced in our app. Is it normal or do I have a problem on my SQL instance?
Is there anywhere I can have visibility over what's Google doing on the hardware? I know there are maintenance alerts, but for my zone it seems always empty when it happen.
The only option I have for now is to wait and that is really not acceptable.
I suspect that Google does some sort of IO throttling and their algorithm is not very sophisticated. We have a build server which slows down to a crawl if we do more than two builds within an hour. The build that normally takes 15 minutes will run for more than an hour and we usually terminate it and re-run manually later. This question describes a similar problem and the recommended solution is to use larger volumes as they come with more IO allowance.

pgpool-II for Postgres - Is it what I need?

I just stumbled upon pgpool-II in my search for clustering my Postgres DB (just getting ready to deploy a web app in a couple months). I still have the shakes from excitement, but I'm nervous, as each time I find something this excellent I am soon let down. Have you any experience with pgpool-II, and will it help me run my database in multiple VMs, and later in multiple physical servers altogether? Is it all I need for backing up, load balancing, and providing a higher availability for my DB server!?
Also, is it easy to use the parallel query function (for instance, in Django or through Pythons psycopg2)? This would be most excellent for providing reporting and aggregation!
One last thing: It seems to work between Postgres and psycopg2. Is this a correct understanding of it, so I can use psycopg2 the same as normal, without regard for pgpool-II?
pgpool-II works fine for what it claims to do. And it fits between your application and the database the way you expect it to; just point psycopg2 toward it instead of directly at the database and off you go.
The main thing you have to note is that while it supports many different types of features--replication, load balancing, parallel query--you can't use them all at once. It sounds like you may be under the impression you can do that, and it doesn't work that way. The documentation is not all that clear on this subject (the English version at least, I can't speak to the original Japanese one).
For example, if you run pgpool-II in its "Master/Slave" mode, so that it supports load-balancing for scaling reads, you have to use another program to actually do the replication between those nodes. Slony was the supported replication solution to put underneath of there in earlier PostgreSQL versions, as of pgpool-II 3.0 and PostgreSQL 9.0 you can also use the soon to be released Streaming Replication/Hot Standby features of that new version as well.
pgpool-II is a useful component and you can use it in a lot of interesting ways, but I doubt it will be "all you need" for every requirement you hope to achieve with it.