Google Cloud SQL Postgres - randomly slow queries from Google Compute / Kubernetes - postgresql

I've been testing Google Cloud SQL with Postgresql, but I have random queries taking ~3s instead of a few ms.
Troubleshooting I did:
The queries themselves aren't problems, rerunning the same query will work.
Indexes are properly set. The database is also very very small, it shouldn't do this, even if there weren't any index.
The Kubernetes container is connecting to the database through SQL Proxy (I followed this https://cloud.google.com/sql/docs/postgres/connect-kubernetes-engine). It is not the problem though as I tried to connect directly to the database, with the same issue.
I configured net.ipv4.tcp_keepalive_time to 60 to make sure the connection weren't dropping.
I also have a pool of connection that are never disconnected to make sure it wasn't from that.
When I run queries directly through my local Postgresql client, I never have the problem.
I don't have this issue when developing locally either and connecting to my local database.
What I'm getting at is: I feel there's some weird connection/link issue between my Google Compute instances and my Google SQL instance that I can't seem to figure out.
Any idea?
Edit:
I also noticed these logs in my SQL Cloud instance every 30s:
ERROR: recovery is not in progress
HINT: Recovery control functions can only be executed during recovery.
STATEMENT: SELECT pg_is_xlog_replay_paused(), current_timestamp

That's an interesting problem you are facing. So my knowledge on Kubernetes isn't that great, but I do have a general understanding so let's see if I can provide some suggestions.
To start with, the API that you linked to in your question does mention that it is still in beta. So I do believe there would still be issues to patch in maximizing speed performance.
Secondly, from what I understand, Kubernetes is a great tool for handling stateless workloads. Thus, handling data where state is required for queries would be a slow operation. This article (although not entirely related) does explain some of the pitfalls of Kubernetes (not all the questions are relevant)
Thirdly, could you explain your use case a little bit? Do you really need to use Kubernetes or will another tool like a powerful Compute Engine Instance or or a Dataflow job resolve the the issue? Are you making your database queries through a programming language or an application call?
Thanks, and do let me know!

Related

Error using the connection to database when RDS scales out

We have a .net API hosted in ECS that queries data from a serverless v1 cluster using Entity Framework. Under normal load this service performs very well but when there's a large spike in traffic that require the RDS cluster to scale out to more ACUs we are seeing a lot of connection errors in our API.
An error occurred using the connection to database '\"ourdatabasename\"' on server '\"tcp://ourcluster.region.rds.amazonaws.com:5432\"'.
The high level overiew of the infrastructure looks like this:
CloudFront >> Load Balancer >> ECS Fargate >> RDS Aurora PostgeSQL Serverless v1
Stack information:
.Net 6 API compiled for Linux
Entity Framework Core 6.x
Npgsql.EntityFrameworkCore.PostgreSQL 6.x
PostgeSQL 10.18
We did open AWS support cases about this issue in the past year, but those basically always result in the answer that this is an implementation issue and not an infrastructure issue.
We can easily reproduce the issue by running a k6 stress test on our API (bypassing the CloudFront caching layer of course) to generate a spike high enough to trigger scaling of the RDS cluster.
For the past year we have worked around this issue by configuring RDS at a capacity at which it basically never needs to scale out. This is of course wasting money, and not the purpose of serverless as all, so we would like to find the underlying root cause and solve that.
Some things we have tried already:
We have experimented with serverless v2 which should scale in a completely different fashion as it's just the same vm consuming more resources from the hosting machine. But our preliminary conclusion is that this was even worse. We do not yet understand why that is, but it appears to trigger the same effect but than a lot faster/more as v2 scales a lot faster/more. With v1 we get in trouble around 400 requests per second, with v2 it was at 150rps.
EnableRetryOnFailure seemed to help a tiny bit, but not a lot. We have left it at the default configuration as implemented by Npgsql for now.
We have experimented with the Maximum Pool Size connection string parameter. At 300 it appears to be a bit better, but it does not solve the issue.
Changing the scaling behaviour of ECS/the ALB or even just prescaling that to handle peak load did not change anything.
We have not tried:
RDS Proxy, it's supposed to solve all your connection pooling issues. But we're not sure it's even a pooling issue. We're not keen on trusting on yet another black box service to solve the issues our first black box service (aurora serverless) has. And it's not really cheap. If all of SO will now convince us this is the holy grail, then surely we'll try it out.
Data API for RDS, you can't have connection management issues if you're not making them right? It's a huge investment to rewrite all EF code to Data API requests and I'm not sure what it says about the service if it's still not out for serverless v2. So, not for now I think.
The first purpose of this question here on SO it trying to find someone that could help us understand what is even going on. Helping us understand the error and where it comes from. We understand that you cannot expect that ECS+RDS can just magically handle all the load you throw at it. But if we do not fully understand how it breaks we are not able to come up with how to create potential failover mechanisms or how to make the system fail more gracefully.
If someone knows the magic setting but not the why that's also great of course :) We can then maybe figure out the why ourselves and share that back with the community ;)
Feel free to ask more questions where needed.

Cloud SQL Mysql - Stuck in failover operation in progress

My Cloud SQL Mysql 5.7.37 Highly available instance is stuck in a "Failover operation in progress. This may take a few minutes. While this operation is running, you may continue to view information about the instance" process. It is a fairly small database and it has been stuck like this for 5 hours and the failover is not available so no DB queries can be executed, hence our system is currently down.
No commands on the DB can be executed since it is in an updating process, the error log is empty and the operations log only contain this update and successfull backups.
Does anyone have any suggestions? I am not paying for Google Support so I cant get support directly from them (which I think is terrible since this a fully managed service).
Best,
Carl-Fredrik

Google Cloud SQL Migration Job stuck on Running

I've got a database on Google SQL that is used by our application running on kubernetes in GKE.
The mysql instance is running on 5.6, and I need to update it to 5.7, so I tried using the new migration jobs.
I've set up the connection profile and all the required permissions for the source DB, then followed the instructions to make a continuous migration.
The Job says it's running, migrating the ~450GB database. After about a day, it's still running, the storage used seems to have stopped growing, and the replication delay is at 0. The source database is not currently in use (That's why I'm unsing it to try this out before doing the same with a more important db).
According to this, if the dump phase is done, I should be able to promote the instance, but the promote button remains greyed out, and there's no way to check the running state (it only says "running", and I don't see any way to check if it's dumping, on CDC, or anything else).
The documentation seems a bit lacking, and I couldn't find anything by googling around. Has anyone been using this?
In short, my questions are:
Why can't I promote the instance?
and how can I check in what phase is the migration?
Here's a screencap of my job:
link because SO doesn't let me embed images yet
Thanks.
p.d.: the tag that the documentation says should be used in stackoverflow is: google-cloud-database-migration-service, which is too long and stackoverflow doesn't allow, so I used google-cloud-sql instead :/
I am seeing an issue like this, but possibly more frustrating. After a week for a 2TB database, storage resets to near-zero and the full dump restarts, without any errors or indication of what happened.

Using MongoDB in AWS Lambda with the mLab API

Usually you cant use MongoDB in Lambdas because Lambda functions are stateless and operations on MongoDB require a connection, thus you suffer a large performance hit in setting up a DB connection each time a function is run.
A solution I have thought of is to use mLab's REST API (http://docs.mlab.com/data-api/), that way I dont need to open a new connection each time my Lambda function is called.
On problem I can see if that mLab's REST service could become a bottleneck, plus im relying on it never going down.
Thoughts?
I have a couple of alternative suggestions for you on this. Only because I've never used mLab.
Setup http://restheart.org/ and have that sit between your lambda micro services and your MongoDB instance. I've used this with pretty decent success on another project. It does come with the downside of now having an EC2 instance to maintain. However, setting up restheart is pretty easy and the crew maintaining it and giving support is pretty great.
You can setup a lambda function that pays the cost of connecting and keeping a connection open. All of your other microservices can then call that lambda function for the data they need. If it is hit more frequently, you will not have to pay the cost of the DB connection as frequently. However, that first connection can be pretty brutal so you may need something keeping it warm. You will also have the potential issue of connections never getting properly closed, and eventually running out.
Those two options aside, if mlab is hosting your DB, you already have put a lot of faith in their ability to keep a system alive. If they cant keep an API up that lack of faith should also translate to their ability to keep your DB alive.

pgpool-II for Postgres - Is it what I need?

I just stumbled upon pgpool-II in my search for clustering my Postgres DB (just getting ready to deploy a web app in a couple months). I still have the shakes from excitement, but I'm nervous, as each time I find something this excellent I am soon let down. Have you any experience with pgpool-II, and will it help me run my database in multiple VMs, and later in multiple physical servers altogether? Is it all I need for backing up, load balancing, and providing a higher availability for my DB server!?
Also, is it easy to use the parallel query function (for instance, in Django or through Pythons psycopg2)? This would be most excellent for providing reporting and aggregation!
One last thing: It seems to work between Postgres and psycopg2. Is this a correct understanding of it, so I can use psycopg2 the same as normal, without regard for pgpool-II?
pgpool-II works fine for what it claims to do. And it fits between your application and the database the way you expect it to; just point psycopg2 toward it instead of directly at the database and off you go.
The main thing you have to note is that while it supports many different types of features--replication, load balancing, parallel query--you can't use them all at once. It sounds like you may be under the impression you can do that, and it doesn't work that way. The documentation is not all that clear on this subject (the English version at least, I can't speak to the original Japanese one).
For example, if you run pgpool-II in its "Master/Slave" mode, so that it supports load-balancing for scaling reads, you have to use another program to actually do the replication between those nodes. Slony was the supported replication solution to put underneath of there in earlier PostgreSQL versions, as of pgpool-II 3.0 and PostgreSQL 9.0 you can also use the soon to be released Streaming Replication/Hot Standby features of that new version as well.
pgpool-II is a useful component and you can use it in a lot of interesting ways, but I doubt it will be "all you need" for every requirement you hope to achieve with it.