Cloud SQL errors - google-cloud-sql

Various errors started occurring in Google SQL. The system is saying temporary unavailable, but it's been quite a while. Looks like 1 in 10 queries now give 500/502 errors. Here is an example stacktrace http://pastebin.com/MNk06PT4
This is a follow-up from Severe delays in cloud SQL responses. It could be the same issue. Same conditions, google cloud engine connected to a cloud SQL, no zone preference. Hope that sheds more light on the issue.

Between 11.00PST and 11.30PST there was an issue that interrupted many Cloud SQL instances. The problem should now be resolved.
We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority for the Google Cloud Platform, and we are making continuous improvements to make our systems better.
To be kept informed of other Google Cloud SQL issues and launches, please join google-cloud-sql-announce#googlegroups.com
https://groups.google.com/forum/#!forum/google-cloud-sql-announce

Seems all Google Cloud SQL is down, any expected recovery time?
https://cloud.google.com/console
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
Best regards
Sergio

Also I noticed that you are using the deprecated JDBC driver that has much lower performance than using the MySQL wire protocol natively. See https://developers.google.com/cloud-sql/docs/external for information on connecting using the standard drivers. That will help latency as well as consistency of performance.

I also got this error using CodeIgniter v2.1 and v3 on app engine and got this error as well.
It happens when using $autoload['libraries'] = array('database');
Then after a few random page refreshes this error pops up.
After changing the following in database.php:
'pconnect' => TRUE,
into
'pconnect' => FALSE,
This errors is gone in my application. Now both version 2.1 and 3 are working for me.
Maybe there is a similar setting in the framework or code you're using.

Related

Error using the connection to database when RDS scales out

We have a .net API hosted in ECS that queries data from a serverless v1 cluster using Entity Framework. Under normal load this service performs very well but when there's a large spike in traffic that require the RDS cluster to scale out to more ACUs we are seeing a lot of connection errors in our API.
An error occurred using the connection to database '\"ourdatabasename\"' on server '\"tcp://ourcluster.region.rds.amazonaws.com:5432\"'.
The high level overiew of the infrastructure looks like this:
CloudFront >> Load Balancer >> ECS Fargate >> RDS Aurora PostgeSQL Serverless v1
Stack information:
.Net 6 API compiled for Linux
Entity Framework Core 6.x
Npgsql.EntityFrameworkCore.PostgreSQL 6.x
PostgeSQL 10.18
We did open AWS support cases about this issue in the past year, but those basically always result in the answer that this is an implementation issue and not an infrastructure issue.
We can easily reproduce the issue by running a k6 stress test on our API (bypassing the CloudFront caching layer of course) to generate a spike high enough to trigger scaling of the RDS cluster.
For the past year we have worked around this issue by configuring RDS at a capacity at which it basically never needs to scale out. This is of course wasting money, and not the purpose of serverless as all, so we would like to find the underlying root cause and solve that.
Some things we have tried already:
We have experimented with serverless v2 which should scale in a completely different fashion as it's just the same vm consuming more resources from the hosting machine. But our preliminary conclusion is that this was even worse. We do not yet understand why that is, but it appears to trigger the same effect but than a lot faster/more as v2 scales a lot faster/more. With v1 we get in trouble around 400 requests per second, with v2 it was at 150rps.
EnableRetryOnFailure seemed to help a tiny bit, but not a lot. We have left it at the default configuration as implemented by Npgsql for now.
We have experimented with the Maximum Pool Size connection string parameter. At 300 it appears to be a bit better, but it does not solve the issue.
Changing the scaling behaviour of ECS/the ALB or even just prescaling that to handle peak load did not change anything.
We have not tried:
RDS Proxy, it's supposed to solve all your connection pooling issues. But we're not sure it's even a pooling issue. We're not keen on trusting on yet another black box service to solve the issues our first black box service (aurora serverless) has. And it's not really cheap. If all of SO will now convince us this is the holy grail, then surely we'll try it out.
Data API for RDS, you can't have connection management issues if you're not making them right? It's a huge investment to rewrite all EF code to Data API requests and I'm not sure what it says about the service if it's still not out for serverless v2. So, not for now I think.
The first purpose of this question here on SO it trying to find someone that could help us understand what is even going on. Helping us understand the error and where it comes from. We understand that you cannot expect that ECS+RDS can just magically handle all the load you throw at it. But if we do not fully understand how it breaks we are not able to come up with how to create potential failover mechanisms or how to make the system fail more gracefully.
If someone knows the magic setting but not the why that's also great of course :) We can then maybe figure out the why ourselves and share that back with the community ;)
Feel free to ask more questions where needed.

Google Cloud SQL Migration Job stuck on Running

I've got a database on Google SQL that is used by our application running on kubernetes in GKE.
The mysql instance is running on 5.6, and I need to update it to 5.7, so I tried using the new migration jobs.
I've set up the connection profile and all the required permissions for the source DB, then followed the instructions to make a continuous migration.
The Job says it's running, migrating the ~450GB database. After about a day, it's still running, the storage used seems to have stopped growing, and the replication delay is at 0. The source database is not currently in use (That's why I'm unsing it to try this out before doing the same with a more important db).
According to this, if the dump phase is done, I should be able to promote the instance, but the promote button remains greyed out, and there's no way to check the running state (it only says "running", and I don't see any way to check if it's dumping, on CDC, or anything else).
The documentation seems a bit lacking, and I couldn't find anything by googling around. Has anyone been using this?
In short, my questions are:
Why can't I promote the instance?
and how can I check in what phase is the migration?
Here's a screencap of my job:
link because SO doesn't let me embed images yet
Thanks.
p.d.: the tag that the documentation says should be used in stackoverflow is: google-cloud-database-migration-service, which is too long and stackoverflow doesn't allow, so I used google-cloud-sql instead :/
I am seeing an issue like this, but possibly more frustrating. After a week for a 2TB database, storage resets to near-zero and the full dump restarts, without any errors or indication of what happened.

Unable to investigate on ZEIT Now 502 eror for a NextJS app

I'm investigating by days with no results about this exception that my NextJS app is currently throwing, in particular when I try to open a single specific URL:
502: BAD_GATEWAY
Code: NO_STATUS_CODE_FROM_FUNCTION
ID: zrh1:4zx5l-1572269318137-64d401b5d058
Here's the screenshot:
Basically, I have on https://lucacattide.dev/about/en a page that this app should open. This is linked to a MongoDB third-party cloud API platform - Squidex - which is responsible to populate the page itself, via GraphQL queries. The app uses Apollo as GraphQL client. The app instead, is hosted on ZEIT.co serverless cloud, with Now 2.0 version.
During the development process, everything works fine. The page loads up and data is fetched in the right way. Notice that for development, I'm working on now-dev environment instead of a custom Express server, in order to reproduce the production one, as suggested by ZEIT itself.
The exception is being thrown on the production environment - the live one on the hosting platform, not on localhost; the main problem is that no errors are being shown on live logs or local development. So I'm literally going mad in inspecting the possible cause.
I've already tried to test the involved page, by splitting it in sections and trying to exclude child components, or focusing the inspection on the GraphQL query. But the first hasn't produced results and the latter works fine in every environment.
As last try, I deleted and re-created the back-end contents related to that page, because in the past I had a similar issue due to an old GraphQL edited schema that didn't reflected its modifications through the API - so in that case I was still receiving 502 errors. But this time it didn't worked.
Anyone could help me to understand what's going on, please?
Thanks everyone in advance
The issue was caused by an incompatibility between the d3-cloud library and the Now environment.
By replacing it with the react-wordcloud one, the error has been solved.
Thanks everyone for your assistance.

Google Cloud SQL Postgres - randomly slow queries from Google Compute / Kubernetes

I've been testing Google Cloud SQL with Postgresql, but I have random queries taking ~3s instead of a few ms.
Troubleshooting I did:
The queries themselves aren't problems, rerunning the same query will work.
Indexes are properly set. The database is also very very small, it shouldn't do this, even if there weren't any index.
The Kubernetes container is connecting to the database through SQL Proxy (I followed this https://cloud.google.com/sql/docs/postgres/connect-kubernetes-engine). It is not the problem though as I tried to connect directly to the database, with the same issue.
I configured net.ipv4.tcp_keepalive_time to 60 to make sure the connection weren't dropping.
I also have a pool of connection that are never disconnected to make sure it wasn't from that.
When I run queries directly through my local Postgresql client, I never have the problem.
I don't have this issue when developing locally either and connecting to my local database.
What I'm getting at is: I feel there's some weird connection/link issue between my Google Compute instances and my Google SQL instance that I can't seem to figure out.
Any idea?
Edit:
I also noticed these logs in my SQL Cloud instance every 30s:
ERROR: recovery is not in progress
HINT: Recovery control functions can only be executed during recovery.
STATEMENT: SELECT pg_is_xlog_replay_paused(), current_timestamp
That's an interesting problem you are facing. So my knowledge on Kubernetes isn't that great, but I do have a general understanding so let's see if I can provide some suggestions.
To start with, the API that you linked to in your question does mention that it is still in beta. So I do believe there would still be issues to patch in maximizing speed performance.
Secondly, from what I understand, Kubernetes is a great tool for handling stateless workloads. Thus, handling data where state is required for queries would be a slow operation. This article (although not entirely related) does explain some of the pitfalls of Kubernetes (not all the questions are relevant)
Thirdly, could you explain your use case a little bit? Do you really need to use Kubernetes or will another tool like a powerful Compute Engine Instance or or a Dataflow job resolve the the issue? Are you making your database queries through a programming language or an application call?
Thanks, and do let me know!

Severe delays in cloud SQL responses

In the past 4-5 hours there have been 10s of simple read queries that took 40-70 seconds to return result from the cloud SQL DB. Usually they take 50ms or so. Is there some ongoing issue? I can provide DB IDs and specific times if needed.
Thanks.
Between 11.00PST and 11.30PST there was an issue that interrupted many Cloud SQL instances. The problem should now be resolved.
We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority for the Google Cloud Platform, and we are making continuous improvements to make our systems better.
To be kept informed of other Google Cloud SQL issues and launches, please join google-cloud-sql-announce#googlegroups.com
https://groups.google.com/forum/#!forum/google-cloud-sql-announce
(from Joe Faith in another thread)