CloudSQL Proxy intermittently refuses connection - google-cloud-sql

I'm using a Cloud SQL proxy sidecar on my nodejs API service.
It appears to work great, except that approximately 1% of my API requests come back with an error indicating that the DB connection failed with:
connect ECONNREFUSED 127.0.0.1:3306
My backend logs show that this was thrown from my ORM when it attempted to connect to the DB.
Sidecar logs show nothing, and the CloudSQL instance in question shows nothing out of the ordinary (17/4000 connections, <1% CPU usage, 1.5/3.5GiB memory usage, <100KiB ingress/egress per time slice on 6 hour window).
What might be causing this?
Edit: additional information:
All my pods have been up for many hours with 0 restarts, so the intermittent failure isn't a transient startup failure.
Logs show that this has been occurring intermittently since 30 days ago.

Here are a few reasons that can cause a Cloud SQL instance to become inaccessible:
1) Connection failure between your instance and the agents Cloud SQL uses to monitor the health of your instance
2) Synchronization of operations between your instance and the Cloud SQL service
3) Underprovisioning of resources, such as CPU cores, RAM, and/or storage, to your Cloud SQL instance (see Cloud SQL's Operational Guidelines [1] for additional information).
Since there are several reasons which could cause connections to be dropped (many of which are intricately related to the specifics of your project's implementation and environment), it's extremely complex to diagnose abnormal connection rejection. Additionally, Cloud SQL continuously monitors for any issues that can make an instance inaccessible and automatically takes action to resolve these issues.
Under normal circumstances, the error rate will not fully go away, but should happen at a very low level [2]. There are, of course, some conditions that can make it worse - both production issues as well as certain combinations of operations.
In any case, the recommendation under such circumstances is to implement a retry strategy for reconnection to the instances with exponential backoff. Some of the client libraries already have supporting code in place, but it depends a bit on what you're exactly using.
[1] https://cloud.google.com/sql/docs/mysql/operational-guidelines
[2] https://cloud.google.com/sql/sla

Related

Expected unvailability during Cloud SQL Postgres failovers and CPU/memory upgrades?

I have some experience with AWS RDS MySQL multi-AZ (HA). I'm looking at GCP Cloud SQL Postgres HA for a new project.
I'm trying to figure how certain maintenance operations work but can't figure it out from the Cloud SQL docs.
How much unavailability does a failover cause?
How much unavailability does a CPU/memory upgrade cause?
After a failover, is it important to eventually "failback" to the original primary instance? Or can I leave it running on the standby instance indefinitely? (The Cloud SQL HA failover diagram make it seem like the two instances aren't totally symmetric.)
Just FYI, the answers for AWS RDS
Failover: usually under 70 seconds of unavailability before my application is able to issue queries again.
This is for planned failovers. (For unplanned failovers, it may take a little longer for RDS to detect that the primary instance is unresponsive before it actually initiates the failover.)
A lot of the failover lag is likely due to DNS. Using the AWS RDS Proxy service may reduce that time (they claim by ~80%). The Cloud SQL HA failover diagram shows both instances sharing a virtual IP, which might mean no DNS lag?
CPU/memory upgrade: I think AWS can accomplish this with a single failover worth of unavailability. It upgrades the standby instance (no unavailability), performs a failover, then upgrades the other instance.
On RDS, I think the two instances that are part of the HA set up are symmetric. So if you failover to the standby, it's fine to leave it that way. There's no need (as far as RDS is concerned) to failover back to the original.
To answer your following questions:
As you mentioned, the duration of the unavailability would vary depending if it is a planned (manual) failover vs unplanned. It's best that you test and manually initiate the failover so you can see how long your instance would respond to it, usually it would take a minute or so. When it comes to unplanned failovers, it's pretty much covered in the docs that when failover occurs, any existing connections to the primary instance and read replicas are closed, and it will take approximately 2-3 minutes for connections to be reestablished.
To address this question, you need to understand the requirements for your instance to allow failover:
The primary instance must be in a normal operating state (not stopped, undergoing maintenance, or performing a long-running Cloud SQL instance operation such as a backup, import or export operation).
That means that failover doesn't count when upgrading your instance, changing your hardware specs (CPU/Memory) will incur downtime so you should plan ahead when making these changes.
To understand the importance of failback, here's an excerpt from this link:
High availability solutions continuously replicate data to a remote site or cloud. In the event that a primary system goes down, the remote, secondary system can be spun up and users are rerouted. This process is commonly referred to as “failover,” and it reduces downtime to seconds or minutes.
However, failover isn’t a permanent state. Once primary servers are up and running, data and applications must be restored so normal operations can resume. This process is known as failback, and it is very important from a DR testing standpoint. Here’s why: Not all replication technology is created equally when it comes to failback. In some cases, failing back to production servers can be painfully slow.
UPDATE 1:
HA on Cloud SQL will provision specs for your standby instance similar to your primary, that's why you'll get billed double the price of a non-HA instance. Also, the importance of failback is not limited to any cloud providers. It is simply a good practice to make sure that all the operation returns to your primary instance instead of just leaving it on a standby instance. On that case, failback (on Cloud SQL to be specific) is really necessary to make sure that everything is back to normal after an outage.
UPDATE 2:
If you don't failback, what could happen is that when there's an outage on the zone where your standby instance is running (you can't control what zone your standby instance will come from), you won't be able to do a failover as the operation will be blocked. (See the docs)
Unfortunately there's pretty much no option as the downtime is required whenever you change hardware. The procedure will require the instance to restart. Here's a link to see how long it would take.
Additional resources: https://severalnines.com/database-blog/achieving-mysql-failover-failback-google-cloud-platform-gcp

Pgbouncer: how to run within a kubernetes cluster properly

The background: I currently run some kubernetes pods with a pgbouncer sidecar container. I’ve been running into annoying behavior with sidecars (that will be addressed in k8s 1.18) that have workarounds, but have brought up an earlier question around running pgbouncer inside k8s.
Many folks recommend the sidecar approach for pgbouncer, but I wonder why running one pgbouncer per say: machine in the k8s cluster wouldn’t be better? I admit I don’t have enough of a deep understanding of either pgbouncer or k8s networking to understand the implications of either approach.
EDIT:
Adding context, as it seems like my question wasn't clear enough.
I'm trying to decide between two approaches of running pgbouncer in a kubernetes cluster. The PostgreSQL server is not running in this cluster. The two approaches are:
Running pgbouncer as a sidecar container in all of my pods. I have a number of pods: some replicas on a webserver deployment, an async job deployment, and a couple cron jobs.
Running pgbouncer as a separate deployment. I'd plan on running 1 pgbouncer instance per node on the k8s cluster.
I worry that (1) will not scale well. If my PostgreSQL master has a max of 100 connections, and each pool has a max of 20 connections, I potentially risk saturating connections pretty early. Additionally, I risk saturating connections on master during pushes as new pgbouncer sidecars exist alongside the old image being removed.
I, however, almost never see (2) recommended. It seems like everyone recommends (1), but the drawbacks seem quite obvious to me. Is the networking penalty I'd incur by connecting to pgbouncer outside of my pod be large enough to notice? Is pgbouncer perhaps smart enough to deal with many other pgbouncer instances that could potentially saturate connections?
We run pgbouncer in production on Kubernetes. I expect the best way to do it is use-case dependent. We do not take the sidecar approach, but instead run pgbouncer as a separate "deployment", and it's accessed by the application via a "service". This is because for our use case, we have 1 postgres instance (i.e. one physical DB machine) and many copies of the same application accessing that same instance (but using different databases within that instance). Pgbouncer is used to manage the active connections resource. We are pooling connections independently for each application because the nature of our application is to have many concurrent connections and not too many transactions. We are currently running with 1 pod (no replicas) because that is acceptable for our use case if pgbouncer restarts quickly. Many applications all run their own pgbouncers and each application has multiple components that need to access the DB (so each pgbouncer is pooling connections of one instance of the application). It is done like this https://github.com/astronomer/airflow-chart/tree/master/templates/pgbouncer
The above does not include getting the credentials set up right for accessing the database. The above, linked template is expecting a secret to already exist. I expect you will need to adapt the template to your use case, but it should help you get the idea.
We have had some production concerns. Primarily we still need to do more investigation on how to replace or move pgbouncer without interrupting existing connections. We have found that the application's connection to pgbouncer is stateful (of course because it's pooling the transactions), so if pgbouncer container (pod) is swapped out behind the service for a new one, then existing connections are dropped from the application's perspective. This should be fine even running pgbouncer replicas if you have an application where you can ensure that rarely dropped connections retry and make use of Kubernetes sticky sessions on the "service". More investigation is still required by our organization to make it work perfectly.

What to do when a Google Cloud SQL postgres server upgrade (to a bigger machine) takes considerably longer than "a few minutes"?

We upgraded our Google Cloud SQL postgres server to a bigger machine and the upgrade is not terminating. In our experience, this usually takes less than 5 minutes, but we'ven been waiting for about 1.5 hours now and nothing is happening. There are no logs after the server shut down(except for failed connection attempts). We cannot switch to the failover, because there is already an operation in progress (namely the upgrade that's causing the problem in the first place). Restarting is disabled because the operation is in progress. It seems like there's nothing we can do right now, except maybe apply the last backup, though we're not sure if that's even possible while an operation is in progress.
Is there anything we can do to restart the DB or fix the problem?
When you upgrade a CloudSQL server, the instance is rebooted. It can happen occasionally that rebooting takes more than expected, which seems to be what happened to your server, but this is not unexpected behaviour.
This being said, be sure to check the status of the CloudSQL service. And if upgrades get stuck too often or never finish, contact support.
To reduce the chances of having this issue again:
Configure High Availability for your instance, so it has failover capability.
Make sure that the maintenance window of failover replicas is different from that of the master instance. To change the maintenance schedule, on the GCP console, go to SQL, click on an instance, and "Edit maintenance schedule"->"Set maintenance schedule". Then choose a window.

AWS RDS with Postgres : Is OOM killer configured

We are running load test against an application that hits a Postgres database.
During the test, we suddenly get an increase in error rate.
After analysing the platform and application behaviour, we notice that:
CPU of Postgres RDS is 100%
Freeable memory drops on this same server
And in the postgres logs, we see:
2018-08-21 08:19:48 UTC::#:[XXXXX]:LOG: server process (PID XXXX) was terminated by signal 9: Killed
After investigating and reading documentation, it appears one possibility is linux oomkiller running having killed the process.
But since we're on RDS, we cannot access system logs /var/log messages to confirm.
So can somebody:
confirm that oom killer really runs on AWS RDS for Postgres
give us a way to check this ?
give us a way to compute max memory used by Postgres based on number of connections ?
I didn't find the answer here:
http://postgresql.freeideas.cz/server-process-was-terminated-by-signal-9-killed/
https://www.postgresql.org/message-id/CAOR%3Dd%3D25iOzXpZFY%3DSjL%3DWD0noBL2Fio9LwpvO2%3DSTnjTW%3DMqQ%40mail.gmail.com
https://www.postgresql.org/message-id/04e301d1fee9%24537ab200%24fa701600%24%40JetBrains.com
AWS maintains a page with best practices for their RDS service: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.html
In terms of memory allocation, that's the recommendation:
An Amazon RDS performance best practice is to allocate enough RAM so
that your working set resides almost completely in memory. To tell if
your working set is almost all in memory, check the ReadIOPS metric
(using Amazon CloudWatch) while the DB instance is under load. The
value of ReadIOPS should be small and stable. If scaling up the DB
instance class—to a class with more RAM—results in a dramatic drop in
ReadIOPS, your working set was not almost completely in memory.
Continue to scale up until ReadIOPS no longer drops dramatically after
a scaling operation, or ReadIOPS is reduced to a very small amount.
For information on monitoring a DB instance's metrics, see Viewing DB Instance Metrics.
Also, that's their recommendation to troubleshoot possible OS issues:
Amazon RDS provides metrics in real time for the operating system (OS)
that your DB instance runs on. You can view the metrics for your DB
instance using the console, or consume the Enhanced Monitoring JSON
output from Amazon CloudWatch Logs in a monitoring system of your
choice. For more information about Enhanced Monitoring, see Enhanced
Monitoring
There's a lot of good recommendations there, including query tuning.
Note that, as a last resort, you could switch to Aurora, which is compatible with PostgreSQL:
Aurora features a distributed, fault-tolerant, self-healing storage
system that auto-scales up to 64TB per database instance. Aurora
delivers high performance and availability with up to 15 low-latency
read replicas, point-in-time recovery, continuous backup to Amazon S3,
and replication across three Availability Zones.
EDIT: talking specifically about your issue w/ PostgreSQL, check this Stack Exchange thread -- they had a long connection with auto commit set to false.
We had a long connection with auto commit set to false:
connection.setAutoCommit(false)
During that time we were doing a lot
of small queries and a few queries with a cursor:
statement.setFetchSize(SOME_FETCH_SIZE)
In JDBC you create a connection object, and from that connection you
create statements. When you execute the statments you get a result
set.
Now, every one of these objects needs to be closed, but if you close
statement, the entry set is closed, and if you close the connection
all the statements are closed and their result sets.
We were used to short living queries with connections of their own so
we never closed statements assuming the connection will handle the
things once it is closed.
The problem was now with this long transaction (~24 hours) which never
closed the connection. The statements were never closed. Apparently,
the statement object holds resources both on the server that runs the
code and on the PostgreSQL database.
My best guess to what resources are left in the DB is the things
related to the cursor. The statements that used the cursor were never
closed, so the result set they returned never closed as well. This
meant the database didn't free the relevant cursor resources in the
DB, and since it was over a huge table it took a lot of RAM.
Hope it helps!
TLDR: If you need PostgreSQL on AWS and you need rock solid stability, run PostgreSQL on EC2 (for now) and do some kernel tuning for overcommitting
I'll try to be concise, but you're not the only one who has seen this and it is a known (internal to Amazon) issue with RDS and Aurora PostgreSQL.
OOM Killer on RDS/Aurora
The OOM killer does run on RDS and Aurora instances because they are backed by linux VMs and OOM is an integral part of the kernel.
Root Cause
The root cause is that the default Linux kernel configuration assumes that you have virtual memory (swap file or partition), but EC2 instances (and the VMs that back RDS and Aurora) do not have virtual memory by default. There is a single partition and no swap file is defined. When linux thinks it has virtual memory, it uses a strategy called "overcommitting" which means that it allows processes to request and be granted a larger amount of memory than the amount of ram the system actually has. Two tunable parameters govern this behavior:
vm.overcommit_memory - governs whether the kernel allows overcommitting (0=yes=default)
vm.overcommit_ratio - what percent of system+swap the kernel can overcommit. If you have 8GB of ram and 8GB of swap, and your vm.overcommit_ratio = 75, the kernel will grant up to 12GB or memory to processes.
We set up an EC2 instance (where we could tune these parameters) and the following settings completely stopped PostgreSQL backends from getting killed:
vm.overcommit_memory = 2
vm.overcommit_ratio = 75
vm.overcommit_memory = 2 tells linux not to overcommit (work within the constraints of system memory) and vm.overcommit_ratio = 75 tells linux not to grant requests for more than 75% of memory (only allow user processes to get up to 75% of memory).
We have an open case with AWS and they have committed to coming up with a long-term fix (using kernel tuning params or cgroups, etc) but we don't have an ETA yet. If you are having this problem, I encourage you to open a case with AWS and reference case #5881116231 so they are aware that you are impacted by this issue, too.
In short, if you need stability in the near term, use PostgreSQL on EC2. If you must use RDS or Aurora PostgreSQL, you will need to oversize your instance (at additional cost to you) and hope for the best as oversizing doesn't guarantee you won't still have the problem.

Cassandra giving TTransportException after some inserts/updates

Around 15 processes were inserting/updating unique entries in Cassandra. Everything was working fine but after sometime I get this error.
(When I restart the process everything is fine again till sometime)
An attempt was made to connect to each of the servers twice, but none
of the attempts succeeded. The last failure was TTransportException:
Could not connect to 10.204.81.77:9160
I did CPU/Memory Analysis of all the Cassandra machines. CPU usage sometimes goes around 110% and Memory Usage was between 60% - 77%. Not sure if this might be the cause, as it was working fine with such memory and cpu usage most of the time.
p.s.: How to ensure Cassandra update/insertion works error free?
Cassandra will throw an exception if anything goes wrong with your inserts; otherwise, you can assume it was error free.
Connection failures are a network problem, not a Cassandra problem. Some places to start: is the Cassandra process still alive? Does netstat show it still listening on 9160? Can you connect to non-Cassandra services on that machine? Is your server or router configured to firewall off frequent connection attempts?