I have noticed that my Google Cloud SQL instance is losing connectivity periodically and it seems to be associated with some read spikes on the Cloud SQL instance. See the attached screenshot for examples.
The instance is mostly idle, but my application recycles connections in the connection pool every 60 seconds so this is not a wait_timeout issue. I have verified that the connection are recycled Also, it occurred twice in 30 minutes and the wait_timeout is 8 hours.
I would suspect a backup process but you can see from the screenshot that no backups have run.
The first instance lasted 17 seconds from the time the connection loss was detected until it was reestablished. The second was only 5 seconds, but given that my connections are idle for 60 seconds the actual downtime could be up to 1:17 and 1:05 respectively. They occurred at 2014-06-05 15:29:08 UTC and 2014-06-05 16:05:32 UTC respectively. The read spikes are not initiated by me. My app continued to be idle during the issue so this is some sort of internal GCS process.
This is not a big deal for my idle app, but it will become a big deal when the app is not idle.
Has anyone else run into this issue? Is this a known issue with Google Cloud SQL? Is there a known fix?
Any help would be appreciated.
****Update****
The root cause of the symptoms above has been identified as a restart of the MySQL instance. I did not restart the instance and the operations section of the web console does not list any events at that time, so now the question becomes, what would cause the instance to restart twice in 30 minutes? Why would a production database instance restart period?
That was caused by one of our regular release. Because of the way the update takes place an instance might be restarted more than once during the push to production.
Was your instance restarted ? During the restart the spinning down/up of an instance will trigger read/write.
That may be one reason why you are seeing the activity for read/write.
Related
I had a number of jobs scheduled but seems none of the jobs were running. On further debugging, I found that there are no available servers, and in the jobrunr_backgroundjobservers table, it seems that there has not been a heart beat for any of the servers. What would cause this issue? How would I restart a heartbeat? And how would I know when such an issue occurs and the servers go down again, given that schedules are time sensitive?
It will stop polling if the connection to the database was lost or the database goes down for a while.
The JobRunr Pro version adds extra features and one of them is database fault tolerance - if such an issue occurs, JobRunr Pro will go in standby and will start processing again once the connection to the database is stable again.
See https://www.jobrunr.io/en/documentation/pro/database-fault-tolerance/ for more info.
As usual we set Redshift maintenance windows on Saturday morning, and we got several errors during that maintenance windows time.
* Query Processing Error AM5:07:01
[Amazon](500051) ERROR processing query/statement. Error: Query execution failed
[SQL State=HY000, DB Errorcode=500051]
* Connection Error AM5:07:27.79
[Amazon](500150) Error setting/closing connection: Connection refused: connect.
I guess that's due to Redshift internal maintenance.
May I ask how to check any evidence to prove that on Redshift? I checked the svl_qlog with aborted=1, but couldn't find perfect one.
And is there any way to set maintenance window to skip when the user session is running on?
--
Thanks to useful information from Schepo and Bill, we could prove that connection error was due to reboot on Redshift Maintenance Window.
Also, we checked Redshift Event at Console, exactly what time Redshift reboot started and ended.
Probably the best way to check if the connection errors were due to Redshift maintenance would be to check the Maintenance tab in your cluster configuration. In the example screenshot below, it's some time between 06:30 and 07:00 am every Wednesday.
There's no way to stop it happening while user sessions are connected. Although you do have the option of deferring all maintenance for up to 45 days if you need (follow the Edit button on the same screen).
For evidence to prove, you can check the audit log of past maintenance events by looking in the AWS Config service under the "timeline" of your cluster. Follow the View Config Timeline button to open AWS Config for that cluster. In the below example screenshot you can see the exact time (08:49:20) of one maintenance window in the past.
Another way to document that the maintenance window was used is to check the "healthy" dashboard metric on the console or in CloudWatch. If the cluster went unhealthy then returned to healthy during the maintenance window is very likely that AWS performed an update on the systems.
I'm using a Cloud SQL proxy sidecar on my nodejs API service.
It appears to work great, except that approximately 1% of my API requests come back with an error indicating that the DB connection failed with:
connect ECONNREFUSED 127.0.0.1:3306
My backend logs show that this was thrown from my ORM when it attempted to connect to the DB.
Sidecar logs show nothing, and the CloudSQL instance in question shows nothing out of the ordinary (17/4000 connections, <1% CPU usage, 1.5/3.5GiB memory usage, <100KiB ingress/egress per time slice on 6 hour window).
What might be causing this?
Edit: additional information:
All my pods have been up for many hours with 0 restarts, so the intermittent failure isn't a transient startup failure.
Logs show that this has been occurring intermittently since 30 days ago.
Here are a few reasons that can cause a Cloud SQL instance to become inaccessible:
1) Connection failure between your instance and the agents Cloud SQL uses to monitor the health of your instance
2) Synchronization of operations between your instance and the Cloud SQL service
3) Underprovisioning of resources, such as CPU cores, RAM, and/or storage, to your Cloud SQL instance (see Cloud SQL's Operational Guidelines [1] for additional information).
Since there are several reasons which could cause connections to be dropped (many of which are intricately related to the specifics of your project's implementation and environment), it's extremely complex to diagnose abnormal connection rejection. Additionally, Cloud SQL continuously monitors for any issues that can make an instance inaccessible and automatically takes action to resolve these issues.
Under normal circumstances, the error rate will not fully go away, but should happen at a very low level [2]. There are, of course, some conditions that can make it worse - both production issues as well as certain combinations of operations.
In any case, the recommendation under such circumstances is to implement a retry strategy for reconnection to the instances with exponential backoff. Some of the client libraries already have supporting code in place, but it depends a bit on what you're exactly using.
[1] https://cloud.google.com/sql/docs/mysql/operational-guidelines
[2] https://cloud.google.com/sql/sla
I am testing logical replication between 2 PostgreSQL 11 databases for use on our production (I was able to set it thanks to this answer - PostgreSQL logical replication - create subscription hangs) and it worked well.
Now I am testing scripts and procedure which would set it automatically on production databases but I am facing strange problem with logical replication slots.
I had to restart logical replica due to some changes in setting requiring restart - which of course could happen on replicas also in the future. But logical replication slot on master did not disconnect and it is still active for certain PID.
I dropped subscription on master (I am still only testing) and tried to repeat the whole process with new logical replication slot but I am facing strange situation.
I cannot create new logical replication slot with the new name. Process running on the old logical replication slot is still active and showing wait_event_type=Lock and wait_event=transaction.
When I try to use pg_create_logical_replication_slot to create new logical replication slot I get similar situation. New slot is created - I see it in pg_catalog but it is marked as active for the PID of the session which issued this command and command hangs indefinitely. When I check processes I can see this command active with same waiting values Lock/transaction.
I tried to activate parameter "lock_timeout" in postgresql.conf and reload configuration but it did not help.
Killing that old hanging process will most likely bring down the whole postgres because it is "walsender" process. It is visible in processes list still with IP of replica with status "idle wating".
I tried to find some parameter(s) which could help me to force postgres to stop this walsender. But settings wal_keep_segments or wal_sender_timeout did not change anything. I even tried to stop replica for longer time - no effect.
Is there some way to do something with this situation without restarting the whole postgres? Like forcing timeout for walsender or lock for transaction etc...
Because if something like this happens on production I would not be able to use restart or any other "brute force". Thanks...
UPDATE:
"Walsender" process "died out" after some time but log does not show anything about it so I do not know when exactly it happened. I can only guess it depends on tcp_keepalives_* parameters. Default on Debian 9 is 2 hours to keep idle process. So I tried to set these parameters in postgresql.conf and will see in following tests.
Strangely enough today everything works without any problems and no matter how I try to simulate yesterday's problems I cannot. Maybe there were some network communication problems in the cloud datacenter involved - we experienced some occasional timeouts in connections into other databases too.
So I really do not know the answer except for "wait until walsender process on master dies" - which can most likely be influenced by tcp_keepalives_* settings. Therefore I recommend to set them to some reasonable values in postgresql.conf because defaults on OS are usually too big.
Actually we use it on our big analytical databases (set both on PostgreSQL and OS) because of similar problems. Golang and nodejs programs calculating statistics from time to time failed to recognize that database session ended or died out in some cases and were hanging until OS ended the connection after 2 hours (default on Debian). All of it seemed to be always connected with network communication problems. With proper tcp_keepalives_* setting reaction is much quicker in case of problems.
After old walsender process dies on master you can repeat all steps and it should work. So looks like I just had bad luck yesterday...
We upgraded our Google Cloud SQL postgres server to a bigger machine and the upgrade is not terminating. In our experience, this usually takes less than 5 minutes, but we'ven been waiting for about 1.5 hours now and nothing is happening. There are no logs after the server shut down(except for failed connection attempts). We cannot switch to the failover, because there is already an operation in progress (namely the upgrade that's causing the problem in the first place). Restarting is disabled because the operation is in progress. It seems like there's nothing we can do right now, except maybe apply the last backup, though we're not sure if that's even possible while an operation is in progress.
Is there anything we can do to restart the DB or fix the problem?
When you upgrade a CloudSQL server, the instance is rebooted. It can happen occasionally that rebooting takes more than expected, which seems to be what happened to your server, but this is not unexpected behaviour.
This being said, be sure to check the status of the CloudSQL service. And if upgrades get stuck too often or never finish, contact support.
To reduce the chances of having this issue again:
Configure High Availability for your instance, so it has failover capability.
Make sure that the maintenance window of failover replicas is different from that of the master instance. To change the maintenance schedule, on the GCP console, go to SQL, click on an instance, and "Edit maintenance schedule"->"Set maintenance schedule". Then choose a window.