Google Cloud SQL: Can I change machine type with zero downtime? - google-cloud-sql

We need to change our Google cloud SQL instance from db-g1-small to db-n1-standard-1. Can I change it with zero downtime?
Edit 1
I think I found the answer. It seems that it will take a few seconds of downtime.
You can change an instance's tier at any time, with just a few seconds
of downtime.
https://cloud.google.com/sql/pricing
Edit 2
I tried it on our dev env. The downtime was about 10 sec.
while true; date; do curl https://api.xxx.com/v1/items; echo ""; sleep 1s; done
2016/8/29 16:24:50 JST
{"OK"}
2016/8/29 16:24:51 JST
Error
2016/8/29 16:25:01 JST
{"OK"}

The note about changing tier in a few seconds is under the First Generation section of that page.
For a Second Generation instance, it may take several minutes.

Related

First 10 long running transactions

I have a fairly small cluster of 6 nodes, 3 client, and 3 server nodes. Important configurations,
storeKeepBinary = true,
cacheMode = Partitioned (some caches's about 5-8, out of 25 have this as TRANSACTIONAL)
AtomicityMode = Atomic
backups = 1
readFromBackups = false
no persistence
When I run the app for some load/performance test on-prem on 2 large boxes, 3 clients on one box, and 3 servers on another box, all within docker containers, I get a decent performance.
However, when I move them over to AWS and run them in EKS, the only change I make is to change the cluster discovery from standard TCP (default) to Kubernetes-based discovery and run the same test.
But now the performance is very bad, I keep getting,
WARN [sys-#145%test%] - [org.apache.ignite] First 10 long-running transactions [total=3]
Here the transactions are running more than a min long.
In other cases, I am getting,
WARN [sys-#196%test-2%] - [org.apache.ignite] First 10 long-running cache futures [total=1]
Here the associated future has been running for > 3 min.
Most of the places 'google search' has taken me, talks flaky/inconsistent n/w as the cause.
The app and the test seem to be ok since on a local on-prem this works just fine and the performance is decent as well.
Wanted to check if others have faced this or when running on Kubernetes in the public cloud something else needs to be done. Like somewhere I read nodes need to be pinned to the host in a cloud/virtual environment, but it's not mandatory.
TIA

How to check about Redshift maintenance windows

As usual we set Redshift maintenance windows on Saturday morning, and we got several errors during that maintenance windows time.
* Query Processing Error AM5:07:01
[Amazon](500051) ERROR processing query/statement. Error: Query execution failed
[SQL State=HY000, DB Errorcode=500051]
* Connection Error AM5:07:27.79
[Amazon](500150) Error setting/closing connection: Connection refused: connect.
I guess that's due to Redshift internal maintenance.
May I ask how to check any evidence to prove that on Redshift? I checked the svl_qlog with aborted=1, but couldn't find perfect one.
And is there any way to set maintenance window to skip when the user session is running on?
--
Thanks to useful information from Schepo and Bill, we could prove that connection error was due to reboot on Redshift Maintenance Window.
Also, we checked Redshift Event at Console, exactly what time Redshift reboot started and ended.
Probably the best way to check if the connection errors were due to Redshift maintenance would be to check the Maintenance tab in your cluster configuration. In the example screenshot below, it's some time between 06:30 and 07:00 am every Wednesday.
There's no way to stop it happening while user sessions are connected. Although you do have the option of deferring all maintenance for up to 45 days if you need (follow the Edit button on the same screen).
For evidence to prove, you can check the audit log of past maintenance events by looking in the AWS Config service under the "timeline" of your cluster. Follow the View Config Timeline button to open AWS Config for that cluster. In the below example screenshot you can see the exact time (08:49:20) of one maintenance window in the past.
Another way to document that the maintenance window was used is to check the "healthy" dashboard metric on the console or in CloudWatch. If the cluster went unhealthy then returned to healthy during the maintenance window is very likely that AWS performed an update on the systems.

What to do when a Google Cloud SQL postgres server upgrade (to a bigger machine) takes considerably longer than "a few minutes"?

We upgraded our Google Cloud SQL postgres server to a bigger machine and the upgrade is not terminating. In our experience, this usually takes less than 5 minutes, but we'ven been waiting for about 1.5 hours now and nothing is happening. There are no logs after the server shut down(except for failed connection attempts). We cannot switch to the failover, because there is already an operation in progress (namely the upgrade that's causing the problem in the first place). Restarting is disabled because the operation is in progress. It seems like there's nothing we can do right now, except maybe apply the last backup, though we're not sure if that's even possible while an operation is in progress.
Is there anything we can do to restart the DB or fix the problem?
When you upgrade a CloudSQL server, the instance is rebooted. It can happen occasionally that rebooting takes more than expected, which seems to be what happened to your server, but this is not unexpected behaviour.
This being said, be sure to check the status of the CloudSQL service. And if upgrades get stuck too often or never finish, contact support.
To reduce the chances of having this issue again:
Configure High Availability for your instance, so it has failover capability.
Make sure that the maintenance window of failover replicas is different from that of the master instance. To change the maintenance schedule, on the GCP console, go to SQL, click on an instance, and "Edit maintenance schedule"->"Set maintenance schedule". Then choose a window.

RDS snapshot restore taking too long

As part of our blue-green deployment strategy we are snapshoting the prod RDS instance and then restoring this snapshot into a new instance applying db migrations after it and linking the newly Green application to it.
Our RDS instance has a 100 GB space, but our DB uses only 10 MB at the moment.
Taking a snapshot takes roughly < 2 minutes
Restoring from the Snapshot takes 25 minutes!
25 minutes for the restore is too long considering users are forced to stay in read only mode for all this period and that our DB size is less than 10 mb at the moment.
I am wondering if this restore time is the usual time for Amazon RDS or if we are doing something wrong.
Amazon RDS Postgres.
Multi AZ: Yes
Instance Class: Medium
General Purpose (SSD)
IOPS: disabled.
After some experimentation we were able to reduce the restoring time from 25 minutes to 5 minutes. This was due to the fact, that RDS first tries to restores the snapshot. (In our case this took 5 minutes). And afterwards it applied the Multi Az change to the new instance. (this was taking like 20 minutes)
Previously we were waiting for the DB to finish the MULTI AZ change, and status="available" to continue with our Deployment, but after contacting AWS, they have confirmed that is safe to start using the new instance even when the instance is being modified to apply the MULTI AZ change. So we continue our deployment process as soon as the restored instance status change from "creating" to "modifying"
This solution as correctly said, might not scale very well but at the moment this is not a concern as we are not expecting this DB to grow significantly.
We consider this approach to be very safe, as any DB schema changes wont affect the live DB and we can safely test the whole GREEN stack before switching to PROD. The only caveat here is that the application need to be in read-only mode, so as not to loose information between the blue and green environments

Google Cloud SQL: Periodic Read Spikes Associated With Loss of Connectivity

I have noticed that my Google Cloud SQL instance is losing connectivity periodically and it seems to be associated with some read spikes on the Cloud SQL instance. See the attached screenshot for examples.
The instance is mostly idle, but my application recycles connections in the connection pool every 60 seconds so this is not a wait_timeout issue. I have verified that the connection are recycled Also, it occurred twice in 30 minutes and the wait_timeout is 8 hours.
I would suspect a backup process but you can see from the screenshot that no backups have run.
The first instance lasted 17 seconds from the time the connection loss was detected until it was reestablished. The second was only 5 seconds, but given that my connections are idle for 60 seconds the actual downtime could be up to 1:17 and 1:05 respectively. They occurred at 2014-06-05 15:29:08 UTC and 2014-06-05 16:05:32 UTC respectively. The read spikes are not initiated by me. My app continued to be idle during the issue so this is some sort of internal GCS process.
This is not a big deal for my idle app, but it will become a big deal when the app is not idle.
Has anyone else run into this issue? Is this a known issue with Google Cloud SQL? Is there a known fix?
Any help would be appreciated.
****Update****
The root cause of the symptoms above has been identified as a restart of the MySQL instance. I did not restart the instance and the operations section of the web console does not list any events at that time, so now the question becomes, what would cause the instance to restart twice in 30 minutes? Why would a production database instance restart period?
That was caused by one of our regular release. Because of the way the update takes place an instance might be restarted more than once during the push to production.
Was your instance restarted ? During the restart the spinning down/up of an instance will trigger read/write.
That may be one reason why you are seeing the activity for read/write.