Getting Auto updates causing 100% CPU utilization - google-cloud-sql

Have setup Google cloud SQL server in asia-south1 zone. Issue is we are getting auto updates triggered in GCP Cloud SQL during IST day hours by Google which is causing server to reach 100% CPU utilization and causing system downtime.
Is there a way to block these updates during IST day hours and get it only in non critical (night) hours
Increased vCPUs from 1 to 2. However, that didn't help

According to the official documentation link
"If you do not set the Maintenance timing setting, Cloud SQL chooses
the timing of updates to your instance (within its Maintenance window,
if applicable)."
You can set your the Maintenance window DAY-TIME and choode Order of update "any", and the updates will be triggered during the Maintenance window.

As per Cloud SQL best practices, it is recommended to configure a maintenance window for your primary instance.
With maintenance window you can control when maintenance restarts will be performed. You can also specify whether an instance gets updates earlier or later than other instances in your project with Maintenance timing.
Maintenance window
The day and hour when disruptive updates (updates
that require an instance restart) to this Cloud SQL instance can be
made. If the maintenance window is set for an instance, Cloud SQL does
not initiate a disruptive update to that instance outside of the
window. The update is not guaranteed to complete before the end of the
maintenance window, but restarts typically complete within a couple of
minutes.
Maintenance timing
This setting lets you provide a preference about the relative timing
of instance updates that require a restart. Receiving updates earlier
lets you test your application with an update before your instances
that get the update later.
The relative timing of updates is not observed between projects; if
you have instances with an earlier timing setting in a different
project than your instances with a later timing setting, Cloud SQL
makes no attempt to update the instances with the earlier timing
setting first.

Maintenance frequency is once in every few mounts. It means that in each few months database will be unresponsive once time while being maintained.
Is it possible to cancel maintenance or to make maintenance manually?

Related

Issues with matching service in Cadence

Two days ago, we started presenting some issues with our cadence setup.
The first thing we noticed is the Open workflows were not disappearing from the list once they completed. For example this workflow appears as Open in the list:
But when you click on it, you will see that it’s actually completed:
At the same time this started to happen, we noticed how several workflows would take quite a long time to complete, several of them would stuck in “Schedule” states and never go further from there. After checking the logs, the only error we saw was this:
{"level":"error","ts":"2021-03-06T19:12:04.865Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"cadence-sys-history-scanner-tasklist-0","wf-task-list-type":1,"store-operation":"create-task","error":"InternalServiceError{Message: CreateTasks operation failed. Error : Request on table cadence.tasks with ttl of 630720000 seconds exceeds maximum supported expiration date of 2038-01-19T03:14:06+00:00. In order to avoid this use a lower TTL, change the expiration date overflow policy or upgrade to a version where this limitation is fixed. See CASSANDRA-14092 for more details.}","wf-task-list-name":"cadence-sys-history-scanner-tasklist-0","wf-task-list-type":1,"number":6300094,"next-number":6300094,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"}
Does somebody have an idea of why this is happening?
The first one is because of visibility sampling being enabled by default(to protect default core DB). You can disable it by configure system.enableVisibilitySampling to false.
But when you doing that, it’s better to separate the visibility and default store into different database cluster so that visibility doesn’t bring down the default(core data model) DB.
see more in https://github.com/uber/cadence/issues/3884
The second is a bug fixed in 0.16.0
It should be resolved if you upgrade server.
See https://github.com/uber/cadence/pull/3627
and https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/recoveringTtlYear2038Problem.html

Starting and Stopping PostgreSQL Amazon RDS Instance Automatically Based on Usage

We're a team of 4 data scientists that use Amazon RDS PostgreSQL for analysis purposes. So we're looking for a way to automatically start/stop the instance automatically but based on usage as opposed to time.
For example, there are clearly solutions for starting and stopping automatically during regular business hours (Stopping an Amazon RDS DB Instance Temporarily).
However, this doesn't quite work for us because we all have different schedules and don't necessarily adhere to a standard schedule. I would like a script that basically checks whether the DB has been used in the past, say 30 minutes, and if not turn off the instance. Then, if someone tries to connect to the DB but it's turned off, then automatically turn it on. My intuition tells me that the latter is harder than the former, but I'm not sure. Is this possible?
To do this you would need to use a CloudWatch Alarm, to do this you would rely on metrics that are available to CloudWatch such as number of connections or CPU Utilization.
This alarm could trigger a Lambda function that will stop your RDS instance, be aware that an RDS instance will restart once it has been off for 7 days.
Alternatively if you're able to use it you could look into Aurora Serverless with the PostgreSQL compatible version. this option would automatically handle the stop/start functionality when no one is using it.

Google Cloud SQL Instance stuck in "Instance is being updated"

We resized a google cloud SQL instance, and it's been stuck in the status "Instance is being updated. This may take a few minutes" for at least two hours now.
There is an update process as you resized your Google Cloud SQL instance. You have to create a Support Ticket if after some hours it's still stuck.
Opening a Support Case, a SQL Specialist will cancel the updating process and your instance will be operational again.
To avoid this kind of disruptive updates any time of the day please enable the Maintenance window so that this kind of update happens only during the Maintenance window.

Is there a way to automate the monitoring and termination of AWS ECS tasks that are silently progressing?

I've been using AWS Fargate for quite a while and have been a big fan of the service.
This week, I created a monitoring dashboard that details the latest runtimes of my containers, and the timestamp watermark of each of my tables (the MAX date updated value). I have SNS topics set up to email me whenever a container exits with code 1.
However, I encountered a tricky issue today that slipped past these safeguards because of what I suspect was a deadlock situation related to a Postgres RDS instance.
I have several tasks running at different points in the day on a scheduler (usually every X or Y hours). Most of these tasks will perform some business logic calculations and insert / update an RDS instance.
One of my tasks (when checking the Cloudwatch logs later) was stuck making an update to a table, and basically just hung there waiting. My guess is that a user (perhaps me) - was manually making a small update statement to the same table, triggering some sort of lock that.
Because I have my tasks set on a recurring basis, the same task had another container provisioned a few hours later, attempted to update the same table, and also hung.
I only noticed this issue because my monitoring dashboard showed that the date updated watermark was still a few days in the past, even though I hadn't gotten any alerts or notifications for errors during my container run time. By this time, I had 3 containers all running, each stuck on the same update to the same table.
After I logged into the ECS console, I saw that my cluster had 3 task instances running - all the same task, all stuck making the same insert.
So my questions are:
is there a way to specify a runtime maximum for these tasks (ie. if the task doesn't finish within 2 hours, terminate with an exit code of 1)?
I'm trying to figure out the best way to prevent this type of "silent failure" in the future? I've added in application logic to execute a query checking for blocked process IDs with queries within my RDS instance, and if it notices any blocked PIDS, it skips the update. But are there any more graceful ways of detecting and handling this issue?

Google Cloud SQL very slow from time to time

It's been almost 3 months I have switched my platform to Google Cloud (Compute Engine + Cloud SQL + Cloud Storage).
I am very happy with it but from time to time I noticed big latency on the Cloud SQL server. My VMs from Compute Engine and my Cloud SQL instance are all on the same location (us-1) datacenter.
Since my Java backend makes a lot of SQL queries to generate a server response, the response times may vary from 250-300ms (normal) up to 2s!
In the console, I notice absolutely nothing: no CPU peaks, no read/write peaks, no backup running, nothing. No alert. Last time it happened, it lasted for a few days and then the response times went suddenly better than ever.
I am pretty sure Google works on the infrastructure behind the scenes... But no way to point that out.
So here's my questions:
Has anybody else ever had noticed the same kind of problem?
It is really annoying for me because my web pages get very slow and I have absolutely no control over it. Plus I loose a lot of time because I generally never first suspect a hardware problem / maintenance but instead something that we introduced in our app. Is it normal or do I have a problem on my SQL instance?
Is there anywhere I can have visibility over what's Google doing on the hardware? I know there are maintenance alerts, but for my zone it seems always empty when it happen.
The only option I have for now is to wait and that is really not acceptable.
I suspect that Google does some sort of IO throttling and their algorithm is not very sophisticated. We have a build server which slows down to a crawl if we do more than two builds within an hour. The build that normally takes 15 minutes will run for more than an hour and we usually terminate it and re-run manually later. This question describes a similar problem and the recommended solution is to use larger volumes as they come with more IO allowance.