Is there a way to automate the monitoring and termination of AWS ECS tasks that are silently progressing? - postgresql

I've been using AWS Fargate for quite a while and have been a big fan of the service.
This week, I created a monitoring dashboard that details the latest runtimes of my containers, and the timestamp watermark of each of my tables (the MAX date updated value). I have SNS topics set up to email me whenever a container exits with code 1.
However, I encountered a tricky issue today that slipped past these safeguards because of what I suspect was a deadlock situation related to a Postgres RDS instance.
I have several tasks running at different points in the day on a scheduler (usually every X or Y hours). Most of these tasks will perform some business logic calculations and insert / update an RDS instance.
One of my tasks (when checking the Cloudwatch logs later) was stuck making an update to a table, and basically just hung there waiting. My guess is that a user (perhaps me) - was manually making a small update statement to the same table, triggering some sort of lock that.
Because I have my tasks set on a recurring basis, the same task had another container provisioned a few hours later, attempted to update the same table, and also hung.
I only noticed this issue because my monitoring dashboard showed that the date updated watermark was still a few days in the past, even though I hadn't gotten any alerts or notifications for errors during my container run time. By this time, I had 3 containers all running, each stuck on the same update to the same table.
After I logged into the ECS console, I saw that my cluster had 3 task instances running - all the same task, all stuck making the same insert.
So my questions are:
is there a way to specify a runtime maximum for these tasks (ie. if the task doesn't finish within 2 hours, terminate with an exit code of 1)?
I'm trying to figure out the best way to prevent this type of "silent failure" in the future? I've added in application logic to execute a query checking for blocked process IDs with queries within my RDS instance, and if it notices any blocked PIDS, it skips the update. But are there any more graceful ways of detecting and handling this issue?

Related

Cloud SQL Mysql - Stuck in failover operation in progress

My Cloud SQL Mysql 5.7.37 Highly available instance is stuck in a "Failover operation in progress. This may take a few minutes. While this operation is running, you may continue to view information about the instance" process. It is a fairly small database and it has been stuck like this for 5 hours and the failover is not available so no DB queries can be executed, hence our system is currently down.
No commands on the DB can be executed since it is in an updating process, the error log is empty and the operations log only contain this update and successfull backups.
Does anyone have any suggestions? I am not paying for Google Support so I cant get support directly from them (which I think is terrible since this a fully managed service).
Best,
Carl-Fredrik

Azure Data Sync Stop working btw two Azure SQL Databases (without showing error msg )

I use Azure Data Sync to synchronize 2 Azure SQL Databases for 2 Tables, the synchronization works always good for 6-7 days , but I think every time when I make a new Deployment in the release pipeline (in azure devops) without changing anything in those 3 Tables the synchronization group stop working.
The Synchronisation group status is still showing "Good" when the Synchronisation doesn't work. I also tried to start the synchronization manually by clicking the button Sync, but nothing happens.
Now, I have always to delete the Sync Group when its not working and create a new one for the same the tables to make everything work again.
I would like to know why the Synchronization stops working.
Does anyone run into the same problem or know how to fix it?

Issues with matching service in Cadence

Two days ago, we started presenting some issues with our cadence setup.
The first thing we noticed is the Open workflows were not disappearing from the list once they completed. For example this workflow appears as Open in the list:
But when you click on it, you will see that it’s actually completed:
At the same time this started to happen, we noticed how several workflows would take quite a long time to complete, several of them would stuck in “Schedule” states and never go further from there. After checking the logs, the only error we saw was this:
{"level":"error","ts":"2021-03-06T19:12:04.865Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"cadence-sys-history-scanner-tasklist-0","wf-task-list-type":1,"store-operation":"create-task","error":"InternalServiceError{Message: CreateTasks operation failed. Error : Request on table cadence.tasks with ttl of 630720000 seconds exceeds maximum supported expiration date of 2038-01-19T03:14:06+00:00. In order to avoid this use a lower TTL, change the expiration date overflow policy or upgrade to a version where this limitation is fixed. See CASSANDRA-14092 for more details.}","wf-task-list-name":"cadence-sys-history-scanner-tasklist-0","wf-task-list-type":1,"number":6300094,"next-number":6300094,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"}
Does somebody have an idea of why this is happening?
The first one is because of visibility sampling being enabled by default(to protect default core DB). You can disable it by configure system.enableVisibilitySampling to false.
But when you doing that, it’s better to separate the visibility and default store into different database cluster so that visibility doesn’t bring down the default(core data model) DB.
see more in https://github.com/uber/cadence/issues/3884
The second is a bug fixed in 0.16.0
It should be resolved if you upgrade server.
See https://github.com/uber/cadence/pull/3627
and https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/recoveringTtlYear2038Problem.html

Getting Auto updates causing 100% CPU utilization

Have setup Google cloud SQL server in asia-south1 zone. Issue is we are getting auto updates triggered in GCP Cloud SQL during IST day hours by Google which is causing server to reach 100% CPU utilization and causing system downtime.
Is there a way to block these updates during IST day hours and get it only in non critical (night) hours
Increased vCPUs from 1 to 2. However, that didn't help
According to the official documentation link
"If you do not set the Maintenance timing setting, Cloud SQL chooses
the timing of updates to your instance (within its Maintenance window,
if applicable)."
You can set your the Maintenance window DAY-TIME and choode Order of update "any", and the updates will be triggered during the Maintenance window.
As per Cloud SQL best practices, it is recommended to configure a maintenance window for your primary instance.
With maintenance window you can control when maintenance restarts will be performed. You can also specify whether an instance gets updates earlier or later than other instances in your project with Maintenance timing.
Maintenance window
The day and hour when disruptive updates (updates
that require an instance restart) to this Cloud SQL instance can be
made. If the maintenance window is set for an instance, Cloud SQL does
not initiate a disruptive update to that instance outside of the
window. The update is not guaranteed to complete before the end of the
maintenance window, but restarts typically complete within a couple of
minutes.
Maintenance timing
This setting lets you provide a preference about the relative timing
of instance updates that require a restart. Receiving updates earlier
lets you test your application with an update before your instances
that get the update later.
The relative timing of updates is not observed between projects; if
you have instances with an earlier timing setting in a different
project than your instances with a later timing setting, Cloud SQL
makes no attempt to update the instances with the earlier timing
setting first.
Maintenance frequency is once in every few mounts. It means that in each few months database will be unresponsive once time while being maintained.
Is it possible to cancel maintenance or to make maintenance manually?

MS CRM recursive workflow and performance

I’m about to write a workflow in CRM that calls itself every day. This is a recursive workflow.
It will run on half a million entities each day and deactive the record if it was not been upodated in the past 3 days.
I’m worried about performance has anyone else done this.
I haven't personally implemented anything like this, but that's 500,000 records that are floating around in the DB that the async service has to keep track of, which is going to tax your hardware. In addition, CRM keeps track of recursive workflow instances. I don't have the exact specs in front of me, but if a workflow calls itself a set number of times within a certain timeframe, CRM will kill the workflow.
Could you just write a console app that asks the Crm Service for records that haven't been updated in three days, and then deactivate them? Run it as a scheduled task once a day, and then your CRM system doesn't have the burden of keeping track of all those running workflow instances.
EDIT: Ah, I see now you might have been thinking of one workflow that runs on all the records as opposed to workflows running on each record. benjynito's advice makes sense if you go this route, although I still think a scheduled task would be more appropriate than using workflow.
You'll want to make sure your workflow is running in non-peak hours. Assuming you have an on-premise installation you should be able to get away with that. If you're using a hosted instance, you might be worried about one organization running the workflow while another organization is using the system. Use the timeout and maybe a custom workflow activity, if necessary, to force the start time to a certain period.
I'm assuming you'll be as efficient as possible in figuring out which records to deactivate. (i.e. Query Expression would only bring back the records you'll be deactivating).
The built-in infinite loop-protection offered by CRM shouldn't kill your workflow instances. It stops after a call depth of 8, but it resets to 1 if no calls are made for an hour. So the fact that you're doing this once a day should make you OK on the recursive workflow front.