Rate for operation ChangeResourceRecordSets exceeded [closed] - amazon-route53

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 5 years ago.
Improve this question
I am trying to delete record set in Route 53 console (web interface), but get this error:
Rate for operation ChangeResourceRecordSets exceeded
I tried deleting the record set via API, but I get the same error. Which limit have I exceeded?

Have a look at. https://status.aws.amazon.com
At the moment (Mar 14, 2017 PDT) it displays an error message for Route 53.
4:44 PM PDT We are investigating slow propagation of DNS edits to the Route 53 DNS servers. This does not impact queries to existing DNS records.
5:11 PM PDT We continue to investigate slow propagation of DNS edits to the Route 53 DNS servers. This does not impact queries to existing DNS records.
6:34 PM PDT We have identified root cause of the slow propagation of DNS edits to the Route 53 DNS servers and are working towards recovery. This does not impact queries to existing DNS records.
7:40 PM PDT We continue to experience slow propagation times and continue to work towards full recovery. This does not impact queries to existing DNS records.
10:12 PM PDT We continue to work on resolving the slow propagation times. Requests to the ChangeResourceRecordSets API are currently being throttled. Queries to existing DNS records remain unaffected.
Mar 14, 12:22 AM PDT While changes are propagating, we continue to work through the backlog of pending changes that have accumulated. We expect full recovery to take several more hours. We have also throttled ChangeResourceRecordSets API call. Queries to existing DNS records remain unaffected
As the last statement suggests, they have throttled the calls that can be made for new DNS records.
Mar 14, 1:40 AM PDT Record changes are slowly propagating, while we work through the backlog of pending changes that have accumulated. We still expect full recovery to take several more hours. We are continuing to throttle ChangeResourceRecordSets API calls. Queries to existing DNS records remain unaffected.
Mar 14, 3:01 AM PDT Record changes are still propagating, while we work through the backlog of pending changes that have accumulated. We expect full recovery to take several more hours. We are continuing to throttle ChangeResourceRecordSets API calls. Queries to existing DNS records remain unaffected.
Mar 14, 4:07 AM PDT All outstanding DNS record changes have completed propagating. ChangeResourceRecordSets API calls are still being throttled. Queries to existing DNS records remain unaffected.
Mar 14, 5:12 AM PDT ChangeResourceRecordSets API calls are still being throttled while we continue to recover. Queries to existing DNS records remain unaffected.
07:12 AM PDT We continue to throttle some ChangeResourceRecordSets API calls as we make progress towards recovery. Queries to existing DNS records remain unaffected.
07:53 AM PDT We are continuing to throttle some ChangeResourceRecordSets API calls while we work towards full recovery. Retries for throttled requests should succeed. Queries to existing DNS records remain unaffected.
10:30 AM PDT We continue to throttle some ChangeResourceRecordSets API calls while we make progress towards recovery. Retries for throttled requests should be successful. Queries to existing DNS records remain unaffected.
It might take some more time until everything has recovered again. There should however be nothing wrong with your account or DNS setup.
Update as of Mar 14 2:54 PM PDT. All throttling of the Route 53 processes has been removed and service has been restored. This incident took about 20 hours.
Mar 14, 1:11 PM PDT We continue to remove throttling for the ChangeResourceRecordSets API as we continue towards recovery. At this stage, many customers are seeing recovery as DNS updates complete successful. For those customers that are still experiencing throttling, we continue to recommend retrying API requests or making use of change batches http://docs.aws.amazon.com/Route53/latest/APIReference/API_ChangeResourceRecordSets.html to update multiple DNS records in a single request. Queries to existing DNS records remain unaffected.
Mar 14, 2:54 PM PDT We have removed throttling for the ChangeResourceRecordSets API and are seeing recovery. All DNS update operations are now completing successfully. Queries to existing DNS records were not affected. The issue has been resolved and the service is operating normally.

Related

Jobrunr background server stopped polling

I had a number of jobs scheduled but seems none of the jobs were running. On further debugging, I found that there are no available servers, and in the jobrunr_backgroundjobservers table, it seems that there has not been a heart beat for any of the servers. What would cause this issue? How would I restart a heartbeat? And how would I know when such an issue occurs and the servers go down again, given that schedules are time sensitive?
It will stop polling if the connection to the database was lost or the database goes down for a while.
The JobRunr Pro version adds extra features and one of them is database fault tolerance - if such an issue occurs, JobRunr Pro will go in standby and will start processing again once the connection to the database is stable again.
See https://www.jobrunr.io/en/documentation/pro/database-fault-tolerance/ for more info.

Incorrect failure notification from Rundeck during fall time change

Last night was "fall back" time change for most locations in the US. I woke up this morning to find dozens of job failure notifications. Almost all of them though were incorrect: the jobs showed as having completed normally, yet Rundeck sent a failure notification for it.
Interestingly, this happened in two completely separate Rundeck installations (v2.10.8-1 and v3.1.2-20190927). The commonality is that they're both on CentOS 7 (separate servers). They're both using MariaDB, although different versions of MariaDB.
The failure emails for the jobs that finished successfully showed a negative time in the "Scheduled after" line:
#1,811,391
by admin Scheduled after 59m at 1:19 AM
• Scheduled after -33s - View Output »
• Download Output
Execution
User: admin
Time: 59m
Started: in 59m 2019-11-03 01:19:01.0
Finished: 1s ago Sun Nov 03 01:19:28 EDT 2019
Executions Success rate Average duration
100% -45s
That job actually ran in 27s at 01:19 EDT (the first 1am hour, it is now EST). Looking at the email headers, I believe I got the message at 1:19 EST, an hour after the job ran.
So that would seem to imply to me that it's just a notification problem (somehow).
But there were a couple of jobs that were following other job executions that failed as well, apparently because the successfully finished job returned a RC 2. I'm not sure what to make of this.
We've been running Rundeck for a few years now, this is the first I remember seeing this problem. Of course my memory may be faulty--maybe we did see it previously, only there were fewer jobs affected or some such.
The fact that it impacted two different versions of Rundeck on two different servers implies either it's a fundamental issue with Rundeck that's been around for a while or it is something else in the operating system that's somehow causing problems for Rundeck. (Although time change isn't new, so that would seem to be somewhat surprising too.)
Any thoughts about what might have gone on (and how to prevent it next year, short of the obvious run on UTC) would be appreciated.
You can define specific Timezone in Rundeck, check this and this.

What happens when Eureka instance skips a heartbeat against a Eureka server with self preservation turned off?

Consider this set-up:
Eureka server with self preservation mode disabled i.e. enableSelfPreservation: false
2 Eureka instances each for 2 services (say service#1 and service#2). Total 4 instances.
And one of the instances (say srv#1inst#1, an instance of service#1) sent a heartbeat, but it did not reach the Eureka server.
AFAIK, following actions take place in sequence on Server side:
ServerStep1: Server observes that a particular instance has missed a heartbeat.
ServerStep2: Server marks the instance for eviction.
ServerStep3: Server's eviction scheduler (which runs periodically) evicts the instance from registry.
Now on instance (srv#1inst#1) side:
InstanceStep1: It skips a heartbeat.
InstanceStep2: It realizes heartbeat did not reach Eureka Server. It retries with exponential back-off.
AFAIK, the eviction and registration do not happen immediately. Eureka server runs separate scheduler for both tasks periodically.
I have some questions related to this process:
Are the sequences correct? If not, what did I miss?
Is the assumption about eviction and registration scheduler correct?
An instance of service#2 requests fresh registry copy from server right after ServerStep2.
Will srv#1inst#1 be in the fresh registry copy, because it has not been evicted yet?
If yes, will srv#1inst#1 be marked UP or DOWN?
The retry request from InstanceStep2 of srv#1inst#1 reaches server right after ServerStep2.
Will there be an immediate change in registry?
How that will affect the response to instance of service#2's request for fresh registry? How will it affect the eviction scheduler?
This question was answered by qiangdavidliu in one of the issues of eureka's GitHub repository.
I'm adding his explanations here for sake of completeness.
Before I answer the questions specifically, here's some high level information regarding heartbeats and evictions (based on default configs):
instances are only evicted if they miss 3 consecutive heartbeats
(most) heartbeats do not retry, they are best effort every 30s. The only time a heartbeat will retry is that if there is a threadlevel error on the heartbeating thread (i.e. Timeout or RejectedExecution), but this should be very rare.
Let me try to answer your questions:
Are the sequences correct? If not, what did I miss?
A: The sequences are correct, with the above clarifications.
Is the assumption about eviction and registration scheduler correct?
A: The eviction is handled by an internal scheduler. The registration is processed by the handler thread for the registration request.
An instance of service#2 requests fresh registry copy from server right after ServerStep2.
Will srv#1inst#1 be in the fresh registry copy, because it has not been evicted yet?
If yes, will srv#1inst#1 be marked UP or DOWN?
A: There are a few things here:
until the instance is actually evicted, it will be part of the result
eviction does not involve changing the instance's status, it merely removes the instance from the registry
the server holds 30s caches of the state of the world, and it is this cache that's returned. So the exact result as seem by the client, in an eviction scenario, still depends on when it falls within the cache's update cycle.
The retry request from InstanceStep2 of srv#1inst#1 reaches server right after ServerStep2.
Will there be an immediate change in registry?
How that will affect the response to instance of service#2's request for fresh registry? How will it affect the eviction scheduler?
A: again a few things:
When the actual eviction happen, we check each evictee's time to see if it is eligible to be evicted. If an instance is able to renew its heartbeats before this event, then it is no longer a target for eviction.
The 3 events in question (evaluation of eviction eligibility at eviction time, updating the heartbeat status of an instance, generation of the result to be returned to the read operations) all happen asynchronously and their result will depend on the evaluation of the above described criteria at execution time.

Google Cloud SQL: Periodic Read Spikes Associated With Loss of Connectivity

I have noticed that my Google Cloud SQL instance is losing connectivity periodically and it seems to be associated with some read spikes on the Cloud SQL instance. See the attached screenshot for examples.
The instance is mostly idle, but my application recycles connections in the connection pool every 60 seconds so this is not a wait_timeout issue. I have verified that the connection are recycled Also, it occurred twice in 30 minutes and the wait_timeout is 8 hours.
I would suspect a backup process but you can see from the screenshot that no backups have run.
The first instance lasted 17 seconds from the time the connection loss was detected until it was reestablished. The second was only 5 seconds, but given that my connections are idle for 60 seconds the actual downtime could be up to 1:17 and 1:05 respectively. They occurred at 2014-06-05 15:29:08 UTC and 2014-06-05 16:05:32 UTC respectively. The read spikes are not initiated by me. My app continued to be idle during the issue so this is some sort of internal GCS process.
This is not a big deal for my idle app, but it will become a big deal when the app is not idle.
Has anyone else run into this issue? Is this a known issue with Google Cloud SQL? Is there a known fix?
Any help would be appreciated.
****Update****
The root cause of the symptoms above has been identified as a restart of the MySQL instance. I did not restart the instance and the operations section of the web console does not list any events at that time, so now the question becomes, what would cause the instance to restart twice in 30 minutes? Why would a production database instance restart period?
That was caused by one of our regular release. Because of the way the update takes place an instance might be restarted more than once during the push to production.
Was your instance restarted ? During the restart the spinning down/up of an instance will trigger read/write.
That may be one reason why you are seeing the activity for read/write.

LiveRebel Update Strategy

I am trying to utilize LiveRebel on my production environment. After most parts are configured I tried to perform update on my application from lets say version 1.1 to 1.3 as shown below
Does this mean that LiveRebel require two server installation on 2 physical IP addresses ? Can I have two server on 2 virtual IP addresses ?
Rolling restarts use request routing to achieve zero downtime for the users. Sessions are first drained by waiting for old sessions to expire and routing new ones to an identical application on another server. When all sessions are drained, application is updated, while the other server handles the requests.
So, as you can see, for zero downtime you need additional server to handle the requests while application is updated. Full restart doesn't have that requirement, but results in downtime for users.
As for the question about IPs, as long as the two server (virtual) machines can see each other , doesn't really make much difference.