Can I force delete an AWS CloudFormation stack that is In Progress of Rollback - aws-cloudformation

An AWS CloudFormation rollback (e.g., UPDATE_ROLLBACK_IN_PROGRESS) has been in progress forever, like over an hour and a half. I want to delete the stack altogether or force stop any activity. Is this possible?
Thanks!

Another common cause of blocked stack updates/rollbacks is errors in ECS::Service resource updates: it doesn't look like that is currently detected (in some cases?). Cloudformation is waiting for the service event for the service reaching a steady state, so simply updating the service to something that works (e.g. desired tasks to 0) will unblock it. Try to get the state back to what Cloudformation expects before sending more updates to avoid problems, though.

I guess your stack resources are changed or deleted by outside.
You can find official guide as below.
Manually sync resources so that they match the original stack's template, and then continue rolling back the update. For example, if you manually deleted a resource that AWS CloudFormation is attempting to roll back to, you must manually create that resource with the same name and properties it had in the original stack.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html#troubleshooting-errors-update-rollback-failed
or (as #talentedmrjones said)
To fix the stack, contact AWS customer support.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html#troubleshooting-errors-nested-stacks-are-stuck
In my case, I can stop same situation via re-creating deleted resource.

In my case it is an EC2 security group that cannot be deleted because it is referenced from another EC2 security group.

When dealing with a custom resource it is possible to construct a mocked up version of the return url.
The easiest way to do this is to grab the url which was used during the create. If you can get your hands on it, replace the section after the last %2F with the "Client Request Token" which you can get from your event log for the cloudformation.
If not, then here's the format of the url you'll have to construct.
https://{region}.console.aws.amazon.com/cloudformation/home?region={region}#/stacks?filter=active&tab=events&stackId={stack arn}%2F{stack name}%2F{client request token}
Run that url as a get and it will cause the resource to fail rollback or delete.

You can try to delete the resources and then the update rollback will complete successfully.

Sometimes this will occur if your user role is missing permissions to delete roles. This can be tested by trying to manually delete roles or users that have been created by the CloudFormation stack.

I had something like this happen once, and the stack seemed stuck forever in UPDATE_ROLLBACK_IN_PROGRESS status. I'd recommend submitting a ticket to AWS support. That was the only way I was able to resolve it.

Was able to delete mine by manually deleting everything via AWS dashboard. I ended up having a couple dangling roles that just needed deletion.

I meet the same problem.
The console told me some resource depends on another, so can't be deleted. Under that state, rollback in unavailable.
I just delete the whole VPC and the resources in that VPC.
Because cloudformation will retry to delete resource every 10-20 min. So when it retry, it will find the resource have already been deleted, and it just skip the deletion and everything is smooth after that.

Yes, use this command to delete stacks stuck in 'DELETE_IN_PROGRESS' state.
You can easily run this in AWS CloudShell also.
Go to Lambda Function->Monitor->CloudWatch Logs. Look for Log where "RequestType" is "Delete" and Copy the necessary fields to below command
curl -H 'Content-Type: ''' -X PUT -d '{"Status": "SUCCESS","PhysicalResourceId": "Add your physical resource ID", "StackId": "Add your StackId","RequestId": "Add your RequestID","LogicalResourceId": "LambdaFunction"}' 'Add your ResponseURL Here'
Example:
curl -H 'Content-Type: ''' -X PUT -d '{"Status": "SUCCESS","PhysicalResourceId": "cutomRes-LambdaFunction-1NC1ORF", "StackId": "arn:aws:cloudformation:us-east-1:3343:stack/cutomRes/f52a-11eb-b5df-0a5c2cc1","RequestId": "d70931a2-364b-413e-a2","LogicalResourceId": "LambdaFunction"}' 'https://cloudformation-custom-resource-response-useast1.s3.amazonaws.com/arn%3Aaws%/cutomRes/f5466f6Expires=7200&X-Amz-Credential=AKIA6L7Q4OWT3GW5BT7K%2F20210330%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=1db1f83f'
Do Note that, example contains URL that mmaybe modified to not work for security purposes. It is for demonstration purposes only.

You will need to investigate why exactly the rollback is taking so long (e.g., if it's due to a missing resource modified outside of the CloudFormation stack, or a Custom Resource that failed to return the expected signals).

I went to the stack resources tab and checked why some of them couldn't be deleted then I deleted them manually first.

Usually, it works with just a quick refresh.

We need to go to the resources section and check which resource it is trying to delete. Go to that resource and check why Cloud formation is not able to delete that resource.
What we can do is try deleting the resource manually and check the error or dependency. Fix that and the stack will again then continue. Depending on you stack design and dependencies you might have to delete manually or fix manually multiple resources.
Check this for more details :
https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-stack-stuck-progress/

Related

OpenSearch 1.3 > 2.3 upgrade, CloudFormation fails on domain update

I recently updated our CDK code to move our OpenSearch cluster from version 1.3 to 2.3. The cluster itself seems to have upgraded to a healthy state and is still accessible / usable by our application, but CloudFormation failed when attempting to update our domain resource with:
Resource handler returned message: "Resource handler returned message: "Invalid request provided: DP Nodes are OOS, Tags operation is not allowed"
This kicked the stack into UPDATE_ROLLBACK_FAILED, which is not allowed. The cluster cannot be downgraded back to 1.3.
I'm struggling to find any information about this error it's kicking out and not quite sure how to resolve it to unblock the CloudFormation stack.
Things I have tried:
Digging through CloudWatch logs only revealed information pertaining to queries.
Forcing the rollback to occur without Domain resource. This got me back to an UPDATE_COMPLETE state, but each subsequent deploy of this stack will cause it to fail again since the core issue is not resolved.
This was an odd presentation of a permissions issue. As I was reading through some docs, I stumbled upon this section, which discusses changes to tag-based access control.
This lead me start looking into CloudTrail a bit and stumbled upon the exact error that was firing when this deploy happened. It was a little odd because the assumed role granted admin access to CloudFormation, but the last line of this event record caught my eye:
"sourceIPAddress": "cloudformation.amazonaws.com",
"userAgent": "cloudformation.amazonaws.com",
"errorCode": "ValidationException",
"errorMessage": "DP Nodes are OOS, Tags operation is not allowed",
"eventSource": "es.amazonaws.com",
Upon adding es.amazonaws.com to the trust relationship of that role, the deploy fully re-ran successfully.
Hopefully this helps someone else.

AWS CloudFormation stack fails create due to resource already exists in stack that has been deleted

When creating a new CloudFormation stack, CREATE fails with the following error:
[RESOURCE] already exists in stack [DELETE_COMPLETE status stack ARN]
I have already verified the resource is no longer in the AWS account.
It's happening with me too (us-east-1), since yesterday (2020-09-30).
I tried to redeploy (the same stack) several times WITHOUT SUCCESS.
I also tried:
list the stack arn on awscli to try to delete manually
list the old resource arn, to try to force deleting the old stack on awscli ... all WITHOUT SUCCESS
It seems to be an AWS bug, I would suggest you opening a ticket on AWS support to try to solve it (I'm doing it)
One alternative is to create a new stack changing the resource name (here we usually use a suffix on the resource, based on stackname variable to have multiples deploys stacks)

Taking over existing Domains (HostedZones) in CloudFormation

I've been setting up CloudFormation templates for some new infrastructure for a project and I've made it to Route 53 Hosted Zones.
Now ideally I'd like to create a "core-domains" stack with all our hosted zones and base configuration. Thing is we already have these created manually using the AWS console (and they're used for test/live infrastructure), is there any way to supply the existing "HostedZoneId" as a property to the resource definition and essentially have it introspect what we already have and then apply the diff? (If I've done my job there shouldn't be a diff hopefully so should just be a no-op!).
I can't see a "HostedZoneId" property in the docs: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-route53-hostedzone.html
Any suggestions?
PS. I'm assuming this isn't possible and I'll have to recreate all the HostedZones under CloudFormation but I thought I'd check :)
Found it. It's fine to make other HostedZones, they're issued with new Nameservers, just use the "HostedZoneConfig: Comment" field to note down which hosted zone is which and then you can switch the nameservers over when you're ready!

Get-AzureRmResourceGroupDeployment lists machines I cannot see in the web interface

I'm tasked with automating the creation of Azure VM's, and naturally I do a number of more or less broken iterations of trying to deploy a VM image. As part of this, I automatically allocate serial hostnames, but there's a strange reason it's not working:
The code in the link above works very well, but the contents of my ResourceGroup is not as expected. Every time I deploy (successfully or not), a new entry is created in whatever list is returned by Get-AzureRmResourceGroupDeployment; however, in the Azure web interface I can only see a few of these entries. If, for instance, I omit a parameter for the JSON file, Azure cannot even begin to deploy something -- but the hostname is somehow reserved anyway.
Where is this list? How can I clean up after broken deployments?
Currently, Get-AzureRmResourceGroupDeployment returns:
azure-w10-tfs13
azure-w10-tfs12
azure-w10-tfs11
azure-w10-tfs10
azure-w10-tfs09
azure-w10-tfs08
azure-w10-tfs07
azure-w10-tfs06
azure-w10-tfs05
azure-w10-tfs02
azure-w7-tfs01
azure-w10-tfs19
azure-w10-tfs1
although the web interface only lists:
azure-w10-tfs12
azure-w10-tfs13
azure-w10-tfs09
azure-w10-tfs05
azure-w10-tfs02
Solved using the code $siblings = (Get-AzureRmResource).Name | Where-Object{$_ -match "^$hostname\d+$"}
(PS. If you have tips for better tags, please feel free to edit this question!)
If you create a VM in Azure Resource Management mode, it will have a deployment attached to it. In fact if you create any resource at all, it will have a resource deployment attached.
If you delete the resource you will still have the deployment record there, because you still deployed it at some stage. Consider deployments as part of the audit trail of what has happened within the account.
You can delete deployment records with Remove-AzureRmResourceGroupDeployment but there is very little point, since deployments have no bearing upon the operation of Azure. There is no cost associated they are just historical records.
Querying deployments with Get-AzureRmResourceGroupDeployment will yield you the following fields.
DeploymentName
Mode
Outputs
OutputsString
Parameters
ParametersString
ProvisioningState
ResourceGroupName
TemplateLink
TemplateLinkString
Timestamp
So you can know whether the deployment was successful via ProvisioningState know the templates you used with TemplateLink and TemplateLinkString and check the outputs of the deployment etc. This can be useful to figure out what template worked and what didn't.
If you want to see actual resources, that you are potentially being charged for, you can use Get-AzureRmResource
If you just want to retrieve a list of the names of VMs that exist within an Azure subscription, you can use
(Get-AzureRmVM).Name

Trouble adding a new service

I have followed the instructions at https://github.com/cloudfoundry/oss-docs/tree/master/vcap/adding_a_system_service and copied the echo service and created my new service. (That document is somewhat out-of-date in that "excluded components" no longer exists.
In any case, my service shows up as running with a gateway and a node when I look at 'vcap status' on the server. However, when I look at 'vmc services' from the client my service is not in the list. Where is this list maintained and why is my service not on the list?
Various services, including blob, filesystem, mongodb, etc, are shown on the 'vcm services' list even though they have never been included in my config. Where is this maintained and why are other services on this list?
The cloud_controller.log file shows a "Create service request:" for echo every minute. This service is not in my config file (it was once but it was removed and I repeated the deployment). What is prompting this request for a service that was not defined in the config?
The _gateway.log for my service shows the following:
INFO -- Sending info to cloud controller: ...api.vcap.me/services/v1/offerings
INFO -- Fetching handles from cloud controller .../offerings/.../handles
ERROR -- Failed registering with cloud controller, status=400
DEBUG -- [GaaS-Provisioner] Connected to node mbus..
ERROR -- Failed fetching handles, status=404
Why does my gateway fail to register with the cloud controller? I have found some reports that suggest that the problem is with domain name mapping. I have verified that the server can find itself:
$curl api.vcap.me
Welcome to VMware's Cloud Application Platform
What can I do to register my service?
You can also try asking your question on the vcap_dev google group.
https://groups.google.com/a/cloudfoundry.org/forum/?fromgroups#!forum/vcap-dev
They are focused in answering and discussing OSS subjects for Cloud Foundry!
If you follow the document correctly things should work just fine. I understand that the mechanism for maintaining the excluded list of components has changed and can be a point of confusion when following the steps mentioned in the article (just ignore that step totally).
ERROR -- Failed registering with cloud controller, status=400
Well this is a point of worry. I recently followed the article step by step and was able to add a new service.
Is the echo service showing up in vmc services?
Have you copied the the yml files for node and gateway at ./cloudfoundry/.deployments/devbox/config?
Are the tokens for your gateway unique? and matching in the two files? ./cloudfoundry/.deployments/devbox/config/cloud_controller.yml and ./cloudfoundry/.deployments/devbox/config/**_gateway.yml**
I would recommend that you first concentrate on getting the echo service to be listed in the vmc services output. Once done with this you should replicate the steps (with absolute care to modify things like the token) to get your custom service working.
Cheers,
Ankit
You should follow this guide
It work to me.
regards.