AWS::CloudFormation::Stack failed to ROLLBACK - aws-cloudformation

My cloudformation template had couple of AWS::SNS::Subscription. I removed those and deployed the template. 1 of those 2 AWS::SNS::Subscription failed to delete and ended up in DELETE_FAILED. I expected the AWS::CloudFormation::Stack to ROLLBACK on failure to delete the AWS::SNS::Subscription. But to my surprise it ended up in UPDATE_COMPLETE state.

Generally if CloudFormation can't delete a resource as part of the cleanup step, it does not rollback, but succeeds.

No Worry!! Now AWS added new features and ability to retry the stack operations from the point of failure.
This is amazing!! While I was using AWS Cloudformation, I faced the same problem when any resources fail to launch for any reason and we have to wait for the rolls back and again we have to launch the stack from the scratch. But now we can retry stack operations from the point of failure.
Thanks to AWS for adding this new features in it.
Still do you have any questions regarding the same please let me know in the comments section.
Get the full details here : https://aws.amazon.com/blogs/aws/new-for-aws-cloudformation-quickly-retry-stack-operations-from-the-point-of-failure/

Related

CloudFormation: stack is stuck, CloudTrail events shows repeating DeleteNetworkInterface event

I am deploying a stack with CDK. It gets stuck in CREATE_IN_PROGRESS. CloudTrail logs show repeating events in logs:
DeleteNetworkInterface
CreateLogStream
What should I look at next to continue debugging? Is there a known reason for this to happen?
I also saw the exact same issue with the deployment of a CDK-based ECS/Fargate Deployment
In my instance, I was able to diagnose the issue by following the content from the AWS support article https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-stack-stuck-progress/
What specifically diagnosed and then resolved it for me:-
I updated my ECS service to set the desired task count of the ECS Service to 0. At that point the Cloud Formation stack did complete successfully.
From that, it became obvious that the actual issue was related to the creation of the initial task for my ECS Service. I was able to diagnose that by reviewing the output in Deployment and Events Tab of the ECS Service in the AWS Management Console. In my case, the task creation was failing because of an issue with accessing the associated ECR repository. Obviously there could be other reasons but they should show-up there.

Service Fabric - How to repair a failing stateful application

I have a stateful service that configures state backups for the primary replica on RunAsync using an Azure storage account.
The other day someone inadvertently deleted the storage account being used for backups. On our next deployment, the services began throwing errors as they initialize due to this 404 error response.
I have noticed that during a deployment fabric apparently shuffles around the old version of the service spinning up new primaries as needed to free up the vm it is upgrading. If the old version of the code fails to instantiate by throwing an exception, the upgrade process will fail causing a rollback.
My problem is, once I create a new storage account, I am still left seemingly no way to bring the existing services back to healthy states. My existing services are using Storage account urls with AccountKeys that no longer exists in azure. Attempts to upgrade fail because the old service instances can’t instantiate due to now bad configuration.
Are there any ways to deal with this situation?
The simplest thing would be to use an unmonitored manual upgrade to force through the change that would point the service to the new storage account.
However, this puts a lot of management overhead on you, particularly if there are many other services, since you need to be careful to perform all safety and functionality checks manually so as not to regress anything.
The recommend solution is to use the ServiceTypeHealthPolicyMap described here to "mask out" the unhealthy service (since you expect it to be unhealthy during the upgrade). You may also need to adjust some of the other upgrade parameters depending on the exact situation.
A third recommendation, or maybe something to improve in the future, would be to make the upgrade to change the account information a configuration only upgrade. This would ensure that SF tries to change the config in-place without restarting the services (by default), which would prevent the existing services from failing over during the upgrade and encountering issues. This is demonstrated in this example.

Updating a kubernetes job: what happens?

I'm looking for a definitive answer for k8s' response to a job being updated - specifically, if I update the container spec (image / args).
If the containers are starting up, will it stop & restart them?
If the job's pod is all running, will it stop & restart?
If it's Completed, will it run it again with the new setup?
If it failed, will it run it again with the new setup?
I've not been able to find documentation on this point, but if there is some I'd be very happy to get some signposting.
The .spec.template field can not be updated in a Job, the field is immutable. The Job would need to be deleted and recreated which covers all of your questions.
The reasoning behind the changes aren't available in the github commit or pr, but these changes were soon after Jobs were originally added. Your stated questions are likely part of that reasoning as making it immutable removes ambiguity.

AWS Mongo QuickStart never completes

Problem
I am trying to complete the MongoDB on AWS quickstart to create a simple MongoDB cluster. Unfortunately it never completes the rollout, cancelling after one last installation part (PrimaryReplicaNodeXYWaitForNodeInstallGP2) has not been completed within an hour.
Background
My Settings were the following:
AvailabilityZone0 eu-central-1a
AvailabilityZone1 eu-central-1b
AvailabilityZone2 eu-central-1b
BuildBucket quickstart-reference/mongodb/latest
ClusterReplicaSetCount 0
ClusterShardCount 1
ConfigServerInstanceType t2.micro
Iops 100
KeyName my_definitely_working_keypair
MongoDBVersion 3.2
NATInstanceType t2.small
NodeInstanceType m3.medium
PrimaryReplicaSubnet 10.0.2.0/24
PublicSubnet 10.0.1.0/24
RemoteAccessCIDR XXX.XXX.0.0/16
SecondaryReplicaSubnet0 10.0.3.0/24
SecondaryReplicaSubnet1 10.0.4.0/24
ShardsPerNode 0
VolumeSize 40
VolumeType gp2
VPCCIDR 10.0.0.0/16
Which caused a rollback in the same behaviour, as named in the AWS forum:
In "Ressources", all but one subtask never gets completed and stays on
forever as "PrimaryReplicaNode0WaitForNodeInstallGP2 -
PrimaryReplicaNode0WaitForNodeInstallWaitHandle - Created in Progress
- Ressource creation initiated"
So, I was further researching on the issue. The post referred to another forum thread, where users with the problem should try to delete their DynamoDB entries and set ClusterReplicaSetCount to 3.
Problem here: In DynamoDB there are no entries and changing ClusterReplicaSetCount to 3 also causes a rollback with a similar error:
ConfigServer2WaitForNodeInstall WaitCondition timed out. Received 0
conditions when expecting 1
and later
MONGODBSTACK1 The following resource(s) failed to create:
[ConfigServer1WaitForNodeInstall,
PrimaryReplicaNode00WaitForNodeInstallGP2,
ConfigServer0WaitForNodeInstall,
SecondaryReplicaNode00WaitForNodeInstallGP2,
SecondaryReplicaNode01WaitForNodeInstallGP2,
ConfigServer2WaitForNodeInstall].
Summary
In both cases there is a fail on PrimaryReplicaNodeXYWaitForNodeInstallGP2 (where XY is the number of the node) while all other parts of the installation completed successfully. I am totally in the dark.
Anyone got around this? The quick start is from 2016 and I think there must be people, who have successfully created this mongo stack!?
After days and days of hard struggle and no solution there was an update (since over a year, feels as if my prayers were heard) on the manual and the template:
https://docs.aws.amazon.com/quickstart/latest/mongodb/welcome.html
So this also comes with a completely revised infrastructure and a more sophisticated setup form, changes are described as:
Upgraded MongoDB to version 3.4; removed sharding configuration;
updated security groups and added database security; updated
parameters
Following the tutorial is quite similar to the former versions, so no struggle here.
Everything went out fine and I got my stack completed now consisting of a
mongoDB
mongoDB Replica
bastion stack
vpc stack
So this part is basically done. If something else comes up, I will open a new question for that.
I noticed this after tearing down a dev cluster and attempting to stand up a new one with the same name.
The torn down cluster orphaned a dynamodb table with the name the new stack was trying to publish the worker nodes status onto. I deleted this dynamo table manually and reattempted to spin up the stack with the same name a third time and had success.

How to disable rollback of a substack from within cloudformation template?

Is there a way to specify that a substack is not to be rolled back on failure when calling other CFTs from a CFT?
Ie, master CFT invoked (when invoked, you can use --disable-rollback or provide the option to CFN) -> substack 1 succesfully created -> substack 2 fails.
Now, substack 2 rolls back and I lose the record of what happened and the master CFT just sits there failed.
Is there a place to specify whether or not to allow rollback inside of a CFT, either in the invoking template (master) or the child (substack)?
Yes, you can disable the Rollback on failure of Cloud Formation stacks.
In the Options menu while creating the stack, you may find the Advanced portion.
In the expanded Advanced menu, you may find the Rollback on failure option.
Now the CFT won't rollback on failures. Even when a child stack fails It won't initiate rollback.