I know that we can disable rollback for stack failure for normal cloudformation stack. Is there anyway we can setup that for the stackset created by the cloudforamtion. Tried with failure tolerance, however the failed stack getting rollbacked, any advice ?
Related
I am deploying a stack with CDK. It gets stuck in CREATE_IN_PROGRESS. CloudTrail logs show repeating events in logs:
DeleteNetworkInterface
CreateLogStream
What should I look at next to continue debugging? Is there a known reason for this to happen?
I also saw the exact same issue with the deployment of a CDK-based ECS/Fargate Deployment
In my instance, I was able to diagnose the issue by following the content from the AWS support article https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-stack-stuck-progress/
What specifically diagnosed and then resolved it for me:-
I updated my ECS service to set the desired task count of the ECS Service to 0. At that point the Cloud Formation stack did complete successfully.
From that, it became obvious that the actual issue was related to the creation of the initial task for my ECS Service. I was able to diagnose that by reviewing the output in Deployment and Events Tab of the ECS Service in the AWS Management Console. In my case, the task creation was failing because of an issue with accessing the associated ECR repository. Obviously there could be other reasons but they should show-up there.
I have the following celery setup in production
Using RabbitMQ as the broker
I am running multiple instances of the celery using ECS Fargate
Logs are sent to CloudWatch using default awslogs driver
Result backend - MongoDB
The issue I am facing is this. A lot of my tasks are not showing logs on cloudwatch.
I just see this log
Task task_name[3d56f396-4530-4471-b37c-9ad2985621dd] received
But I do not see the actual logs for the execution of this task. Nor do I see the logs for completion - for example something like this is nowhere in the logs to be found
Task task_name[3d56f396-4530-4471-b37c-9ad2985621dd] succeeded
This does not happen all the time. It happens intermittently but consistently. I see that a lot of tasks are printing the logs.
I can see that result backend has the task results and I know that the task has executed but the logs for the task are completely missing. It is not specific to some task_name.
On my local setup, I have not been able to isolate the issue
I am not sure if this is a celery logging issue or awslogs issue. What can I do to troubleshoot this?
** UPDATE **
Found the root cause - it was that I had some code in the codebase that was removing handlers from the root logger. Leaving this question on stack overflow in case someone else faces this issue
Log all requests from the python-requests module
Does anyone know why an ECS Fargate task would fail with this error?
Timeout waiting for network interface provisioning to complete. I am running an ECS Fargate task using step functions. The IAM role for step function have access to the task def.The state machine code also looks good. The same step function worked fine before but i ran into this error just now. Want to know why this would happen? is it occasional?
According to AWS support, intermittent failures of this nature are to be expected (with relatively low probability).
The recommendation was to set retryAttempts > 1 to handle these situations.
This can happen if there are problems within AWS. You can view the Network Interfaces page on the EC2 console and you may see errors loading, which is an indication of API problems within EC2. You can also check status.aws.amazon.com to look for errors. Note that AWS can be slow to acknowledge problems there, so you may experience the errors before they update the status page!
AWS support has a detailed post on resolving network interface provision errors for ECS on Fargate. Here's an excerpt from the same
If the Fargate service tries to attach an elastic network interface to the underlying infrastructure that the task is meant to run on, then you can receive the following error message: "Timeout waiting for network interface provisioning to complete."
Fargate faces intermittent API issues usually while spinning up in Step functions and AWS Batch jobs. And as recommended in another answer you can update the MaxAttempts for retry in the definition.
"Retry": [
{
"MaxAttempts": 3,
}
]
Additionally, reattempts can be automated with an exponential backoff and retry logic in AWS Step Functions.
I was hitting the same issue until I switched over to fargate platform 1.4.0
It looks like there were some changes made to the networking side of things.
https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/
The default version is currently still set at 1.3.0 so maybe give that a try and see if it fixes it for you.
My cloudformation template had couple of AWS::SNS::Subscription. I removed those and deployed the template. 1 of those 2 AWS::SNS::Subscription failed to delete and ended up in DELETE_FAILED. I expected the AWS::CloudFormation::Stack to ROLLBACK on failure to delete the AWS::SNS::Subscription. But to my surprise it ended up in UPDATE_COMPLETE state.
Generally if CloudFormation can't delete a resource as part of the cleanup step, it does not rollback, but succeeds.
No Worry!! Now AWS added new features and ability to retry the stack operations from the point of failure.
This is amazing!! While I was using AWS Cloudformation, I faced the same problem when any resources fail to launch for any reason and we have to wait for the rolls back and again we have to launch the stack from the scratch. But now we can retry stack operations from the point of failure.
Thanks to AWS for adding this new features in it.
Still do you have any questions regarding the same please let me know in the comments section.
Get the full details here : https://aws.amazon.com/blogs/aws/new-for-aws-cloudformation-quickly-retry-stack-operations-from-the-point-of-failure/
Is there a way to specify that a substack is not to be rolled back on failure when calling other CFTs from a CFT?
Ie, master CFT invoked (when invoked, you can use --disable-rollback or provide the option to CFN) -> substack 1 succesfully created -> substack 2 fails.
Now, substack 2 rolls back and I lose the record of what happened and the master CFT just sits there failed.
Is there a place to specify whether or not to allow rollback inside of a CFT, either in the invoking template (master) or the child (substack)?
Yes, you can disable the Rollback on failure of Cloud Formation stacks.
In the Options menu while creating the stack, you may find the Advanced portion.
In the expanded Advanced menu, you may find the Rollback on failure option.
Now the CFT won't rollback on failures. Even when a child stack fails It won't initiate rollback.