CloudFormation: stack is stuck, CloudTrail events shows repeating DeleteNetworkInterface event - aws-cloudformation

I am deploying a stack with CDK. It gets stuck in CREATE_IN_PROGRESS. CloudTrail logs show repeating events in logs:
DeleteNetworkInterface
CreateLogStream
What should I look at next to continue debugging? Is there a known reason for this to happen?

I also saw the exact same issue with the deployment of a CDK-based ECS/Fargate Deployment
In my instance, I was able to diagnose the issue by following the content from the AWS support article https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-stack-stuck-progress/
What specifically diagnosed and then resolved it for me:-
I updated my ECS service to set the desired task count of the ECS Service to 0. At that point the Cloud Formation stack did complete successfully.
From that, it became obvious that the actual issue was related to the creation of the initial task for my ECS Service. I was able to diagnose that by reviewing the output in Deployment and Events Tab of the ECS Service in the AWS Management Console. In my case, the task creation was failing because of an issue with accessing the associated ECR repository. Obviously there could be other reasons but they should show-up there.

Related

Why can't I see my cluster when I'm trying to setup a scheduled task

I have a cluster in ECS with about 20+ services all happily running in it.
I've just uploaded a new image which I want to set up as a daily task. I can create it as a task and run it - the logs indicate it is running to completion.
I've gone into EventBridge and created a Rule, set the detail and cron, I select the target (AWS service), then select ECS task but when I drop the Cluster dropdown it is empty, I can't select a cluster - there are none.
Is this a security issue perhaps or am I missing something elsewhere - can't this be done?
Any help would be much appreciated.
Eventually managed to get this to work. The problem was that I was starting the EventBridge creation process in the wrong region - rookie mistake - so it couldn't see the cluster in the other region. D'Oh!

Container App Environment creation timing out

Where I work has just started migrating to the cloud. We've successfully deployed a number of resources using Terraform and Pipelines into Azure.
Where we are running into issues is deploying a Container App Environment, we have code that was working in a less locked down environment (setup for Proof of Concept), but are now having issues using that code in our go-forward.
When deploying, the Container App Environment spends 30mins attempting to create before it returns a context deadline exceeded error. Looking in Azure Portal, I can see the resource in "Waiting" provisioning state and I can also see the MC_ and AKS resources that get generated. It then fails around 4hrs later.
Any advice?
I am suspecting it's related to security on the Virtual Network that the subnets are sitting on, but I'm not seeing any logs on the deployment to confirm. The original subnets had a Network Security Group (NSG) assigned and I configured the rules that Microsoft provide before I added a couple of subnets without an NSG assigned and no luck.
My next step is to try provisioning it via the GUI and see if that works.
I managed to break our build in the "anything goes" environment.
The root cause is an incomplete configuration of the Virtual Network which has custom DNS entries. This has now been passed to our network architects to resolve. If I can get more details on the fix they apply I'll include that here for anyone else that runs into the issue.

AWS ECS won't start tasks: http request timed out enforced after 4999ms

I have an ECS cluster (fargate), task, and service I have had setup in Terraform for at least a year. I haven't touched it for a long while. My normal deployment for updating the code is to push a new container to the registry and then stop all tasks on the cluster with a script. Today, my service did not run a new task in response to that task being stopped. It's desired count is fixed at so it should.
I have go in an tried to manually run this and I'm seeing this error.
Unable to run task
Http request timed out enforced after 4999ms
When I try to do this, a new stopped task is added to my stopped tasks lists. When I look into that task the stopped reason is "Deployment restart" and two of them are now showing "Task provisioning failed." which I think might be tasks the service tried to start. But these tasks do not show a started timestamp. The ones I start in the console have a started timestamp.
My site is now down and I can't get it back up. Does anyone know of a way to debug this? Is AWS ECS experiencing problems right now? I checked the health monitors and I see no issues.
This was an AWS outage affecting Fargate in us-east-1. It's fixed now.

Timeout waiting for network interface provisioning to complete

Does anyone know why an ECS Fargate task would fail with this error?
Timeout waiting for network interface provisioning to complete. I am running an ECS Fargate task using step functions. The IAM role for step function have access to the task def.The state machine code also looks good. The same step function worked fine before but i ran into this error just now. Want to know why this would happen? is it occasional?
According to AWS support, intermittent failures of this nature are to be expected (with relatively low probability).
The recommendation was to set retryAttempts > 1 to handle these situations.
This can happen if there are problems within AWS. You can view the Network Interfaces page on the EC2 console and you may see errors loading, which is an indication of API problems within EC2. You can also check status.aws.amazon.com to look for errors. Note that AWS can be slow to acknowledge problems there, so you may experience the errors before they update the status page!
AWS support has a detailed post on resolving network interface provision errors for ECS on Fargate. Here's an excerpt from the same
If the Fargate service tries to attach an elastic network interface to the underlying infrastructure that the task is meant to run on, then you can receive the following error message: "Timeout waiting for network interface provisioning to complete."
Fargate faces intermittent API issues usually while spinning up in Step functions and AWS Batch jobs. And as recommended in another answer you can update the MaxAttempts for retry in the definition.
"Retry": [
{
"MaxAttempts": 3,
}
]
Additionally, reattempts can be automated with an exponential backoff and retry logic in AWS Step Functions.
I was hitting the same issue until I switched over to fargate platform 1.4.0
It looks like there were some changes made to the networking side of things.
https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/
The default version is currently still set at 1.3.0 so maybe give that a try and see if it fixes it for you.

Deployed jobs stopped working with an image error?

In the last few hours I am no longer able to execute deployed Data Fusion pipeline jobs - they just end in an error state almost instantly.
I can run the jobs in Preview mode, but when trying to run deployed jobs this error appears in the logs:
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Selected software image version '1.2.65-deb9' can no longer be used to create new clusters. Please select a more recent image
I've tried with both an existing instance and a new instance, and all deployed jobs including the sample jobs give this error.
Any ideas? I cannot find any config options for what image is used for execution
We are currently investigating an issue with the image for Cloud Dataproc used by Cloud Data Fusion. We had pinned a version of Dataproc VM image for the launch that is causing an issue.
We apologize for you inconvenience. We are working to resolve the issue as soon as possible for you.
Will provide update on this thread.
Nitin