Container App Environment creation timing out - azure-devops

Where I work has just started migrating to the cloud. We've successfully deployed a number of resources using Terraform and Pipelines into Azure.
Where we are running into issues is deploying a Container App Environment, we have code that was working in a less locked down environment (setup for Proof of Concept), but are now having issues using that code in our go-forward.
When deploying, the Container App Environment spends 30mins attempting to create before it returns a context deadline exceeded error. Looking in Azure Portal, I can see the resource in "Waiting" provisioning state and I can also see the MC_ and AKS resources that get generated. It then fails around 4hrs later.
Any advice?
I am suspecting it's related to security on the Virtual Network that the subnets are sitting on, but I'm not seeing any logs on the deployment to confirm. The original subnets had a Network Security Group (NSG) assigned and I configured the rules that Microsoft provide before I added a couple of subnets without an NSG assigned and no luck.
My next step is to try provisioning it via the GUI and see if that works.

I managed to break our build in the "anything goes" environment.
The root cause is an incomplete configuration of the Virtual Network which has custom DNS entries. This has now been passed to our network architects to resolve. If I can get more details on the fix they apply I'll include that here for anyone else that runs into the issue.

Related

How to find the auto-created service connection when deploying to AKS

During a pipeline run, under deployment job, providing a deployment environment eliminates the need of providing service connection manually. I'd guess, it's either creating a new SC at this time or it would have created SC at the time of environment creation and using the same.
Either ways, is there a way to find out which Service connection is being used from the logs of pipeline run or from anywhere else?
In our environment, I see a lot of service connection for one environment and a cleanup is necessary to get things in place.
I tried giving SC manually along with environment and it works as expected. So, going forward, I can use this method. But for cleanup, I'd still like to know which one gets used when not specified! (none of the auto-created SCs show any execution history, but I know the deployment has happened multiple times)
As a Kubernetes resource in an environment is referencing Kubernetes service connection, you can use this API to list the serviceEndpointId of a Kubernetes resource, which is also the resourceId of the referenced service connection.
GET https://dev.azure.com/{organization}/{project}/_apis/distributedtask/environments/{environmentId}/providers/kubernetes/{resourceId}?api-version=7.0
Applied with the value of the serviceEndpointId from the response of the above API, we can proceed to use this API to get the referenced service connection details.
GET https://dev.azure.com/{organization}/{project}/_apis/serviceendpoint/endpoints/{endpointId}?api-version=7.0

CloudFormation: stack is stuck, CloudTrail events shows repeating DeleteNetworkInterface event

I am deploying a stack with CDK. It gets stuck in CREATE_IN_PROGRESS. CloudTrail logs show repeating events in logs:
DeleteNetworkInterface
CreateLogStream
What should I look at next to continue debugging? Is there a known reason for this to happen?
I also saw the exact same issue with the deployment of a CDK-based ECS/Fargate Deployment
In my instance, I was able to diagnose the issue by following the content from the AWS support article https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-stack-stuck-progress/
What specifically diagnosed and then resolved it for me:-
I updated my ECS service to set the desired task count of the ECS Service to 0. At that point the Cloud Formation stack did complete successfully.
From that, it became obvious that the actual issue was related to the creation of the initial task for my ECS Service. I was able to diagnose that by reviewing the output in Deployment and Events Tab of the ECS Service in the AWS Management Console. In my case, the task creation was failing because of an issue with accessing the associated ECR repository. Obviously there could be other reasons but they should show-up there.

Timeout waiting for network interface provisioning to complete

Does anyone know why an ECS Fargate task would fail with this error?
Timeout waiting for network interface provisioning to complete. I am running an ECS Fargate task using step functions. The IAM role for step function have access to the task def.The state machine code also looks good. The same step function worked fine before but i ran into this error just now. Want to know why this would happen? is it occasional?
According to AWS support, intermittent failures of this nature are to be expected (with relatively low probability).
The recommendation was to set retryAttempts > 1 to handle these situations.
This can happen if there are problems within AWS. You can view the Network Interfaces page on the EC2 console and you may see errors loading, which is an indication of API problems within EC2. You can also check status.aws.amazon.com to look for errors. Note that AWS can be slow to acknowledge problems there, so you may experience the errors before they update the status page!
AWS support has a detailed post on resolving network interface provision errors for ECS on Fargate. Here's an excerpt from the same
If the Fargate service tries to attach an elastic network interface to the underlying infrastructure that the task is meant to run on, then you can receive the following error message: "Timeout waiting for network interface provisioning to complete."
Fargate faces intermittent API issues usually while spinning up in Step functions and AWS Batch jobs. And as recommended in another answer you can update the MaxAttempts for retry in the definition.
"Retry": [
{
"MaxAttempts": 3,
}
]
Additionally, reattempts can be automated with an exponential backoff and retry logic in AWS Step Functions.
I was hitting the same issue until I switched over to fargate platform 1.4.0
It looks like there were some changes made to the networking side of things.
https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/
The default version is currently still set at 1.3.0 so maybe give that a try and see if it fixes it for you.

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

How to auto retry deployments to agents when they come online again (after having been offline)

When using Azure pipelines and deployment groups it is possible to re-deploy the "last successful" release to new agents with given "tags" using the instructions found here:
https://learn.microsoft.com/en-us/azure/devops/release-notes/2018/jul-10-vsts#automatically-deploy-to-new-targets-in-a-deployment-group
My issue is when releasing to a deployment group consisting of 3 machines. 2 are online and 1 is periodically offline. In this situation my release fails when the 1 machine is offline. This would be OK by me if Azure pipelines retried the deployment when machine offline comes back online. I thought this would work in the same way as "new targets", but I still haven't figured out how.
This is just a small test. When going in production my deployment group will consist of hundreds of machines and not all of them will be online at the same time.
So - Is it possible to automate the process to ensure all machines eventually will be up to date when all of them have been online?
Octopus-deploy seems to have this feature
https://help.octopusdeploy.com/discussions/questions/9351-possibility-to-deploy-when-agent-become-online
https://octopus.com/docs/deployment-patterns/elastic-and-transient-environments/deploying-to-transient-targets
Status after failed deployment
(and target is online again)
Well, in general the queued deployments will be automatically triggered once the agent is online. But for the failed deployments you have to re-deploy them manually. No any way to retry it automatically when the agent is online again...
Based on my test, to redeploy to all "not-updated-agents", you have to remove the other target machines which passed in previous deployment from deployment group...