ECS get image from QUAY.io and spin ec2Spot: Infinitely waiting for task to start - desiredCount = 1, pendingCount = 0 - amazon-ecs

I've set up pipeline which talks to ECS and spins EC2Spot instance.
Getting stuck on following message
PRIMARY task ******:5 - runningCount = 0 , desiredCount = 1, pendingCount = 0
Which basically means that I'm waiting for task to start, but something is off in a set up and it never gets started. Any suggestions on where to look?
Note:
This is a testing app which spins up a browser so no ports required
No load balancer
Possibly quay.io integration miss, but cant figure out with no logs
CloudTrail log is empty with only success messaged upon taskDefinition create and update
Thanks

About 8 hours of hammering head of the wall and this issue was solved.
Long time ago, by this fella - https://stackoverflow.com/a/36533601/5332494
Steps that It took me to figure it out.
Look in the CloudTrail => Event history => Even name column(UpdateService) => click on View event => Find error message(was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your cluster. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide) there which will take you to https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-event-messages.html#service-event-messages-1
Page in a link above specifies possible issues you are having if you got same message as I(see step 1). First option on that page:
No container instances were found in your cluster
took me to https://docs.aws.amazon.com/AmazonECS/latest/developerguide/launch_container_instance.html
That's where I've added docker instance to my ecs cluster and finally was able to add ec2 Spot instance through codefresh pipeline talking to ecs.
Notes:
Ecs had to talk to QUAY.io to pull docker image from their private registry. And all I had to do is create secret in AWS secret manager with default following format
{ "username": "your-Quay-Username",
"password": "your-Quay-password"
}
That's it :)

Related

Why can't I see my cluster when I'm trying to setup a scheduled task

I have a cluster in ECS with about 20+ services all happily running in it.
I've just uploaded a new image which I want to set up as a daily task. I can create it as a task and run it - the logs indicate it is running to completion.
I've gone into EventBridge and created a Rule, set the detail and cron, I select the target (AWS service), then select ECS task but when I drop the Cluster dropdown it is empty, I can't select a cluster - there are none.
Is this a security issue perhaps or am I missing something elsewhere - can't this be done?
Any help would be much appreciated.
Eventually managed to get this to work. The problem was that I was starting the EventBridge creation process in the wrong region - rookie mistake - so it couldn't see the cluster in the other region. D'Oh!

CloudFormation: stack is stuck, CloudTrail events shows repeating DeleteNetworkInterface event

I am deploying a stack with CDK. It gets stuck in CREATE_IN_PROGRESS. CloudTrail logs show repeating events in logs:
DeleteNetworkInterface
CreateLogStream
What should I look at next to continue debugging? Is there a known reason for this to happen?
I also saw the exact same issue with the deployment of a CDK-based ECS/Fargate Deployment
In my instance, I was able to diagnose the issue by following the content from the AWS support article https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-stack-stuck-progress/
What specifically diagnosed and then resolved it for me:-
I updated my ECS service to set the desired task count of the ECS Service to 0. At that point the Cloud Formation stack did complete successfully.
From that, it became obvious that the actual issue was related to the creation of the initial task for my ECS Service. I was able to diagnose that by reviewing the output in Deployment and Events Tab of the ECS Service in the AWS Management Console. In my case, the task creation was failing because of an issue with accessing the associated ECR repository. Obviously there could be other reasons but they should show-up there.

AWS ECS won't start tasks: http request timed out enforced after 4999ms

I have an ECS cluster (fargate), task, and service I have had setup in Terraform for at least a year. I haven't touched it for a long while. My normal deployment for updating the code is to push a new container to the registry and then stop all tasks on the cluster with a script. Today, my service did not run a new task in response to that task being stopped. It's desired count is fixed at so it should.
I have go in an tried to manually run this and I'm seeing this error.
Unable to run task
Http request timed out enforced after 4999ms
When I try to do this, a new stopped task is added to my stopped tasks lists. When I look into that task the stopped reason is "Deployment restart" and two of them are now showing "Task provisioning failed." which I think might be tasks the service tried to start. But these tasks do not show a started timestamp. The ones I start in the console have a started timestamp.
My site is now down and I can't get it back up. Does anyone know of a way to debug this? Is AWS ECS experiencing problems right now? I checked the health monitors and I see no issues.
This was an AWS outage affecting Fargate in us-east-1. It's fixed now.

Why does ECS integration new-relic task require AmazonEC2ContainerServiceforEC2Role?

We are trying to use the AWS Cloudformation way of installing the ECS integration for our clusers with NewRelic as described in this link
I observed that this cloud formation first creates few IAM roles for Task that will be executed as daemon service and one of the roles on the Task created is AmazonEC2ContainerServiceforEC2Role , which includes permissions to operated with Container Instances, including Deregistering the Container Instance.
I am interested to understand under what circumstances will this daemon task required to Deregister instance or for that matter Create cluster or register instance. The complete list of permissions given by IAM are as below. Can someone please elaborated why would we need this in first place.
Tried putting this in newrelic discussion forums but havent had any luck yet
"ec2:DescribeTags", "ecs:CreateCluster", "ecs:DeregisterContainerInstance", "ecs:DiscoverPollEndpoint", "ecs:Poll", "ecs:RegisterContainerInstance", "ecs:StartTelemetrySession", "ecs:UpdateContainerInstancesState", "ecs:Submit*", "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "logs:CreateLogStream", "logs:PutLogEvents"

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.