ECS task stuck in PENDING state - amazon-ecs

I am trying out AWS App-Mesh, I have pushed an image to ECR which starts a web server on 8080 port, and created an ECS service for it. I have been following this guide just to try out the service https://docs.aws.amazon.com/app-mesh/latest/userguide/getting-started-ecs.html . When I get to the part where I update my service with enabled AppMesh, my task gets stuck in the PENDING state and envoy task is unhealthy (screenshot attached)
I am using
840364872350.dkr.ecr.us-east-1.amazonaws.com/aws-appmesh-envoy:v1.21.1.2-prod
as envoy image
To be honest I don't really understand how this works and I want to know if I can debug this somehow to understand the problem. Thank you in advance !

With ECS AppMesh Integration, a ContainerDependency(container startup order) is added to the application container to only start if envoy is healthy. See images below:
Application Container with Container Ordering (View container)
Application Container with Container Ordering (Edit task/container)
In order to find out why envoy container is UNHEALTHY I would suggest enabling logging on the envoy container. In my scenario the envoy container couldn't retrieve EC2 Task Metadata (see below snippet) and hence the envoy container was UNHEALTHY in my case, so the ECS task would remain in pending indefinitely.
[error][aws] [source/extensions/common/aws/credentials_provider_impl.cc:118] Could not retrieve credentials listing from the EC2MetadataService
After adding the necessary permission in the ECS Task Role, the issue was resolved, as the envoy container was healthy and the application container can start as well. Hope the above helps.

Related

Why would Storm UI running on ECS give me a jetty 404 error?

I'm running a storm (v1.2.1) container with the command:
storm,ui
on ECS. The UI container runs on the same task as the zookeeper and nimbus.
I've deployed the task on FARGATE as a service which has service discovery enabled.
The containers are all running fine with no errors in the logs.
I've configured the task definition to map port 8080 so that I can access the storm UI.
However, when I try, all I get is a jetty 404 page. This tells me that I'm hitting the container, but somehow the storm ui is not there. I suppose an alternative is that I'm hitting a different container, but I'm not sure how that's possible.
This is the error I see:
Why is storm UI giving me a 404?
I finally got access to the container logs using ECS ExecCommand and the error was that it couldn't bind to the port.
I modified the Storm UI container to include the following command flag:
-c ui.port=[not 8080]
and mapped the new port from the container to the host. The UI worked, then.

How to avoid starting kubernetes pod liveness checks until all containers are running

I'm having trouble with health checks starting too early on a kubernetes pod with multiple containers. My pod is set up like this:
main-container (nodejs)
sidecar container (http proxy)
Currently the health checks are configured on the sidecar container, and end up hitting both containers (proxy, then main container).
If the main container starts quickly, then everything is fine. But if the sidecar starts quickly and the main container starts slowly (e.g. if the image needs to be pulled) then the initial liveness checks start on the sidecar before the other container has even started.
Is there a way of telling kubernetes: don't start running any probes (liveness or readiness checks) until all the containers in the pod have started?
I know I can use a startupProbe to be more generous waiting for startup: but ideally and to avoid other monitoring warnings, I'd prefer to suppress the health/liveness probes completely until all the containers have started.
Answering your question - yes, there is a way of doing so using startupProbe on your sidecar container pointing to your main application's opened port. As per the documentation all other probes (per container) are disabled if a startup probe is provided, until it succeeds. For more information about how to set up a startup probe visit here.

Connection refused error in outbound request in k8s app container. Istio?

Updated
I have some script that initializes our service.
The script fails when it runs in the container because of connection refused error in the first outbound request (to external service) in the script.
We tried to add a loop that makes curl and if it fails, re-try, if not - continuous the script.
Sometimes it succeeds for the first time, sometimes it fails 10-15 times in a row.
We recently started using istio
What may be a reason??
It is a famous istio bug https://github.com/istio/istio/issues/11130 ( App container unable to connect to network before Istio's sidecar is fully running) it seems the Istio proxy will not start in parallel , it is waiting for the app container to be ready. a sequential startup sequence as one blogger mentioned (https://medium.com/#marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74) quote: most Kubernetes users assume that after a pod’s init containers have finished, the pod’s regular containers are started in parallel. It turns out that’s not the case.
containers will start in order defined by the Deployment spec YAML.
so the biggest question is will the Istio proxy envoy will start while the first container is stuck in a curl-loop . (egg and chicken problem) .
App container script performs:
until curl --head localhost:15000 ; do echo "Waiting for Istio Proxy to start" ; sleep 3 ; done
as far as I saw: that script doesn't help a bit. proxy is up but connection to external hostname return "connection refused"
With istio 1.7 comes a new feature that configures the pod to start the sidecar first and hold every other container untill the sidecar is started.
Just set values.proxy.holdApplicationUntilProxyStarts to true.
Please note that the feature is still experimental.

Starting a container/pod after running the istio-proxy

I am trying to implement a service mesh to a service with Kubernetes using Istio and Envoy. I was able to set up the service and istio-proxy but I am not able to control the order in which the container and istio-proxy are started.
My container is the first started and tries to access an external resource via TCP but at that time, istio-proxy has not completely loaded and so does the ServiceEntry for the external resource
I tried adding a panic in my service and also tried with a sleep of 5 seconds before accessing the external resource.
Is there a way that I can control the order of these?
On istio version 1.7.X and above you can add configuration option values.global.proxy.holdApplicationUntilProxyStarts, which causes the sidecar injector to inject the sidecar at the start of the pod’s container list and configures it to block the start of all other containers until the proxy is ready. This option is disabled by default.
According to https://istio.io/latest/news/releases/1.7.x/announcing-1.7/change-notes/
I don't think you can control the order other than listing the containers in a particular order in your pod spec. So, I recommend you configure a Readiness Probe so that you are pod is not ready until your service can send some traffic to the outside.
Github issue here:
Support startup dependencies between containers on the same Pod
We're currently recommending that developers solve this problem
themselves by running a startup script on their application container
which delays application startup until Envoy has received its initial
configuration. However, this is a bit of a hack and requires changes
to every one of the developer's containers.

Application container unable to access network before sidecar ready

I was trying fortio server/client application on istio. I used istoctl for injecting istio dependency and my serer pod was came up fine. But client pod was giving connection refused error due to proxy sidecar is not yet ready to handle connection request of client. Please help me addressing this issue. For reference attaching my yaml files.
This is by design and there is no way around it.
The part responsible for configuration of the iptables for capturing the traffic is run as an init container, which ensures that the required rules are in place before any of the normal pod containers start up. If you use istio for all the traffic, then until it's container is ready, no network traffic will reach in/out of the container.
You should make sure your application handles this right. Apps should be able to withstand unavailability of it's dependencies for a time, both on startup and during operation. In worst case you can introduce your own handling in form of ie. custom entrypoint that awaits for communication to be up.