ECS fargate healtcheck issue - amazon-ecs

I am using ECS Fargate model.I have two task defination and in each defination i have one service running (have place 1 container in each TD)
Then i have two ECS services which run these two task definations.
Network mode used is fargate.
I have created a load balancer which has two target groups.I have defined rules to divert traffic.
In target groups i have added IP's of the services running inside two task definitions (as network mode is awsvpc i got ENI)
I have seen for one of the target group health check is failing continuously with HTTP code 502.
I am observing the IP from ENI for that service changing continuously and same is updated in target group.
Questions:
Does ECS change IP in target group automatically ?
How to troubleshoot this HTTP code 502 as this is fargate i even cannot login inside container ?

Related

Is AWS NLB supported for ECS?

Question
Is NLB supported for ECS with dynamic port mapping?
Background
It looks there are attempts to use NLB with ECS but problems with health check.
Network Load Balancer for inter-service communication
Health check interval for Network Load Balancer Target Group
NLB Target Group health checks are out of control
When talked with AWS, they acknowledged that the NLB documentation of health check interval is not accurate as NLB has multiple instances sending health check respectively, hence the interval when an ECS task will get health check is not according to the HealthCheckIntervalSeconds.
Also the ECS task page says specifically about ALB to use the dynamic port mapping.
Hence, I suppose NLB is not supported for ECS? If there is a documentation which states NLB is supported for ECS, please suggest.
Update
Why are properly functioning Amazon ECS tasks registered to ELB marked as unhealthy and replaced?
Elastic Load Balancing is repeatedly flagging properly functioning Amazon Elastic Container Service (Amazon ECS) tasks as unhealthy. These incorrectly flagged tasks are stopped and new tasks are started to replace them. How can I troubleshoot this?
change the Health check grace period to an appropriate time period for your service
A Network Load Balancer makes routing decisions at the transport layer (TCP/SSL). It can handle millions of requests per second. After the load balancer receives a connection, it selects a target from the target group for the default rule using a flow hash routing algorithm. It attempts to open a TCP connection to the selected target on the port specified in the listener configuration. It forwards the request without modifying the headers. Network Load Balancers support dynamic host port mapping.
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/load-balancer-types.html#nlb

ECS+NLB does not support dynamic port hence only 1 task per EC2 instance?

Please confirm if these are true, or please point to the official AWS documentations that describes how to use dynamic port mapping with NLB and run multiple same tasks in an ECS ES2 instance. I am not using Fargate.
ECS+NLB does NOT support dynamic port mapping, hence
ECS+NLB can only allow 1 task (docker container) per EC2 instance in an ECS service
This is because:
AWS ECS Developer Guide - Creating a Load Balancer only mentions ALB that can use dynamic port, and not mention on NLB.
Application Load Balancers offer several features that make them attractive for use with Amazon ECS services:
* Application Load Balancers allow containers to use dynamic host port mapping (so that multiple tasks from the same service are allowed per container instance).
ECS task creation page clearly states that dynamic port is for ALB.
Network Load Balancer for inter-service communication quotes a response from the AWS support:
"However, I would like to point out that there is currently an ongoing issue with the NLB functionality with ECS, mostly seen with dynamic port mapping where the container is not able to stabilize due to health check errors, I believe the error you're seeing is related to that issue. I can only recommend that you use the ALB for now, as the NLB is still quite new so it's not fully compatible with ECS yet."
Updates
Found a document stating NLB supports dynamic port. However, if I switch ALB to NLB, ECS service does not work. When I log into an EC2 instance, an ECS agent is running but no docker container is running.
If someone managed to make ECS(EC2 type)+NLB work, please provide the step by step how it has been done.
Amazon ECS Developer Guide - Service Load Balancing - Load Balancer Types - NLB
Network Load Balancers support dynamic host port mapping. For example, if your task's container definition specifies port 80 for an NGINX container port, and port 0 for the host port, then the host port is dynamically chosen from the ephemeral port range of the container instance (such as 32768 to 61000 on the latest Amazon ECS-optimized AMI). When the task is launched, the NGINX container is registered with the Network Load Balancer as an instance ID and port combination, and traffic is distributed to the instance ID and port corresponding to that container. This dynamic mapping allows you to have multiple tasks from a single service on the same container instance.

ECS auto scailing cluster with ec2 count

To deploy my docker-compose, I using AWS ECS.
Everything works fine, except auto scailing.
When create ECS cluster,
I can decide number of instances.
So I defined it to 1.
Next, when creating service on my cluster,
Also can decide number of tasks.
I know that tasks running on the instance, so I defined it to 1.
And to specify auto scailing policy like this.
As you know that, if cpu percentage up to 50 in 5 minutes, it automatically adds a task.
So finish configure it, I run benchmark to test.
In the service describe, desired tasks is increase to 2.
But instance didn't added automatically.
In the event log,
Maybe I defined number of instances to 1 in my cluster, So it can't start new task.
Why auto scailing do not automatically add new instance on my cluster?
Is there any problem on my configuration?
Thanks.
Your ecs cluster Is not autoscaling the number of instances. It autoscales number of tasks that are running inside your existing cluster. An ec2 instance can have multiple tasks running. To autoscale instance count, you will need to use cloudwatch alarms:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch_alarm_autoscaling.html
You are receiving this issue because of the port conflict when ECS attempts to use the "closest matching container instance" which in this case is the one which ends in 9e5e.
When attempting to spin up a task on that instance it notices that this instance "is already using a port required by your task"
In order to resolve this issue,
You need to use dynamic porting for your ECS cluster.
There is a tutorial on how to do this that Amazon provides here:
https://aws.amazon.com/premiumsupport/knowledge-center/dynamic-port-mapping-ecs/
Essentially,
You will need to modify the port mapping in the task definition that has the docker container you are trying to run and scale.
The port mapping should be 0 for the host port and then the port number that your application uses for the container port.
the zero value will make each docker instance in the ECS cluster that is ran use a different number for its host port, eliminating the port conflict you are experiencing.

Connect to On Premises Service Fabric Cluster

I've followed the steps from Microsoft to create a Multi-Node On-Premises Service Fabric cluster. I've deployed a stateless app to the cluster and it seems to be working fine. When I have been connecting to the cluster I have used the IP Address of one of the nodes. Doing that, I can connect via Powershell using Connect-ServiceFabricCluster nodename:19000 and I can connect to the Service Fabric Explorer website (http://nodename:19080/explorer/index.html).
The examples online suggest that if I hosted in Azure I can connect to http://mycluster.eastus.cloudapp.azure.com:19000 and it resolves, however I can't work out what the equivalent is on my local. I tried connecting to my sample cluster: Connect-ServiceFabricCluster sampleCluster.domain.local:19000 but that returns:
WARNING: Failed to contact Naming Service. Attempting to contact Failover Manager Service...
WARNING: Failed to contact Failover Manager Service, Attempting to contact FMM...
False
WARNING: No such host is known
Connect-ServiceFabricCluster : No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue.
Am I missing something in my setup? Should there be a central DNS entry somewhere that allows me to connect to the cluster? Or am I trying to do something that isn't supported On-Premises?
Yup, you're missing a load balancer.
This is the best resource I could find to help, I'll paste relevant contents in the event of it becoming unavailable.
Reverse Proxy — When you provision a Service Fabric cluster, you have an option of installing Reverse Proxy on each of the nodes on the cluster. It performs the service resolution on the client’s behalf and forwards the request to the correct node which contains the application. In majority of the cases, services running on the Service Fabric run only on the subset of the nodes. Since the load balancer will not know which nodes contain the requested service, the client libraries will have to wrap the requests in a retry-loop to resolve service endpoints. Using Reverse Proxy will address the issue since it runs on each node and will know exactly on what nodes is the service running on. Clients outside the cluster can reach the services running inside the cluster via Reverse Proxy without any additional configuration.
Source: Azure Service Fabric is amazing
I have an Azure Service Fabric resource running, but the same rules apply. As the article states, you'll need a reverse proxy/load balancer to resolve not only what nodes are running the API, but also to balance the load between the nodes running that API. So, health probes are necessary too so that the load balancer knows which nodes are viable options for sending traffic to.
As an example, Azure creates 2 rules off the bat:
1. LBHttpRule on TCP/19080 with a TCP probe on port 19080 every 5 seconds with a 2 count error threshold.
2. LBRule on TCP/19000 with a TCP probe on port 19000 every 5 seconds with a 2 count error threshold.
What you need to add to make this forward-facing is a rule where you forward port 80 to your service http port. Then the health probe can be an http probe that hits a path to test a 200 return.
Once you get into the cluster, you can resolve the services normally and SF will take care of availability.
In Azure-land, this is abstracted again to using something like API Management to further reverse proxy it to SSL. What a mess but it works.
Once your load balancer is set up, you'll have a single IP to hit for management, publishing, and regular traffic.

How to specify multiple backends for TCP Load Balancer with Google Cloud Deployment Manager

When creating a TCP load balancer in the web console, I can add multiple backend services (see image below). I got everything working and now I'm trying to replicate it with Cloud Deployment Manager, but I can't figure out how to set multiple backend services to a TCP load balancer.
The Cloud Deployment Manager ForwardingRule documentation only seems to allow a single target. Maybe a single target is all I need and instead I just need to connect multiple instance group managers to a single target pool?
The problem with that, for me, is my instance group managers were created by Kubernetes and I don't see a way to connect an instance group manager to a target pool without redefining the instance group manager.
Is there a way to add multiple backends/instance groups to a forwarding rule when the instance groups weren't created with deployment manager?
Kubernetes
First of all, if you are creating a cluster making use of Kubernetes and you are willing to make the containers running in the nodes reachable through a single entrance point you have to create a service of type loadbalancer.
Google Cloud Deployment Manager
However it is possible to create a TCP load balancer redirecting the traffic to more than one backend also in the case of the Cloud Deployment Manager.
In order to check the needed underlying components I suggest you to create a temporary TCP Load balancer through the Developers Console and check from the advanced setting all the components created.
It turns out that you need to create a ForwardingRule pointing to a TargetPool having several managed instance groups in the same region attached to it.
Therefore you need to modify the managed instance groups and set the target pool for each of them.
You can use the following YAML to update an existing managed instance group named test:
resources:
- name: test
type: compute.v1.instanceGroupManager
properties:
zone: europe-west1-c
targetSize: 2
targetPools:
- https://www.googleapis.com/compute/v1/projects/<<projectID>>/regions/europe-west1/targetPools/mytargetpool
baseInstanceName: <<baseName>>
instanceTemplate: https://www.googleapis.com/compute/v1/projects/<<projectID>>/global/instanceTemplates/<<instanceTemplateName>>
You’ll need a similar structure for each of the managed instance groups.
On the other hand you can create the Target pool with the following snippet:
resources:
- name: mytargetpool
type: compute.v1.targetPool
properties:
region: europe-west1