Deploying a Laravel web application on ECS, in order to enable autoscaling I am using an Application Load Balancer. The application worked (and scaled) perfectly until I introduced a heavy weight page, where I started to get 504 Gateway Timeout errors after a minute or so.
I am pretty sure the single web server has a higher timeout (this never happens when the application is tested in local) so the problem must be related to something related to AWS environment (ECS / ALB).
Below you can find a snipped of the ALB setting
AdminLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
SecurityGroups:
- !Ref 'AlbSecurityGroup'
Subnets:
- !Ref 'PublicSubnetAz1'
- !Ref 'PublicSubnetAz2'
Scheme: internet-facing
Name: !Join ['-', [!Ref 'AWS::StackName', 'lb']]
After some attempts, I solved the issue setting the idle timeout attribute of the load balancer, as explained here in theory, because nothing was wrong with the single ECS Tasks. In Cloudformation, it was enough to add the attribute setting of the parameter, and double the default value.
AdminLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
LoadBalancerAttributes:
- Key: 'idle_timeout.timeout_seconds'
Value: 120
SecurityGroups:
- !Ref 'AlbSecurityGroup'
Subnets:
- !Ref 'PublicSubnetAz1'
- !Ref 'PublicSubnetAz2'
Scheme: internet-facing
Name: !Join ['-', [!Ref 'AWS::StackName', 'lb']]
Related
I am trying to define an ECs cluster deployment using CLoudFormation. So far I have been successful with defining and executing the template.
I decided to externalize the environment variables for the container by using the EnvironmentFile property in the AWS::ECS::TaskDefinition resource.
I think I'm using the correct syntax according to the documentation:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ecs-taskdefinition-containerdefinitions.html
However running the template in CF generates an error, telling me that the keys I'm using for the EnviromentFile definition are not permitted.
The most strange thing is that the stack update since to complete successfully and I can see the property when I look at the task definition in the console. Is this an error I should ignore or Is there a more correct way to define these property
CloudFormation snippet:
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Ref 'ServiceName'
Cpu: !Ref 'ContainerCpu'
Memory: !Ref 'ContainerMemory'
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
ExecutionRoleArn: !Ref 'ECSTaskExecutionRole'
TaskRoleArn:
Fn::If:
- 'HasCustomRole'
- !Ref 'Role'
- !Ref "AWS::NoValue"
ContainerDefinitions:
- Name: !Ref 'ServiceName'
Cpu: !Ref 'ContainerCpu'
Memory: !Ref 'ContainerMemory'
Image: !Ref 'ImageUrl'
EnvironmentFiles:
- value: !Ref EnvFile
type: s3
PortMappings:
- ContainerPort: !Ref 'ContainerPort'
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !Ref ApplicationLogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Sub ${AWS::StackName}-ecs-service
Reported error:
Resource template validation failed for resource TaskDefinition as the template has invalid properties.
Please refer to the resource documentation to fix the template.
Properties validation failed for resource TaskDefinition with message:
#/ContainerDefinitions/0/EnvironmentFiles/0: extraneous key [type] is not permitted
#/ContainerDefinitions/0/EnvironmentFiles/0: extraneous key [value] is not permitted
Ok, I'm answering this to close it. After trying several things I realize that the value and type property were in lover case and CloudFomation enforces that the properties need to start with Uppercase. making this change removed the error
EnvironmentFiles:
- Value: !Ref EnvFile
Type: s3
I have this template that was working till February.
https://datameetgeobk.s3.amazonaws.com/cftemplates/EyeOfCustomer_updated.yaml.txt
Something related to Fine Grained access changed and I get the error...
Enable fine-grained access control or apply a restrictive access
policy to your domain (Service: AWSElasticsearch; Status Code: 400;
Error Code: ValidationException
This is just a test server and I do not want to protect it using Advanced security options.
The error you receive is because Amazon enabled the fine grained access control as part of its release in February 2020.
You can enable VPCOptions for the cluster and create a subnet + security group and allow access through that security group. Add VPC ID as a parameter say pVpc (default VPC in thise case)
Add vpc parameter
pVpc:
Description: VPC ID
Type: String
Default: default-xxadssad - your default vpc id
Add subnet & security group
ESSubnetA:
Type: AWS::EC2::Subnet
Properties:
VpcId:
Ref: !Ref pVpc
AvailabilityZone: ${self:provider.region}a
CidrBlock: !Ref pVpcCIDR
Tags:
- Key: Name
Value: es-subneta
ESSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: SecurityGroup for Elasticsearch
VpcId:
Ref: !Ref pVpc
SecurityGroupIngress:
- FromPort: '443'
IpProtocol: tcp
ToPort: '443'
CidrIp: 0.0.0.0/0
Tags:
- Key: Name
Value: es-sg
Enable VPCOptions
VPCOptions:
SubnetIds:
- !Ref ESSubnetA
SecurityGroupIds:
- !Ref ESSecurityGroup
Another Engineer introduced a deploy date parameter into our AMIFInder Custom Resource in the prod stack which means we can no longer update the dev stack without attempting to recreate the EC2 instance.
Is it possible to introduce a condition purely based on the DeployDate parameter so I can still use one template for both stacks?
FindAmiResource:
Type: 'Custom::FindAmiFunction'
Properties:
ServiceToken:
Fn::ImportValue:
!Sub
- cfn:find-ami:${AWSAccount}:arn
- {AWSAccount: !FindInMap [AccountIDMap, Accounts, !Ref "AWS::AccountId"]}
AmiName: 'Corp_w2016_Std-*'
AmiOwner: '9999999999999'
DeployDate: !Ref AMIDeployDate
Assuming you have some information to key off (like a known AccountId or a parameter in the stack) you can create a condition that defines the stack as dev. Then you can use the 'Fn::If' function, like this:
FindAmiResource:
Type: 'Custom::FindAmiFunction'
Properties:
ServiceToken:
Fn::ImportValue:
!Sub
- cfn:find-ami:${AWSAccount}:arn
- {AWSAccount: !FindInMap [AccountIDMap, Accounts, !Ref "AWS::AccountId"]}
AmiName: 'Corp_w2016_Std-*'
AmiOwner: '9999999999999'
DeployDate:
Fn::If:
- DevCondition
- !Ref AWS::NoValue
- !Ref AMIDeployDate
I am having trouble deploying a fargate cluster, and it is failing on the docker pull image with error "CannotPullContainerError". I am creating the stack with cloudformation, which is not optional, and it creates the full stack, but fails when trying to start the task based on the above error.
I have attached the cloudformation stack file which might highlight the problem, and I have doubled checked that the subnet has a route to nat(below). I also ssh'ed into an instance in the same subnet which was able to route externally. I am wondering if i have not correctly placed the pieces required i.e the service + loadbalancer are in the private subnet, or should i not be placing the internal lb in the same subnet???
This subnet is the one that currently has the placement but all 3 in the file have the same nat settings.
subnet routable (subnet-34b92250)
* 0.0.0.0/0 -> nat-05a00385366da527a
cheers in advance.
yaml cloudformaition script:
AWSTemplateFormatVersion: 2010-09-09
Description: Cloudformation stack for the new GRPC endpoints within existing vpc/subnets and using fargate
Parameters:
StackName:
Type: String
Default: cf-core-ci-grpc
Description: The name of the parent Fargate networking stack that you created. Necessary
vpcId:
Type: String
Default: vpc-0d499a68
Description: The name of the parent Fargate networking stack that you created. Necessary
Resources:
CoreGrcpInstanceSecurityGroupOpenWeb:
Type: 'AWS::EC2::SecurityGroup'
Properties:
GroupName: sgg-core-ci-grpc-ingress
GroupDescription: Allow http to client host
VpcId: !Ref vpcId
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: '80'
ToPort: '80'
CidrIp: 0.0.0.0/0
SecurityGroupEgress:
- IpProtocol: tcp
FromPort: '80'
ToPort: '80'
CidrIp: 0.0.0.0/0
LoadBalancer:
Type: 'AWS::ElasticLoadBalancingV2::LoadBalancer'
DependsOn:
- CoreGrcpInstanceSecurityGroupOpenWeb
Properties:
Name: lb-core-ci-int-grpc
Scheme: internal
Subnets:
# # pub
# - subnet-f13995a8
# - subnet-f13995a8
# - subnet-f13995a8
# pri
- subnet-34b92250
- subnet-82d85af4
- subnet-ca379b93
LoadBalancerAttributes:
- Key: idle_timeout.timeout_seconds
Value: '50'
SecurityGroups:
- !Ref CoreGrcpInstanceSecurityGroupOpenWeb
TargetGroup:
Type: 'AWS::ElasticLoadBalancingV2::TargetGroup'
DependsOn:
- LoadBalancer
Properties:
Name: tg-core-ci-grpc
Port: 3000
TargetType: ip
Protocol: HTTP
HealthCheckIntervalSeconds: 30
HealthCheckProtocol: HTTP
HealthCheckTimeoutSeconds: 10
HealthyThresholdCount: 4
Matcher:
HttpCode: '200'
TargetGroupAttributes:
- Key: deregistration_delay.timeout_seconds
Value: '20'
UnhealthyThresholdCount: 3
VpcId: !Ref vpcId
LoadBalancerListener:
Type: 'AWS::ElasticLoadBalancingV2::Listener'
DependsOn:
- TargetGroup
Properties:
DefaultActions:
- Type: forward
TargetGroupArn: !Ref TargetGroup
LoadBalancerArn: !Ref LoadBalancer
Port: 80
Protocol: HTTP
EcsCluster:
Type: 'AWS::ECS::Cluster'
DependsOn:
- LoadBalancerListener
Properties:
ClusterName: ecs-core-ci-grpc
EcsTaskRole:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service:
# - ecs.amazonaws.com
- ecs-tasks.amazonaws.com
Action:
- 'sts:AssumeRole'
Path: /
Policies:
- PolicyName: iam-policy-ecs-task-core-ci-grpc
PolicyDocument:
Statement:
- Effect: Allow
Action:
- 'ecr:**'
Resource: '*'
CoreGrcpTaskDefinition:
Type: 'AWS::ECS::TaskDefinition'
DependsOn:
- EcsCluster
- EcsTaskRole
Properties:
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
ExecutionRoleArn: !Ref EcsTaskRole
Cpu: '1024'
Memory: '2048'
ContainerDefinitions:
- Name: container-core-ci-grpc
Image: 'nginx:latest'
Cpu: '256'
Memory: '1024'
PortMappings:
- ContainerPort: '80'
HostPort: '80'
Essential: 'true'
EcsService:
Type: 'AWS::ECS::Service'
DependsOn:
- CoreGrcpTaskDefinition
Properties:
Cluster: !Ref EcsCluster
LaunchType: FARGATE
DesiredCount: '1'
DeploymentConfiguration:
MaximumPercent: 150
MinimumHealthyPercent: 0
LoadBalancers:
- ContainerName: container-core-ci-grpc
ContainerPort: '80'
TargetGroupArn: !Ref TargetGroup
NetworkConfiguration:
AwsvpcConfiguration:
AssignPublicIp: DISABLED
SecurityGroups:
- !Ref CoreGrcpInstanceSecurityGroupOpenWeb
Subnets:
- subnet-34b92250
- subnet-82d85af4
- subnet-ca379b93
TaskDefinition: !Ref CoreGrcpTaskDefinition
Unfortunately AWS Fargate only supports images hosted in ECR or public repositories in Docker Hub and does not support private repositories which are hosted in Docker Hub. For more info - https://forums.aws.amazon.com/thread.jspa?threadID=268415
Even we faced the same problem using AWS Fargate couple of months back. You have only two options right now:
Migrate your images to Amazon ECR.
Use AWS Batch with custom AMI, where the custom AMI is built with Docker Hub credentials in ECS config (which we are using right now).
Edit: As mentioned by Christopher Thomas in the comment, ECS fargate now supports pulling images from DockerHub Private repositories. More info on how to set it up can be found here.
Do define this policy in your ECR registry and attach the IAM role with your task.
{
"Version": "2008-10-17",
"Statement": [
{
"Sid": "new statement",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::99999999999:role/ecsEventsRole"
},
"Action": [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability",
"ecr:PutImage",
"ecr:InitiateLayerUpload",
"ecr:UploadLayerPart",
"ecr:CompleteLayerUpload"
]
}
]
}
Has anybody succeeded in using an Application Autoscaling group with an ELB Health check. It replaces the instances over and over. Is there a way to prevent that?
My template looks like that:
Resources:
ECSAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AvailabilityZones:
- Fn::Select:
- '0'
- Fn::GetAZs:
Ref: AWS::Region
- Fn::Select:
- '1'
- Fn::GetAZs:
Ref: AWS::Region
- Fn::Select:
- '2'
- Fn::GetAZs:
Ref: AWS::Region
VPCZoneIdentifier:
- Fn::ImportValue: !Sub ${EnvironmentName}-PrivateEC2Subnet1
- Fn::ImportValue: !Sub ${EnvironmentName}-PrivateEC2Subnet2
- Fn::ImportValue: !Sub ${EnvironmentName}-PrivateEC2Subnet3
HealthCheckGracePeriod: !Ref ASGHealthCheckGracePeriod
HealthCheckType: !Ref ASGHealthCheckType
LaunchTemplate:
LaunchTemplateId: !Ref ECSLaunchTemplate
Version: 1
MetricsCollection:
- Granularity: 1Minute
ServiceLinkedRoleARN:
!Sub arn:aws:iam::${AWS::AccountId}:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling
DesiredCapacity: !Ref ASGDesiredCapacity
MinSize: !Ref ASGMinSize
MaxSize: !Ref ASGMaxSize
TargetGroupARNs:
- Fn::ImportValue: !Sub ${EnvironmentName}-WebTGARN
Fn::ImportValue: !Sub ${EnvironmentName}-DataTGARN
Fn::ImportValue: !Sub ${EnvironmentName}-GeneratorTGARN
TerminationPolicies:
- OldestInstance
the Launchtemplate looks like this:
ECSLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: ECSLaunchtemplate
LaunchTemplateData:
ImageId: !FindInMap [AWSRegionToAMI, !Ref "AWS::Region", AMI]
InstanceType: !Ref InstanceType
SecurityGroupIds:
- Fn::ImportValue: !Sub ${EnvironmentName}-ECSInstancesSecurityGroupID
IamInstanceProfile:
Arn:
Fn::ImportValue:
!Sub ${EnvironmentName}-ecsInstanceProfileARN
Monitoring:
Enabled: true
CreditSpecification:
CpuCredits: standard
TagSpecifications:
- ResourceType: instance
Tags:
- Key: "keyname1"
Value: "value1"
KeyName:
Fn::ImportValue:
!Sub ${EnvironmentName}-ECSKeyPairName
UserData:
"Fn::Base64": !Sub
- |
#!/bin/bash
yum update -y
yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
yum update -y aws-cfn-bootstrap hibagent
/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchTemplate --region ${AWS::Region}
/opt/aws/bin/cfn-signal -e $? --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSAutoScalingGroup
/usr/bin/enable-ec2-spot-hibernation
echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
PATH=$PATH:/usr/local/bin
- ECSCluster:
Fn::ImportValue:
!Sub ${EnvironmentName}-ECSClusterName
the Load balancer config looks like this:
ApplicationLoadBalancerInternet:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Name: !Sub ${EnvironmentName}-${Project}-ALB-Internet
IpAddressType: !Ref ELBIpAddressType
Type: !Ref ELBType
Scheme: internet-facing
Subnets:
- Fn::ImportValue:
!Sub ${EnvironmentName}-PublicSubnet1
- Fn::ImportValue:
!Sub ${EnvironmentName}-PublicSubnet2
- Fn::ImportValue:
!Sub ${EnvironmentName}-PublicSubnet3
SecurityGroups:
- Fn::ImportValue:
!Sub ${EnvironmentName}-ALBInternetSecurityGroupID
As said, its working fine with EC2 Healthchecks but when I switch to ELB Healthchecks the instances are being drained and the ASG spins up a new instance.
Merci A
I would troubleshoot it like this:
Delete this stack.
Edit your template and change the ASG health-check type to ELB (for now).
Create new stack either from CLI or console. I recommend CLI since you might have to recreate it and it's far simpler/quicker than console. The most important step is to enable "Disable-Rollback" feature when the stack fails, otherwise, you wont be able to find out the reason of failure
I believe you will also be creating some IAM resources as a part of this template, so an example CLI command would be this for your quick reference:
aws cloudformation create-stack --stack-name Name-of-your-stack --template-body file://template.json --tags Key=Name,Value=Your_Tag_Value --profile default --region region --capabilities CAPABILITY_NAMED_IAM --disable-rollback yes
For more information on the requirement of CAPABILITY_NAMED_IAM, see this SO answer.
Now, when you create the stack, it's still going to fail, but now we can troubleshoot it. The reason we kept the healthcheck type to ELB in step 2 is that we actually want the ASG to replace the instances on failed healthchecks and we can find out the reason in the ASG's "Activity History tab" from the console.
Chances are high, that you will see a message far more meaningful than, that was returned by CloudFormation.
Now that you have that error message, change the healthcheck type of ASG from the console to EC2, because we do not want the ASG to start of loop of "launch and terminate" for EC2 instances.
Now, login to your EC2 instance and look for the access logs, for the hits from your ELB healthcheck. In httpd, a successful healthcheck gets an HTTP 408.
Also please note that if the ELB healtcheck type is TCP:80 then, there isnt any port conflict on your server and if you have selected HTTP:80, then you have specified a path/file as well as your ping target.
Since your script has some user-data as well, please also review /var/log/cfn-init.log and other entries for any error message. A simple option would be, grep error /var/log/*
Now, at this point, you just have to make sure you get the ELB healthcheck successful and the instance "in-service" behind the ELB and the most important step is to document all the troubleshooting steps because you never know, which step out of many you tried actually fixed this healthcheck.
Once you are able to find the cause, just put it in the template and you should be good to go. I have seen many templates going wrong at Step 8.
Also, do not miss to change the ASG healthecheck to ELB, once again.