AWS fargate ecs service shutting down after 10~15 mins - amazon-ecs

we have 2 environment acpt and acpt-contigiency
in acpt-contigiency env We have ECS fargate services set with with Desired capacity=0 Min capacity=0 and max capacity=0 as we will use only if acpt goes down
so when we switched over from acpt to acpt-contigiency env and just updated the desired count =2 , min capacity=0 and max capacity=0 we observed that the ecs services went down after 10~15 mins
Nothing in aws cloudwatch log , in ecs service event it shows that it is bringing down ecs instance but did not mention reason
in ecs service -Tasks -> stopped -> it shows the reason as stopped due to scaling policy
Any idea what autoscale policy it might be bringing down the instance ? using target tracking with ECSServiceAverageCPUUtilization at 70

What we found out was alarm got triggered for CPUUtilization < 63 for 15 datapoints within 15 minutes which in turn reset the desired count to 0 ( may be it tried to reset to previous saved value or tried to match min/max capacity) and shutdown the instances

Related

Configure ECS to scale to zero when not in use

I'm running Superset in AWS ECS using Fargate. This instance of Superset is for internal use only. I want to be able to configure ECS to scale to zero tasks when not in use. I am aware that it will take time (Possibly minutes) to come back up, the end-users of this application are content with waiting a few minutes.
Situation:
AWS ECS deployed using Fargate
Autoscaling set to a max of 2 and a min of 0
Want to scale to 0 when not in use (after, say, an hour)
Scaling ECS down to zero when not in use is not possible. ECS is designed to run continuously, unlike Lambda functions that can be turned on and off as requests arrive.
However, if your internal users only access the application during known hours (say business hours), then you can use scheduled scaling to scale to zero during specific hours.
You can use put-scheduled-action for that.
aws application-autoscaling put-scheduled-action --service-namespace ecs \
--schedule "cron(15 12 * * ? *)" \
...
This AWS Blog post explains it in more detail: https://aws.amazon.com/blogs/containers/optimizing-amazon-elastic-container-service-for-cost-using-scheduled-scaling/

Elastic Cloud APM Server - Queue is full

I have many Java microservices running in a Kubernetes Cluster. All of them are APM agents sending data to an APM server in our Elastic Cloud Cluster.
Everything was working fine but suddenly every microservice received the error below showed in the logs.
I tried to restart the cluster, increase the hardware power and I tried to follow the hints but no success.
Obs: The disk is almost empty and the memory usage is ok.
Everything is in 7.5.2 version
I deleted all the indexes related to APM and everything worked after some minutes.
for better performance u can fine tune these fields in apm-server.yml file
internal queue size increase queue.mem.events=output.elasticsearch.worker * output.elasticsearch.bulk_max_size
default is 4096
output.elasticsearch.worker (increase) default is 1
output.elasticsearch.bulk_max_size (increase) default is 50 very less
Example : for my use case i have used following stats for 2 apm-server nodes and 3 es nodes (1 master 2 data nodes )
queue.mem.events=40000
output.elasticsearch.worker=4
output.elasticsearch.bulk_max_size=10000

Auto-Scaling removes running task in ECS service (FARGATE)

I am running an ecs service using Fargate on AWS. Each task there completes a single operation and dies (fetching a message from a SQS queue and decode/encode a video file). Now I designed an autoscaling policy like below,
If SQS queue size is more than 5 increment desired count to 1 (repeat every 60 seconds).
If SQS queue size is less than 2 decrement desired count to 1 (repeat every 60 seconds).
But what AWS is doing is than when queue size gets down below 2, it kills out running tasks leaving the corresponding operation "broken". I don't want AWS to kill the running tasks (because they will automatically die within sometime when the command completes) but just to set the desired count to 0 so that the tasks doesn't get "respawned". So literally I want my tasks to be unstoppable during auto-scaling.
How I can achieve this in ECS service and aws_ecs_autoscaling_target. Please note that I am using terraform to provision the service.
Thanks in advance.
I had to solve this issue in a different approach. I had to create a small Lambda function which gets triggered by the cloudwatch alarm and starts a Fargate task using StartTask. This workflow suited well here rather than using autoscaling policy.

How may I check ECS container agent configuration in ECS Fargate Task?

For EC2 launch type I'm able to check agent configuration in /etc/ecs/ecs.config file at EC2 container instance. But is it possible to find out the same info at ECS Fargate Task? For example, I'd like to know, what is the timeout between SIGTERM and SIGKILL (ECS_CONTAINER_STOP_TIMEOUT). I wonder should it be possible to retrieve such info from Amazon ECS Task Metadata Endpoint?
In Fargate, timeout between SIGTERM and SIGKILL is the same as the default setting of 30 seconds.
For newer Fargate platform versions, you can use the stopTimeout container definition parameter. Note the maximum value of 120 seconds:
For tasks that use the Fargate launch type, the task or service requires platform version 1.3.0 or later (Linux) or 1.0.0 or later (for Windows). The max stop timeout value is 120 seconds. However, if the parameter isn't specified, the default value of 30 seconds is used.

Yarn cluster doesn't equally manage vcores, queue resource limit exceeded

I have 3 yarn node managers working in a yarn cluster, and an issue connected with vcores avalibity per yarn node.
For e.g., I have:
on first node : available 15 vcores,
on second node : non vcores avalible,
on third node : available 37 vcores.
And now, job try to start and fails withe the error:
"Queue's AM resource limit exceeded"
Is this connected with the non vcores available on second node, or maybe I can somehow increase the resources limit in queue?
I also want to mention, that I have the following setting:
yarn.scheduler.capacity.maximum-am-resource-percent=1.0
That means, that your drivers have exceeded max memory configured in Max Application Master Resources. You can either increase max memory for AM or decrease driver memory in your jobs.