How to deploy new autoscaled ECS service with high initial load? Instances fail the health check and get removed before autoscaling kicks in - amazon-ecs

I want to deploy a new autoscaled ECS service that will receive a high initial load immediately.
While the autoscaling policy is quite wide (scales from a minimum of 1 instance to 20, as traffic, varies a lot during the day), I am running into the problem that the initial load on the containers is so much that it results in all of them failing to pass the health checks and the system removing the containers before scaling them up.
What's the recommended way of dealing with this? Right now I'm simply trying to increase the minimum number of instances, but this will result in many unnecessary instances in the morning when the load is low.

Related

Is it relevent to set “requests” values if i’m not using HPA?

I was wondering if it was really relevant to set “requests” (CPU/MEM) values if I’m not using HPA ?
If those values are not used to scale up or down some pods, what is the point ?
It's fine and it will work if you don't provide the requests (CPU/MEM) to workloads.
But consider the scenario, suppose you have 1-2 Nodes with a capacity of 1 GB and you have not mentioned the requests.
Already running application utilizing half of the node around 0.5 GB. Your new app needs now 1 GB to start so K8s will schedule the PODs onto that node as not aware of the minimum requirement to start the application.
After that whatever happens, we call it a Crash.
If you have extra resources in the cluster, setting affinity and confidence in the application code you can go without putting the requests (not best practice).

How to provision jobs in Kubernetes with very wide range of memory use

I am fairly new to Kubernetes, and I think I understand the basics of provisioning nodes and setting memory limits for pods. Here's the problem I have: my application can require dramatically different amounts of memory, depending on the input (and there is no fool-proof way to predict it). Some jobs require 50MB, some require 50GB. How can I set up my K8s deployment to handle this situation?
I have one strategy that I'd like to try out, but I don't know how to do it: start with small instances (nodes with not a lot of memory), and if the job fails with out-of-memory, then automatically send it to increasingly bigger instances until it succeeds. How hard would this be to implement in Kubernetes?
Thanks!
Natively K8S supports horizontal autoscalling i.e. automatically deplying more replicas of a deployment basing on chosen metric like CPU usage, memory usage etc.: Horizontal Pod Autoscaling
What you are describing here though is vertical scaling. It is not supported out of the box, but there is a subproject that seems to be able to fulfill your requirements: vertical-pod-autoscaler

Will the pods will consume full resources specified in its request or limit while it getting created?

Would like to clear about the pods resource consumption when its getting created or restarted as part of rolling update or scaling up.
looking to understand..
whether pods will consume entire resoources specified in its requests while its getting created? or limits ?
or it will just consume how much its required to start, which will be less than its request.
We are currently facing some issue with our AKS cluster that, pods generating high cpu usage alerts (morethan 95%) when new pods getting created as part of rollout or as part of scaling up , but our applications are light weight and needs less cpu for its functionality.
So looking for a solution for this ,
whether we can consider CPU initialization period /initial readiness, which will make the pods to manage its resource consumption during startups?
whether we can tweak the hpa settings during scaling up activities or any policy window or stabilization window during the pod startups?
That really depends on the resource type and the actual values you specify.
As your question focuses on CPU I will as well.
The first consideration is what kind of class your pod ends up in
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/
Pods generally do not "consume" any cpu resources, software running within them does, so what happens with CPU relies strictly on what software you're running. Some will have cpu heavy startup phase (oh what affection do I have for Java in k8s) and in that case initial spike of cpu will be perfectly normal, but also, due to scaling logic, that initial spike, if happening before pod is in ready state, will be discarded for computation of HPA scaleup.
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
So my ultimate advice would be to set your readinessProbe correctly.

Handle sudden increase in traffic size (multiple orders of magnitude) with GKE

If a website has a door crasher sale where many people (~50K) are waiting for the countdown to finish and enter the page, how would one tackle this with GKE in a cost efficient way?
That seems to be the reason GKE exists, the solution could be that with cluster autoscaler and HPA, GKE can handle the traffic. In practice however it is a different story, when the autoscaler tries to create nodes and pull the image for containers it may take up to a certain time (perhaps up to a min or two in some cases). During that time users see 5XX errors which is not ideal.
Well to tackle that, over-provisioning with paused pods come to mind. However, considering the servers are generally very small in size (they should only handle 100 requests in a normal day) and all of a sudden 50K in a second, how would this be a feasible solution? Paused pods seems to only make sure the autoscaler don't remove nodes that are not working, so in that case 50 nodes must always be occupied with paused pods which I am assuming the running hours are still billable (since nodes are there just not doing anything) in GKE.
What would a feasible solution to serve 100 requests with n1-standard-1 everyday but also be able to scale to ~50k in less than 10 seconds?
Not as fast as 10 seconds. That's reachable only if you go serverless.
Pods autoscaling best is 20-30 seconds (depends on your readiness probes, probes of loadbalancer, image cache etc). But you still have to have a pool of nodes to fit that capacity, which is the same money - you're right.
Nodes+Pods autoscaling is around 5 minutes.
If you go serverless, make sure you know (increase?) your account limits. Because it scales so fast and billed per lambda-run - it was very easy to accidentally blow up your bill. Thus all providers limited the default amount of concurrent function executions, e.g. AWS has 1000 per account by default. https://aws.amazon.com/about-aws/whats-new/2017/05/aws-lambda-raises-default-concurrent-execution-limit/. This can be increased through support.
I recall this post for AWS: https://aws.amazon.com/blogs/startups/from-0-to-100-k-in-seconds-instant-scale-with-aws-lambda/. Unfortunately didn't see similar writes for google functions, but I'm sure they have very similar capabilities.

Kuberenetes Scheduling: How would one achieve depth first scheduling behavior?

// I'm almost certain this must be a dup or at least a solved problem, but I could not find what I was after through searching the many k8 communities.
We have jobs that run for between a minute and many hours. Given that we assign them resource values that afford them QOS Guaranteed status, how could we minimize resource waste across the nodes?
The problem is that downscaling rarely happens, because each node eventually gets assigned one of the long running jobs. They are not common, but the keep all of the nodes running, even when we do not have need for them.
The dumb strategy that seems to avoid this would be a depth first scheduling algorithm, wherein among nodes that have capacity, the one most filled already will be assigned. In other words, if we have two total nodes running at 90% cpu/memory usage and 10% cpu/memory assigned, the 90% would always be assigned first provided it has sufficient capacity.
Open to any input here and/or ideas. Thanks kindly.
As of now there seems to be this kube-sheduler profile plugin:
NodeResourcesMostAllocated: Favors nodes that have a high allocation of resources.
But it is in alpha stage since k8s v1.18+, so probably not safe to use it for produciton.
There is also this parameter you can set for kube-scheduler that I have found here:
MostRequestedPriority: Favors nodes with most requested resources. This policy will fit the scheduled Pods onto the smallest number of Nodes needed to run your overall set of workloads.
and here is an example on how to configure it.
One last thing that comes to my mind is using node affinity.
Using nodeAffinity on long running pods, (more specificaly with preferredDuringSchedulingIgnoredDuringExecution), will prefer to schedule these pods on the nodes that run all the time, and prefer to not schedule them on nodes that are being autoscaled. This approach requires excluding some nodes from autoscaling and labeling approprietly so that scheduler can make use of node-affinity.