Is it relevent to set “requests” values if i’m not using HPA? - kubernetes

I was wondering if it was really relevant to set “requests” (CPU/MEM) values if I’m not using HPA ?
If those values are not used to scale up or down some pods, what is the point ?

It's fine and it will work if you don't provide the requests (CPU/MEM) to workloads.
But consider the scenario, suppose you have 1-2 Nodes with a capacity of 1 GB and you have not mentioned the requests.
Already running application utilizing half of the node around 0.5 GB. Your new app needs now 1 GB to start so K8s will schedule the PODs onto that node as not aware of the minimum requirement to start the application.
After that whatever happens, we call it a Crash.
If you have extra resources in the cluster, setting affinity and confidence in the application code you can go without putting the requests (not best practice).

Related

How to deploy new autoscaled ECS service with high initial load? Instances fail the health check and get removed before autoscaling kicks in

I want to deploy a new autoscaled ECS service that will receive a high initial load immediately.
While the autoscaling policy is quite wide (scales from a minimum of 1 instance to 20, as traffic, varies a lot during the day), I am running into the problem that the initial load on the containers is so much that it results in all of them failing to pass the health checks and the system removing the containers before scaling them up.
What's the recommended way of dealing with this? Right now I'm simply trying to increase the minimum number of instances, but this will result in many unnecessary instances in the morning when the load is low.

How to provision jobs in Kubernetes with very wide range of memory use

I am fairly new to Kubernetes, and I think I understand the basics of provisioning nodes and setting memory limits for pods. Here's the problem I have: my application can require dramatically different amounts of memory, depending on the input (and there is no fool-proof way to predict it). Some jobs require 50MB, some require 50GB. How can I set up my K8s deployment to handle this situation?
I have one strategy that I'd like to try out, but I don't know how to do it: start with small instances (nodes with not a lot of memory), and if the job fails with out-of-memory, then automatically send it to increasingly bigger instances until it succeeds. How hard would this be to implement in Kubernetes?
Thanks!
Natively K8S supports horizontal autoscalling i.e. automatically deplying more replicas of a deployment basing on chosen metric like CPU usage, memory usage etc.: Horizontal Pod Autoscaling
What you are describing here though is vertical scaling. It is not supported out of the box, but there is a subproject that seems to be able to fulfill your requirements: vertical-pod-autoscaler

Will the pods will consume full resources specified in its request or limit while it getting created?

Would like to clear about the pods resource consumption when its getting created or restarted as part of rolling update or scaling up.
looking to understand..
whether pods will consume entire resoources specified in its requests while its getting created? or limits ?
or it will just consume how much its required to start, which will be less than its request.
We are currently facing some issue with our AKS cluster that, pods generating high cpu usage alerts (morethan 95%) when new pods getting created as part of rollout or as part of scaling up , but our applications are light weight and needs less cpu for its functionality.
So looking for a solution for this ,
whether we can consider CPU initialization period /initial readiness, which will make the pods to manage its resource consumption during startups?
whether we can tweak the hpa settings during scaling up activities or any policy window or stabilization window during the pod startups?
That really depends on the resource type and the actual values you specify.
As your question focuses on CPU I will as well.
The first consideration is what kind of class your pod ends up in
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/
Pods generally do not "consume" any cpu resources, software running within them does, so what happens with CPU relies strictly on what software you're running. Some will have cpu heavy startup phase (oh what affection do I have for Java in k8s) and in that case initial spike of cpu will be perfectly normal, but also, due to scaling logic, that initial spike, if happening before pod is in ready state, will be discarded for computation of HPA scaleup.
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
So my ultimate advice would be to set your readinessProbe correctly.

kubernetes - multiple pods with same app on the same node

We are migrating our infrastructure to kubernetes. I am talking about a part of it, that contains of an api for let's say customers (we have this case for many other resources). Let's consider we have a billion customers, each with some data etc. and we decided they deserve a specialized api just for them, with its own db, server, domain, etc.
Kubernetes has the notion of nodes and pods. So we said "ok, we dedicate node X with all its resources to this particular api". And now the question:
Why would I used multiple pods each of them containing the same nginx + fpm and code, and limit it to a part of traffic and resources, and add an internal lb, autoscale, etc., instead of having a single pod, with all node resources?
Since each pod adds a bit of extra memory consumption this seems like a waste to me. The only upside being the fact that if something fails only part of it goes down (so maybe 2 pods would be optimal in this case?).
Obviously, would scale the nodes when needed.
Note: I'm not talking about a case where you have multiple pods with different stuff, I'm talking about that particular case.
Note 2: The db already is outside this node, on it's own pod.
Google fails me on this topic. I find hundreds of post with "how to configure things, but 0 with WHY?".
Why would I used multiple pods each of them containing the same nginx + fpm and code, and limit it to a part of traffic and resources, and add an internal lb, autoscale, etc., instead of having a single pod, with all node resources?
Since each pod adds a bit of extra memory consumption this seems like a waste to me. The only upside being the fact that if something fails only part of it goes down (so maybe 2 pods would be optimal in this case?).
This comes down to the question, should I scale my app vertically (larger instance) or horizontally (more instances).
First, try to avoid using only a single instance since you probably want more redundancy if you e.g. upgrade a Node. A single instance may be a good option if you are OK with some downtime sometimes.
Scale app vertically
To scale an app vertically, by changing the instance to a bigger, is a viable alternative that sometimes is a good option. Especially when the app can not be scaled horizontally, e.g. an app that use leader election pattern - typically listen to a specific event and react. There is however a limit by how much you can scale an app vertically.
Scale app horizontally
For a stateless app, it is usually much easier and cheaper to scale an app horizontally by adding more instances. You typically want more than one instance anyway, since you want to tolerate that a Node goes down for maintenance. This is also possible to do for a large scale app to very many instances - and the cost scales linearly. However, not every app can scale horizontally, e.g. a distributed database (replicated) can typically not scale well horizontally unless you shard the data. You can even use Horizontal Pod Autoscaler to automatically adjust the number of instances depending on how busy the app is.
Trade offs
As described above, horizontal scaling is usually easier and preferred. But there are trade offs - you would probably not want to run thousands of instances when you have low traffic - an instance has some resource overhead costs, also in maintainability. For availability you should run at least 2 pods and make sure that they does not run on the same node, if you have a regional cluster, you want to make sure that they does not run on the same Availability Zone - for availability reasons. Consider 2-3 pods when your traffic is low, and use Horizontal Pod Autoscaler to automatically scale up to more instance when you need. In the end, this is a number game - resources cost money - but you want to provide a good service for your customers as well.

Handle sudden increase in traffic size (multiple orders of magnitude) with GKE

If a website has a door crasher sale where many people (~50K) are waiting for the countdown to finish and enter the page, how would one tackle this with GKE in a cost efficient way?
That seems to be the reason GKE exists, the solution could be that with cluster autoscaler and HPA, GKE can handle the traffic. In practice however it is a different story, when the autoscaler tries to create nodes and pull the image for containers it may take up to a certain time (perhaps up to a min or two in some cases). During that time users see 5XX errors which is not ideal.
Well to tackle that, over-provisioning with paused pods come to mind. However, considering the servers are generally very small in size (they should only handle 100 requests in a normal day) and all of a sudden 50K in a second, how would this be a feasible solution? Paused pods seems to only make sure the autoscaler don't remove nodes that are not working, so in that case 50 nodes must always be occupied with paused pods which I am assuming the running hours are still billable (since nodes are there just not doing anything) in GKE.
What would a feasible solution to serve 100 requests with n1-standard-1 everyday but also be able to scale to ~50k in less than 10 seconds?
Not as fast as 10 seconds. That's reachable only if you go serverless.
Pods autoscaling best is 20-30 seconds (depends on your readiness probes, probes of loadbalancer, image cache etc). But you still have to have a pool of nodes to fit that capacity, which is the same money - you're right.
Nodes+Pods autoscaling is around 5 minutes.
If you go serverless, make sure you know (increase?) your account limits. Because it scales so fast and billed per lambda-run - it was very easy to accidentally blow up your bill. Thus all providers limited the default amount of concurrent function executions, e.g. AWS has 1000 per account by default. https://aws.amazon.com/about-aws/whats-new/2017/05/aws-lambda-raises-default-concurrent-execution-limit/. This can be increased through support.
I recall this post for AWS: https://aws.amazon.com/blogs/startups/from-0-to-100-k-in-seconds-instant-scale-with-aws-lambda/. Unfortunately didn't see similar writes for google functions, but I'm sure they have very similar capabilities.