HPA + Cluster Autoscaler + OPA within Federated Kubernetes cluster on GKE - kubernetes

I'm setting up a federated kubernetes cluster with kubefed on the Google Container Engine (GKE) 1.8.3-gke.0.
And it seems like for a good HPA and cluster autoscaler I have to use Open Policy Agent as a kubernetes Admission Controller because of this:
By default, replicas are spread equally in all the underlying
clusters. For example: if you have 3 registered clusters and you
create a Federated Deployment with spec.replicas = 9, then each
Deployment in the 3 clusters will have spec.replicas=3.
But in my case, the load would be dynamically changed in every region and every cluster should have dynamic pods number.
I can't find (or just can't see) examples or manuals regarding cases like mine. So, the question is:
What scenario should a policy have, if I have three clusters in my federated context, one for every region of GKE:
eu (1000 rps, nodes labeled with "region=eu")
us (200 rps, nodes labeled with "region=us")
asia (100 rps, nodes labeled with "region=asia")
It should be a single deployment to dynamically spread pods across those three clusters.
One pod should:
serve 100 rps
request 2 vCPUs + 2Gb RAM
be placed on a node solely (with anti-affinity)
How can I configure OPA to make that schema work, if this is possible?
Thanks in advance for any links to corresponding manuals.

What you are trying to do should be achivable through "Federated Horizontal Pod Autoscalers", one of their main use cases is exactly your scenario:
Quoting from the Requirements & Design Document of the Federated Pod Autoscaler:
Users can schedule replicas of same application, across the federated clusters, using replicaset (or deployment). Users however further might need to let the replicas be scaled independently in each cluster, depending on the current usage metrics of the replicas; including the CPU, memory and application defined custom metrics.
And from the actual documentation this passage from the conclusion describe the behaviour:
The use of federated HPA is to ensure workload replicas move to the cluster(s) where they are needed most, or in other words where the load is beyond expected threshold. The federated HPA feature achieves this by manipulating the min and max replicas on the HPAs it creates in the federated clusters. It actually relies on the in-cluster HPA controllers to monitor the metrics and update relevant fields [...] The federated HPA controller, on the other hand, monitors only the cluster-specific HPA object fields and updates the min replica and max replica fields of those in cluster HPA objects, which have replicas matching thresholds.
Therefore If I didn't misunderstood your needs, there is no reason to use a third product like Open Policy Agent or create policies.

Related

Is it possible to use Kubernetes autoscale on cron job pods

Some context: I have multiple cron jobs running daily, weekly, hourly and some of which require significant processing power.
I would like to add requests and limitations to these container cron pods to try and enable vertical scaling and ensure that the assigned node will have enough capacity when being initialized. This will prevent me from having to have multiple large node available at all times and also letting me modify how many crons I can run in parallel easily.
I would like to try and avoid timed scaling since the cron jobs processing time can increase as the application grows.
Edit - Additional Information :
Currently I am using Digital Ocean and utilizing it's UI for cluster autoscaling. I have it working with HPA's on deployments but not crons. Adding limits to crons does not trigger cluster autoscaling to my knowledge.
I have tried to enable HPA scaling with the cron but with no success. Basically it just sits on a pending status signalling that there is insufficient CPU available and does not generate a new node.
Does HPA scaling work with cron job pods and is there a way to achieve the same type of scaling?
HPA is used to scale more pods when pod loads are high, but this won't increase the resources on your cluster.
I think you're looking for cluster autoscaler (works on AWS, GKE and Azure) and will increase cluster capacity when pods can't be scheduled.
This is a Community Wiki answer so feel free to edit it and add any additional details you consider important.
As Dom already mentioned "this won't increase the resources on your cluster." Saying more specifically, it won't create an additional node as Horizontal Pod Autoscaler doesn't have such capability and in fact it has nothing to do with cluster scaling. It's name is pretty self-exlpanatory. HPA is able only to scale Pods and it scales them horizontaly, in other words it can automatically increase or decrease number of replicas of your "replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics)" as per the docs.
As to cluster autoscaling, as already said by Dom, such solutions are implemented in so called managed kubernetes solutions such as GKE on GCP, EKS on AWS or AKS on Azure and many more. You typically don't need to do anything to enable them as they are available out of the box.
You may wonder how HPA and CA fit together. It's really well explained in FAQ section of the Cluster Autoscaler project:
How does Horizontal Pod Autoscaler work with Cluster Autoscaler?
Horizontal Pod Autoscaler changes the deployment's or replicaset's
number of replicas based on the current CPU load. If the load
increases, HPA will create new replicas, for which there may or may
not be enough space in the cluster. If there are not enough resources,
CA will try to bring up some nodes, so that the HPA-created pods have
a place to run. If the load decreases, HPA will stop some of the
replicas. As a result, some nodes may become underutilized or
completely empty, and then CA will terminate such unneeded nodes.

Kubernetes node CPU utilization

I'm trying(learning) to figure out the best way to utilize CPU (and RAM) on k8s nodes.
My final goal is to make sure CPU utilization on each node in the cluster is above X%
Till now I've read about cluster-autoscaler and HPA, but not sure if they'd help me with the use case.
From what I've read:
cluster-autoscaler is used to autoscale nodes based on a comparison between replica count and resources.request Vs available CPU on the target ec2 instance - which is NOT based on traffic/actual CPU usage
HPA is based on CPU/actual cpu usage, but for individual pods
I essentially wanna get to a point where kubectl top nodes would show all nodes are using > X% (let's say 60%) - and ideally trigger the autoscaling if we reach X2% (let's say 80%)
any suggestion/pointer on how to go about this use case? (or I should somehow use the combination of these 2 autoscaling mechanisms)
You can a combination of the HPA or/and Cluster autoscaler and/or the cloud providers' autoscaling group.
HPA based on CPU/Memory of your pods and scale up and down your K8s Deployments for example.
Cloud provider ASG or autoscaling group. Using the VMs or instances based and you can scale up and down based on their own CPU and memory metrics.
Cluster autoscaler. It works when pods are pending and they have nowhere to run, but if you are handling the case above this is more of a safe fail mechanism or perhaps for workloads that don't require to come up very quickly.
In summary, you can use all 3 above (or less) but you have to see what works for you so that they don't conflict with each other. One potential problem is that when your cloud ASG starts scaling down then you also have pods in pending state then your cluster autoscaler (if you have it enabled) will kick in and you may have both of them trying to do the opposite causing your cluster to just not being able to schedule any pod.
✌️☮️

Container CPU request value for Kubernetes

I have several microservices and I am deploying them to GCP Kubernetes. I am using the free credits and trying out my deployments. My question is when we define CPU requests, based on what we define it? I have set it to 250mCPU but that fills my cluster nodes which are small in with CPU.
Currently, I have 3 nodes with 940mCPU allocable CPU and 3 nodes of the same kind. Now I deployed one API with 3 replicas and assigned 250mCPU for each. With all Kubernetes internal items, all nodes are almost full.
So my question is, based on what we can assign a value for CPU for a service. 250mCPU was a random value. What are others doing to find a minimum CPU for Kubernetes? I have one ASP.NET Core API and 8 NodeJS APIs. If it's based on usage, what's' the best values to start with for a new product?
Mostly by doing stress testing on your application, also, you could use a vertical pod autoscaler in "recomendation" mode, which will monitor your application for some time and then make a recomendation for you to set limits.
Documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler
Remember you cannot use a vertical pod autoscaler and horizontal pod autoscaler at the same time, unless the vertical pod autoscaler its in recomendation mode.

Duplicate metrics with multiple instances of kube-state-metrics

Problem:
Duplicate data when querying from prometheus for metrics from kube-state-metrics.
Sample query and result with 3 instances of kube-state-metrics running:
Query:
kube_pod_container_resource_requests_cpu_cores{namespace="ns-dummy"}
Metrics
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.142:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-34-25.ec2.internal",pod="app1-appname-6bd9d8d978-gfk7f",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.142:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-35-22.ec2.internal",pod="app2-appname-ccbdfc7c8-g9x6s",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.17:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-34-25.ec2.internal",pod="app1-appname-6bd9d8d978-gfk7f",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.17:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-35-22.ec2.internal",pod="app2-appname-ccbdfc7c8-g9x6s",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.37.171:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-34-25.ec2.internal",pod="app1-appname-6bd9d8d978-gfk7f",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.37.171:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-35-22.ec2.internal",pod="app2-appname-ccbdfc7c8-g9x6s",service="prom-kube-state-metrics"}
Observation:
Every metric is coming up Nx when N pods are running for kube-state-metrics. If it's a single pod running, we get the correct info.
Possible solutions:
Scale down to single instance of kube-state-metrics. (Reduced availability is a concern)
Enable sharding. (Solves duplication problem, still less available)
According to the docs, for horizontal scaling we have to pass sharding arguments to the pods.
Shards are zero indexed. So we have to pass the index and total number of shards for each pod.
We are using Helm chart and it is deployed as a deployment.
Questions:
How can we pass different arguments to different pods in this scenario, if its possible?
Should we be worried about availability of the kube-state-metrics considering the self-healing nature of k8s workloads?
When should we really scale it to multiple instances and how?
You could use a 'self-healing' deployment with only a single replica of kube-state-metric if the container down, the deployment will start a new container. Since kube-state-metric is not focused on the health of the individual kubernetes components. It only will affect you if your cluster is too big and make many objects changes per second.
It is not focused on the health of the individual Kubernetes components, but rather on the health of the various objects inside, such as deployments, nodes and pods.
For small cluster there's is no problem use in this way, but you really need a high availability monitoring platform I recommend you take a look in this two articles:
creating a well designed and highly available monitoring stack for kubernetes and
kubernetes monitoring

What's the maximum number of Kubernetes namespaces?

Is there a maximum number of namespaces supported by a Kubernetes cluster? My team is designing a system to run user workloads via K8s and we are considering using one namespace per user to offer logical segmentation in the cluster, but we don't want to hit a ceiling with the number of users who can use our service.
We are using Amazon's EKS managed Kubernetes service and Kubernetes v1.11.
This is quite difficult to answer which has dependency on a lot of factors, Here are some facts which were created on the k8s 1.7 cluster kubernetes-theresholds the Number of namespaces (ns) are 10000 with few assumtions
The are no limits from the code point of view because is just a Go type that gets instantiated as a variable.
In addition to link that #SureshVishnoi posted, the limits will depend on your setup but some of the factors that can contribute to how your namespaces (and resources in a cluster) scale can be:
Physical or VM hardware size where your masters are running
Unfortunately, EKS doesn't provide that yet (it's a managed service after all)
The number of nodes your cluster is handling.
The number of pods in each namespace
The number of overall K8s resources (deployments, secrets, service accounts, etc)
The hardware size of your etcd database.
Storage: how many resources can you persist.
Raw performance: how much memory and CPU you have.
The network connectivity between your master components and etcd store if they are on different nodes.
If they are on the same nodes then you are bound by the server's memory, CPU and storage.
There is no limit on number of namespaces. You can create as many as you want. It doesn't actually consume cluster resources like cpu, memory etc.