One of our microservice(worker component - nature is short lived) is actually getting deployed on K8s pods in an autoscale fashion, sometimes this number goes to few thousands as well based upon load and this worker is bound to make connections with various persistent services, since these services come with some resource limit, so we're getting bottlenecked at access level, so my ask is, do we have some way in Kubernetes(similar to some sort of gateway/proxy) which narrow down multiplex requests to limit under resource limits. Let's say every pod makes a connection to MySQL server which has an active connection limit of 50, so if we keep spinning new pods(requirement of 1 MySQL connection), then we can not spin more than 50 pods concurrently.
You can setup a Pod Quota for a Namespace.
If you can spin those Pods on a separate Namespace, you can limit the number of running pods with creating a ResourceQuota object, lets call is quota-pod.yaml:
apiVersion: v1
kind: ResourceQuota
metadata:
name: pod-demo
spec:
hard:
pods: "2"
kubectl create -f quota-pod.yaml --namespace=quota-pod-example
If you check kubectl get resourcequota pod-demo --namespace=quota-pod-example --output=yaml, you would get something like:
spec:
hard:
pods: "2"
status:
hard:
pods: "2"
used:
pods: "0"
In the description of the for example 3 replica nginx deployment you would see:
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 2m deployment-controller Scaled up replica set nginx-1-7cb5b65464 to 3
Normal ScalingReplicaSet 16s deployment-controller Scaled down replica set nginx-1-7cb5b65464 to 1
And kubectl get deployment nginx -o yaml would show:
...
status:
availableReplicas: 1
conditions:
- lastTransitionTime: 2018-12-05T10:42:45Z
lastUpdateTime: 2018-12-05T10:42:45Z
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: 2018-12-05T10:42:45Z
lastUpdateTime: 2018-12-05T10:42:45Z
message: 'pods "nginx-6bd764c757-4gkfq" is forbidden: exceeded quota: pod-demo,
requested: pods=1, used: pods=2, limited: pods=2'
I recommend checking K8s docs Create a ResourceQuota for more information.
Related
I created HPA on our k8s cluster which should auto-scale on 90% memory utilization. However, it scales UP without hitting the target percentage. I use the following config:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
namespace: {{ .Values.namespace }}
name: {{ include "helm-generic.fullname" . }}
labels:
{{- include "helm-generic.labels" . | nindent 4 }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "helm-generic.fullname" . }}
minReplicas: 1
maxReplicas: 2
metrics:
- type: Resource
resource:
name: memory
targetAverageUtilization: 90
So for this config it creates 2 pods which is the maxReplicas number. If I add 4 for maxReplicas it will create 3.
This is what i get from kubectl describe hpa
$ kubectl describe hpa -n trunkline
Name: test-v1
Namespace: trunkline
Labels: app.kubernetes.io/instance=test-v1
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=helm-generic
app.kubernetes.io/version=0.0.0
helm.sh/chart=helm-generic-0.1.3
Annotations: meta.helm.sh/release-name: test-v1
meta.helm.sh/release-namespace: trunkline
CreationTimestamp: Wed, 12 Oct 2022 17:36:54 +0300
Reference: Deployment/test-v1
Metrics: ( current / target )
**resource memory on pods (as a percentage of request): 59% (402806784) / 90%**
resource cpu on pods (as a percentage of request): 11% (60m) / 80%
Min replicas: 1
Max replicas: 2
Deployment pods: **2 current / 2** desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from memory resource utilization (percentage of request)
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events: <none>
As you see the pods memory % is 59 , with target 90 which I expect to produce only 1 pod.
This is working as intended.
targetAverageUtilization is an average over all the matching Pods that is targeted.
The idea of HPA is:
scale up? We have 2 Pods, average memory utilization is only 59%, this is under 90%, no need to scale up
scale down? Since 59% is the average for 2 Pods under the current load, then if there was only one Pod taking all load it would raise to 59%*2=118% utilization, which is over 90% so we need to scale up again, so not scaling down
The horizontal pod autoscaler has a very specific formula for calculating the target replica count:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
With the output you show, currentMetricValue is 59% and desiredMetricValue is 90%. Multiplying that by the currentReplicas of 2, you get about 1.3 replicas, which gets rounded up to 2.
This formula, and especially the ceil() round-up behavior, can make HPA very slow to scale down, especially with a small number of replicas.
More broadly, autoscaling on Kubernetes-observable memory might not work the way you expect. Most programming languages are garbage-collected (C, C++, and Rust are the most notable exceptions) and garbage collectors as a rule tend to allocate a large block of operating-system memory and reuse it, rather than return it to the operating system if load decreases. If you have a pod that reaches 90% memory from the Kubernetes point of view, its possible that memory usage will never decrease. You might need to autoscale on a different metric, or attach an external metrics system like Prometheus to get more detailed memory-manager statistics you can act on.
So I wish to limit resources used by pod running for each of my namespace, and therefor want to use resource quota.
I am following this tutorial.
It works well, but I wish something a little different.
When trying to schedule a pod which will go over the limit of my quota, I am getting a 403 error.
What I wish is the request to be scheduled, but waiting in a pending state until one of the other pod end and free some resources.
Any advice?
Instead of using straight pod definitions (kind: Pod) use deployment.
Why?
Pods in Kubernetes are designed as relatively ephemeral, disposable entities:
You'll rarely create individual Pods directly in Kubernetes—even singleton Pods. This is because Pods are designed as relatively ephemeral, disposable entities. When a Pod gets created (directly by you, or indirectly by a controller), the new Pod is scheduled to run on a Node in your cluster. The Pod remains on that node until the Pod finishes execution, the Pod object is deleted, the Pod is evicted for lack of resources, or the node fails.
Kubernetes assumes that for managing pods you should a workload resources instead of creating pods directly:
Pods are generally not created directly and are created using workload resources. See Working with Pods for more information on how Pods are used with workload resources.
Here are some examples of workload resources that manage one or more Pods:
Deployment
StatefulSet
DaemonSet
By using deployment you will get very similar behaviour to the one you want.
Example below:
Let's suppose that I created pod quota for a custom namespace, set to "2" as in this example and I have two pods running in this namespace:
kubectl get pods -n quota-demo
NAME READY STATUS RESTARTS AGE
quota-demo-1 1/1 Running 0 75s
quota-demo-2 1/1 Running 0 6s
Third pod definition:
apiVersion: v1
kind: Pod
metadata:
name: quota-demo-3
spec:
containers:
- name: quota-demo-3
image: nginx
ports:
- containerPort: 80
Now I will try to apply this third pod in this namespace:
kubectl apply -f pod.yaml -n quota-demo
Error from server (Forbidden): error when creating "pod.yaml": pods "quota-demo-3" is forbidden: exceeded quota: pod-demo, requested: pods=1, used: pods=2, limited: pods=2
Not working as expected.
Now I will change pod definition into deployment definition:
apiVersion: apps/v1
kind: Deployment
metadata:
name: quota-demo-3-deployment
labels:
app: quota-demo-3
spec:
selector:
matchLabels:
app: quota-demo-3
template:
metadata:
labels:
app: quota-demo-3
spec:
containers:
- name: quota-demo-3
image: nginx
ports:
- containerPort: 80
I will apply this deployment:
kubectl apply -f deployment-v3.yaml -n quota-demo
deployment.apps/quota-demo-3-deployment created
Deployment is created successfully, but there is no new pod, Let's check this deployment:
kubectl get deploy -n quota-demo
NAME READY UP-TO-DATE AVAILABLE AGE
quota-demo-3-deployment 0/1 0 0 12s
We can see that a pod quota is working, deployment is monitoring resources and waiting for the possibility to create a new pod.
Let's now delete one of the pod and check deployment again:
kubectl delete pod quota-demo-2 -n quota-demo
pod "quota-demo-2" deleted
kubectl get deploy -n quota-demo
NAME READY UP-TO-DATE AVAILABLE AGE
quota-demo-3-deployment 1/1 1 1 2m50s
The pod from the deployment is created automatically after deletion of the pod:
kubectl get pods -n quota-demo
NAME READY STATUS RESTARTS AGE
quota-demo-1 1/1 Running 0 5m51s
quota-demo-3-deployment-7fd6ddcb69-nfmdj 1/1 Running 0 29s
It works the same way for memory and CPU quotas for namespace - when the resources are free, deployment will automatically create new pods.
I would like to know how it's possible to set a priorityClass by default for all pods in a specific namespace without using a
globalvalue: true
may be with admission controller but i don't know.
Do you have an concret example for that ?
PriorityClass : A PriorityClass is a non-namespaced object
PriorityClass also has two optional fields: globalDefault and description.
The globalDefault field indicates that the value of this PriorityClass should be used for Pods without a priorityClassName.
Only one PriorityClass with globalDefault set to true can exist in the system. If there is no PriorityClass with globalDefault set, the priority of Pods with no priorityClassName is zero.
Create Priority Class as using below yaml (no globalDefault flag is set)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
description: "This priority class should be used for pods."
$ kubectl get priorityclasses.scheduling.k8s.io
NAME VALUE GLOBAL-DEFAULT AGE
high-priority 1000000 false 10s
Now add priority class to pod manifest and schedule them in your namespace
$ kubectl create namespace priority-test
namespace/priority-test created
$ kubectl get namespaces
NAME STATUS AGE
default Active 43m
kube-node-lease Active 43m
kube-public Active 43m
kube-system Active 43m
priority-test Active 5s
Example : pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
priorityClassName: high-priority
$ kubectl apply -f pod.yaml -n priority-test
pod/nginx created
ubuntu#k8s-master-1:~$ kubectl get all -n priority-test
NAME READY STATUS RESTARTS AGE
pod/nginx 1/1 Running 0 25s
$ kubectl describe pod -n priority-test nginx | grep -i priority
Namespace: priority-test
Priority: 1000000
Priority Class Name: high-priority
Normal Scheduled <unknown> default-scheduler Successfully assigned priority-test/nginx to worker-1
Currently per namespace priorities are not possible.
But you can achieve similar result if instead you set default priorityClass with globalDefault: true and e.g. value: 1000. Next create another lower priority class and with e.g. value: 100 and add it to all dev/staging pods.
Btw. not directly related to the question but it would be much easier to accomplish what you need if you use nodeSelectors and schedule dev pods to separate nodes. This way production pods don't have to compete for resources with non-essential pods.
We have a GKE cluster (1.11) and implemented HPA based on memory utilization for pods. During our testing activity, we have observed HPA behavior is not consistent, HPA is not scaling pods even though the target value is met. We have also noticed that, HPA events is not giving us any response data ( either scaling or downscaling related info).
Example
kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
com-manh-cp-organization Deployment/com-manh-cp-organization 95%/90% 1 25 1 1d
kubectl describe hpa com-manh-cp-organization
Name: com-manh-cp-organization
Namespace: default
Labels: app=com-manh-cp-organization
stereotype=REST
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"labels":{"app":"com-manh-cp-organizatio...
CreationTimestamp: Tue, 12 Feb 2019 18:02:12 +0530
Reference: Deployment/com-manh-cp-organization
Metrics: ( current / target )
resource memory on pods (as a percentage of request): 95% (4122087424) / 90%
Min replicas: 1
Max replicas: 25
Deployment pods: 1 current / 1 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from memory resource utilization (percentage of request)
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events: <none>
Cluster version : 1.11.6
Cloud service : GKE
Metric : memory
Target : targetAverageUtilization
Any inputs will be much appreciated and let us know if we can debug HPA implementation.
Thanks.
There is a tolerance on the values for the threshold in HPA when calculating the replica numbers as specified in this link.
This tolerance is by default 0.1. And in your configuration you might not be hitting the threshold when you put 90% due to this. I would recommend you to change the metrics to 80% and see if it is working.
I'm attempting to configure a Horizontal Pod Autoscaler to scale a deployment based on the duty cycle of attached GPUs.
I'm using GKE, and my Kubernetes master version is 1.10.7-gke.6 .
I'm working off the tutorial at https://cloud.google.com/kubernetes-engine/docs/tutorials/external-metrics-autoscaling . In particular, I ran the following command to set up custom metrics:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml
This appears to have worked, or at least I can access a list of metrics at /apis/custom.metrics.k8s.io/v1beta1 .
This is my YAML:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: images-srv-hpa
spec:
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metricName: container.googleapis.com|container|accelerator|duty_cycle
targetAverageValue: 50
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: images-srv-deployment
I believe that the metricName exists because it's listed in /apis/custom.metrics.k8s.io/v1beta1 , and because it's described on https://cloud.google.com/monitoring/api/metrics_gcp .
This is the error I get when describing the HPA:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetExternalMetric 18s (x3 over 1m) horizontal-pod-autoscaler unable to get external metric prod/container.googleapis.com|container|accelerator|duty_cycle/nil: no metrics returned from external metrics API
Warning FailedComputeMetricsReplicas 18s (x3 over 1m) horizontal-pod-autoscaler failed to get container.googleapis.com|container|accelerator|duty_cycle external metric: unable to get external metric prod/container.googleapis.com|container|accelerator|duty_cycle/nil: no metrics returned from external metrics API
I don't really know how to go about debugging this. Does anyone know what might be wrong, or what I could do next?
You are using ‘type: External’. For External Metrics List, you need to use ‘kubernetes.io’ instead of ‘container.googleapis.com’ [1]
Replace the ‘metricName:container.googleapis.com|container|accelerator|duty_cycle’
with
‘metricName: kubernetes.io|container|accelerator|duty_cycle’
[1]https://cloud.google.com/monitoring/api/metrics_other#other-kubernetes.io
This problem went away on its own once I placed the system under load. It's working fine now with the same configuration.
I'm not sure why. My best guess is that StackMetrics wasn't reporting a duty cycle value until it went above 1%.