How to limit the maximum number of running pods - kubernetes

We currently have around 20 jobs. These jobs create one pod each, but we want to make sure that only one of these pods can run at a time, keeping the rest of them in pending status. Increasing the resource limitations makes them to run one by one but I want to be sure that this is always the behaviour.
Is there any way of limiting this concurrency to 1, let's say per label or something similar?

Use ResourceQuota resource:
apiVersion: v1
kind: ResourceQuota
metadata:
name: pod-demo
spec:
hard:
pods: "5"

Related

Programmatic calculation of Kubernetes Limit Range

I am looking for a way to calculate appropriate Limit Range and Resource Quota settings for Kubernetes based on the sizing of our Load Test (LT) environment. The LT environment we want to keep flexible in order to play with things and I feel that's a great way to figure out how to set up the limits, etc.
I might also have a fundamental misunderstanding of how this works, so feel free to correct that.
Does anyone have a set of equations or anything that takes into account (I know it won't be an exact science, but I am looking mostly for a jumping-off point):
Container CPU
Container memory
Right now I am pulling the CPU limits requested using this (and the memory similarly, and using some nifty shell script things I found to do full calculations for me):
kubectl get pods -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[].resource.limits.cpu}{"\n"}{end}' -n my-namespace
We made sure all of our containers are explicitly making requests for CPU/memory, so that works nicely.
The machine type is based on our testing and target number of pods per node. We have nodeSelector declarations in use as we need to separate things out for some very specific needs by the services being deployed and to be able to leverage multiple machine types.
For the Limit Range I was thinking (adding 10% just for padding):
Maximum [CPU/memory] + 10% (ensuring that the machine type holds 2x that calculation) as:
apiVersion: v1
kind: LimitRange
metadata:
name: ns-my-namespace
namespace: my-namespace
spec:
limits:
- max:
cpu: [calculation_from_above]
memory: [calculation_from_above]
type: Container
For the Resource Quota I was thinking (50% to handle estimated overflow in "emergency" HPA):
Total of all CPU/memory in the Load Test environment + 50% as:
apiVersion: v1
kind: ResourceQuota
metadata:
name: ns-my-namespace
namespace: my-namespace
spec:
hard:
limits:
cpu: [calculation_from_above]
memory: [calculation_from_above]

Can we have multiple targets in K8s Horizontal Pod Autoscaler?

We are considering to use HPA to scale number of pods in our cluster. This is how a typical HPA object would like:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: hpa-demo
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hpa-deployment
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 20
My question is - can we have multiple targets (scaleTargetRef) for HPA? Or each deployment/RS/SS/etc. has to have its own HPA?
Tried to look into K8s doc, but could not find any info on this. Any help appreciated, thanks.
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-metrics-apis
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
Can we have multiple targets (scaleTargetRef) for HPA ?
One HorizontalPodAutoscaler has only one scaleTargetRef that hold one referred resource only.
HorizontalPodAutoscaler controls the scale of a single resource - Deployment/StatefulSet/ReplicaSet. It is actually stated in documentation, though not that directly:
Here there is a reference to it as well - single target resource is defined by the scaleTargetRef, horizontal pod autoscaler learns the current resource consumption for it and will set the desired number of pods by using its Scale subresource.
From practical experience, reference for multiple workload resources in a single HorizontalPodAutoscaler definition will work for only one of them. In addition, when applying kubectl autoscale command with several resources to create a HorizontalPodAutoscaler object, separate hpa object will be created for each of them.

Batch horizontal pod autoscaling

Looking at HPA (pretty new to this), usecase I'm dealing with is to apply the same HPA rules to all deployment (in a specific namespace).
so I'd ideally want to implement something like this :
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: generalHpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: [deploymentObject1, deploymentObject2, deploymentObject3,...]
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50
I was hoping to handle this via label/selector, whereas all deployment objects are marked with a specific label (e.g. enableHpa) and somehow use selector/macthLabels inside HorizontalPodAutoscaler to apply it to all those objects.
But it looks like name is required, and need to be targeted to a specific deployment object.
Any suggestion on how to handle this case and avoid creating hpas one by one for every single deployment by name?
There are two ways of setting up a new HorizontalPodAutoscaler object:
Declarative approach described here:
Creating the autoscaler declaratively
Instead of using kubectl autoscale command to create a HorizontalPodAutoscaler imperatively we can use the following file to create it declaratively:
application/hpa/php-apache.yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50
We will create the autoscaler by executing the following command:
kubectl create -f https://k8s.io/examples/application/hpa/php-apache.yaml
Imperative approach i.e. by invoking kubectl autoscale command:
kubectl autoscale deployment nginx-deployment --cpu-percent=50 --min=1 --max=5
The first approach doesn't leave much room for further interpretation. The syntax is strictly specified and you cannot do much about it. As you can see both kind and name of our scaling target should be specified and although your pseudo code may seem like an interesting proposal, it have no chances to work. According to the specification name field is a map/dictionary and a list simply cannot be used in this context.
When it comes to the imperative approach, actually you can automate it by using a fairly simple bash one-liner and make your life a bit easier. If you have... let's say 50 different deployments and you want to autoscale all of them, it can save you a lot of time.
For the sake of simplicity I've created only 3 different deployments:
$ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-deployment-1 3/3 3 3 4m3s
nginx-deployment-2 3/3 3 3 3m58s
nginx-deployment-3 3/3 3 3 3m54s
In order not to create hpa one by one manually, I used the following one-liner bash script:
$ for i in $(kubectl get deployments -o jsonpath='{.items[*].metadata.name}');do kubectl autoscale deployment $i --cpu-percent=50 --min=1 --max=3; done
the result of which is:
horizontalpodautoscaler.autoscaling/nginx-deployment-1 autoscaled
horizontalpodautoscaler.autoscaling/nginx-deployment-2 autoscaled
horizontalpodautoscaler.autoscaling/nginx-deployment-3 autoscaled
Command:
kubectl get deployments -o jsonpath='{.items[*].metadata.name}'
returns only the names of your deployments, so they can be easily iterated through a for loop. Notice that we still have 1-to-1 relation here. To one Deployment corresponds exactly one HorizontalPodAutoscaler object. If you additionally need to deal with different namespaces, the script can be further expanded.
Going back to your specific requirement, the question arises as to the legitimacy of such a solution. Although it may seem quite tempting to manage all your Deployments by one single HorizontalPodAutoscaler object (less work in the very beginning), if you take a closer look at all potential downsides of such approach, you would probably change your mind quickly. First of all, such solution isn't very scalable. In fact it is not scalable at all. Just imagine that for some reason you want to change the targetCPUUtilizationPercentage for a single Deployment object. Well... you have a problem. It is managed by one global autoscaler and you need to quickly redesign your environment and create a separate hpa. So 1-to-1 relation between HorizontalPodAutoscaler and Deployment/ReplicationController/ReplicaSet makes a perfect sense. What you usually need is more granular level of control rather than possibility to manage everything by one huge general object.

Exactly one Pod

I'm working on deploying the Thanos monitoring system and one of its components, the metric compactor, warns that there should never be more than one compactor running at the same time. If this constraint is violated it will likely lead to corruption of metric data.
Is there any way to codify "Exactly One" pod via Deployment/StatefulSet/etc, aside from "just set replicas: 1 and never scale"? We're using Rancher as an orchestration layer and it's real easy to hit that + button without thinking about it.
Limit the replicas in your deployment to 1
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
Be careful with Deployment, because they can be configured with two update strategy:
RollingUpdate: new pods are added while and old pods are terminated. This mean that, depending on the maxSurge option, if you set your replicas to 1, you may still be have at most 2 pods.
Recreate: all the previous pods are terminated before any new pods are created.
Instead, Statefulsets guarantee that there will never be more than 1 instance of a pod at any given time.
apiVersion: apps/v1beta1
kind: StatefulSet
spec:
replicas: 1
Unlike Deployments, pods are not replaced until the previous has been terminated.

Have kube jobs start on waiting pods

I am working on a scenario where I want to be able to maintain some X number of pods in waiting (and managed by kube) and then upon user request (via some external system) have a kube job start on one of those waiting pods. So now the waiting pods count is X-1 and kube starts another pod to bring this number back to X.
This way I'll be able to cut down on the time taken to create a pod, start a container and getting is ready to start actual processing. The processing data can be sent to those pods via some sort of messaging (akka or rabbitmq).
I think the ReplicationControllers are best place to keep idle pods, but when I create a job how can I specify that I want to be able to use one of the pods that are in waiting and are managed by ReplicationController.
I think I got this to work upto a state on top of which I can build this solution.
So what I am doing is starting a RC with replicas: X (X is the number of idle pods I wish to maintain, usually not a very large number). The pods that it starts have custom label status: idle or something like that. The RC spec.selector has the same custom label value to match with the pods that it manages, so spec.selector.status: idle. When creating this RC, kube ensures that it creates X pods with their status=idle. Somewhat like below:
apiVersion: v1
kind: ReplicationController
metadata:
name: testrc
spec:
replicas: 3
selector:
status: idle
template:
metadata:
name: idlepod
labels:
status: idle
spec:
containers:
...
On the other hand I have a job yaml that has spec.manualSelector: true (and yes I have taken into account that the label set has to be unique). With manualSelector enabled, I can now define selectors on the job like below.
apiVersion: batch/v1
kind: Job
metadata:
generateName: testjob-
spec:
manualSelector: true
selector:
matchLabels:
status: active
...
So clearly, RC creates pods with status=idle and job expects to use pods with status=active because of the selector.
So now whenever I have a request to start a new job, I'll update label on one of the pods managed by RC so that its status=active. The selector on RC will effect the release of this pod from its control and start another one because of replicas: X set on it. And the released pod is no longer controller by RC and is now orphan. Finally, when I create a job, the selector on this job template will match the label of the orphaned pod and this pod will then be controlled by the new job. I'll send messages to this pod that will start the actual processing and finally bring it to complete.
P.S.: Pardon my formatting. I am new here.