Verifying resources in a deployment yaml - kubernetes

In a deployment yaml, how we can verify that the resources we need for the running pods are guaranteed by the k8s?
Is there a way to figure that out?

Specify your resource request in the deployment YAML. The kube-scheduler will ensure the resources even before scheduling the pods.
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
selector:
matchLabels:
app: guestbook
tier: frontend
replicas: 3
template:
metadata:
labels:
app: guestbook
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google-samples/gb-frontend:v4
resources:
requests:
cpu: 100m
memory: 100Mi
How Pods with resource requests are scheduled? Ref
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
N.B.: However, if you want a container not to use more than its allowed resources, specify the limit too.

There is QoS (Quality of Service) Classes for running pods in k8s. There is an option that is guaranteing and restricting request and limits. This option is qosClass: Guaranteed.
To be able to make your pods' QoS's Guaranteed;
Every Container in the Pod must have a memory limit and a memory request.
For every Container in the Pod, the memory limit must equal the memory request.
Every Container in the Pod must have a CPU limit and a CPU request.
For every Container in the Pod, the CPU limit must equal the CPU request.
These restrictions apply to init containers and app containers equally.
also check out the reference page for more info :
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

Related

Kubernetes scheduling: control preference and remove pods in priority on a node

Context
I have a Kubernetes cluster with 2 pools:
the first one, bare-metal, is composed of a finite number of bare metal instances. Every node in this pool has a label node-type: bare-metal
the second one, vms, is composed of virtual machines. This is used if the cluster needs to autoscale, since provisioning VMs is faster than provisioning a new bare metal instance. Every node in this pool has a label node-type: vms
For performance reasons, I would like that, when possible, all pods run on a bare-metal node. If not possible, and only if not possible (not enough memory, too many pods already existing), that's ok for me to run pods on a vms node. I have 2 questions (tightly related) about this.
About preferredDuringSchedulingIgnoredDuringExecution behavior
To schedule pods in priority on bare-metal, I've used preferredDuringSchedulingIgnoredDuringExecution. Example with a simple nginx deployment with 90 replicas (a bare-metal pool node could hold all of them):
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 90
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
resources:
requests:
cpu: 50m
memory: 50M
limits:
cpu: 50m
memory: 50M
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- bare-metal
This works fine, even if I don't really understand why on the 90 pods, 3 are scheduled on a node of vms pool (even if scheduling pods on the bare-metal pool is still possible). Well, I understand that this is preferred instead of required, but I'm wondering if we can tune this "preference".
Question: Can I configure my deployment to tell like "preferred only if there is no other solution"?
Scaling down
When I scale down my Deployment, I would like that pods scheduled on vms pool are deleted first, because my pods should run on priority on a bare-metal pool node. However, if I run:
kubectl scale deployment nginx --replicas 3
I can observe that pods on bare-metal pool nodes are deleted first, leaving only 3 pods on a vms pool node.
Question: Is there a way to configure my deployment to tell Kubernetes to remove the pods on vms pool first?
Edit: It seems that the proposal of the introduction of PodDeletionCost can solve this. But it's not implemented for now.

Kubernetes giving each pod access to GPU

Im new to Kubernetes, my goal is to create a serverless like architecture on GPUs (i.e fan out to 1000+ pods)
I understand a node may be a virtual or physical machine. I am using GKE to help manage k8s. My node machine config is n1-standard-4 with 1 x NVIDIA Tesla T4.
With that setup it seems I could only have 4 pods, if I wanted lets say 16 pods per node, I could use n1-standard-16.
Lets say we are using n1-standard-4 and ran 4 pods on that node, how can we give each node access to the GPU? Currently I am only able to run one pod, while the other pods stay on pending. This seems to only happen when I add the gpu resource in my YAML file.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: load-balancer-example
name: hello-world
spec:
replicas: 4
selector:
matchLabels:
app.kubernetes.io/name: load-balancer-example
template:
metadata:
labels:
app.kubernetes.io/name: load-balancer-example
spec:
containers:
- image: CUSTOM_IMAGE_WITH_NVIDIA/CUDA/UBUNTU
name: test
ports:
- containerPort: 3000
resources:
limits:
nvidia.com/gpu: 1
Without the GPU resource and with a basic node container it seems to fan out fine. With the GPU resource I can only get one POD to run.
What you are creating is not a Pod but a Deployment with a replica count of 4, which is essentially 4 pods. All 4 of these pods are using your n1-standard-4 type of node.
There are certain limitations when it comes to using GPUs with pods. This is very different from CPU sharing. In short, GPUs are only supposed to be specified in the limits section, which means:
You can specify GPU limits without specifying requests because Kubernetes will use the limit as the request value by default.
You can specify GPU in both limits and requests but these two values must be equal.
You cannot specify GPU requests without specifying limits.
Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.
You can read more about these limitations here.
Your best option, will be to create a node pool with your desired GPU type. This node pool will have # nodes = # pods in your deployment and each node will host only 1 pod, and will have 1 GPU of your choice. I suggest this instead of multiple GPUs/node because you want to have a fan-out/scale-out architecture, so more smaller nodes will be better than less larger nodes.
You can read more about how to do this on the GKE docs here.
Note that having n1-standard-4 doesn't mean you can have 4 pods on the node. It simply means the node has 4 vCPUs which you can share across as many pods as needed. But since you want to run GPU workloads, this node type should not matter much, as long as you attach the right amount of GPU resources.

What is the relationship between scheduling policies and scheduling Configuration in k8s

The k8s scheduling implementation comes in two forms: Scheduling Policies and Scheduling Profiles.
What is the relationship between the two? They seem to overlap but have some differences. For example, there is a NodeUnschedulable in the profiles but not in the policy. CheckNodePIDPressure is in the policy, but not in the profiles
In addition, there is a default startup option in the scheduling configuration, but it is not specified in the scheduling policy. How can I know about the default startup policy?
I really appreciate any help with this.
The difference is simple: Kubernetes does't support 'Scheduling policies' from v1.19 or later. Kubernetes v1.19 supports configuring multiple scheduling policies with a single scheduler. We are using this to define a bin-packing scheduling policy in all v1.19 clusters by default. 'Scheduling profiles' can be used to define a bin-packing scheduling policy in all v1.19 clusters by default.
To use that scheduling policy, all that is required is to specify the scheduler name bin-packing-scheduler in the Pod spec. For example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 5
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
schedulerName: bin-packing-scheduler
containers:
- name: nginx
image: nginx:1.17.8
resources:
requests:
cpu: 200m
The pods of this deployment will be scheduled onto the nodes which already have the highest resource utilisation, to optimise for autoscaling or ensuring efficient pod placement when mixing large and small pods in the same cluster.
If a scheduler name is not specified then the default spreading algorithm will be used to distribute pods across all nodes.

what is the default allocation when resources are not specified in kubernetes?

Below is kubernetes POD definition
apiVersion: v1
kind: Pod
metadata:
name: static-web
labels:
role: myrole
spec:
containers:
- name: web
image: nginx
ports:
- name: web
containerPort: 80
protocol: TCP
as I have not specified the resources, how much Memory & CPU will be allocated? Is there a kubectl to find what is allocated for the POD?
If resources are not specified for the Pod, the Pod will be scheduled to any node and resources are not considered when choosing a node.
The Pod might be "terminated" if it uses more memory than available or get little CPU time as Pods with specified resources will be prioritized. It is a good practice to set resources for your Pods.
See Configure Quality of Service for Pods - your Pod will be classified as "Best Effort":
For a Pod to be given a QoS class of BestEffort, the Containers in the Pod must not have any memory or CPU limits or requests.
In your case, kubernetes will assign QoS called BestEffort to your pod.
That's means kube-scheduler has no idea how to schedule your pod and just do its best.
That's also means your pod can consume any amount of resource(cpu/mem)it want (but the kubelet will evict it if anything goes wrong).
To see the resource cost of your pod, you can use kubectl top pod xxxx

ceph-mds pod fails to launch with Insufficient cpu, MatchNodeSelector, PodToleratesNodeTaints

I tracked down the CPU usage. Even after increasing the number of nodes I still get a persistent scheduling error with the following terms: Insufficient cpu, MatchNodeSelector, PodToleratesNodeTaints.
My hint came from this article. It mentions:
Do not allow new pods to schedule onto the node unless they tolerate
the taint, but allow all pods submitted to Kubelet without going
through the scheduler to start, and allow all already-running pods to
continue running. Enforced by the scheduler.
The configuration contains the following.
spec:
replicas: 1
template:
metadata:
name: ceph-mds
namespace: ceph
labels:
app: ceph
daemon: mds
spec:
nodeSelector:
node-type: storage
... and more ...
Notice the node-type. I have to kubectl label nodes node-type=storage --all so I can label all nodes with node-type=storage. I could also choose to only dedicate some nodes as storage nodes.
In kops edit ig nodes, according to this hint, you can add this label in the following.
spec:
nodeLabels:
node-type: storage