Kubernetes giving each pod access to GPU - kubernetes

Im new to Kubernetes, my goal is to create a serverless like architecture on GPUs (i.e fan out to 1000+ pods)
I understand a node may be a virtual or physical machine. I am using GKE to help manage k8s. My node machine config is n1-standard-4 with 1 x NVIDIA Tesla T4.
With that setup it seems I could only have 4 pods, if I wanted lets say 16 pods per node, I could use n1-standard-16.
Lets say we are using n1-standard-4 and ran 4 pods on that node, how can we give each node access to the GPU? Currently I am only able to run one pod, while the other pods stay on pending. This seems to only happen when I add the gpu resource in my YAML file.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: load-balancer-example
name: hello-world
spec:
replicas: 4
selector:
matchLabels:
app.kubernetes.io/name: load-balancer-example
template:
metadata:
labels:
app.kubernetes.io/name: load-balancer-example
spec:
containers:
- image: CUSTOM_IMAGE_WITH_NVIDIA/CUDA/UBUNTU
name: test
ports:
- containerPort: 3000
resources:
limits:
nvidia.com/gpu: 1
Without the GPU resource and with a basic node container it seems to fan out fine. With the GPU resource I can only get one POD to run.

What you are creating is not a Pod but a Deployment with a replica count of 4, which is essentially 4 pods. All 4 of these pods are using your n1-standard-4 type of node.
There are certain limitations when it comes to using GPUs with pods. This is very different from CPU sharing. In short, GPUs are only supposed to be specified in the limits section, which means:
You can specify GPU limits without specifying requests because Kubernetes will use the limit as the request value by default.
You can specify GPU in both limits and requests but these two values must be equal.
You cannot specify GPU requests without specifying limits.
Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.
You can read more about these limitations here.
Your best option, will be to create a node pool with your desired GPU type. This node pool will have # nodes = # pods in your deployment and each node will host only 1 pod, and will have 1 GPU of your choice. I suggest this instead of multiple GPUs/node because you want to have a fan-out/scale-out architecture, so more smaller nodes will be better than less larger nodes.
You can read more about how to do this on the GKE docs here.
Note that having n1-standard-4 doesn't mean you can have 4 pods on the node. It simply means the node has 4 vCPUs which you can share across as many pods as needed. But since you want to run GPU workloads, this node type should not matter much, as long as you attach the right amount of GPU resources.

Related

Kubernetes scheduling: control preference and remove pods in priority on a node

Context
I have a Kubernetes cluster with 2 pools:
the first one, bare-metal, is composed of a finite number of bare metal instances. Every node in this pool has a label node-type: bare-metal
the second one, vms, is composed of virtual machines. This is used if the cluster needs to autoscale, since provisioning VMs is faster than provisioning a new bare metal instance. Every node in this pool has a label node-type: vms
For performance reasons, I would like that, when possible, all pods run on a bare-metal node. If not possible, and only if not possible (not enough memory, too many pods already existing), that's ok for me to run pods on a vms node. I have 2 questions (tightly related) about this.
About preferredDuringSchedulingIgnoredDuringExecution behavior
To schedule pods in priority on bare-metal, I've used preferredDuringSchedulingIgnoredDuringExecution. Example with a simple nginx deployment with 90 replicas (a bare-metal pool node could hold all of them):
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 90
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
resources:
requests:
cpu: 50m
memory: 50M
limits:
cpu: 50m
memory: 50M
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- bare-metal
This works fine, even if I don't really understand why on the 90 pods, 3 are scheduled on a node of vms pool (even if scheduling pods on the bare-metal pool is still possible). Well, I understand that this is preferred instead of required, but I'm wondering if we can tune this "preference".
Question: Can I configure my deployment to tell like "preferred only if there is no other solution"?
Scaling down
When I scale down my Deployment, I would like that pods scheduled on vms pool are deleted first, because my pods should run on priority on a bare-metal pool node. However, if I run:
kubectl scale deployment nginx --replicas 3
I can observe that pods on bare-metal pool nodes are deleted first, leaving only 3 pods on a vms pool node.
Question: Is there a way to configure my deployment to tell Kubernetes to remove the pods on vms pool first?
Edit: It seems that the proposal of the introduction of PodDeletionCost can solve this. But it's not implemented for now.

Verifying resources in a deployment yaml

In a deployment yaml, how we can verify that the resources we need for the running pods are guaranteed by the k8s?
Is there a way to figure that out?
Specify your resource request in the deployment YAML. The kube-scheduler will ensure the resources even before scheduling the pods.
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
selector:
matchLabels:
app: guestbook
tier: frontend
replicas: 3
template:
metadata:
labels:
app: guestbook
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google-samples/gb-frontend:v4
resources:
requests:
cpu: 100m
memory: 100Mi
How Pods with resource requests are scheduled? Ref
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
N.B.: However, if you want a container not to use more than its allowed resources, specify the limit too.
There is QoS (Quality of Service) Classes for running pods in k8s. There is an option that is guaranteing and restricting request and limits. This option is qosClass: Guaranteed.
To be able to make your pods' QoS's Guaranteed;
Every Container in the Pod must have a memory limit and a memory request.
For every Container in the Pod, the memory limit must equal the memory request.
Every Container in the Pod must have a CPU limit and a CPU request.
For every Container in the Pod, the CPU limit must equal the CPU request.
These restrictions apply to init containers and app containers equally.
also check out the reference page for more info :
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

Distribute pods for a deployment across different node pools

In my GKE Kubernetes cluster, I have 2 node pools; one with regular nodes and the other with pre-emptible nodes. I'd like some of the pods to be on pre-emptible nodes so I can save costs while I have at least 1 pod on a regular non-pre-emptible node to reduce the risk of downtime.
I'm aware of using podAntiAffinity to encourage pods to be scheduled on different nodes, but is there a way to have k8s schedule pods for a single deployment across both pools?
Yes 💡! You can use Pod Topology Spread Constraints, based on a label 🏷️ key on your nodes. For example, the label could be type and the values could be regular and preemptible. Then you can have something like this:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: app
image: myimage
You can also identify a maxSkew which means the maximum differentiation of a number of pods that one label value (node type) can have.
You can also combine multiple 'Pod Topology Spread Constraints' and also together with PodAffinity/AntiAffinity and NodeAffinity. All depending on what best fits your use case.
Note: This feature is alpha in 1.16 and beta in 1.18. Beta features are enabled by default but with alpha features, you need an alpha cluster in GKE.
☮️✌️

ceph-mds pod fails to launch with Insufficient cpu, MatchNodeSelector, PodToleratesNodeTaints

I tracked down the CPU usage. Even after increasing the number of nodes I still get a persistent scheduling error with the following terms: Insufficient cpu, MatchNodeSelector, PodToleratesNodeTaints.
My hint came from this article. It mentions:
Do not allow new pods to schedule onto the node unless they tolerate
the taint, but allow all pods submitted to Kubelet without going
through the scheduler to start, and allow all already-running pods to
continue running. Enforced by the scheduler.
The configuration contains the following.
spec:
replicas: 1
template:
metadata:
name: ceph-mds
namespace: ceph
labels:
app: ceph
daemon: mds
spec:
nodeSelector:
node-type: storage
... and more ...
Notice the node-type. I have to kubectl label nodes node-type=storage --all so I can label all nodes with node-type=storage. I could also choose to only dedicate some nodes as storage nodes.
In kops edit ig nodes, according to this hint, you can add this label in the following.
spec:
nodeLabels:
node-type: storage

How are minions chosen for given pod creation request

How does kubernetes choose the minion among many available for a given pod creation command? Is it something that can be controlled/tweaked ?
If replicated pods are submitted for deployment, is kubernetes intelligent enough to place them in different minions if they expose the same container/host port pair? Or does it always place different replicas in different minions ?
What about corner cases like what if two different pods (not necessarily replicas) that expose same host/container port pair are submitted? Will they carefully be placed on different minions ?
If a pod requires specific compute/memory requirements, can it be placed in a minion/host that has sufficient resources left to meet those requirement?
In summary, is there detailed documentation on kubernetes pod placement strategy?
Pods are scheduled to ports using the algorithm in generic_scheduler.go
There are rules that prevent host-port conflicts, and also to make sure that there are sufficient memory and cpu requirements. predicates.go
One way to choose minion for pod creation is using nodeSelector. Inside the yaml file of pod specify the label of minion for which you want to choose the minion.
apiVersion: v1
kind: Pod
metadata:
name: nginx1
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
key: value