ceph-mds pod fails to launch with Insufficient cpu, MatchNodeSelector, PodToleratesNodeTaints - kubernetes

I tracked down the CPU usage. Even after increasing the number of nodes I still get a persistent scheduling error with the following terms: Insufficient cpu, MatchNodeSelector, PodToleratesNodeTaints.

My hint came from this article. It mentions:
Do not allow new pods to schedule onto the node unless they tolerate
the taint, but allow all pods submitted to Kubelet without going
through the scheduler to start, and allow all already-running pods to
continue running. Enforced by the scheduler.
The configuration contains the following.
spec:
replicas: 1
template:
metadata:
name: ceph-mds
namespace: ceph
labels:
app: ceph
daemon: mds
spec:
nodeSelector:
node-type: storage
... and more ...
Notice the node-type. I have to kubectl label nodes node-type=storage --all so I can label all nodes with node-type=storage. I could also choose to only dedicate some nodes as storage nodes.
In kops edit ig nodes, according to this hint, you can add this label in the following.
spec:
nodeLabels:
node-type: storage

Related

Kubernetes giving each pod access to GPU

Im new to Kubernetes, my goal is to create a serverless like architecture on GPUs (i.e fan out to 1000+ pods)
I understand a node may be a virtual or physical machine. I am using GKE to help manage k8s. My node machine config is n1-standard-4 with 1 x NVIDIA Tesla T4.
With that setup it seems I could only have 4 pods, if I wanted lets say 16 pods per node, I could use n1-standard-16.
Lets say we are using n1-standard-4 and ran 4 pods on that node, how can we give each node access to the GPU? Currently I am only able to run one pod, while the other pods stay on pending. This seems to only happen when I add the gpu resource in my YAML file.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: load-balancer-example
name: hello-world
spec:
replicas: 4
selector:
matchLabels:
app.kubernetes.io/name: load-balancer-example
template:
metadata:
labels:
app.kubernetes.io/name: load-balancer-example
spec:
containers:
- image: CUSTOM_IMAGE_WITH_NVIDIA/CUDA/UBUNTU
name: test
ports:
- containerPort: 3000
resources:
limits:
nvidia.com/gpu: 1
Without the GPU resource and with a basic node container it seems to fan out fine. With the GPU resource I can only get one POD to run.
What you are creating is not a Pod but a Deployment with a replica count of 4, which is essentially 4 pods. All 4 of these pods are using your n1-standard-4 type of node.
There are certain limitations when it comes to using GPUs with pods. This is very different from CPU sharing. In short, GPUs are only supposed to be specified in the limits section, which means:
You can specify GPU limits without specifying requests because Kubernetes will use the limit as the request value by default.
You can specify GPU in both limits and requests but these two values must be equal.
You cannot specify GPU requests without specifying limits.
Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.
You can read more about these limitations here.
Your best option, will be to create a node pool with your desired GPU type. This node pool will have # nodes = # pods in your deployment and each node will host only 1 pod, and will have 1 GPU of your choice. I suggest this instead of multiple GPUs/node because you want to have a fan-out/scale-out architecture, so more smaller nodes will be better than less larger nodes.
You can read more about how to do this on the GKE docs here.
Note that having n1-standard-4 doesn't mean you can have 4 pods on the node. It simply means the node has 4 vCPUs which you can share across as many pods as needed. But since you want to run GPU workloads, this node type should not matter much, as long as you attach the right amount of GPU resources.

Verifying resources in a deployment yaml

In a deployment yaml, how we can verify that the resources we need for the running pods are guaranteed by the k8s?
Is there a way to figure that out?
Specify your resource request in the deployment YAML. The kube-scheduler will ensure the resources even before scheduling the pods.
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
selector:
matchLabels:
app: guestbook
tier: frontend
replicas: 3
template:
metadata:
labels:
app: guestbook
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google-samples/gb-frontend:v4
resources:
requests:
cpu: 100m
memory: 100Mi
How Pods with resource requests are scheduled? Ref
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
N.B.: However, if you want a container not to use more than its allowed resources, specify the limit too.
There is QoS (Quality of Service) Classes for running pods in k8s. There is an option that is guaranteing and restricting request and limits. This option is qosClass: Guaranteed.
To be able to make your pods' QoS's Guaranteed;
Every Container in the Pod must have a memory limit and a memory request.
For every Container in the Pod, the memory limit must equal the memory request.
Every Container in the Pod must have a CPU limit and a CPU request.
For every Container in the Pod, the CPU limit must equal the CPU request.
These restrictions apply to init containers and app containers equally.
also check out the reference page for more info :
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

upgrading to bigger node-pool in GKE

I have a node-pool (default-pool) in a GKE cluster with 3 nodes, machine type n1-standard-1. They host 6 pods with a redis cluster in it (3 masters and 3 slaves) and 3 pods with an nodejs example app in it.
I want to upgrade to a bigger machine type (n1-standard-2) with also 3 nodes.
In the documentation, google gives an example to upgrade to a different machine type (in a new node pool).
I have tested it while in development, and my node pool was unreachable for a while while executing the following command:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
kubectl cordon "$node";
done
In my terminal, I got a message that my connection with the server was lost (I could not execute kubectl commands). After a few minutes, I could reconnect and I got the desired output as shown in the documentation.
The second time, I tried leaving out the cordon command and I skipped to the following command:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node";
done
This because if I interprete the kubernetes documentation correctly, the nodes are automatically cordonned when using the drain command. But I got the same result as with the cordon command: I lost connection to the cluster for a few minutes, and I could not reach the nodejs example app that was hosted on the same nodes. After a few minutes, it restored itself.
I found a workaround to upgrade to a new node pool with bigger machine types: I edited the deployment/statefulset yaml files and changed the nodeSelector. Node pools in GKE are tagged with:
cloud.google.com/gke-nodepool=NODE_POOL_NAME
so I added the correct nodeSelector to the deployment.yaml file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
labels:
app: example
spec:
replicas: 3
selector:
matchLabels:
app: example
template:
metadata:
labels:
app: example
spec:
nodeSelector:
cloud.google.com/gke-nodepool: new-default-pool
containers:
- name: example
image: IMAGE
ports:
- containerPort: 3000
This works without downtime, but I'm not sure this is the right way to do in a production environment.
What is wrong with the cordon/drain command, or am I not using them correctly?
Cordoning a node will cause it to be removed from the load balancers backend list, so will a drain. The correct way to do it is to set up anti-affinity rules on the deployment so the pods are not deployed on the same node, or the same region for that matter. That will cause an even distribution of pods throught your node pool.
Then you have to disable autoscaling on the old node pool if you have it enabled, slowly drain 1-2 nodes a time and wait for them to appear on the new node pool, making sure at all times to keep one pod of the deployment alive so it can handle traffic.

Is it possible for running pods on kubernetes to share the same PVC

I've currently set up a PVC with the name minio-pvc and created a deployment based on the stable/minio chart with the values
mode: standalone
replicas: 1
persistence:
enabled: true
existingClaim: minio-pvc
What happens if I increase the number of replicas? Do i run the risk of corrupting data if more than one pod tries to write to the PVC at the same time?
Don't use deployment for stateful containers. Instead use StatefulSets.
StatefulSets are specifically designed for running stateful containers like databases. They are used to persist the state of the container.
Note that each pod is going to bind a separate persistent volume via pvc. There is no possibility of multiple instances of pods writing to same pv. Hope I answered your question.
In case you are sticking to Deployments instead of StatefulSets it won't be feasible for multiple replicas to write to the same PVC, since there is no guarantee that the different replicas are scheduled on the same node, and so you might have a pending pod waiting to establish a connection to the volume and fail. The solution is to choose a specific node and have all your replicas run on the same node.
Run the following and assign a label to one of your nodes:
kubectl label nodes <node-name> <label-key>=<label-value>
Say we choose label-key to be labelKey and label-value to be node1. Then you can go ahead and add the following to your YAML file and have the pods scheduled on the same node:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
spec:
replicas: 3
template:
spec:
nodeSelector:
labelKey: node1
containers:
...

How do I make Kubernetes scale my deployment based on the "ready"/ "not ready" status of my Pods?

I have a deployment with a defined number of replicas. I use readiness probe to communicate if my Pod is ready/ not ready to handle new connections – my Pods toggle between ready/ not ready state during their lifetime.
I want Kubernetes to scale the deployment up/ down to ensure that there is always the desired number of pods in a ready state.
Example:
If replicas is 4 and there are 4 Pods in ready state, then Kubernetes should keep the current replica count.
If replicas is 4 and there are 2 ready pods and 2 not ready pods, then Kubernetes should add 2 more pods.
How do I make Kubernetes scale my deployment based on the "ready"/ "not ready" status of my Pods?
I don't think this is possible. If pod is not ready, k8 will not make it ready as It is something which releated to your application.Even if it create new pod, how readiness will be guaranted. So you have to resolve the reasons behind non ready status and then k8. Only thing k8 does it keep them away from taking world load to avoid request failure
Ensuring you always have 4 pods running can be done by specifying the replicas property in your deployment definition:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 4 #here we define a requirement for 4 replicas
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
Kubernetes will ensure that if any pods crash, replacement pods will be created so that a total of 4 are always available.
You cannot schedule deployments on unhealthy nodes in the cluster. The master api will only create pods on nodes which are healthy and meet the quota criteria to create any additional pods on the nodes which are schedulable.
Moreover, what you define is called an auto-heal concept of k8s which in basic terms will be taken care of.