Kubernetes add Toleration from CLI - kubernetes

I'm using the Oracle Cloud Infrastructure with Kubernetes and Docker. I've got the following pod:
$ kubectl describe pod $podname -n $namespace
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 19m default-scheduler 0/1 nodes are available: 1 node(s) had taint {nvidia.com/gpu: }, that the pod didn't tolerate.
Warning FailedScheduling 18m default-scheduler 0/1 nodes are available: 1 node(s) had taint {nvidia.com/gpu: }, that the pod didn't tolerate.
I want to add a toleration to this pod - is there a command to do so, without creating the pod config yaml file (as this pod is created by some other systems that I don't want to edit. I just want to add the toleration to resolve this issue.
Thanks.
====================
gpu-config.yaml
apiVersion: v1 # What version of the Kubernetes API to use
kind: Pod # What kind of object you want to create
metadata: # Data that helps uniquely identify the object, including a name, string, UID and optional namespace
name: nvidia-gpu-workload
spec: # What state you desire for the object, differs for every type of Kubernetes object.
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: k8s.gcr/io/cuda-vector-add:v0.1
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
effect: "NoSchedule"
# Update command
$ kubectl create -f ./gpu-config.yaml
# All this seems to do is create a pod by the name of nvidia-gpu-workload-v2, and it doesn't add these configurations to the pod that I require.
Just to note that this issue is occurring on a pod called hook-image-awaiter-5tq5 and I don't think I should re-create that pod with a different config as it seems to be configured by part of the system.

Related

Can a Pod tolerate one of a set of taints

Consider a cluster in which each node has a given taint (let's say NodeType) and a Pod can tolerate a set of NodeType. For example, there are nodes tainted NodeType=A, NodeType=B and NodeType=C.
I'd like to be able to specify for example that some Pods tolerate NodeType=A or NodeType=C, but not NodeType=B. Other Pods (in different Deployments) would tolerate different sets. Is there a way to do this?
Yes, it appears it is possible to do so by adding multiple tolerations with the same key on the pod's spec. An example of the same is given in the official docs.
Here is a demo I tried which works to produce the desired result.
The cluster has three nodes:
kubectl get nodes
NAME STATUS AGE VERSION
dummy-0 Ready 3m17s v1.17.14
dummy-1 Ready 26m v1.17.14
dummy-2 Ready 26m v1.17.14
I tainted them as mentioned in the question using the kubectl taint command:
kubectl taint node dummy-0 NodeType=A:NoSchedule
kubectl taint node dummy-1 NodeType=B:NoSchedule
kubectl taint node dummy-2 NodeType=C:NoSchedule
Created a Deployment with three replicas with the matching tolerations:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx-nfs
tolerations:
- key: "NodeType"
operator: "Equal"
value: "A"
effect: "NoSchedule"
- key: "NodeType"
operator: "Equal"
value: "B"
effect: "NoSchedule"
From the kubectl get pods command, we can see that the pods of the Deployment were scheduled only on the nodes dummy-0 and dummy-1 and not on dummy-2 which has a different taint:
kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE
nginx-deployment-5fc8f985d8-2pfvm 1/1 Running 0 8s 100.96.2.11 dummy-0
nginx-deployment-5fc8f985d8-hkrcz 1/1 Running 0 8s 100.96.6.10 dummy-1
nginx-deployment-5fc8f985d8-xfxsx 1/1 Running 0 8s 100.96.6.11 dummy-1
Further, it is important to understand that the taints and tolerations are useful to make sure that the pods don't get scheduled to a particular node.
We should use the concepts of node affinities namely affinity and anti-affinity to make sure that the pods are scheduled to a particular node.

deploying Portainer on Kubernetes Cluster failed

after deploying Portainer on Kubernetes Cluster (1 master, 2 workers), following https://documentation.portainer.io/v2.0/deploy/ceinstallk8s/, by
helm install --create-namespace -n portainer portainer portainer/portainer --set persistence.storageClass=slow
I got the status:
kubectl get all -n portainer
NAME READY STATUS RESTARTS AGE
pod/portainer-6cb48f955f-qmtdq 0/1 Pending 0 2d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/portainer NodePort 10.97.158.200 <none> 9000:30777/TCP,30776:30776/TCP 2d3h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/portainer 0/1 1 0 2d
NAME DESIRED CURRENT READY AGE
replicaset.apps/portainer-6cb48f955f 1 1 0 2d
So,
The pod is not READY, with STATUS Pending.
The service is up but has no EXTERNAL-IP.
The deployment is not READY or AVAILABLE.
The ReplicaSet is not READY.
And I can't access the instance on port 30777.
i.e. http://20.199.64.113:30777/
More 'kubectl describe' info:
root#kubemaster:/home/kubemaster# kubectl describe pod portainer -n portainer
Name: portainer-7b94d88f67-plz9d
Namespace: portainer
Priority: 0
Node: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 129m default-scheduler 0/3 nodes are available: 3 pod has unbound immediate Persiste
root#kubemaster:/home/kubemaster# kubectl describe pvc portainer -n portainer
Name: portainer
Namespace: portainer
StorageClass: slow
Status: Pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 2m22s (x259 over 9h) persistentvolume-controller Failed to provision volume with S
root#kubemaster:/home/kubemaster# kubectl describe pv portainer -n portainer
Error from server (NotFound): persistentvolumes "portainer" not found
I did researched the below error/warning:
Warning FailedScheduling 129m default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
Warning ProvisioningFailed 2m22s (x259 over 9h) persistentvolume-controller Failed to provision volume with StorageClass "slow": AzureDisk - failed to get Azure Cloud Provider. GetCloudProvider returned <nil> instead
But still wasn't able to enable Portainer instance.
Is there anything i missed out or anyway to debug
thanks ahead
If you are using PersistentVolumeClaim you need a volume provisioner for Dynamic Volume Provisioning. The bigger cloud providers typically has this.
If you don't have a volume provisioner in your cluster, you have to create a PersistentVolume resource and possibly also a StorageClass and declare how to use your storage system.
Take a look: portainer-on-kubernetes.
So in your case as you have mentioned you can install external volume provisioner - NFS subdir external provisioner.

FailedScheduling: 0/3 nodes are available: 3 Insufficient pods

I'm trying to deploy my NodeJS application to EKS and run 3 pods with exactly the same container.
Here's the error message:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cm-deployment-7c86bb474c-5txqq 0/1 Pending 0 18s
cm-deployment-7c86bb474c-cd7qs 0/1 ImagePullBackOff 0 18s
cm-deployment-7c86bb474c-qxglx 0/1 ImagePullBackOff 0 18s
public-api-server-79b7f46bf9-wgpk6 0/1 ImagePullBackOff 0 2m30s
$ kubectl describe pod cm-deployment-7c86bb474c-5txqq
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 23s (x4 over 2m55s) default-scheduler 0/3 nodes are available: 3 Insufficient pods.
So it says that 0/3 nodes are available However, if I run
kubectl get nodes --watch
$ kubectl get nodes --watch
NAME STATUS ROLES AGE VERSION
ip-192-168-163-73.ap-northeast-2.compute.internal Ready <none> 6d7h v1.14.6-eks-5047ed
ip-192-168-172-235.ap-northeast-2.compute.internal Ready <none> 6d7h v1.14.6-eks-5047ed
ip-192-168-184-236.ap-northeast-2.compute.internal Ready <none> 6d7h v1.14.6-eks-5047ed
3 pods are running.
here are my configurations:
aws-auth-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: aws-auth
namespace: kube-system
data:
mapRoles: |
- rolearn: [MY custom role ARN]
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cm-deployment
spec:
replicas: 3
selector:
matchLabels:
app: cm-literal
template:
metadata:
name: cm-literal-pod
labels:
app: cm-literal
spec:
containers:
- name: cm
image: docker.io/cjsjyh/public_test:1
imagePullPolicy: Always
ports:
- containerPort: 80
#imagePullSecrets:
# - name: regcred
env:
[my environment variables]
I applied both .yaml files
How can I solve this?
Thank you
My guess, without running the manifests you've got is that the image tag 1 on your image doesn't exist, so you're getting ImagePullBackOff which usually means that the container runtime can't find the image to pull .
Looking at the Docker Hub page there's no 1 tag there, just latest.
So, either removing the tag or replace 1 with latest may resolve your issue.
I experienced this issue with aws instance types with low resources

pod failed to schedule. openstack over kubernates installation

I am new to kubernetes and trying to deploy openstack on kubernetes cluster, below is the error I see when I try to deploy openstack. I am following the openstack docs to deploy.
kube-system ingress-error-pages-56b4446784-crl85 0/1 Pending 0 1d
kube-system ingress-error-pages-56b4446784-m7jrw 0/1 Pending 0 5d
I have kubernetes cluster with one master and one node running on debain9. I encounted this error during openstack installation on kubernetes.
Kubectl describe pod shows the event as below:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m (x7684 over 1d) default-scheduler 0/2 nodes are available: 1 PodToleratesNodeTaints, 2 MatchNodeSelector.
All I see is a failed scheduling, Even the container logs for kube scheduler shows it failed to schedule a pod, but doesn't say why it failed? I am kind of struck at this step from past few hours trying to debug....
PS: I am running debian9, kube version: v1.9.2+coreos.0, Docker - 17.03.1-ce
Any help appreciated ....
Looks like you have a toleration on your Pod and don't have nodes with the taints for those tolerations. Would help to post the definition for your Ingress and its corresponding Deployment or DaemonSet.
You would generally taint your node(s) like this:
kubectl taint nodes <your-node> key=value:IngressNode
Then on your PodSpec something like this:
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "IngressNode"
It could also be because of missing labels on your node that your Pod needs in the nodeSelector field:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
cpuType: haswell
Then on you'd add a label to your node.
kubectl label nodes kubernetes-foo-node-1 cpuType=haswell
Hope it helps!

Trying to create a Kubernetes deployment but it shows 0 pods available

I'm new to k8s, so some of my terminology might be off. But basically, I'm trying to deploy a simple web api: one load balancer in front of n pods (where right now, n=1).
However, when I try to visit the load balancer's IP address it doesn't show my web application. When I run kubectl get deployments, I get this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
tl-api 1 1 1 0 4m
Here's my YAML file. Let me know if anything looks off--I'm very new to this!
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: tl-api
spec:
replicas: 1
template:
metadata:
labels:
app: tl-api
spec:
containers:
- name: tl-api
image: tlk8s.azurecr.io/devicecloudwebapi:v1
ports:
- containerPort: 80
imagePullSecrets:
- name: acr-auth
nodeSelector:
beta.kubernetes.io/os: windows
---
apiVersion: v1
kind: Service
metadata:
name: tl-api
spec:
type: LoadBalancer
ports:
- port: 80
selector:
app: tl-api
Edit 2: When I try using ACS (which supports Windows), I get this:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned tl-api-3466491809-vd5kg to dc9ebacs9000
Normal SuccessfulMountVolume 11m kubelet, dc9ebacs9000 MountVolume.SetUp succeeded for volume "default-token-v3wz9"
Normal Pulling 4m (x6 over 10m) kubelet, dc9ebacs9000 pulling image "tlk8s.azurecr.io/devicecloudwebapi:v1"
Warning FailedSync 1s (x50 over 10m) kubelet, dc9ebacs9000 Error syncing pod
Normal BackOff 1s (x44 over 10m) kubelet, dc9ebacs9000 Back-off pulling image "tlk8s.azurecr.io/devicecloudwebapi:v1"
I then try examining the failed pod:
PS C:\users\<me>\source\repos\DeviceCloud\DeviceCloud\1- Presentation\DeviceCloud.Web.API> kubectl logs tl-api-3466491809-vd5kg
Error from server (BadRequest): container "tl-api" in pod "tl-api-3466491809-vd5kg" is waiting to start: trying and failing to pull image
When I run docker images I see the following:
REPOSITORY TAG IMAGE ID CREATED SIZE
devicecloudwebapi latest ee3d9c3e231d 24 hours ago 7.85GB
tlk8s.azurecr.io/devicecloudwebapi v1 ee3d9c3e231d 24 hours ago 7.85GB
devicecloudwebapi dev bb33ab221910 25 hours ago 7.76GB
Your problem is that the container image tlk8s.azurecr.io/devicecloudwebapi:v1 is in a private container registry. See the events at the bottom of the following command:
$ kubectl describe po -l=app=tl-api
The official Kubernetes docs describe how to resolve this issue, see Pull an Image from a Private Registry, essentially:
Create a secret kubectl create secret docker-registry
Use it in your deployment, under the spec.imagePullSecrets key