why does the pod remain in pending state despite having toleration set - kubernetes

I applied the following taint, and label to a node but the pod never reaches a running status and I cannot seem to figure out why
kubectl taint node k8s-worker-2 dedicated=devs:NoSchedule
kubectl label node k8s-worker-2 dedicated=devs
and here is a sample of my pod yaml file:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
security: s1
name: pod-1
spec:
containers:
- image: nginx
name: bear
resources: {}
tolerations:
- key: "dedicated"
operator: "Equal"
value: "devs"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dedicated
operator: In
values:
- devs
dnsPolicy: ClusterFirst
restartPolicy: Always
nodeName: k8s-master-2
status: {}
on creating the pod, it gets scheduled on the k8s-worker-2 node but remains in a pending state before it's finally evicted. Here are sample outputs:
kubectl describe no k8s-worker-2 | grep -i taint
Taints: dedicated=devs:NoSchedule
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-1 0/1 Pending 0 9s <none> k8s-master-2 <none> <none>
# second check
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-1 0/1 Pending 0 59s <none> k8s-master-2 <none> <none>
Name: pod-1
Namespace: default
Priority: 0
Node: k8s-master-2/
Labels: security=s1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
bear:
Image: nginx
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dzvml (ro)
Volumes:
kube-api-access-dzvml:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: dedicated=devs:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Also, here is output of kubectl describe node
root#k8s-master-1:~/scheduling# kubectl describe nodes k8s-worker-2
Name: k8s-worker-2
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
dedicated=devs
kubernetes.io/arch=amd64
kubernetes.io/hostname=k8s-worker-2
kubernetes.io/os=linux
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.128.0.4/32
projectcalico.org/IPv4IPIPTunnelAddr: 192.168.140.0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 18 Jul 2021 16:18:41 +0000
Taints: dedicated=devs:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: k8s-worker-2
AcquireTime: <unset>
RenewTime: Sun, 10 Oct 2021 18:54:46 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Sun, 10 Oct 2021 18:48:50 +0000 Sun, 10 Oct 2021 18:48:50 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Sun, 10 Oct 2021 18:53:40 +0000 Mon, 04 Oct 2021 07:52:58 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 10 Oct 2021 18:53:40 +0000 Mon, 04 Oct 2021 07:52:58 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sun, 10 Oct 2021 18:53:40 +0000 Mon, 04 Oct 2021 07:52:58 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sun, 10 Oct 2021 18:53:40 +0000 Mon, 04 Oct 2021 07:52:58 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.128.0.4
Hostname: k8s-worker-2
Capacity:
cpu: 2
ephemeral-storage: 20145724Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8149492Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 18566299208
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8047092Ki
pods: 110
System Info:
Machine ID: 3c2709a436fa0c630680bac68ad28669
System UUID: 3c2709a4-36fa-0c63-0680-bac68ad28669
Boot ID: 18a3541f-f3b4-4345-ba45-8cfef9fb1364
Kernel Version: 5.8.0-1038-gcp
OS Image: Ubuntu 20.04.2 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.7
Kubelet Version: v1.21.3
Kube-Proxy Version: v1.21.3
PodCIDR: 192.168.2.0/24
PodCIDRs: 192.168.2.0/24
Non-terminated Pods: (2 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-gp4tk 250m (12%) 0 (0%) 0 (0%) 0 (0%) 84d
kube-system kube-proxy-6xxgx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 81d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (12%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 6m25s kubelet Starting kubelet.
Normal NodeAllocatableEnforced 6m25s kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 6m19s (x7 over 6m25s) kubelet Node k8s-worker-2 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 6m19s (x7 over 6m25s) kubelet Node k8s-worker-2 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 6m19s (x7 over 6m25s) kubelet Node k8s-worker-2 status is now: NodeHasSufficientPID
Warning Rebooted 6m9s kubelet Node k8s-worker-2 has been rebooted, boot id: 18a3541f-f3b4-4345-ba45-8cfef9fb1364
Normal Starting 6m7s kube-proxy Starting kube-proxy.
Included the following to show that the pod never issues events and it terminates later on by itself.
root#k8s-master-1:~/format/scheduling# kubectl get po
No resources found in default namespace.
root#k8s-master-1:~/format/scheduling# kubectl create -f nginx.yaml
pod/pod-1 created
root#k8s-master-1:~/format/scheduling# kubectl get po pod-1
NAME READY STATUS RESTARTS AGE
pod-1 0/1 Pending 0 10s
root#k8s-master-1:~/format/scheduling# kubectl describe po pod-1
Name: pod-1
Namespace: default
Priority: 0
Node: k8s-master-2/
Labels: security=s1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
bear:
Image: nginx
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5hsq4 (ro)
Volumes:
kube-api-access-5hsq4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: dedicated=devs:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
root#k8s-master-1:~/format/scheduling# kubectl get po pod-1
NAME READY STATUS RESTARTS AGE
pod-1 0/1 Pending 0 45s
root#k8s-master-1:~/format/scheduling# kubectl get po pod-1
NAME READY STATUS RESTARTS AGE
pod-1 0/1 Pending 0 62s
root#k8s-master-1:~/format/scheduling# kubectl get po pod-1
NAME READY STATUS RESTARTS AGE
pod-1 0/1 Pending 0 74s
root#k8s-master-1:~/format/scheduling# kubectl get po pod-1
Error from server (NotFound): pods "pod-1" not found
root#k8s-master-1:~/format/scheduling# kubectl get po
No resources found in default namespace.
root#k8s-master-1:~/format/scheduling#

I was able to figure this one out later. On reproducing the same case on another cluster, the pod got created on the node having the scheduling parameters set. Then it occurred to me that the only change I had to make on the manifest was setting nodeName: node-1 to match the right node on other cluster.
I was literally assigning the pod to a control plane node nodeName: k8s-master-2 and this was causing conflicts.

on creating the pod, it gets scheduled on the k8s-worker-2 node but
remains in a pending state before it's finally evicted.
Hope you node have proper resource left and free, that could be also reason behind pod getting evicted due to resources issue.
https://sysdig.com/blog/kubernetes-pod-evicted/

Related

Kubernetes v1.21.2: 'selfLink was empty, can't make reference'

I get this log error for a pod like below but I updated kubernetes orchestrator, clusters, and nodes to kubernetes v1.21.2. Before updating it, they were v1.20.7. I found a reference that from v1.21, selfLink is completely removed. Why am I getting this error? How can I resolve this issue?
error log for kubectl logs (podname)
...
2021-08-10T03:07:19.535Z INFO setup starting manager
2021-08-10T03:07:19.536Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"}
E0810 03:07:19.550636 1 event.go:247] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"controller-leader-election-helper", GenerateName:"", Namespace:"kubestone-system", SelfLink:"", UID:"b01651ed-7d54-4815-a047-57b16d26cfdf", ResourceVersion:"65956", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63764161639, loc:(*time.Location)(0x21639e0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"kubestone-controller-manager-f467b7c47-cv7ws_1305bc36-f988-11eb-81fc-a20dfb9758a2\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2021-08-10T03:07:19Z\",\"renewTime\":\"2021-08-10T03:07:19Z\",\"leaderTransitions\":0}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"manager", Operation:"Update", APIVersion:"v1", Time:(*v1.Time)(0xc0000956a0), Fields:(*v1.Fields)(nil)}}}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'kubestone-controller-manager-f467b7c47-cv7ws_1305bc36-f988-11eb-81fc-a20dfb9758a2 became leader'
2021-08-10T03:07:21.636Z INFO controller-runtime.controller Starting Controller {"controller": "kafkabench"}
...
kubectl get nodes to show kubernetes version: the node that the pod is scheduled is aks-default-41152893-vmss000000
PS C:\Users\user> kubectl get nodes -A
NAME STATUS ROLES AGE VERSION
aks-default-41152893-vmss000000 Ready agent 5h32m v1.21.2
aks-default-41152893-vmss000001 Ready agent 5h29m v1.21.2
aksnpwi000000 Ready agent 5h32m v1.21.2
aksnpwi000001 Ready agent 5h26m v1.21.2
aksnpwi000002 Ready agent 5h19m v1.21.2
kubectl describe pods (pod name: kubestone-controller-manager-f467b7c47-cv7ws)
PS C:\Users\user> kubectl describe pods kubestone-controller-manager-f467b7c47-cv7ws -n kubestone-system
Name: kubestone-controller-manager-f467b7c47-cv7ws
Namespace: kubestone-system
Priority: 0
Node: aks-default-41152893-vmss000000/10.240.0.4
Start Time: Mon, 09 Aug 2021 23:07:16 -0400
Labels: control-plane=controller-manager
pod-template-hash=f467b7c47
Annotations: <none>
Status: Running
IP: 10.240.0.21
IPs:
IP: 10.240.0.21
Controlled By: ReplicaSet/kubestone-controller-manager-f467b7c47
Containers:
manager:
Container ID: containerd://01594df678a2c1d7163c913eff33881edf02e39633b1a4b51dcf5fb769d0bc1e
Image: user2/imagename
Image ID: docker.io/user2/imagename#sha256:aa049f135931192630ceda014d7a24306442582dbeeaa36ede48e6599b6135e1
Port: <none>
Host Port: <none>
Command:
/manager
Args:
--enable-leader-election
State: Running
Started: Mon, 09 Aug 2021 23:07:18 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 30Mi
Requests:
cpu: 100m
memory: 20Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jvjjh (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-jvjjh:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 23m default-scheduler Successfully assigned kubestone-system/kubestone-controller-manager-f467b7c47-cv7ws to aks-default-41152893-vmss000000
Normal Pulling 23m kubelet Pulling image "user2/imagename"
Normal Pulled 23m kubelet Successfully pulled image "user2/imagename" in 354.899039ms
Normal Created 23m kubelet Created container manager
Normal Started 23m kubelet Started container manager
Kubestone has had no releases since 2019, it needs to upgrade its copy of the Kubernetes Go client. That said, this appears to only impact the event recorder system so probably not a huge deal.

Argo workflow stuck in pending due to liveness probe fail?

I am trying to set up a Hyperledger Fabric network on Kubernetes by using this.
I am at the step where I am trying to create channels. I run the command argo submit output.yaml -v where output.yaml is the output of the command helm template channel-flow/ -f samples/simple/network.yaml -f samples/simple/crypto-config.yaml but with spec.securityContext added as follows:
...
spec:
securityContext:
runAsNonRoot: true
#runAsUser: 8737 (I commented out this because I don't know my user ID; not sure if this could cause a problem)
entrypoint: channels
...
My argo workflow ends up getting stuck in the pending state. I say this because I check my orderer and peer logs but I see no movement in their logs.
I referenced Argo sample workflows stuck in the pending state and I start with getting the argo logs:
[user#vmmock3 fabric-kube]$ kubectl logs -n argo -l app=workflow-controller
time="2021-05-31T05:02:41.145Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:41.150Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:46.162Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:46.168Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:51.179Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:51.185Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:56.193Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:56.199Z" level=info msg="Update leases 200"
time="2021-05-31T05:03:01.213Z" level=info msg="Get leases 200"
time="2021-05-31T05:03:01.219Z" level=info msg="Update leases 200"
I try to describe the workflow controller pod:
[user#vmmock3 fabric-kube]$ kubectl -n argo describe pod workflow-controller-57fcfb5df8-qvn74
Name: workflow-controller-57fcfb5df8-qvn74
Namespace: argo
Priority: 0
Node: hlf-pool1-8rnem/10.104.0.8
Start Time: Tue, 25 May 2021 13:44:56 +0800
Labels: app=workflow-controller
pod-template-hash=57fcfb5df8
Annotations: <none>
Status: Running
IP: 10.244.0.158
IPs:
IP: 10.244.0.158
Controlled By: ReplicaSet/workflow-controller-57fcfb5df8
Containers:
workflow-controller:
Container ID: containerd://78c7f8dcb0f3a3b861293559ae0a11b92ce6843065e6f9459556a6b7099c8961
Image: argoproj/workflow-controller:v3.0.5
Image ID: docker.io/argoproj/workflow-controller#sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
Ports: 9090/TCP, 6060/TCP
Host Ports: 0/TCP, 0/TCP
Command:
workflow-controller
Args:
--configmap
workflow-controller-configmap
--executor-image
argoproj/argoexec:v3.0.5
--namespaced
State: Running
Started: Mon, 31 May 2021 13:08:11 +0800
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Mon, 31 May 2021 12:59:05 +0800
Finished: Mon, 31 May 2021 13:03:04 +0800
Ready: True
Restart Count: 1333
Liveness: http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
Environment:
LEADER_ELECTION_IDENTITY: workflow-controller-57fcfb5df8-qvn74 (v1:metadata.name)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from argo-token-hflpb (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
argo-token-hflpb:
Type: Secret (a volume populated by a Secret)
SecretName: argo-token-hflpb
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 7m44s (x3994 over 5d23h) kubelet Liveness probe failed: Get "http://10.244.0.158:6060/healthz": dial tcp 10.244.0.158:6060: connect: connection refused
Warning BackOff 3m46s (x16075 over 5d22h) kubelet Back-off restarting failed container
Could this failure be why my argo workflow is stuck in the pending state? How should I go about troubleshooting this?
EDIT: Output of kubectl get pods --all-namespaces (FYI these are being run on Digital Ocean):
[user#vmmock3 fabric-kube]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
argo argo-server-5695555c55-867bx 1/1 Running 1 6d19h
argo minio-58977b4b48-r2m2h 1/1 Running 0 6d19h
argo postgres-6b5c55f477-7swpp 1/1 Running 0 6d19h
argo workflow-controller-57fcfb5df8-qvn74 0/1 CrashLoopBackOff 1522 6d19h
default hlf-ca--atlantis-58bbd79d9d-x4mz4 1/1 Running 0 21h
default hlf-ca--karga-547dbfddc8-7w6b5 1/1 Running 0 21h
default hlf-ca--nevergreen-7ffb98484c-nlg4j 1/1 Running 0 21h
default hlf-orderer--groeifabriek--orderer0-0 1/1 Running 0 21h
default hlf-peer--atlantis--peer0-0 2/2 Running 0 21h
default hlf-peer--karga--peer0-0 2/2 Running 0 21h
default hlf-peer--nevergreen--peer0-0 2/2 Running 0 21h
kube-system cilium-2kjfz 1/1 Running 3 26d
kube-system cilium-operator-84bdd6f7b6-kp9vb 1/1 Running 1 6d20h
kube-system cilium-operator-84bdd6f7b6-pkkf9 1/1 Running 1 6d20h
kube-system coredns-55ff57f948-jb5jc 1/1 Running 0 6d20h
kube-system coredns-55ff57f948-r2q4g 1/1 Running 0 6d20h
kube-system csi-do-node-4r9gj 2/2 Running 0 26d
kube-system do-node-agent-sbc8b 1/1 Running 0 26d
kube-system kube-proxy-hpsc7 1/1 Running 0 26d
I will answer partially on your question as I'm not promising everything else will work fine, however I know how to fix the issue with argo workflow-controller pod.
Answer
In short words you need to update argo workflows to a new version (at least 3.0.6, ideally 3.0.7 which available) because it looks like a bug in 3.0.5 version.
How I got there
First I installed argo 3.0.5 version (which is not production ready)
Ended up with workflow-controller pod restarts:
kubectl get pods -n argo
NAME READY STATUS RESTARTS AGE
argo-server-645cf8bc47-sbnqv 1/1 Running 0 9m7s
workflow-controller-768565d958-9lftf 1/1 Running 2 9m7s
curl-pod 1/1 Running 0 6m47s
And the same liveness probe failed:
kubectl describe pod workflow-controller-768565d958-9lftf -n argo
Name: workflow-controller-768565d958-9lftf
Namespace: argo
Priority: 0
Node: worker1/10.186.0.3
Start Time: Tue, 01 Jun 2021 14:25:00 +0000
Labels: app=workflow-controller
pod-template-hash=768565d958
Annotations: <none>
Status: Running
IP: 10.244.1.151
IPs:
IP: 10.244.1.151
Controlled By: ReplicaSet/workflow-controller-768565d958
Containers:
workflow-controller:
Container ID: docker://4b797b57ae762f9fc3f7acdd890d25434a8d9f6f165bbb7a7bda35745b5f4092
Image: argoproj/workflow-controller:v3.0.5
Image ID: docker-pullable://argoproj/workflow-controller#sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
Ports: 9090/TCP, 6060/TCP
Host Ports: 0/TCP, 0/TCP
Command:
workflow-controller
Args:
--configmap
workflow-controller-configmap
--executor-image
argoproj/argoexec:v3.0.5
State: Running
Started: Tue, 01 Jun 2021 14:33:00 +0000
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Tue, 01 Jun 2021 14:29:00 +0000
Finished: Tue, 01 Jun 2021 14:33:00 +0000
Ready: True
Restart Count: 2
Liveness: http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
Environment:
LEADER_ELECTION_IDENTITY: workflow-controller-768565d958-9lftf (v1:metadata.name)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ts9zf (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-ts9zf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m57s default-scheduler Successfully assigned argo/workflow-controller-768565d958-9lftf to worker1
Normal Pulled 57s (x3 over 8m56s) kubelet Container image "argoproj/workflow-controller:v3.0.5" already present on machine
Normal Created 57s (x3 over 8m56s) kubelet Created container workflow-controller
Normal Started 57s (x3 over 8m56s) kubelet Started container workflow-controller
Warning Unhealthy 57s (x6 over 6m57s) kubelet Liveness probe failed: Get "http://10.244.1.151:6060/healthz": dial tcp 10.244.1.151:6060: connect: connection refused
Normal Killing 57s (x2 over 4m57s) kubelet Container workflow-controller failed liveness probe, will be restarted
I also tested this endpoint with a pod in the same namespace based on curlimages/curl image - it has built-in curl.
here's a pod.yaml
apiVersion: v1
kind: Pod
metadata:
namespace: argo
labels:
app: curl
name: curl-pod
spec:
containers:
- image: curlimages/curl
name: curl-pod
command: ['sh', '-c', 'while true ; do sleep ; done']
dnsPolicy: ClusterFirst
restartPolicy: Always
kubectl exec -it curl-pod -n argo -- curl http://10.244.1.151:6060/healthz
Which resulted in the same error:
curl: (7) Failed to connect to 10.244.1.151 port 6060: Connection refused
Next step was trying a newer version (3.10rc and then 3.0.7). And it succeeded!
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned argo/workflow-controller-74b4b5455d-skb2f to worker1
Normal Pulling 27m kubelet Pulling image "argoproj/workflow-controller:v3.0.7"
Normal Pulled 27m kubelet Successfully pulled image "argoproj/workflow-controller:v3.0.7" in 15.728042003s
Normal Created 27m kubelet Created container workflow-controller
Normal Started 27m kubelet Started container workflow-controller
And check it with curl:
kubectl exec -it curl-pod -n argo -- curl 10.244.1.169:6060/healthz
ok
The problem is due to the usage of old version of a workflow-controller.If you are following docs it downloads the old version of the work-flow controller which ends up causing a lot of issues .Use below commands instead
or find out the latest one in here
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.3.6/install.yaml

Minikube pod keep a forever pending status and failed to be scheduled

I am very begginer on kubernetes. Sorry if this a is dumb question.
I am using minikube and kvm2(5.0.0). Here is the info about minikube and kubectl version
Minikube status output
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
kubectl cluster-info output:
Kubernetes master is running at https://127.0.0.1:32768
KubeDNS is running at https://127.0.0.1:32768/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
I am trying to deploy a pod using kubectl apply -f client-pod.yaml. Here is my client-pod.yaml configuration
apiVersion: v1
kind: Pod
metadata:
name: client-pod
labels:
component: web
spec:
containers:
- name: client
image: stephengrider/multi-client
ports:
- containerPort: 3000
This is the kubectl get pods output:
NAME READY STATUS RESTARTS AGE
client-pod 0/1 Pending 0 4m15s
kubectl describe pods output:
Name: client-pod
Namespace: default
Priority: 0
Node: <none>
Labels: component=web
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"component":"web"},"name":"client-pod","namespace":"default"},"spec...
Status: Pending
IP:
IPs: <none>
Containers:
client:
Image: stephengrider/multi-client
Port: 3000/TCP
Host Port: 0/TCP
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-z45bq (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-z45bq:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-z45bq
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
I have been searching for a way to see which taints is stopping to pod to initialize wihout luck.
Is there a way to see the taint that is failing?
kubectl get nodes output:
NAME STATUS ROLES AGE VERSION
m01 Ready master 11h v1.17.3
-- EDIT --
kubectl describe nodes output:
Name: home-pc
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=home-pc
kubernetes.io/os=linux
minikube.k8s.io/commit=eb13446e786c9ef70cb0a9f85a633194e62396a1
minikube.k8s.io/name=minikube
minikube.k8s.io/updated_at=2020_03_17T22_51_28_0700
minikube.k8s.io/version=v1.8.2
node-role.kubernetes.io/master=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 17 Mar 2020 22:51:25 -0500
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: home-pc
AcquireTime: <unset>
RenewTime: Tue, 17 Mar 2020 22:51:41 -0500
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 17 Mar 2020 22:51:41 -0500 Tue, 17 Mar 2020 22:51:21 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 17 Mar 2020 22:51:41 -0500 Tue, 17 Mar 2020 22:51:21 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 17 Mar 2020 22:51:41 -0500 Tue, 17 Mar 2020 22:51:21 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 17 Mar 2020 22:51:41 -0500 Tue, 17 Mar 2020 22:51:41 -0500 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.0.12
Hostname: home-pc
Capacity:
cpu: 12
ephemeral-storage: 227688908Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8159952Ki
pods: 110
Allocatable:
cpu: 12
ephemeral-storage: 209838097266
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8057552Ki
pods: 110
System Info:
Machine ID: 339d426453b4492da92f75d06acc1e0d
System UUID: 62eedb55-444f-61ce-75e9-b06ebf3331a0
Boot ID: a9ae9889-d7cb-48c5-ae75-b2052292ac7a
Kernel Version: 5.0.0-38-generic
OS Image: Ubuntu 19.04
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.5
Kubelet Version: v1.17.3
Kube-Proxy Version: v1.17.3
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system coredns-6955765f44-mbwqt 100m (0%) 0 (0%) 70Mi (0%) 170Mi (2%) 10s
kube-system coredns-6955765f44-sblf2 100m (0%) 0 (0%) 70Mi (0%) 170Mi (2%) 10s
kube-system etcd-home-pc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13s
kube-system kube-apiserver-home-pc 250m (2%) 0 (0%) 0 (0%) 0 (0%) 13s
kube-system kube-controller-manager-home-pc 200m (1%) 0 (0%) 0 (0%) 0 (0%) 13s
kube-system kube-proxy-lk7xs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 10s
kube-system kube-scheduler-home-pc 100m (0%) 0 (0%) 0 (0%) 0 (0%) 12s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 750m (6%) 0 (0%)
memory 140Mi (1%) 340Mi (4%)
ephemeral-storage 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 24s kubelet, home-pc Starting kubelet.
Normal NodeHasSufficientMemory 23s (x4 over 24s) kubelet, home-pc Node home-pc status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 23s (x3 over 24s) kubelet, home-pc Node home-pc status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 23s (x3 over 24s) kubelet, home-pc Node home-pc status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 23s kubelet, home-pc Updated Node Allocatable limit across pods
Normal Starting 13s kubelet, home-pc Starting kubelet.
Normal NodeHasSufficientMemory 13s kubelet, home-pc Node home-pc status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 13s kubelet, home-pc Node home-pc status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 13s kubelet, home-pc Node home-pc status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 13s kubelet, home-pc Updated Node Allocatable limit across pods
Normal Starting 9s kube-proxy, home-pc Starting kube-proxy.
Normal NodeReady 3s kubelet, home-pc Node home-pc status is now: NodeReady
You have some taints on the node which is stopping the scheduler from deploying the pod.Either remove the taint from master node or add tolerations in the pod spec.

NoExecuteTaintManager falsely deleting Pod?

I am receiving NoExecuteTaintManager events that are deleting my pod but I can't figure out why. The node is healthy and the Pod has the appropriate tolerations.
This is actually causing infinite scale up because my Pod is setup so that it uses 3/4 Node CPUs and has a Toleration Grace Period > 0. This forces a new node when a Pod terminates. Cluster Autoscaler tries to keep the replicas == 2.
How do I figure out which taint is causing it specifically? Any then why it thinks that node had that taint? Currently the pods are being killed at exactly 600 seconds (which I have changed tolerationSeconds to be for node.kubernetes.io/unreachable and node.kubernetes.io/not-ready) however the node does not appear to undergo either of those situations.
NAME READY STATUS RESTARTS AGE
my-api-67df7bd54c-dthbn 1/1 Running 0 8d
my-api-67df7bd54c-mh564 1/1 Running 0 8d
my-pod-6d7b698b5f-28rgw 1/1 Terminating 0 15m
my-pod-6d7b698b5f-2wmmg 1/1 Terminating 0 13m
my-pod-6d7b698b5f-4lmmg 1/1 Running 0 4m32s
my-pod-6d7b698b5f-7m4gh 1/1 Terminating 0 71m
my-pod-6d7b698b5f-8b47r 1/1 Terminating 0 27m
my-pod-6d7b698b5f-bb58b 1/1 Running 0 2m29s
my-pod-6d7b698b5f-dn26n 1/1 Terminating 0 25m
my-pod-6d7b698b5f-jrnkg 1/1 Terminating 0 38m
my-pod-6d7b698b5f-sswps 1/1 Terminating 0 36m
my-pod-6d7b698b5f-vhqnf 1/1 Terminating 0 59m
my-pod-6d7b698b5f-wkrtg 1/1 Terminating 0 50m
my-pod-6d7b698b5f-z6p2c 1/1 Terminating 0 47m
my-pod-6d7b698b5f-zplp6 1/1 Terminating 0 62m
14:22:43.678937 8 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: my-pod-6d7b698b5f-dn26n
14:22:43.679073 8 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"prod", Name:"my-pod-6d7b698b5f-dn26n", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TaintManagerEviction' Marking for deletion Pod prod/my-pod-6d7b698b5f-dn26n
# kubectl -n prod get pod my-pod-6d7b698b5f-8b47r -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
checksum/config: bcdc41c616f736849a6bef9c726eec9bf704ce7d2c61736005a6fedda0ee14d0
kubernetes.io/psp: eks.privileged
creationTimestamp: "2019-10-25T14:09:17Z"
deletionGracePeriodSeconds: 172800
deletionTimestamp: "2019-10-27T14:20:40Z"
generateName: my-pod-6d7b698b5f-
labels:
app.kubernetes.io/instance: my-pod
app.kubernetes.io/name: my-pod
pod-template-hash: 6d7b698b5f
name: my-pod-6d7b698b5f-8b47r
namespace: prod
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: my-pod-6d7b698b5f
uid: c6360643-f6a6-11e9-9459-12ff96456b32
resourceVersion: "2408256"
selfLink: /api/v1/namespaces/prod/pods/my-pod-6d7b698b5f-8b47r
uid: 08197175-f731-11e9-9459-12ff96456b32
spec:
containers:
- args:
- -c
- from time import sleep; sleep(10000)
command:
- python
envFrom:
- secretRef:
name: pix4d
- secretRef:
name: rabbitmq
image: python:3.7-buster
imagePullPolicy: Always
name: my-pod
ports:
- containerPort: 5000
name: http
protocol: TCP
resources:
requests:
cpu: "3"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-gv6q5
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-142-54-235.ec2.internal
nodeSelector:
nodepool: zeroscaling-gpu-accelerated-p2-xlarge
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 172800
tolerations:
- key: specialized
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 600
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 600
volumes:
- name: default-token-gv6q5
secret:
defaultMode: 420
secretName: default-token-gv6q5
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:10:40Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:11:09Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:11:09Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:10:40Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://15e2e658c459a91a86573c1096931fa4ac345e06f26652da2a58dc3e3b3d5aa2
image: python:3.7-buster
imageID: docker-pullable://python#sha256:f0db6711abee8d406121c9e057bc0f7605336e8148006164fea2c43809fe7977
lastState: {}
name: my-pod
ready: true
restartCount: 0
state:
running:
startedAt: "2019-10-25T14:11:09Z"
hostIP: 10.142.54.235
phase: Running
podIP: 10.142.63.233
qosClass: Burstable
startTime: "2019-10-25T14:10:40Z"
# kubectl -n prod describe pod my-pod-6d7b698b5f-8b47r
Name: my-pod-6d7b698b5f-8b47r
Namespace: prod
Priority: 0
PriorityClassName: <none>
Node: ip-10-142-54-235.ec2.internal/10.142.54.235
Start Time: Fri, 25 Oct 2019 10:10:40 -0400
Labels: app.kubernetes.io/instance=my-pod
app.kubernetes.io/name=my-pod
pod-template-hash=6d7b698b5f
Annotations: checksum/config: bcdc41c616f736849a6bef9c726eec9bf704ce7d2c61736005a6fedda0ee14d0
kubernetes.io/psp: eks.privileged
Status: Terminating (lasts 47h)
Termination Grace Period: 172800s
IP: 10.142.63.233
Controlled By: ReplicaSet/my-pod-6d7b698b5f
Containers:
my-pod:
Container ID: docker://15e2e658c459a91a86573c1096931fa4ac345e06f26652da2a58dc3e3b3d5aa2
Image: python:3.7-buster
Image ID: docker-pullable://python#sha256:f0db6711abee8d406121c9e057bc0f7605336e8148006164fea2c43809fe7977
Port: 5000/TCP
Host Port: 0/TCP
Command:
python
Args:
-c
from time import sleep; sleep(10000)
State: Running
Started: Fri, 25 Oct 2019 10:11:09 -0400
Ready: True
Restart Count: 0
Requests:
cpu: 3
Environment Variables from:
pix4d Secret Optional: false
rabbitmq Secret Optional: false
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-gv6q5 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-gv6q5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-gv6q5
Optional: false
QoS Class: Burstable
Node-Selectors: nodepool=zeroscaling-gpu-accelerated-p2-xlarge
Tolerations: node.kubernetes.io/not-ready:NoExecute for 600s
node.kubernetes.io/unreachable:NoExecute for 600s
specialized
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12m (x2 over 12m) default-scheduler 0/13 nodes are available: 1 Insufficient pods, 13 Insufficient cpu, 6 node(s) didn't match node selector.
Normal TriggeredScaleUp 12m cluster-autoscaler pod triggered scale-up: [{prod-worker-gpu-accelerated-p2-xlarge 7->8 (max: 13)}]
Warning FailedScheduling 11m (x5 over 11m) default-scheduler 0/14 nodes are available: 1 Insufficient pods, 1 node(s) had taints that the pod didn't tolerate, 13 Insufficient cpu, 6 node(s) didn't match node selector.
Normal Scheduled 11m default-scheduler Successfully assigned prod/my-pod-6d7b698b5f-8b47r to ip-10-142-54-235.ec2.internal
Normal Pulling 11m kubelet, ip-10-142-54-235.ec2.internal pulling image "python:3.7-buster"
Normal Pulled 10m kubelet, ip-10-142-54-235.ec2.internal Successfully pulled image "python:3.7-buster"
Normal Created 10m kubelet, ip-10-142-54-235.ec2.internal Created container
Normal Started 10m kubelet, ip-10-142-54-235.ec2.internal Started container
# kubectl -n prod describe node ip-10-142-54-235.ec2.internal
Name: ip-10-142-54-235.ec2.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=p2.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1b
kubernetes.io/hostname=ip-10-142-54-235.ec2.internal
nodepool=zeroscaling-gpu-accelerated-p2-xlarge
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 25 Oct 2019 10:10:20 -0400
Taints: specialized=true:NoExecute
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 25 Oct 2019 10:23:11 -0400 Fri, 25 Oct 2019 10:10:19 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 25 Oct 2019 10:23:11 -0400 Fri, 25 Oct 2019 10:10:19 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 25 Oct 2019 10:23:11 -0400 Fri, 25 Oct 2019 10:10:19 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 25 Oct 2019 10:23:11 -0400 Fri, 25 Oct 2019 10:10:40 -0400 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.142.54.235
ExternalIP: 3.86.112.24
Hostname: ip-10-142-54-235.ec2.internal
InternalDNS: ip-10-142-54-235.ec2.internal
ExternalDNS: ec2-3-86-112-24.compute-1.amazonaws.com
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 4
ephemeral-storage: 209702892Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 62872868Ki
pods: 58
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 4
ephemeral-storage: 200777747706
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 61209892Ki
pods: 58
System Info:
Machine ID: 0e76fec3e06d41a6bf2c49a18fbe1795
System UUID: EC29973A-D616-F673-6899-A96C97D5AE2D
Boot ID: 4bc510b6-f615-48a7-9e1e-47261ddf26a4
Kernel Version: 4.14.146-119.123.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.1
Kubelet Version: v1.13.11-eks-5876d6
Kube-Proxy Version: v1.13.11-eks-5876d6
ProviderID: aws:///us-east-1b/i-0f5b519aa6e38e04a
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
amazon-cloudwatch cloudwatch-agent-4d24j 50m (1%) 250m (6%) 50Mi (0%) 250Mi (0%) 12m
amazon-cloudwatch fluentd-cloudwatch-wkslq 50m (1%) 0 (0%) 150Mi (0%) 300Mi (0%) 12m
prod my-pod-6d7b698b5f-8b47r 3 (75%) 0 (0%) 0 (0%) 0 (0%) 14m
kube-system aws-node-6nr6g 10m (0%) 0 (0%) 0 (0%) 0 (0%) 13m
kube-system kube-proxy-wf8k4 100m (2%) 0 (0%) 0 (0%) 0 (0%) 13m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3210m (80%) 250m (6%)
memory 200Mi (0%) 550Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 13m kubelet, ip-10-142-54-235.ec2.internal Starting kubelet.
Normal NodeHasSufficientMemory 13m (x2 over 13m) kubelet, ip-10-142-54-235.ec2.internal Node ip-10-142-54-235.ec2.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 13m (x2 over 13m) kubelet, ip-10-142-54-235.ec2.internal Node ip-10-142-54-235.ec2.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 13m (x2 over 13m) kubelet, ip-10-142-54-235.ec2.internal Node ip-10-142-54-235.ec2.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 13m kubelet, ip-10-142-54-235.ec2.internal Updated Node Allocatable limit across pods
Normal Starting 12m kube-proxy, ip-10-142-54-235.ec2.internal Starting kube-proxy.
Normal NodeReady 12m kubelet, ip-10-142-54-235.ec2.internal Node ip-10-142-54-235.ec2.internal status is now: NodeReady
# kubectl get node ip-10-142-54-235.ec2.internal -o yaml
apiVersion: v1
kind: Node
metadata:
annotations:
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2019-10-25T14:10:20Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: p2.xlarge
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: us-east-1
failure-domain.beta.kubernetes.io/zone: us-east-1b
kubernetes.io/hostname: ip-10-142-54-235.ec2.internal
nodepool: zeroscaling-gpu-accelerated-p2-xlarge
name: ip-10-142-54-235.ec2.internal
resourceVersion: "2409195"
selfLink: /api/v1/nodes/ip-10-142-54-235.ec2.internal
uid: 2d934979-f731-11e9-89b8-0234143df588
spec:
providerID: aws:///us-east-1b/i-0f5b519aa6e38e04a
taints:
- effect: NoExecute
key: specialized
value: "true"
status:
addresses:
- address: 10.142.54.235
type: InternalIP
- address: 3.86.112.24
type: ExternalIP
- address: ip-10-142-54-235.ec2.internal
type: Hostname
- address: ip-10-142-54-235.ec2.internal
type: InternalDNS
- address: ec2-3-86-112-24.compute-1.amazonaws.com
type: ExternalDNS
allocatable:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: "200777747706"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 61209892Ki
pods: "58"
capacity:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: 209702892Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 62872868Ki
pods: "58"
conditions:
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:19Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:19Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:19Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:40Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- python#sha256:f0db6711abee8d406121c9e057bc0f7605336e8148006164fea2c43809fe7977
- python:3.7-buster
sizeBytes: 917672801
- names:
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni#sha256:5b7e7435f88a86bbbdb2a5ecd61e893dc14dd13c9511dc8ace362d299259700a
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.5.4
sizeBytes: 290739356
- names:
- fluent/fluentd-kubernetes-daemonset#sha256:582770d951f81e0971e852089239ced0186e0bdc3226daf16b99ca4cc22de4f7
- fluent/fluentd-kubernetes-daemonset:v1.3.3-debian-cloudwatch-1.4
sizeBytes: 261867521
- names:
- amazon/cloudwatch-agent#sha256:877106acbc56e747ebe373548c88cd37274f666ca11b5c782211db4c5c7fb64b
- amazon/cloudwatch-agent:latest
sizeBytes: 131360039
- names:
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy#sha256:4767b441ddc424b0ea63c305b79be154f65fb15ebefe8a3b2832ce55aa6de2f0
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.13.8
sizeBytes: 80183964
- names:
- busybox#sha256:fe301db49df08c384001ed752dff6d52b4305a73a7f608f21528048e8a08b51e
- busybox:latest
sizeBytes: 1219782
- names:
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64#sha256:bea77c323c47f7b573355516acf927691182d1333333d1f41b7544012fab7adf
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1
sizeBytes: 742472
nodeInfo:
architecture: amd64
bootID: 4bc510b6-f615-48a7-9e1e-47261ddf26a4
containerRuntimeVersion: docker://18.6.1
kernelVersion: 4.14.146-119.123.amzn2.x86_64
kubeProxyVersion: v1.13.11-eks-5876d6
kubeletVersion: v1.13.11-eks-5876d6
machineID: 0e76fec3e06d41a6bf2c49a18fbe1795
operatingSystem: linux
osImage: Amazon Linux 2
systemUUID: EC29973A-D616-F673-6899-A96C97D5AE2D
Unfortunately, I don't have an exact answer to your issue, but I may have some workaround.
I think I had the same issue with Amazon EKS cluster, version 1.13.11 - my pod was triggering node scale-up, pod was scheduled, works for 300s and then evicted:
74m Normal TaintManagerEviction pod/master-3bb760a7-b782-4138-b09f-0ca385db9ad7-workspace Marking for deletion Pod project-beta/master-3bb760a7-b782-4138-b09f-0ca385db9ad7-workspace
Interesting, that the same pod was able to run with no problem if it was scheduled on existing node and not a just created one.
From my investigation, it really looks like some issue with this specific Kubernetes version. Maybe some edge case of the TaintBasedEvictions feature(I think it was enabled by default in 1.13 version of Kubernetes).
To "fix" this issue I updated cluster version to 1.14. After that, mysterious pod eviction did not happen anymore.
So, if it's possible to you, I suggest updating your cluster to 1.14 version(together with cluster-autoscaler).

kubernetes cluster master node not ready

i do not know why ,my master node in not ready status,all pods on cluster run normally, and i use cabernets v1.7.5 ,and network plugin use calico,and os version is "centos7.2.1511"
# kubectl get nodes
NAME STATUS AGE VERSION
k8s-node1 Ready 1h v1.7.5
k8s-node2 NotReady 1h v1.7.5
# kubectl get all --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system po/calico-node-11kvm 2/2 Running 0 33m
kube-system po/calico-policy-controller-1906845835-1nqjj 1/1 Running 0 33m
kube-system po/calicoctl 1/1 Running 0 33m
kube-system po/etcd-k8s-node2 1/1 Running 1 15m
kube-system po/kube-apiserver-k8s-node2 1/1 Running 1 15m
kube-system po/kube-controller-manager-k8s-node2 1/1 Running 2 15m
kube-system po/kube-dns-2425271678-2mh46 3/3 Running 0 1h
kube-system po/kube-proxy-qlmbx 1/1 Running 1 1h
kube-system po/kube-proxy-vwh6l 1/1 Running 0 1h
kube-system po/kube-scheduler-k8s-node2 1/1 Running 2 15m
NAMESPACE NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default svc/kubernetes 10.96.0.1 <none> 443/TCP 1h
kube-system svc/kube-dns 10.96.0.10 <none> 53/UDP,53/TCP 1h
NAMESPACE NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kube-system deploy/calico-policy-controller 1 1 1 1 33m
kube-system deploy/kube-dns 1 1 1 1 1h
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system rs/calico-policy-controller-1906845835 1 1 1 33m
kube-system rs/kube-dns-2425271678 1 1 1 1h
update
it seems master node can not recognize the calico network plugin, i use kubeadm to install k8s cluster ,due to kubeadm start etcd on 127.0.0.1:2379 on master node,and calico on other nodes can not talk with etcd,so i modify etcd.yaml as following ,and all calico pods run fine, i do not very familiar with calico ,how to fix it ?
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:
- command:
- etcd
- --listen-client-urls=http://127.0.0.1:2379,http://10.161.233.80:2379
- --advertise-client-urls=http://10.161.233.80:2379
- --data-dir=/var/lib/etcd
image: gcr.io/google_containers/etcd-amd64:3.0.17
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /health
port: 2379
scheme: HTTP
initialDelaySeconds: 15
timeoutSeconds: 15
name: etcd
resources: {}
volumeMounts:
- mountPath: /etc/ssl/certs
name: certs
- mountPath: /var/lib/etcd
name: etcd
- mountPath: /etc/kubernetes
name: k8s
readOnly: true
hostNetwork: true
volumes:
- hostPath:
path: /etc/ssl/certs
name: certs
- hostPath:
path: /var/lib/etcd
name: etcd
- hostPath:
path: /etc/kubernetes
name: k8s
status: {}
[root#k8s-node2 calico]# kubectl describe node k8s-node2
Name: k8s-node2
Role:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=k8s-node2
node-role.kubernetes.io/master=
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: node-role.kubernetes.io/master:NoSchedule
CreationTimestamp: Tue, 12 Sep 2017 15:20:57 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Wed, 13 Sep 2017 10:25:58 +0800 Tue, 12 Sep 2017 15:20:57 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Wed, 13 Sep 2017 10:25:58 +0800 Tue, 12 Sep 2017 15:20:57 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 13 Sep 2017 10:25:58 +0800 Tue, 12 Sep 2017 15:20:57 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready False Wed, 13 Sep 2017 10:25:58 +0800 Tue, 12 Sep 2017 15:20:57 +0800 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Addresses:
InternalIP: 10.161.233.80
Hostname: k8s-node2
Capacity:
cpu: 2
memory: 3618520Ki
pods: 110
Allocatable:
cpu: 2
memory: 3516120Ki
pods: 110
System Info:
Machine ID: 3c6ff97c6fbe4598b53fd04e08937468
System UUID: C6238BF8-8E60-4331-AEEA-6D0BA9106344
Boot ID: 84397607-908f-4ff8-8bdc-ff86c364dd32
Kernel Version: 3.10.0-514.6.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.12.6
Kubelet Version: v1.7.5
Kube-Proxy Version: v1.7.5
PodCIDR: 10.68.0.0/24
ExternalID: k8s-node2
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system etcd-k8s-node2 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-apiserver-k8s-node2 250m (12%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-controller-manager-k8s-node2 200m (10%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-qlmbx 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-scheduler-k8s-node2 100m (5%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
550m (27%) 0 (0%) 0 (0%) 0 (0%)
Events: <none>
It's good practice to run a describe command in order to see what's wrong with your node:
kubectl describe nodes <NODE_NAME>
e.g.: kubectl describe nodes k8s-node2
You should be able to start your investigations from there and add more info to this question if needed.
You need install a Network Policy Provider, this is one of supported provider:
Weave Net for NetworkPolicy.
command line to install:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
After a few seconds, a Weave Net pod should be running on each Node and any further pods you create will be automatically attached to the Weave network.
I think you may need to add tolerations and update the annotations for calico-node in the manifest you are using so that it can run on a master created by kubeadm. Kubeadm taints the master so that pods cannot run on it unless they have a toleration for that taint.
I believe you are using the https://docs.projectcalico.org/v2.5/getting-started/kubernetes/installation/hosted/calico.yaml manifest which has the annotations (that include tolerations) for K8s v1.5, you should check https://docs.projectcalico.org/v2.5/getting-started/kubernetes/installation/hosted/kubeadm/1.6/calico.yaml, it has the toleration syntax for K8s v1.6+.
Here is a snippet from the above with annotations and tolerations
metadata:
labels:
k8s-app: calico-node
annotations:
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
# reserves resources for critical add-on pods so that they can be rescheduled after
# a failure. This annotation works in tandem with the toleration below.
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
hostNetwork: true
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists