All pod staying in pending mode with Events None in a K3S ARM64 + AMD64 cluster - kubernetes

I found many issues about this on StackOverflow; most are non-responded and over-complicated.
I shrink my issue to a simple "Hello world" test in a brand-new empty cluster.
I have a K3s cluster, the master is an online bare-metal AMD64 server, and the nodes are local PI400 ARM64 Debian hosts.
I'm trying to deploy
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: hello-world
labels:
app: hello-world
spec:
selector:
matchLabels:
app: hello-world
template:
metadata:
labels:
app: hello-world
spec:
containers:
- name: hello-world
image: nginxdemos/hello
Then all my pod stays in Pending states:
kubectl get pods
NAME READY STATUS RESTARTS AGE
hello-world-qsv2d 0/1 Pending 0 7m53s
hello-world-6rn5d 0/1 Pending 0 7m53s
a description of one of my nodes gave me:
kubectl describe pod hello-world-6rn5d
Name: hello-world-6rn5d
Namespace: default
Priority: 0
Node: <none>
Labels: app=hello-world
controller-revision-hash=649569d94c
pod-template-generation=1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/hello-world
Containers:
hello-world:
Image: hello-world
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8bh8p (ro)
Volumes:
kube-api-access-8bh8p:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events: <none>
I already use this same node in a local ARM64 cluster, and they are working fine.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
pi417 Ready <none> 11h v1.24.4+k3s1
pi400 Ready <none> 11h v1.24.4+k3s1
kubectl version --output=yaml
clientVersion:
buildDate: "2022-06-15T14:22:29Z"
compiler: gc
gitCommit: f66044f4361b9f1f96f0053dd46cb7dce5e990a8
gitTreeState: clean
gitVersion: v1.24.2
goVersion: go1.18.3
major: "1"
minor: "24"
platform: windows/amd64
kustomizeVersion: v4.5.4
serverVersion:
buildDate: "2022-08-25T03:45:26Z"
compiler: gc
gitCommit: c3f830e9b9ed8a4d9d0e2aa663b4591b923a296e
gitTreeState: clean
gitVersion: v1.24.4+k3s1
goVersion: go1.18.1
major: "1"
minor: "24"
platform: linux/amd64
a node description:
kubectl.exe describe node pi400
Name: pi400
Roles: <none>
Labels: adb=true
beta.kubernetes.io/arch=arm64
beta.kubernetes.io/instance-type=k3s
beta.kubernetes.io/os=linux
egress.k3s.io/cluster=true
kubernetes.io/arch=arm64
kubernetes.io/hostname=pi400
kubernetes.io/os=linux
node.kubernetes.io/instance-type=k3s
Annotations: flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"26:a8:bd:f3:1d:fd"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.3.25
k3s.io/hostname: pi400
k3s.io/internal-ip: 192.168.3.25
k3s.io/node-args: ["agent"]
k3s.io/node-config-hash: CBEQF3QV5PMMQWO2GECMRPJVEIFSCEFARQFZKX4RNV4K5FPB7FGQ====
k3s.io/node-env:
{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/8...2a","K3S_NODE_NAME":"pi400" ...}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 12 Sep 2022 20:44:50 +0300
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: pi400
AcquireTime: <unset>
RenewTime: Tue, 13 Sep 2022 08:53:29 +0300
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 13 Sep 2022 08:51:08 +0300 Mon, 12 Sep 2022 21:33:41 +0300 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 13 Sep 2022 08:51:08 +0300 Mon, 12 Sep 2022 21:33:41 +0300 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 13 Sep 2022 08:51:08 +0300 Mon, 12 Sep 2022 21:33:41 +0300 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 13 Sep 2022 08:51:08 +0300 Mon, 12 Sep 2022 21:33:41 +0300 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.3.25
Hostname: pi400
Capacity:
cpu: 4
ephemeral-storage: 30473608Ki
memory: 3885428Ki
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 29644725840
memory: 3885428Ki
pods: 110
System Info:
Machine ID: d2eb1415b12e45ebac766cc20ce58012
System UUID: d2eb1415b12e45ebac766cc20ce58012
Boot ID: c2531ffa-96b0-4463-9f51-08e0dce6d5c3
Kernel Version: 5.15.61-v8+
OS Image: Debian GNU/Linux 11 (bullseye)
Operating System: linux
Architecture: arm64
Container Runtime Version: containerd://1.6.6-k3s1
Kubelet Version: v1.24.4+k3s1
Kube-Proxy Version: v1.24.4+k3s1
PodCIDR: 10.42.1.0/24
PodCIDRs: 10.42.1.0/24
ProviderID: k3s://pi400
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
the issue may be due to:
the mixing of AMD64 / ARM64
a connection issue between nodes
K3s
I do not know how to get more info about the situation, I do not have any ADM64 node for now for more tests.

After more Test:
I had the same issue when all nodes used the same arch.
The issue was due to an incompatibility with the distro on my master.
I could have spotted the issue just after the master k3s setup.
Once the setup is done on the server, take a kubectl get nodes on a K3S setup the master must be visible as a node. if it's not the case, do not try to add any more nodes.
So that my k3s setup:
STEP 1 preconfigure your k3s
mkdir -p /etc/rancher/k3s/
nano /etc/rancher/k3s/config.yaml
add all options that are not available via env variable in the config.yaml before stating the setup script.
the content may looks like:
write-kubeconfig-mode: "0644"
tls-san:
- "1.2.3.4"
STEP 2 the start the master node setup
curl -sfL https://get.k3s.io | sh -
STEP 3 check that the master node is live
kubectl get nodes
if you do not see the master node, start investigating. (kernel option, and more)
STEP 4 get your credencial
get your credencial to remote access your k3s with the file /etc/rancher/k3s/k3s.yaml, and change clusters.cluster.server IP from 127.0.0.1 to a valid remote IP. past the new config file in ~/.kube/config.
STEP 5 try to connect with a kubectl
kubectl get node
STEP 6 add nodes
Start adding your nodes using your token from /var/lib/rancher/k3s/server/node-token
# FOR SLAVE customise hostname
# export K3S_NODE_NAME=pi417
export K3S_TOKEN=<token from /var/lib/rancher/k3s/server/node-token>
export K3S_URL=https://<remote-ip>:6443
curl -sfL https://get.k3s.io | sh -
STEP 7 if the setup get stuck
CTRL+C
sudo systemctl restart kubepods.slice kubepods-besteffort.slice
or reboot
STEP 8 start a hello world on all node to check if everythink is ok
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: hello-world
labels:
app: hello-world
spec:
selector:
matchLabels:
app: hello-world
template:
metadata:
labels:
app: hello-world
spec:
containers:
- name: hello-world
image: nginxdemos/hello

Related

Minikube Service URL not working | Windows 11 [duplicate]

I'm new to Kubernetes. I successfully created a deployment with 2 replicas of my Angular frontend application, but when I expose it with a service and try to access the service with 'minikube service service-name', the browser can't show me the application.
This is my docker file
FROM registry.gitlab.informatica.aci.it/ccsc/images/nodejs/10_15
LABEL maintainer="d.vaccaro#informatica.aci.it" name="assistenza-fo" version="v1.0.0" license=""
WORKDIR /usr/src/app
ARG PRODUCTION_MODE="false"
ENV NODE_ENV='development'
ENV HTTP_PORT=4200
COPY package*.json ./
RUN if [ "${PRODUCTION_MODE}" = "true" ] || [ "${PRODUCTION_MODE}" = "1" ]; then \
echo "Build di produzione"; \
npm ci --production ; \
else \
echo "Build di sviluppo"; \
npm ci ; \
fi
RUN npm audit fix
RUN npm install -g #angular/cli
COPY dockerize /usr/local/bin
RUN chmod +x /usr/local/bin/dockerize
COPY . .
EXPOSE 4200
CMD ng serve --host 0.0.0.0
pod description
Name: assistenza-fo-674f85c547-bzf8g
Namespace: default
Priority: 0
Node: minikube/172.17.0.2
Start Time: Sun, 19 Apr 2020 12:41:06 +0200
Labels: pod-template-hash=674f85c547
run=assistenza-fo
Annotations: <none>
Status: Running
IP: 172.18.0.6
Controlled By: ReplicaSet/assistenza-fo-674f85c547
Containers:
assistenza-fo:
Container ID: docker://ef2bfb66d22dea56b2dc0e49e875376bf1edff369274015445806451582703a0
Image: registry.gitlab.informatica.aci.it/apra/sta-r/assistenza/assistenza-fo:latest
Image ID: docker-pullable://registry.gitlab.informatica.aci.it/apra/sta-r/assistenza/assistenza-fo#sha256:8d02a3e69d6798c1ac88815ef785e05aba6e394eb21f806bbc25fb761cca5a98
Port: 4200/TCP
Host Port: 0/TCP
State: Running
Started: Sun, 19 Apr 2020 12:41:08 +0200
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-zdrwg (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-zdrwg:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-zdrwg
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
my deployment description
Name: assistenza-fo
Namespace: default
CreationTimestamp: Sun, 19 Apr 2020 12:41:06 +0200
Labels: run=assistenza-fo
Annotations: deployment.kubernetes.io/revision: 1
Selector: run=assistenza-fo
Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: run=assistenza-fo
Containers:
assistenza-fo:
Image: registry.gitlab.informatica.aci.it/apra/sta-r/assistenza/assistenza-fo:latest
Port: 4200/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: assistenza-fo-674f85c547 (2/2 replicas created)
Events: <none>
and my service description
Name: assistenza-fo
Namespace: default
Labels: run=assistenza-fo
Annotations: <none>
Selector: run=assistenza-fo
Type: LoadBalancer
IP: 10.97.3.206
Port: <unset> 4200/TCP
TargetPort: 4200/TCP
NodePort: <unset> 30375/TCP
Endpoints: 172.18.0.6:4200,172.18.0.7:4200
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
When i run the command
minikube service assistenza-fo
I get the following output:
|-----------|---------------|-------------|-------------------------|
| NAMESPACE | NAME | TARGET PORT | URL |
|-----------|---------------|-------------|-------------------------|
| default | assistenza-fo | 4200 | http://172.17.0.2:30375 |
|-----------|---------------|-------------|-------------------------|
* Opening service default/assistenza-fo in default browser...
but Chrome prints out: "unable to reach the site" for timeout.
Thank you
EDIT
I create again the service, this time as a NodePort service. Still not working. This is the service description:
Name: assistenza-fo
Namespace: default
Labels: run=assistenza-fo
Annotations: <none>
Selector: run=assistenza-fo
Type: NodePort
IP: 10.107.46.43
Port: <unset> 4200/TCP
TargetPort: 4200/TCP
NodePort: <unset> 30649/TCP
Endpoints: 172.18.0.7:4200,172.18.0.8:4200
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
I was able to reproduce your issue.
It's actually a bug on latest version of Minikube for Windows running Docker Driver: --driver=docker
You can see it here: Issue - minikube service not working with Docker driver on Windows 10 Pro #7644
it was patched with the merge: Pull - docker driver: Add Service & Tunnel features to windows
it is available now on Minikube v1.10.0-beta.0
In order to make it work, download the beta version from the website:
https://github.com/kubernetes/minikube/releases/download/v1.10.0-beta.0/minikube-windows-amd64.exe
move it to your working folder and rename it to minikube.exe
C:\Kubernetes>rename minikube-windows-amd64.exe minikube.exe
C:\Kubernetes>dir
22/04/2020 21:10 <DIR> .
22/04/2020 21:10 <DIR> ..
22/04/2020 21:04 55.480.832 minikube.exe
22/04/2020 20:05 489 nginx.yaml
2 File(s) 55.481.321 bytes
If you haven't yet, stop and uninstall the older version, then start Minikube with the new binary:
C:\Kubernetes>minikube.exe start --driver=docker
* minikube v1.10.0-beta.0 on Microsoft Windows 10 Pro 10.0.18363 Build 18363
* Using the docker driver based on existing profile
* Starting control plane node minikube in cluster minikube
* Pulling base image ...
* Restarting existing docker container for "minikube" ...
* Preparing Kubernetes v1.18.0 on Docker 19.03.2 ...
- kubeadm.pod-network-cidr=10.244.0.0/16
* Enabled addons: dashboard, default-storageclass, storage-provisioner
* Done! kubectl is now configured to use "minikube"
C:\Kubernetes>kubectl get all
NAME READY STATUS RESTARTS AGE
pod/nginx-76df748b9-t6q59 1/1 Running 1 78m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 85m
service/nginx-svc NodePort 10.100.212.15 <none> 80:31027/TCP 78m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/nginx 1/1 1 1 78m
NAME DESIRED CURRENT READY AGE
replicaset.apps/nginx-76df748b9 1 1 1 78m
Minikube is now running on version v1.10.0-beta.0, now you can run the service as intended (and note the command will be unavailable because it will be tunneling the connection:
The browser will open automatically and your service will be available:
If you have any doubts let me know in the comments.

Getting CrashBackloopError when deploying a pod

I am new to kubernetes and am trying to deploy a pod with private registry. Whenever I deploy this yaml it goes crash loop. Added sleep with a large value thinking that might cause this, still haven't worked.
apiVersion: v1
kind: Pod
metadata:
name: privetae-image-testing
spec:
containers:
- name: private-image-test
image: buildforjenkin.azurecr.io/nginx:latest
imagePullPolicy: IfNotPresent
command: ['echo','success','sleep 1000000']
Here are the logs:
Name: privetae-image-testing
Namespace: default
Priority: 0
Node: docker-desktop/192.168.65.4
Start Time: Sun, 24 Oct 2021 15:52:25 +0530
Labels: <none>
Annotations: <none>
Status: Running
IP: 10.1.1.49
IPs:
IP: 10.1.1.49
Containers:
private-image-test:
Container ID: docker://46520936762f17b70d1ec92a121269e90aef2549390a14184e6c838e1e6bafec
Image: buildforjenkin.azurecr.io/nginx:latest
Image ID: docker-pullable://buildforjenkin.azurecr.io/nginx#sha256:7250923ba3543110040462388756ef099331822c6172a050b12c7a38361ea46f
Port: <none>
Host Port: <none>
Command:
echo
success
sleep 1000000
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 24 Oct 2021 15:52:42 +0530
Finished: Sun, 24 Oct 2021 15:52:42 +0530
Ready: False
Restart Count: 2
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ld6zz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-ld6zz:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 34s default-scheduler Successfully assigned default/privetae-image-testing to docker-desktop
Normal Pulled 17s (x3 over 33s) kubelet Container image "buildforjenkin.azurecr.io/nginx:latest" already present on machine
Normal Created 17s (x3 over 33s) kubelet Created container private-image-test
Normal Started 17s (x3 over 33s) kubelet Started container private-image-test
Warning BackOff 2s (x5 over 31s) kubelet Back-off restarting failed container
I am running the cluster on docker-desktop on windows. TIA
Notice you are using standard nginx image? Try delete your pod and re-apply with:
apiVersion: v1
kind: Pod
metadata:
name: private-image-testing
labels:
run: my-nginx
spec:
restartPolicy: Always
containers:
- name: private-image-test
image: buildforjenkin.azurecr.io/nginx:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
name: http
If your pod runs, you should be able to remote into with kubectl exec -it private-image-testing -- sh, follow by wget -O- localhost should print you a welcome message. If it still fail, paste the output of kubectl logs -f -l run=my-nginx to your question.
Check my previous answer to understand step-by step whats going on after you launch the container.
You are launching some nginx:latest container with the process inside that runs forever as it should be to avoid main process be exited. Then you add overlay that (I will quote David: print the words success and sleep 1000000, and having printed those words, then exit).
Instead of making your container running all the time to serve, you explicitly shooting into your leg by finishing the process using sleep 1000000.
And sure, your command will be executed and container will exit. Check that. It was exited correctly with status 0 and did that 2 times. And will more in the future.
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 24 Oct 2021 15:52:42 +0530
Finished: Sun, 24 Oct 2021 15:52:42 +0530
You need to think well if you really need command: ['echo','success','sleep 1000000']

why does the pod remain in pending state despite having toleration set

I applied the following taint, and label to a node but the pod never reaches a running status and I cannot seem to figure out why
kubectl taint node k8s-worker-2 dedicated=devs:NoSchedule
kubectl label node k8s-worker-2 dedicated=devs
and here is a sample of my pod yaml file:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
security: s1
name: pod-1
spec:
containers:
- image: nginx
name: bear
resources: {}
tolerations:
- key: "dedicated"
operator: "Equal"
value: "devs"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dedicated
operator: In
values:
- devs
dnsPolicy: ClusterFirst
restartPolicy: Always
nodeName: k8s-master-2
status: {}
on creating the pod, it gets scheduled on the k8s-worker-2 node but remains in a pending state before it's finally evicted. Here are sample outputs:
kubectl describe no k8s-worker-2 | grep -i taint
Taints: dedicated=devs:NoSchedule
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-1 0/1 Pending 0 9s <none> k8s-master-2 <none> <none>
# second check
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-1 0/1 Pending 0 59s <none> k8s-master-2 <none> <none>
Name: pod-1
Namespace: default
Priority: 0
Node: k8s-master-2/
Labels: security=s1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
bear:
Image: nginx
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dzvml (ro)
Volumes:
kube-api-access-dzvml:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: dedicated=devs:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Also, here is output of kubectl describe node
root#k8s-master-1:~/scheduling# kubectl describe nodes k8s-worker-2
Name: k8s-worker-2
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
dedicated=devs
kubernetes.io/arch=amd64
kubernetes.io/hostname=k8s-worker-2
kubernetes.io/os=linux
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.128.0.4/32
projectcalico.org/IPv4IPIPTunnelAddr: 192.168.140.0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 18 Jul 2021 16:18:41 +0000
Taints: dedicated=devs:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: k8s-worker-2
AcquireTime: <unset>
RenewTime: Sun, 10 Oct 2021 18:54:46 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Sun, 10 Oct 2021 18:48:50 +0000 Sun, 10 Oct 2021 18:48:50 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Sun, 10 Oct 2021 18:53:40 +0000 Mon, 04 Oct 2021 07:52:58 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 10 Oct 2021 18:53:40 +0000 Mon, 04 Oct 2021 07:52:58 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sun, 10 Oct 2021 18:53:40 +0000 Mon, 04 Oct 2021 07:52:58 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sun, 10 Oct 2021 18:53:40 +0000 Mon, 04 Oct 2021 07:52:58 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.128.0.4
Hostname: k8s-worker-2
Capacity:
cpu: 2
ephemeral-storage: 20145724Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8149492Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 18566299208
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8047092Ki
pods: 110
System Info:
Machine ID: 3c2709a436fa0c630680bac68ad28669
System UUID: 3c2709a4-36fa-0c63-0680-bac68ad28669
Boot ID: 18a3541f-f3b4-4345-ba45-8cfef9fb1364
Kernel Version: 5.8.0-1038-gcp
OS Image: Ubuntu 20.04.2 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.7
Kubelet Version: v1.21.3
Kube-Proxy Version: v1.21.3
PodCIDR: 192.168.2.0/24
PodCIDRs: 192.168.2.0/24
Non-terminated Pods: (2 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-gp4tk 250m (12%) 0 (0%) 0 (0%) 0 (0%) 84d
kube-system kube-proxy-6xxgx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 81d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (12%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 6m25s kubelet Starting kubelet.
Normal NodeAllocatableEnforced 6m25s kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 6m19s (x7 over 6m25s) kubelet Node k8s-worker-2 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 6m19s (x7 over 6m25s) kubelet Node k8s-worker-2 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 6m19s (x7 over 6m25s) kubelet Node k8s-worker-2 status is now: NodeHasSufficientPID
Warning Rebooted 6m9s kubelet Node k8s-worker-2 has been rebooted, boot id: 18a3541f-f3b4-4345-ba45-8cfef9fb1364
Normal Starting 6m7s kube-proxy Starting kube-proxy.
Included the following to show that the pod never issues events and it terminates later on by itself.
root#k8s-master-1:~/format/scheduling# kubectl get po
No resources found in default namespace.
root#k8s-master-1:~/format/scheduling# kubectl create -f nginx.yaml
pod/pod-1 created
root#k8s-master-1:~/format/scheduling# kubectl get po pod-1
NAME READY STATUS RESTARTS AGE
pod-1 0/1 Pending 0 10s
root#k8s-master-1:~/format/scheduling# kubectl describe po pod-1
Name: pod-1
Namespace: default
Priority: 0
Node: k8s-master-2/
Labels: security=s1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
bear:
Image: nginx
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5hsq4 (ro)
Volumes:
kube-api-access-5hsq4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: dedicated=devs:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
root#k8s-master-1:~/format/scheduling# kubectl get po pod-1
NAME READY STATUS RESTARTS AGE
pod-1 0/1 Pending 0 45s
root#k8s-master-1:~/format/scheduling# kubectl get po pod-1
NAME READY STATUS RESTARTS AGE
pod-1 0/1 Pending 0 62s
root#k8s-master-1:~/format/scheduling# kubectl get po pod-1
NAME READY STATUS RESTARTS AGE
pod-1 0/1 Pending 0 74s
root#k8s-master-1:~/format/scheduling# kubectl get po pod-1
Error from server (NotFound): pods "pod-1" not found
root#k8s-master-1:~/format/scheduling# kubectl get po
No resources found in default namespace.
root#k8s-master-1:~/format/scheduling#
I was able to figure this one out later. On reproducing the same case on another cluster, the pod got created on the node having the scheduling parameters set. Then it occurred to me that the only change I had to make on the manifest was setting nodeName: node-1 to match the right node on other cluster.
I was literally assigning the pod to a control plane node nodeName: k8s-master-2 and this was causing conflicts.
on creating the pod, it gets scheduled on the k8s-worker-2 node but
remains in a pending state before it's finally evicted.
Hope you node have proper resource left and free, that could be also reason behind pod getting evicted due to resources issue.
https://sysdig.com/blog/kubernetes-pod-evicted/

Kubernetes fix gke-metrics-agent stuck in terminating state on GKE

GKE had an outage about 2 days ago in their London datacentre (https://status.cloud.google.com/incident/compute/20013), since which time one of my nodes has been acting up. I've had to manually terminate a number of pods running on it and I'm having issues with a couple of sites, I assume due to their liveness checks failing temporarily which might have something to do with the below error in gke-metrics-agent?
Looking at the system pods I can see one instance of gke-metrics-agent is stuck in a terminating state and has been since last night:
kubectl get pods -n kube-system
reports:
...
gke-metrics-agent-k47g8 0/1 Terminating 0 32d
gke-metrics-agent-knr9h 1/1 Running 0 31h
gke-metrics-agent-vqkpw 1/1 Running 0 32d
...
I've looked at the describe output for the pod but can't see anything that helps me understand what it needs done:
kubectl describe pod gke-metrics-agent-k47g8 -n kube-system
Name: gke-metrics-agent-k47g8
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: <node-name>/<IP>
Start Time: Mon, 09 Nov 2020 03:41:14 +0000
Labels: component=gke-metrics-agent
controller-revision-hash=f8c5b8bfb
k8s-app=gke-metrics-agent
pod-template-generation=4
Annotations: components.gke.io/component-name: gke-metrics-agent
components.gke.io/component-version: 0.27.1
configHash: <config-hash>
Status: Terminating (lasts 15h)
Termination Grace Period: 30s
IP: <IP>
IPs:
IP: <IP>
Controlled By: DaemonSet/gke-metrics-agent
Containers:
gke-metrics-agent:
Container ID: docker://<id>
Image: gcr.io/gke-release/gke-metrics-agent:0.1.3-gke.0
Image ID: docker-pullable://gcr.io/gke-release/gke-metrics-agent#sha256:<hash>
Port: <none>
Host Port: <none>
Command:
/otelsvc
--config=/conf/gke-metrics-agent-config.yaml
--metrics-level=NONE
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 09 Nov 2020 03:41:17 +0000
Finished: Thu, 10 Dec 2020 21:16:50 +0000
Ready: False
Restart Count: 0
Limits:
memory: 50Mi
Requests:
cpu: 3m
memory: 50Mi
Environment:
NODE_NAME: (v1:spec.nodeName)
POD_NAME: gke-metrics-agent-k47g8 (v1:metadata.name)
POD_NAMESPACE: kube-system (v1:metadata.namespace)
KUBELET_HOST: 127.0.0.1
ARG1: ${1}
ARG2: ${2}
Mounts:
/conf from gke-metrics-agent-config-vol (rw)
/etc/ssl/certs from ssl-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from gke-metrics-agent-token-cn6ss (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
gke-metrics-agent-config-vol:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: gke-metrics-agent-conf
Optional: false
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs
HostPathType:
gke-metrics-agent-token-cn6ss:
Type: Secret (a volume populated by a Secret)
SecretName: gke-metrics-agent-token-cn6ss
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoExecute
:NoSchedule
components.gke.io/gke-managed-components
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events: <none>
I'm not used to having to work on the system pods, in the past my experience troubleshooting issues often falls back on force deleting them when all else fails:
kubectl delete pod <pod-name> -n <ns> --grace-period=0 --force
My concern is I don't fully understand what this might do for a system pod and was hoping someone with expertise could advise on a sensible way forward?
I'm also looking at draining this node so Kubernetes can rebuild a new one. Would this potentially be the easiest way to go?
Following up on this I found the pod that was experiencing the issues with gke-metrics-agent became even less stable as the day went on.
I, therefore, had to drain it. The resources it was running are now on new nodes which are working as expected and all system pods are running as expected (including gke-metrics-agent).
Prior to draining this node I ensured, Pod Disruption Budgets were in place as a number of services run on 1 or 2 instances:
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
This meant I could run:
kubectl drain <node-name>
The deployments then ensured they had enough live pods prior to the bad node being taken offline and seems to have avoided any downtime.

kubernetes cluster master node not ready

i do not know why ,my master node in not ready status,all pods on cluster run normally, and i use cabernets v1.7.5 ,and network plugin use calico,and os version is "centos7.2.1511"
# kubectl get nodes
NAME STATUS AGE VERSION
k8s-node1 Ready 1h v1.7.5
k8s-node2 NotReady 1h v1.7.5
# kubectl get all --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system po/calico-node-11kvm 2/2 Running 0 33m
kube-system po/calico-policy-controller-1906845835-1nqjj 1/1 Running 0 33m
kube-system po/calicoctl 1/1 Running 0 33m
kube-system po/etcd-k8s-node2 1/1 Running 1 15m
kube-system po/kube-apiserver-k8s-node2 1/1 Running 1 15m
kube-system po/kube-controller-manager-k8s-node2 1/1 Running 2 15m
kube-system po/kube-dns-2425271678-2mh46 3/3 Running 0 1h
kube-system po/kube-proxy-qlmbx 1/1 Running 1 1h
kube-system po/kube-proxy-vwh6l 1/1 Running 0 1h
kube-system po/kube-scheduler-k8s-node2 1/1 Running 2 15m
NAMESPACE NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default svc/kubernetes 10.96.0.1 <none> 443/TCP 1h
kube-system svc/kube-dns 10.96.0.10 <none> 53/UDP,53/TCP 1h
NAMESPACE NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kube-system deploy/calico-policy-controller 1 1 1 1 33m
kube-system deploy/kube-dns 1 1 1 1 1h
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system rs/calico-policy-controller-1906845835 1 1 1 33m
kube-system rs/kube-dns-2425271678 1 1 1 1h
update
it seems master node can not recognize the calico network plugin, i use kubeadm to install k8s cluster ,due to kubeadm start etcd on 127.0.0.1:2379 on master node,and calico on other nodes can not talk with etcd,so i modify etcd.yaml as following ,and all calico pods run fine, i do not very familiar with calico ,how to fix it ?
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:
- command:
- etcd
- --listen-client-urls=http://127.0.0.1:2379,http://10.161.233.80:2379
- --advertise-client-urls=http://10.161.233.80:2379
- --data-dir=/var/lib/etcd
image: gcr.io/google_containers/etcd-amd64:3.0.17
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /health
port: 2379
scheme: HTTP
initialDelaySeconds: 15
timeoutSeconds: 15
name: etcd
resources: {}
volumeMounts:
- mountPath: /etc/ssl/certs
name: certs
- mountPath: /var/lib/etcd
name: etcd
- mountPath: /etc/kubernetes
name: k8s
readOnly: true
hostNetwork: true
volumes:
- hostPath:
path: /etc/ssl/certs
name: certs
- hostPath:
path: /var/lib/etcd
name: etcd
- hostPath:
path: /etc/kubernetes
name: k8s
status: {}
[root#k8s-node2 calico]# kubectl describe node k8s-node2
Name: k8s-node2
Role:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=k8s-node2
node-role.kubernetes.io/master=
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: node-role.kubernetes.io/master:NoSchedule
CreationTimestamp: Tue, 12 Sep 2017 15:20:57 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Wed, 13 Sep 2017 10:25:58 +0800 Tue, 12 Sep 2017 15:20:57 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Wed, 13 Sep 2017 10:25:58 +0800 Tue, 12 Sep 2017 15:20:57 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 13 Sep 2017 10:25:58 +0800 Tue, 12 Sep 2017 15:20:57 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready False Wed, 13 Sep 2017 10:25:58 +0800 Tue, 12 Sep 2017 15:20:57 +0800 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Addresses:
InternalIP: 10.161.233.80
Hostname: k8s-node2
Capacity:
cpu: 2
memory: 3618520Ki
pods: 110
Allocatable:
cpu: 2
memory: 3516120Ki
pods: 110
System Info:
Machine ID: 3c6ff97c6fbe4598b53fd04e08937468
System UUID: C6238BF8-8E60-4331-AEEA-6D0BA9106344
Boot ID: 84397607-908f-4ff8-8bdc-ff86c364dd32
Kernel Version: 3.10.0-514.6.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.12.6
Kubelet Version: v1.7.5
Kube-Proxy Version: v1.7.5
PodCIDR: 10.68.0.0/24
ExternalID: k8s-node2
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system etcd-k8s-node2 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-apiserver-k8s-node2 250m (12%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-controller-manager-k8s-node2 200m (10%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-qlmbx 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-scheduler-k8s-node2 100m (5%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
550m (27%) 0 (0%) 0 (0%) 0 (0%)
Events: <none>
It's good practice to run a describe command in order to see what's wrong with your node:
kubectl describe nodes <NODE_NAME>
e.g.: kubectl describe nodes k8s-node2
You should be able to start your investigations from there and add more info to this question if needed.
You need install a Network Policy Provider, this is one of supported provider:
Weave Net for NetworkPolicy.
command line to install:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
After a few seconds, a Weave Net pod should be running on each Node and any further pods you create will be automatically attached to the Weave network.
I think you may need to add tolerations and update the annotations for calico-node in the manifest you are using so that it can run on a master created by kubeadm. Kubeadm taints the master so that pods cannot run on it unless they have a toleration for that taint.
I believe you are using the https://docs.projectcalico.org/v2.5/getting-started/kubernetes/installation/hosted/calico.yaml manifest which has the annotations (that include tolerations) for K8s v1.5, you should check https://docs.projectcalico.org/v2.5/getting-started/kubernetes/installation/hosted/kubeadm/1.6/calico.yaml, it has the toleration syntax for K8s v1.6+.
Here is a snippet from the above with annotations and tolerations
metadata:
labels:
k8s-app: calico-node
annotations:
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
# reserves resources for critical add-on pods so that they can be rescheduled after
# a failure. This annotation works in tandem with the toleration below.
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
hostNetwork: true
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists