Kubernetes pod stuck in waiting state - kubernetes

Trying to start this pod
apiVersion: v1
kind: Pod
metadata:
name: tinyproxy
spec:
containers:
- name: master
image: asdrepo.isus.emc.com:8091/francisbesset/tinyproxy
env:
- name: MASTER
value: "true"
ports:
- containerPort: 6379
resources:
limits:
cpu: "0.1"
volumeMounts:
- mountPath: /tinyproxy-data
name: data
volumes:
- name: data
emptyDir: {}
This gets stuck in pending state. I looked in the troubleshooting guide, but this pod does not seem to have any events
$ kubectl describe pods tinyproxy
Name: tinyproxy
Namespace: default
Node: /
Labels: name=tinyproxy
Status: Pending
IP:
Controllers: <none>
Containers:
master:
Image: asdrepo.isus.emc.com:8091/francisbesset/tinyproxy
Port: 6379/TCP
QoS Tier:
cpu: Guaranteed
memory: BestEffort
Limits:
cpu: 100m
Requests:
cpu: 100m
Environment Variables:
MASTER: true
Volumes:
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
No events.
Also
$ kubectl get events
FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
13m 13m 1 10.0.0.5 Node Normal Starting {kubelet 10.0.0.5} Starting kubelet.
13m 13m 2 10.0.0.5 Node Warning MissingClusterDNS {kubelet 10.0.0.5} kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. pod: "kube-proxy-10.0.0.5_kube-system(9fa6e0ea64b9f19ad6996367402408eb)". Falling back to DNSDefault policy.
13m 13m 1 10.0.0.5 Node Normal NodeHasSufficientDisk {kubelet 10.0.0.5} Node 10.0.0.5 status is now: NodeHasSufficientDisk
13m 13m 1 10.0.0.5 Node Normal Starting {kubelet 10.0.0.5} Starting kubelet.
13m 13m 1 10.0.0.5 Node Normal NodeHasSufficientDisk {kubelet 10.0.0.5} Node 10.0.0.5 status is now: NodeHasSufficientDisk
13m 13m 1 k8-dvawxybzux-0-a7m3diiryehx-kube-minion-itahxn4icom6 Node Normal Starting {kube-proxy k8-dvawxybzux-0-a7m3diiryehx-kube-minion-itahxn4icom6} Starting kube-proxy.
The proxy does seem to be running and is not restarting
bash-4.3# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d6dd779b301f gcr.io/google_containers/hyperkube:v1.2.0 "/hyperkube proxy --m" 15 minutes ago Up 15 minutes k8s_kube-proxy.d87e83d4_kube-proxy-10.0.0.5_kube-system_9fa6e0ea64b9f19ad6996367402408eb_caae92ac
8191770f15d9 gcr.io/google_containers/pause:2.0 "/pause" 15 minutes ago Up 15 minutes k8s_POD.6059dfa2_kube-proxy-10.0.0.5_kube-system_9fa6e0ea64b9f19ad6996367402408eb_e4da5a30
How do I debug this?

Looks like the scheduler service did not start (this is in an openstack VM). All services were supposed to be configured and started automatically. This worked after I started the service manually.

Related

kubernetes pod (mssql-tools) failing with CrashLoopBackOff error and restarting

I'm using Rancher Dekstop for K8 in WSL 2 in Windows 11.
I'm trying to create a pod using the simple yaml:
apiVersion: v1
kind: Pod
metadata:
name: mssql-tools
labels:
name: mssql-tools
spec:
containers:
- name: mssql-tools
image: mcr.microsoft.com/mssql-tools:latest
But it is continuously giving CrashLoopBackOff error.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
mssql-tools 0/1 CrashLoopBackOff 11 (8s ago) 14m
And here is the result of kubectl describe pod mssql-tool:
$ kubectl describe pod mssql-tools
Name: mssql-tools
Namespace: default
Priority: 0
Service Account: default
Node: desktop-2ohsprk/172.22.97.204
Start Time: Mon, 26 Dec 2022 04:34:19 +0500
Labels: name=mssql-tools
Annotations: <none>
Status: Running
IP: 10.42.0.57
IPs:
IP: 10.42.0.57
Containers:
mssql-tools:
Container ID: docker://76343010f4344a5d26fb35f3b0278271d3336e8e10d695cc22e78520262f34bf
Image: mcr.microsoft.com/mssql-tools:latest
Image ID: docker-pullable://mcr.microsoft.com/mssql-tools#sha256:62556500522072535cb3df2bb5965333dded9be47000473e9e0f84118e248642
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 26 Dec 2022 04:46:20 +0500
Finished: Mon, 26 Dec 2022 04:46:20 +0500
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 26 Dec 2022 04:45:51 +0500
Finished: Mon, 26 Dec 2022 04:45:51 +0500
Ready: False
Restart Count: 9
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wkqlg (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-wkqlg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned default/mssql-tools to desktop-2ohsprk
Normal Pulled 12m kubelet Successfully pulled image "mcr.microsoft.com/mssql-tools:latest" in 1.459473213s
Normal Pulled 12m kubelet Successfully pulled image "mcr.microsoft.com/mssql-tools:latest" in 823.403008ms
Normal Pulled 11m kubelet Successfully pulled image "mcr.microsoft.com/mssql-tools:latest" in 835.697509ms
Normal Pulled 11m kubelet Successfully pulled image "mcr.microsoft.com/mssql-tools:latest" in 873.802598ms
Normal Created 11m (x4 over 12m) kubelet Created container mssql-tools
Normal Started 11m (x4 over 12m) kubelet Started container mssql-tools
Normal Pulling 10m (x5 over 12m) kubelet Pulling image "mcr.microsoft.com/mssql-tools:latest"
Normal Pulled 10m kubelet Successfully pulled image "mcr.microsoft.com/mssql-tools:latest" in 740.64559ms
Warning BackOff 6m56s (x25 over 11m) kubelet Back-off restarting failed container
Normal SandboxChanged 50s kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 48s kubelet Successfully pulled image "mcr.microsoft.com/mssql-tools:latest" in 951.332457ms
Normal Pulled 32s kubelet Successfully pulled image "mcr.microsoft.com/mssql-tools:latest" in 828.839917ms
Normal Pulling 4s (x3 over 49s) kubelet Pulling image "mcr.microsoft.com/mssql-tools:latest"
Normal Pulled 3s kubelet Successfully pulled image "mcr.microsoft.com/mssql-tools:latest" in 713.951656ms
Normal Created 3s (x3 over 48s) kubelet Created container mssql-tools
Normal Started 3s (x3 over 48s) kubelet Started container mssql-tools
Warning BackOff 2s (x5 over 47s) kubelet Back-off restarting failed container
The same container works perfectly if I run it via docker and I can use its shell to execute sqlcmd properly.
I can't figure out any reason for this.
Any help would be really appreciated.
Thanks
Crashloopbackoff is the common error which indicates that pod failed to start and it continued to fail repeatedly when kubernetes tried to restart this.
To troubleshoot this issue follow the below steps:
Check for “Back off Restarting Failed Container” by running the command Run kubectl describe pod [name].
If you get a Liveness probe failed and Back-off restarting failed container messages from the kubelet, this indicates the container is not responding and is in the process of restarting.
Check from the previous container instance. Run kubectl get pods to identify the Kubernetes pod that causes CrashLoopBackOff error. You can run kubectl logs --previous --tail 10command to get the last ten log lines from the pod.
Check deployment logs by running the command: kubectl logs -f deploy/ -n
Refer to this link for more detailed troubleshooting steps.
So after trying and digging through multiple options, finally it worked by executing the command sleep 3600000 i.e. delaying it so that the pod initializes itself properly and then executes the container.
Here is the working yaml:
apiVersion: v1
kind: Pod
metadata:
name: mssql-tools
labels:
name: mssql-tools
spec:
containers:
- name: mssql-tools
image: mcr.microsoft.com/mssql-tools:latest
command: ["sleep"]
args:
- "3600000"
imagePullPolicy: IfNotPresent
The command and argument passing portion can also be mentioned like the following:
apiVersion: v1
...
...
spec:
containers:
- name: mssql-tools
image: mcr.microsoft.com/mssql-tools:latest
command:
- sleep
- "3600000"
...
and btw, you can also deploy a container by passing a command with the kubectl run command line: i.e.
kubectl run mssql --image=mcr.microsoft.com/mssql-tools --command sleep 3600000 -n myNameSpace
Note: You can omit -n myNameSpace if you are not deploying it in a specific namespace or deploying it in the default namespace.

kubernetes: when pod in CrashLoopBackOff status, related events won't update?

I'm testing kubernetes behavior when pod getting error.
I now have a pod in CrashLoopBackOff status caused by liveness probe failed, from what I can see in kubernetes events, pod turns into CrashLoopBackOff after 3 times try and begin to back off restarting, but the related Liveness probe failed events won't update?
➜ ~ kubectl describe pods/my-nginx-liveness-err-59fb55cf4d-c6p8l
Name: my-nginx-liveness-err-59fb55cf4d-c6p8l
Namespace: default
Priority: 0
Node: minikube/192.168.99.100
Start Time: Thu, 15 Jul 2021 12:29:16 +0800
Labels: pod-template-hash=59fb55cf4d
run=my-nginx-liveness-err
Annotations: <none>
Status: Running
IP: 172.17.0.3
IPs:
IP: 172.17.0.3
Controlled By: ReplicaSet/my-nginx-liveness-err-59fb55cf4d
Containers:
my-nginx-liveness-err:
Container ID: docker://edc363b76811fdb1ccacdc553d8de77e9d7455bb0d0fb3cff43eafcd12ee8a92
Image: nginx
Image ID: docker-pullable://nginx#sha256:353c20f74d9b6aee359f30e8e4f69c3d7eaea2f610681c4a95849a2fd7c497f9
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 15 Jul 2021 13:01:36 +0800
Finished: Thu, 15 Jul 2021 13:02:06 +0800
Ready: False
Restart Count: 15
Liveness: http-get http://:8080/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r7mh4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-r7mh4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 37m default-scheduler Successfully assigned default/my-nginx-liveness-err-59fb55cf4d-c6p8l to minikube
Normal Created 35m (x4 over 37m) kubelet Created container my-nginx-liveness-err
Normal Started 35m (x4 over 37m) kubelet Started container my-nginx-liveness-err
Normal Killing 35m (x3 over 36m) kubelet Container my-nginx-liveness-err failed liveness probe, will be restarted
Normal Pulled 31m (x7 over 37m) kubelet Container image "nginx" already present on machine
Warning Unhealthy 16m (x32 over 36m) kubelet Liveness probe failed: Get "http://172.17.0.3:8080/": dial tcp 172.17.0.3:8080: connect: connection refused
Warning BackOff 118s (x134 over 34m) kubelet Back-off restarting failed container
BackOff event updated 118s ago, but Unhealthy event updated 16m ago?
and why I'm getting only 15 times Restart Count while BackOff events with 134 times?
I'm using minikube and my deployment is like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx-liveness-err
spec:
selector:
matchLabels:
run: my-nginx-liveness-err
replicas: 1
template:
metadata:
labels:
run: my-nginx-liveness-err
spec:
containers:
- name: my-nginx-liveness-err
image: nginx
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
livenessProbe:
httpGet:
path: /
port: 8080
I think you might be confusing Status Conditions and Events. Events don't "update", they just exist. It's a stream of event data from the controllers for debugging or alerting on. The Age column is the relative timestamp to the most recent instance of that event type and you can see if does some basic de-duplication. Events also age out after a few hours to keep the database from exploding.
So your issue has nothing to do with the liveness probe, your container is crashing on startup.

how to scale daemon set about kubernetes using kubectl

Now I only have terminal to access kubernetes cluster now, check the ingress controller like this:
$ k get daemonset --all-namespaces
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system traefik-ingress-controller 0 0 0 0 0 IngressProxy=true 60d
logging fluentd-es 0 0 0 0 0 beta.kubernetes.io/fluentd-ds-ready=true 28d
I am now using kubectl(v1.15.2) to scale daemon set like this:
kubectl scale --replicas=1 DaemonSet/traefik-ingress-controller -n kube-system
but it shows:
Error from server (NotFound): the server could not find the requested resource
what should I do to start the traefik in terminal using command line? This is my daemon set describe output:
~/Library/Mobile Documents/com~apple~CloudDocs/Document/k8s/work/traefik-deployment-yaml/k8s-backup ⌚ 17:49:58
$ k describe daemonset traefik-ingress-controller -n kube-system
Name: traefik-ingress-controller
Selector: app=traefik
Node-Selector: IngressProxy=true
Labels: app=traefik
Annotations: deprecated.daemonset.template.generation: 18
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"app":"traefik"},"name":"traefik-ingress-controller","na...
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=traefik
Service Account: traefik-ingress-controller
Containers:
traefik-ingress-lb:
Image: traefik:v2.1.6
Ports: 80/TCP, 443/TCP, 8080/TCP
Host Ports: 80/TCP, 443/TCP, 0/TCP
Args:
--configfile=/config/traefik.yaml
--logLevel=INFO
--metrics=true
--metrics.prometheus=true
--entryPoints.metrics.address=:8080
--metrics.prometheus.entryPoint=metrics
--metrics.prometheus.addServicesLabels=true
--metrics.prometheus.addEntryPointsLabels=true
--metrics.prometheus.buckets=0.100000, 0.300000, 1.200000, 5.000000
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 1
memory: 1Gi
Environment: <none>
Mounts:
/config from config (rw)
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: traefik-config
Optional: false
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedDaemonPod 3h32m daemonset-controller Found failed daemon pod kube-system/traefik-ingress-controller-wdpsq on node azshara-k8s03, will try to kill it
Normal SuccessfulDelete 3h32m daemonset-controller Deleted pod: traefik-ingress-controller-wdpsq
Normal SuccessfulCreate 3h32m daemonset-controller Created pod: traefik-ingress-controller-qmttl
Warning FailedDaemonPod 3h32m daemonset-controller Found failed daemon pod kube-system/traefik-ingress-controller-qmttl on node azshara-k8s03, will try to kill it
Normal SuccessfulDelete 3h32m daemonset-controller Deleted pod: traefik-ingress-controller-qmttl
Normal SuccessfulCreate 3h32m daemonset-controller Created pod: traefik-ingress-controller-nlxwc
You don not need to scale a deamon set on K8s.
A Daemon Set ensures that all eligible nodes run a copy of a Pod..
As nodes are added to the cluster, Pods are added to them. So you need to add new node to cluster and deamon set will be scheduled there unless you have a very unique taint to disallow given deamon set.

kube-scheduler Liveness probe failed: Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused

So I have this unhealthy cluster partially working in the datacenter. This is probably the 10th time I have rebuilt from the instructions at: https://kubernetes.io/docs/setup/independent/high-availability/
I can apply some pods to this cluster and it seems to work but eventually it starts slowing down and crashing as you can see below. Here is the scheduler manifest:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
image: k8s.gcr.io/kube-scheduler:v1.14.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10251
scheme: HTTP
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-cluster-critical
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
status: {}
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
coredns-fb8b8dccf-42psn 1/1 Running 9 88m
coredns-fb8b8dccf-x9mlt 1/1 Running 11 88m
docker-registry-dqvzb 1/1 Running 1 2d6h
kube-apiserver-kube-apiserver-1 1/1 Running 44 2d8h
kube-apiserver-kube-apiserver-2 1/1 Running 34 2d7h
kube-controller-manager-kube-apiserver-1 1/1 Running 198 2d2h
kube-controller-manager-kube-apiserver-2 0/1 CrashLoopBackOff 170 2d7h
kube-flannel-ds-amd64-4mbfk 1/1 Running 1 2d7h
kube-flannel-ds-amd64-55hc7 1/1 Running 1 2d8h
kube-flannel-ds-amd64-fvwmf 1/1 Running 1 2d7h
kube-flannel-ds-amd64-ht5wm 1/1 Running 3 2d7h
kube-flannel-ds-amd64-rjt9l 1/1 Running 4 2d8h
kube-flannel-ds-amd64-wpmkj 1/1 Running 1 2d7h
kube-proxy-2n64d 1/1 Running 3 2d7h
kube-proxy-2pq2g 1/1 Running 1 2d7h
kube-proxy-5fbms 1/1 Running 2 2d8h
kube-proxy-g8gmn 1/1 Running 1 2d7h
kube-proxy-wrdrj 1/1 Running 1 2d8h
kube-proxy-wz6gv 1/1 Running 1 2d7h
kube-scheduler-kube-apiserver-1 0/1 CrashLoopBackOff 198 2d2h
kube-scheduler-kube-apiserver-2 1/1 Running 5 18m
nginx-ingress-controller-dz8fm 1/1 Running 3 2d4h
nginx-ingress-controller-sdsgg 1/1 Running 3 2d4h
nginx-ingress-controller-sfrgb 1/1 Running 1 2d4h
$ kubectl -n kube-system describe pod kube-scheduler-kube-apiserver-1
Containers:
kube-scheduler:
Container ID: docker://c04f3c9061cafef8749b2018cd66e6865d102f67c4d13bdd250d0b4656d5f220
Image: k8s.gcr.io/kube-scheduler:v1.14.2
Image ID: docker-pullable://k8s.gcr.io/kube-scheduler#sha256:052e0322b8a2b22819ab0385089f202555c4099493d1bd33205a34753494d2c2
Port: <none>
Host Port: <none>
Command:
kube-scheduler
--bind-address=127.0.0.1
--kubeconfig=/etc/kubernetes/scheduler.conf
--authentication-kubeconfig=/etc/kubernetes/scheduler.conf
--authorization-kubeconfig=/etc/kubernetes/scheduler.conf
--leader-elect=true
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 28 May 2019 23:16:50 -0400
Finished: Tue, 28 May 2019 23:19:56 -0400
Ready: False
Restart Count: 195
Requests:
cpu: 100m
Liveness: http-get http://127.0.0.1:10251/healthz delay=15s timeout=15s period=10s #success=1 #failure=8
Environment: <none>
Mounts:
/etc/kubernetes/scheduler.conf from kubeconfig (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubeconfig:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/scheduler.conf
HostPathType: FileOrCreate
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 4h56m (x104 over 37h) kubelet, kube-apiserver-1 Created container kube-scheduler
Normal Started 4h56m (x104 over 37h) kubelet, kube-apiserver-1 Started container kube-scheduler
Warning Unhealthy 137m (x71 over 34h) kubelet, kube-apiserver-1 Liveness probe failed: Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
Normal Pulled 132m (x129 over 37h) kubelet, kube-apiserver-1 Container image "k8s.gcr.io/kube-scheduler:v1.14.2" already present on machine
Warning BackOff 128m (x1129 over 34h) kubelet, kube-apiserver-1 Back-off restarting failed container
Normal SandboxChanged 80m kubelet, kube-apiserver-1 Pod sandbox changed, it will be killed and re-created.
Warning Failed 76m kubelet, kube-apiserver-1 Error: context deadline exceeded
Normal Pulled 36m (x7 over 78m) kubelet, kube-apiserver-1 Container image "k8s.gcr.io/kube-scheduler:v1.14.2" already present on machine
Normal Started 36m (x6 over 74m) kubelet, kube-apiserver-1 Started container kube-scheduler
Normal Created 32m (x7 over 74m) kubelet, kube-apiserver-1 Created container kube-scheduler
Warning Unhealthy 20m (x9 over 40m) kubelet, kube-apiserver-1 Liveness probe failed: Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
Warning BackOff 2m56s (x85 over 69m) kubelet, kube-apiserver-1 Back-off restarting failed container
I feel like I am overlooking a simple option or configuration but I can't find it and after days of dealing with this problem and reading documentation I am at my wits end.
The load balancer is a TCP load balancer and seems to be working as expected as I can query the cluster from my desktop.
Any suggestions or troubleshooting tips are definitely welcome at this time.
Thank you.
The problem with our configuration was that a well intended technician decided to eliminate one of the rules on the kubernetes master firewall which prevented the master from looping back to ports it needed to probe. This caused all kinds of weird issues and misdiagnosed problems which was definitely the wrong direction. After we allowed all ports on the servers Kubernetes was back to its normal behavior.

Cluster-autoscaler not triggering scale-up on Daemonset deployment

I deployed the Datadog agent using the Datadog Helm chart which deploys a Daemonset in Kubernetes. However when checking the state of the Daemonset I saw it was not creating all pods:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
datadog-agent-datadog 5 2 2 2 2 <none> 1h
When describing the Daemonset to figure out what was going wrong I saw it did not have enough resources:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPlacement 42s (x6 over 42s) daemonset-controller failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 42s (x6 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 42s (x5 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Warning FailedPlacement 42s (x7 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Normal SuccessfulCreate 42s daemonset-controller Created pod: datadog-agent-7b2kp
However, I have the Cluster-autoscaler installed in the cluster and configured properly (It does trigger on regular Pod deployments that do not have enough resources to schedule), but it does not seem to trigger on the Daemonset:
I0424 14:14:48.545689 1 static_autoscaler.go:273] No schedulable pods
I0424 14:14:48.545700 1 static_autoscaler.go:280] No unschedulable pods
The AutoScalingGroup has enough nodes left:
Did I miss something in the configuration of the Cluster-autoscaler? What can I do to make sure it triggers on Daemonset resources as well?
Edit:
Describe of the Daemonset
Name: datadog-agent
Selector: app=datadog-agent
Node-Selector: <none>
Labels: app=datadog-agent
chart=datadog-1.27.2
heritage=Tiller
release=datadog-agent
Annotations: deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 2
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 2
Number of Nodes Misscheduled: 0
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=datadog-agent
Annotations: checksum/autoconf-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
checksum/checksd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
checksum/confd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
Service Account: datadog-agent
Containers:
datadog:
Image: datadog/agent:6.10.1
Port: 8125/UDP
Host Port: 0/UDP
Limits:
cpu: 200m
memory: 256Mi
Requests:
cpu: 200m
memory: 256Mi
Liveness: http-get http://:5555/health delay=15s timeout=5s period=15s #success=1 #failure=6
Environment:
DD_API_KEY: <set to the key 'api-key' in secret 'datadog-secret'> Optional: false
DD_LOG_LEVEL: INFO
KUBERNETES: yes
DD_KUBERNETES_KUBELET_HOST: (v1:status.hostIP)
DD_HEALTH_PORT: 5555
Mounts:
/host/proc from procdir (ro)
/host/sys/fs/cgroup from cgroups (ro)
/var/run/docker.sock from runtimesocket (ro)
/var/run/s6 from s6-run (rw)
Volumes:
runtimesocket:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType:
procdir:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
cgroups:
Type: HostPath (bare host directory volume)
Path: /sys/fs/cgroup
HostPathType:
s6-run:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPlacement 33m (x6 over 33m) daemonset-controller failed to place pod on "ip-10-0-2-144.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Normal SuccessfulCreate 33m daemonset-controller Created pod: datadog-agent-7b2kp
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-2-174.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-3-250.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
You can add priorityClassName to point to a high priority PriorityClass to your DaemonSet. Kubernetes will then remove other pods in order to run the DaemonSet's pods. If that results in unschedulable pods, cluster-autoscaler should add a node to schedule them on.
See the docs (Most examples based on that) (For some pre-1.14 versions, the apiVersion is likely a beta (1.11-1.13) or alpha version (1.8 - 1.10) instead)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority class for essential pods"
Apply it to your workload
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: datadog-agent
spec:
template:
metadata:
labels:
app: datadog-agent
name: datadog-agent
spec:
priorityClassName: high-priority
serviceAccountName: datadog-agent
containers:
- image: datadog/agent:latest
############ Rest of template goes here
You should understand how cluster autoscaler works. It is responsible only for adding or removing nodes. It is not responsible for creating or destroying pods. So in your case cluster autoscaler is not doing anything because it's useless. Even if you add one more node - there will be still a requirement to run DaemonSet pods on nodes where is not enough CPU. That's why it is not adding nodes.
What you should do is to manually remove some pods from occupied nodes. Then it will be able to schedule DaemonSet pods.
Alternatively you can reduce CPU requests of Datadog to, for example, 100m or 50m. This should be enough to start those pods.