promethues operator alertmanager-main-0 pending and display - kubernetes

What happened?
kubernetes version: 1.12
promethus operator: release-0.1
I follow the README:
$ kubectl create -f manifests/
# It can take a few seconds for the above 'create manifests' command to fully create the following resources, so verify the resources are ready before proceeding.
$ until kubectl get customresourcedefinitions servicemonitors.monitoring.coreos.com ; do date; sleep 1; echo ""; done
$ until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
$ kubectl apply -f manifests/ # This command sometimes may need to be done twice (to workaround a race condition).
and then I use the command and then is showed like:
[root#VM_8_3_centos /data/hansenwu/kube-prometheus/manifests]# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 66s
alertmanager-main-1 1/2 Running 0 47s
grafana-54f84fdf45-kt2j9 1/1 Running 0 72s
kube-state-metrics-65b8dbf498-h7d8g 4/4 Running 0 57s
node-exporter-7mpjw 2/2 Running 0 72s
node-exporter-crfgv 2/2 Running 0 72s
node-exporter-l7s9g 2/2 Running 0 72s
node-exporter-lqpns 2/2 Running 0 72s
prometheus-adapter-5b6f856dbc-ndfwl 1/1 Running 0 72s
prometheus-k8s-0 3/3 Running 1 59s
prometheus-k8s-1 3/3 Running 1 59s
prometheus-operator-5c64c8969-lqvkb 1/1 Running 0 72s
[root#VM_8_3_centos /data/hansenwu/kube-prometheus/manifests]# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 0/2 Pending 0 0s
grafana-54f84fdf45-kt2j9 1/1 Running 0 75s
kube-state-metrics-65b8dbf498-h7d8g 4/4 Running 0 60s
node-exporter-7mpjw 2/2 Running 0 75s
node-exporter-crfgv 2/2 Running 0 75s
node-exporter-l7s9g 2/2 Running 0 75s
node-exporter-lqpns 2/2 Running 0 75s
prometheus-adapter-5b6f856dbc-ndfwl 1/1 Running 0 75s
prometheus-k8s-0 3/3 Running 1 62s
prometheus-k8s-1 3/3 Running 1 62s
prometheus-operator-5c64c8969-lqvkb 1/1 Running 0 75s
I don't know why the pod altertmanager-main-0 pending and disaply then restart.
And I see the event, it is showed as:
72s Warning FailedCreate StatefulSet create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again.
72s Warning FailedCreate StatefulSet create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again.
72s Warning^Z FailedCreate StatefulSet
[10]+ Stopped kubectl get events -n monitoring

Most likely the alertmanager does not get enough time to start correctly.
Have a look at this answer : https://github.com/coreos/prometheus-operator/issues/965#issuecomment-460223268
You can set the paused field to true, and then modify the StatefulSet to try if extending the liveness/readiness solves your issue.

Related

calico-kube-controller stays in pending state

I have a new install of kubernetes on Ubuntu-18 using version 1.24.3 with Calico. The calico-controller will not start:
$ sudo kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-555bc4b957-z4q2p 0/1 Pending 0 5m14s
kube-system calico-node-jz2j7 1/1 Running 0 5m15s
kube-system coredns-6d4b75cb6d-hwfx9 1/1 Running 0 5m14s
kube-system coredns-6d4b75cb6d-wdh55 1/1 Running 0 5m14s
kube-system etcd-ubuntu-18-extssd 1/1 Running 1 5m27s
kube-system kube-apiserver-ubuntu-18-extssd 1/1 Running 1 5m28s
kube-system kube-controller-manager-ubuntu-18-extssd 1/1 Running 1 5m26s
kube-system kube-proxy-t5z2r 1/1 Running 0 5m15s
kube-system kube-scheduler-ubuntu-18-extssd 1/1 Running 1 5m27s
Someone suggested setting a couple of Calico timeouts to 60 seconds, but that didn't work either.
What could be causing the calico-controller to fail to start, especially since the calico-node is running?
Also, is there a more trouble-free CNI implementation to use? Calico seems very error-prone.
I solved this by installing Weave:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
with this cidr:
sudo kubeadm init --pod-network-cidr=192.168.0.0/16

How to debug GKE internal network issue?

UPDATE 1:
Some more logs from api-servers:
https://gist.github.com/nvcnvn/47df8798e798637386f6e0777d869d4f
This question is more about debugging method for current GKE but welcome for solution.
We're using GKE version 1.22.3-gke.1500 with following configuration:
We recently facing issue that commands like kubectl logs and exec doesn't work, deleting a namespace taking forever.
Checking some service inside the cluster, it seem for some reason some network operation just randomly failed. For example metric-server keep crashing with these error logs:
message: "pkg/mod/k8s.io/client-go#v0.19.10/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.97.0.1:443/api/v1/nodes?resourceVersion=387681528": net/http: TLS handshake timeout"
HTTP request timeout also:
unable to fully scrape metrics: unable to fully scrape metrics from node gke-staging-n2d-standard-8-78c35b3a-6h16: unable to fetch metrics from node gke-staging-n2d-standard-8-78c35b3a-6h16: Get "http://10.148.15.217:10255/stats/summary?only_cpu_and_memory=true": context deadline exceeded
and I also try to restart (by kubectl delete) most of the pod in this list:
kubectl get pod
NAME READY STATUS RESTARTS AGE
event-exporter-gke-5479fd58c8-snq26 2/2 Running 0 4d7h
fluentbit-gke-gbs2g 2/2 Running 0 4d7h
fluentbit-gke-knz2p 2/2 Running 0 85m
fluentbit-gke-ljw8h 2/2 Running 0 30h
gke-metadata-server-dtnvh 1/1 Running 0 4d7h
gke-metadata-server-f2bqw 1/1 Running 0 30h
gke-metadata-server-kzcv6 1/1 Running 0 85m
gke-metrics-agent-4g56c 1/1 Running 12 (3h6m ago) 4d7h
gke-metrics-agent-hnrll 1/1 Running 13 (13h ago) 30h
gke-metrics-agent-xdbrw 1/1 Running 0 85m
konnectivity-agent-87bc84bb7-g9nd6 1/1 Running 0 2m59s
konnectivity-agent-87bc84bb7-rkhhh 1/1 Running 0 3m51s
konnectivity-agent-87bc84bb7-x7pk4 1/1 Running 0 3m50s
konnectivity-agent-autoscaler-698b6d8768-297mh 1/1 Running 0 83m
kube-dns-77d9986bd5-2m8g4 4/4 Running 0 3h24m
kube-dns-77d9986bd5-z4j62 4/4 Running 0 3h24m
kube-dns-autoscaler-f4d55555-dmvpq 1/1 Running 0 83m
kube-proxy-gke-staging-n2d-standard-8-78c35b3a-8299 1/1 Running 0 11s
kube-proxy-gke-staging-n2d-standard-8-78c35b3a-fp5u 1/1 Running 0 11s
kube-proxy-gke-staging-n2d-standard-8-78c35b3a-rkdp 1/1 Running 0 11s
l7-default-backend-7db896cb4-mvptg 1/1 Running 0 83m
metrics-server-v0.4.4-fd9886cc5-tcscj 2/2 Running 82 33h
netd-5vpmc 1/1 Running 0 30h
netd-bhq64 1/1 Running 0 85m
netd-n6jmc 1/1 Running 0 4d7h
Some logs from metrics server
https://gist.github.com/nvcnvn/b77eb02705385889961aca33f0f841c7
if you cannot use kubectl to get info from your cluster, can you try to access them by using their restfull api
http://blog.madhukaraphatak.com/understanding-k8s-api-part-2/
try to delete "metric-server" pods or get logs from it using podman or curl command.

Debug a pod stuck in pending state [duplicate]

This question already has an answer here:
Error "pod has unbound immediate PersistentVolumeClaim" during statefulset deployment
(1 answer)
Closed 2 years ago.
How could I debug a pod stuck in pending state? I am using k8ssandra https://k8ssandra.io/docs/ to create a Cassandra cluster. It uses helm files. I created a 3 nodes cluster and changed size value to 3 in local values.yaml file to create a 3 node cluster - https://github.com/k8ssandra/k8ssandra/blob/main/charts/k8ssandra-cluster/values.yaml
no_reply#cloudshell:~ (k8ssandra-299315)$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cass-operator-86d4dc45cd-588c8 1/1 Running 0 29h
grafana-deployment-66557855cc-j7476 1/1 Running 0 29h
k8ssandra-cluster-a-grafana-operator-k8ssandra-5b89b64f4f-8pbxk 1/1 Running 0 29h
k8ssandra-cluster-a-reaper-k8ssandra-847c99ccd8-dsnj4 1/1 Running 0 28h
k8ssandra-cluster-a-reaper-k8ssandra-schema-5fzpn 0/1 Completed 0 28h
k8ssandra-cluster-a-reaper-operator-k8ssandra-87d56d56f-wn8hw 1/1 Running 0 29h
k8ssandra-dc1-default-sts-0 2/2 Running 0 29h
**k8ssandra-dc1-default-sts-1 0/2 Pending 0 14m**
k8ssandra-dc1-default-sts-2 2/2 Running 0 14m
k8ssandra-tools-kube-prome-operator-6bcdf668d4-ndhw9 1/1 Running 0 29h
prometheus-k8ssandra-cluster-a-prometheus-k8ssandra-0 2/2 Running 1 29h
The best way as described by Arghya is checking the events of the pod.
kubectl describe pod k8ssandra-dc1-default-sts-1
You could also check for the logs of the pod:
kubectl logs k8ssandra-dc1-default-sts-1

Kubernetes dashboard (web UI) not working

I have just started a new Kubernetes 1.8.0 environment using minikube (0.27) on Windows 10.
I followed this steps but it didn't work:
https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/
When I list pods this is the result:
C:\WINDOWS\system32>kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-minikube 1/1 Running 0 23m
kube-system heapster-69b5d4974d-s9vrf 1/1 Running 0 5m
kube-system kube-addon-manager-minikube 1/1 Running 0 23m
kube-system kube-apiserver-minikube 1/1 Running 0 23m
kube-system kube-controller-manager-minikube 1/1 Running 0 23m
kube-system kube-dns-545bc4bfd4-xkt7l 3/3 Running 3 1h
kube-system kube-proxy-7jnk6 1/1 Running 0 23m
kube-system kube-scheduler-minikube 1/1 Running 0 23m
kube-system kubernetes-dashboard-5569448c6d-8zqnc 1/1 Running 2 52m
kube-system kubernetes-dashboard-869db7f6b4-ddlmq 0/1 CrashLoopBackOff 19 51m
kube-system monitoring-influxdb-78d4c6f5b6-b66m9 1/1 Running 0 4m
kube-system storage-provisioner 1/1 Running 2 1h
As you can see, I have 2 kubernets-dashboard pods now, one of then is running and the other one is CrashLookBackOff.
When I try to run minikube dashboard this is the result:
"Waiting, endpoint for service is not ready yet..."
I have tried to remove kubernetes-dashboard-869db7f6b4-ddlmq pod:
kubectl delete pod kubernetes-dashboard-869db7f6b4-ddlmq
This is the result:
"Error from server (NotFound): pods "kubernetes-dashboard-869db7f6b4-ddlmq" not found"
"Error from server (NotFound): pods "kubernetes-dashboard-869db7f6b4-ddlmq" not found"
You failed to delete the pod due to the lack of namespace (add -n kube-system). And it should be 1 dashboard pod if no modification's applied. If it still fails to run minikube dashboard after you delete the abnormal pod, more logs should be provided.

kube-dns kubedns/dnsmasq/sidecar fails to start

This is a really odd issue I've started to experience. Everything was working with out issue, however, now when I startup a cluster (kubeadm), setup flannel, kube-dns never starts up. Eventually, it errors out with the following output from kubectl describe
Error: failed to start container "sidecar": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:240: creating new parent process caused \\\"container_linux.go:1245: running lstat on namespace path \\\\\\\"/proc/7420/ns/ipc\\\\\\\" caused \\\\\\\"lstat /proc/7420/ns/ipc: no such file or directory\\\\\\\"\\\"\\n\""}
Any ideas what this error really means? I get the same looking error for dnsmasq and kubedns as well.
I am using the switch "--pod-network-cidr 10.244.0.0/16" as always. As I said ,this was working, and then a few days later, it's not....
Here's the get pods output:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-machiato-0 1/1 Running 0 3m
kube-system kube-apiserver-machiato-0 1/1 Running 0 3m
kube-system kube-controller-manager-machiato-0 1/1 Running 0 2m
kube-system kube-dns-2258483030-pd8qj 0/3 ContainerCreating 0 3m
kube-system kube-flannel-ds-0z0dd 2/2 Running 0 1m
kube-system kube-flannel-ds-3dccg 2/2 Running 0 1m
kube-system kube-proxy-gc8ft 1/1 Running 0 3m
kube-system kube-proxy-tjgzn 1/1 Running 0 1m
kube-system kube-scheduler-machiato-0 1/1 Running 0 3m
Eventually, "ContainerCreating" switches to "CrashLoopBackOff" then I see the lstat error above.
most likely its overlay network issue. can you check the dns pods log message and see any error message?
kubectl logs -n kube-system kube-dns-2258483030-pd8qj -c kubedns
kubectl logs -n kube-system kube-dns-2258483030-pd8qj -c dnsmasq
kubectl logs -n kube-system kube-dns-2258483030-pd8qj -c sidecar