AWS Kubernetes cluster using KOPS - Kube-dns and Kube-proxy goes down - kubernetes

I have created a kubernetes cluster using KOPS on AWS cloud. The cluster gets created without any issues and runs fine for 10-15 hrs. I have deployed SAP Vora2.1 on this cluster. However generally after 12-15 hrs the KOPS cluster gets into problems related to kube-proxy and kube-dns. These pods either goes down or shows in a completed state. There is lot of restart as well. This eventually results into my application pods getting into problems and application also goes down. the application uses consul for service discovery however as kubernetes foundation services are not working properly so application does not comes to steady state even if I try to restore kube-proxy/kube-dns pods.
This is a 3 node cluster (1 master and 2 nodes) set up in a fully autoscaling mode. The overlay network is using default kubenet. Below is snapshot of pod statuses once system runs into issues,
[root#ip-172-31-18-162 ~]# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
infyvora vora-catalog-1549734119-cfnhz 0/2 CrashLoopBackOff 188 20h
infyvora vora-consul-0 0/1 CrashLoopBackOff 101 20h
infyvora vora-consul-1 1/1 Running 34 20h
infyvora vora-consul-2 0/1 CrashLoopBackOff 95 20h
infyvora vora-deployment-operator-293895365-4b3t6 0/1 Completed 104 20h
infyvora vora-disk-0 1/2 CrashLoopBackOff 187 20h
infyvora vora-dlog-0 0/2 CrashLoopBackOff 226 20h
infyvora vora-dlog-1 1/2 CrashLoopBackOff 155 20h
infyvora vora-doc-store-2451237348-dkrm6 0/2 CrashLoopBackOff 229 20h
infyvora vora-elasticsearch-logging-v1-444540252-mwfrz 0/1 CrashLoopBackOff 100 20h
infyvora vora-elasticsearch-logging-v1-444540252-vrr63 1/1 Running 14 20h
infyvora vora-elasticsearch-retention-policy-137762458-ns5pc 1/1 Running 13 20h
infyvora vora-fluentd-kubernetes-v1.21-9f4pt 1/1 Running 12 20h
infyvora vora-fluentd-kubernetes-v1.21-s2t1j 0/1 CrashLoopBackOff 99 20h
infyvora vora-grafana-2929546178-vrf5h 1/1 Running 13 20h
infyvora vora-graph-435594712-47lcg 0/2 CrashLoopBackOff 157 20h
infyvora vora-kibana-logging-3693794794-2qn86 0/1 CrashLoopBackOff 99 20h
infyvora vora-landscape-2532068267-w1f5n 0/2 CrashLoopBackOff 232 20h
infyvora vora-nats-streaming-1569990702-kcl1v 1/1 Running 13 20h
infyvora vora-prometheus-node-exporter-k4c3g 0/1 CrashLoopBackOff 102 20h
infyvora vora-prometheus-node-exporter-xp511 1/1 Running 13 20h
infyvora vora-prometheus-pushgateway-399610745-tcfk7 0/1 CrashLoopBackOff 103 20h
infyvora vora-prometheus-server-3955170982-xpct0 2/2 Running 24 20h
infyvora vora-relational-376953862-w39tc 0/2 CrashLoopBackOff 237 20h
infyvora vora-security-operator-2514524099-7ld0k 0/1 CrashLoopBackOff 103 20h
infyvora vora-thriftserver-409431919-8c1x9 2/2 Running 28 20h
infyvora vora-time-series-1188816986-f2fbq 1/2 CrashLoopBackOff 184 20h
infyvora vora-tools5tlpt-100252330-mrr9k 0/1 rpc error: code = 4 desc = context deadline exceeded 272 17h
infyvora vora-tools6zr3m-3592177467-n7sxd 0/1 Completed 1 20h
infyvora vora-tx-broker-4168728922-hf8jz 0/2 CrashLoopBackOff 151 20h
infyvora vora-tx-coordinator-3910571185-l0r4n 0/2 CrashLoopBackOff 184 20h
infyvora vora-tx-lock-manager-2734670982-bn7kk 0/2 Completed 228 20h
infyvora vsystem-1230763370-5ckr0 0/1 CrashLoopBackOff 115 20h
infyvora vsystem-auth-1068224543-0g59w 0/1 CrashLoopBackOff 102 20h
infyvora vsystem-vrep-1427606801-zprlr 0/1 CrashLoopBackOff 121 20h
kube-system dns-controller-3110272648-chwrs 1/1 Running 0 22h
kube-system etcd-server-events-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system etcd-server-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system kube-apiserver-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system kube-controller-manager-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system kube-dns-1311260920-cm1fs 0/3 Completed 309 22h
kube-system kube-dns-1311260920-hm5zd 3/3 Running 39 22h
kube-system kube-dns-autoscaler-1818915203-wmztj 1/1 Running 12 22h
kube-system kube-proxy-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system kube-proxy-ip-172-31-64-110.ap-southeast-1.compute.internal 0/1 CrashLoopBackOff 98 22h
kube-system kube-proxy-ip-172-31-64-15.ap-southeast-1.compute.internal 1/1 Running 13 22h
kube-system kube-scheduler-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system tiller-deploy-352283156-97hhb 1/1 Running 34 22h
Has anyone come across similar issue related to KOPS kubernetes on AWS. Appreciate if any pointers to solve this issue.
Regards,
Deepak

Related

MK_ADDON_ENABLE : run callbacks: running callbacks: waiting for app.kubernetes.io/name=ingress-nginx pods: timed out waiting for the condition

I'm trying to enable ingress addon by typing:
minikube addons enable ingress
I get this Error:
`X Fermeture en raison de MK_ADDON_ENABLE : run callbacks: running callbacks: [waiting for app.kubernetes.io/name=ingress-nginx pods: timed out waiting for the condition]`
I did try to see kubectl pods and it shows:
`NAMESPACE NAME READY STATUS RESTARTS AGE
ingress-nginx ingress-nginx-admission-create-6xqc7 0/1 Completed 0 105m
ingress-nginx ingress-nginx-admission-patch-5qxwp 0/1 Completed 1 105m
ingress-nginx ingress-nginx-controller-5959f988fd-wngnn 0/1 ImageInspectError 0 105m
kube-system coredns-565d847f94-kdcf6 1/1 Running 1 (23m ago) 107m
kube-system etcd-minikube 1/1 Running 1 (23m ago) 107m
kube-system kube-apiserver-minikube 1/1 Running 1 (23m ago) 107m
kube-system kube-controller-manager-minikube 1/1 Running 1 (23m ago) 107m
kube-system kube-proxy-zzrwv 1/1 Running 1 (23m ago) 107m
kube-system kube-scheduler-minikube 1/1 Running 1 (23m ago) 107m
kube-system storage-provisioner 1/1 Running 3 (20m ago) 107m`

deploy exceptionless on k8s! Error Back-off restarting failed container

I get the exceptionless helm chart ,my value.yaml is https://github.com/mypublicuse/myfile/blob/main/el-values.yaml
i got errors
1:
Error: INSTALLATION FAILED: Deployment.apps "exceptionless-elasticsearch" is invalid: spec.template.spec.initContainers[0].image: Required value
so I edit the elasticsearch.yaml Add
spec:
initContainers:
name: sysctl
image: mydockerhost/busybox:1.35
so the helm can install
2: after helm install
i found
exless-nfsclient-nfs-subdir-external-provisioner-7fc86846fmlbgz 1/1 Running 0 52m
exceptionless-redis-85956947f-7vkpg 1/1 Running 0 49m
exceptionless-app-6547d4d88d-2hkbg 1/1 Running 0 49m
exceptionless-elasticsearch-76f6cc9b9-2jgks 1/1 Running 0 49m
exceptionless-jobs-web-hooks-7bb9d7477c-kpmwv 0/1 CrashLoopBackOff 14 (2m53s ago) 49m
exceptionless-jobs-event-notifications-844cb87665-bd7bt 0/1 CrashLoopBackOff 14 (2m53s ago) 49m
exceptionless-jobs-mail-message-647d6bd897-s8jmq 0/1 CrashLoopBackOff 14 (2m55s ago) 49m
exceptionless-jobs-event-usage-75c6d6d54d-m5rjr 0/1 CrashLoopBackOff 14 (2m46s ago) 49m
exceptionless-jobs-work-item-c74d77b55-th4g7 0/1 CrashLoopBackOff 14 (2m34s ago) 49m
exceptionless-jobs-daily-summary-6c99dfbc87-7zq5k 0/1 CrashLoopBackOff 14 (2m34s ago) 49m
exceptionless-jobs-event-posts-75777759b8-nsmbw 0/1 CrashLoopBackOff 14 (2m32s ago) 49m
exceptionless-jobs-close-inactive-sessions-b49595f49-hmfxm 0/1 CrashLoopBackOff 14 (2m14s ago) 49m
exceptionless-jobs-event-user-descriptions-5c9d5dc768-8h27z 0/1 CrashLoopBackOff 14 (2m16s ago) 49m
exceptionless-jobs-stack-event-count-54ffcfb4b6-gk6mz 0/1 CrashLoopBackOff 14 (2m ago) 49m
exceptionless-jobs-maintain-indexes-27669970-s28cg 0/1 CrashLoopBackOff 5 (94s ago) 4m30s
exceptionless-collector-5c774fd8ff-6ksvx 0/1 CrashLoopBackOff 2 (11s ago) 37s
exceptionless-api-66fc9cc659-zckzz 0/1 CrashLoopBackOff 3 (9s ago) 55s
api collector and jobs is un success!
I need help!thanks!
The pod log is
Back-off restarting failed container
yes just it!
i guess the program should be run and immediate crash ,so ....

rook-ceph-osd-prepare pod stuck for hours

I am new to ceph and using rook to install ceph in k8s cluster. I see that pod rook-ceph-osd-prepare is in Running status forever and stuck on below line:
2020-06-15 20:09:02.260379 D | exec: Running command: ceph auth get-or-create-key
client.bootstrap-osd mon allow profile bootstrap-osd --connect-timeout=15 --cluster=rook-ceph
--conf=/var/lib/rook/rook-ceph/rook-ceph.config
--name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
--format json --out-file /tmp/180401029
When I logged into container and ran the same command, I see that its stuck too and after pressing ^C it showed this:
Traceback (most recent call last):
File "/usr/bin/ceph", line 1266, in <module>
retval = main()
File "/usr/bin/ceph", line 1197, in main
verbose)
File "/usr/bin/ceph", line 622, in new_style_command
ret, outbuf, outs = do_command(parsed_args, target, cmdargs, sigdict, inbuf, verbose)
File "/usr/bin/ceph", line 596, in do_command
return ret, '', ''
Below are all my pods:
rook-ceph csi-cephfsplugin-9k9z2 3/3 Running 0 9h
rook-ceph csi-cephfsplugin-mjsbk 3/3 Running 0 9h
rook-ceph csi-cephfsplugin-mrqz5 3/3 Running 0 9h
rook-ceph csi-cephfsplugin-provisioner-5ffbdf7856-59cf7 5/5 Running 0 9h
rook-ceph csi-cephfsplugin-provisioner-5ffbdf7856-m4bhr 5/5 Running 0 9h
rook-ceph csi-cephfsplugin-xgvz4 3/3 Running 0 9h
rook-ceph csi-rbdplugin-6k4dk 3/3 Running 0 9h
rook-ceph csi-rbdplugin-klrwp 3/3 Running 0 9h
rook-ceph csi-rbdplugin-provisioner-68d449986d-2z9gr 6/6 Running 0 9h
rook-ceph csi-rbdplugin-provisioner-68d449986d-mzh9d 6/6 Running 0 9h
rook-ceph csi-rbdplugin-qcmrj 3/3 Running 0 9h
rook-ceph csi-rbdplugin-zdg8z 3/3 Running 0 9h
rook-ceph rook-ceph-crashcollector-k8snode001-76ffd57d58-slg5q 1/1 Running 0 9h
rook-ceph rook-ceph-crashcollector-k8snode002-85b6d9d699-s8m8z 1/1 Running 0 9h
rook-ceph rook-ceph-crashcollector-k8snode004-847bdb4fc5-kk6bd 1/1 Running 0 9h
rook-ceph rook-ceph-mgr-a-5497fcbb7d-lq6tf 1/1 Running 0 9h
rook-ceph rook-ceph-mon-a-6966d857d9-s4wch 1/1 Running 0 9h
rook-ceph rook-ceph-mon-b-649c6845f4-z46br 1/1 Running 0 9h
rook-ceph rook-ceph-mon-c-67869b76c7-4v6zn 1/1 Running 0 9h
rook-ceph rook-ceph-operator-5968d8f7b9-hsfld 1/1 Running 0 9h
rook-ceph rook-ceph-osd-prepare-k8snode001-j25xv 1/1 Running 0 7h48m
rook-ceph rook-ceph-osd-prepare-k8snode002-6fvlx 0/1 Completed 0 9h
rook-ceph rook-ceph-osd-prepare-k8snode003-cqc4g 0/1 Completed 0 9h
rook-ceph rook-ceph-osd-prepare-k8snode004-jxxtl 0/1 Completed 0 9h
rook-ceph rook-discover-28xj4 1/1 Running 0 9h
rook-ceph rook-discover-4ss66 1/1 Running 0 9h
rook-ceph rook-discover-bt8rd 1/1 Running 0 9h
rook-ceph rook-discover-q8f4x 1/1 Running 0 9h
Please let me know if anyone has any hints to resolve this or troubleshoot this?
In my case, the problem is my Kubernetes host is not in the same kernel version.
Once I upgraded the kernel version to match with all the other nodes, this issue is resolved.
In my case, one of my nodes system clock not synchronized with hardware so there was a time gap between nodes.
maybe you should check output of timedatectl command.

Gluster cluster in Kubernetes: Glusterd inactive (dead) after node reboot. How to debug?

I don't know what to do to debug it. I have 1 Kubernetes master node and three slave nodes. I have deployed on the three nodes a Gluster cluster just fine with this guide https://github.com/gluster/gluster-kubernetes/blob/master/docs/setup-guide.md.
I created volumes and everything is working. But when I reboot a slave node, and the node reconnects to the master node, the glusterd.service inside the slave node shows up dead and nothing works after this.
[root#kubernetes-node-1 /]# systemctl status glusterd.service
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
Active: inactive (dead)
I don't know what to do from here, for example /var/log/glusterfs/glusterd.log has been updated last time 3 days ago (it's not being updated with errors after a reboot or a pod deletion+recreation).
I just want to know where glusterd crashes so I can find out why.
How can I debug this crash?
All the nodes (master + slaves) run on Ubuntu Desktop 18 64 bit LTS Virtualbox VMs.
requested logs (kubectl get all --all-namespaces):
NAMESPACE NAME READY STATUS RESTARTS AGE
glusterfs pod/glusterfs-7nl8l 0/1 Running 62 22h
glusterfs pod/glusterfs-wjnzx 1/1 Running 62 2d21h
glusterfs pod/glusterfs-wl4lx 1/1 Running 112 41h
glusterfs pod/heketi-7495cdc5fd-hc42h 1/1 Running 0 22h
kube-system pod/coredns-86c58d9df4-n2hpk 1/1 Running 0 6d12h
kube-system pod/coredns-86c58d9df4-rbwjq 1/1 Running 0 6d12h
kube-system pod/etcd-kubernetes-master-work 1/1 Running 0 6d12h
kube-system pod/kube-apiserver-kubernetes-master-work 1/1 Running 0 6d12h
kube-system pod/kube-controller-manager-kubernetes-master-work 1/1 Running 0 6d12h
kube-system pod/kube-flannel-ds-amd64-785q8 1/1 Running 5 3d19h
kube-system pod/kube-flannel-ds-amd64-8sj2z 1/1 Running 8 3d19h
kube-system pod/kube-flannel-ds-amd64-v62xb 1/1 Running 0 3d21h
kube-system pod/kube-flannel-ds-amd64-wx4jl 1/1 Running 7 3d21h
kube-system pod/kube-proxy-7f6d9 1/1 Running 5 3d19h
kube-system pod/kube-proxy-7sf9d 1/1 Running 0 6d12h
kube-system pod/kube-proxy-n9qxq 1/1 Running 8 3d19h
kube-system pod/kube-proxy-rwghw 1/1 Running 7 3d21h
kube-system pod/kube-scheduler-kubernetes-master-work 1/1 Running 0 6d12h
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 6d12h
elastic service/glusterfs-dynamic-9ad03769-2bb5-11e9-8710-0800276a5a8e ClusterIP 10.98.38.157 <none> 1/TCP 2d19h
elastic service/glusterfs-dynamic-a77e02ca-2bb4-11e9-8710-0800276a5a8e ClusterIP 10.97.203.225 <none> 1/TCP 2d19h
elastic service/glusterfs-dynamic-ad16ed0b-2bb6-11e9-8710-0800276a5a8e ClusterIP 10.105.149.142 <none> 1/TCP 2d19h
glusterfs service/heketi ClusterIP 10.101.79.224 <none> 8080/TCP 2d20h
glusterfs service/heketi-storage-endpoints ClusterIP 10.99.199.190 <none> 1/TCP 2d20h
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 6d12h
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
glusterfs daemonset.apps/glusterfs 3 3 0 3 0 storagenode=glusterfs 2d21h
kube-system daemonset.apps/kube-flannel-ds-amd64 4 4 4 4 4 beta.kubernetes.io/arch=amd64 3d21h
kube-system daemonset.apps/kube-flannel-ds-arm 0 0 0 0 0 beta.kubernetes.io/arch=arm 3d21h
kube-system daemonset.apps/kube-flannel-ds-arm64 0 0 0 0 0 beta.kubernetes.io/arch=arm64 3d21h
kube-system daemonset.apps/kube-flannel-ds-ppc64le 0 0 0 0 0 beta.kubernetes.io/arch=ppc64le 3d21h
kube-system daemonset.apps/kube-flannel-ds-s390x 0 0 0 0 0 beta.kubernetes.io/arch=s390x 3d21h
kube-system daemonset.apps/kube-proxy 4 4 4 4 4 <none> 6d12h
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
glusterfs deployment.apps/heketi 1/1 1 0 2d20h
kube-system deployment.apps/coredns 2/2 2 2 6d12h
NAMESPACE NAME DESIRED CURRENT READY AGE
glusterfs replicaset.apps/heketi-7495cdc5fd 1 1 0 2d20h
kube-system replicaset.apps/coredns-86c58d9df4 2 2 2 6d12h
requested:
tasos#kubernetes-master-work:~$ kubectl logs -n glusterfs glusterfs-7nl8l
env variable is set. Update in gluster-blockd.service
Please check these similar topics:
GlusterFS deployment on k8s cluster-- Readiness probe failed: /usr/local/bin/status-probe.sh
and
https://github.com/gluster/gluster-kubernetes/issues/539
Check tcmu-runner.log log to debug it.
UPDATE:
I think it will be your issue:
https://github.com/gluster/gluster-kubernetes/pull/557
PR is prepared, but not merged.
UPDATE 2:
https://github.com/gluster/glusterfs/issues/417
Be sure that rpcbind is installed.

After installing kubernetes add-on heapster take login time to load dashboard pod & node status page

After installing kubernetes add-on it take login time to load dashboard pod & node status page. during this time I didn't see any CPU or Memory usage high on my nodes.
deploy/kube-config/influxdb/grafana.yaml
deploy/kube-config/influxdb/heapster.yaml
deploy/kube-config/influxdb/influxdb.yaml
Here are running pods and services.
[root#master01 heapster]# kubectl get svc --namespace=kube-system
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default-http-backend 10.108.228.89 <none> 80/TCP 9h
heapster 10.100.200.252 <none> 80/TCP 3h
kube-dns 10.96.0.10 <none> 53/UDP,53/TCP 14d
kubernetes-dashboard 10.96.25.153 <nodes> 80:31511/TCP 8d
monitoring-grafana 10.102.103.4 <none> 80/TCP 3h
monitoring-influxdb 10.101.51.148 <none> 8086/TCP 3h
[root#master01 heapster]# kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
default-http-backend-2198840601-99gk4 1/1 Running 1 5h
etcd-master01 1/1 Running 14 11d
heapster-1428305041-hnw3k 1/1 Running 1 3h
kube-apiserver-master01 1/1 Running 31 14d
kube-controller-manager-master01 1/1 Running 27 14d
kube-dns-3913472980-5kpg2 3/3 Running 568 14d
kube-flannel-ds-66x5q 2/2 Running 34 10d
kube-flannel-ds-6ls9n 2/2 Running 39 10d
kube-flannel-ds-htggq 2/2 Running 41 10d
kube-proxy-0cp1q 1/1 Running 28 14d
kube-proxy-98p5n 1/1 Running 20 14d
kube-proxy-rgjw2 1/1 Running 22 14d
kube-scheduler-master01 1/1 Running 27 14d
kubernetes-dashboard-3858955849-srhqf 1/1 Running 21 7d
monitoring-grafana-3975459543-r9v50 1/1 Running 1 3h
monitoring-influxdb-3480804314-t6nt7 1/1 Running 1 3h
I have restated the all three nodes still it takes long time to load pod status page.
this page took 4.2 min. to load http://localhost/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy/api/v1/pod/default?itemsPerPage=10&page=1
If I remove below three apps. its loading normal.
Any tips to debug this issue?
thanks
SR