rook-ceph-osd-prepare pod stuck for hours - kubernetes

I am new to ceph and using rook to install ceph in k8s cluster. I see that pod rook-ceph-osd-prepare is in Running status forever and stuck on below line:
2020-06-15 20:09:02.260379 D | exec: Running command: ceph auth get-or-create-key
client.bootstrap-osd mon allow profile bootstrap-osd --connect-timeout=15 --cluster=rook-ceph
--conf=/var/lib/rook/rook-ceph/rook-ceph.config
--name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
--format json --out-file /tmp/180401029
When I logged into container and ran the same command, I see that its stuck too and after pressing ^C it showed this:
Traceback (most recent call last):
File "/usr/bin/ceph", line 1266, in <module>
retval = main()
File "/usr/bin/ceph", line 1197, in main
verbose)
File "/usr/bin/ceph", line 622, in new_style_command
ret, outbuf, outs = do_command(parsed_args, target, cmdargs, sigdict, inbuf, verbose)
File "/usr/bin/ceph", line 596, in do_command
return ret, '', ''
Below are all my pods:
rook-ceph csi-cephfsplugin-9k9z2 3/3 Running 0 9h
rook-ceph csi-cephfsplugin-mjsbk 3/3 Running 0 9h
rook-ceph csi-cephfsplugin-mrqz5 3/3 Running 0 9h
rook-ceph csi-cephfsplugin-provisioner-5ffbdf7856-59cf7 5/5 Running 0 9h
rook-ceph csi-cephfsplugin-provisioner-5ffbdf7856-m4bhr 5/5 Running 0 9h
rook-ceph csi-cephfsplugin-xgvz4 3/3 Running 0 9h
rook-ceph csi-rbdplugin-6k4dk 3/3 Running 0 9h
rook-ceph csi-rbdplugin-klrwp 3/3 Running 0 9h
rook-ceph csi-rbdplugin-provisioner-68d449986d-2z9gr 6/6 Running 0 9h
rook-ceph csi-rbdplugin-provisioner-68d449986d-mzh9d 6/6 Running 0 9h
rook-ceph csi-rbdplugin-qcmrj 3/3 Running 0 9h
rook-ceph csi-rbdplugin-zdg8z 3/3 Running 0 9h
rook-ceph rook-ceph-crashcollector-k8snode001-76ffd57d58-slg5q 1/1 Running 0 9h
rook-ceph rook-ceph-crashcollector-k8snode002-85b6d9d699-s8m8z 1/1 Running 0 9h
rook-ceph rook-ceph-crashcollector-k8snode004-847bdb4fc5-kk6bd 1/1 Running 0 9h
rook-ceph rook-ceph-mgr-a-5497fcbb7d-lq6tf 1/1 Running 0 9h
rook-ceph rook-ceph-mon-a-6966d857d9-s4wch 1/1 Running 0 9h
rook-ceph rook-ceph-mon-b-649c6845f4-z46br 1/1 Running 0 9h
rook-ceph rook-ceph-mon-c-67869b76c7-4v6zn 1/1 Running 0 9h
rook-ceph rook-ceph-operator-5968d8f7b9-hsfld 1/1 Running 0 9h
rook-ceph rook-ceph-osd-prepare-k8snode001-j25xv 1/1 Running 0 7h48m
rook-ceph rook-ceph-osd-prepare-k8snode002-6fvlx 0/1 Completed 0 9h
rook-ceph rook-ceph-osd-prepare-k8snode003-cqc4g 0/1 Completed 0 9h
rook-ceph rook-ceph-osd-prepare-k8snode004-jxxtl 0/1 Completed 0 9h
rook-ceph rook-discover-28xj4 1/1 Running 0 9h
rook-ceph rook-discover-4ss66 1/1 Running 0 9h
rook-ceph rook-discover-bt8rd 1/1 Running 0 9h
rook-ceph rook-discover-q8f4x 1/1 Running 0 9h
Please let me know if anyone has any hints to resolve this or troubleshoot this?

In my case, the problem is my Kubernetes host is not in the same kernel version.
Once I upgraded the kernel version to match with all the other nodes, this issue is resolved.

In my case, one of my nodes system clock not synchronized with hardware so there was a time gap between nodes.
maybe you should check output of timedatectl command.

Related

Why does the official helm chart for fluent-bit start 20 pods

I followed the official helm chart for fluent-bit and ended up with 20 pods, in a namespace. How do I configure it to use 1 pod?
The replicaCount attribute in values.yaml is set to 1.
https://github.com/fluent/helm-charts/tree/main/charts/fluent-bit
helm upgrade -i fluent-bit helm/efk/fluent-bit --namespace some-ns
kubectl get pods -n some-ns
NAME READY STATUS RESTARTS AGE
fluent-bit-22dx4 1/1 Running 0 15h
fluent-bit-2x6rn 1/1 Running 0 15h
fluent-bit-42rfd 1/1 Running 0 15h
fluent-bit-54drx 1/1 Running 0 15h
fluent-bit-8f8pl 1/1 Running 0 15h
fluent-bit-8rtp9 1/1 Running 0 15h
fluent-bit-8wfcc 1/1 Running 0 15h
fluent-bit-bffh8 1/1 Running 0 15h
fluent-bit-lgl9k 1/1 Running 0 15h
fluent-bit-lqdrs 1/1 Running 0 15h
fluent-bit-mdvlc 1/1 Running 0 15h
fluent-bit-qgvww 1/1 Running 0 15h
fluent-bit-qqwh6 1/1 Running 0 15h
fluent-bit-qxbjt 1/1 Running 0 15h
fluent-bit-rqr8g 1/1 Running 0 15h
fluent-bit-t8vbv 1/1 Running 0 15h
fluent-bit-vkcfl 1/1 Running 0 15h
fluent-bit-wnwtq 1/1 Running 0 15h
fluent-bit-xqwxk 1/1 Running 0 15h
fluent-bit-xxj8q 1/1 Running 0 15h
Note that there are two deployment kinds in your template, daemonset and deployment.
This is controlled by the kind field of the values file.
Now daemonset is written in the values, so it will be on each node without considering the affinity Start a replica.
If you want to start one, please set kind to Deployment, then set replicaCount to 1 and redeploy.
values.yaml
# Default values for fluent-bit.
# kind -- DaemonSet or Deployment
kind: Deployment
# replicaCount -- Only applicable if kind=Deployment
replicaCount: 1
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/

Delete all the pods created by applying Helm2.13.1

I'm new to Helm. I'm trying to deploy a simple server on the master node. When I do helm install and see the details using the command kubectl get po,svc I see lot of pods created other than the pods I intend to deploy.So, My precise questions are:
Why so many pods got created?
How do I delete all those pods?
Below is the output of the command kubectl get po,svc:
NAME READY STATUS RESTARTS AGE
pod/altered-quoll-stx-sdo-chart-6446644994-57n7k 1/1 Running 0 25m
pod/austere-garfish-stx-sdo-chart-5b65d8ccb7-jjxfh 1/1 Running 0 25m
pod/bald-hyena-stx-sdo-chart-9b666c998-zcfwr 1/1 Running 0 25m
pod/cantankerous-pronghorn-stx-sdo-chart-65f5699cdc-5fkf9 1/1 Running 0 25m
pod/crusty-unicorn-stx-sdo-chart-7bdcc67546-6d295 1/1 Running 0 25m
pod/exiled-puffin-stx-sdo-chart-679b78ccc5-n68fg 1/1 Running 0 25m
pod/fantastic-waterbuffalo-stx-sdo-chart-7ddd7b54df-p78h7 1/1 Running 0 25m
pod/gangly-quail-stx-sdo-chart-75b9dd49b-rbsgq 1/1 Running 0 25m
pod/giddy-pig-stx-sdo-chart-5d86844569-5v8nn 1/1 Running 0 25m
pod/hazy-indri-stx-sdo-chart-65d4c96f46-zmvm2 1/1 Running 0 25m
pod/interested-macaw-stx-sdo-chart-6bb7874bbd-k9nnf 1/1 Running 0 25m
pod/jaundiced-orangutan-stx-sdo-chart-5699d9b44b-6fpk9 1/1 Running 0 25m
pod/kindred-nightingale-stx-sdo-chart-5cf95c4d97-zpqln 1/1 Running 0 25m
pod/kissing-snail-stx-sdo-chart-854d848649-54m9w 1/1 Running 0 25m
pod/lazy-tiger-stx-sdo-chart-568fbb8d65-gr6w7 1/1 Running 0 25m
pod/nonexistent-octopus-stx-sdo-chart-5f8f6c7ff8-9l7sm 1/1 Running 0 25m
pod/odd-boxer-stx-sdo-chart-6f5b9679cc-5stk7 1/1 Running 1 15h
pod/orderly-chicken-stx-sdo-chart-7889b64856-rmq7j 1/1 Running 0 25m
pod/redis-697fb49877-x5hr6 1/1 Running 0 25m
pod/rv.deploy-6bbffc7975-tf5z4 1/2 CrashLoopBackOff 93 30h
pod/sartorial-eagle-stx-sdo-chart-767d786685-ct7mf 1/1 Running 0 25m
pod/sullen-gnat-stx-sdo-chart-579fdb7df7-4z67w 1/1 Running 0 25m
pod/undercooked-cow-stx-sdo-chart-67875cc5c6-mwvb7 1/1 Running 0 25m
pod/wise-quoll-stx-sdo-chart-5db8c766c9-mhq8v 1/1 Running 0 21m
You can run the command helm ls to see all the deployed helm releases in your cluster.
To remove the release (and every resource it created, including the pods), run: helm delete RELEASE_NAME --purge.
If you want to delete all the pods in your namespace without your Helm release (I DON'T think this is what you're looking for), you can run: kubectl delete pods --all.
On a side note, if you're new to Helm, consider starting with Helm v3 since it has many improvements, and specially because the migration from v2 to v3 can become cumbersome, and if you can avoid it - you should.

Failed to open topo server on vitess with etcd

I'm running a simple example with Helm. Take a look below at values.yaml file:
cat << EOF | helm install helm/vitess -n vitess -f -
topology:
cells:
- name: 'zone1'
keyspaces:
- name: 'vitess'
shards:
- name: '0'
tablets:
- type: 'replica'
vttablet:
replicas: 1
mysqlProtocol:
enabled: true
authType: secret
username: vitess
passwordSecret: vitess-db-password
etcd:
replicas: 3
vtctld:
replicas: 1
vtgate:
replicas: 3
vttablet:
dataVolumeClaimSpec:
storageClassName: nfs-slow
EOF
Take a look at the output of current pods running below:
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-fb8b8dccf-8f5kt 1/1 Running 0 32m
kube-system coredns-fb8b8dccf-qbd6c 1/1 Running 0 32m
kube-system etcd-master1 1/1 Running 0 32m
kube-system kube-apiserver-master1 1/1 Running 0 31m
kube-system kube-controller-manager-master1 1/1 Running 0 32m
kube-system kube-flannel-ds-amd64-bkg9z 1/1 Running 0 32m
kube-system kube-flannel-ds-amd64-q8vh4 1/1 Running 0 32m
kube-system kube-flannel-ds-amd64-vqmnz 1/1 Running 0 32m
kube-system kube-proxy-bd8mf 1/1 Running 0 32m
kube-system kube-proxy-nlc2b 1/1 Running 0 32m
kube-system kube-proxy-x7cd5 1/1 Running 0 32m
kube-system kube-scheduler-master1 1/1 Running 0 32m
kube-system tiller-deploy-8458f6c667-cx2mv 1/1 Running 0 27m
vitess etcd-global-6pwvnv29th 0/1 Init:0/1 0 16m
vitess etcd-operator-84db9bc774-j4wml 1/1 Running 0 30m
vitess etcd-zone1-zwgvd7spzc 0/1 Init:0/1 0 16m
vitess vtctld-86cd78b6f5-zgfqg 0/1 CrashLoopBackOff 7 16m
vitess vtgate-zone1-58744956c4-x8ms2 0/1 CrashLoopBackOff 7 16m
vitess zone1-vitess-0-init-shard-master-mbbph 1/1 Running 0 16m
vitess zone1-vitess-0-replica-0 0/6 Init:CrashLoopBackOff 7 16m
Running logs I see this error:
$ kubectl logs -n vitess vtctld-86cd78b6f5-zgfqg
++ cat
+ eval exec /vt/bin/vtctld '-cell="zone1"' '-web_dir="/vt/web/vtctld"' '-web_dir2="/vt/web/vtctld2/app"' -workflow_manager_init -workflow_manager_use_election -logtostderr=true -stderrthreshold=0 -port=15000 -grpc_port=15999 '-service_map="grpc-vtctl"' '-topo_implementation="etcd2"' '-topo_global_server_address="etcd-global-client.vitess:2379"' -topo_global_root=/vitess/global
++ exec /vt/bin/vtctld -cell=zone1 -web_dir=/vt/web/vtctld -web_dir2=/vt/web/vtctld2/app -workflow_manager_init -workflow_manager_use_election -logtostderr=true -stderrthreshold=0 -port=15000 -grpc_port=15999 -service_map=grpc-vtctl -topo_implementation=etcd2 -topo_global_server_address=etcd-global-client.vitess:2379 -topo_global_root=/vitess/global
ERROR: logging before flag.Parse: E0422 02:35:34.020928 1 syslogger.go:122] can't connect to syslog
F0422 02:35:39.025400 1 server.go:221] Failed to open topo server (etcd2,etcd-global-client.vitess:2379,/vitess/global): grpc: timed out when dialing
I'm running behind vagrant with 1 master and 2 nodes. I suspect that is a issue with eth1.
The storage are configured to use NFS.
$ kubectl logs etcd-operator-84db9bc774-j4wml
time="2019-04-22T17:26:51Z" level=info msg="skip reconciliation: running ([]), pending ([etcd-zone1-zwgvd7spzc])" cluster-name=etcd-zone1 cluster-namespace=vitess pkg=cluster
time="2019-04-22T17:26:51Z" level=info msg="skip reconciliation: running ([]), pending ([etcd-zone1-zwgvd7spzc])" cluster-name=etcd-global cluster-namespace=vitess pkg=cluster
It appears that etcd is not fully initializing. Note that neither the pod for the global lockserver (etcd-global-6pwvnv29th) nor the local one for cell zone1 (pod etcd-zone1-zwgvd7spzc) are ready.

Calico etcd has no key named calico

I have a 2 node kubernetes cluster with calico networking. All the pods are up and running.
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-etcd-94466 1/1 Running 0 21h
kube-system calico-kube-controllers-5fdcfdbdf7-xsjxb 1/1 Running 0 14d
kube-system calico-node-hmnf5 2/2 Running 0 14d
kube-system calico-node-vmmmk 2/2 Running 0 14d
kube-system coredns-78fcdf6894-dlqg6 1/1 Running 0 14d
kube-system coredns-78fcdf6894-zwrd6 1/1 Running 0 14d
kube-system etcd-kube-master-01 1/1 Running 0 14d
kube-system kube-apiserver-kube-master-01 1/1 Running 0 14d
kube-system kube-controller-manager-kube-master-01 1/1 Running 0 14d
kube-system kube-proxy-nxfht 1/1 Running 0 14d
kube-system kube-proxy-qnn45 1/1 Running 0 14d
kube-system kube-scheduler-kube-master-01 1/1 Running 0 14d
I wanted to query calico-etcd using etcdctl, but I get the following error.
# etcdctl --debug --endpoints "http://10.142.137.11:6666" get calico
start to sync cluster using endpoints(http://10.142.137.11:6666)
cURL Command: curl -X GET http://10.142.137.11:6666/v2/members
got endpoints(http://10.142.137.11:6666) after sync
Cluster-Endpoints: http://10.142.137.11:6666
cURL Command: curl -X GET http://10.142.137.11:6666/v2/keys/calico?quorum=false&recursive=false&sorted=false
Error: 100: Key not found (/calico) [4]
Any pointers on why I get this error?
As #JakubBujny mentioned, ETCDCTL_API=3 should be set to get the appropriate result.

kubectl cannot exec or logs pod on other node

v1.8.2,installed by kubeadm
2 node:
NAME STATUS ROLES AGE VERSION
192-168-99-102.node Ready <none> 8h v1.8.2
192-168-99-108.master Ready master 8h v1.8.2
run nginx to test:
NAME READY STATUS RESTARTS AGE IP NODE
curl-6896d87888-smvjm 1/1 Running 0 7h 10.244.1.99 192-168-99-102.node
nginx-fbb985966-5jbxd 1/1 Running 0 7h 10.244.1.94 192-168-99-102.node
nginx-fbb985966-8vp9g 1/1 Running 0 8h 10.244.1.93 192-168-99-102.node
nginx-fbb985966-9bqzh 1/1 Running 1 7h 10.244.0.85 192-168-99-108.master
nginx-fbb985966-fd22h 1/1 Running 1 7h 10.244.0.83 192-168-99-108.master
nginx-fbb985966-lmgmf 1/1 Running 0 7h 10.244.1.98 192-168-99-102.node
nginx-fbb985966-lr2rh 1/1 Running 0 7h 10.244.1.96 192-168-99-102.node
nginx-fbb985966-pm2p7 1/1 Running 0 7h 10.244.1.97 192-168-99-102.node
nginx-fbb985966-t6d8b 1/1 Running 0 7h 10.244.1.95 192-168-99-102.node
kubectl exec pod on master is OK!
but when i exec pod on other node,return a error:
kubectl exec -it nginx-fbb985966-8vp9g bash
error: unable to upgrade connection: pod does not exist