Single-node ceph cluster unresponsive from client - ceph

I have attempted to set up a small one-node ceph cluster for some proof-of-concept work with ceph fs. The cluster is running centos 7 OS with :
# ceph --version
ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)
The cluster appears healthy:
# ceph -s
cluster:
id: fa18d061-b6fd-4092-bbe3-31f4f8493360
health: HEALTH_OK
services:
mon: 1 daemons, quorum se-ceph1-dev
mgr: se-ceph1-dev(active)
mds: cephfs-1/1/1 up {0=se-ceph1-dev=up:active}
osd: 1 osds: 1 up, 1 in
data:
pools: 2 pools, 64 pgs
objects: 22 objects, 2.2 KiB
usage: 1.0 GiB used, 39 GiB / 40 GiB avail
pgs: 64 active+clean
All ceph commands work perfectly on the OSD node (which is also the mon,mgr,mds). However any attempt to access the cluster as a client (default user admin) from another machine is completely ignored.
For instance:
cephcli$ ceph status
2020-07-08 08:12:58.358 7fa4c568e700 0 monclient(hunting): authenticate timed out after 300
2020-07-08 08:17:58.360 7fa4c568e700 0 monclient(hunting): authenticate timed out after 300
2020-07-08 08:22:58.362 7fa4c568e700 0 monclient(hunting): authenticate timed out after 300
2020-07-08 08:27:58.364 7fa4c568e700 0 monclient(hunting): authenticate timed out after 300
2020-07-08 08:32:58.363 7fa4c568e700 0 monclient(hunting): authenticate timed out after 300
The client machine is running OS 18.04.1-Ubuntu and has the same release of ceph installed as the osd node:
cephcli$ ceph --version
ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)
I have verified that no clients are blacklisted:
# ceph osd blacklist ls
listed 0 entries
I have verified that the various ceph agents are listening on their respective ports on the OSD node:
# netstat -tnlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:6800 0.0.0.0:* LISTEN 32591/ceph-osd
tcp 0 0 0.0.0.0:6801 0.0.0.0:* LISTEN 32591/ceph-osd
tcp 0 0 0.0.0.0:6802 0.0.0.0:* LISTEN 32591/ceph-osd
tcp 0 0 0.0.0.0:6803 0.0.0.0:* LISTEN 32591/ceph-osd
tcp 0 0 0.0.0.0:6804 0.0.0.0:* LISTEN 33279/ceph-mds
tcp 0 0 0.0.0.0:6805 0.0.0.0:* LISTEN 32579/ceph-mgr
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 13881/sshd
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 14038/master
tcp 0 0 10.19.4.159:6789 0.0.0.0:* LISTEN 32580/ceph-mon
tcp6 0 0 :::22 :::* LISTEN 13881/sshd
I have verified that the client is indeed sending requests to the OSD node using tcpdump on port 6789:
# tcpdump -i ens192 port 6789 -x -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes
08:42:05.183071 IP 10.19.4.84.37170 > 10.19.4.159.smc-https: Flags [S], seq 4146143942, win 64240, options [mss 1460,sackOK,TS val 1566694440 ecr 0,nop,wscale 7], length 0
0x0000: 4500 003c c7d9 4000 4006 55ca 0a13 0454
0x0010: 0a13 049f 9132 1a85 f721 22c6 0000 0000
0x0020: a002 faf0 30cd 0000 0204 05b4 0402 080a
0x0030: 5d61 dc28 0000 0000 0103 0307
08:42:05.383784 IP 10.19.4.84.37172 > 10.19.4.159.smc
I have verified on the client that the /etc/ceph/ceph.client.admin.keyring file contains the same key as is on the OSD node.
I've checked the monitor log and see entries when I make requests on the OSD node:
2020-07-08 10:17:12.414 7f06268a3700 0 log_channel(audit) log [DBG] : from='client.? 10.19.4.159:0/3709075926' entity='client.admin' cmd=[{"prefix": "status"}]: dispatch
However there is nothing reflecting the requests I'm making from the client node.
So requests are making it to the OSD node, but I'm not getting any response. Where have I gone wrong?

In case anyone stumbles upon this, I found the answer! At least - the answer for my specific issue.
My OSD host was set up in the default "defensive" mode with an iptables rule that rejected all incoming packets except for ssh. By deleting this rule, client requests immediately began working. To delete the rule (in my case):
sudo iptables -D INPUT -j REJECT --reject-with icmp-host-prohibited
Once I did that, the client could immediately connect.
The CEPH troubleshooting guide actually mentions this in the "clock-skew" section:
https://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-mon/#clock-skews

Related

Kubernetes node failed to join master due "Timeout exceeded while awaiting headers error"

I am trying to setup k8s cluster with master and two worker nodes in Digital Ocean.
My Config:
I have created three droplets as follows:
Master: 2cpu, 3GB Mem
Worker Node1: 1cpu, 2GB Mem
Worker Node2: 1cpu, 2GB Mem
I was able to setup master node successfully
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 139m v1.18.3
I am unable to add worker to master.
Command i ran to join:
$ kubeadm join <PUBLIC IP>:6443 --token <token> --discovery-token-ca-cert-hash <hash>
Token had 23h of validity left at the time of executing the above command.
Error that i got:
W0528 14:13:09.920404 25129 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
[preflight] Running pre-flight checks
error execution phase preflight: couldn't validate the identity of the API Server: Get https://PUBLIC_IP:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
To see the stack trace of this error execute with --v=5 or higher
My observations on this issue:
$ netstat -pnltu
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:40389 0.0.0.0:* LISTEN 25074/kubelet
tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 25074/kubelet
tcp 0 0 127.0.0.1:10249 0.0.0.0:* LISTEN 25478/kube-proxy
tcp 0 0 127.0.0.1:9099 0.0.0.0:* LISTEN 29823/calico-node
tcp 0 0 127.0.0.1:10257 0.0.0.0:* LISTEN 24580/kube-controll
tcp 0 0 127.0.0.1:10259 0.0.0.0:* LISTEN 24742/kube-schedule
tcp6 0 0 :::10250 :::* LISTEN 25074/kubelet
tcp6 0 0 :::10251 :::* LISTEN 24742/kube-schedule
tcp6 0 0 :::6443 :::* LISTEN 24725/kube-apiserve
tcp6 0 0 :::10252 :::* LISTEN 24580/kube-controll
tcp6 0 0 :::10256 :::* LISTEN 25478/kube-proxy
Is it because the API service is listening in IPV6 instead of IPV4?
here is the output of cluster-info:
$ kubectl cluster-info
Kubernetes master is running at https://<PUBLIC_IP>:6443
KubeDNS is running at https://<PUBLIC_IP>:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Any help to fix this issue is much appreciated.

How to enable listening 10255 in my kubelet service

I am learning to work with Kubernetes and trying to configure monitoring of my Kubernetes cluster. For this I use metricbeat and elk.
After deploying and configuring metricbeat, I get an error:
error making http request: Get http://172.16.0.205:10255/stats/summary: dial tcp 172.16.0.205:10255: connect: connection refused
I found that my Kubelet is not listening on port 10255:
[root#kube2 /]# netstat -ap | grep -i "listen" | grep "kubelet"
tcp 0 0 localhost:40450 0.0.0.0:* LISTEN 8560/kubelet
tcp 0 0 localhost:10248 0.0.0.0:* LISTEN 8560/kubelet
tcp6 0 0 [::]:10250 [::]:* LISTEN 8560/kubelet
How can I enable this port. I found information that I need to use the parameter --read-only-port = 10255, but how do I apply it to my kubelet, I do not quite understand. For example:
[root#kube2 /]# kubelet --config --read-only-port=10255
\F1010 13:32:48.592306 15851 server.go:196] failed to load Kubelet config file --read-only-port=10255, error failed to read kubelet config file "/--read-only-port=10255", error: open /--read-only-port=10255: no such file or directory
It's does't work. Which file does it need?
Can anyone help me with a solution to this problem?
I resolved this issue. I added flags in /var/lib/kubelet/kubelet-flags in every my kubertenes' nodes:
KUBELET_KUBEADM_ARGS="--cgroup-driver=systemd --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.1 --read-only-port=10255"
and restart kubelet service.
Now I have open port 10255:
[root#kube2 7.1]# netstat -ap | grep -i "listen" | grep "kubelet"
tcp 0 0 localhost:44799 0.0.0.0:* LISTEN 6281/kubelet
tcp 0 0 localhost:10248 0.0.0.0:* LISTEN 6281/kubelet
tcp6 0 0 [::]:10250 [::]:* LISTEN 6281/kubelet
tcp6 0 0 [::]:10255 [::]:* LISTEN 6281/kubelet
And I see some logs of kubernetes in my kibana.

kubernetes master starts only on tcp6 , how to join a node?

I have a local Kubernetes master started on a tcp6:6443 but not on tcp so how to start a kubeadm join for using the right port?
tcp6 0 0 :::10250 :::* LISTEN -
tcp6 0 0 :::6443 :::* LISTEN -
tcp6 0 0 :::10251 :::* LISTEN -
Starting Nmap 7.01 ( https://nmap.org ) at 2019-09-25 15:40 CEST
Nmap scan report for 10.0.2.15
Host is up (0.000081s latency).
PORT STATE SERVICE
6443/tcp closed unknown
You should run the below command (on master host):
$ kubeadm init --apiserver-advertise-address=<private-ip of master host>
--apiserver-advertise-address parameter - if the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used.
Now try to run the join command that was generated in the output of kubeadm init. It should works fine.
Also, what you can check is a firewall running on your master node that should be disabled. It’s blocking incoming traffic.
systemctl stop firewalld

Network connectivity/DNS issues on a GKE 1.10 kubernetes cluster

I'm running into DNS issues on a GKE 1.10 kubernetes cluster. Occasionally pods start without any network connectivity. Restarting the pod tends to fix the issue.
Here's the result of the same few commands inside a container without network, and one with.
BROKEN:
kc exec -it -n iotest app1-b67598997-p9lqk -c userapp sh
/app $ nslookup www.google.com
nslookup: can't resolve '(null)': Name does not resolve
/app $ cat /etc/resolv.conf
nameserver 10.63.240.10
search iotest.svc.cluster.local svc.cluster.local cluster.local c.myproj.internal google.internal
options ndots:5
/app $ curl -I 10.63.240.10
curl: (7) Failed to connect to 10.63.240.10 port 80: Connection refused
/app $ netstat -antp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:8001 0.0.0.0:* LISTEN 1/python
tcp 0 0 ::1:50051 :::* LISTEN 1/python
tcp 0 0 ::ffff:127.0.0.1:50051 :::* LISTEN 1/python
WORKING:
kc exec -it -n iotest app1-7d985bfd7b-h5dbr -c userapp sh
/app $ nslookup www.google.com
nslookup: can't resolve '(null)': Name does not resolve
Name: www.google.com
Address 1: 74.125.206.147 wk-in-f147.1e100.net
Address 2: 74.125.206.105 wk-in-f105.1e100.net
Address 3: 74.125.206.99 wk-in-f99.1e100.net
Address 4: 74.125.206.104 wk-in-f104.1e100.net
Address 5: 74.125.206.106 wk-in-f106.1e100.net
Address 6: 74.125.206.103 wk-in-f103.1e100.net
Address 7: 2a00:1450:400c:c04::68 wk-in-x68.1e100.net
/app $ cat /etc/resolv.conf
nameserver 10.63.240.10
search iotest.svc.cluster.local svc.cluster.local cluster.local c.myproj.internal google.internal
options ndots:5
/app $ curl -I 10.63.240.10
HTTP/1.1 404 Not Found
date: Sun, 29 Jul 2018 15:13:47 GMT
server: envoy
content-length: 0
/app $ netstat -antp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:15000 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:15001 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8001 0.0.0.0:* LISTEN 1/python
tcp 0 0 10.60.2.6:56508 10.60.48.22:9091 ESTABLISHED -
tcp 0 0 127.0.0.1:57768 127.0.0.1:50051 ESTABLISHED -
tcp 0 0 10.60.2.6:43334 10.63.255.44:15011 ESTABLISHED -
tcp 0 0 10.60.2.6:15001 10.60.45.26:57160 ESTABLISHED -
tcp 0 0 10.60.2.6:48946 10.60.45.28:9091 ESTABLISHED -
tcp 0 0 127.0.0.1:49804 127.0.0.1:50051 ESTABLISHED -
tcp 0 0 ::1:50051 :::* LISTEN 1/python
tcp 0 0 ::ffff:127.0.0.1:50051 :::* LISTEN 1/python
tcp 0 0 ::ffff:127.0.0.1:50051 ::ffff:127.0.0.1:49804 ESTABLISHED 1/python
tcp 0 0 ::ffff:127.0.0.1:50051 ::ffff:127.0.0.1:57768 ESTABLISHED 1/python
These pods are identical, just one was restarted.
Does anyone have advice about how to analyse and fix this issue?
Some steps to try:
1) ifconfig eth0 or whatever the primary interface is.
Is the interface up? Are the tx and rx packet counts increasing?
2)If interface is up, you can try tcpdump as you are running the nslookup command that you posted. See if the dns request packets are getting sent out.
3) See which node the pod is scheduled on, when network connectivity gets broken. Maybe it is on the same node every time? If yes, are other pods on that node running into similar problem?
I also faced the same problem, and I simply worked around it for now by switching to the 1.9.x GKE version (after spending many hours trying to debug why my app wasn't working).
Hope this helps!

Cannot curl kubelet read-only port

I have a heapster pod running on one of the nodes in my Kubernetes cluster. It is able to get http://<node-with-heapster-pod>:10255/stats/summary just fine, but whenever it runs the same get request on another node, it cannot. When I run curl from within any given node I can access that port, but when I curl any node from another machine I get the following error:
Failed to connect to 128.180.120.229 port 10255: No route to host
The following is the netstat output for all ports on which the kubelet is listening:
netstat -ap | grep -i "listen" | grep "kubelet"
tcp 0 0 localhost:10248 0.0.0.0:* LISTEN 7562/kubelet
tcp6 0 0 [::]:4194 [::]:* LISTEN 7562/kubelet
tcp6 0 0 [::]:10250 [::]:* LISTEN 7562/kubelet
tcp6 0 0 [::]:10255 [::]:* LISTEN 7562/kubelet
unix 2 [ ACC ] STREAM LISTENING 621349 7562/kubelet /var/run/dockershim.sock
I apologize for the messy last column. Any ideas why this may be? My iptables rules are set up to accept all incoming connections, and any node can ping port 10250 fine, just not 10255.
you may not have ip_forward enabled on your system. can you check this settings?
sysctl -n net.ipv4.ip_forward
If anybody still cares, port 10255 is the kubelet's read only port and may or may not be configured. You can confirm this by accessing the worker node in question then looking at the kubelet's startup command.
systemctl status kubelet-worker.service
Some on-prem kubernetes solutions set this to 0 as mentioned below
https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
--read-only-port int32 The read-only port for the Kubelet to serve on with no authentication/authorization (set to 0 to disable) (default 10255) (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)