Ceph (cepadm) quincy: can't add osd from remote nodes (command hanging) - ceph

I stuck with a problem, while trying to create cluster of 3 nodes (AWS EC2 instancies):
fa11.something.com ~ # ceph orch host ls
HOST ADDR LABELS STATUS
fa11.something.com 172.16.24.67 _admin
fa12.something.com 172.16.23.159 _admin
fa13.something.com 172.16.25.119 _admin
3 hosts in cluster
Each of them have 2 disks (all accepted by CEPH):
fa11.something.com ~ # ceph orch device ls
HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
fa11.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol016651cf7f3b9c9dd 8589M Yes 7m ago
fa11.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol034082d7d364dfbdb 5368M Yes 7m ago
fa12.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol0ec193fa3f77fee66 8589M Yes 3m ago
fa12.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol018736f7eeab725f5 5368M Yes 3m ago
fa13.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol0443a031550be1024 8589M Yes 84s ago
fa13.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol0870412d37717dc2c 5368M Yes 84s ago
fa11.something.com is first host, where from I manage cluster.
Adding OSD from fa11.something.com itself works fine:
fa11.something.com ~ # ceph orch daemon add osd fa11.something.com:/dev/nvme1n1
Created osd(s) 0 on host 'fa11.something.com'
But it doesn't work for other 2 hosts (it hangs forever):
fa11.something.com ~ # ceph orch daemon add osd fa12.something.com:/dev/nvme1n1
^CInterrupted
Logs on fa12.something.com shows that it hangs at following step:
fa12.something.com ~ # tail /var/log/ceph/a9ef6c26-ac38-11ed-9429-06e6bc29c1db/ceph-volume.log
...
[2023-02-14 07:38:20,942][ceph_volume.process][INFO ] Running command: /usr/bin/ceph-authtool --gen-print-key
[2023-02-14 07:38:20,964][ceph_volume.process][INFO ] Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new a51506c2-e910-4763-9a0c-f6c2194944e2
I'm not sure what might be the reason for this hanging?
*Additional details:
cephadm installed, using curl (https://docs.ceph.com/en/quincy/cephadm/install/#curl-based-installation)
I use user ceph, instead of root and port 2222 instead of 22. First node was bootstrapped, using below command:
cephadm bootstrap --mon-ip 172.16.24.67 --allow-fqdn-hostname --ssh-user ceph --ssh-config /home/anton/ceph/ssh_config --cluster-network 172.16.16.0/20 --skip-monitoring-stack
Content of /home/anton/ceph/ssh_config:
fa11.something.com ~ # cat /home/anton/ceph/ssh_config
Host *
User ceph
Port 2222
IdentityFile /home/ceph/.ssh/id_rsa
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
Hosts fa12.something.com and fa13.something.com were added, using commands:
ceph orch host add fa12.something.com 172.16.23.159 --labels _admin
ceph orch host add fa13.something.com 172.16.25.119 --labels _admin
Not sure if I have to check some specific ports are not blocked? Just I expected to get some error at early stages in case if ceph can't access some port...
Thanks in advance for any suggestion!

It should be that problem caused by some port(s) blocked. I've tried to drop all restrictions on iptables level: iptables -F; iptables -P INPUT ACCEPT and it started to work. Will check later which ports exactly needed. I guess it should be described somewhere here: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

Related

Debug Alpine Image in K8s: No `netstat`, no `ip`, no `apk`

There is a container in my Kubernetes cluster which I want to debug.
But there is nonetstat, no ip and no apk.
Is there a way to upgrade this image, so that the common tools are installed?
In this case it is the nginx container image in a K8s 1.23 cluster.
Alpine is a stripped-down version of the image to reduce the footprint. So the absence of those tools is expected. Although since Kubernetes 1.23, you can use the kubectl debug command to attach a debug pod to the subject pod.
Syntax:
kubectl debug -it <POD_TO_DEBUG> --image=ubuntu --target=<CONTAINER_TO_DEBUG> --share-processes
Example:
In the below example, the ubuntu container is attached to the Nginx-alpine pod, requiring debugging. Also, note that the ps -eaf output shows nginx process running and the cat /etc/os-release shows ubuntu running. The indicating process is shared/visible between the two containers.
ps#kube-master:~$ kubectl debug -it nginx --image=ubuntu --target=nginx --share-processes
Targeting container "nginx". If you don't see processes from this container, the container runtime doesn't support this feature.
Defaulting debug container name to debugger-2pgtt.
If you don't see a command prompt, try pressing enter.
root#nginx:/# ps -eaf
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 19:50 ? 00:00:00 nginx: master process nginx -g daemon off;
101 33 1 0 19:50 ? 00:00:00 nginx: worker process
101 34 1 0 19:50 ? 00:00:00 nginx: worker process
101 35 1 0 19:50 ? 00:00:00 nginx: worker process
101 36 1 0 19:50 ? 00:00:00 nginx: worker process
root 248 0 1 20:00 pts/0 00:00:00 bash
root 258 248 0 20:00 pts/0 00:00:00 ps -eaf
root#nginx:/#
Debugging as ubuntu as seen here, this arm us with all sort of tools:
root#nginx:/# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
root#nginx:/#
In case ephemeral containers need to be enabled in your cluster, then you can enable it via feature gates as described here.
The whole point of using containers is to optimize the resource utilization in your cluster. The images used should only include the packages that are needed to run your app.
The unwanted packages should be removed from your images (especially in prod) to reduce the compute utilization and to reduce the attack vector.
This appears to be a stripped down image that has only the libraries needed to run that application.
In order to debug, you will have to create a new container in the same pid and network namespace as the container you are trying to debug
Build container first
Dockerfile
FROM alpine
RUN apk update && apk add strace
CMD ["strace", "-p", "1"]
Build
$ docker build -t strace .
Run
docker run -t --pid=container:<targetContainer> \
--net=container:targetContainer \
--cap-add sys_admin \
--cap-add sys_ptrace \
strace
strace: Process 1 attached
futex(0xd72e90, FUTEX_WAIT, 0, NULL
https://rothgar.medium.com/how-to-debug-a-running-docker-container-from-a-separate-container-983f11740dc6

Resizing Pool to 1 has been disabled by default on Ceph Pacific stable 6.0

I implemented Ceph Pacific Stable 6.0 with Ceph-Ansible on Ubuntu 20.04 LTS
but when I want to change size of my pool from 3 to 1 with following command:
sudo ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring osd pool set cephfs_data size 1
I get following error:
Error EPERM: configuring pool size as 1 is disabled by default.
Add the following option under section [global] in ceph.conf:
mon_allow_pool_size_one = true
Restart the monitor service
For those working with Quincy (17.2.3) you can do:
$ ceph config set global mon_allow_pool_size_one true
$ ceph osd pool set data_pool min_size 1
$ ceph osd pool set data_pool size 1 --yes-i-really-mean-it

The connection to the server x.x.x.:6443 was refused - did you specify the right host or port? Kubernetes

I've installed, Docker, Kubectl and kubeAdm.
I want to create my device model and device CRDs (I'm following this guide.
So, when I run the command :
kubectl create -f devices_v1alpha1_devicemodel.yaml
as a user I get the following out:
The connection to the server 10.0.0.68:6443 was refused - did you
specify the right host or port?
(I have added the permission for the user to access the .kube folder)
With netstat, I get :
> ubuntu#kubernetesmaster:~/src/github.com/kubeedge/kubeedge/build/crds/devices$
> sudo netstat -atunp Active Internet connections (servers and
> established) Proto
> Recv-Q Send-Q Local Address Foreign Address State
> PID/Program name tcp 0 0 0.0.0.0:22
> 0.0.0.0:* LISTEN 1298/sshd tcp 0 224 10.0.0.68:22 160.98.31.160:52503 ESTABLISHED
> 2061/sshd: ubuntu [ tcp6 0 0 :::22 :::*
> LISTEN 1298/sshd udp 0 0 0.0.0.0:68
> 0.0.0.0:* 910/dhclient udp 0 0 10.0.0.68:123 0.0.0.0:*
> 1241/ntpd udp 0 0 127.0.0.1:123
> 0.0.0.0:* 1241/ntpd udp 0 0 0.0.0.0:123 0.0.0.0:*
> 1241/ntpd udp6 0 0 fe80::f816:3eff:fe0:123 :::*
> 1241/ntpd udp6 0 0 2001:620:5ca1:2f0:f:123 :::*
> 1241/ntpd udp6 0 0 ::1:123 :::*
> 1241/ntpd udp6 0 0 :::123 :::*
> 1241/ntpd
With lsof -i :
ubuntu#kubernetesmaster:~/src/github.com/kubeedge/kubeedge/build/crds/devices$ sudo lsof -i
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dhclient 910 root 6u IPv4 12765 0t0 UDP *:bootpc
ntpd 1241 ntp 16u IPv6 15340 0t0 UDP *:ntp
ntpd 1241 ntp 17u IPv4 15343 0t0 UDP *:ntp
ntpd 1241 ntp 18u IPv4 15347 0t0 UDP localhost:ntp
ntpd 1241 ntp 19u IPv4 15349 0t0 UDP 10.0.0.68:ntp
ntpd 1241 ntp 20u IPv6 15351 0t0 UDP ip6-localhost:ntp
ntpd 1241 ntp 21u IPv6 15353 0t0 UDP [2001:620:5ca1:2f0:f816:3eff:fe0a:874a]:ntp
ntpd 1241 ntp 22u IPv6 15355 0t0 UDP [fe80::f816:3eff:fe0a:874a]:ntp
sshd 1298 root 3u IPv4 18821 0t0 TCP *:ssh (LISTEN)
sshd 1298 root 4u IPv6 18830 0t0 TCP *:ssh (LISTEN)
sshd 2061 root 3u IPv4 18936 0t0 TCP 10.0.0.68:ssh->160.98.31.160:52503 (ESTABLISHED)
sshd 2124 ubuntu 3u IPv4 18936 0t0 TCP 10.0.0.68:ssh->160.98.31.160:52503 (ESTABLISHED)
I've already tried this
and:sudo swapoff -a
Please perform below steps on the master node. It works like charm.
1. sudo -i
2. swapoff -a
3. exit
4. strace -eopenat kubectl version
I am facing similar problem with following error while deploying the pod network into a cluster using flannel:
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
The connection to the server 192.168.1.101:6443 was refused - did you specify the right host or port?
I performed below steps to solved the issue:
$ sudo systemctl stop kubelet
$ sudo systemctl start kubelet
$ strace -eopenat kubectl version
then apply the yml file
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created
kubelet must be down. you need to check kubelet logs on the master and ensure api server is running and online. then only you should be able to deploy
I'll add another reason for this error that was the issue in my case.
I exported the wrong Kubeconfig file to shell and the error message was very accurate in that case - The endpoint for the API server was wrong (and of course other fields like the cluster name and the certificates - but the server endpoint is the first step in the chain).
I've encountered this problem and swapoff -a works for me though.
sudo -i
swapoff -a
exit
strace -eopenat kubectl version
This is because docker is down. Start docker on your machine.
I have tried many ways but couldn't get it work, then accidentally found the solution to my own situation:
In ~/.kube/ I have
drwxr-x--- 4 staff 128 25 Jul 22:31 cache
-rw------- 1 staff 8781 25 Jul 22:46 config
drwxr-xr-x 8 staff 256 25 Jul 22:46 configs
-rw-r--r-- 1 staff 14 25 Jul 22:31 kubectx
drwxr-xr-x 4 staff 128 29 Jun 16:59 kubens
My assumption is that there is something messed up in the .kube configuration, however I couldn't figure out which files, so I removed most of the directories/files, including cache, config. (If you don't want to keep all the configs, maybe you should remove all of them)
Then from docker dashboard to re-enable kubernetes, to get all the files installed back.
Re-config the docker-desktop by kubectl config use-context docker-desktop.
Finally, my 6443 responded.
Another suggestion is to restart your container sudo systemctl restart containerd In my situation I'm working with containerd and not a typical docker container
After doing this I think it should fix the issue for you but if doesn't work then try sudo swapoff -a
Assuming that's all the output from your netstat command and that you ran it on the master node (the one you installed Kubernetes via kubadm on), it looks to me like the installation did not complete correctly as none of the usual ports you would expect to see on a Kubernetes master node are present.
Usually on a kubernetes master node you'd expect to see kube-apiserver, kube-scheduler, kube-controller, kubelet and possibly etcd all listening on the network.
What was the output of your kubeadm init command?
I ran into this issue as well, I tried the solutions noted above and it did not work for me. Here is what worked for me:
FIX:
kubeadm init --apiserver-advertise-address=10.139.0.42 --ignore-preflight-errors all --pod-network-cidr=172.17.0.1/16 --token-ttl 0
source:
https://www.c-sharpcorner.com/article/kubernetes-installation-in-redhat-and-centos/
I faced this issue recently due to expired certificates for my K8S cluster.
I followed this blog link to renew the certificates and also replace the kube config file that I was using.
Note, it is important to replace the kube config post renewal of the certificates or else you would end up getting following error message for kubectl CLIs:
error: You must be logged in to the server (Unauthorized)
I faced same issue with the same error, You need to check that your container run time (docker/containerd) is active and running:
systemctl status containerd
systemctl restart containerd
systemctl restart kubelet
then if you check its status, it supposed to be up and running and you are able to create k8s objects right now.
In my case KUBECONFIG was the problem causing the same error.
This solved it:
export KUBECONFIG=/home/$(whoami)/.kube/config
I solved this exact problem by making sure that in the /etc/hosts file the IP address and host name was set correctly:
192.168.10.11 kube-01.testing
instead of:
127.0.1.1 kube-01.testing
(or 127.0.0.1).
This happens on Ubuntu and Debian as far as I know-
I solved this exact problem by making sure that in the /etc/hosts file the IP address and host name was set correctly:
192.168.10.11 kube-01.testing kube-01
instead of:
127.0.1.1 kube-01.testing kube-01
(or 127.0.0.1).
This happens on Ubuntu and Debian as far as I know-
If you did all the above steps (sudo swapoff -a, kubeconfig file permissions, kubelet and containerd status, etc) and nothing works for you, It is a good idea to take a look at the kubelet logs:
journalctl -xeu kubelet
In my case, I realize that the kubelet wants to download the images, but it fails.
So I turned on the VPN on the node, and after a couple of seconds, everything worked!
I have faced the same issue "The connection to the server {IP}:6443 was refused - did you specify the right host or port?"
The reason was that since Kubernetes 1.24+, kubenet has been removed.
So, when installing the Kubernetes cluster with kubeadm and using Docker as a Container Runtime, the cri-dockerd must be installed as well (otherwise I got the above error).
As it is mention in the kubernetes documentation:
On each of your nodes, install Docker for your Linux distribution as
per Install Docker Engine.
Install cri-dockerd, following the
instructions in that source code repository.
As per the error message, it is clearly says port number 6443 connection is refused.
Means
port number is blocked by firewall
if there is no firewall, then
Blockquote
the port number is not running on the specified host 6443. you can
cross verify using the below command
#netstat -tulpn | grep -i 6443.
Solution:
6443 it is kube-apiserver port number in k8s. if it is not running make sure kube-apiserver is running properly. I have faced the same problem. After that I have fix properly set the correct argument in that specific port.
/usr/local/bin/kube-apiserver \\
--advertise-address=${INTERNAL_IP} \\
--allow-privileged=true \\
--apiserver-count=3 \\
--audit-log-maxage=30 \\
--audit-log-maxbackup=3 \\
--audit-log-maxsize=100 \\
--audit-log-path=/var/log/audit.log \\
--authorization-mode=Node,RBAC \\
--bind-address=0.0.0.0 \\
--client-ca-file=/var/lib/kubernetes/ca.crt \\
--enable-admission-plugins=NodeRestriction,ServiceAccount \\
--enable-swagger-ui=true \\
--enable-bootstrap-token-auth=true \\
--etcd-cafile=/var/lib/kubernetes/ca.crt \\
--etcd-certfile=/var/lib/kubernetes/etcd-server.crt \\
--etcd-keyfile=/var/lib/kubernetes/etcd-server.key \\
--etcd-servers=https://192.168.5.11:2379,https://192.168.5.12:2379 \\
--event-ttl=1h \\
--encryption-provider-config=/var/lib/kubernetes/encryption-config.yaml \\
--kubelet-certificate-authority=/var/lib/kubernetes/ca.crt \\
--kubelet-client-certificate=/var/lib/kubernetes/kube-apiserver.crt \\
--kubelet-client-key=/var/lib/kubernetes/kube-apiserver.key \\
--kubelet-https=true \\
--runtime-config=api/all \\
--service-account-key-file=/var/lib/kubernetes/service-account.crt \\
--service-cluster-ip-range=10.96.0.0/24 \\
--service-node-port-range=30000-32767 \\
--tls-cert-file=/var/lib/kubernetes/kube-apiserver.crt \\
--tls-private-key-file=/var/lib/kubernetes/kube-apiserver.key \\
--v=2
sudo -i
swapoff -a
exit
strace -eopenat kubectl version

Why does kubeadm not start even after disabling swap?

I am trying to install kubernetes with kubeadm in my laptop which has Ubuntu 16.04. I have disabled swap, since kubelet does not work with swap on. The command I used is :
swapoff -a
I also commented out the reference to swap in /etc/fstab.
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/sda1 during installation
UUID=1d343a19-bd75-47a6-899d-7c8bc93e28ff / ext4 errors=remount-ro 0 1
# swap was on /dev/sda5 during installation
#UUID=d0200036-b211-4e6e-a194-ac2e51dfb27d none swap sw 0 0
I confirmed swap is turned off by running the following:
free -m
total used free shared buff/cache available
Mem: 15936 2108 9433 954 4394 12465
Swap: 0 0 0
When I start kubeadm, I get the following error:
kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.14.2
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
I also tried restarting my laptop, but I get the same error. What could the reason be?
below was the root cause.
detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd".
you need to update the docker cgroup driver.
follow the below fix
cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}
EOF
mkdir -p /etc/systemd/system/docker.service.d
# Restart Docker
systemctl daemon-reload
systemctl restart docker
you could try kubeadm reset , then kubeadm init --ignore-preflight-errors Swap .
first try with sudo
sudo swapoff -a
then check if there's anything swapped
cat /proc/swaps
and
free -h

Kubernetes cluster deployment in packstack

I am trying to deploy a k8s cluster in openstack rocky but after long time it fails. I've checked orchestration stack and see that kube_minions resources never completes. Checking the logs output for all the instances created:
[ 196.817505] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 215.082433] random: crng init done
Fedora 27 (Atomic Host)
Kernel 4.14.18-300.fc27.x86_64 on an x86_64 (ttyS0)
host-10-0-0-3 login: [ 691.438618] bridge: filtering via
arp/ip/ip6tables is no longer available by default. Update your scripts
to load br_netfilter if you need this.
[ 691.516277] Bridge firewalling registered
[ 692.149217] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[ 701.932912] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
Checking deeply in the instances I've found that in master node can not start heat-agent-service...
_prefix=docker.io/openstackmagnum/
atomic install --storage ostree --system --system-package no --set
REQUESTS_CA_BUNDLE=/etc/pki/tls/certs/ca-bundle.crt --name heat-container-
agent docker.io/openstackmagnum/heat-container-agent:rocky-stable
systemctl start heat-container-agent
Failed to start heat-container-agent.service: Unit heat-container-
agent.service not found.
2019-04-04 14:57:40,238 - util.py[WARNING]: Failed running
/var/lib/cloud/instance/scripts/part-013 [5]ยด