Resizing Pool to 1 has been disabled by default on Ceph Pacific stable 6.0 - ceph

I implemented Ceph Pacific Stable 6.0 with Ceph-Ansible on Ubuntu 20.04 LTS
but when I want to change size of my pool from 3 to 1 with following command:
sudo ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring osd pool set cephfs_data size 1
I get following error:
Error EPERM: configuring pool size as 1 is disabled by default.

Add the following option under section [global] in ceph.conf:
mon_allow_pool_size_one = true
Restart the monitor service

For those working with Quincy (17.2.3) you can do:
$ ceph config set global mon_allow_pool_size_one true
$ ceph osd pool set data_pool min_size 1
$ ceph osd pool set data_pool size 1 --yes-i-really-mean-it

Related

Ceph (cepadm) quincy: can't add osd from remote nodes (command hanging)

I stuck with a problem, while trying to create cluster of 3 nodes (AWS EC2 instancies):
fa11.something.com ~ # ceph orch host ls
HOST ADDR LABELS STATUS
fa11.something.com 172.16.24.67 _admin
fa12.something.com 172.16.23.159 _admin
fa13.something.com 172.16.25.119 _admin
3 hosts in cluster
Each of them have 2 disks (all accepted by CEPH):
fa11.something.com ~ # ceph orch device ls
HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
fa11.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol016651cf7f3b9c9dd 8589M Yes 7m ago
fa11.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol034082d7d364dfbdb 5368M Yes 7m ago
fa12.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol0ec193fa3f77fee66 8589M Yes 3m ago
fa12.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol018736f7eeab725f5 5368M Yes 3m ago
fa13.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol0443a031550be1024 8589M Yes 84s ago
fa13.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol0870412d37717dc2c 5368M Yes 84s ago
fa11.something.com is first host, where from I manage cluster.
Adding OSD from fa11.something.com itself works fine:
fa11.something.com ~ # ceph orch daemon add osd fa11.something.com:/dev/nvme1n1
Created osd(s) 0 on host 'fa11.something.com'
But it doesn't work for other 2 hosts (it hangs forever):
fa11.something.com ~ # ceph orch daemon add osd fa12.something.com:/dev/nvme1n1
^CInterrupted
Logs on fa12.something.com shows that it hangs at following step:
fa12.something.com ~ # tail /var/log/ceph/a9ef6c26-ac38-11ed-9429-06e6bc29c1db/ceph-volume.log
...
[2023-02-14 07:38:20,942][ceph_volume.process][INFO ] Running command: /usr/bin/ceph-authtool --gen-print-key
[2023-02-14 07:38:20,964][ceph_volume.process][INFO ] Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new a51506c2-e910-4763-9a0c-f6c2194944e2
I'm not sure what might be the reason for this hanging?
*Additional details:
cephadm installed, using curl (https://docs.ceph.com/en/quincy/cephadm/install/#curl-based-installation)
I use user ceph, instead of root and port 2222 instead of 22. First node was bootstrapped, using below command:
cephadm bootstrap --mon-ip 172.16.24.67 --allow-fqdn-hostname --ssh-user ceph --ssh-config /home/anton/ceph/ssh_config --cluster-network 172.16.16.0/20 --skip-monitoring-stack
Content of /home/anton/ceph/ssh_config:
fa11.something.com ~ # cat /home/anton/ceph/ssh_config
Host *
User ceph
Port 2222
IdentityFile /home/ceph/.ssh/id_rsa
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
Hosts fa12.something.com and fa13.something.com were added, using commands:
ceph orch host add fa12.something.com 172.16.23.159 --labels _admin
ceph orch host add fa13.something.com 172.16.25.119 --labels _admin
Not sure if I have to check some specific ports are not blocked? Just I expected to get some error at early stages in case if ceph can't access some port...
Thanks in advance for any suggestion!
It should be that problem caused by some port(s) blocked. I've tried to drop all restrictions on iptables level: iptables -F; iptables -P INPUT ACCEPT and it started to work. Will check later which ports exactly needed. I guess it should be described somewhere here: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

Debug Alpine Image in K8s: No `netstat`, no `ip`, no `apk`

There is a container in my Kubernetes cluster which I want to debug.
But there is nonetstat, no ip and no apk.
Is there a way to upgrade this image, so that the common tools are installed?
In this case it is the nginx container image in a K8s 1.23 cluster.
Alpine is a stripped-down version of the image to reduce the footprint. So the absence of those tools is expected. Although since Kubernetes 1.23, you can use the kubectl debug command to attach a debug pod to the subject pod.
Syntax:
kubectl debug -it <POD_TO_DEBUG> --image=ubuntu --target=<CONTAINER_TO_DEBUG> --share-processes
Example:
In the below example, the ubuntu container is attached to the Nginx-alpine pod, requiring debugging. Also, note that the ps -eaf output shows nginx process running and the cat /etc/os-release shows ubuntu running. The indicating process is shared/visible between the two containers.
ps#kube-master:~$ kubectl debug -it nginx --image=ubuntu --target=nginx --share-processes
Targeting container "nginx". If you don't see processes from this container, the container runtime doesn't support this feature.
Defaulting debug container name to debugger-2pgtt.
If you don't see a command prompt, try pressing enter.
root#nginx:/# ps -eaf
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 19:50 ? 00:00:00 nginx: master process nginx -g daemon off;
101 33 1 0 19:50 ? 00:00:00 nginx: worker process
101 34 1 0 19:50 ? 00:00:00 nginx: worker process
101 35 1 0 19:50 ? 00:00:00 nginx: worker process
101 36 1 0 19:50 ? 00:00:00 nginx: worker process
root 248 0 1 20:00 pts/0 00:00:00 bash
root 258 248 0 20:00 pts/0 00:00:00 ps -eaf
root#nginx:/#
Debugging as ubuntu as seen here, this arm us with all sort of tools:
root#nginx:/# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
root#nginx:/#
In case ephemeral containers need to be enabled in your cluster, then you can enable it via feature gates as described here.
The whole point of using containers is to optimize the resource utilization in your cluster. The images used should only include the packages that are needed to run your app.
The unwanted packages should be removed from your images (especially in prod) to reduce the compute utilization and to reduce the attack vector.
This appears to be a stripped down image that has only the libraries needed to run that application.
In order to debug, you will have to create a new container in the same pid and network namespace as the container you are trying to debug
Build container first
Dockerfile
FROM alpine
RUN apk update && apk add strace
CMD ["strace", "-p", "1"]
Build
$ docker build -t strace .
Run
docker run -t --pid=container:<targetContainer> \
--net=container:targetContainer \
--cap-add sys_admin \
--cap-add sys_ptrace \
strace
strace: Process 1 attached
futex(0xd72e90, FUTEX_WAIT, 0, NULL
https://rothgar.medium.com/how-to-debug-a-running-docker-container-from-a-separate-container-983f11740dc6

Ceph got timeout

I upgraded my proxmox cluster from 6 to 7. After upgrading, Ceph services are not responding. Any command in the console, for example ceph -s hangs and does not return any result. And in the web interface I see Error got timeout (500)
enter image description here
In the log I saw the error cluster [WRN] overall HEALTH_WARN noout flag (s) set; 1/4 mons down, quorum server-1, server-2, server-3
But when I try to execute ceph osd unset noout I see this error
2021-12-03T18: 59: 17.068 + 0300 7fac8c30d700 0 monclient (hunting): authenticate timed out after 300

Why does kubeadm not start even after disabling swap?

I am trying to install kubernetes with kubeadm in my laptop which has Ubuntu 16.04. I have disabled swap, since kubelet does not work with swap on. The command I used is :
swapoff -a
I also commented out the reference to swap in /etc/fstab.
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/sda1 during installation
UUID=1d343a19-bd75-47a6-899d-7c8bc93e28ff / ext4 errors=remount-ro 0 1
# swap was on /dev/sda5 during installation
#UUID=d0200036-b211-4e6e-a194-ac2e51dfb27d none swap sw 0 0
I confirmed swap is turned off by running the following:
free -m
total used free shared buff/cache available
Mem: 15936 2108 9433 954 4394 12465
Swap: 0 0 0
When I start kubeadm, I get the following error:
kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.14.2
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
I also tried restarting my laptop, but I get the same error. What could the reason be?
below was the root cause.
detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd".
you need to update the docker cgroup driver.
follow the below fix
cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}
EOF
mkdir -p /etc/systemd/system/docker.service.d
# Restart Docker
systemctl daemon-reload
systemctl restart docker
you could try kubeadm reset , then kubeadm init --ignore-preflight-errors Swap .
first try with sudo
sudo swapoff -a
then check if there's anything swapped
cat /proc/swaps
and
free -h

Ran out of Docker disk space

I have this Docker command:
docker run -d mongo
this will build and run a mongodb server running in a docker container
However, I get an error:
no space left on device
I am on MacOS, and using the newer versions of Docker which use hyper-v instead of VirtualBox (I think that's correct).
Here is the exact error message from the mongo container:
$ docker logs efee16702c5756659d563b98d4ae0f58ecf1f1bba8a54f63443c0ae4b520ab4e
about to fork child process, waiting until server is ready for connections.
forked process: 21
2017-05-04T20:23:51.412+0000 I CONTROL [main] ***** SERVER RESTARTED *****
2017-05-04T20:23:51.430+0000 I CONTROL [main] ERROR: Cannot write pid file to /tmp/tmp.Lo035QkbfL: No space left on device
ERROR: child process failed, exited with error number 1
Any idea how to fix this and prevent it from happening in future?
As suggested, the output of df -h is:
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 465Gi 116Gi 349Gi 25% 1963838 4293003441 0% /
devfs 183Ki 183Ki 0Bi 100% 634 0 100% /dev
map -hosts 0Bi 0Bi 0Bi 100% 0 0 100% /net
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /home
Output of docker info is:
$ docker info
Containers: 5
Running: 0
Paused: 0
Stopped: 5
Images: 741
Server Version: 17.03.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.13-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.952 GiB
Name: moby
ID: OR4L:WYWW:FFAP:IDX3:B6UK:O2AN:UVTO:EPH6:GYSV:4GV4:L5WP:BQTH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 17
Goroutines: 30
System Time: 2017-05-04T20:45:27.056157913Z
EventsListeners: 1
No Proxy: *.local, 169.254/16
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
As you state in the comments to the question, ls -altrh ~/Library/Containers/com.docker.docker/Data/com.docker.driver.‌​amd64-linux/Docker.q‌​cow2 returns the following:
-rw-r--r--# 1 alexamil staff 53G
This is a known bug on MacOS (actually, not only) and an official dev comment could be found here. Except for one thing: I read, that different people get different size limit. In the comment it is 64Gb, but for another person it was 20Gb.
There are a couple walkarounds, but no definite solution that I could find.
The manual one
Run docker ps -a and manually remove all unused containers. Then run docker images and remove manually all the intermediate and unused images.
The simplest one
Delete the Docker.qcow2 file entirely. But you will lose all images and containers. Completely.
The less simple
Another way is to run docker volume prune, which will remove all unused volumes
The resizing one (keeps the data)
Another idea that comes to me is to expand the disk image size with QEMU or something like it:
$ brew install qemu
$ /Applications/Docker.app/Contents/MacOS/qemu-img resize ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 +5G
After you expanded the image, you will need to run a VM in which you should run GParted against Docker.qcow2 and expand the partition to use added space. You could use GParted Live ISO for that:
$ qemu-system-x86_64 -drive file=Docker.qcow2 -m 512 -cdrom ~/Downloads/gparted-live.iso -boot d -device usb-mouse -usb
Some people report this either doesn't work or doesn't help.
Yet another resizing one (wipes the data)
Create a substitute image with desired size (120G):
$ qemu-img create -f qcow2 ~/data.qcow2 120G
$ cp ~/data.qcow2 /Application/Docker.app/Contents/Resources/moby/data.qcow2
$ rm ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2
data.qcow2 is copied to ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 when you restart docker.
This walkaround comes from this comment.
Hope this helps. Good luck!