Ceph deploy cannot create osd - ceph

I am net in ceph installation and followed some tutorials. Unluckily when I am now trying to execute the command for OSD.
ceph-deploy osd create --data /dev/vdb node1
I have encountered this error
[ceph-vm2][INFO ] Running command: sudo /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore --data /dev/sdb
[ceph-vm2][WARNIN] --> RuntimeError: Unable to create a new OSD id
[ceph-vm2][DEBUG ] Running command: /usr/bin/ceph-authtool --gen-print-key
[ceph-vm2][DEBUG ] Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new d64885d8-866c-4e26-bdda-94a6b8a79366
[ceph-vm2][DEBUG ] stderr: [errno 1] error connecting to the cluster
[ceph-vm2][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy.osd][ERROR ] Failed to execute command: /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore --data /dev/sdb
[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs

Be sure your /dev/sdb got no important data first!
1.
#umount /dev/sdb1 or /dev/sdb2
2.
vim /etc/fstab
common /dev/sdb uuid mount
3.
#parted -s /dev/sdb mklabel gpt mkpart primary xfs 0% 100%
4.
#reboot
5.
#mkfs.xfs /dev/sdb -f
6.
ceph-deploy osd create --data /dev/sdb node1

Related

Ceph (cepadm) quincy: can't add osd from remote nodes (command hanging)

I stuck with a problem, while trying to create cluster of 3 nodes (AWS EC2 instancies):
fa11.something.com ~ # ceph orch host ls
HOST ADDR LABELS STATUS
fa11.something.com 172.16.24.67 _admin
fa12.something.com 172.16.23.159 _admin
fa13.something.com 172.16.25.119 _admin
3 hosts in cluster
Each of them have 2 disks (all accepted by CEPH):
fa11.something.com ~ # ceph orch device ls
HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
fa11.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol016651cf7f3b9c9dd 8589M Yes 7m ago
fa11.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol034082d7d364dfbdb 5368M Yes 7m ago
fa12.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol0ec193fa3f77fee66 8589M Yes 3m ago
fa12.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol018736f7eeab725f5 5368M Yes 3m ago
fa13.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol0443a031550be1024 8589M Yes 84s ago
fa13.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol0870412d37717dc2c 5368M Yes 84s ago
fa11.something.com is first host, where from I manage cluster.
Adding OSD from fa11.something.com itself works fine:
fa11.something.com ~ # ceph orch daemon add osd fa11.something.com:/dev/nvme1n1
Created osd(s) 0 on host 'fa11.something.com'
But it doesn't work for other 2 hosts (it hangs forever):
fa11.something.com ~ # ceph orch daemon add osd fa12.something.com:/dev/nvme1n1
^CInterrupted
Logs on fa12.something.com shows that it hangs at following step:
fa12.something.com ~ # tail /var/log/ceph/a9ef6c26-ac38-11ed-9429-06e6bc29c1db/ceph-volume.log
...
[2023-02-14 07:38:20,942][ceph_volume.process][INFO ] Running command: /usr/bin/ceph-authtool --gen-print-key
[2023-02-14 07:38:20,964][ceph_volume.process][INFO ] Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new a51506c2-e910-4763-9a0c-f6c2194944e2
I'm not sure what might be the reason for this hanging?
*Additional details:
cephadm installed, using curl (https://docs.ceph.com/en/quincy/cephadm/install/#curl-based-installation)
I use user ceph, instead of root and port 2222 instead of 22. First node was bootstrapped, using below command:
cephadm bootstrap --mon-ip 172.16.24.67 --allow-fqdn-hostname --ssh-user ceph --ssh-config /home/anton/ceph/ssh_config --cluster-network 172.16.16.0/20 --skip-monitoring-stack
Content of /home/anton/ceph/ssh_config:
fa11.something.com ~ # cat /home/anton/ceph/ssh_config
Host *
User ceph
Port 2222
IdentityFile /home/ceph/.ssh/id_rsa
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
Hosts fa12.something.com and fa13.something.com were added, using commands:
ceph orch host add fa12.something.com 172.16.23.159 --labels _admin
ceph orch host add fa13.something.com 172.16.25.119 --labels _admin
Not sure if I have to check some specific ports are not blocked? Just I expected to get some error at early stages in case if ceph can't access some port...
Thanks in advance for any suggestion!
It should be that problem caused by some port(s) blocked. I've tried to drop all restrictions on iptables level: iptables -F; iptables -P INPUT ACCEPT and it started to work. Will check later which ports exactly needed. I guess it should be described somewhere here: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

Failed to start PostgreSQL Cluster 10-main when booting

when I try to boot Ubuntu, it never finishes the boot process because it appears the message "Failed to start PostgreSQL Cluster 10-main." I also get the same message with 9.5-main. But lets focus on 10.
When I execute:
systemctl status postgresql#10-main.service
I get the following message:
postgresql#10-main.service - PostgreSQL Cluster 10-main
Loaded: loaded (/lib/systemd/system/postgresql#.service; indirect; vendor preset: enabled)
Active: failed (Result: protocol) since Wed 2020-02-19 17:57:22 CET; 30 min ago
Process: 1602 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 10-main start (code_exited, status=1/FAILURE)
PC_info systemd[1]: Starting PostgreSQL Cluster 10-main...
PC_info postgresql#10-main[1602]: Error: /usr/lib/postgresql/10/bin/pg_ctl /usr/lib/postgresql/10/bin/pg_ctl start -D /var/lib/postgresql/10/main -l /var/log/postgresql/postgresql-19-main.log -s -o -c config_file="/etc/postgresql/10/main/postgresql.conf" exit with status 1:
PC_info systemd[1]: postgresql#10-main.service: Can't open PID file /var/run/postgresql/10-main.pid (yet?) after start: No such file or directory
PC_info systemd[1]: postgresql#10-main.service: Failed with result 'protocol'.
PC_info systemd[1]: Failed to start PostgreSQL Cluster 10-main.
PC_info is information about my computer (user, date..) not relevant
I got this error from one day to an other without touching anything related to Database Servers.
I tried to fix it by my self but nothing worked
Writing the command
service postgresql#10-main start
I get
Job for postgresql#10-main.service failed because the service did not take the steps required by its unit configuration
See "systemctl status postgresql#10-main.service" and "journalctl -xe" for details.
Running this two command I get the message from the beginning.
Anyone has an idea of what is happening? How I can fix it?
I had same issue, I followed below steps,
Error status :
pg_lsclusters
Ver Cluster Port Status Owner Data directory Log file
10 main 5432 down postgres /var/lib/postgresql/10/main
/var/log/postgresql/postgresql-10-main.log
Applied Solution :
sudo chmod 700 -R /var/lib/postgresql/10/main
sudo -i -u postgres
postgres#abc:~$ /usr/lib/postgresql/10/bin/pg_ctl restart -D /var/lib/postgresql/10/main
After Solution status :
pg_lsclusters
Ver Cluster Port Status Owner Data directory Log file
10 main 5432 online postgres /var/lib/postgresql/10/main
/var/log/postgresql/postgresql-10-main.log
as mentioned by #gruentee in comment above,
/usr/lib/postgresql/10/bin/pg_ctl restart -D /var/lib/postgresql/10/main -l /var/log/postgresql/postgresql-10-main.log -s -o '-c config_file="/etc/postgresql/10/main/postgresql.conf"'
started the postgresSql DB. Dont forget to
sudo -i -u postgres
before issuing above command
I was facing the same challenge, I try a lot of methods but none work for me till I try these commands below
sudo apt-get -y install postgresql
sudo systemctl start postgresql#15-main.service
pg_lsclusters
then everything started working fine, but please the version of postgresql i'm using is 15 that's you are seeing 15 in the second command so in your case you substitute it with the version you are using

Why does kubeadm not start even after disabling swap?

I am trying to install kubernetes with kubeadm in my laptop which has Ubuntu 16.04. I have disabled swap, since kubelet does not work with swap on. The command I used is :
swapoff -a
I also commented out the reference to swap in /etc/fstab.
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/sda1 during installation
UUID=1d343a19-bd75-47a6-899d-7c8bc93e28ff / ext4 errors=remount-ro 0 1
# swap was on /dev/sda5 during installation
#UUID=d0200036-b211-4e6e-a194-ac2e51dfb27d none swap sw 0 0
I confirmed swap is turned off by running the following:
free -m
total used free shared buff/cache available
Mem: 15936 2108 9433 954 4394 12465
Swap: 0 0 0
When I start kubeadm, I get the following error:
kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.14.2
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
I also tried restarting my laptop, but I get the same error. What could the reason be?
below was the root cause.
detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd".
you need to update the docker cgroup driver.
follow the below fix
cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}
EOF
mkdir -p /etc/systemd/system/docker.service.d
# Restart Docker
systemctl daemon-reload
systemctl restart docker
you could try kubeadm reset , then kubeadm init --ignore-preflight-errors Swap .
first try with sudo
sudo swapoff -a
then check if there's anything swapped
cat /proc/swaps
and
free -h

Resizing disk on google cloud kubernetes

Hi im trying to resize a disk for a pod in my kubernetes cluster,following the steps on the docs, i ssh in to the instance node to follow the steps, but its giving me an error
sudo growpart /dev/sdb 1
WARN: unknown label
failed [sfd_dump:1] sfdisk --unit=S --dump /dev/sdb
/dev/sdb: device contains a valid 'ext4' signature; it is strongly recommended to wipe the device with wipefs(8)
if this is unexpected, in order to avoid possible collisions
sfdisk: failed to dump partition table: Success
FAILED: failed to dump sfdisk info for /dev/sdb
i try running the commands from inside the pod but doesnt even locate the disk even tho its there
root#rc-test-r2cfg:/# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 59G 2.5G 56G 5% /
/dev/sdb 49G 22G 25G 47% /var/lib/postgresql/data
root#rc-test-r2cfg:/# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 96G 0 disk /var/lib/postgresql/data
sda 8:0 0 60G 0 disk
└─sda1 8:1 0 60G 0 part /etc/hosts
root#rc-test-r2cfg:/# growpart /dev/sdb 1
FAILED: /dev/sdb: does not exist
where /dev/sdb is the disk location
This can now be easily done by updating the storage specification directly of the Persistent Volume Claim. See these posts for reference:
https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/
https://dev.to/bzon/resizing-persistent-volumes-in-kubernetes-like-magic-4f96 (GKE example)

Docker disk space issue with space left on host

I have PostgreSQL running in a Docker container (Docker 17.09.0-ce-mac35 on OS X 10.11.6) and I'm inserting data from a Python application on the host. After a while I consistently get the following error in Python while there is still plenty of disk space available on the host:
psycopg2.OperationalError: could not extend file "base/16385/24599.49": wrote only 4096 of 8192 bytes at block 6543502
HINT: Check free disk space.
This is my docker-compose.yml:
version: "2"
services:
rabbitmq:
container_name: rabbitmq
build: ../messaging/
ports:
- "4369:4369"
- "5672:5672"
- "25672:25672"
- "15672:15672"
- "5671:5671"
database:
container_name: database
build: ../database/
ports:
- "5432:5432"
The database Dockerfile looks like this:
FROM ubuntu:17.04
RUN echo "deb http://apt.postgresql.org/pub/repos/apt/ zesty-pgdg main" > /etc/apt/sources.list.d/pgdg.list
RUN apt-get update && apt-get install -y --allow-unauthenticated python-software-properties software-properties-common postgresql-10 postgresql-client-10 postgresql-contrib-10
USER postgres
RUN /etc/init.d/postgresql start &&\
psql --command "CREATE USER ****** WITH SUPERUSER PASSWORD '******';" &&\
createdb -O ****** ******
RUN echo "host all all 0.0.0.0/0 md5" >> /etc/postgresql/10/main/pg_hba.conf
RUN echo "listen_addresses='*'" >> /etc/postgresql/10/main/postgresql.conf
EXPOSE 5432
VOLUME ["/etc/postgresql", "/var/log/postgresql", "/var/lib/postgresql"]
CMD ["/usr/lib/postgresql/10/bin/postgres", "-D", "/var/lib/postgresql/10/main", "-c", "config_file=/etc/postgresql/10/main/postgresql.conf"]
df -k output:
Filesystem 1024-blocks Used Available Capacity iused ifree %iused Mounted on
/dev/disk2 1088358016 414085004 674017012 39% 103585249 168504253 38% /
devfs 190 190 0 100% 658 0 100% /dev
map -hosts 0 0 0 100% 0 0 100% /net
map auto_home 0 0 0 100% 0 0 100% /home
Update 1:
It seems like the container has now shut down. I'll start over and try to df -k in the container before it shuts down.
2017-11-14 14:48:25.117 UTC [18] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-11-14 14:48:25.120 UTC [17] WARNING: terminating connection because of crash of another server process
2017-11-14 14:48:25.120 UTC [17] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-11-14 14:48:25.120 UTC [17] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-11-14 14:48:25.132 UTC [1] LOG: all server processes terminated; reinitializing
2017-11-14 14:48:25.175 UTC [1] FATAL: could not access status of transaction 0
2017-11-14 14:48:25.175 UTC [1] DETAIL: Could not write to file "pg_notify/0000" at offset 0: No space left on device.
2017-11-14 14:48:25.181 UTC [1] LOG: database system is shut down
Update 2:
This is df -k on the container, /dev/vda2 seems to be filling up quickly:
$ docker exec -it database df -k
Filesystem 1K-blocks Used Available Use% Mounted on
none 61890340 15022448 43700968 26% /
tmpfs 65536 0 65536 0% /dev
tmpfs 1023516 0 1023516 0% /sys/fs/cgroup
/dev/vda2 61890340 15022448 43700968 26% /etc/postgresql
shm 65536 8 65528 1% /dev/shm
tmpfs 1023516 0 1023516 0% /sys/firmware
Update 3:
This seems to be related to the ~64 GB file size limit on Docker.qcow2. Solved using qemu and gparted as follows:
cd ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/
qemu-img info Docker.qcow2
qemu-img resize Docker.qcow2 +200G
qemu-img info Docker.qcow2
qemu-system-x86_64 -drive file=Docker.qcow2 -m 512 -cdrom ~/Downloads/gparted-live-0.30.0-1-i686.iso -boot d -device usb-mouse -usb