ceph rbd import hangs - ceph

My ceph cluster is 48 ms ping away from the ceph client. Rbd import of an 8GB image on the client hangs at some point during the copy and never progresses. Ctl-C out of rbd import leaves the image locked in the cluster. When I scp the image to the cluster and then rbd import locally the problem goes away. I suspect a timeout occuring but cannot tell which config parameter to modify. Any suggestions? This is ceph 15.2.13 octopus release

Related

All ceph daemons container images disappeared on a single node after reconfiguring docker logs driver

I've changed log_driver to "local" in daemon.json docker configuration file, because an high activity level on rados gateway logs had satured disk space. My will was to change to journald to have logrotate. Unfortunately, after restart the docker daemon, many Ceph services did disappeared even as containers images. So now that node had caused an HEALTH_ERR because it lost 1 mgr, 1 mon and 3 osd services at the same time.
I've tried to use some ceph commands inside cephadm shell (on another node), but it freezes and nothing happened. What can I try to do to restore the node's services and cluster health?

Monitor daemon running but not in quorum

I'm currently testing OS and version upgrades for a ceph cluster. Starting info:
The cluster is currently on Centos 7 and Ceph version Nautilus. I'm trying to change OS with ubuntu 20.04 and version with Octopus. I started with upgrading mon1 first. I will write down the things done in order.
First of I stopped monitor service - systemctl stop ceph-mon#mon1
Then I removed the monitor from cluster - ceph mon remove mon1
Then installed ubuntu 20.04 on mon1. Updated the system and configured ufw.
Installed ceph octopus packages.
Copied ceph.client.admin.keyring and ceph.conf to mon1 /etc/ceph/
Copied ceph.mon.keyring to mon1 to a temporary folder and changed ownership to ceph:ceph
Got the monmap ceph mon getmap -o ${MONMAP} - The thing is i did this after removing the monitor.
Created /var/lib/ceph/mon/ceph-mon1 folder and changed ownership to ceph:ceph
Created the filesystem for monitor - sudo -u ceph ceph-mon --mkfs -i mon1 --monmap /folder/monmap --keyring /folder/ceph.mon.keyring
After noticing I got the monmap after the monitors removal I added it manually - ceph mon add mon1 <ip> --fsid <fsid>
After starting manually and checking cluster state with ceph -s I can see mon1 is listed but is not in quorum. The monitor daemon runs fine on the said mon1 node. I noticed on logs that mon1 is stuck in "probe" state and on other monitor logs there is an output such as mon1 (rank 2) addr [v2:<ip>:3300/0,v1:<ip>:6789/0] is down (out of quorum) , as i said the the monitor daemon is running on mon1 without any visible errors just stuck in probe state.
I wondered if it was caused by os&version change so i first tried out configuring manager, mds and radosgw daemons by creating the respective folders in /var/lib/ceph/... and copying keyrings. All these services work fine, i was able to reach to my buckets, was able to open the Octopus version dashboard, and metadata server is listed as active in ceph -s. So evidently my problem is only with monitor configuration.
After doing some checking found this on red hat ceph documantation:
If the Ceph Monitor is in the probing state longer than expected, it
cannot find the other Ceph Monitors. This problem can be caused by
networking issues, or the Ceph Monitor can have an outdated Ceph
Monitor map (monmap) and be trying to reach the other Ceph Monitors on
incorrect IP addresses. Alternatively, if the monmap is up-to-date,
Ceph Monitor’s clock might not be synchronized.
There is no network error on the monitor, I can reach all the other machines in the cluster. The clocks are synchronized. If this problem is caused by the monmap situation how can I fix this?
Ok so as a result, directly from centos7-Nautilus to ubuntu20.04-Octopus is not possible for monitor services only, apparently the issue is about hostname resolution with different Operating systems. The rest of the services is fine. There is a longer way to do this without issue and is the correct solution. First change os from centos7 to ubuntu18.04 and install ceph-nautilus packages and add the machines to cluster (no issues at all). Then update&upgrade the system and apply "do-release-upgrade". Works like a charm. I think what eblock mentioned was this.

how to drop ceph osd block?

I build a ceph cluster with kubernetes and it create an osd block into the sdb disk.
I had delete the ceph cluster but cleanup all the kubernetes instance which were created by ceph cluster, but it did't delete the osd block which is mounted into sdb.
I am a beginner in kubernetes. How can I remove the osd block from sdb.
And why the osd block will have all the disk space?
I find a way to remove osd block from disk on ubuntu18.04:
Use this command to show the logical volume information:
$ sudo lvm lvdisplay
Then you will get the log like this:
Then execute this command to remove the osd block volumn.
$ sudo lvm lvremove <LV Path>
Check if we have removed the volume successfully.
$ lsblk

CephFS meets bonnie++: No space left on device

CephFS is evaluated as a distributed filesystem in my lab with bonnie++ in the
following setup:
Virtualbox VMs are used with the following settings:
RAM: 1024 MiB
One SATA hard drive for OS: 8 GiB
One dedicated SATA hard drive for distributed file system for each OSD: 8 GiB
OS: CentOS 7
Ceph version: 10.2
Ceph is configured for three (3) replicas, i.e.
three VMs are operated as OSDs:
node1 hosts ceph-mon, ceph-osd and ceph-mds
node2 and node3 host ceph-mon and ceph-osd.
admin node is idle
The client runs bonnie++ on a CephFS mounts.
Somehow, bonnie++ fails to run through. Instead, it exits with the following
error message:
Writing intelligently...Can't write block.: No space left on device
See attached screenshot:
Funnily enough, df -h reports that 8.7 GiB available disk space.
See next screenshot:
Is this normal ceph behaviour? What do you recommend to fix it?

Kubernetes in vmware vsphere issues

I am following this guide to set up my cluster. It all works fine.
However, when I install fabric8 in this cluster I run out of disk on the minions. The image, kube.vmdk, is only about 6GB. It is the /var/lib/docker which gets filled up. How do I solve this?
Using the GUI for vmware the option to resize the disk is 'greyed out'.
Should I attach a second disk to the minions and then mount this disk? Where should I mount it? /var/lib/docker?
I would appreciate any input.
Docker's image is store in /var/lib/docker(more precisely, it store in storage driver's directory, /var/lib/docker/aufs when using aufs storage driver) , so when Kubernetes report disk gets filled up, it check that directory.
So you can
Remove all the images in docker(not necessary, you can copy everything to new dir).
stop docker daemon.
mount your new disk to /var/lib/docker/ or /var/lib/docker
start docker daemon.
If you are not sure what storage driver your docker is using, type docker info in your node, will get something contain this:
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 139
Dirperm1 Supported: true
It seems that you run out of the space of the disk. You can remove all the files in /var/lib/docker, and mount the second disk. Finally you need restart your dockerd.