Monitor daemon running but not in quorum - ceph

I'm currently testing OS and version upgrades for a ceph cluster. Starting info:
The cluster is currently on Centos 7 and Ceph version Nautilus. I'm trying to change OS with ubuntu 20.04 and version with Octopus. I started with upgrading mon1 first. I will write down the things done in order.
First of I stopped monitor service - systemctl stop ceph-mon#mon1
Then I removed the monitor from cluster - ceph mon remove mon1
Then installed ubuntu 20.04 on mon1. Updated the system and configured ufw.
Installed ceph octopus packages.
Copied ceph.client.admin.keyring and ceph.conf to mon1 /etc/ceph/
Copied ceph.mon.keyring to mon1 to a temporary folder and changed ownership to ceph:ceph
Got the monmap ceph mon getmap -o ${MONMAP} - The thing is i did this after removing the monitor.
Created /var/lib/ceph/mon/ceph-mon1 folder and changed ownership to ceph:ceph
Created the filesystem for monitor - sudo -u ceph ceph-mon --mkfs -i mon1 --monmap /folder/monmap --keyring /folder/ceph.mon.keyring
After noticing I got the monmap after the monitors removal I added it manually - ceph mon add mon1 <ip> --fsid <fsid>
After starting manually and checking cluster state with ceph -s I can see mon1 is listed but is not in quorum. The monitor daemon runs fine on the said mon1 node. I noticed on logs that mon1 is stuck in "probe" state and on other monitor logs there is an output such as mon1 (rank 2) addr [v2:<ip>:3300/0,v1:<ip>:6789/0] is down (out of quorum) , as i said the the monitor daemon is running on mon1 without any visible errors just stuck in probe state.
I wondered if it was caused by os&version change so i first tried out configuring manager, mds and radosgw daemons by creating the respective folders in /var/lib/ceph/... and copying keyrings. All these services work fine, i was able to reach to my buckets, was able to open the Octopus version dashboard, and metadata server is listed as active in ceph -s. So evidently my problem is only with monitor configuration.
After doing some checking found this on red hat ceph documantation:
If the Ceph Monitor is in the probing state longer than expected, it
cannot find the other Ceph Monitors. This problem can be caused by
networking issues, or the Ceph Monitor can have an outdated Ceph
Monitor map (monmap) and be trying to reach the other Ceph Monitors on
incorrect IP addresses. Alternatively, if the monmap is up-to-date,
Ceph Monitor’s clock might not be synchronized.
There is no network error on the monitor, I can reach all the other machines in the cluster. The clocks are synchronized. If this problem is caused by the monmap situation how can I fix this?

Ok so as a result, directly from centos7-Nautilus to ubuntu20.04-Octopus is not possible for monitor services only, apparently the issue is about hostname resolution with different Operating systems. The rest of the services is fine. There is a longer way to do this without issue and is the correct solution. First change os from centos7 to ubuntu18.04 and install ceph-nautilus packages and add the machines to cluster (no issues at all). Then update&upgrade the system and apply "do-release-upgrade". Works like a charm. I think what eblock mentioned was this.

Related

All ceph daemons container images disappeared on a single node after reconfiguring docker logs driver

I've changed log_driver to "local" in daemon.json docker configuration file, because an high activity level on rados gateway logs had satured disk space. My will was to change to journald to have logrotate. Unfortunately, after restart the docker daemon, many Ceph services did disappeared even as containers images. So now that node had caused an HEALTH_ERR because it lost 1 mgr, 1 mon and 3 osd services at the same time.
I've tried to use some ceph commands inside cephadm shell (on another node), but it freezes and nothing happened. What can I try to do to restore the node's services and cluster health?

Rados Gateway installation (CEPH)

I have installed CEPH cluster using cephadm (octopus version)
Now I’m having problems installing rados gateway for CEPH cluster using this instruction:
https://docs.ceph.com/en/latest/man/8/radosgw/
I’m following each step, but at the end command:
sudo /etc/init.d/ceph-radosgw start
not working as this script could not be found
So I’m running:
systemctl start ceph-radosgw.target
And it helps, then checking the status of the service shows that it’s running.
But I don’t see any gateway in UI and radosgw-admin just hangs for infitity so I cannot create users. Also there is no logs erroring.
Is there someone who faced the same problem?
Maybe there is something I have to check and do additionally? Also when I run above commands it says that monitor configuration is not found, is it related issues?

Minikube Start Error (Kubernetes) When Using hyperv Driver on Windows server 2016

I am trying to install Kubernetes on windows server 2016.
I tried to install minikube, and got some errors.
This is the tutorial that I followed:
https://www.assistanz.com/installing-minikube-on-windows-2016-server/
This is the command + error that I got:
PS C:\Windows\system32> minikube start –vm-driver=hyperv –hyperv-virtual-switch=Minikube
Starting local Kubernetes v1.10.0 cluster...
Starting VM... Downloading Minikube ISO
170.78 MB / 170.78 MB [============================================] 100.00% 0s
E1106 19:29:10.616564 11852 start.go:168] Error starting host: Error creating host: Error executing step: Running precreate checks.
: VBoxManage not found. Make sure VirtualBox is installed and VBoxManage is in the path.
Retrying.
E1106 19:29:10.689675 11852 start.go:174] Error starting host: Error creating host: Error executing step: Running precreate checks.
: VBoxManage not found. Make sure VirtualBox is installed and VBoxManage is in the path
================================================================================
An error has occurred. Would you like to opt in to sending anonymized crash
information to minikube to help prevent future errors?
To opt out of these messages, run the command:
minikube config set WantReportErrorPrompt false
================================================================================
Please enter your response [Y/n]:
Someone knows how to solve it?
I googled it, but no luck.
Thanks!
I was never able to get the config parameters to work with minikube start.
I was able to get past this error using the minikube config commands in PowerShell (should also work at a command prompt):
minikube config set vm-driver hyperv
minikube config set hyperv-virtual-switch ExternalSwitch
minikube config view
minikube delete
minikube start
For more information on the command run: minikube config -h
Looking at the documentation you have provided, I have noticed that the screenshot shows a slight difference to the one they've quote.
I have also found this command in another piece of documentation from kubernetes here, showing the same command as that from the screenshot.
I suggest you try the following command;
minikube start --vm-driver=hyperv --hyperv-virtual-switch=Minikube
It is true that OP has pasted the incorrect command, because there is - instead of --. I tried to pass this arguments to minikube and all you get is an instant error. So the issue must be somewhere else. I remember having similar issue and it got resolved after deleting the .kube and .minikube folders and trying to run it again.
After taking a closer look this tutorial is destined for installation of minikube inside of a Windows Server 2016 Virtual Machine, so you have to have a Nested Virtualization able hardware:
Prerequisites The Hyper-V host and guest must both be Windows Server
2016/Windows 10 Anniversary Update or later. VM configuration version
8.0 or greater. An Intel processor with VT-x and EPT technology -- nesting is currently Intel-only. There are some differences with
virtual networking for second-level virtual machines. See "Nested
Virtual Machine Networking".
So the main question is, is that true in your scenario? Are you trying to perform your steps on Windows Server Hyper-V virtual machine with nested virtualization feature?
If you confirm that I have technical possibilities to check it in that scenario.
Otherwise I recommend using the "traditional way" of running minikube in Windows, according for example to this tutorial.

Unable to bootstrap (cloud type: localhost) - Error when installing Kuberneters cluster locally with LXD/Conjure-up

Using Ubuntu 18.04.
I am trying to install a kubernetes cluster on my local machine (localhost) using this guide (LXD + conjure-up kubernetes):
https://kubernetes.io/docs/getting-started-guides/ubuntu/local/#before-you-begin
When I run:
conjure-up kubernetes
I select the following installation:
and select localhost for "Choose a cloud" and use the defaults for the rest of the install wizard. It then starts to install and after 30-40 minutes it completes with this error:
Here is the log:
https://pastebin.com/raw/re1UvrUU
Where one error says:
2018-07-25 20:09:38,125 [ERROR] conjure-up/canonical-kubernetes - events.py:161 - Unhandled exception in <Task finished coro=<BaseBootstrapController.run() done, defined at /snap/conjure-up/1015/lib/python3.6/site-packages/conjureup/controllers/juju/bootstrap/common.py:15> exception=BootstrapError('Unable to bootstrap (cloud type: localhost)',)>
but that does not really help much.
Any suggestion to why the install wizard/conjure-up fails?
Also based on this post:
https://github.com/conjure-up/conjure-up/issues/1308
I have tried to first disable firewall:
sudo ufw disable
and then re-run installation/conjure install wizard. But I get the same error.
Some more details on how I installed and configured LXD/conjure-up below:
$ snap install lxd
lxd 3.2 from 'canonical' installed
$ /snap/bin/lxd init
Would you like to use LXD clustering? (yes/no) [default=no]:
Do you want to configure a new storage pool? (yes/no) [default=yes]:
Name of the new storage pool [default=default]:
Name of the storage backend to use (btrfs, ceph, dir, lvm) [default=btrfs]:
Create a new BTRFS pool? (yes/no) [default=yes]:
Would you like to use an existing block device? (yes/no) [default=no]:
Size in GB of the new loop device (1GB minimum) [default=26GB]:
Would you like to connect to a MAAS server? (yes/no) [default=no]:
Would you like to create a new local network bridge? (yes/no) [default=yes]:
What should the new bridge be called? [default=lxdbr0]:
What IPv4 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]:
What IPv6 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]:
Would you like LXD to be available over the network? (yes/no) [default=no]:
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]:
Configured group membership:
sudo usermod -a -G lxd $USER
newgrp lxd
Next installed:
sudo snap install conjure-up --classic
And then ran installation:
conjure-up kubernetes
I wasn't able to reproduce your exact problem but i got conjure-up + lxd installed and in the end Kubernetes on my newly installed VirtualBox Ubuntu 18.04 (Desktop) VM. Hopefully this answer could help you somehow!
I looked through the kubernetes.io documentation page and that one lacked tiny bits of information, it does mention lxd but not the part with lxd init which i assume you picked up in the conjure-up user manual.
So with that said, i followed the conjure-up user manual with some minor changes on the way. I'm assuming that it's OK for you to use the edge version of conjure-up, i started off with the stable one but changed to edge when testing different combinations.
Also please ensure that you have the recommended resources available stated by the user manual, conjure-up and the Canoncial Distribution of Kubernetes launches a number of containers for you. You might not need 3 x etcd, 3 x worker nodes and 2 x Master, and if you don't just tune the number of containers down in the conjure-up wizard.
These are the steps i performed (as my local user):
Make sure your Ubuntu box are updated: sudo apt update && sudo apt upgrade
Install conjure-up by running: sudo snap install conjure-up --classic --edge
Install lxd by running: sudo snap install lxd
With lxd comes the client part which is lxc, if you run e.g. lxc list you should get an empty table (no containers started yet). I got an permission error at this time, i ran the following: sudo chown -R lxd:lxd /var/snap/lxd/ to change owner and group of the lxd directory containing the socket you'll be communicating with using lxc.
Add your user to the lxdgroup: sudo usermod -a -G lxd $USER && newgrp lxd, log off and on to make this permanent and not only active in your current shell.
Now create a lxd bridge manually with the following command: lxc network create lxdbr1 ipv4.address=auto ipv4.nat=true ipv6.address=none ipv6.nat=false
Now let's run the init part of lxd with lxd init. Remember to answer no when being asked to create a new local network bridge?, in the next prompt provide your newly created network bridge instead (lxdbr1). The rest of the answers to the questions can be left as default.
Now continue with running conjure-up kubernetes and choose localhost as your type. For me the localhost choice was greyed out from the beginning, it worked when i created the network bridge manually and not via the lxd init step.
Skip the additional components you can install like Rancher, Prometheus etc.
Choose your new network bridge and the default storage pool, proceed to the next step.
In the next step customize your Kubernetes cluster if needed and then hit Deploy. And now you wait!
You can always troubleshoot and list all containers created with the lxc tool. If you've ever used Docker the lxc tool feels a lot like the docker client.
And finally some thoughts and observations, there's a lot of moving parts to conjure-up as you might have seen. It's actually described as: conjure-up is a thin layer spanning a few different underlying technologies - Juju, MAAS and LXD.
For reference, i ended up having the following versions installed:
lxd version 3.3
conjure-up version 2.6.1

Ceph configuration file and ceph-deploy

I set up a test cluster and follow the documentation.
I created cluster with command ceph-deploy new node1. After that, ceph configuration file appeared in the current directory, which contains information about the monitor on the node with hostname node1. Then I added two OSDs to the cluster.
So now I have cluster with 1 monitor and 2 OSDs. ceph status command says that status is HEALTH_OK.
Following all the same documentation, I moved on to section "Expanding your cluster" and added two new monitors with commands ceph-deploy mon add node2 and ceph-deploy mon add node3. Now I have cluster with three monitors in the quorum and status HEALTH_OK, but there is one little discrepancy for me. The ceph.conf is still the same. It contains old information about only one monitor. Why ceph-deploy mon add {node-name} command didn't update configuration file? And the main question is why ceph status displays correct information about new cluster state with 3 monitors while ceph.conf doesn't contain this information. Where is real configuration file and why ceph-deploy knows it but I don't?
And it works even after a reboot. All ceph daemons start, read incorrect ceph.conf (I checked this with strace) and, ignoring this, work fine with new configuration.
And the last question. Why ceph-deploy osd activate {ceph-node}:/path/to/directory command didn't update configuration file too? After all why do we need ceph.conf file if we have so smart ceph-deploy now?
You have multiple questions here.
1) ceph.conf doesn't need to be completely the same for all nodes to run. E.g. OSD only need osd configuration they care about, MON only need configuration mon care ( unless you run everything on the same node which is also not recommended) So maybe your MON1 has MON1 MON2 has MON2 MON3 has MON3
2) When MON being created and then added, the MON map being updated so MON itself already know which other MON require to have quorum. So MON doesn't counting on ceph.conf to get quorum information but to change run time configuration.
3) ceph-deploy just a python script to prepare and run the ceph command for you. If you read into the detail ceph-deploy use e.g. ceph-disk zap prepare activate.
Once you osd being prepared, and activate, once it is format to ceph partition, udev know where to mount. Then systemd ceph-osd.server will be activate ceph-osd at boot. That's why it doesn't need OSD information in ceph.conf at all