Ceph Manual Deployment no ceph -s output after mon installs (Nautilus) - ceph

I'm trying to build a cluster to test stuff before i apply them to out production cluster. We're using Ceph Nautilus so i decided to install Nautilus first as well.
Used the docs below:
https://docs.ceph.com/en/latest/install/manual-deployment/
Everything seemed to go fine. I installed 3 monitors, generated the monmap copied keyrings to other monitors, started services and they are all up. But when i type ceph -s to check the cluster status it just gets stuck forever without any output. Any command that uses the word "ceph" in it just gets stuck. As a result i can't continue to build the cluster since i need to be able to use ceph commands after monitor deployments to install other services.
Systemctl outputs are the same for all 3 monitors in the current state:
[root#mon2 ~]# systemctl status ceph-mon#mon2
● ceph-mon#mon2.service - Ceph cluster monitor daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-mon#.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2021-04-28 09:55:24 +03; 25min ago
Main PID: 4725 (ceph-mon)
CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon#mon2.service
└─4725 /usr/bin/ceph-mon -f --cluster ceph --id mon2 --setuser ceph --setgroup ceph
Apr 28 09:55:24 mon2 systemd[1]: Started Ceph cluster monitor daemon.

Resolved, the problem is caused by missing firewalld and selinux configurations. After applying those and restarting the deployment process my issue was solved.

Related

Monitor daemon running but not in quorum

I'm currently testing OS and version upgrades for a ceph cluster. Starting info:
The cluster is currently on Centos 7 and Ceph version Nautilus. I'm trying to change OS with ubuntu 20.04 and version with Octopus. I started with upgrading mon1 first. I will write down the things done in order.
First of I stopped monitor service - systemctl stop ceph-mon#mon1
Then I removed the monitor from cluster - ceph mon remove mon1
Then installed ubuntu 20.04 on mon1. Updated the system and configured ufw.
Installed ceph octopus packages.
Copied ceph.client.admin.keyring and ceph.conf to mon1 /etc/ceph/
Copied ceph.mon.keyring to mon1 to a temporary folder and changed ownership to ceph:ceph
Got the monmap ceph mon getmap -o ${MONMAP} - The thing is i did this after removing the monitor.
Created /var/lib/ceph/mon/ceph-mon1 folder and changed ownership to ceph:ceph
Created the filesystem for monitor - sudo -u ceph ceph-mon --mkfs -i mon1 --monmap /folder/monmap --keyring /folder/ceph.mon.keyring
After noticing I got the monmap after the monitors removal I added it manually - ceph mon add mon1 <ip> --fsid <fsid>
After starting manually and checking cluster state with ceph -s I can see mon1 is listed but is not in quorum. The monitor daemon runs fine on the said mon1 node. I noticed on logs that mon1 is stuck in "probe" state and on other monitor logs there is an output such as mon1 (rank 2) addr [v2:<ip>:3300/0,v1:<ip>:6789/0] is down (out of quorum) , as i said the the monitor daemon is running on mon1 without any visible errors just stuck in probe state.
I wondered if it was caused by os&version change so i first tried out configuring manager, mds and radosgw daemons by creating the respective folders in /var/lib/ceph/... and copying keyrings. All these services work fine, i was able to reach to my buckets, was able to open the Octopus version dashboard, and metadata server is listed as active in ceph -s. So evidently my problem is only with monitor configuration.
After doing some checking found this on red hat ceph documantation:
If the Ceph Monitor is in the probing state longer than expected, it
cannot find the other Ceph Monitors. This problem can be caused by
networking issues, or the Ceph Monitor can have an outdated Ceph
Monitor map (monmap) and be trying to reach the other Ceph Monitors on
incorrect IP addresses. Alternatively, if the monmap is up-to-date,
Ceph Monitor’s clock might not be synchronized.
There is no network error on the monitor, I can reach all the other machines in the cluster. The clocks are synchronized. If this problem is caused by the monmap situation how can I fix this?
Ok so as a result, directly from centos7-Nautilus to ubuntu20.04-Octopus is not possible for monitor services only, apparently the issue is about hostname resolution with different Operating systems. The rest of the services is fine. There is a longer way to do this without issue and is the correct solution. First change os from centos7 to ubuntu18.04 and install ceph-nautilus packages and add the machines to cluster (no issues at all). Then update&upgrade the system and apply "do-release-upgrade". Works like a charm. I think what eblock mentioned was this.

Problem with kafka - Failed with result 'exit-code', status=1/FAILURE

I tried to install apache-kafka several times but I always had this problem. I'm using ubuntu on my virtual machine. When I'm trying to activate kafka service using sudo systemctl start kafka
and then controlling if it's working at first, the output is "active (running)", but if I double-check it and the output is "failed (Result: exit-code) ". And I tried sudo systemctl enable kafka but it didn't work.
This is the output:
● kafka.service
Loaded: loaded (/etc/systemd/system/kafka.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2021-05-26 05:40:22 PDT; 3s ago
Process: 8098 ExecStart=/bin/sh -c /home/kafka/kafka/bin/kafka-server-start.sh /home/kafka/kafka/co>
Main PID: 8098 (code=exited, status=1/FAILURE)
May 26 05:40:19 ubuntu systemd[1]: Started kafka.service.
May 26 05:40:22 ubuntu systemd[1]: kafka.service: Main process exited, code=exited, status=1/FAILURE
May 26 05:40:22 ubuntu systemd[1]: kafka.service: Failed with result 'exit-code'.
You can see the full output attached
I also tried journalctl -xe and it recommended using ./gradlew jar -PscalaVersion=2.13.5, and I download it, at first it seemed to work, but the following day I had the same problem ( kafka.service: Failed with result 'exit-code'.). And if I tried journalctl -xe I had an output that you can see attached.
With zookeeper I had no problem, it's always active.
Thank you in advance.
Open the file meta.properties.
In my case, it was located at the path /home/kafka/logs/meta.properties
Just comment the the cluster.id with a #
Restart zookeeper and kafka.
I had the same issue by following the tutorial from well known site. I fixed the problem by doing all from the scratch this way.
sudo apt update
sudo apt install default-jdk
I downloaded latest BINARY release from here https://kafka.apache.org/downloads. I used https://dlcdn.apache.org/kafka/3.0.0/kafka_2.13-3.0.0.tgz
sudo wget https://dlcdn.apache.org/kafka/3.0.0/kafka_2.13-3.0.0.tgz
Unpack and move
tar xzf kafka_2.13-3.0.0.tgz
mv kafka_2.13-3.0.0 /usr/local/kafka
edit zookeeper unit file
sudo vi /etc/systemd/system/zookeeper.service
add this content
[Unit]
Description=Apache Zookeeper server
Documentation=http://zookeeper.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target
[Service]
Type=simple
ExecStart=/usr/local/kafka/bin/zookeeper-server-start.sh /usr/local/kafka/config/zookeeper.properties
ExecStop=/usr/local/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal
[Install]
WantedBy=multi-user.target
Edit Kafka systemd unit file
sudo vi /etc/systemd/system/kafka.service
and add the content below. Note: You must change JAVA_HOME=path to your path
[Unit]
Description=Apache Kafka Server
Documentation=http://kafka.apache.org/documentation.html
Requires=zookeeper.service
[Service]
Type=simple
Environment="JAVA_HOME=REPLACE-THIS-WITH-YOUR-PATH"
ExecStart=/usr/local/kafka/bin/kafka-server-start.sh /usr/local/kafka/config/server.properties
ExecStop=/usr/local/kafka/bin/kafka-server-stop.sh
[Install]
WantedBy=multi-user.target
Reload the systemd daemon to apply new changes.
sudo systemctl daemon-reload
Start zookeeper and kafka
sudo systemctl start zookeeper
sudo systemctl start kafka
check kafka status now, it should be running
sudo systemctl status kafka
All you need to do is to build kafka project before running it:
./gradlew jar -PscalaVersion=2.13.6
Note that you need to have Java installed
tried to install apache-kafka several times
Kafka doesn't come with Systemd scripts. Follow the official Apache Kafka website to see how you start it without systemctl
If you want to install on Ubuntu, Confluent Community edition allows you to do apt-get install to get both Kafka and Zookeeper
Your error shows an InconsistentClusterIdException, which means you need to wipe the data directories for Zookeeper and Kafka so that the broker will start in a fresh state
For me, I found out that the system actually has 2 folder kafka so when the service started, it said "exit-code"
-> My solution for my problem is delete 1 folder and keep folder /home/kafka
In my case Kafka didn't start in the first place, I reassigned a different logs folder to server.properties files and provided necessary rights to the folder, and restarted both the zookeeper and Kafka services, and then they seem to work.
in my case, I was using a Source Download
which I was : kafka-3.3.1-src.tgz
use binary version
Scala 2.13 - kafka_2.13-3.3.1.tgz
you can download it from https://kafka.apache.org/downloads

Kubelet failing start attempts pollutes logs

I have a bunch of fresh CentOS servers installed on AWS. The service kubelet pollutes log file (var/log/messages) with it attempts to start, but as I have no use for it, I would like to remove it. It's this an optional component of CentOS and I can safely remove it (or disable kubelet.service)? I believe so, but would not expect a brand new server pushing out so many errors.
Currently, 97% of my /var/log/messages logs contain rows like:
Jan 17 03:21:03 systemd: Started kubelet: The Kubernetes Node Agent.
Jan 17 03:21:03 kubelet: F0117 03:21:03.101812 29626 server.go:198] failed to load Kubelet
config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file
"/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or
directory
***da da da, 40 more rows***
Jan 17 03:21:03 systemd: Unit kubelet.service entered failed state.
Jan 17 03:21:03 systemd: kubelet.service failed.
Jan 17 03:21:13 systemd: kubelet.service holdoff time over, scheduling restart.
Jan 17 03:21:13 systemd: Stopped kubelet: The Kubernetes Node Agent.
Jan 17 03:21:13 systemd: Started kubelet: The Kubernetes Node Agent.
***sleep for 10s and start all over*
As I have already mentioned in my comment, kubelet is a part of kubernetes cluster, it's the primary node agent that runs on each node. I sincerely doubt that this CentOS image came with it preinstalled. If it really did, and as you said, it's a "fresh CentOS server", that nobody had previously tinkered with, I would recommend you to choose a different image if your servers have nothing to do with kubernetes cluster. However if it is used as kind of your production environment and runs some other important things, you should investigate how it was installed and simply remove it.
I did not do the setup myself, but the template used is
258751437250/ami-centos-7-1.13.0-00-1543960911. We have not asked for
Kubernetes on it and is not using clusters
The simplest answer to your question is:
You can safely stop and disable it so it doesn't pollute your /var/log/messages any more:
sudo systemctl stop kubelet.service && sudo systemctl disable kubelet.service
You can also remove it. Depending on how it was installed, you may need to do it in a specific way.
First check:
yum list installed | grep kubelet
If it's there you can:
yum remove kubelet
If it doesn't return any result you may try:
rpm -qa | grep kubelet
and if anything found, remove it:
rpm -e kubelet
It may be also a remnant of an old kubernetes installation which was set up with a tool like minikube or kubeadm. To check that, run:
sudo systemctl cat kubelet.service
and take a look at the ExecStart section. Depending on what you find there, it's very likely you'll need to uninstall some other unnecessary components e.g. if you find something like /var/lib/minikube/binaries/v1.16.0/kubelet, it means it's part of minikube installation.
Chances are that it was even partially uninstalled, but there are still some leftovers. As you can see, even it's config file cannot be found:
error: open /var/lib/kubelet/config.yaml: no such file or
directory
In case of any doubts or additional questions, don't hesitate to ask.

Joining cluster takes forever

I have set up my master node and I am trying to join a worker node as follows:
kubeadm join 192.168.30.1:6443 --token 3czfua.os565d6l3ggpagw7 --discovery-token-ca-cert-hash sha256:3a94ce61080c71d319dbfe3ce69b555027bfe20f4dbe21a9779fd902421b1a63
However the command hangs forever in the following state:
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
Since this is just a warning, why does it actually fails?
edit: I noticed the following in my /var/log/syslog
Mar 29 15:03:15 ubuntu-xenial kubelet[9626]: F0329 15:03:15.353432 9626 server.go:193] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory
Mar 29 15:03:15 ubuntu-xenial systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Mar 29 15:03:15 ubuntu-xenial systemd[1]: kubelet.service: Unit entered failed state.
First if you want to see more detail when your worker joins to the master use:
kubeadm join 192.168.1.100:6443 --token m3jfbb.wq5m3pt0qo5g3bt9 --discovery-token-ca-cert-hash sha256:d075e5cc111ffd1b97510df9c517c122f1c7edf86b62909446042cc348ef1e0b --v=2
Using the above command I could see that my worker could not established connection with the master, so i just stoped the firewall:
systemctl stop firewalld
This can be solved by creating a new token
using this command:
kubeadm token create --print-join-command
and use the token generated for joining other nodes to the cluster
The problem had to do with kubeadm not installing a networking CNI-compatible solution out of the box;
Therefore, without this step the kubernetes nodes/master are unable to establish any form of communication;
The following task addressed the issue:
- name: kubernetes.yml --> Install Flannel
shell: kubectl -n kube-system apply -f https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml
become: yes
environment:
KUBECONFIG: "/etc/kubernetes/admin.conf"
when: inventory_hostname in (groups['masters'] | last)
I did get the same error on CentOS 7 but in my case join command worked without problems, so it was indeed just a warning.
> [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker
> cgroup driver. The recommended driver is "systemd". Please follow the
> guide at https://kubernetes.io/docs/setup/cri/ [preflight] Reading
> configuration from the cluster... [preflight] FYI: You can look at
> this config file with 'kubectl -n kube-system get cm kubeadm-config
> -oyaml' [kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.14" ConfigMap in the kube-system namespace
As the official documentation mentions, there are two common issues that make the init hang (I guess it also applies to join command):
the default cgroup driver configuration for the kubelet differs from
that used by Docker. Check the system log file (e.g. /var/log/message)
or examine the output from journalctl -u kubelet. If you see something
like the following:
First try the steps from official documentation and if that does not work please provide more information so we can troubleshoot further if needed.
I had a bunch of k8s deployment scripts that broke recently with this same error message... it looks like docker changed it's install. Try this --
previous install:
apt-get isntall docker-ce
updated install:
apt-get install docker-ce docker-ce-cli containerd.io
How /var/lib/kubelet/config.yaml is created?
Regarding the /var/lib/kubelet/config.yaml: no such file or directory error.
Below are steps that should occur on the worker node in order for the mentioned file to be created.
1 ) The creation of the /var/lib/kubelet/ folder. It is created when the kubelet service is installed as mentioned here:
sudo apt-get update && sudo apt-get install -y apt-transport-https curl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
2 ) The creation of config.yaml. The kubeadm join flow should take place so when you run kubeadm join, kubeadm uses the Bootstrap Token credential to perform a TLS bootstrap, which fetches the credential needed to download the kubelet-config-1.X ConfigMap and writes it to /var/lib/kubelet/config.yaml.
After a successful execution you should see the logs below:
.
.
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
.
.
So, after these 2 steps you should have /var/lib/kubelet/config.yaml in place.
Failure of the kubeadm join flow
In your case, it seems that the kubeadm join flow failed which might happen due to multiple reasons like bad configuration of iptables, ports that are already in use, container runtime not installed properly, etc' - as described here and here.
As far as I know, the fact that no networking CNI-compatible solution was in place should not affect the creation of /var/lib/kubelet/config.yaml:
A) We can see the under the kubeadm preflight checks what issues will cause the join phase to fail.
B ) I also tested this issue by removing the current solution I used (Calico) and ran kubeadm reset and kubeadm join again and no errors appeared in the kubeadm logs (I've got the successful execution logs I mentioned above) and /var/lib/kubelet/config.yaml was created properly.
(*) Of course that the cluster can't function in this state - I just wanted to emphasize that I think the problem was one of the options mentioned in A.

MongoDB service cannot be found in service list

I have following the instruction from this link
https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
And when I'm going to
sudo service mongod start
it's shown as unrecognized service
I have known that I may need to use mongodb instead of mongod, but it's the same.
Then I have checked all services list by
service --status-all
There's no mongodb-related service in the list
I have reinstall it again and it's the same. I have also searched through the internet and I cannot find the solution. I install on Windows's ubuntu bash. I have reinstall bash once before this issue (14.04 to 16.04). Last time I installed it before I reinstalled bash it work just fine.
Thank you in advance for your answers.
I faced the same issue. I don't know the reason and solution but you can check the status of mongo itself by this command:
sudo service mongod status
You may see something like this output:
root#lab# service mongod status
● mongod.service - High-performance, schema-free document-oriented database
Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2018-04-06 09:47:57 BST; 1min 31s ago
Docs: https://docs.mongodb.org/manual
Main PID: 5973 (mongod)
CGroup: /system.slice/mongod.service
└─5973 /usr/bin/mongod --quiet --config /etc/mongod.conf