Ceph Monitor out of quorum - ceph

we're experiencing a problem with one of our ceph monitors. Cluster uses 3 monitors and they are all up&running. They can communicate with each other and gives a relevant ceph -s output. However quorum shows second monitor is down. The ceph -s output from supposedly down monitor is below:
cluster:
id: bb1ab46a-d282-4530-bf5c-021e9c940958
health: HEALTH_WARN
insufficient standby MDS daemons available
noout flag(s) set
9 large omap objects
47 pgs not deep-scrubbed in time
application not enabled on 2 pool(s)
1/3 mons down, quorum mon1,mon3
services:
mon: 3 daemons, quorum mon1,mon3 (age 3d), out of quorum: mon2
mgr: mon1(active, since 3d)
mds: filesystem:1 {0=mon1=up:active}
osd: 77 osds: 77 up (since 3d), 77 in (since 2w)
flags noout
rbd-mirror: 1 daemon active (12512649)
rgw: 1 daemon active (mon1)
data:
pools: 13 pools, 1500 pgs
objects: 65.36M objects, 23 TiB
usage: 85 TiB used, 701 TiB / 785 TiB avail
pgs: 1500 active+clean
io:
client: 806 KiB/s wr, 0 op/s rd, 52 op/s wr
systemctl status ceph-mon#2.service shows:
ceph-mon#2.service - Ceph cluster monitor daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-mon#.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Tue 2020-12-08 12:12:58 +03; 28s ago
Process: 2681 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 2681 (code=exited, status=1/FAILURE)
Dec 08 12:12:48 mon2 systemd[1]: Unit ceph-mon#2.service entered failed state.
Dec 08 12:12:48 mon2 systemd[1]: ceph-mon#2.service failed.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon#2.service holdoff time over, scheduling restart.
Dec 08 12:12:58 mon2 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: start request repeated too quickly for ceph-mon#2.service
Dec 08 12:12:58 mon2 systemd[1]: Failed to start Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: Unit ceph-mon#2.service entered failed state.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon#2.service failed.
Restarting, Stop/Starting, Enable/Disabling the monitor daemon did not work. Docs mention the monitor asok file in var/run/ceph and i don't have it in the supposed directory yet the other monitors have their asok files right in place. And now im in a state that i can't even stop the monitor daemon on second monitor it only stays at failed state. There is no logs shown in /var/log/ceph monitor logs. What am i supposed to do? I don't have much experience in ceph so i don't want to change things without being absolutely sure in order to avoid messing up the cluster.

try to start the service manually on MON2 with:
/usr/bin/ceph-mon -f --cluster Ceph --id 2 --setuser ceph --setgroup ceph

Related

Why I got the next errors?

could someone help me with this error please:
Output for command: systemctl start postgresql-13.service
Job for postgresql-13.service failed because the control process exited with error code.
See "systemctl status postgresql-13.service" and "journalctl -xeu postgresql-13.service" for details.
Output for command systemctl status postgresql-13.service
× postgresql-13.service - PostgreSQL 13 database server
Loaded: loaded (/usr/lib/systemd/system/postgresql-13.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2022-08-23 12:17:50 CDT; 1min 47s ago
Docs: https://www.postgresql.org/docs/13/static/
Process: 1079 ExecStartPre=/usr/pgsql-13/bin/postgresql-13-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
Process: 1110 ExecStart=/usr/pgsql-13/bin/postmaster -D ${PGDATA} (code=exited, status=1/FAILURE)
Main PID: 1110 (code=exited, status=1/FAILURE)
CPU: 40ms
Aug 23 12:17:47 fedora systemd[1]: Starting PostgreSQL 13 database server...
Aug 23 12:17:50 fedora postmaster[1110]: 2022-08-23 12:17:50.144 CDT [1110] LOG: redirecting log output to logging collector process
Aug 23 12:17:50 fedora postmaster[1110]: 2022-08-23 12:17:50.144 CDT [1110] HINT: Future log output will appear in directory "log".
Aug 23 12:17:50 fedora systemd[1]: postgresql-13.service: Main process exited, code=exited, status=1/FAILURE
Aug 23 12:17:50 fedora systemd[1]: postgresql-13.service: Killing process 1132 (postmaster) with signal SIGKILL.
Aug 23 12:17:50 fedora systemd[1]: postgresql-13.service: Killing process 1132 (postmaster) with signal SIGKILL.
Aug 23 12:17:50 fedora systemd[1]: postgresql-13.service: Failed with result 'exit-code'.
Aug 23 12:17:50 fedora systemd[1]: postgresql-13.service: Unit process 1132 (postmaster) remains running after unit stopped.
Aug 23 12:17:50 fedora systemd[1]: Failed to start PostgreSQL 13 database server.
Already I uninstall and reinstall postgresql but nothing works. Also I tried to install postgresql-14 but I get the same error.
I have to install postgresql to work alongs with Ruby on Rails.

Unable to start posgtresql?

I have installed postgresql for a long time, but it suddenly stopped working when I added an IP address in the pg_hba file.
I'm trying to run it using sudo service postgresql restart but this doesn't work.
Also
root#ubuntu-dev-server:/# systemctl start postgresql#9.6-main
Job for postgresql#9.6-main.service failed because the service did not take the steps required by its unit configuration.
See "systemctl status postgresql#9.6-main.service" and "journalctl -xe" for details.
And then,
root#ubuntu-dev-server:/# systemctl status postgresql#9.6-main.service
● postgresql#9.6-main.service - PostgreSQL Cluster 9.6-main
Loaded: loaded (/lib/systemd/system/postgresql#.service; indirect; vendor preset: enabled)
Active: failed (Result: protocol) since Fri 2020-07-24 16:56:06 UTC; 1min 54s ago
Process: 31877 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 9.6-main stop (code=exited, status=0/SUCCESS)
Process: 951 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 9.6-main start (code=exited, status=1/FAILURE)
Main PID: 24944 (code=exited, status=0/SUCCESS)
Jul 24 16:56:06 ubuntu-dev-server systemd[1]: Starting PostgreSQL Cluster 9.6-main...
Jul 24 16:56:06 ubuntu-dev-server postgresql#9.6-main[951]: The PostgreSQL server failed to start. Please check the log output:
Jul 24 16:56:06 ubuntu-dev-server postgresql#9.6-main[951]: 2020-07-24 16:56:06.236 UTC [956] FATAL: could not map anonymous shared memory: Cannot allocate memory
Jul 24 16:56:06 ubuntu-dev-server postgresql#9.6-main[951]: 2020-07-24 16:56:06.236 UTC [956] HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory, swap space, or huge pages. To reduce the request size (currently 148471808 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.
Jul 24 16:56:06 ubuntu-dev-server postgresql#9.6-main[951]: 2020-07-24 16:56:06.236 UTC [956] LOG: database system is shut down
Jul 24 16:56:06 ubuntu-dev-server systemd[1]: postgresql#9.6-main.service: Can't open PID file /run/postgresql/9.6-main.pid (yet?) after start: No such file or directory
Jul 24 16:56:06 ubuntu-dev-server systemd[1]: postgresql#9.6-main.service: Failed with result 'protocol'.
Jul 24 16:56:06 ubuntu-dev-server systemd[1]: Failed to start PostgreSQL Cluster 9.6-main.
I have tried several things but still do not start. Any recommendation is welcome.

Preemptible node is sometimes failing to join the GKE cluster

I have a preemptible node pool of size 1 on GKE. I've been running this node pool with size 1 for almost a month now. Every day the node restarts after 24 hours and rejoins the cluster. Today it restarted but did not rejoin the cluster.
Instead, I noticed that according to gcloud compute instances list the underlying instance was running but not included in the output of kubectl get node. I increased the node pool size to 2, whereupon a second instance was launched. That node immediately joined my GKE cluster and pods were scheduled onto it. The first node is still running according to gcloud, but it won't join the cluster.
What's going on? How can I debug this this problem?
Update:
I SSHed into the instance and was immediately greeted with this excellent error message:
Broken (or in progress) Kubernetes node setup! Check the cluster initialization status
using the following commands:
Master instance:
- sudo systemctl status kube-master-installation
- sudo systemctl status kube-master-configuration
Node instance:
- sudo systemctl status kube-node-installation
- sudo systemctl status kube-node-configuration
The results of sudo systemctl status kube-node-installation:
goto mark: ● kube-node-installation.service - Download and install k8s binaries and configurations
Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
Active: active (exited) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 945 ExecStart=/home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 941 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 937 ExecStartPre=/usr/bin/curl --fail --retry 5 --retry-delay 3 --silent --show-error -H X-Google-Metadata-Request: True -o /home/kubernetes/bin/configure.sh http://metadata.google.internal/com
puteMetadata/v1/instance/attributes/configure-sh (code=exited, status=0/SUCCESS)
Process: 933 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 930 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 925 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Main PID: 945 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4915)
Memory: 0B
CPU: 0
CGroup: /system.slice/kube-node-installation.service
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Downloading node problem detector.
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: % Total % Received % Xferd Average Speed Time Time Time Current
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Dload Upload Total Spent Left Speed
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: [158B blob data]
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: == Downloaded https://storage.googleapis.com/kubernetes-release/node-problem-detector/node-problem-detector-v0.4.1.tar.gz (SHA1 = a57a3fe
64cab8a18ec654f5cef0aec59dae62568) ==
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: kubernetes-manifests.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: mounter is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Done for installing kubernetes files
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Download and install k8s binaries and configurations.
And the result of sudo systemctl status kube-node-configuration:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 994 ExecStart=/home/kubernetes/bin/configure-helper.sh (code=exited, status=4)
Process: 990 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure-helper.sh (code=exited, status=0/SUCCESS)
Main PID: 994 (code=exited, status=4)
CPU: 33ms
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Starting Configure kubernetes node...
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Start to configure instance for kubernetes
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Configuring IP firewall rules
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Failed to start Configure kubernetes node.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Unit entered failed state.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Failed with result 'exit-code'.
So it looks like kube-node-configuration failed. I ran sudo systemctl restart kube-node-configuration and now the status output is:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: active (exited) since Fri 2017-12-29 03:41:36 UTC; 3s ago
Main PID: 20802 (code=exited, status=0/SUCCESS)
CPU: 1.851s
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Extend the docker.service configuration to set a higher pids limit
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Docker command line is updated. Restart docker to pick it up
Dec 29 03:41:30 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using kubelet binary at /home/kubernetes/bin/kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kube-proxy static pod
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start node problem detector
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using node problem detector binary at /home/kubernetes/bin/node-problem-detector
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Prepare containerized mounter
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Done for the configuration for kubernetes
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Configure kubernetes node.
...and the node joined the cluster :). But, my original question stands: what happened?
We were experiencing a similar problem on GKE with preemptible nodes, seeing error messaging like these from the nodes:
Extend the docker.service configuration to set a higher pids limit
Docker command line is updated. Restart docker to pick it up
level=info msg="Processing signal 'terminated'"
level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
level=info msg="Daemon shutdown complete"
docker daemon exited
Start kubelet
After about a month of back-and-forth with Google Support, we learned that the Nodes were getting preempted and replaced, and the new node that comes in, uses the same name, and it all happens without the normal pod disruption of a node being evicted.
Backstory: we were running into this problem because Jenkins was running it's workers on the nodes, and during this ~2 minute "restart" of the node going and returning, Jenkins master would loose connection and fail the job.
tldr; don't use preemptible nodes for this kind of work.

Filebeat Service will not start on RHEL 7

I have a trouble/problem with my Filebeat installation.
When I try it to start with "service filebeat start", it says "Starting Filebeat". After "service filebeat status" I get 4 PIDs (until here everything looks "normal"):
[root#(Server) run]# service filebeat status
Filebeat is running with pid: 30650 30657 30658 30659
But after checking the PID, we see that it is not running:
[root#(Server) run]# ps -ef | grep 30650
root 30665 31360 0 16:27 pts/0 00:00:00 grep --color=auto 30650
Trying to start it with systemctl doesn't help:
[root#(Server) run]# systemctl start filebeat
Job for filebeat.service failed because a configured resource limit was exceeded. See "systemctl status filebeat.service" and "journalctl -xe" for details.
Status says:
[root#Server run]# systemctl status filebeat
● filebeat.service - LSB: start and stop filebeat
Loaded: loaded (/etc/rc.d/init.d/filebeat; bad; vendor preset: disabled)
Active: failed (Result: resources) since Tue 2017-09-26 16:30:33 CEST; 1min 41s ago
Docs: man:systemd-sysv-generator(8)
Process: 32118 ExecStart=/etc/rc.d/init.d/filebeat start (code=exited, status=0/SUCCESS)
Sep 26 16:30:33 Server... systemd[1]: Starting LSB: start and stop filebeat...
Sep 26 16:30:33 Server... filebeat[32118]: Starting Filebeat
Sep 26 16:30:33 Server... su[32119]: (to user) root on none
Sep 26 16:30:33 Server... systemd[1]: PID file /var/run/filebeat.pid not readable (yet?) after start.
Sep 26 16:30:33 Server... systemd[1]: Failed to start LSB: start and stop filebeat.
Sep 26 16:30:33 Server... systemd[1]: Unit filebeat.service entered failed state.
Sep 26 16:30:33 Server... systemd[1]: filebeat.service failed.
Does somebody has any idea?
Regards
Problem was "chown permissions". I installed filebeat not as root and the "data" directory had root user & group ownership. After changing that, it runs and starts automatically after boot.
Regards

Job for kube-apiserver.service failed because the control process exited with error code

On the beginning i wanted to point out i am fairly new into Linux systems, and totally, totally new with kubernetes so my question may be trivial.
As stated in the title i have problem with setting up the Kubernetes cluster. I am working on the Atomic Host Version: 7.1707 (2017-07-31 16:12:06)
I am following this guide:
http://www.projectatomic.io/docs/gettingstarted/
in addition to that i followed this:
http://www.projectatomic.io/docs/kubernetes/
(to be precise, i ran this command:
rpm-ostree install kubernetes-master --reboot
everything was going fine until this point:
systemctl start etcd kube-apiserver kube-controller-manager kube-scheduler
the problem is with:
systemctl start etcd kube-apiserver
as it gives me back this response:
Job for kube-apiserver.service failed because the control process
exited with error code. See "systemctl status kube-apiserver.service"
and "journalctl -xe" for details.
systemctl status kube-apiserver.service
gives me back:
● kube-apiserver.service - Kubernetes API Server
Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Fri 2017-08-25 14:29:56 CEST; 2s ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Process: 17876 ExecStart=/usr/bin/kube-apiserver $KUBE_LOGTOSTDERR $KUBE_LOG_LEVEL $KUBE_ETCD_SERVERS $KUBE_API_ADDRESS $KUBE_API_PORT $KUBELET_PORT $KUBE_ALLOW_PRIV $KUBE_SERVICE_ADDRESSES $KUBE_ADMISSION_CONTROL $KUBE_API_ARGS (code=exited, status=255)
Main PID: 17876 (code=exited, status=255)
Aug 25 14:29:56 master systemd[1]: kube-apiserver.service: main process exited, code=exited, status=255/n/a
Aug 25 14:29:56 master systemd[1]: Failed to start Kubernetes API Server.
Aug 25 14:29:56 master systemd[1]: Unit kube-apiserver.service entered failed state.
Aug 25 14:29:56 master systemd[1]: kube-apiserver.service failed.
Aug 25 14:29:56 master systemd[1]: kube-apiserver.service holdoff time over, scheduling restart.
Aug 25 14:29:56 master systemd[1]: start request repeated too quickly for kube-apiserver.service
Aug 25 14:29:56 master systemd[1]: Failed to start Kubernetes API Server.
Aug 25 14:29:56 master systemd[1]: Unit kube-apiserver.service entered failed state.
Aug 25 14:29:56 master systemd[1]: kube-apiserver.service failed.
I have no clue where to start and i will be more than thankful for any advices.
It turned out to be a typo in /etc/kubernetes/config. I misunderstood the "# Comma separated list of nodes in the etcd cluster".
Idk how to close the thread or anything.