Ceph Mgr not responding to certain commands - ceph

We have a ceph cluster built with rook(2 mgrs, 3 mons, 2 mds each cephfs, 24 osds, rook: 1.9.3, ceph: 16.2.7, kubelet: 1.24.1). Our operation requires constantly creating and deleting cephfilesystems. Overtime we experienced issues with rook-ceph-mgr. After the cluster was built, in a week or two, rook-ceph-mgr failed to respond to certain ceph commands, like ceph osd pool autoscale-status, ceph fs subvolumegroup ls, while other commands, like ceph -s, worked fine. We have to restart rook-ceph-mgr to get it going. Now we have around 30 cephfilesystems and the issue happens more frequently.
We tried disabling mgr modules dashboard, prometheus and iostat, set ceph progress off, increased mgr_stats_period & mon_mgr_digest_period. That didn't help much. The issue happened again after one or two creating & deleting cycles.

Related

MongoDB terminated at with exit code 14(error) in Kubernetes many times

I have created 3 mongoDB replicaSet by helm chart from bitnami and mount the data to my NAS in k8s.
However after several restart(due to modify some config), my mongoDB become unhealthy and end up terminating many times.
The last state shows: 'Terminated at Feb 25,2022 11:20:04 AM with exit code 14(error)'
I think maybe I should clean the data in NAS and restart mongoDB again.
But I really want to know what cause this happen and how to solve this error.
Has anyone ever had the same issue?
Thanks.

How can I fix ceph commands hanging after a reboot?

I'm pretty new to Ceph, so I've included all my steps I used to set up my cluster since I'm not sure what is or is not useful information to fix my problem.
I have 4 CentOS 8 VMs in VirtualBox set up to teach myself how to bring up Ceph. 1 is a client and 3 are Ceph monitors. Each ceph node has 6 8Gb drives. Once I learned how the networking worked, it was pretty easy.
I set each VM to have a NAT (for downloading packages) and an internal network that I called "ceph-public". This network would be accessed by each VM on the 10.19.10.0/24 subnet. I then copied the ssh keys from each VM to every other VM.
I followed this documentation to install cephadm, bootstrap my first monitor, and added the other two nodes as hosts. Then I added all available devices as OSDs, created my pools, then created my images, then copied my /etc/ceph folder from the bootstrapped node to my client node. On the client, I ran rbd map mypool/myimage to mount the image as a block device, then used mkfs to create a filesystem on it, and I was able to write data and see the IO from the bootstrapped node. All was well.
Then, as a test, I shutdown and restarted the bootstrapped node. When it came back up, I ran ceph status but it just hung with no output. Every single ceph and rbd command now hangs and I have no idea how to recover or properly reset or fix my cluster.
Has anyone ever had the ceph command hang on their cluster, and what did you do to solve it?
Let me share a similar experience. I also tried some time ago to perform some tests on Ceph (mimic i think) an my VMs on my VirtualBox acted very strange, nothing comparing with actual bare metal servers so please bare this in mind... the tests are not quite relevant.
As regarding your problem, try to see the following:
have at least 3 monitors (or an even number). It's possible that hang is because of monitor election.
make sure the networking part is OK (separated VLANs for ceph servers and clients)
DNS is resolving OK. (you have added the servername in hosts)
...just my 2 cents...

ceph 15.2.4 -- authentication changes -- unable to reattach with kubernetes

Yesterday my teammates found a way to disable cephx authentication cluster wide (2 server cluster) in order to bypass issues that were preventing us from joining a 3rd server. Unfortunately they were uncertain which of the steps taken let to the successful addition. I request assistance getting my ceph operational again. Yesterday we left off after editing /etc/ceph/ceph.conf, turning authentication back on here, then copying the file to /var/lib/ceph///config and ensuring permissions were set to 644.
This got one command to work that had previously not been -- my ceph osd df correctly shows all 24 OSDs again, but I cannot run a ceph osd status nor a ceph orch status.

Compute Engine unhealthy instance down 50% of the time

I started to use google cloud 3 days ago or so, so I am completely new to it.
I have 4 pods deployed to Google Kubernetes Engine:
Frontend: react app,
Redis,
Backend: made up of 2 containers, a nodejs server and a cloudsql-proxy,
Nginx-ingress-controller
** And also have an sql instance running for my postgresql database, hence the cloudsql-proxy container
This setup works well 50% of the time, but every now and then all the pods crash or/and the containers are recreated.
I tried to check all the relevant logs, but I really don't know which are actually relevant. But there is one thing that I found which correlates with my issue, I have 2 VM instances running, and one of them might be the faulty one:
When I hover the loading spin, it says Instance is being verified, and it seems to be in this state 80% of the time, when it is not there is a yellow warning beside the name of the instance, saying The resource is not ready.
Here is the cpu usage of the instance (the trend is the same for all the hardware), I checked in the logs of my frontend and backend containers, here is
the last logs that correspond to a cpu drop:
2019-03-13 01:45:23.533 CET - 🚀 Server ready
2019-03-13 01:45:33.477 CET - 2019/03/13 00:45:33 Client closed local connection on 127.0.0.1:5432
2019-03-13 01:54:07.270 CET - yarn run v1.10.1
As you can see here, all the pods are being recreated...
I think that it might come from the fact that the faulty instance is unhealthy:
Instance gke-*****-production-default-pool-0de6d459-qlxk is unhealthy for ...
...the health check is proceeding and recreating/restarting the instance again and again. Tell me if I am wrong.
So, how can I discover what is making this instance unhealthy?

Linux kernel tune in Google Container Engine

I deployed a redis container to Google Container Engine and get the following warnings.
10:M 01 Mar 05:01:46.140 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
I know to correct the warning I need executing
echo never > /sys/kernel/mm/transparent_hugepage/enabled
I tried that in container but does not help.
How to solve this warning in Google Container Engine?
As I understand, my pods are running on the node, and the node is a VM private for me only? So I ssh to the node and modify the kernel directly?
Yes, you own the nodes and can ssh into them and modify them as you need.