What steps are followed in detail when an osd or a host go into fail status? - ceph

To understand in detail whas detail steps are followed
- when an osd goes down?
- When a host goes down?
why both cases cause an effect on a performance of ceph cluster.
Thanks in advance.

Related

How can I fix ceph commands hanging after a reboot?

I'm pretty new to Ceph, so I've included all my steps I used to set up my cluster since I'm not sure what is or is not useful information to fix my problem.
I have 4 CentOS 8 VMs in VirtualBox set up to teach myself how to bring up Ceph. 1 is a client and 3 are Ceph monitors. Each ceph node has 6 8Gb drives. Once I learned how the networking worked, it was pretty easy.
I set each VM to have a NAT (for downloading packages) and an internal network that I called "ceph-public". This network would be accessed by each VM on the 10.19.10.0/24 subnet. I then copied the ssh keys from each VM to every other VM.
I followed this documentation to install cephadm, bootstrap my first monitor, and added the other two nodes as hosts. Then I added all available devices as OSDs, created my pools, then created my images, then copied my /etc/ceph folder from the bootstrapped node to my client node. On the client, I ran rbd map mypool/myimage to mount the image as a block device, then used mkfs to create a filesystem on it, and I was able to write data and see the IO from the bootstrapped node. All was well.
Then, as a test, I shutdown and restarted the bootstrapped node. When it came back up, I ran ceph status but it just hung with no output. Every single ceph and rbd command now hangs and I have no idea how to recover or properly reset or fix my cluster.
Has anyone ever had the ceph command hang on their cluster, and what did you do to solve it?
Let me share a similar experience. I also tried some time ago to perform some tests on Ceph (mimic i think) an my VMs on my VirtualBox acted very strange, nothing comparing with actual bare metal servers so please bare this in mind... the tests are not quite relevant.
As regarding your problem, try to see the following:
have at least 3 monitors (or an even number). It's possible that hang is because of monitor election.
make sure the networking part is OK (separated VLANs for ceph servers and clients)
DNS is resolving OK. (you have added the servername in hosts)
...just my 2 cents...

AWS Container Insight on ECS have a big delay (~2/3mins)

I've setup container insight on a ECS cluster running Fargate.
I'm experiencing quite big delay to get metrics into AWS Container.
When looking at the metric log /aws/ecs/containerinsights/{cluster_name}/performance, in log insight:
I can see delay from 130s to 170s between the #ingestionTime and the #timestamp
I also see a delay of like 60s between the advertised #ingestionTime, and the time the corresponding log appeared in the Logs insight query.
This appearingly also impact auto-scaling, making it very slow to react.
The metrics are 60s appart, made at the start of every minutes.
Anyone experienced this, or know how to tune it?

Master Kubernetes nodes offline GKE (mutliple clusters and projects)

This morning we noticed that all Kubernetes clusters in all projects ( 2 projects, 2 clusters per project ) showed unavailable / ERROR in the Google Cloud Console.
The dashboard shows no current issues: https://status.cloud.google.com/
It basically looks like the master nodes are down, the API does not respond and the clusters cannot be edited in the UI. Before the weekend everything was up and since at least yesterday evening they all show in red.
The deployed services fortunately respond, but we cannot manage the cluster in any way.
I reported it here too:
https://issuetracker.google.com/issues/172841082
Did anyone else encounter this and is there any way to restart or trigger the master node to restart? I cannot edit the cluster so an upgrade is not possible either.
I read elsewhere that only SRE folks from Google can (re)start them.
It's beyond me how this can happen.
By the way, auto-repair is set to on and I followed the troubleshooting page, basically with all paths leading to : master node down, nothing to be done.
Any help would be greatly appreciated, or simply a SRE doing a start node action ;).
Thank you #dany L, it was indeed a billing issue.
I'm surprised there is nothing like a message in the Cloud Console and one has to go to billing specifically to find out about this.
After billing was fixed, it took a few minutes while before the clusters were available, then everything looked back to normal.

Slow replication recovery due to communication problems

We had lately several times the same problems on Google compute engine environment with PostgreSQL streaming replication and I would like to understand reasons and if I can repair it in some smoother way.
From time to time we see some communication problems in Google's internal network in GCE datacenter and they always trigger replication lags between our PG master and its replicas. All machines are Debian-8 and PostgreSQL 9.5.
When situation happens everything seems to be OK - no errors in PG logs on master or replicas just communication between master and replicas seems to be incredibly slow or repeatedly failing so new WAL logs are transfered to replicas with big delays and therefore replication lag is still growing.
Restart of replication from within PostgreSQl or restart of PostgreSQL on replica does not really help - after several WAL logs copied using scp in recovery command communication is back in previous incredibly slow status. Only restart of the whole instance help. When whole VM is restarted communication is back to normal and recovery even from lag many hours long is done in a few minutes. So main reason for this behavior seems to be on OS level. I tried to check net traffic but without finding anything significant. I also do not see anything relevant in any OS log.
Could restart of some OS service help? So I do not need to restart the whole VM? Thank you very much for any ideas.

How to create Ceph Filesystem after Ceph Object Storage Cluster Setup?

I successfully set up a Ceph Object Storage Cluster based on this tutorial: https://www.twoptr.com/2018/05/installing-ceph-luminous.html.
Now I am stuck because I would like to add an MDS node in order to setup a Ceph Filesystem from that cluster. I have already set up the MDS node and tried to set up the FS, following several different guides and tutorials (e.g. the Ceph docs), but nothing has really worked so far.
I would be very grateful if someone could point me into the right direction of how to do this the right way.
My setup includes 5 VM's with Ubuntu 16.04 server installed:
ceph-1 (mon, mgr, osd.0)
ceph-2 (osd.1)
ceph-3 (osd.2)
ceph-4 (radosgw, client)
ceph-5 (mds)
I also tried to create a pool which seemed to work, because it's showing in the Ceph Dashboard, which I installed on ceph-1. But I am not sure how to continue....
Thank you for your help!
hi your install not Standard
please read a below link very helpfull for install ceph:
http://docs.ceph.com/docs/master/start/quick-ceph-deploy/
then
http://docs.ceph.com/docs/mimic/cephfs/createfs/
for erasure coding below link
http://karan-mj.blogspot.com/2014/04/erasure-coding-in-ceph.html