Why my new Ceph cluster status never shows 'HEALTH_OK'? - ceph

I'm working on setup a Ceph cluster with Docker and image 'ceph/daemon:v3.1.0-stable-3.1-luminous-centos-7'. But after the cluster has been setup, the ceph status command never reaches HEALTH_OK. Here is my cluster's information. It has enough disk space and the network is all right.
My question are:
Why does Ceph not replicate the 'undersized' pages?
How to fix it?
Thank you very much!
➜ ~ ceph -s
cluster:
id: 483a61c4-d3c7-424d-b96b-311d2c6eb69b
health: HEALTH_WARN
Degraded data redundancy: 3 pgs undersized
services:
mon: 3 daemons, quorum pc-10-10-0-13,pc-10-10-0-89,pc-10-10-0-160
mgr: pc-10-10-0-89(active), standbys: pc-10-10-0-13, pc-10-10-0-160
mds: cephfs-1/1/1 up {0=pc-10-10-0-160=up:active}, 2 up:standby
osd: 5 osds: 5 up, 5 in
rbd-mirror: 3 daemons active
rgw: 3 daemons active
data:
pools: 6 pools, 68 pgs
objects: 212 objects, 5.27KiB
usage: 5.02GiB used, 12.7TiB / 12.7TiB avail
pgs: 65 active+clean
3 active+undersized
➜ ~ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 12.73497 root default
-5 0.90959 host pc-10-10-0-13
3 hdd 0.90959 osd.3 up 1.00000 1.00000
-7 0.90959 host pc-10-10-0-160
4 hdd 0.90959 osd.4 up 1.00000 1.00000
-3 10.91579 host pc-10-10-0-89
0 hdd 3.63860 osd.0 up 1.00000 1.00000
1 hdd 3.63860 osd.1 up 1.00000 1.00000
2 hdd 3.63860 osd.2 up 1.00000 1.00000
➜ ~ ceph osd pool ls detail
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 24 flags hashpspool stripe_width 0 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 24 flags hashpspool stripe_width 0 application cephfs
pool 3 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 27 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 30 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 32 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 34 flags hashpspool stripe_width 0 application rgw

#itsafire This is not the solution. He is asking for solution not asking for hardware recommendation.
I'm running 8 nodes and 5 nodes multiple CEPH clusters. I always use 2 replica with multiple crush map (for SSD, SAS and 72k drives)
Why you need 3 replica if you are using a small cluster with limited resources.
Could you please explain why my solution is Recipe for disaster? You have good reputation and I'm not sure how did you get them. Maybe just replying recommendation not solution.

Create a new Pool with Size 2 and Min Size 1.
For pg-num use Ceph PG Calculator https://ceph.com/pgcalc/

It seems you created a three node cluster with different osd configurations and sizes. The standard crush rule tells ceph to have 3 copies of a PG on different hosts. If there is not enough space to spread the PGs over the three hosts, then your cluster will never be healthy.
It is always a good idea to start with a set of equally sized hosts (RAM, CPU, OSDs).
Update for discussion about cluster with size of 2 vs 3
Don't use 2 replicas. Go for 3. Ceph started out with a size default of 2. But this was changed to 3 in Ceph 0.82 (Firefly release).
Why ? Because if one drive fails you are left with only one drive containing your data. Should this drive fail too while recovery is running, then your data is gone for good.
See this thread on the ceph user mailing list
2 replicas isn't safe, no matter how big or small the cluster is. With
disks becoming larger recovery times will grow. In that window you don't
want to run on a single replica.

Related

pyspark conf and yarn top memory discrepancies

An EMR cluster reads (from main node, after running yarn top):
ARN top - 13:27:57, up 0d, 1:34, 1 active users, queue(s): root
NodeManager(s): 6 total, 6 active, 0 unhealthy, 2 decommissioned, 0
lost, 0 rebooted Queue(s) Applications: 3 running, 8 submitted, 0
pending, 5 completed, 0 killed, 0 failed Queue(s) Mem(GB): 18
available, 189 allocated, 1555 pending, 0 reserved Queue(s) VCores: 44
available, 20 allocated, 132 pending, 0 reserved Queue(s) Containers:
20 allocated, 132 pending, 0 reserved
APPLICATIONID USER TYPE QUEUE PRIOR #CONT #RCONT VCORES RVCORES MEM RMEM VCORESECS MEMSECS %PROGR TIME NAME
application_1663674823778_0002 hadoop spark default 0 10 0 10 0 99G 0G 18754 187254 10.00 00:00:33 PyS
application_1663674823778_0003 hadoop spark default 0 9 0 9 0 88G 0G 9446 84580 10.00 00:00:32 PyS
application_1663674823778_0008 hadoop spark default 0 1 0 1 0 0G 0G 382 334 10.00 00:00:06 PyS
Note that the PySpark apps for application_1663674823778_0002 and application_1663674823778_0003 were provisioned via the main node command line with just executing pyspark (with no explicit config editing).
However, the application_1663674823778_0008 was provisioned via the following command: pyspark --conf spark.executor.memory=11g --conf spark.driver.memory=12g. Despite this (test) PySpark config customization, that app in yarn fails to show anything other than 0 for the memory (regular or reserved) value.
Why is this?

how to rejoin Mon and mgr Ceph to cluster

i have this situation and cand access to ceph dashboard.i haad 5 mon but 2 of them went down and one of them is the bootstrap mon node so that have mgr and I got this from that node.
2020-10-14T18:59:46.904+0330 7f9d2e8e9700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
cluster:
id: e97c1944-e132-11ea-9bdd-e83935b1c392
health: HEALTH_WARN
no active mgr
services:
mon: 3 daemons, quorum srv4,srv5,srv6 (age 2d)
mgr: no daemons active (since 2d)
mds: heyatfs:1 {0=heyfs.srv10.lxizhc=up:active} 1 up:standby
osd: 54 osds: 54 up (since 47h), 54 in (since 3w)
task status:
scrub status:
mds.heyfs.srv10.lxizhc: idle
data:
pools: 3 pools, 65 pgs
objects: 223.95k objects, 386 GiB
usage: 1.2 TiB used, 97 TiB / 98 TiB avail
pgs: 65 active+clean
io:
client: 105 KiB/s rd, 328 KiB/s wr, 0 op/s rd, 0 op/s wr
I have to say the whole story, I used cephadm to create my cluster at first and I'm so new to ceph i have 15 servers and 14 of them have OSD container and 5 of them had mon and my bootstrap mon that is srv2 have mgr.
2 of these servers have public IP and I used one of them as a client (I know this structure have a lot of question in it but my company forces me to do it and also I'm new to ceph so it's how it's now). 2 weeks ago I lost 2 OSD and I said to datacenter who gives me these servers to change that 2 HDD they restart those servers and unfortunately, those servers were my Mon server. after they restarted those servers on of them came back srv5 but I could see srv3 is out of quorum
so i begon to solve this problem so I used this command in ceph shell --fsid ...
ceph orch apply mon srv3
ceph mon remove srv3
after some while I see in my dashboard srv2 my boostrap mon and mgr is not working and when I used ceph -s ssrv2 isn't there and I can see srv2 mon in removed directory
root#srv2:/var/lib/ceph/e97c1944-e132-11ea-9bdd-e83935b1c392# ls
crash crash.srv2 home mgr.srv2.xpntaf osd.0 osd.1 osd.2 osd.3 removed
but mgr.srv2.xpntaf is running and unfortunately, I lost my access to ceph dashboard now
i tried to add srv2 and 3 to monmap with
576 ceph orch daemon add mon srv2:172.32.X.3
577 history | grep dump
578 ceph mon dump
579 ceph -s
580 ceph mon dump
581 ceph mon add srv3 172.32.X.4:6789
and now
root#srv2:/# ceph -s
cluster:
id: e97c1944-e132-11ea-9bdd-e83935b1c392
health: HEALTH_WARN
no active mgr
2/5 mons down, quorum srv4,srv5,srv6
services:
mon: 5 daemons, quorum srv4,srv5,srv6 (age 16h), out of quorum: srv2, srv3
mgr: no daemons active (since 2d)
mds: heyatfs:1 {0=heyatfs.srv10.lxizhc=up:active} 1 up:standby
osd: 54 osds: 54 up (since 2d), 54 in (since 3w)
task status:
scrub status:
mds.heyatfs.srv10.lxizhc: idle
data:
pools: 3 pools, 65 pgs
objects: 223.95k objects, 386 GiB
usage: 1.2 TiB used, 97 TiB / 98 TiB avail
pgs: 65 active+clean
io:
client: 105 KiB/s rd, 328 KiB/s wr, 0 op/s rd, 0 op/s wr
and I must say ceph orch host ls doesn't work and it hangs when I run it and I think it's because of that err no active mgr and also when I see that removed directory mon.srv2 is there and you can see unit.run file so I used that command to run the container again but it says mon.srv2 isn't on mon map and doesn't have specific IP and by the way I must say after ceph orch apply mon srv3 i could see a new container with a new fsid in srv3 server
I now my whole problem is because I ran this command ceph orch apply mon srv3
because when you see the installation document :
To deploy monitors on a specific set of hosts:
# ceph orch apply mon *<host1,host2,host3,...>*
Be sure to include the first (bootstrap) host in this list.
and I didn't see that line !!!
now I manage to have another mgr running but I got this
root#srv2:/var/lib/ceph/mgr# ceph -s
2020-10-15T13:11:59.080+0000 7f957e9cd700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
cluster:
id: e97c1944-e132-11ea-9bdd-e83935b1c392
health: HEALTH_ERR
1 stray daemons(s) not managed by cephadm
2 mgr modules have failed
2/5 mons down, quorum srv4,srv5,srv6
services:
mon: 5 daemons, quorum srv4,srv5,srv6 (age 20h), out of quorum: srv2, srv3
mgr: srv4(active, since 8m)
mds: heyatfs:1 {0=heyatfs.srv10.lxizhc=up:active} 1 up:standby
osd: 54 osds: 54 up (since 2d), 54 in (since 3w)
task status:
scrub status:
mds.heyatfs.srv10.lxizhc: idle
data:
pools: 3 pools, 65 pgs
objects: 301.77k objects, 537 GiB
usage: 1.6 TiB used, 97 TiB / 98 TiB avail
pgs: 65 active+clean
io:
client: 180 KiB/s rd, 597 B/s wr, 0 op/s rd, 0 op/s wr
and when I run the ceph orch host ls i see this
root#srv2:/var/lib/ceph/mgr# ceph orch host ls
HOST ADDR LABELS STATUS
srv10 172.32.x.11
srv11 172.32.x.12
srv12 172.32.x.13
srv13 172.32.x.14
srv14 172.32.x.15
srv15 172.32.x.16
srv2 srv2
srv3 172.32.x.4
srv4 172.32.x.5
srv5 172.32.x.6
srv6 172.32.x.7
srv7 172.32.x.8
srv8 172.32.x.9
srv9 172.32.x.10

ceph raw used is more than sum of used in all pools (ceph df detail)

First of all sorry for my poor English
In my ceph cluster, when i run the ceph df detail command it shows me like as following result
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 62 TiB 52 TiB 10 TiB 10 TiB 16.47
ssd 8.7 TiB 8.4 TiB 370 GiB 377 GiB 4.22
TOTAL 71 TiB 60 TiB 11 TiB 11 TiB 14.96
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
rbd-kubernetes 36 288 GiB 71.56k 865 GiB 1.73 16 TiB N/A N/A 71.56k 0 B 0 B
rbd-cache 41 2.4 GiB 208.09k 7.2 GiB 0.09 2.6 TiB N/A N/A 205.39k 0 B 0 B
cephfs-metadata 51 529 MiB 221 1.6 GiB 0 16 TiB N/A N/A 221 0 B 0 B
cephfs-data 52 1.0 GiB 424 3.1 GiB 0 16 TiB N/A N/A 424 0 B 0 B
So i have a question about the result
As you can see, sum of my pools used storage is less than 1 TB, But in RAW STORAGE section the used from HDD hard disks is 10TB and it is growing every day.I think this is unusual and something is wrong with this CEPH cluster.
And also FYI the output of ceph osd dump | grep replicated is
pool 36 'rbd-kubernetes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 244 pg_num_target 64 pgp_num_target 64 last_change 1376476 lfor 2193/2193/2193 flags hashpspool,selfmanaged_snaps,creating tiers 41 read_tier 41 write_tier 41 stripe_width 0 application rbd
pool 41 'rbd-cache' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 1376476 lfor 2193/2193/2193 flags hashpspool,incomplete_clones,selfmanaged_snaps,creating tier_of 36 cache_mode writeback target_bytes 1000000000000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 decay_rate 0 search_last_n 0 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 51 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 31675 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 52 'cephfs-data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 742334 flags hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
Ceph Version ceph -v
ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
Ceph OSD versions ceph tell osd.* version return for all OSDs like
osd.0: {
"version": "ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)"
}
Ceph status ceph -s
cluster:
id: 6a86aee0-3171-4824-98f3-2b5761b09feb
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-sn-03,ceph-sn-02,ceph-sn-01 (age 37h)
mgr: ceph-sn-01(active, since 4d), standbys: ceph-sn-03, ceph-sn-02
mds: cephfs-shared:1 {0=ceph-sn-02=up:active} 2 up:standby
osd: 63 osds: 63 up (since 41h), 63 in (since 41h)
task status:
scrub status:
mds.ceph-sn-02: idle
data:
pools: 4 pools, 384 pgs
objects: 280.29k objects, 293 GiB
usage: 11 TiB used, 60 TiB / 71 TiB avail
pgs: 384 active+clean
According to the provided data, you should evaluate the following considerations and scenarios:
The replication size is inclusive, and once the min_size is achieved in a write operation, you receive a completion message. That means you should expect storage consumption with the minimum of min_size and maximum of the replication size.
Ceph stores metadata and logs for housekeeping purposes, obviously consuming storage.
If you do benchmark operation via "rados bench" or a similar interface with the --no-cleanup parameter, objects will be permanently stored within the cluster that consumes storage.
All the mentioned scenarios are a couple of possibilities.

Rookio Ceph cluster : mon c is low on available space message

I have setup RookIO 1.4 cluster in Kubernetes 1.18. with 3 nodes allocated 1TB storage on each of them.
after creating cluster. when I run the ceph status cluster status shows as HEALTH_WARN with mon c is low on available space.
There is no data stored yet. why status how low on available space? How to clear this error?
[root#rook-ceph-tools-6bdcd78654-sfjvl /]# ceph status
cluster:
id: ad42764d-aa28-4da5-a828-2d87205aff08
health: HEALTH_WARN
mon c is low on available space
services:
mon: 3 daemons, quorum a,b,c (age 37m)
mgr: a(active, since 36m)
osd: 3 osds: 3 up (since 37m), 3 in (since 37m)
data:
pools: 1 pools, 1 pgs
objects: 0 objects, 0 B
usage: 3.0 GiB used, 3.6 TiB / 3.6 TiB avail
pgs: 1 active+clean
All three node has same size storage:
sdb 8:16 0 1.2T 0 disk
└─ceph--a6cd601d--7584--4b1f--bf82--48c95437f351-osd--data--ae1bc856--8ded--4b1e--8c87--30ca0f0959a3 253:3 0 1.2T 0 lvm
sdb 8:16 0 1.2T 0 disk
└─ceph--ccaf7144--d6a0--441c--bcd5--6a09d056bd7a-osd--data--36a9b28c--7207--400a--936b--edfb3255ce0b 253:3 0 1.2T 0 lvm
sdb 8:16 0 1.2T 0 disk
└─ceph--53e9b8a9--8925--4b21--a6ea--f8e17a322d5c-osd--data--6b1e779c--a18a--4e4d--960e--73ca9473d02f 253:3 0 1.2T 0 lvm
Thanks
SR
This alert is for your monitor disk space that is stored normally in /var/lib/ceph/mon. This path is stored in root fs that isn't related to your OSDs block device. This warn is raised when this path has less than 30% available space (see mon_data_avail_warn which is 30 by default).
You can change it to ignore alert or resize that path to have more space for its RocksDB data.
As Seena explained, it was because the available space is less than 30%, in this case, you could compact the mon data by the command as follow.
ceph tell mon.`hostname -s` compact
There is another way to trigger the data compaction for mon, add the mon config to the ceph.conf, and then restart the mon.
[mon]
mon compact on start = true

How to abandon Ceph PGs that are stuck in "incomplete"?

We have been working on restoring our Ceph cluster after losing a large number of OSDs. We have all PGs active now except for 80 PGs that are stuck in the "incomplete" state. These PGs are referencing OSD.8 which we removed 2 weeks ago due to corruption.
We would like to abandon the "incomplete" PGs as they are not restorable. We have tried the following:
Per the docs, we made sure min_size on the corresponding pools was
set to 1. This did not clear the condition.
Ceph would not let us
issue "ceph osd lost N" because OSD.8 had already been removed from
the cluster.
We also tried "ceph pg force_create_pg X" on all the
PGs. The 80 PGs moved to "creating" for a few minutes but then all
went back to "incomplete".
How do we abandon these PGs to allow recovery to continue? Is there some way to force individual PGs to be marked as "lost"?
To remove the OSD we used the procedure from the web site here:
http://docs.ceph.com/docs/jewel/rados/operations/add-or-rm-osds/#removing-osds-manual
Basically:
ceph osd crush remove 8
ceph auth del osd.8
ceph osd rm 8
Some miscellaneous data below:
djakubiec#dev:~$ ceph osd lost 8 --yes-i-really-mean-it
osd.8 is not down or doesn't exist
djakubiec#dev:~$ ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 58.19960 root default
-2 7.27489 host node24
1 7.27489 osd.1 up 1.00000 1.00000
-3 7.27489 host node25
2 7.27489 osd.2 up 1.00000 1.00000
-4 7.27489 host node26
3 7.27489 osd.3 up 1.00000 1.00000
-5 7.27489 host node27
4 7.27489 osd.4 up 1.00000 1.00000
-6 7.27489 host node28
5 7.27489 osd.5 up 1.00000 1.00000
-7 7.27489 host node29
6 7.27489 osd.6 up 1.00000 1.00000
-8 7.27539 host node30
9 7.27539 osd.9 up 1.00000 1.00000
-9 7.27489 host node31
7 7.27489 osd.7 up 1.00000 1.00000
BUT, even though OSD 8 no longer exists I see still lots of references to OSD 8 in various ceph dumps and query's.
Interestingly, we do still see weird entries in the CRUSH map (should I do something about these?):
# devices
device 0 device0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 device8
device 9 osd.9
And for what it is worth, here is the ceph -s:
cluster 10d47013-8c2a-40c1-9b4a-214770414234
health HEALTH_ERR
212 pgs are stuck inactive for more than 300 seconds
93 pgs backfill_wait
1 pgs backfilling
101 pgs degraded
63 pgs down
80 pgs incomplete
89 pgs inconsistent
4 pgs recovery_wait
1 pgs repair
132 pgs stale
80 pgs stuck inactive
132 pgs stuck stale
103 pgs stuck unclean
97 pgs undersized
2 requests are blocked > 32 sec
recovery 4394354/46343776 objects degraded (9.482%)
recovery 4025310/46343776 objects misplaced (8.686%)
2157 scrub errors
mds cluster is degraded
monmap e1: 3 mons at {core=10.0.1.249:6789/0,db=10.0.1.251:6789/0,dev=10.0.1.250:6789/0}
election epoch 266, quorum 0,1,2 core,dev,db
fsmap e3627: 1/1/1 up {0=core=up:replay}
osdmap e4293: 8 osds: 8 up, 8 in; 144 remapped pgs
flags sortbitwise
pgmap v1866639: 744 pgs, 10 pools, 7668 GB data, 20673 kobjects
8339 GB used, 51257 GB / 59596 GB avail
4394354/46343776 objects degraded (9.482%)
4025310/46343776 objects misplaced (8.686%)
362 active+clean
112 stale+active+clean
89 active+undersized+degraded+remapped+wait_backfill
66 active+clean+inconsistent
63 down+incomplete
19 stale+active+clean+inconsistent
17 incomplete
5 active+undersized+degraded+remapped
4 active+recovery_wait+degraded
2 active+undersized+degraded+remapped+inconsistent+wait_backfill
1 stale+active+clean+scrubbing+deep+inconsistent+repair
1 active+remapped+inconsistent+wait_backfill
1 active+clean+scrubbing+deep
1 active+remapped+wait_backfill
1 active+undersized+degraded+remapped+backfilling