ceph df max available miscalculation - ceph

Ceph cluster shows following weird behavior with ceph df output:
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 817 TiB 399 TiB 418 TiB 418 TiB 51.21
ssd 1.4 TiB 1.2 TiB 22 GiB 174 GiB 12.17
TOTAL 818 TiB 400 TiB 418 TiB 419 TiB 51.15
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
pool1 45 300 21 TiB 6.95M 65 TiB 20.23 85 TiB
pool2 50 50 72 GiB 289.15k 357 GiB 0.14 85 TiB
pool3 53 64 2.9 TiB 754.06k 8.6 TiB 3.24 85 TiB
erasurepool_data 57 1024 138 TiB 50.81M 241 TiB 48.49 154 TiB
erasurepool_metadata 58 8 9.1 GiB 1.68M 27 GiB 2.46 362 GiB
device_health_metrics 59 1 22 MiB 163 66 MiB 0 85 TiB
.rgw.root 60 8 5.6 KiB 17 3.5 MiB 0 85 TiB
.rgw.log 61 8 70 MiB 2.56k 254 MiB 0 85 TiB
.rgw.control 62 8 0 B 8 0 B 0 85 TiB
.rgw.meta 63 8 7.6 MiB 52 32 MiB 0 85 TiB
.rgw.buckets.index 64 8 11 GiB 1.69k 34 GiB 3.01 362 GiB
.rgw.buckets.data 65 512 23 TiB 33.87M 72 TiB 21.94 85 TiB
As seen above available storage 399TiB, and max avail in pool list shows 85TiB. I use 3 replicas for each pool replicated pool and 3+2 erasure code for the erasurepool_data.
As far as I know Max Avail segment shows max raw available capacity according to replica size. So it comes up to 85*3=255TiB. Meanwhile cluster shows almost 400 available.
Which to trust?
Is this only a bug?

Turns out max available space is calculated according to the fullest osds in the cluster and has nothing to do with total free space in the cluster. From what i've found this kind of fluctiation mainly happens on small clusters.

MAX AVAIL column represents the amount of data that can be used before the first OSD becomes full. It takes into account the projected distribution of data across disks from the CRUSH map and uses the 'first OSD to fill up' as the target. it does not seem to be a bug. If MAX AVAIL is not what you expect it to be, look at the data distribution using ceph osd tree and make sure you have a uniform distribution.
You can also check some helpful posts here that explains some of the miscalculations:
Using available space in a Ceph pool
ceph-displayed-size-calculation
max-avail-in-ceph-df-command-is-incorrec
As you have Erasure Coding involved please check this SO post:
ceph-df-octopus-shows-used-is-7-times-higher-than-stored-in-erasure-coded-pool

When you add the erasure coded pool, i.e. erasurepool_data at 154, you get 255+154 = 399.

Related

ceph pgs marked as inactive and undersized+peered

I installed a rook.io ceph storage cluster. Before installation, I cleaned up the previous installation like described here: https://rook.io/docs/rook/v1.7/ceph-teardown.html
The new cluster was provisioned correctly, however ceph is not healthy immediately after provisioning, and stuck.
data:
pools: 1 pools, 128 pgs
objects: 0 objects, 0 B
usage: 20 MiB used, 15 TiB / 15 TiB avail
pgs: 100.000% pgs not active
128 undersized+peered
[root#rook-ceph-tools-74df559676-scmzg /]# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 3.63869 1.00000 3.6 TiB 5.0 MiB 144 KiB 0 B 4.8 MiB 3.6 TiB 0 0.98 0 up
1 hdd 3.63869 1.00000 3.6 TiB 5.4 MiB 144 KiB 0 B 5.2 MiB 3.6 TiB 0 1.07 128 up
2 hdd 3.63869 1.00000 3.6 TiB 5.0 MiB 144 KiB 0 B 4.8 MiB 3.6 TiB 0 0.98 0 up
3 hdd 3.63869 1.00000 3.6 TiB 4.9 MiB 144 KiB 0 B 4.8 MiB 3.6 TiB 0 0.97 0 up
TOTAL 15 TiB 20 MiB 576 KiB 0 B 20 MiB 15 TiB 0
MIN/MAX VAR: 0.97/1.07 STDDEV: 0
[root#rook-ceph-tools-74df559676-scmzg /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 14.55475 root default
-3 14.55475 host storage1-kube-domain-tld
0 hdd 3.63869 osd.0 up 1.00000 1.00000
1 hdd 3.63869 osd.1 up 1.00000 1.00000
2 hdd 3.63869 osd.2 up 1.00000 1.00000
3 hdd 3.63869 osd.3 up 1.00000 1.00000
Is there anyone who can explain what went wrong and how to fix the issue?
The problem is that osds are running on the same host and failure domain is set to host. Switching failure domain to osd fixes the issue. The default failure domain can be changed as per https://stackoverflow.com/a/63472905/3146709

ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

The pool default.rgw.buckets.data has 501 GiB stored, but USED shows 3.5 TiB.
root#ceph-01:~# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85
TOTAL 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 19 KiB 12 56 KiB 0 61 TiB
.rgw.root 2 32 2.6 KiB 6 1.1 MiB 0 61 TiB
default.rgw.log 3 32 168 KiB 210 13 MiB 0 61 TiB
default.rgw.control 4 32 0 B 8 0 B 0 61 TiB
default.rgw.meta 5 8 4.8 KiB 11 1.9 MiB 0 61 TiB
default.rgw.buckets.index 6 8 1.6 GiB 211 4.7 GiB 0 61 TiB
default.rgw.buckets.data 10 128 501 GiB 5.36M 3.5 TiB 1.90 110 TiB
The default.rgw.buckets.data pool is using erasure coding:
root#ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=6
m=4
plugin=jerasure
technique=reed_sol_van
w=8
If anyone could help explain why it's using up 7 times more space, it would help a lot.
Versioning is disabled. ceph version 15.2.13 (octopus stable).
This is related to bluestore_min_alloc_size_hdd=64K (default on Octopus).
If using Erasure Coding, data is broken up into smaller chunks, which each take 64K on disk.
One option would be to lower bluestore_min_alloc_size_hdd to 4K, which is good if your solution requires storing millions of tiny (16K) objects. In my case, I'm storing hundreds of millions of 3-4M photos, so I decide to skip Erasure Coding, stay on bluestore_min_alloc_size_hdd=64K and switch to replicated 3 (min 2). Which is much safer and faster in the long run.
Here is the reply from Josh Baergen on the mailing list:
Hey Arkadiy,
If the OSDs are on HDDs and were created with the default
bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus, then in
effect data will be allocated from the pool in 640KiB chunks (64KiB *
(k+m)). 5.36M objects taking up 501GiB is an average object size of 98KiB
which results in a ratio of 6.53:1 allocated:stored, which is pretty close
to the 7:1 observed.
If my assumption about your configuration is correct, then the only way to
fix this is to adjust bluestore_min_alloc_size_hdd and recreate all your
OSDs, which will take a while...
Josh

ceph raw used is more than sum of used in all pools (ceph df detail)

First of all sorry for my poor English
In my ceph cluster, when i run the ceph df detail command it shows me like as following result
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 62 TiB 52 TiB 10 TiB 10 TiB 16.47
ssd 8.7 TiB 8.4 TiB 370 GiB 377 GiB 4.22
TOTAL 71 TiB 60 TiB 11 TiB 11 TiB 14.96
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
rbd-kubernetes 36 288 GiB 71.56k 865 GiB 1.73 16 TiB N/A N/A 71.56k 0 B 0 B
rbd-cache 41 2.4 GiB 208.09k 7.2 GiB 0.09 2.6 TiB N/A N/A 205.39k 0 B 0 B
cephfs-metadata 51 529 MiB 221 1.6 GiB 0 16 TiB N/A N/A 221 0 B 0 B
cephfs-data 52 1.0 GiB 424 3.1 GiB 0 16 TiB N/A N/A 424 0 B 0 B
So i have a question about the result
As you can see, sum of my pools used storage is less than 1 TB, But in RAW STORAGE section the used from HDD hard disks is 10TB and it is growing every day.I think this is unusual and something is wrong with this CEPH cluster.
And also FYI the output of ceph osd dump | grep replicated is
pool 36 'rbd-kubernetes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 244 pg_num_target 64 pgp_num_target 64 last_change 1376476 lfor 2193/2193/2193 flags hashpspool,selfmanaged_snaps,creating tiers 41 read_tier 41 write_tier 41 stripe_width 0 application rbd
pool 41 'rbd-cache' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 1376476 lfor 2193/2193/2193 flags hashpspool,incomplete_clones,selfmanaged_snaps,creating tier_of 36 cache_mode writeback target_bytes 1000000000000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 decay_rate 0 search_last_n 0 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 51 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 31675 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 52 'cephfs-data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 742334 flags hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
Ceph Version ceph -v
ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
Ceph OSD versions ceph tell osd.* version return for all OSDs like
osd.0: {
"version": "ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)"
}
Ceph status ceph -s
cluster:
id: 6a86aee0-3171-4824-98f3-2b5761b09feb
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-sn-03,ceph-sn-02,ceph-sn-01 (age 37h)
mgr: ceph-sn-01(active, since 4d), standbys: ceph-sn-03, ceph-sn-02
mds: cephfs-shared:1 {0=ceph-sn-02=up:active} 2 up:standby
osd: 63 osds: 63 up (since 41h), 63 in (since 41h)
task status:
scrub status:
mds.ceph-sn-02: idle
data:
pools: 4 pools, 384 pgs
objects: 280.29k objects, 293 GiB
usage: 11 TiB used, 60 TiB / 71 TiB avail
pgs: 384 active+clean
According to the provided data, you should evaluate the following considerations and scenarios:
The replication size is inclusive, and once the min_size is achieved in a write operation, you receive a completion message. That means you should expect storage consumption with the minimum of min_size and maximum of the replication size.
Ceph stores metadata and logs for housekeeping purposes, obviously consuming storage.
If you do benchmark operation via "rados bench" or a similar interface with the --no-cleanup parameter, objects will be permanently stored within the cluster that consumes storage.
All the mentioned scenarios are a couple of possibilities.

pg_top output analysis of puppetdb with postgres

I recently started using a tool called pg_top that shows statistics for Postgres, however since I am not verify versed with the internals of Postgres I needed a bit of clarification on the output.
last pid: 6152; load avg: 19.1, 18.6, 20.4; up 119+20:31:38 13:09:41
41 processes: 5 running, 36 sleeping
CPU states: 52.1% user, 0.0% nice, 0.8% system, 47.1% idle, 0.0% iowait
Memory: 47G used, 16G free, 2524M buffers, 20G cached
DB activity: 151 tps, 0 rollbs/s, 253403 buffer r/s, 86 hit%, 1550639 row r/s,
21 row w/s
DB I/O: 0 reads/s, 0 KB/s, 35 writes/s, 2538 KB/s
DB disk: 233.6 GB total, 195.1 GB free (16% used)
Swap:
My question is under the DB Activity, the 1.5 million row r/s, is that a lot? If so what can be done to improve it? I am running puppetdb 2.3.8, with 6.8 million resources, 2500 nodes, and Postgres 9.1. All of this runs on a single 24 core box with 64GB of memory.

Mongod resident memory usage low

I'm trying to debug some performance issues with a MongoDB configuration, and I noticed that the resident memory usage is sitting very low (around 25% of the system memory) despite the fact that there are occasionally large numbers of faults occurring. I'm surprised to see the usage so low given that MongoDB is so memory dependent.
Here's a snapshot of top sorted by memory usage. It can be seen that no other process is using an significant memory:
top - 21:00:47 up 136 days, 2:45, 1 user, load average: 1.35, 1.51, 0.83
Tasks: 62 total, 1 running, 61 sleeping, 0 stopped, 0 zombie
Cpu(s): 13.7%us, 5.2%sy, 0.0%ni, 77.3%id, 0.3%wa, 0.0%hi, 1.0%si, 2.4%st
Mem: 1692600k total, 1676900k used, 15700k free, 12092k buffers
Swap: 917500k total, 54088k used, 863412k free, 1473148k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2461 mongodb 20 0 29.5g 564m 492m S 22.6 34.2 40947:09 mongod
20306 ubuntu 20 0 24864 7412 1712 S 0.0 0.4 0:00.76 bash
20157 root 20 0 73352 3576 2772 S 0.0 0.2 0:00.01 sshd
609 syslog 20 0 248m 3240 520 S 0.0 0.2 38:31.35 rsyslogd
20304 ubuntu 20 0 73352 1668 872 S 0.0 0.1 0:00.00 sshd
1 root 20 0 24312 1448 708 S 0.0 0.1 0:08.71 init
20442 ubuntu 20 0 17308 1232 944 R 0.0 0.1 0:00.54 top
I'd like to at least understand why the memory isn't being better utilized by the server, and ideally to learn how to optimize either the server config or queries to improve performance.
UPDATE:
It's fair that the memory usage looks high, which might lead to the conclusion it's another process. There's no other processes using any significant memory on the server; the memory appears to be consumed in the cache, but I'm not clear why that would be the case:
$free -m
total used free shared buffers cached
Mem: 1652 1602 50 0 14 1415
-/+ buffers/cache: 172 1480
Swap: 895 53 842
UPDATE:
You can see that the database is still page faulting:
insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn set repl time
0 402 377 0 1167 446 0 24.2g 51.4g 3g 0 <redacted>:9.7% 0 0|0 1|0 217k 420k 457 mover PRI 03:58:43
10 295 323 0 961 592 0 24.2g 51.4g 3.01g 0 <redacted>:10.9% 0 14|0 1|1 228k 500k 485 mover PRI 03:58:44
10 240 220 0 698 342 0 24.2g 51.4g 3.02g 5 <redacted>:10.4% 0 0|0 0|0 164k 429k 478 mover PRI 03:58:45
25 449 359 0 981 479 0 24.2g 51.4g 3.02g 32 <redacted>:20.2% 0 0|0 0|0 237k 503k 479 mover PRI 03:58:46
18 469 337 0 958 466 0 24.2g 51.4g 3g 29 <redacted>:20.1% 0 0|0 0|0 223k 500k 490 mover PRI 03:58:47
9 306 238 1 759 325 0 24.2g 51.4g 2.99g 18 <redacted>:10.8% 0 6|0 1|0 154k 321k 495 mover PRI 03:58:48
6 301 236 1 765 325 0 24.2g 51.4g 2.99g 20 <redacted>:11.0% 0 0|0 0|0 156k 344k 501 mover PRI 03:58:49
11 397 318 0 995 395 0 24.2g 51.4g 2.98g 21 <redacted>:13.4% 0 0|0 0|0 198k 424k 507 mover PRI 03:58:50
10 544 428 0 1237 532 0 24.2g 51.4g 2.99g 13 <redacted>:15.4% 0 0|0 0|0 262k 571k 513 mover PRI 03:58:51
5 291 264 0 878 335 0 24.2g 51.4g 2.98g 11 <redacted>:9.8% 0 0|0 0|0 163k 330k 513 mover PRI 03:58:52
It appears this was being caused by a large amount of inactive memory on the server that wasn't be cleared for Mongo's use.
By looking at the result from:
cat /proc/meminfo
I could see a large amount of Inactive memory. Using this command as a sudo user:
free && sync && echo 3 > /proc/sys/vm/drop_caches && echo "" && free
Freed up the inactive memory, and over the next 24 hours I was able to see the resident memory of my Mongo instance increasing to consume the rest of the memory available on the server.
Credit to the following blog post for it's instructions:
http://tinylan.com/index.php/article/how-to-clear-inactive-memory-in-linux
MongoDB only uses as much memory as it needs, so if all of the data and indexes that are in MongoDB can fit inside what it's currently using you won't be able to push that anymore.
If the data set is larger than memory, there are a couple of considerations:
Check MongoDB itself to see how much data it thinks its using by running mongostat and looking at resident-memory
Was MongoDB re/started recently? If it's cold then the data won't be in memory until it gets paged in (leading to more page faults initially that gradually settle). Check out the touch command for more information on "warming MongoDB up"
Check your read ahead settings. If your system read ahead is too high then MongoDB can't efficiently use the memory on the system. For MongoDB a good number to start with is a setting of 32 (that's 16 KB of read ahead assuming you have 512 byte blocks)
I had the same issue: Windows Server 2008 R2, 16 Gb RAM, Mongo 2.4.3. Mongo uses only 2 Gb of RAM and generates a lot of page faults. Queries are very slow. Disk is idle, memory is free. Found no other solution than upgrade to 2.6.5. It helped.