ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool - ceph

The pool default.rgw.buckets.data has 501 GiB stored, but USED shows 3.5 TiB.
root#ceph-01:~# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85
TOTAL 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 19 KiB 12 56 KiB 0 61 TiB
.rgw.root 2 32 2.6 KiB 6 1.1 MiB 0 61 TiB
default.rgw.log 3 32 168 KiB 210 13 MiB 0 61 TiB
default.rgw.control 4 32 0 B 8 0 B 0 61 TiB
default.rgw.meta 5 8 4.8 KiB 11 1.9 MiB 0 61 TiB
default.rgw.buckets.index 6 8 1.6 GiB 211 4.7 GiB 0 61 TiB
default.rgw.buckets.data 10 128 501 GiB 5.36M 3.5 TiB 1.90 110 TiB
The default.rgw.buckets.data pool is using erasure coding:
root#ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=6
m=4
plugin=jerasure
technique=reed_sol_van
w=8
If anyone could help explain why it's using up 7 times more space, it would help a lot.
Versioning is disabled. ceph version 15.2.13 (octopus stable).

This is related to bluestore_min_alloc_size_hdd=64K (default on Octopus).
If using Erasure Coding, data is broken up into smaller chunks, which each take 64K on disk.
One option would be to lower bluestore_min_alloc_size_hdd to 4K, which is good if your solution requires storing millions of tiny (16K) objects. In my case, I'm storing hundreds of millions of 3-4M photos, so I decide to skip Erasure Coding, stay on bluestore_min_alloc_size_hdd=64K and switch to replicated 3 (min 2). Which is much safer and faster in the long run.
Here is the reply from Josh Baergen on the mailing list:
Hey Arkadiy,
If the OSDs are on HDDs and were created with the default
bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus, then in
effect data will be allocated from the pool in 640KiB chunks (64KiB *
(k+m)). 5.36M objects taking up 501GiB is an average object size of 98KiB
which results in a ratio of 6.53:1 allocated:stored, which is pretty close
to the 7:1 observed.
If my assumption about your configuration is correct, then the only way to
fix this is to adjust bluestore_min_alloc_size_hdd and recreate all your
OSDs, which will take a while...
Josh

Related

ceph df max available miscalculation

Ceph cluster shows following weird behavior with ceph df output:
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 817 TiB 399 TiB 418 TiB 418 TiB 51.21
ssd 1.4 TiB 1.2 TiB 22 GiB 174 GiB 12.17
TOTAL 818 TiB 400 TiB 418 TiB 419 TiB 51.15
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
pool1 45 300 21 TiB 6.95M 65 TiB 20.23 85 TiB
pool2 50 50 72 GiB 289.15k 357 GiB 0.14 85 TiB
pool3 53 64 2.9 TiB 754.06k 8.6 TiB 3.24 85 TiB
erasurepool_data 57 1024 138 TiB 50.81M 241 TiB 48.49 154 TiB
erasurepool_metadata 58 8 9.1 GiB 1.68M 27 GiB 2.46 362 GiB
device_health_metrics 59 1 22 MiB 163 66 MiB 0 85 TiB
.rgw.root 60 8 5.6 KiB 17 3.5 MiB 0 85 TiB
.rgw.log 61 8 70 MiB 2.56k 254 MiB 0 85 TiB
.rgw.control 62 8 0 B 8 0 B 0 85 TiB
.rgw.meta 63 8 7.6 MiB 52 32 MiB 0 85 TiB
.rgw.buckets.index 64 8 11 GiB 1.69k 34 GiB 3.01 362 GiB
.rgw.buckets.data 65 512 23 TiB 33.87M 72 TiB 21.94 85 TiB
As seen above available storage 399TiB, and max avail in pool list shows 85TiB. I use 3 replicas for each pool replicated pool and 3+2 erasure code for the erasurepool_data.
As far as I know Max Avail segment shows max raw available capacity according to replica size. So it comes up to 85*3=255TiB. Meanwhile cluster shows almost 400 available.
Which to trust?
Is this only a bug?
Turns out max available space is calculated according to the fullest osds in the cluster and has nothing to do with total free space in the cluster. From what i've found this kind of fluctiation mainly happens on small clusters.
MAX AVAIL column represents the amount of data that can be used before the first OSD becomes full. It takes into account the projected distribution of data across disks from the CRUSH map and uses the 'first OSD to fill up' as the target. it does not seem to be a bug. If MAX AVAIL is not what you expect it to be, look at the data distribution using ceph osd tree and make sure you have a uniform distribution.
You can also check some helpful posts here that explains some of the miscalculations:
Using available space in a Ceph pool
ceph-displayed-size-calculation
max-avail-in-ceph-df-command-is-incorrec
As you have Erasure Coding involved please check this SO post:
ceph-df-octopus-shows-used-is-7-times-higher-than-stored-in-erasure-coded-pool
When you add the erasure coded pool, i.e. erasurepool_data at 154, you get 255+154 = 399.

ceph pgs marked as inactive and undersized+peered

I installed a rook.io ceph storage cluster. Before installation, I cleaned up the previous installation like described here: https://rook.io/docs/rook/v1.7/ceph-teardown.html
The new cluster was provisioned correctly, however ceph is not healthy immediately after provisioning, and stuck.
data:
pools: 1 pools, 128 pgs
objects: 0 objects, 0 B
usage: 20 MiB used, 15 TiB / 15 TiB avail
pgs: 100.000% pgs not active
128 undersized+peered
[root#rook-ceph-tools-74df559676-scmzg /]# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 3.63869 1.00000 3.6 TiB 5.0 MiB 144 KiB 0 B 4.8 MiB 3.6 TiB 0 0.98 0 up
1 hdd 3.63869 1.00000 3.6 TiB 5.4 MiB 144 KiB 0 B 5.2 MiB 3.6 TiB 0 1.07 128 up
2 hdd 3.63869 1.00000 3.6 TiB 5.0 MiB 144 KiB 0 B 4.8 MiB 3.6 TiB 0 0.98 0 up
3 hdd 3.63869 1.00000 3.6 TiB 4.9 MiB 144 KiB 0 B 4.8 MiB 3.6 TiB 0 0.97 0 up
TOTAL 15 TiB 20 MiB 576 KiB 0 B 20 MiB 15 TiB 0
MIN/MAX VAR: 0.97/1.07 STDDEV: 0
[root#rook-ceph-tools-74df559676-scmzg /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 14.55475 root default
-3 14.55475 host storage1-kube-domain-tld
0 hdd 3.63869 osd.0 up 1.00000 1.00000
1 hdd 3.63869 osd.1 up 1.00000 1.00000
2 hdd 3.63869 osd.2 up 1.00000 1.00000
3 hdd 3.63869 osd.3 up 1.00000 1.00000
Is there anyone who can explain what went wrong and how to fix the issue?
The problem is that osds are running on the same host and failure domain is set to host. Switching failure domain to osd fixes the issue. The default failure domain can be changed as per https://stackoverflow.com/a/63472905/3146709

ceph raw used is more than sum of used in all pools (ceph df detail)

First of all sorry for my poor English
In my ceph cluster, when i run the ceph df detail command it shows me like as following result
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 62 TiB 52 TiB 10 TiB 10 TiB 16.47
ssd 8.7 TiB 8.4 TiB 370 GiB 377 GiB 4.22
TOTAL 71 TiB 60 TiB 11 TiB 11 TiB 14.96
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
rbd-kubernetes 36 288 GiB 71.56k 865 GiB 1.73 16 TiB N/A N/A 71.56k 0 B 0 B
rbd-cache 41 2.4 GiB 208.09k 7.2 GiB 0.09 2.6 TiB N/A N/A 205.39k 0 B 0 B
cephfs-metadata 51 529 MiB 221 1.6 GiB 0 16 TiB N/A N/A 221 0 B 0 B
cephfs-data 52 1.0 GiB 424 3.1 GiB 0 16 TiB N/A N/A 424 0 B 0 B
So i have a question about the result
As you can see, sum of my pools used storage is less than 1 TB, But in RAW STORAGE section the used from HDD hard disks is 10TB and it is growing every day.I think this is unusual and something is wrong with this CEPH cluster.
And also FYI the output of ceph osd dump | grep replicated is
pool 36 'rbd-kubernetes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 244 pg_num_target 64 pgp_num_target 64 last_change 1376476 lfor 2193/2193/2193 flags hashpspool,selfmanaged_snaps,creating tiers 41 read_tier 41 write_tier 41 stripe_width 0 application rbd
pool 41 'rbd-cache' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 1376476 lfor 2193/2193/2193 flags hashpspool,incomplete_clones,selfmanaged_snaps,creating tier_of 36 cache_mode writeback target_bytes 1000000000000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 decay_rate 0 search_last_n 0 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 51 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 31675 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 52 'cephfs-data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 742334 flags hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
Ceph Version ceph -v
ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
Ceph OSD versions ceph tell osd.* version return for all OSDs like
osd.0: {
"version": "ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)"
}
Ceph status ceph -s
cluster:
id: 6a86aee0-3171-4824-98f3-2b5761b09feb
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-sn-03,ceph-sn-02,ceph-sn-01 (age 37h)
mgr: ceph-sn-01(active, since 4d), standbys: ceph-sn-03, ceph-sn-02
mds: cephfs-shared:1 {0=ceph-sn-02=up:active} 2 up:standby
osd: 63 osds: 63 up (since 41h), 63 in (since 41h)
task status:
scrub status:
mds.ceph-sn-02: idle
data:
pools: 4 pools, 384 pgs
objects: 280.29k objects, 293 GiB
usage: 11 TiB used, 60 TiB / 71 TiB avail
pgs: 384 active+clean
According to the provided data, you should evaluate the following considerations and scenarios:
The replication size is inclusive, and once the min_size is achieved in a write operation, you receive a completion message. That means you should expect storage consumption with the minimum of min_size and maximum of the replication size.
Ceph stores metadata and logs for housekeeping purposes, obviously consuming storage.
If you do benchmark operation via "rados bench" or a similar interface with the --no-cleanup parameter, objects will be permanently stored within the cluster that consumes storage.
All the mentioned scenarios are a couple of possibilities.

pg_top output analysis of puppetdb with postgres

I recently started using a tool called pg_top that shows statistics for Postgres, however since I am not verify versed with the internals of Postgres I needed a bit of clarification on the output.
last pid: 6152; load avg: 19.1, 18.6, 20.4; up 119+20:31:38 13:09:41
41 processes: 5 running, 36 sleeping
CPU states: 52.1% user, 0.0% nice, 0.8% system, 47.1% idle, 0.0% iowait
Memory: 47G used, 16G free, 2524M buffers, 20G cached
DB activity: 151 tps, 0 rollbs/s, 253403 buffer r/s, 86 hit%, 1550639 row r/s,
21 row w/s
DB I/O: 0 reads/s, 0 KB/s, 35 writes/s, 2538 KB/s
DB disk: 233.6 GB total, 195.1 GB free (16% used)
Swap:
My question is under the DB Activity, the 1.5 million row r/s, is that a lot? If so what can be done to improve it? I am running puppetdb 2.3.8, with 6.8 million resources, 2500 nodes, and Postgres 9.1. All of this runs on a single 24 core box with 64GB of memory.

Interpret differences in prstat vs. 'prstat -m' on Solaris

I've been using prstat and prstat -m a lot to investigate performance issues lately, and I think I've basically understood the differences of sampling vs. microstate accounting in Solaris 10. So I don't expect both to always show the exactly same number.
Today I came across an occasion where the 2 showed so vastly different outputs, that I have problems interpreting them and making sense of the output. The machine is a heavily loaded 8-CPU Solaris 10, with several large WebSphere processes and an Oracle database. The system practically came to a halt today for about 15 minutes (load averages of >700). I had difficulties to get any prstat information, but was able to get some outputs from "prtstat 1 1" and "prtstat -m 1 1", issued shortly one after another.
The top lines of the outputs:
prstat 1 1:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
8379 was 3208M 2773M cpu5 60 0 5:29:13 19% java/145
7123 was 3159M 2756M run 59 0 5:26:45 7.7% java/109
5855 app1 1132M 26M cpu2 60 0 0:01:01 7.7% java/18
16503 monitor 494M 286M run 59 19 1:01:08 7.1% java/106
7112 oracle 15G 15G run 59 0 0:00:10 4.5% oracle/1
7124 oracle 15G 15G cpu3 60 0 0:00:10 4.5% oracle/1
7087 app1 15G 15G run 58 0 0:00:09 4.0% oracle/1
7155 oracle 96M 6336K cpu1 60 0 0:00:07 3.6% oracle/1
...
Total: 495 processes, 4581 lwps, load averages: 74.79, 35.35, 23.8
prstat -m 1 1 (some seconds later)
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
7087 app1 0.1 56 0.0 0.2 0.4 0.0 13 30 96 2 33 0 oracle/1
7153 oracle 0.1 53 0.0 3.2 1.1 0.0 1.0 42 82 0 14 0 oracle/1
7124 oracle 0.1 47 0.0 0.2 0.2 0.0 0.0 52 77 2 16 0 oracle/1
7112 oracle 0.1 47 0.0 0.4 0.1 0.0 0.0 52 79 1 16 0 oracle/1
7259 oracle 0.1 45 9.4 0.0 0.3 0.0 0.1 45 71 2 32 0 oracle/1
7155 oracle 0.0 42 11 0.0 0.5 0.0 0.1 46 90 1 9 0 oracle/1
7261 oracle 0.0 37 9.5 0.0 0.3 0.0 0.0 53 61 1 17 0 oracle/1
7284 oracle 0.0 32 5.9 0.0 0.2 0.0 0.1 62 53 1 21 0 oracle/1
...
Total: 497 processes, 4576 lwps, load averages: 88.86, 39.93, 25.51
I have a very hard time interpreting the output. prstat seems to tell me that a fair amount of Java processing is going on, together with some Oracle stuff, just as I would expect in normal situation. prtstat -m shows a machine totally dominated by Oracle, consuming huge amounts of system time, and the overall CPU being heavily overloaded (large numbers in LAT).
I'm inclined to believe the output of prstat -m, because that matches much mor closely to what the system felt like during this time. Totally sluggish, almost no more user request processing going on from WebSphere, etc. But why would prstat show so heavily differing numbers?
Any explanation of this would be welcome!
CU, Joe
There's a known problem with prstat -m on Solaris in the way cpu usage figures are calculated - the value you see has been averaged over all threads (LWPs) in a process, and hence is far far too low for heavily multithreaded processes - such as your Java app servers, which can have hundreds of threads each (see your NLWP). Less than a dozen of them are probably CPU hogs hence the CPU usage by java looks "low". You'd need to call it with prstat -Lm to get the per-LWP (thread) breakdown to see that effect. Reference:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6780169
Without further performance monitoring data it's hard to give non-speculative explanations of what you've seen there. I assume lock contention within java. One particular workload that can cause this is heavily multithreaded memory mapped I/O, they'll all pile up on the process address space lock. But it could be a purely java userside lock of course. A plockstat on one of the java processes, and/or simple dtrace profiling would be helpful.