Ceph libRBD cache control - kubernetes

So Ceph has a user-space page cache implementation in librbd. Does it allow users to mention how much page cache to allocate to each pod? If yes, can we dynamically change the allocations?

There is no reference to page cache allocation at the POD level according to documentation and issues in the project github.
Ceph supports write-back caching for RBD. To enable it, add rbd cache = true to the [client] section of your ceph.conf file. By default librbd does not perform any caching. Writes and reads go directly to the storage cluster, and writes return only when the data is on disk on all replicas. With caching enabled, writes return immediately, unless there are more than rbd cache max dirty unflushed bytes. In this case, the write triggers writeback and blocks until enough bytes are flushed.
This are the currently supported RDB Cache parameters and they must be inserted in the client section of your ceph.conf file:
rbd cache = The RBD cache size in bytes. | Type: Boolean, Required: No, Default: false
rbd cache size = Enable caching for RADOS Block Device (RBD). | Type: 64-bit Integer, Required: No, Default: 32 MiB
rbd cache max dirty = The dirty limit in bytes at which the cache triggers write-back. | If 0, uses write-through caching.
Type: 64-bit Integer, Required: No, Constraint: Must be less than rbd cache size, Default: 24 MiB
rbd cache target dirty = The dirty target before the cache begins writing data to the data storage. Does not block writes to the cache. | Type: 64-bit Integer, Required: No, Constraint: Must be less than rbd cache max dirty, Default: 16 MiB
rbd cache max dirty age = The number of seconds dirty data is in the cache before writeback starts. | Type: Float, Required: No, Default: 1.0
rbd cache max dirty age
rbd cache writethrough until flush = Start out in write-through mode, and switch to write-back after the first flush request is received. Enabling this is a conservative but safe setting in case VMs running on rbd are too old to send flushes, like the virtio driver in Linux before 2.6.32. | Type: Boolean, Required: No, Default: false

Related

Mapping nested device mapper mounts back to their physical drive

Looking for a reliable (and hopefully simple) way to trace a directory in an lvm or other dm mounted fs back to the physical disk it resides on. Goal is to get the model and serial number of the drive no matter where the script wakes up.
Not a problem when the fs mount is on a physical partition, but gets messy when layers of lvm and/or loopbacks are in between. The lsblk tree shows the dm relationships back to /dev/sda in the following example, but wouldn't be very easy or desirable to parse:
# lsblk -po NAME,MODEL,SERIAL,MOUNTPOINT,MAJ:MIN
NAME MODEL SERIAL MOUNTPOINT MAJ:MIN
/dev/loop0 /mnt/test 7:0
/dev/sda AT1000MX500SSD1 21035FEA05B8 8:0
├─/dev/sda1 /boot 8:1
├─/dev/sda2 8:2
└─/dev/sda5 8:5
└─/dev/mapper/sda5_crypt 254:0
├─/dev/mapper/test5--vg-root / 254:1
└─/dev/mapper/test5--vg-swap_1 [SWAP] 254:2
Tried udevadm info, stat and a few other variations, but they all dead end at the device mapper without a way (that I can see) of connecting the dots to the backing disk and it's model/serial number.
Got enough solution by enumerating the base /dev/sd? devices, looping through each one and its partitions with lsblk -ln devpart and looking for the mountpoint in column 7. In the following example, the desired / shows up in the mappings to the /dev/sda5 partition. The serial number (and a lot of other data) for the base device can then be returned with udevadm info /dev/sda:
sda5 8:5 0 931G 0 part
sda5_crypt 254:0 0 931G 0 crypt
test5--vg-root 254:1 0 651G 0 lvm /
test5--vg-swap_1 254:2 0 976M 0 lvm [SWAP]

UseContainerSupport and direct memory

In a container-based environment such as Kubernetes, the UseContainerSupport JVM feature is handy as it allows configuring heap size as a percentage of container memory via options such as XX:MaxRAMPercentage instead of a static value via Xmx. This way you don't have to potentially adjust your JVM options every time the container memory limit changes, potentially allowing use of vertical autoscaling. The primary goal is hitting a Java OufOfMemoryError rather than running out of memory at the container (e.g. K8s OOMKilled).
That covers heap memory. In applications that use a significant amount of direct memory via NIO (e.g. gRPC/Netty), what are the options for this? The main option I could find is XX:MaxDirectMemorySize, but this takes in a static value similar to Xmx.
There's no similar switch for MaxDirectMemorySize as far as I know.
But by default (if you don't specify -XX:MaxDirectMemorySize) the limit is same as for MaxHeapSize.
That means, if you set -XX:MaxRAMPercentage then the same limit applies to MaxDirectMemory.
Note: that you cannot verify this simply via -XX:+PrintFlagsFinal because that prints 0:
java -XX:MaxRAMPercentage=1 -XX:+PrintFlagsFinal -version | grep 'Max.*Size'
...
uint64_t MaxDirectMemorySize = 0 {product} {default}
size_t MaxHeapSize = 343932928 {product} {ergonomic}
...
openjdk version "17.0.2" 2022-01-18
...
See also https://dzone.com/articles/default-hotspot-maximum-direct-memory-size and Replace access to sun.misc.VM for JDK 11
My own experiments here: https://github.com/jumarko/clojure-experiments/pull/32

Tomcat 9 throwing - rg.apache.catalina.webresources.Cache.getResource Unable to add the resource

02-Feb-2022 16:35:23.846 WARNING [http-nio-9020-exec-442] org.apache.catalina.webresources.Cache.getResource Unable to add the resource at [/] to the cache for web application [] because there was insufficient free space available after evicting expired cache entries - consider increasing the maximum size of the cache
02-Feb-2022 16:35:23.870 WARNING [http-nio-9020-exec-442] org.apache.catalina.webresources.Cache.getResource Unable to add the resource at [/] to the cache for web application [] because there was insufficient free space available after evicting expired cache entries - consider increasing the maximum size of the cache
02-Feb-2022 16:35:23.975 WARNING [http-nio-9020-exec-442] org.apache.catalina.webresources.Cache.getResource Unable to add the resource at [/] to the cache for web application [] because there was insufficient free space available after evicting expired cache entries - consider increasing the maximum size of the cache
02-Feb-2022 16:35:27.468 WARNING [mysql-cj-abandoned-connection-cleanup] org.apache.catalina.webresources.Cache.getResource Unable to add the resource at [/WEB-INF/classes/] to the cache for web application [] because there was insufficient free space available after evicting expired cache entries - consider increasing the maximum size of the cache
02-Feb-2022 16:35:27.518 WARNING [http-nio-9020-exec-452] org.apache.catalina.webresources.Cache.getResource Unable to add the resource at [/audio-cassettes-to-cd/index.html] to the cache for web application [] because there was insufficient free space available after evicting expired cache entries - consider increasing the maximum size of the cache
02-Feb-2022 16:35:27.518 WARNING [http-nio-9020-exec-452] org.apache.catalina.webresources.Cache.getResource Unable to add the resource at [/audio-cassettes-to-cd/index.htm] to the cache for web application [] because there was insufficient free space available after evicting expired cache entries - consider increasing the maximum size of the cache

What's the best practice of improving IOPS of a CEPH cluster?

I am currently building a CEPH cluster for a KVM platform, which got catastrophic performance outcome right now. The figure is dreadful. I am not really familiar with physically distributed systems, is there any general advice of improving the overall performance (i.e. latency, bandwidth and IOPS)?
The hardware configuration is not optimal right now, but I am still would like to release the full potential of what I currently got:
1x 10Gbe Huawei switch
3x Rack server, with hardware configuration:
Intel(R) Xeon(R) CPU E5-2678 v3 # 2.50GHz x2, totally 48 logical cores,
128GB DDR3 RAM
Intel 1.84T NVMe SSD x6 as data drive, with 1 OSD per disk (totally 6 OSDs per server)
My current /etc/ceph/ceph.conf:
[global]
fsid = f2d6d3a7-0e61-4768-b3f5-b19dd2d8b657
mon initial members = ceph-node1, ceph-node2, ceph-node3
mon allow pool delete = true
mon host = 192.168.16.1, 192.168.16.2, 192.168.16.3
public network = 192.168.16.0/24
cluster network = 192.168.16.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 600
osd pool default pgp num = 600
osd memory target = 4294967296
max open files = 131072
[mon]
mon clock drift allowed = 1
mon osd min down reporters = 13
mon osd down out interval = 600
[OSD]
osd journal size = 20000
osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 16
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd recovery op priority = 2
osd recovery max active = 10
osd max backfills = 4
osd min pg log entries = 30000
osd max pg log entries = 100000
osd mon heartbeat interval = 40
ms dispatch throttle bytes = 1048576000
objecter inflight ops = 819200
osd op log threshold = 50
osd crush chooseleaf type = 0
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000
[Client]
rbd cache = True
rbd cache size = 335544320
rbd cache max dirty = 134217728
rbd cache max dirty age = 30
rbd cache writethrough until flush = False
rbd cache max dirty object = 2
rbd cache target dirty = 235544320
IO benchmark is done by fio, with the configuration:
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randread -size=100G -filename=/data/testfile -name="CEPH Test" -iodepth=8 -runtime=30
Benchmark result screenshot:
The bench mark result
The benchmark was done on a sperate machine, configured to connect the cluster via 10Gbe switch by installing MDS only. The benchmark machine is identical to other 3 which formed the cluster, apart from the absence of Intel NVMe SSD drives.
Any help is appreciated,
First, I must note that Ceph is not an acronym, it is short for Cephalopod, because tentacles.
That said, you have a number of settings in ceph.conf that surprise me, like the extreme number of osdmaps you're caching. The thread settings can be tricky, and vary in applicability between releases. If you're building pools with 600 PGs that isn't great, you generally want a power of 2, and to target a ratio per OSD that factors in drive type and other pools. Setting the mon clock skew to a full second (vs. the default 50ms) is downright alarming, with Chrony or even the legacy ntpd it's not hard to get sub-millisecond syncing.
Three nodes may be limiting in the degree of parallelism / overlap clients can support, especially since you only have 6 drives per server. That's only 18 OSDs.
You have Filestore settings in there too, you aren't really using Filestore are you? Or a Ceph release older than Nautilus?
Finally, as more of an actual answer to the question posed, one simple thing you can do is to split each NVMe drive into two OSDs -- with appropriate pgp_num and pg_num settings for the pool.
ceph-volume lvm batch –osds-per-device 2
I assume you 3 meant 3 servers blades and not 3 racks.
What was you rough estimate of performance ?
What is the performance profile of your disk hardware (outside ceph) at 4K and 2MB ?
How many disk do you have in this pool, what is replication factor/strategy and object size ?
On the client side you are performing small reads: 4K
On the server side, depending on your read-ahead settings and object size each of this 4K may grasp much more data in the background.
Did you check that one of your disk is really at its limits and there is no Network/cpu throttling?
You can partition your drive with lvm and use multiple OSDs per drive. Since you have so many cores per server, one osd per drive is not making use of them

Restore data from Postgres data files

Got system with Postgres broken (a RAID is the reason) , without any backups.
Trying to put data to another comptuter with Postgres (and make however backup).
But always when I set up data directory and run postgres I've got message
GET FATAL: database files are incompatible with server
2012-08-15 19:58:38 GET DETAIL: The database cluster was initialized with BLCKSZ 16777216, but the server was compiled with BLCKSZ 8192.
2012-08-15 19:58:38 GET HINT: It looks like you need to recompile or initdb.
It's very strange number 16777216(2 to power 24 - to big).
However I can't reset default value 8192 when compiling (playing with --with-blocksize= take no effect; BLCKSZ - I can't find it in headers files)
).
Any way to extract data ?
This is environment and circumstances:
harddrive: RAID 1 with 3 SAS disks in array
OS: ubuntu 10.04.04 amd64
Postgres: 9.1 (by apt-get (we change repository links to higher version of Ubuntu))
the system become broken - after some time got
AAC: Host Adapter BLINK LED 0x56
AACO: Adapter kernel panic'd 56
(filesystem or hardware error)
Somehow we got data directory. pg_conroldata shown:
pg_control version number: 903
Catalog version number: 201105231
Database system identifier: 5714530593695276911
Database cluster state: shut down
pg_control last modified: Tue 15 Aug 2012 11:50:50
Latest checkpoint location: 1B595668/2000020
Prior checkpoint location: 0/0
Latest checkpoint's REDO location: 1B595668/2000020
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/4057946
Latest checkpoint's NextOID: 40960
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 670
Latest checkpoint's oldestXID's DB: 1344846103
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Tue 15 Aug 2012 11:50:50
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: minimal
Current max_connections setting: 100
Current max_prepared_xacts setting:0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 16777216
Blocks per segment of large relation:131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 2387576020
Maximum size of a TOAST chunk: 0
Date/time type storage: floating-point numbers
Float4 argument passing: by reference
Float8 argument passing: by reference
First I effort to up DB in Ubuntu servers (harddisk - simple serial 2, Ubuntu 10.04 i386, Postgres 9.1) and got the same exception above (with BLCKSZ).
That's why I deployed Ubuntu 10.04 amd64 with english Postgres 9.1 (because got '?' instead of russian symbols in error logs in previous step) in virtual machine
Got the same exception (with BLCKSZ).
Ather that have removed apt-get postgres version and compiled it as described at docs http://www.postgresql.org/docs/9.1/static/installation.html.
Playing withconfigure --with-blocksize=BLOCKSIZE had take no effect - got the same error
Sorry, for the post.
The pg_contol was broken by some manipulations with.
Sow, the cluster was succeful restored by pg_resetxlog with initial data.
A blocksize of 16Mb would be really weird, and since these two values also look completely bogus:
Maximum columns in an index: 2387576020
Maximum size of a TOAST chunk: 0
...you might want to question the integrity of this data before spending time on compiling postgres with a non-standard block size.
If you look at the sizes of files corresponding to relations, are they multiple of 16Mb or 8Kb?
If the database have some gigabytes tables, what appears to be the cut-off size on disk (the size above which postgres split the data into several files)? This should be equal to data block size*Blocks per segment of large relation. On a default install, it's 1Gb.
See here for details on configuring kernel resources. Perhaps the default/current settings for this new OS won't allow the postmaster to start.
Here are details on the meaning and context of the BLCKSZ parameter. Was the system that failed running a 64bit build of PostgreSQL and the new system is a 32bit build? If possible, attempting to obtain version information on the failed system's PostgreSQL could shed light on the problem. Let us know what version, build, and OS were used. Was is a custom build?