What's the best practice of improving IOPS of a CEPH cluster? - ceph

I am currently building a CEPH cluster for a KVM platform, which got catastrophic performance outcome right now. The figure is dreadful. I am not really familiar with physically distributed systems, is there any general advice of improving the overall performance (i.e. latency, bandwidth and IOPS)?
The hardware configuration is not optimal right now, but I am still would like to release the full potential of what I currently got:
1x 10Gbe Huawei switch
3x Rack server, with hardware configuration:
Intel(R) Xeon(R) CPU E5-2678 v3 # 2.50GHz x2, totally 48 logical cores,
128GB DDR3 RAM
Intel 1.84T NVMe SSD x6 as data drive, with 1 OSD per disk (totally 6 OSDs per server)
My current /etc/ceph/ceph.conf:
[global]
fsid = f2d6d3a7-0e61-4768-b3f5-b19dd2d8b657
mon initial members = ceph-node1, ceph-node2, ceph-node3
mon allow pool delete = true
mon host = 192.168.16.1, 192.168.16.2, 192.168.16.3
public network = 192.168.16.0/24
cluster network = 192.168.16.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 600
osd pool default pgp num = 600
osd memory target = 4294967296
max open files = 131072
[mon]
mon clock drift allowed = 1
mon osd min down reporters = 13
mon osd down out interval = 600
[OSD]
osd journal size = 20000
osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 16
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd recovery op priority = 2
osd recovery max active = 10
osd max backfills = 4
osd min pg log entries = 30000
osd max pg log entries = 100000
osd mon heartbeat interval = 40
ms dispatch throttle bytes = 1048576000
objecter inflight ops = 819200
osd op log threshold = 50
osd crush chooseleaf type = 0
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000
[Client]
rbd cache = True
rbd cache size = 335544320
rbd cache max dirty = 134217728
rbd cache max dirty age = 30
rbd cache writethrough until flush = False
rbd cache max dirty object = 2
rbd cache target dirty = 235544320
IO benchmark is done by fio, with the configuration:
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randread -size=100G -filename=/data/testfile -name="CEPH Test" -iodepth=8 -runtime=30
Benchmark result screenshot:
The bench mark result
The benchmark was done on a sperate machine, configured to connect the cluster via 10Gbe switch by installing MDS only. The benchmark machine is identical to other 3 which formed the cluster, apart from the absence of Intel NVMe SSD drives.
Any help is appreciated,

First, I must note that Ceph is not an acronym, it is short for Cephalopod, because tentacles.
That said, you have a number of settings in ceph.conf that surprise me, like the extreme number of osdmaps you're caching. The thread settings can be tricky, and vary in applicability between releases. If you're building pools with 600 PGs that isn't great, you generally want a power of 2, and to target a ratio per OSD that factors in drive type and other pools. Setting the mon clock skew to a full second (vs. the default 50ms) is downright alarming, with Chrony or even the legacy ntpd it's not hard to get sub-millisecond syncing.
Three nodes may be limiting in the degree of parallelism / overlap clients can support, especially since you only have 6 drives per server. That's only 18 OSDs.
You have Filestore settings in there too, you aren't really using Filestore are you? Or a Ceph release older than Nautilus?
Finally, as more of an actual answer to the question posed, one simple thing you can do is to split each NVMe drive into two OSDs -- with appropriate pgp_num and pg_num settings for the pool.
ceph-volume lvm batch –osds-per-device 2

I assume you 3 meant 3 servers blades and not 3 racks.
What was you rough estimate of performance ?
What is the performance profile of your disk hardware (outside ceph) at 4K and 2MB ?
How many disk do you have in this pool, what is replication factor/strategy and object size ?
On the client side you are performing small reads: 4K
On the server side, depending on your read-ahead settings and object size each of this 4K may grasp much more data in the background.
Did you check that one of your disk is really at its limits and there is no Network/cpu throttling?

You can partition your drive with lvm and use multiple OSDs per drive. Since you have so many cores per server, one osd per drive is not making use of them

Related

Is there any way to reduce the PostgreSQL performance deviation between the multiple iterations?

NOPM values captured with HammerDB-v4.3 scripts (schema_tpcc.tcl and
test_tpcc.tcl ) for multiple trails.
The expected performance deviation between the multiple trials should be less
than 2%, but observed more.
Hardware configuration
Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
OS: RHEL8.4
RAM SIZE:512
SSD:1TB
Postgresql.conf
autovacuum_max_workers = 16
autovacuum_vacuum_cost_limit = 3000
checkpoint_completion_target = 0.9
checkpoint_timeout = '15min'
cpu_tuple_cost = 0.03
effective_cache_size = '350GB'
listen_addresses = '*'
maintenance_work_mem = '2GB'
max_connections = 1000
max_wal_size = '128GB'
random_page_cost = 1.1
shared_buffers = '128GB'
wal_buffers = '1GB'
work_mem = '128MB'
random_page_cost = 1.1
effective_io_concurrency = 200
HammerDB Scripts
>>cat schema.tcl
#!/bin/tclsh
dbset db pg
diset connection pg_host localhost
diset connection pg_port 5432
diset tpcc pg_count_ware 400
diset tpcc pg_num_vu 50
print dict
buildschema
waittocomplete
RUN TEST on i.e. start with 1VU then 2, 4, etc
| Virtual Users | Trail-1(NOPM) | Trail-2(NOPM) | %diff |
|---------------|---------------|---------------|---------|
| 12 | 99390 | 92913 | 6.516752|
| 140 | 561429 | 525408 | 6.415949|
| 192 | 636016 | 499574 | 21.4526 |
| 230 | 621644 | 701882 | 12.9074 |
There is already a comprehensive answer to this question on HammerDB discussions.
You make an assumption that PostgreSQL will scale linearly for an intensive OLTP workload on 256 logical CPUs of a particular type. However, if a workload experiences high contention then performance will not be as expected on a particular hardware/software combination due to locking and latching - this is to be expected. Your experience may be different on different hardware (with the same database) and/or a different database (on the same hardware). For example, you may find a higher core count results in lower performance as the additional cores increase contention, lowering throughput.
You need to follow the advice in the discussions post and analyze the wait events using the HammerDB v4.3 Graphical metrics viewer for pg_active_session_history or with SQL directly. This will direct you to exactly the cause of contention (with a particular hardware/software combination - LWLock is highlighted in pink in the viewer or look for this in the query output). If this does not enable you to diagnose the issues directly, then employing a PostgreSQL consultant would be necessary to explain the issue for you.

Ceph libRBD cache control

So Ceph has a user-space page cache implementation in librbd. Does it allow users to mention how much page cache to allocate to each pod? If yes, can we dynamically change the allocations?
There is no reference to page cache allocation at the POD level according to documentation and issues in the project github.
Ceph supports write-back caching for RBD. To enable it, add rbd cache = true to the [client] section of your ceph.conf file. By default librbd does not perform any caching. Writes and reads go directly to the storage cluster, and writes return only when the data is on disk on all replicas. With caching enabled, writes return immediately, unless there are more than rbd cache max dirty unflushed bytes. In this case, the write triggers writeback and blocks until enough bytes are flushed.
This are the currently supported RDB Cache parameters and they must be inserted in the client section of your ceph.conf file:
rbd cache = The RBD cache size in bytes. | Type: Boolean, Required: No, Default: false
rbd cache size = Enable caching for RADOS Block Device (RBD). | Type: 64-bit Integer, Required: No, Default: 32 MiB
rbd cache max dirty = The dirty limit in bytes at which the cache triggers write-back. | If 0, uses write-through caching.
Type: 64-bit Integer, Required: No, Constraint: Must be less than rbd cache size, Default: 24 MiB
rbd cache target dirty = The dirty target before the cache begins writing data to the data storage. Does not block writes to the cache. | Type: 64-bit Integer, Required: No, Constraint: Must be less than rbd cache max dirty, Default: 16 MiB
rbd cache max dirty age = The number of seconds dirty data is in the cache before writeback starts. | Type: Float, Required: No, Default: 1.0
rbd cache max dirty age
rbd cache writethrough until flush = Start out in write-through mode, and switch to write-back after the first flush request is received. Enabling this is a conservative but safe setting in case VMs running on rbd are too old to send flushes, like the virtio driver in Linux before 2.6.32. | Type: Boolean, Required: No, Default: false

PostgreSQL 11 Shared Memory Error: could not open shared memory segment "/PostgreSQL.XXXXXXXX": No such file or directory

Shared Memory files getting deleted some time (~15 hours) in Postgres 11
2019-07-09 08:46:41 CDT [] [6723]: [1-1] user=,db=,e=58P01 ERROR: could not open shared memory segment "/PostgreSQL.291691635": No such file or directory
2019-07-09 08:46:41 CDT [] [6722]: [1-1] user=,db=,e=58P01 ERROR: could not open shared memory segment "/PostgreSQL.291691635": No such file or directory
2019-07-09 08:46:41 CDT [10.40.0.204(60550)] [13880]: [1-1] user=user_name,db=db_name,e=58P01 ERROR: could not open shared memory segment "/PostgreSQL.291691635": No such file or directory
2019-07-09 08:46:41 CDT [10.40.0.204(60550)] [13880]: [2-1] user=user_name,db=db_name,e=58P01 CONTEXT: parallel worker
2019-07-09 08:46:41 CDT [10.40.0.204(60550)] [13880]: [3-1] user=user_name,db=db_name,e=58P01 STATEMENT: WITH overall_reviewed AS (SQL Query)
GCP VM Config
CPU: 4
RAM: 16 GB
OS: Ubuntu 18.04.1 LTS
kernel shared memory setting shared
kernel.shmmax=8589934592
kernel.shmall=2097152
postgresql.config
max_connections = 500
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 4194kB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
During startup: no errors/warnings
After ~15 hours some of the shared memory files is getting deleted, I'm doubting is there any other process deleting files in "/dev/shm" ?
Not sure what is the root cause
making dynamic_shared_memory_type = none in postgresql.conf did solve the issue.
Got the same problem on Ubuntu 18.04 and PostgreSQL 11 and after some more research i have found a solution for us. The error occured when the backup user, which ist he same user as the PG service user, logs into the system. The following Link describes that the storeage under /dev/shm where deleted when a user logs in to the system (same user). So our solution was to change the following:
/etc/systemd/logind.conf
added the Line
RemoveIPC=no
and restart the service
systemctl restart systemd-logind.service
Sources:
https://www.postgresql-archive.org/systemd-deletes-shared-memory-segment-in-dev-shm-Postgresql-NNNNNN-td5883507.html
https://superuser.com/questions/1117764/why-are-the-contents-of-dev-shm-is-being-removed-automatically
We had the same issue and it turned out that someone set postgres user's UID bigger than 1000 (which means that postgres user was no longer a system account). And, as said here:
After hours of searching and reading, I found the culprit.
It's a setting for systemd. The /etc/systemd/logind.conf contains default configuration options, with each of them commented out.
The RemoveIPC option is set to yes by default. That option tells systemd to clean up interprocess communication (IPC) for "user accounts" who aren't logged in.
================================================
This does not affect "system accounts"
================================================
Met the same issue.....when I have opened two sqldeveloper with the same user account, one of them is my remote session which I completely forgot to close the session.
I was doing some aggregation operators like count(*) or max(...), and the error on both. And the error is similar:
ERROR: could not open shared memory segment "/PostgreSQL.798235387":
No such file or directory Where: parallel worker
Solution? I killed the remote session.... XD
And life is peaceful and happy again :D

pgbouncer free_servers - how to increase them

current setting of a pgbouncer server is the following - and what I don't understand is the 'free_servers' info given by the show lists command when connecting to pgbouncer. Is it a (soft or hard) limit on the number of connexion to the postgresql databases used with this instance of pgbouncer ?
configuration :
max_client_conn = 2048
default_pool_size = 1024
min_pool_size = 10
reserve_pool_size = 500
reserve_pool_timeout = 1
server_idle_timeout = 600
listen_backlog = 1024
show lists gives :
pgbouncer=# show lists ;
list | items
---------------+--------
databases | 6
pools | 3
free_clients | 185
used_clients | 15
free_servers | 70
used_servers | 30
it seems that there is a limit at 30 + 70 = 100 servers, but couldn't find it even review configuration values with show config, and documentation doesn't explicit which configuration to change / increase free_servers.
pgbouncer version : 1.7.2
EDIT :
I've just discover that, for a pool of 6 webservers configured to hit the same PG database, 3 of them can have 200 backend connexions (server connexion), and 3 of them can only make and maintain 100 connexions (as described in the first part). But, .. the configuration is exactly the same in pgbouncer configuration file, and the servers are cloned VM. The version of pgbouncer is also the same..
So far, I still haven't found documentation on internet where this limitation come from...
This data is just some internal information for PgBouncer.
Servers information is stored inside an array list data structure which is pre-allocated up to a certain size, in this case that is 100 slots. used_servers = 30, free_servers = 70 means there are 30 slots currently in used, and 70 slots free. PgBouncer will automatically increase the size of the list when it's full, hence there's no configuration for this.

C10k Tsung Gatling and PlayWS

I'm new in Load Testing, but I googled a lot and configured test system on Amazon.
The system consist of: the Websocket server, on Play framework, and some load testing machines. I tried such load testing tools: Tsung and Gatling.
My testing scenario: I create >10k users each of them connects to the server and start sending messages for each second. I tuned linux to handle more than 100k connections. I playaround with JAVA_OPTS for gatling (added more memory and ParallelGC usage). I playaround with akka on the server side to handle 100-300 dispatcher threads. I ordered 36 vCPU machine with 60 GB RAM and 10GBit channel for server machine.
But the result was the same Tsung and Gatling send near 10k messages per second from the one machine(I sent just text messages < 160Bytes).
Can someone explain me. Why I can't reach more than 10k concurrent users (1 message per second). And what am I doing wrong?
actor {
default-dispatcher = {
fork-join-executor {
throughput = 1000
parallelism-factor = 36.0
parallelism-max = 154
}
}
}
JAVA_OPTS="-Xmx3800m -Xms3800m -Xmn2g -XX:+UseParallelGC -XX:ParallelGCThreads=20"
Linux configs
sudo ulimits -n > 999999
sudo vim /etc/sysctl.conf
# General gigabit tuning
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 4096 16777216
net.ipv4.tcp_wmem = 4096 4096 16777216
#
# # This gives the kernel more memory for TCP
# # which you need with many (100k+) open socket connections
net.ipv4.tcp_mem = 4096 65536 16777216
#
# # Backlog
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_syncookies = 1
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 0
# Disable netfilter on bridges.
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 65535
# Controls the maximum size of a message, in bytes
kernel.msgmax = 65535
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_orphan_retries = 5
fs.file-max = 999999
net.ipv4.tcp_max_orphans = 819200
net.core.somaxconn = 65535
net.ipv4.tcp_congestion_control = cubic
I tested with 5 client machines and I found that the server side on play framework can handle 50k messages per second from 50k users. with such configuration. The problem is that one client machine can't send more than 10k mps from 10k users. Maybe someone know other load testing tools that can send more than 10k mps from 10k users.