SHMMAX / SHMALL Kernel parameters for postgres database with huge_pages on

SHMMAX / SHMALL Kernel parameters for postgres database with huge_pages on - postgresql

I've upgraded RAM on postgres server from 8 to 16 GB and set shared_buffer to 4GB. I'm also set
vm.nr_hugepages = 2300
calculated by postgres doc
kernel.shmmax = 8589934592 (8GB)
Postgres is configured to use huge_page (huge_page = 'on') with:
grep -i hugepages /proc/meminfo
AnonHugePages: 0 kB
HugePages_Total: 2300
HugePages_Free: 167
HugePages_Rsvd: 4
HugePages_Surp: 0
Hugepagesize: 2048 kB
What's the correct configuration of kernel.shmall ?
According to doc kernel.shmall is espressed in number of PAGE memory (getconf PAGE_SIZE--> 4096) but Hugepagesize:2048 kb. Current configuration is kernel.shmall=4096 considering PAGE size 2048 kb but I'm not sure that is correct. Can you help me to understand this configuration please and if something is wrong. To setting kernel.shmall I must consider PAGE size = 2048 kb or PAGE size = 4096 byte? And what's the best configuration of these parameters?
Thanks

PostgreSQL v10 uses POSIX shared memory unless you set shared_memory_type = sysv, so the System V shared memory kernel parameters are irrelevant, and you don't have to tune them.

Related

Mapping nested device mapper mounts back to their physical drive

Looking for a reliable (and hopefully simple) way to trace a directory in an lvm or other dm mounted fs back to the physical disk it resides on. Goal is to get the model and serial number of the drive no matter where the script wakes up.
Not a problem when the fs mount is on a physical partition, but gets messy when layers of lvm and/or loopbacks are in between. The lsblk tree shows the dm relationships back to /dev/sda in the following example, but wouldn't be very easy or desirable to parse:
# lsblk -po NAME,MODEL,SERIAL,MOUNTPOINT,MAJ:MIN
NAME MODEL SERIAL MOUNTPOINT MAJ:MIN
/dev/loop0 /mnt/test 7:0
/dev/sda AT1000MX500SSD1 21035FEA05B8 8:0
├─/dev/sda1 /boot 8:1
├─/dev/sda2 8:2
└─/dev/sda5 8:5
└─/dev/mapper/sda5_crypt 254:0
├─/dev/mapper/test5--vg-root / 254:1
└─/dev/mapper/test5--vg-swap_1 [SWAP] 254:2
Tried udevadm info, stat and a few other variations, but they all dead end at the device mapper without a way (that I can see) of connecting the dots to the backing disk and it's model/serial number.

Got enough solution by enumerating the base /dev/sd? devices, looping through each one and its partitions with lsblk -ln devpart and looking for the mountpoint in column 7. In the following example, the desired / shows up in the mappings to the /dev/sda5 partition. The serial number (and a lot of other data) for the base device can then be returned with udevadm info /dev/sda:
sda5 8:5 0 931G 0 part
sda5_crypt 254:0 0 931G 0 crypt
test5--vg-root 254:1 0 651G 0 lvm /
test5--vg-swap_1 254:2 0 976M 0 lvm [SWAP]

Is there any way to reduce the PostgreSQL performance deviation between the multiple iterations?

NOPM values captured with HammerDB-v4.3 scripts (schema_tpcc.tcl and
test_tpcc.tcl ) for multiple trails.
The expected performance deviation between the multiple trials should be less
than 2%, but observed more.
Hardware configuration
Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
OS: RHEL8.4
RAM SIZE:512
SSD:1TB
Postgresql.conf
autovacuum_max_workers = 16
autovacuum_vacuum_cost_limit = 3000
checkpoint_completion_target = 0.9
checkpoint_timeout = '15min'
cpu_tuple_cost = 0.03
effective_cache_size = '350GB'
listen_addresses = '*'
maintenance_work_mem = '2GB'
max_connections = 1000
max_wal_size = '128GB'
random_page_cost = 1.1
shared_buffers = '128GB'
wal_buffers = '1GB'
work_mem = '128MB'
random_page_cost = 1.1
effective_io_concurrency = 200
HammerDB Scripts
>>cat schema.tcl
#!/bin/tclsh
dbset db pg
diset connection pg_host localhost
diset connection pg_port 5432
diset tpcc pg_count_ware 400
diset tpcc pg_num_vu 50
print dict
buildschema
waittocomplete
RUN TEST on i.e. start with 1VU then 2, 4, etc
| Virtual Users | Trail-1(NOPM) | Trail-2(NOPM) | %diff |
|---------------|---------------|---------------|---------|
| 12 | 99390 | 92913 | 6.516752|
| 140 | 561429 | 525408 | 6.415949|
| 192 | 636016 | 499574 | 21.4526 |
| 230 | 621644 | 701882 | 12.9074 |

There is already a comprehensive answer to this question on HammerDB discussions.
You make an assumption that PostgreSQL will scale linearly for an intensive OLTP workload on 256 logical CPUs of a particular type. However, if a workload experiences high contention then performance will not be as expected on a particular hardware/software combination due to locking and latching - this is to be expected. Your experience may be different on different hardware (with the same database) and/or a different database (on the same hardware). For example, you may find a higher core count results in lower performance as the additional cores increase contention, lowering throughput.
You need to follow the advice in the discussions post and analyze the wait events using the HammerDB v4.3 Graphical metrics viewer for pg_active_session_history or with SQL directly. This will direct you to exactly the cause of contention (with a particular hardware/software combination - LWLock is highlighted in pink in the viewer or look for this in the query output). If this does not enable you to diagnose the issues directly, then employing a PostgreSQL consultant would be necessary to explain the issue for you.

What's the best practice of improving IOPS of a CEPH cluster?

I am currently building a CEPH cluster for a KVM platform, which got catastrophic performance outcome right now. The figure is dreadful. I am not really familiar with physically distributed systems, is there any general advice of improving the overall performance (i.e. latency, bandwidth and IOPS)?
The hardware configuration is not optimal right now, but I am still would like to release the full potential of what I currently got:
1x 10Gbe Huawei switch
3x Rack server, with hardware configuration:
Intel(R) Xeon(R) CPU E5-2678 v3 # 2.50GHz x2, totally 48 logical cores,
128GB DDR3 RAM
Intel 1.84T NVMe SSD x6 as data drive, with 1 OSD per disk (totally 6 OSDs per server)
My current /etc/ceph/ceph.conf:
[global]
fsid = f2d6d3a7-0e61-4768-b3f5-b19dd2d8b657
mon initial members = ceph-node1, ceph-node2, ceph-node3
mon allow pool delete = true
mon host = 192.168.16.1, 192.168.16.2, 192.168.16.3
public network = 192.168.16.0/24
cluster network = 192.168.16.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 600
osd pool default pgp num = 600
osd memory target = 4294967296
max open files = 131072
[mon]
mon clock drift allowed = 1
mon osd min down reporters = 13
mon osd down out interval = 600
[OSD]
osd journal size = 20000
osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 16
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd recovery op priority = 2
osd recovery max active = 10
osd max backfills = 4
osd min pg log entries = 30000
osd max pg log entries = 100000
osd mon heartbeat interval = 40
ms dispatch throttle bytes = 1048576000
objecter inflight ops = 819200
osd op log threshold = 50
osd crush chooseleaf type = 0
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000
[Client]
rbd cache = True
rbd cache size = 335544320
rbd cache max dirty = 134217728
rbd cache max dirty age = 30
rbd cache writethrough until flush = False
rbd cache max dirty object = 2
rbd cache target dirty = 235544320
IO benchmark is done by fio, with the configuration:
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randread -size=100G -filename=/data/testfile -name="CEPH Test" -iodepth=8 -runtime=30
Benchmark result screenshot:
The bench mark result
The benchmark was done on a sperate machine, configured to connect the cluster via 10Gbe switch by installing MDS only. The benchmark machine is identical to other 3 which formed the cluster, apart from the absence of Intel NVMe SSD drives.
Any help is appreciated,

First, I must note that Ceph is not an acronym, it is short for Cephalopod, because tentacles.
That said, you have a number of settings in ceph.conf that surprise me, like the extreme number of osdmaps you're caching. The thread settings can be tricky, and vary in applicability between releases. If you're building pools with 600 PGs that isn't great, you generally want a power of 2, and to target a ratio per OSD that factors in drive type and other pools. Setting the mon clock skew to a full second (vs. the default 50ms) is downright alarming, with Chrony or even the legacy ntpd it's not hard to get sub-millisecond syncing.
Three nodes may be limiting in the degree of parallelism / overlap clients can support, especially since you only have 6 drives per server. That's only 18 OSDs.
You have Filestore settings in there too, you aren't really using Filestore are you? Or a Ceph release older than Nautilus?
Finally, as more of an actual answer to the question posed, one simple thing you can do is to split each NVMe drive into two OSDs -- with appropriate pgp_num and pg_num settings for the pool.
ceph-volume lvm batch –osds-per-device 2

I assume you 3 meant 3 servers blades and not 3 racks.
What was you rough estimate of performance ?
What is the performance profile of your disk hardware (outside ceph) at 4K and 2MB ?
How many disk do you have in this pool, what is replication factor/strategy and object size ?
On the client side you are performing small reads: 4K
On the server side, depending on your read-ahead settings and object size each of this 4K may grasp much more data in the background.
Did you check that one of your disk is really at its limits and there is no Network/cpu throttling?

You can partition your drive with lvm and use multiple OSDs per drive. Since you have so many cores per server, one osd per drive is not making use of them

C10k Tsung Gatling and PlayWS

I'm new in Load Testing, but I googled a lot and configured test system on Amazon.
The system consist of: the Websocket server, on Play framework, and some load testing machines. I tried such load testing tools: Tsung and Gatling.
My testing scenario: I create >10k users each of them connects to the server and start sending messages for each second. I tuned linux to handle more than 100k connections. I playaround with JAVA_OPTS for gatling (added more memory and ParallelGC usage). I playaround with akka on the server side to handle 100-300 dispatcher threads. I ordered 36 vCPU machine with 60 GB RAM and 10GBit channel for server machine.
But the result was the same Tsung and Gatling send near 10k messages per second from the one machine(I sent just text messages < 160Bytes).
Can someone explain me. Why I can't reach more than 10k concurrent users (1 message per second). And what am I doing wrong?
actor {
default-dispatcher = {
fork-join-executor {
throughput = 1000
parallelism-factor = 36.0
parallelism-max = 154
}
}
}
JAVA_OPTS="-Xmx3800m -Xms3800m -Xmn2g -XX:+UseParallelGC -XX:ParallelGCThreads=20"
Linux configs
sudo ulimits -n > 999999
sudo vim /etc/sysctl.conf
# General gigabit tuning
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 4096 16777216
net.ipv4.tcp_wmem = 4096 4096 16777216
#
# # This gives the kernel more memory for TCP
# # which you need with many (100k+) open socket connections
net.ipv4.tcp_mem = 4096 65536 16777216
#
# # Backlog
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_syncookies = 1
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 0
# Disable netfilter on bridges.
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 65535
# Controls the maximum size of a message, in bytes
kernel.msgmax = 65535
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_orphan_retries = 5
fs.file-max = 999999
net.ipv4.tcp_max_orphans = 819200
net.core.somaxconn = 65535
net.ipv4.tcp_congestion_control = cubic
I tested with 5 client machines and I found that the server side on play framework can handle 50k messages per second from 50k users. with such configuration. The problem is that one client machine can't send more than 10k mps from 10k users. Maybe someone know other load testing tools that can send more than 10k mps from 10k users.

Understanding results of mongostat

I am trying to understand the results of mongostat:
example
insert query update delete getmore command flushes mapped vsize res faults locked % idx
0 2 4 0 0 10 0 976m 2.21g 643m 0 0.1 0
0 1 0 0 0 4 0 976m 2.21g 643m 0 0 0
0 0 0 0 0 1 0 976m 2.21g 643m 0 0 0
I see
mapped - 976m
vsize-2.2.g
res - 643m
res - RAM, so ~650MB of my database is in RAM
mapped - total size of database (via memory mapped files)
vsize - ???
not sure why vsize is important or what exactly it means in this content - im running an m1.large so i have like 400GB of HD space + 8GB of RAM.
Can someone help me out here and explain if
I am on the right page
what stats I should monitor in production

This should give you enough information
mapped - amount of data mmaped (total data size) megabytes
vsize - virtual size of process in megabytes
res - resident size of process in megabytes

1) I am on the right page
So mongostat is not really a "live monitor". It's mostly useful for connecting to a specific server and watching for something specific (what's happening when this job runs?). But it's not really useful for tracking performance over time.
Typically, for monitoring the server, you will want to use a tool like Zabbix or Cacti or Munin. Or some third-party server monitor. The MongoDB webiste has a list.
2) what stats I should monitor in production
You should monitor the same basic stats you would monitor on any server:
CPU
Memory
Disk IO
Network traffic
For MongoDB specifically, you will to run db.serverStatus() and track the
opcounters
connections
indexcounters
Note that these are increasing counters, so you'll have to create the correct "counter type" in your monitoring system (Zabbix, Cacti, etc.) A few of these monitoring programs already have MongoDB plug-ins available.
Also note that MongoDB has a "free" monitoring service called MMS. I say "free" because you will be receiving calls from salespeople in exchange for setting up MMS.

Also you can use these mini tools watching mongodb
http://openmymind.net/2011/9/23/Compressed-Blobs-In-MongoDB/
by the way I remembered this great online tool from 10gen
https://mms.10gen.com/user/login