Is there any way to reduce the PostgreSQL performance deviation between the multiple iterations? - postgresql

NOPM values captured with HammerDB-v4.3 scripts (schema_tpcc.tcl and
test_tpcc.tcl ) for multiple trails.
The expected performance deviation between the multiple trials should be less
than 2%, but observed more.
Hardware configuration
Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
OS: RHEL8.4
RAM SIZE:512
SSD:1TB
Postgresql.conf
autovacuum_max_workers = 16
autovacuum_vacuum_cost_limit = 3000
checkpoint_completion_target = 0.9
checkpoint_timeout = '15min'
cpu_tuple_cost = 0.03
effective_cache_size = '350GB'
listen_addresses = '*'
maintenance_work_mem = '2GB'
max_connections = 1000
max_wal_size = '128GB'
random_page_cost = 1.1
shared_buffers = '128GB'
wal_buffers = '1GB'
work_mem = '128MB'
random_page_cost = 1.1
effective_io_concurrency = 200
HammerDB Scripts
>>cat schema.tcl
#!/bin/tclsh
dbset db pg
diset connection pg_host localhost
diset connection pg_port 5432
diset tpcc pg_count_ware 400
diset tpcc pg_num_vu 50
print dict
buildschema
waittocomplete
RUN TEST on i.e. start with 1VU then 2, 4, etc
| Virtual Users | Trail-1(NOPM) | Trail-2(NOPM) | %diff |
|---------------|---------------|---------------|---------|
| 12 | 99390 | 92913 | 6.516752|
| 140 | 561429 | 525408 | 6.415949|
| 192 | 636016 | 499574 | 21.4526 |
| 230 | 621644 | 701882 | 12.9074 |

There is already a comprehensive answer to this question on HammerDB discussions.
You make an assumption that PostgreSQL will scale linearly for an intensive OLTP workload on 256 logical CPUs of a particular type. However, if a workload experiences high contention then performance will not be as expected on a particular hardware/software combination due to locking and latching - this is to be expected. Your experience may be different on different hardware (with the same database) and/or a different database (on the same hardware). For example, you may find a higher core count results in lower performance as the additional cores increase contention, lowering throughput.
You need to follow the advice in the discussions post and analyze the wait events using the HammerDB v4.3 Graphical metrics viewer for pg_active_session_history or with SQL directly. This will direct you to exactly the cause of contention (with a particular hardware/software combination - LWLock is highlighted in pink in the viewer or look for this in the query output). If this does not enable you to diagnose the issues directly, then employing a PostgreSQL consultant would be necessary to explain the issue for you.

Related

SHMMAX / SHMALL Kernel parameters for postgres database with huge_pages on

I've upgraded RAM on postgres server from 8 to 16 GB and set shared_buffer to 4GB. I'm also set
vm.nr_hugepages = 2300
calculated by postgres doc
kernel.shmmax = 8589934592 (8GB)
Postgres is configured to use huge_page (huge_page = 'on') with:
grep -i hugepages /proc/meminfo
AnonHugePages: 0 kB
HugePages_Total: 2300
HugePages_Free: 167
HugePages_Rsvd: 4
HugePages_Surp: 0
Hugepagesize: 2048 kB
What's the correct configuration of kernel.shmall ?
According to doc kernel.shmall is espressed in number of PAGE memory (getconf PAGE_SIZE--> 4096) but Hugepagesize:2048 kb. Current configuration is kernel.shmall=4096 considering PAGE size 2048 kb but I'm not sure that is correct. Can you help me to understand this configuration please and if something is wrong. To setting kernel.shmall I must consider PAGE size = 2048 kb or PAGE size = 4096 byte? And what's the best configuration of these parameters?
Thanks
PostgreSQL v10 uses POSIX shared memory unless you set shared_memory_type = sysv, so the System V shared memory kernel parameters are irrelevant, and you don't have to tune them.

What's the best practice of improving IOPS of a CEPH cluster?

I am currently building a CEPH cluster for a KVM platform, which got catastrophic performance outcome right now. The figure is dreadful. I am not really familiar with physically distributed systems, is there any general advice of improving the overall performance (i.e. latency, bandwidth and IOPS)?
The hardware configuration is not optimal right now, but I am still would like to release the full potential of what I currently got:
1x 10Gbe Huawei switch
3x Rack server, with hardware configuration:
Intel(R) Xeon(R) CPU E5-2678 v3 # 2.50GHz x2, totally 48 logical cores,
128GB DDR3 RAM
Intel 1.84T NVMe SSD x6 as data drive, with 1 OSD per disk (totally 6 OSDs per server)
My current /etc/ceph/ceph.conf:
[global]
fsid = f2d6d3a7-0e61-4768-b3f5-b19dd2d8b657
mon initial members = ceph-node1, ceph-node2, ceph-node3
mon allow pool delete = true
mon host = 192.168.16.1, 192.168.16.2, 192.168.16.3
public network = 192.168.16.0/24
cluster network = 192.168.16.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 600
osd pool default pgp num = 600
osd memory target = 4294967296
max open files = 131072
[mon]
mon clock drift allowed = 1
mon osd min down reporters = 13
mon osd down out interval = 600
[OSD]
osd journal size = 20000
osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 16
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd recovery op priority = 2
osd recovery max active = 10
osd max backfills = 4
osd min pg log entries = 30000
osd max pg log entries = 100000
osd mon heartbeat interval = 40
ms dispatch throttle bytes = 1048576000
objecter inflight ops = 819200
osd op log threshold = 50
osd crush chooseleaf type = 0
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000
[Client]
rbd cache = True
rbd cache size = 335544320
rbd cache max dirty = 134217728
rbd cache max dirty age = 30
rbd cache writethrough until flush = False
rbd cache max dirty object = 2
rbd cache target dirty = 235544320
IO benchmark is done by fio, with the configuration:
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randread -size=100G -filename=/data/testfile -name="CEPH Test" -iodepth=8 -runtime=30
Benchmark result screenshot:
The bench mark result
The benchmark was done on a sperate machine, configured to connect the cluster via 10Gbe switch by installing MDS only. The benchmark machine is identical to other 3 which formed the cluster, apart from the absence of Intel NVMe SSD drives.
Any help is appreciated,
First, I must note that Ceph is not an acronym, it is short for Cephalopod, because tentacles.
That said, you have a number of settings in ceph.conf that surprise me, like the extreme number of osdmaps you're caching. The thread settings can be tricky, and vary in applicability between releases. If you're building pools with 600 PGs that isn't great, you generally want a power of 2, and to target a ratio per OSD that factors in drive type and other pools. Setting the mon clock skew to a full second (vs. the default 50ms) is downright alarming, with Chrony or even the legacy ntpd it's not hard to get sub-millisecond syncing.
Three nodes may be limiting in the degree of parallelism / overlap clients can support, especially since you only have 6 drives per server. That's only 18 OSDs.
You have Filestore settings in there too, you aren't really using Filestore are you? Or a Ceph release older than Nautilus?
Finally, as more of an actual answer to the question posed, one simple thing you can do is to split each NVMe drive into two OSDs -- with appropriate pgp_num and pg_num settings for the pool.
ceph-volume lvm batch –osds-per-device 2
I assume you 3 meant 3 servers blades and not 3 racks.
What was you rough estimate of performance ?
What is the performance profile of your disk hardware (outside ceph) at 4K and 2MB ?
How many disk do you have in this pool, what is replication factor/strategy and object size ?
On the client side you are performing small reads: 4K
On the server side, depending on your read-ahead settings and object size each of this 4K may grasp much more data in the background.
Did you check that one of your disk is really at its limits and there is no Network/cpu throttling?
You can partition your drive with lvm and use multiple OSDs per drive. Since you have so many cores per server, one osd per drive is not making use of them

PCIe Configuration Space vs ECAM

Is the PCIe ECAM exactly the same as the "PCI-Compatible Configuration Registers" only mapped to memory instead of I/O?
It seems to me that PCIe uses the same Configuration Mechanism as conventional PCI: [1]
| 31 | 30 - 24 | 23 - 16 | 15 - 11 | 10 - 8 | 7 - 2 | 1 - 0 |
| Enable | Reserved | Bus Nr | Device Nr | Function Nr | Register Nr | 00 |
But in PCIe you can use the reserved bytes to address more registers of a function.
Is this correct?
In section 7.2.1 [2] the ECAM is defined as:
| 27 - 20 | 19 - 15 | 14 - 12 | 11 - 8 | 7 - 2 | 1 - 0 |
| Bus Nr | Dev Nr | Function Nr | Ext. Register Nr | Register Nr | Byte Enable |
It looks very similar to the conventional configuration.
Just the reserved bits are shifted to the register number which they extend in PCIe.
But I can use them like the old one? Only address them in memory space not IO space?
[1] https://wiki.osdev.o/PCI#Configuration_Space_Access_Mechanism_.231
[2] in PCI Express Base Specification, Rev. 4.0 Version 1.0
You're mixing apples and oranges in your comparison. The first address decoding is provided by a host bridge component on PC-AT architecture systems (*). It's a way of using the Intel processor's I/O address space to interface to the PCI bus configuration space mechanism. It can also be used on a PCIe system, because the PCIe host bridge component provides the same interface to PCIe devices. However, everything below the host bridge is implemented quite differently between PCI and PCIe.
Meanwhile the second decoding scheme you showed can only be used within the memory-mapped block through which PCIe provides access to its extended configuration space. And only after that block has been mapped into the physical address space in a system-dependent way.
So while they have a similar function, no, you cannot use them in the same way. You can:
Access the first 256 bytes of any PCI or PCIe device's configuration space using the first mechanism, but you must use the first addressing scheme, OR
Access the entire extended configuration space of any PCIe device using the second mechanism (including the first 256 bytes), but then you must use the second addressing scheme.
(*) The "I/O space interface to PCI bus configuration via 0xCF8 / 0xCFC" really is part of the Intel / PC-AT architecture. Other system architectures (MIPS for example) don't have separate I/O address spaces, and host bridges designed for them have different mechanisms to generate PCIe configuration space accesses (or they simply use the memory-mapped mechanism directly).

pgbouncer free_servers - how to increase them

current setting of a pgbouncer server is the following - and what I don't understand is the 'free_servers' info given by the show lists command when connecting to pgbouncer. Is it a (soft or hard) limit on the number of connexion to the postgresql databases used with this instance of pgbouncer ?
configuration :
max_client_conn = 2048
default_pool_size = 1024
min_pool_size = 10
reserve_pool_size = 500
reserve_pool_timeout = 1
server_idle_timeout = 600
listen_backlog = 1024
show lists gives :
pgbouncer=# show lists ;
list | items
---------------+--------
databases | 6
pools | 3
free_clients | 185
used_clients | 15
free_servers | 70
used_servers | 30
it seems that there is a limit at 30 + 70 = 100 servers, but couldn't find it even review configuration values with show config, and documentation doesn't explicit which configuration to change / increase free_servers.
pgbouncer version : 1.7.2
EDIT :
I've just discover that, for a pool of 6 webservers configured to hit the same PG database, 3 of them can have 200 backend connexions (server connexion), and 3 of them can only make and maintain 100 connexions (as described in the first part). But, .. the configuration is exactly the same in pgbouncer configuration file, and the servers are cloned VM. The version of pgbouncer is also the same..
So far, I still haven't found documentation on internet where this limitation come from...
This data is just some internal information for PgBouncer.
Servers information is stored inside an array list data structure which is pre-allocated up to a certain size, in this case that is 100 slots. used_servers = 30, free_servers = 70 means there are 30 slots currently in used, and 70 slots free. PgBouncer will automatically increase the size of the list when it's full, hence there's no configuration for this.

C10k Tsung Gatling and PlayWS

I'm new in Load Testing, but I googled a lot and configured test system on Amazon.
The system consist of: the Websocket server, on Play framework, and some load testing machines. I tried such load testing tools: Tsung and Gatling.
My testing scenario: I create >10k users each of them connects to the server and start sending messages for each second. I tuned linux to handle more than 100k connections. I playaround with JAVA_OPTS for gatling (added more memory and ParallelGC usage). I playaround with akka on the server side to handle 100-300 dispatcher threads. I ordered 36 vCPU machine with 60 GB RAM and 10GBit channel for server machine.
But the result was the same Tsung and Gatling send near 10k messages per second from the one machine(I sent just text messages < 160Bytes).
Can someone explain me. Why I can't reach more than 10k concurrent users (1 message per second). And what am I doing wrong?
actor {
default-dispatcher = {
fork-join-executor {
throughput = 1000
parallelism-factor = 36.0
parallelism-max = 154
}
}
}
JAVA_OPTS="-Xmx3800m -Xms3800m -Xmn2g -XX:+UseParallelGC -XX:ParallelGCThreads=20"
Linux configs
sudo ulimits -n > 999999
sudo vim /etc/sysctl.conf
# General gigabit tuning
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 4096 16777216
net.ipv4.tcp_wmem = 4096 4096 16777216
#
# # This gives the kernel more memory for TCP
# # which you need with many (100k+) open socket connections
net.ipv4.tcp_mem = 4096 65536 16777216
#
# # Backlog
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_syncookies = 1
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 0
# Disable netfilter on bridges.
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 65535
# Controls the maximum size of a message, in bytes
kernel.msgmax = 65535
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_orphan_retries = 5
fs.file-max = 999999
net.ipv4.tcp_max_orphans = 819200
net.core.somaxconn = 65535
net.ipv4.tcp_congestion_control = cubic
I tested with 5 client machines and I found that the server side on play framework can handle 50k messages per second from 50k users. with such configuration. The problem is that one client machine can't send more than 10k mps from 10k users. Maybe someone know other load testing tools that can send more than 10k mps from 10k users.