I am currently building a CEPH cluster for a KVM platform, which got catastrophic performance outcome right now. The figure is dreadful. I am not really familiar with physically distributed systems, is there any general advice of improving the overall performance (i.e. latency, bandwidth and IOPS)?
The hardware configuration is not optimal right now, but I am still would like to release the full potential of what I currently got:
1x 10Gbe Huawei switch
3x Rack server, with hardware configuration:
Intel(R) Xeon(R) CPU E5-2678 v3 # 2.50GHz x2, totally 48 logical cores,
128GB DDR3 RAM
Intel 1.84T NVMe SSD x6 as data drive, with 1 OSD per disk (totally 6 OSDs per server)
My current /etc/ceph/ceph.conf:
[global]
fsid = f2d6d3a7-0e61-4768-b3f5-b19dd2d8b657
mon initial members = ceph-node1, ceph-node2, ceph-node3
mon allow pool delete = true
mon host = 192.168.16.1, 192.168.16.2, 192.168.16.3
public network = 192.168.16.0/24
cluster network = 192.168.16.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 600
osd pool default pgp num = 600
osd memory target = 4294967296
max open files = 131072
[mon]
mon clock drift allowed = 1
mon osd min down reporters = 13
mon osd down out interval = 600
[OSD]
osd journal size = 20000
osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 16
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd recovery op priority = 2
osd recovery max active = 10
osd max backfills = 4
osd min pg log entries = 30000
osd max pg log entries = 100000
osd mon heartbeat interval = 40
ms dispatch throttle bytes = 1048576000
objecter inflight ops = 819200
osd op log threshold = 50
osd crush chooseleaf type = 0
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000
[Client]
rbd cache = True
rbd cache size = 335544320
rbd cache max dirty = 134217728
rbd cache max dirty age = 30
rbd cache writethrough until flush = False
rbd cache max dirty object = 2
rbd cache target dirty = 235544320
IO benchmark is done by fio, with the configuration:
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randread -size=100G -filename=/data/testfile -name="CEPH Test" -iodepth=8 -runtime=30
Benchmark result screenshot:
The bench mark result
The benchmark was done on a sperate machine, configured to connect the cluster via 10Gbe switch by installing MDS only. The benchmark machine is identical to other 3 which formed the cluster, apart from the absence of Intel NVMe SSD drives.
Any help is appreciated,
First, I must note that Ceph is not an acronym, it is short for Cephalopod, because tentacles.
That said, you have a number of settings in ceph.conf that surprise me, like the extreme number of osdmaps you're caching. The thread settings can be tricky, and vary in applicability between releases. If you're building pools with 600 PGs that isn't great, you generally want a power of 2, and to target a ratio per OSD that factors in drive type and other pools. Setting the mon clock skew to a full second (vs. the default 50ms) is downright alarming, with Chrony or even the legacy ntpd it's not hard to get sub-millisecond syncing.
Three nodes may be limiting in the degree of parallelism / overlap clients can support, especially since you only have 6 drives per server. That's only 18 OSDs.
You have Filestore settings in there too, you aren't really using Filestore are you? Or a Ceph release older than Nautilus?
Finally, as more of an actual answer to the question posed, one simple thing you can do is to split each NVMe drive into two OSDs -- with appropriate pgp_num and pg_num settings for the pool.
ceph-volume lvm batch –osds-per-device 2
I assume you 3 meant 3 servers blades and not 3 racks.
What was you rough estimate of performance ?
What is the performance profile of your disk hardware (outside ceph) at 4K and 2MB ?
How many disk do you have in this pool, what is replication factor/strategy and object size ?
On the client side you are performing small reads: 4K
On the server side, depending on your read-ahead settings and object size each of this 4K may grasp much more data in the background.
Did you check that one of your disk is really at its limits and there is no Network/cpu throttling?
You can partition your drive with lvm and use multiple OSDs per drive. Since you have so many cores per server, one osd per drive is not making use of them
We have set up a three member replica set consisting of the following, all on MongoDB version 3.4:
Primary. Physical local server, Windows Server 2012, 64 GB RAM, 6 cores. Hosted in Scandinavia.
Secondary. Amazon EC2, Windows Server 2016, m4.2xlarge, 32 GB RAM, 8 vCPUs. Hosted in Germany.
Arbiter. Tiny cloud based Linux instance.
The problem we are seeing is that the secondary is unable to keep up with the primary. As we seed it with data (copy over from the primary) and add it to the replica set, it typically manages to get in sync, but an hour later it might lag behind by 10 minutes; a few hours later, it's an hour behind, and so on, until a day or two later, it goes stale.
We are trying to figure out why this is. The primary is consistently using 0-1% CPU, while the secondary is under constant heavy load at 20-80% CPU. This seems to be the only potential resource constraint. Disk and network load does not seem to be an issue. There seems to be some locking going on on the secondary, as operations in the mongo shell (such as db.getReplicationInfo()) frequently takes 5 minutes or more to complete, and mongostat rarely works (it just says i/o timeout). Here is output from mongostat during a rare instance when it reported stats for the secondary:
host insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
localhost:27017 *0 33 743 *0 0 166|0 1.0% 78.7% 0 27.9G 27.0G 0|0 0|1 2.33m 337k 739 rs PRI Mar 27 14:41:54.578
primary.XXX.com:27017 *0 36 825 *0 0 131|0 1.0% 78.7% 0 27.9G 27.0G 0|0 0|0 1.73m 322k 739 rs PRI Mar 27 14:41:53.614
secondary.XXX.com:27017 *0 *0 *0 *0 0 109|0 4.3% 80.0% 0 8.69G 7.54G 0|0 0|10 6.69k 134k 592 rs SEC Mar 27 14:41:53.673
I ran db.serverStatus() on the secondary, and compared to the primary, and one number that stood out was the following:
"locks" : {"Global" : {"timeAcquiringMicros" : {"r" : NumberLong("21188001783")
The secondary had an uptime of 14000 seconds at the time.
Would appreciate any ideas on what this could be, or how to debug this issue! We could upgrade the Amazon instance to something beefier, but we've done that three times already, and at this point we figure that something else must be wrong.
I'll include output from db.currentOp() on the secondary below, in case it helps. (That command took 5 minutes to run, after which the following was logged: Restarting oplog query due to error: CursorNotFound: Cursor not found, cursor id: 15728290121. Last fetched optime (with hash): { ts: Timestamp 1490613628000|756, t: 48 }[-5363878314895774690]. Restarts remaining: 3)
"desc":"conn605",
"connectionId":605,"client":"127.0.0.1:61098",
"appName":"MongoDB Shell",
"secs_running":0,
"microsecs_running":NumberLong(16),
"op":"command",
"ns":"admin.$cmd",
"query":{"currentOp":1},
"locks":{},
"waitingForLock":false,
"lockStats":{}
"desc":"repl writer worker 10",
"secs_running":0,
"microsecs_running":NumberLong(14046),
"op":"none",
"ns":"CustomerDB.ed2112ec779f",
"locks":{"Global":"W","Database":"W"},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"w":NumberLong(1),"W":NumberLong(1)}},"Database":{"acquireCount":{"W":NumberLong(1)}}}
"desc":"ApplyBatchFinalizerForJournal",
"op":"none",
"ns":"",
"locks":{},
"waitingForLock":false,
"lockStats":{}
"desc":"ReplBatcher",
"secs_running":11545,
"microsecs_running":NumberLong("11545663961"),
"op":"none",
"ns":"local.oplog.rs",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(2)}},"Database":{"acquireCount":{"r":NumberLong(1)}},"oplog":{"acquireCount":{"r":NumberLong(1)}}}
"desc":"rsBackgroundSync",
"secs_running":11545,
"microsecs_running":NumberLong("11545281690"),
"op":"none",
"ns":"local.replset.minvalid",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(5),"w":NumberLong(1)}},"Database":{"acquireCount":{"r":NumberLong(2),"W":NumberLong(1)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}}
"desc":"TTLMonitor",
"op":"none",
"ns":"",
"locks":{"Global":"r"},
"waitingForLock":true,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(35)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(341534123)}},"Database":{"acquireCount":{"r":NumberLong(17)}},"Collection":{"acquireCount":{"r":NumberLong(17)}}}
"desc":"SyncSourceFeedback",
"op":"none",
"ns":"",
"locks":{},
"waitingForLock":false,
"lockStats":{}
"desc":"WT RecordStoreThread: local.oplog.rs",
"secs_running":1163,
"microsecs_running":NumberLong(1163137036),
"op":"none",
"ns":"local.oplog.rs",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(1),"w":NumberLong(1)}},"Database":{"acquireCount":{"w":NumberLong(1)}},"oplog":{"acquireCount":{"w":NumberLong(1)}}}
"desc":"rsSync",
"secs_running":11545,
"microsecs_running":NumberLong("11545663926"),
"op":"none",
"ns":"local.replset.minvalid",
"locks":{"Global":"W"},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(272095),"w":NumberLong(298255),"R":NumberLong(1),"W":NumberLong(74564)},"acquireWaitCount":{"W":NumberLong(3293)},"timeAcquiringMicros":{"W":NumberLong(17685)}},"Database":{"acquireCount":{"r":NumberLong(197529),"W":NumberLong(298255)},"acquireWaitCount":{"W":NumberLong(146)},"timeAcquiringMicros":{"W":NumberLong(651947)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}}
"desc":"clientcursormon",
"secs_running":0,
"microsecs_running":NumberLong(15649),
"op":"none",
"ns":"CustomerDB.b72ac80177ef",
"locks":{"Global":"r"},
"waitingForLock":true,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(387)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(397538606)}},"Database":{"acquireCount":{"r":NumberLong(193)}},"Collection":{"acquireCount":{"r":NumberLong(193)}}}}],"ok":1}
JJussi was exactly right (thanks!). The problem was that the active data is bigger than the memory available, and we were using Amazon EBS "Throughput Optimized HDD". We changed this to "General Purpose SSD" and the problem immediately went away. We were even able to downgrade the server from m4.2xlarge to m4.large.
We were confused by the fact that this manifested as high CPU load. We thought disk load was not an issue based on the fairly low amount of data written to disk per second. But when we tried making the AWS server the primary instance, we noticed that there was a very strong correlation between high CPU load and disk queue length. Further testing on the disk showed that it had very bad performance for the type of traffic that Mongo has.
I have 2 raspberry pi's that I wanted to benchmark for load balancing purpose.
Raspberry pi Model B v1.1 - running Raspbian Jessie
Raspberry pi Model B+ v1.2 - running Raspbian Jessie
I installed sysbench on both systems and ran: sysbench --num-threads=1 --test=cpu --cpu-max-prime=10000 --validate run on the first and changed --num-threads=4 on the second, as its quadcore and ran both.
The results are not at all what I expected (I obviously expected the multithreaded benchmark to severely outperform the single threaded benchmark). When I ran a the command with a single thread, performance was about the same on both systems. But when I changed the number of threads to 4 on the second Pi it still took the same amount of time, except that the per request statistics showed that the average request took about 4 times as much time. I can seem to grasp why this is.
Here are the results:
Raspberry pi v1.1
Single thread
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 1325.0229s
total number of events: 10000
total time taken by event execution: 1324.9665
per-request statistics:
min: 131.00ms
avg: 132.50ms
max: 171.58ms
approx. 95 percentile: 137.39ms
Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 1324.9665/0.00
Raspberry pi v1.2
Four threads
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 1321.0618s
total number of events: 10000
total time taken by event execution: 5283.8876
per-request statistics:
min: 486.45ms
avg: 528.39ms
max: 591.60ms
approx. 95 percentile: 553.98ms
Threads fairness:
events (avg/stddev): 2500.0000/0.00
execution time (avg/stddev): 1320.9719/0.03
"Raspberry pi Model B+ v1.2" has the same CPU as "Raspberry pi Model B v1.1". Both boards are from the first generation of Raspberry Pi and they have 1 core CPU.
For 4 CPU you need Raspberry Pi 2 Model B instead of Raspberry pi Model B+.
Yeah, the naming is a bit confusing :(
Just a quick question - apologies if its been asked before, I couldn't find it.
We are using asynchronous streaming replication with postgres and have noticed that the disk usage for the database can vary between the master and the replica, even though the databases appear to be synchronising correctly.
At the moment the discrepancy is quite small, but it has on occasion been in the region of several GB.
At present, this is the sync status:
Master:
master=# SELECT pg_current_xlog_location();
pg_current_xlog_location
--------------------------
35C/F142C98
(1 row)
Slave:
slave=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
35C/F142C98
(1 row)
The disk usage is as follows. Again, I realise the discrepancy is currently quite small (~1.5GiB), but yesterday it was several GB.
Master
-bash-4.1$ df -m /var/lib/pgsql/
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/mapper/lv_pgsql
401158 302898 98261 76% /var/lib/pgsql
Slave:
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/mapper/lv_pgsql
401158 301263 99895 76% /var/lib/pgsql
I should clarify that the archive command is set to archive to a different partition on the master. I guess what i'm asking is:
- Is the current disk usage discrepancy normal?
- How can it be explained?
- How much of a discrepancy should there be before I get worried?
Thanks in advance for any help.
edit - I'm specifically interested in the discrepancy within the "data/base/*" directories which house the actual db content, as follows:
Master:
7 /var/lib/pgsql/master/9.3/data/base/1
7 /var/lib/pgsql/master/9.3/data/base/12891
7 /var/lib/pgsql/master/9.3/data/base/12896
57904 /var/lib/pgsql/master/9.3/data/base/16385
180 /var/lib/pgsql/master/9.3/data/base/16387
11 /var/lib/pgsql/master/9.3/data/base/16389
203588 /var/lib/pgsql/master/9.3/data/base/48448446
7 /var/lib/pgsql/master/9.3/data/base/534138292
1 /var/lib/pgsql/master/9.3/data/base/pgsql_tmp
Slave:
7 /var/lib/pgsql/slave/9.3/data/base/1
7 /var/lib/pgsql/slave/9.3/data/base/12891
7 /var/lib/pgsql/slave/9.3/data/base/12896
57634 /var/lib/pgsql/slave/9.3/data/base/16385
180 /var/lib/pgsql/slave/9.3/data/base/16387
10 /var/lib/pgsql/slave/9.3/data/base/16389
202945 /var/lib/pgsql/slave/9.3/data/base/48448446
7 /var/lib/pgsql/slave/9.3/data/base/534138292
1 /var/lib/pgsql/slave/9.3/data/base/pgsql_tmp