MongoDB secondary completely unable to keep up - mongodb

We have set up a three member replica set consisting of the following, all on MongoDB version 3.4:
Primary. Physical local server, Windows Server 2012, 64 GB RAM, 6 cores. Hosted in Scandinavia.
Secondary. Amazon EC2, Windows Server 2016, m4.2xlarge, 32 GB RAM, 8 vCPUs. Hosted in Germany.
Arbiter. Tiny cloud based Linux instance.
The problem we are seeing is that the secondary is unable to keep up with the primary. As we seed it with data (copy over from the primary) and add it to the replica set, it typically manages to get in sync, but an hour later it might lag behind by 10 minutes; a few hours later, it's an hour behind, and so on, until a day or two later, it goes stale.
We are trying to figure out why this is. The primary is consistently using 0-1% CPU, while the secondary is under constant heavy load at 20-80% CPU. This seems to be the only potential resource constraint. Disk and network load does not seem to be an issue. There seems to be some locking going on on the secondary, as operations in the mongo shell (such as db.getReplicationInfo()) frequently takes 5 minutes or more to complete, and mongostat rarely works (it just says i/o timeout). Here is output from mongostat during a rare instance when it reported stats for the secondary:
host insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
localhost:27017 *0 33 743 *0 0 166|0 1.0% 78.7% 0 27.9G 27.0G 0|0 0|1 2.33m 337k 739 rs PRI Mar 27 14:41:54.578
primary.XXX.com:27017 *0 36 825 *0 0 131|0 1.0% 78.7% 0 27.9G 27.0G 0|0 0|0 1.73m 322k 739 rs PRI Mar 27 14:41:53.614
secondary.XXX.com:27017 *0 *0 *0 *0 0 109|0 4.3% 80.0% 0 8.69G 7.54G 0|0 0|10 6.69k 134k 592 rs SEC Mar 27 14:41:53.673
I ran db.serverStatus() on the secondary, and compared to the primary, and one number that stood out was the following:
"locks" : {"Global" : {"timeAcquiringMicros" : {"r" : NumberLong("21188001783")
The secondary had an uptime of 14000 seconds at the time.
Would appreciate any ideas on what this could be, or how to debug this issue! We could upgrade the Amazon instance to something beefier, but we've done that three times already, and at this point we figure that something else must be wrong.
I'll include output from db.currentOp() on the secondary below, in case it helps. (That command took 5 minutes to run, after which the following was logged: Restarting oplog query due to error: CursorNotFound: Cursor not found, cursor id: 15728290121. Last fetched optime (with hash): { ts: Timestamp 1490613628000|756, t: 48 }[-5363878314895774690]. Restarts remaining: 3)
"desc":"conn605",
"connectionId":605,"client":"127.0.0.1:61098",
"appName":"MongoDB Shell",
"secs_running":0,
"microsecs_running":NumberLong(16),
"op":"command",
"ns":"admin.$cmd",
"query":{"currentOp":1},
"locks":{},
"waitingForLock":false,
"lockStats":{}
"desc":"repl writer worker 10",
"secs_running":0,
"microsecs_running":NumberLong(14046),
"op":"none",
"ns":"CustomerDB.ed2112ec779f",
"locks":{"Global":"W","Database":"W"},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"w":NumberLong(1),"W":NumberLong(1)}},"Database":{"acquireCount":{"W":NumberLong(1)}}}
"desc":"ApplyBatchFinalizerForJournal",
"op":"none",
"ns":"",
"locks":{},
"waitingForLock":false,
"lockStats":{}
"desc":"ReplBatcher",
"secs_running":11545,
"microsecs_running":NumberLong("11545663961"),
"op":"none",
"ns":"local.oplog.rs",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(2)}},"Database":{"acquireCount":{"r":NumberLong(1)}},"oplog":{"acquireCount":{"r":NumberLong(1)}}}
"desc":"rsBackgroundSync",
"secs_running":11545,
"microsecs_running":NumberLong("11545281690"),
"op":"none",
"ns":"local.replset.minvalid",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(5),"w":NumberLong(1)}},"Database":{"acquireCount":{"r":NumberLong(2),"W":NumberLong(1)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}}
"desc":"TTLMonitor",
"op":"none",
"ns":"",
"locks":{"Global":"r"},
"waitingForLock":true,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(35)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(341534123)}},"Database":{"acquireCount":{"r":NumberLong(17)}},"Collection":{"acquireCount":{"r":NumberLong(17)}}}
"desc":"SyncSourceFeedback",
"op":"none",
"ns":"",
"locks":{},
"waitingForLock":false,
"lockStats":{}
"desc":"WT RecordStoreThread: local.oplog.rs",
"secs_running":1163,
"microsecs_running":NumberLong(1163137036),
"op":"none",
"ns":"local.oplog.rs",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(1),"w":NumberLong(1)}},"Database":{"acquireCount":{"w":NumberLong(1)}},"oplog":{"acquireCount":{"w":NumberLong(1)}}}
"desc":"rsSync",
"secs_running":11545,
"microsecs_running":NumberLong("11545663926"),
"op":"none",
"ns":"local.replset.minvalid",
"locks":{"Global":"W"},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(272095),"w":NumberLong(298255),"R":NumberLong(1),"W":NumberLong(74564)},"acquireWaitCount":{"W":NumberLong(3293)},"timeAcquiringMicros":{"W":NumberLong(17685)}},"Database":{"acquireCount":{"r":NumberLong(197529),"W":NumberLong(298255)},"acquireWaitCount":{"W":NumberLong(146)},"timeAcquiringMicros":{"W":NumberLong(651947)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}}
"desc":"clientcursormon",
"secs_running":0,
"microsecs_running":NumberLong(15649),
"op":"none",
"ns":"CustomerDB.b72ac80177ef",
"locks":{"Global":"r"},
"waitingForLock":true,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(387)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(397538606)}},"Database":{"acquireCount":{"r":NumberLong(193)}},"Collection":{"acquireCount":{"r":NumberLong(193)}}}}],"ok":1}

JJussi was exactly right (thanks!). The problem was that the active data is bigger than the memory available, and we were using Amazon EBS "Throughput Optimized HDD". We changed this to "General Purpose SSD" and the problem immediately went away. We were even able to downgrade the server from m4.2xlarge to m4.large.
We were confused by the fact that this manifested as high CPU load. We thought disk load was not an issue based on the fairly low amount of data written to disk per second. But when we tried making the AWS server the primary instance, we noticed that there was a very strong correlation between high CPU load and disk queue length. Further testing on the disk showed that it had very bad performance for the type of traffic that Mongo has.

Related

Slow server caused by mongodb instance

I see the MongoDB usage extremely high. It shows it is using 756% of the cpu and the load is at 4
22527 root 20 0 0.232t 0.024t 0.023t S 756.2 19.5 240695:16 mongod
I checked the MongoDB logs and found that every question is taking more than 200ms to execute which is causing the high resource usage and speed issue .

Openstack Swift performance issue

we are having a problem with our swift cluster, with a swift version 1.8.0.
The cluster is built up from 3 storage nodes + a proxy node, we have 2 times replication. Each node sports a single 2TB sata HDD, the OS is running on an SSD.
The traffic is ~300 1.3MB files per minute. The files are of the same size. Each file is uploaded with an X-expire-after with a value equivalent of 7 days.
When we started the cluster around 3 months ago we uploaded significantly less files (~150/m), everything was working fine. As we have put more pressure on the system, at one point the object expirer couldn't expire the files as fast as being uploaded, slowly filling up the servers.
After our analysis we found the following:
It's not a network issue, the interfaces are not overloaded, we don't have an extreme amount of open connections
It's not a CPU issue, loads are fine
It doesn't seem to be a RAM issue, we have ~20G free of 64G
The bottleneck seems to be the disk, iostat is quite revealing:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 57.00 0.00 520.00 0.00 3113.00 11.97 149.18 286.21 0.00 286.21 1.92 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 2.00 44.00 7.00 488.00 924.00 2973.00 15.75 146.27 296.61 778.29 289.70 2.02 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 3.00 60.00 226.00 5136.00 2659.50 54.51 35.04 168.46 49.13 200.14 3.50 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 0.00 110.00 91.00 9164.00 2247.50 113.55 2.98 14.51 24.07 2.95 4.98 100.00
The read and write wait times are not always that good :), can go up into the thousands range msecs, which is pretty dreadful.
We're also seeing many ConnectionTimeout messages from the node side and in the proxy.
Some examples from the storage nodes:
Jul 17 13:28:51 compute005 object-server ERROR container update failed with 10.100.100.149:6001/sdf (saving for async update later): Timeout (3s) (txn: tx70549d8ee9a04f74a60d69842634deb)
Jul 17 13:34:03 compute005 swift ERROR with Object server 10.100.100.153:6000/sdc re: Trying to DELETE /AUTH_698845ea71b0e860bbfc771ad3dade1/container/whatever.file: Timeout (10s) (txn: tx11c34840f5cd42fdad123887e26asdae)
Jul 17 12:45:55 compute005 container-replicator ERROR reading HTTP response from {'zone': 7, 'weight': 2000.0, 'ip': '10.100.100.153', 'region': 1, 'port': 6001, 'meta': '', 'device': 'sdc', 'id': 1}: Timeout (10s)
And also from the proxy:
Jul 17 14:37:53 controller proxy-server ERROR with Object server 10.100.100.149:6000/sdf re: Trying to get final status of PUT to /v1/AUTH_6988e698bc17460bbfc74ea20fdcde1/container/whatever.file: Timeout (10s) (txn: txb114c84404194f5a84cb34a0ff74e273)
Jul 17 12:32:43 controller proxy-server ERROR with Object server 10.100.100.153:6000/sdc re: Expect: 100-continue on /AUTH_6988e698bc17460bbf71ff210e8acde1/container/whatever.file: ConnectionTimeout (0.5s) (txn: txd8d6ac5abfa34573a6dc3c3be71e454f)
If all the services pushing to swift and the object-expirer are stopped, the disk utilization stays at 100% for most of the time. There are no async_pending transactions, but there is a lot of rsyncing going on, probably coming from the object-replicator.
If all are turned on, there are 30-50 or even more async_pending transactions at almost any given moment in time.
We thought about different solutions to mitigate the problem, this is the outcome basically:
SSDs for storage are too expensive, so won't happen
Putting another HDD in paired with each in a RAID0 cluster (we have replication in swift)
Using some caching, like bcache or flashcache
Does anyone of you have experience with this kind of problem?
Any hints/other places to look for the root cause?
Is there a possibility to optimize the expirer/replicator performance?
If any additional info is required, just let me know.
Thanks
I've seen issues where containers with >1 million objects cause timeouts (due to sqlite3 db not being able to get a lock)...can you verify your containers object count?

Restore data from Postgres data files

Got system with Postgres broken (a RAID is the reason) , without any backups.
Trying to put data to another comptuter with Postgres (and make however backup).
But always when I set up data directory and run postgres I've got message
GET FATAL: database files are incompatible with server
2012-08-15 19:58:38 GET DETAIL: The database cluster was initialized with BLCKSZ 16777216, but the server was compiled with BLCKSZ 8192.
2012-08-15 19:58:38 GET HINT: It looks like you need to recompile or initdb.
It's very strange number 16777216(2 to power 24 - to big).
However I can't reset default value 8192 when compiling (playing with --with-blocksize= take no effect; BLCKSZ - I can't find it in headers files)
).
Any way to extract data ?
This is environment and circumstances:
harddrive: RAID 1 with 3 SAS disks in array
OS: ubuntu 10.04.04 amd64
Postgres: 9.1 (by apt-get (we change repository links to higher version of Ubuntu))
the system become broken - after some time got
AAC: Host Adapter BLINK LED 0x56
AACO: Adapter kernel panic'd 56
(filesystem or hardware error)
Somehow we got data directory. pg_conroldata shown:
pg_control version number: 903
Catalog version number: 201105231
Database system identifier: 5714530593695276911
Database cluster state: shut down
pg_control last modified: Tue 15 Aug 2012 11:50:50
Latest checkpoint location: 1B595668/2000020
Prior checkpoint location: 0/0
Latest checkpoint's REDO location: 1B595668/2000020
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/4057946
Latest checkpoint's NextOID: 40960
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 670
Latest checkpoint's oldestXID's DB: 1344846103
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Tue 15 Aug 2012 11:50:50
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: minimal
Current max_connections setting: 100
Current max_prepared_xacts setting:0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 16777216
Blocks per segment of large relation:131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 2387576020
Maximum size of a TOAST chunk: 0
Date/time type storage: floating-point numbers
Float4 argument passing: by reference
Float8 argument passing: by reference
First I effort to up DB in Ubuntu servers (harddisk - simple serial 2, Ubuntu 10.04 i386, Postgres 9.1) and got the same exception above (with BLCKSZ).
That's why I deployed Ubuntu 10.04 amd64 with english Postgres 9.1 (because got '?' instead of russian symbols in error logs in previous step) in virtual machine
Got the same exception (with BLCKSZ).
Ather that have removed apt-get postgres version and compiled it as described at docs http://www.postgresql.org/docs/9.1/static/installation.html.
Playing withconfigure --with-blocksize=BLOCKSIZE had take no effect - got the same error
Sorry, for the post.
The pg_contol was broken by some manipulations with.
Sow, the cluster was succeful restored by pg_resetxlog with initial data.
A blocksize of 16Mb would be really weird, and since these two values also look completely bogus:
Maximum columns in an index: 2387576020
Maximum size of a TOAST chunk: 0
...you might want to question the integrity of this data before spending time on compiling postgres with a non-standard block size.
If you look at the sizes of files corresponding to relations, are they multiple of 16Mb or 8Kb?
If the database have some gigabytes tables, what appears to be the cut-off size on disk (the size above which postgres split the data into several files)? This should be equal to data block size*Blocks per segment of large relation. On a default install, it's 1Gb.
See here for details on configuring kernel resources. Perhaps the default/current settings for this new OS won't allow the postmaster to start.
Here are details on the meaning and context of the BLCKSZ parameter. Was the system that failed running a 64bit build of PostgreSQL and the new system is a 32bit build? If possible, attempting to obtain version information on the failed system's PostgreSQL could shed light on the problem. Let us know what version, build, and OS were used. Was is a custom build?

Understanding results of mongostat

I am trying to understand the results of mongostat:
example
insert query update delete getmore command flushes mapped vsize res faults locked % idx
0 2 4 0 0 10 0 976m 2.21g 643m 0 0.1 0
0 1 0 0 0 4 0 976m 2.21g 643m 0 0 0
0 0 0 0 0 1 0 976m 2.21g 643m 0 0 0
I see
mapped - 976m
vsize-2.2.g
res - 643m
res - RAM, so ~650MB of my database is in RAM
mapped - total size of database (via memory mapped files)
vsize - ???
not sure why vsize is important or what exactly it means in this content - im running an m1.large so i have like 400GB of HD space + 8GB of RAM.
Can someone help me out here and explain if
I am on the right page
what stats I should monitor in production
This should give you enough information
mapped - amount of data mmaped (total data size) megabytes
vsize - virtual size of process in megabytes
res - resident size of process in megabytes
1) I am on the right page
So mongostat is not really a "live monitor". It's mostly useful for connecting to a specific server and watching for something specific (what's happening when this job runs?). But it's not really useful for tracking performance over time.
Typically, for monitoring the server, you will want to use a tool like Zabbix or Cacti or Munin. Or some third-party server monitor. The MongoDB webiste has a list.
2) what stats I should monitor in production
You should monitor the same basic stats you would monitor on any server:
CPU
Memory
Disk IO
Network traffic
For MongoDB specifically, you will to run db.serverStatus() and track the
opcounters
connections
indexcounters
Note that these are increasing counters, so you'll have to create the correct "counter type" in your monitoring system (Zabbix, Cacti, etc.) A few of these monitoring programs already have MongoDB plug-ins available.
Also note that MongoDB has a "free" monitoring service called MMS. I say "free" because you will be receiving calls from salespeople in exchange for setting up MMS.
Also you can use these mini tools watching mongodb
http://openmymind.net/2011/9/23/Compressed-Blobs-In-MongoDB/
by the way I remembered this great online tool from 10gen
https://mms.10gen.com/user/login

mongodb higher faults on Windows than on Linux

I am executing below C# code -
for (; ; )
{
Console.WriteLine("Doc# {0}", ctr++);
BsonDocument log = new BsonDocument();
log["type"] = "auth";
BsonDateTime time = new BsonDateTime(DateTime.Now);
log["when"] = time;
log["user"] = "staticString";
BsonBoolean bol = BsonBoolean.False;
log["res"] = bol;
coll.Insert(log);
}
When I run it on a MongoDB instance (version 2.0.2) running on virtual 64 bit Linux machine with just 512 MB ram, I get about 5k inserts with 1-2 faults as reported by mongostat after few mins.
When same code is run against a MongoDB instance (version 2.0.2) running on a physical Windows machine with 8 GB of ram, I get 2.5k inserts with about 80 faults as reported by mongostat after few mins.
Why more faults are occurring on Windows? I can see following message in logs-
[DataFileSync] FlushViewOfFile failed 33 file
Journaling is disable on both instances
Also, is 5k insert on a virtual machine with 1-2 faults a good enough speed? or should I be expecting better inserts?
Looks like this is a known issue - https://jira.mongodb.org/browse/SERVER-1163
page fault counter on Windows is in fact the total page faults which include both hard and soft page fault.
Process : Page Faults/sec. This is an indication of the number of page faults that
occurred due to requests from this particular process. Excessive page faults from a
particular process are an indication usually of bad coding practices. Either the
functions and DLLs are not organized correctly, or the data set that the application
is using is being called in a less than efficient manner.