I'm using Mongo 2.6.9 with a cluster of 2 shards, each shard has 3 replicas, one of which is hidden.
It's a 5 machines deployment running on RedHat where 4 machines contain a single replica of 1 shard and the 5th machine contains the hidden replicas of both shards.
There is a load running of around 250 inserts per second and 50 updates per second. These are simple inserts and updates of pretty small documents.
In addition there is a small load of small files inserted to GridFS (around 1 file / second). The average file size is less than 1 MB.
There are 14 indexes defined for the involved collections. Those will be required when I will be adding the application that will be reading from the DB.
From the logs of the primary replicas I see during the whole run a huge amount of simple inserts and updates or even GetLastError requests that take hundreds of ms or even sometimes seconds (the default logging level only shows queries that took more than 100ms). For example this simple update uses an index for the query and doesn't update any index:
2015-10-12T06:12:17.258+0000 [conn166086] update chatlogging.SESSIONS query: { _id: "743_12101506113018605820fe43610c0a81eb9_IM" } update: { $set: { EndTime: new Date(1444630335126) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 numYields:0 locks(micros) w:430 2131ms
2015-10-12T06:12:17.259+0000 [conn166086] command chatlogging.$cmd command: update { update: "SESSIONS", updates: [ { q: { _id: "743_12101506113018605820fe43610c0a81eb9_IM" }, u: { $set: { EndTime: new Date(1444630335126) } }, multi: false, upsert: false } ], writeConcern: { w: 1 }, ordered: true, metadata: { shardName: "S1R", shardVersion: [ Timestamp 17000|3, ObjectId('56017697ca848545f5f47bf5') ], session: 0 } } ntoreturn:1 keyUpdates:0 numYields:0 reslen:155 2132ms
All inserts and updates are made with w:1, j:1.
The machines have plenty of available CPU and memory. The disk I/O is significant, but not coming anywhere near 100% when these occur.
I really need to figure out what's causing this unexpectedly slow responsiveness of the DB. It's very possible that I need to change something in the way the DB is set up. Mongo runs with default configuration including the logging level.
An update -
I've continued looking into this issue and here are additional details that I'm hoping will allow to pinpoint the root cause of the problem or at least point me to the right direction:
The total DB size for a single shard is more than 200GB at the moment. The indexes being almost 50GB. Here is the relevant part from db.stats() and the mem part from db.ServerStatus() from the primary replica of one of the shards:
"collections" : 7,
"objects" : 73497326,
"avgObjSize" : 1859.9700916465995,
"dataSize" : 136702828176,
"storageSize" : 151309253648,
"numExtents" : 150,
"indexes" : 14,
"indexSize" : 46951096976,
"fileSize" : 223163187200,
"nsSizeMB" : 16,
"mem" : {
"bits" : 64,
"resident" : 5155,
"virtual" : 526027,
"supported" : true,
"mapped" : 262129,
"mappedWithJournal" : 524258
},
The servers have 8GB of RAM, out of which the mongod process use around 5GB. So the majority of the data and probably more important the indexes is not kept in memory. Can this be the our root cause? When I previously wrote that the system has plenty of free memory, I was refering to the fact that the mongod process isn't using as much as it could and also that most of the RAM is used for cached memory that can be released if required:
free -m output
Here is the output of mongostat from the same mongod:
mongostat output
I do see few faults in these, but these numbers look too low to me to indicate a real problem. Am I wrong?
Also I don't know whether the numbers seen in "locked db" are considered reasonable, or do those indicate that we have locks contention?
During the same timeframe when these stats were taken, many simple update operations that find a document based on an index and don't update an index, like the following one took hundreds of ms:
2015-10-19T09:44:09.220+0000 [conn210844] update chatlogging.SESSIONS query: { _id: "838_19101509420840010420fe43620c0a81eb9_IM" } update: { $set: { EndTime: new Date(1445247849092) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 numYields:0 locks(micros) w:214 126ms
Many other types of insert or update operations take hundreds of ms too. So the issue looks to be system wide and not related to a specific type of query. Using mtools I'm not able to find operations that scan lots of documents.
I'm hoping that here I'll be able to get help with regards to finding the root cause of the problem. I can provide whatever additional info or statistics from the system.
Thank you in advance,
Leonid
1) First you need to increase the logging level
2) Use mtools to figure out what queries are slow
3) Tune that queries to figure out your bottleneck
Related
I would like to know why the query below is taking so long (21 seconds) to execute even though the collection just have one document. I have a replicaset PSA instance with 130 databases giving a total of 500K files between collections and indexes (350GB). The Linux server has 32GB RAM and 8 CPUs, but we are not having IO and CPU bound. I'm using MongoDB 3.2 with Wiredtiger engine.
What is the relation between timeAcquiringMicros and the query time?
2019-10-03T11:30:34.249-0300I COMMAND[
conn370659
]command bd01.000000000000000000000000 command:find{
find:"000000000000000000000000",
filter:{
_id:ObjectId('000000000000000000000006')
},
batchSize:300
}planSummary:IDHACK
keysExamined:1
docsExamined:1
idhack:1
cursorExhausted:1
keyUpdates:0
writeConflicts:0
numYields:0
nreturned:1
reslen:102226
locks:{
Global:{
acquireCount:{
r:2
}
},
Database:{
acquireCount:{
r:1
},
acquireWaitCount:{
r:1
},
timeAcquiringMicros:{
r:21893874
}
},
Collection:{
acquireCount:{
r:1
}
}
}protocol:op_query 21894ms
MongoDB uses multiple granularity locking to help improve parallelism. There are some operations that need to lock on the Global, Database, or Collection level.
Your query you see several acquireCount: { r: X } the r is attempting to obtain an "intent shared lock" which is just a way of saying I don't need to lock anyone out, but I want to obtain a loose lock on a read at each of these levels before I get to the level I need. This prevents your query from executing while their are exclusive writes happening at any level you need to go through.
Importantly for you, you saw this:
Database:{
acquireCount:{
r:1
},
acquireWaitCount:{
r:1
},
timeAcquiringMicros:{
r:21893874
}
Meaning it took 21893874 milliseconds to acquire the lock you wanted at the Database level. Your query made it through the Global level, but was blocked by something that had an exclusive lock at the Dattabase level. I recommend becoming acquainted with this table in the MongoDB documentation: What locks are taken by some common client operations?.
One hypothesis in your situation is that someone had decided to create an index in the foreground on the database. This is interesting because you usually create indices on collections, but you need an exclusive write lock on the database which locks all your collections in that database.
Another hypothesis is that your database is just under heavy load. You'll need something like Percona's MongoDB Prometheus Exporter to scrape data from your database, but if you are able to get the data another way these two blog posts can help you understand your performance bottleneck.
Percona Monitoring and Management (PMM) Graphs Explained: WiredTiger and Percona Memory Engine
Percona Monitoring and Management (PMM) Graphs Explained: MongoDB MMAPv1
I have MongoDB 3.2.6 installed on 5 machines which all form sharded cluster consisting of 2 shards (each is replica set with primary-secondary-arbiter configuration).
I also have a database with very large collection (~50M records, 200GB) and it was imported through mongos which put it to primary shard along with other collections.
I generated hashed id on that collection which will be my shard key.
After thay I sharded collection with:
> use admin
> db.runCommand( { enablesharding : "my-database" } )
> use my-database
> sh.shardCollection("my-database.my-collection", { "_id": "hashed" } )
Comand returned:
{ "collectionsharded" : "my-database.my-collection", "ok" : 1 }
And it actually started to shard. Status of shard looks like this:
> db.my-collection.getShardingDistribution()
Totals
data : 88.33GiB docs : 45898841 chunks : 2825
Shard my-replica-1 contains 99.89% data, 99.88% docs in cluster, avg obj size on shard : 2KiB
Shard my-replica-2 contains 0.1% data, 0.11% docs in cluster, avg obj size on shard : 2KiB()
This all looks ok but problem is that when I count my-collection through mongos I see number is increasing.
When I log in to primary replica set (my-replica-1) I see that number of records in my-collection is not decreasing although number in my-replica-2 is increasing (which is expected) so I guess mongodb is not removing chunks from source shard while migrating to second shard.
Does anyone know is this normal and if not why is it happening?
EDIT: Actually now it started to decrease on my-replica-1 although it still grows when counting on mongos (sometimes it goes little down and then up). Maybe this is normal behaviour when migrating large collection, I don't know
Ivan
according to documentation here you are observing a valid situation.
When document is moved from a to b it is counted twice as long as a receive confirmation that relocation was successfule.
On a sharded cluster, db.collection.count() can result in an
inaccurate count if orphaned documents exist or if a chunk migration
is in progress.
To avoid these situations, on a sharded cluster, use the $group stage
of the db.collection.aggregate() method to $sum the documents. For
example, the following operation counts the documents in a collection:
db.collection.aggregate(
[
{ $group: { _id: null, count: { $sum: 1 } } }
]
)
I have not so big collection, that has about 500k records, but it's mission critical.
I want to add one field and remove another one. I was wondering would it lock that collection from inserting/updating (I really don't want any downtime).
I've made an experiment, and it looks that it doesn't block it:
// mongo-console 1
use "my_db"
// add new field
db.my_col.update(
{},
{ $set:
{ foobar : "bizfoo"}
},
{ multi: true}
);
// mongo-console 2
use "my_db"
db.my_col.insert({_id: 1, foobar: 'Im in'});
db.my_col.findOne({_id: 1});
=>{ "_id" : 1, "foo" : "bar" }
Although I don't really understand why, because db.currentOp() shows that there are Write locks on it.
Also on the production system I have replica set, and I was curious how does it impact the migration.
Can someone answer these questions, or point me to some article where it's nicely explained.
Thanks!
(MongoDB version I use is 2.4)
MongoDB 2.4 locks on the database level per shard. You mentioned you have a replica set. Replica sets have no impact on the locking. Shards do. If you have your data sharded, when you perform an update, the lock will only lock the database on the shard where the data lives. If you don't have your data sharded, then the database is locked during the write operation.
In order to see impact, you'll need a test that does a significant amount of work.
You can read more at:
http://www.mongodb.org/display/DOCS/How+does+concurrency+work
With a 11 GB working set (db.records.totalSize()), I ran the touch command in order to get Mongo to use as much memory as possible on my 16-GB RAM box. Before running touch, the serverStatus command showed that Mongo's mem.resident equaled 5800 (roughly 6 GB RAM).
db.runCommand({ touch: "records", data: true, index: true })
{ "ok" : 1 }
But, after running touch, Mongo's using roughly the same amount of RAM.
"mem" : {
"bits" : 64,
"resident" : 5821, /* only a 21 MB increase */
"virtual" : 29010,
"supported" : true,
"mapped" : 14362,
"mappedWithJournal" : 28724
},
Why did the touch command hardly increase how much RAM Mongo uses (mem.resident)?
The way that MongoDB db.serverStatus() command reports resident memory is by counting how many pages in physical RAM were actually accessed by mongod process.
This means that while your collection and indexes were read into RAM they won't show up in "res" value until you start actually querying against it.
You can verify that the data was read into RAM (if it was definitely cold before) just by seeing how much RAM mongod process has (not virtual memory).
I have 2 EC2 instances, one as a mongodb server and the other is a python web app (same availability zone). The python sever connects to mongo server using PyMongo and everything works fine.
The problem is, when I profile execution-time in python, in some calls (less than 5%) it takes even up to couple of second to return. I was able to narrow down the problem and the time delay was actually on the db calls to mongo server.
The two causes that I thought were,
1. The Mongo server is slow/over-loaded
2. Network latency
So, I tried upgrading the mongo servers to 4X faster instance but the issue still happens (Some calls takes even 3 secs to return) I assumed since both the servers are on EC2, network latency should not be a problem... but may be I was wrong.
How can I confirm if the issue is actually the network itself? If so, what the best way to solve it? Is there any other possible cause?
Any help is appreciated...
Thanks,
UPATE: The entities that I am fetching are very small (indexed) and usually the calls takes only 0.01-0.02 secs to finish.
UPDATE:
As suggested by "James Wahlin", I enabled profiling on my mongo server and got some interesting logs,
Fri Mar 15 18:05:22 [conn88635] query db.UserInfoShared query: { $or:
[ { _locked: { $exists: false } }, { _locked: { $lte:
1363370603.297361 } } ], _id: "750837091142" } nto return:1 nscanned:1 nreturned:1 reslen:47 2614ms
Fri Mar 15 18:05:22 [conn88635] command db.$cmd command: {
findAndModify: "UserInfoShared", fields: { _id: 1 }, upsert: true,
query: { $or: [ { _locked: { $exists: false } }, { _locked: { $lte:
1363370603.297361 } } ], _id: "750837091142" }, update: { $set: { _locked: 1363370623.297361 } }, new: true } ntoreturn:1 reslen:153 2614ms
You can see these two calls took more than 2 secs to finish. The field _id is unique indexed and finding it should not have taken this much time. May be I have to post a new question for it, but can the mongodb GLOBAL LOCK be the cause ?
#James Wahlin, thanks a lot for helping me out.
As it turned out the main cause of latency was mongodb GLOBAL LOCK itself. We had a lock percentage which was averaging at 5% and sometime peaks up to 30-50% and that results in the slow queries.
If you are facing this issue, the first thing you have to do it enable mongodb MMS service (mms.10gen.com), which will give you a lot of insights on what exactly is happening in your db server.
In our case the LOCK PERCENTAGE was really high and there were multiple reasons for it. First thing you have to do to figure it out is to read mongodb documentation on concurrency,
http://docs.mongodb.org/manual/faq/concurrency/
The reason for lock can be in application level, mongodb or hardware.
1) Our app was doing a lot of updates and each update(more than 100 ops/sec) holds a global lock in mongodb. The issue was that when an update happens for an entry which is not in memory, mongo will have to load the data into memory first and then update (in memory) and the whole process happens while inside the global lock. If say the whole thing takes 1 sec to complete (0.75sec to load the data from disk and 0.25sec to update in memory), the whole rest of the update calls waits (for the entire 1 sec) and such updates starts queuing up... and you will notice more and more slows requests in your app server.
The solution for it (while it might sound silly) is to query for the same data before you make an update. What it effectively does is moving the 'load data to memory' (0.75sec) part out of the global lock which greatly reduces your lock percentage
2) The other main cause of global lock is mongodb's data flush to disk. Basically in every 60sec (or less) mongodb (or the OS) writes the data to disk and a global lock is held during this process. (This kinda explains the random slow queries). In your MMS stats, see the graph for background flush avg... if its high, that means you have to get faster disks.
In our case, we moved to a new EBS optimized instance in EC2 and also bumped our provisioned IOPS from 100 to 500 which almost halved the background flush avg and the servers are much happier now.