EC2 - MongoDB - PyMongo - Debugging Query Latency - mongodb

I have 2 EC2 instances, one as a mongodb server and the other is a python web app (same availability zone). The python sever connects to mongo server using PyMongo and everything works fine.
The problem is, when I profile execution-time in python, in some calls (less than 5%) it takes even up to couple of second to return. I was able to narrow down the problem and the time delay was actually on the db calls to mongo server.
The two causes that I thought were,
1. The Mongo server is slow/over-loaded
2. Network latency
So, I tried upgrading the mongo servers to 4X faster instance but the issue still happens (Some calls takes even 3 secs to return) I assumed since both the servers are on EC2, network latency should not be a problem... but may be I was wrong.
How can I confirm if the issue is actually the network itself? If so, what the best way to solve it? Is there any other possible cause?
Any help is appreciated...
Thanks,
UPATE: The entities that I am fetching are very small (indexed) and usually the calls takes only 0.01-0.02 secs to finish.
UPDATE:
As suggested by "James Wahlin", I enabled profiling on my mongo server and got some interesting logs,
Fri Mar 15 18:05:22 [conn88635] query db.UserInfoShared query: { $or:
[ { _locked: { $exists: false } }, { _locked: { $lte:
1363370603.297361 } } ], _id: "750837091142" } nto return:1 nscanned:1 nreturned:1 reslen:47 2614ms
Fri Mar 15 18:05:22 [conn88635] command db.$cmd command: {
findAndModify: "UserInfoShared", fields: { _id: 1 }, upsert: true,
query: { $or: [ { _locked: { $exists: false } }, { _locked: { $lte:
1363370603.297361 } } ], _id: "750837091142" }, update: { $set: { _locked: 1363370623.297361 } }, new: true } ntoreturn:1 reslen:153 2614ms
You can see these two calls took more than 2 secs to finish. The field _id is unique indexed and finding it should not have taken this much time. May be I have to post a new question for it, but can the mongodb GLOBAL LOCK be the cause ?

#James Wahlin, thanks a lot for helping me out.
As it turned out the main cause of latency was mongodb GLOBAL LOCK itself. We had a lock percentage which was averaging at 5% and sometime peaks up to 30-50% and that results in the slow queries.
If you are facing this issue, the first thing you have to do it enable mongodb MMS service (mms.10gen.com), which will give you a lot of insights on what exactly is happening in your db server.
In our case the LOCK PERCENTAGE was really high and there were multiple reasons for it. First thing you have to do to figure it out is to read mongodb documentation on concurrency,
http://docs.mongodb.org/manual/faq/concurrency/
The reason for lock can be in application level, mongodb or hardware.
1) Our app was doing a lot of updates and each update(more than 100 ops/sec) holds a global lock in mongodb. The issue was that when an update happens for an entry which is not in memory, mongo will have to load the data into memory first and then update (in memory) and the whole process happens while inside the global lock. If say the whole thing takes 1 sec to complete (0.75sec to load the data from disk and 0.25sec to update in memory), the whole rest of the update calls waits (for the entire 1 sec) and such updates starts queuing up... and you will notice more and more slows requests in your app server.
The solution for it (while it might sound silly) is to query for the same data before you make an update. What it effectively does is moving the 'load data to memory' (0.75sec) part out of the global lock which greatly reduces your lock percentage
2) The other main cause of global lock is mongodb's data flush to disk. Basically in every 60sec (or less) mongodb (or the OS) writes the data to disk and a global lock is held during this process. (This kinda explains the random slow queries). In your MMS stats, see the graph for background flush avg... if its high, that means you have to get faster disks.
In our case, we moved to a new EBS optimized instance in EC2 and also bumped our provisioned IOPS from 100 to 500 which almost halved the background flush avg and the servers are much happier now.

Related

MongoDB - Find() by _id is taking so long - planSummary:IDHACK (timeAcquiringMicros is equal to query time)

I would like to know why the query below is taking so long (21 seconds) to execute even though the collection just have one document. I have a replicaset PSA instance with 130 databases giving a total of 500K files between collections and indexes (350GB). The Linux server has 32GB RAM and 8 CPUs, but we are not having IO and CPU bound. I'm using MongoDB 3.2 with Wiredtiger engine.
What is the relation between timeAcquiringMicros and the query time?
2019-10-03T11:30:34.249-0300I COMMAND[
conn370659
]command bd01.000000000000000000000000 command:find{
find:"000000000000000000000000",
filter:{
_id:ObjectId('000000000000000000000006')
},
batchSize:300
}planSummary:IDHACK
keysExamined:1
docsExamined:1
idhack:1
cursorExhausted:1
keyUpdates:0
writeConflicts:0
numYields:0
nreturned:1
reslen:102226
locks:{
Global:{
acquireCount:{
r:2
}
},
Database:{
acquireCount:{
r:1
},
acquireWaitCount:{
r:1
},
timeAcquiringMicros:{
r:21893874
}
},
Collection:{
acquireCount:{
r:1
}
}
}protocol:op_query 21894ms
MongoDB uses multiple granularity locking to help improve parallelism. There are some operations that need to lock on the Global, Database, or Collection level.
Your query you see several acquireCount: { r: X } the r is attempting to obtain an "intent shared lock" which is just a way of saying I don't need to lock anyone out, but I want to obtain a loose lock on a read at each of these levels before I get to the level I need. This prevents your query from executing while their are exclusive writes happening at any level you need to go through.
Importantly for you, you saw this:
Database:{
acquireCount:{
r:1
},
acquireWaitCount:{
r:1
},
timeAcquiringMicros:{
r:21893874
}
Meaning it took 21893874 milliseconds to acquire the lock you wanted at the Database level. Your query made it through the Global level, but was blocked by something that had an exclusive lock at the Dattabase level. I recommend becoming acquainted with this table in the MongoDB documentation: What locks are taken by some common client operations?.
One hypothesis in your situation is that someone had decided to create an index in the foreground on the database. This is interesting because you usually create indices on collections, but you need an exclusive write lock on the database which locks all your collections in that database.
Another hypothesis is that your database is just under heavy load. You'll need something like Percona's MongoDB Prometheus Exporter to scrape data from your database, but if you are able to get the data another way these two blog posts can help you understand your performance bottleneck.
Percona Monitoring and Management (PMM) Graphs Explained: WiredTiger and Percona Memory Engine
Percona Monitoring and Management (PMM) Graphs Explained: MongoDB MMAPv1

Mongodb server side vs client side processing

I have a shell script that creates a cursor on a collection then updates each docunment with data from another collection.
When I run it on a local db it finishes in about 15 sec, but on a hosted database, it runs for over 45 min.
db.col1.find().forEach(function(doc) {
db.col2.findAndModify(
{
query: {attr1: doc.attr1},
update: { $push: {attr2: doc.attr2},
upsert: true
});
});
So there is obviously network overhead between the client and server in order to process this script. Is there a way to keep the processing all server side? I have looked at server side javascript, but from what I read here, it is not a recommended practice.
Locally, you have almost no network overhead. No interference, no routers, no switches, no limited bandwidth. Plus, during most situations, your mass storage, be it SSD or HDD, more or less idles around (Unless you tend to play games while developing.) So when a operation requiring a lot of IO capabilities kicks in, it is available.
When you run your script from your local shell against a server, here is what happens.
db.col1.find().forEach The whole collection is going to be read, from an unknown medium (most likely HDDs of which the available IO could be shared among many instances). The documents will then be transferred to you local shell. Compared to a connection to localhost, each document retrieval is routed over dozens of hops, each adding a tiny amount of latency. Over presumably quite some documents, this adds up. Don't forget that the complete document is sent over the network, since you did not use projection to limit the fields returned to attr1 and attr2. External bandwidth of course is slower than a connection to localhost.
db.col2.findAndModify For each document, a query is done. Again, the shared IO might well kill performance.
{ query: {attr1: doc.attr1}, update: { $push: {attr2: doc.attr2}, upsert: true} Are you sure attr1 is indexed, by the way? And even when it is, it is not sure wether the index is currently in RAM. We are talking of a shared instance, right? And it might well be that your write operations have to wait until they are even processed by mongod, as per default write concern, the data has to be successfully applied to the in memory data set before it is acknowledged, but if gazillions of operations are sent to the shared instance, it might well be that your operation is number 1 bazillion and one in the queue. And the network latency is added a second time, since the values transferred to your local shell needs to be sent back.
What you can do
First of all, make sure that you
limit the returned values to those you need using projection:
db.col1.find({},{ "_id":0, "attr1":1, "attr2":1 })
Make sure you have attr1 indexed
db.col2.ensureIndex( { "attr1":1 } )
Use bulk operations. They are executed much faster, at the expense of reduced feedback in case of problems.
// We can use unordered here, because the operations
// each apply to only a single document
var bulk = db.col2.initializeUnorderedBulkOp()
// A counter we will use for intermediate commits
// We do the intermediate commits in order to keep RAM usage low
var counter = 0
// We limit the result to the values we need
db.col1.find({}.{"_id":0, "attr1":1, "attr2":1 }).forEach(
function(doc){
// Find the matching document
// Update exactly that
// and if it does not exist, create it
bulk
.find({"attr1": doc.attr1})
.updateOne({ $push: {"attr2": doc.attr2})
.upsert()
counter++
// We have queued 1k operations and can commit them
// MongoDB would split the bulk ops in batches of 1k operations anyway
if( counter%1000 == 0 ){
bulk.execute()
print("Operations committed: "+counter)
// Initialize a new batch of operations
bulk = db.col2.initializeUnorderedBulkOp()
}
}
)
// Execute the remaining operations not committed yet.
bulk.execute()
print("Operations committed: "+counter)

MongoDB C# query performance much worse than mongotop reports

I have quite a bit of experience with Mongo, but am on the verge of tears of frustration about this problem (that of course popped up from nowhere a day before release).
Basically I am querying a database to retrieve a document but it will often be an order of magnitude (or even two) worse than it should be, particularly since the query is returning nothing.
Query:
//searchQuery ex: { "atomic.Basic^SessionId" : "a8297898-7fc9-435c-96be-9c5e60901e40" }
var doc = FindOne(searchQuery);
Explain:
{
"cursor":"BtreeCursor atomic.Basic^SessionId",
"isMultiKey" : false,
" n":0,
"nscannedObjects":0,
"nscanned":0,
"nscannedObjectsAllPlans":0,
"nscannedAllPlans":0,
"scanAndOrder":false,
"indexOnly":false,
"nYields":0,
"nChunkSkips":0,
"millis":0,
"indexBounds":{
"atomic.Basic^SessionId":[
[
"a8297898-7fc9-435c-96be-9c5e60901e40",
"a8297898-7fc9-435c-96be-9c5e60901e40"
]
]
}
}
It is often taking 50-150 ms, even though mongotop reports at most 15 ms of read time (and that should be over several queries). There are only 6k documents in the database (only 2k or so are in the index, and the explain says it's using the index) and since the document being searched for isn't there, it can't be a deserialization issue.
It's not this bad on every query (sub ms most of the time) and surely the B-tree isn't large enough to have that much variance.
Any ideas will have my eternal gratitude.
MongoTop is not reporting the total query time. It reports the amount of time MongoDB is spending holding particular locks.
That query retuned in 0ms according to the explain (which is extremely quickly). What you are describing sounds like network latency. What is the latency when you ping the node? Is it possible that the network is flakey?
What version of MongoDB are you using? Consider upgrading both MongoDB and the C# driver to the latest stable versions.

How to automatically kill slow MongoDB queries?

Is there a way that I can protect my app against slow queries in MongoDB?
My application has tons of possibilities of filters and I'm monitoring all these queries but at the same time I don't want to compromise performance because of a missing index definition.
The 'notablescan' option, as #ghik mentioned, will prevent you from running queries that are slow due to not using an index. However, that option is global to the server, and it is not appropriate for use in a production environment. It also won't protect you from any other source of slow queries besides table scans.
Unfortunately, I don't think there is a way to directly do what you want right now. There is a JIRA ticket proposing the addition of a $maxTime or $maxScan query parameter, which sounds like it would help you, so please vote for it: https://jira.mongodb.org/browse/SERVER-2212.
There are options available on the client side (maxTimeMS starting in 2.6 release).
On the server side, there is no appealing global option, because it would impact all databases and all operations, even ones that the system needs to be long running for internal operation (for example tailing the oplog for replication). In addition, it may be okay for some of your queries to be long running by design.
The correct way to solve this would be to monitor currently running queries via a script and kill the ones that are long running and user/client initiated - you can then build in exceptions for queries that are long running by design, or have different thresholds for different queries/collections/etc.
You can then use db.currentOp() method (in the shell) to see all currently running operations. The field "secs_running" indicates how long the operation has been running. Be careful not to kill any long running operations that are not initiated by your application/client - it may be a necessary system operation, like chunk migration in a sharded cluster (as just one example).
Right now with version 2.6 this is possible. In their press release you can see the following:
with MaxTimeMS operators and developers can specify auto-cancellation
of queries, providing better control of resource utilization;
Therefore with MaxTimeMS you can specify how much time you allow your query to be executed. For example I do not want a specific query to run more than 200 ms.
db.collection.find({
// my query
}).maxTimeMS(200)
What is cool about this, is that you can specify different timeouts for different operations.
To answer OP's question in the comment. There is not global setting for this. One reason is that different queries can have different maximum tolerating time. For example you can have query that finds userInfo by it's ID. This is very common operation and should run super fast (otherwise we are doing something wrong). So we can not tolerate it to run longer than 200 ms.
But we also have some aggregation query, which we run once a day. For this operation it is ok to run for 4 seconds. But we can not tolerate it longer than 10 seconds. So we can put 10000 as maxTimeMS.
I guess there is currently no support for killing query by passing time argument. Though in your development side, you can set profiler level to 2. It will log every query that has been issued. From there you can see which queries take how much time. I know its not what you exactly wanted but it helps in getting the insight of what all queries are fat and then in your app logic you can have some way to gracefully handle such cases where those queries might originate. I usually go by this approach and it helps.
Just putting this here since I was struggling with the same for a while:
Here is how you can do it in python3
Tested on mongo version 4.0 and pymongo version 3.11.4
import pymongo
client = pymongo.MongoClient("mongodb://mongodb0.example.com:27017")
admin_db = client.get_database("admin")
milliseconds_running = 10000
query = [
{"$currentOp": {"allUsers": True, "idleSessions": True}},
{
"$match": {
"active": True,
"microsecs_running": {
"$gte": milliseconds_running * 1000
},
"ns": {"$in": ["mydb.collection1", "mydb.collection2"]},
"op": {"$in": ["query"]},
}
},
]
ops = admin_db.aggregate(query)
count = 0
for op in ops:
admin_db.command({"killOp": 1, "op": op["opid"]})
count += 1
logging.info("ops found: %d" % count)
I wrote a more robust and configurable script for it here.
It also has a Dockerfile file in case anyone wants to use this as a container. I am currently using this as a periodicallly running cleanup task.

Update $unset command eats up all ram on large collection

Using MongoDB, I'm trying to remove a column from a collection that contains ~3 million records.
db.Listing.update( {}, { $unset: { Longitude: 1 } }, false, true);
When I execute this command, the RAM on the server continues to go up until it runs out of RAM and then the server is hosed and needs to be physically rebooted. Is there a better way to remove a column from a large collection that won't hose the server?
I expect your problem is the system OOM killer. You should make sure you aren't limiting the resources for mongod. See this: http://www.mongodb.org/display/DOCS/Checking+Server+Memory+Usage
http://prefetch.net/blog/index.php/2009/09/30/how-the-linux-oom-killer-works/
If you are using a virt. system like openvz you might want to stop , or adjust the over-committing feature.