MongoDb search performance - mongodb

I want to know why the follow search in mongo db (C#) would take 50 seconds to execute.
I followed the basic idea of http://calv.info/indexing-schemaless-documents-in-mongo/
I have 100,000 records in a collection(captures). On each document I have a SearchTerm Collection
public class SearchTerm
{
public string Key { get; set; }
public object Value { get; set; }
}
public class Capture
{
//Some other fields
public IList<SearchTerm> SearchTerms { get; set; }
}
I have also defined a index like so
var capturesCollection = database.GetCollection<Capture>("captures");
capturesCollection.CreateIndex("SearchTerms.Key", "SearchTerms.Value");
But the following query takes 50 seconds to execute
var query = Query.Or(Query.And(Query.EQ("SearchTerms.Key", "ClientId"), Query.EQ("SearchTerms.Value", selectedClient.Id)), Query.And(Query.EQ("SearchTerms.Key", "CustomerName"), Query.EQ("SearchTerms.Value", "Jan")));
var selectedCapture = capturesCollection.Find(query).ToList();
Edit: As asked my explain:
clauses: [{ "cursor" : "BtreeCursor SearchTerms.Key_1_SearchTerms.Value_1", "isMultiKey" : true, "n" : 10003, "nscannedObjects" : 100000, "nscanned" : 100000, "scanAndOrder" : false, "indexOnly" : false, "nChunkSkips" : 0, "indexBounds" : { "SearchTerms.Key" : [["ClientId", "ClientId"]], "SearchTerms.Value" : [[{ "$minElement" : 1 }, { "$maxElement" : 1 }]] } }, { "cursor" : "BtreeCursor SearchTerms.Key_1_SearchTerms.Value_1", "isMultiKey" : true, "n" : 70328, "nscannedObjects" : 90046, "nscanned" : 211653, "scanAndOrder" : false, "indexOnly" : false, "nChunkSkips" : 0, "indexBounds" : { "SearchTerms.Key" : [["CustomerName", "CustomerName"]], "SearchTerms.Value" : [[{ "$minElement" : 1 }, { "$maxElement" : 1 }]] } }]
cursor: QueryOptimizerCursor
n: 73219
nscannedObjects: 190046
nscanned: 311653
nscannedObjectsAllPlans: 190046
nscannedAllPlans: 311653
scanAndOrder: false
nYields: 2436
nChunkSkips: 0
millis: 5196
server: piro-pc:27017
filterSet: false
stats: { "type" : "KEEP_MUTATIONS", "works" : 311655, "yields" : 2436, "unyields" : 2436, "invalidates" : 0, "advanced" : 73219, "needTime" : 238435, "needFetch" : 0, "isEOF" : 1, "children" : [{ "type" : "OR", "works" : 311655, "yields" : 2436, "unyields" : 2436, "invalidates" : 0, "advanced" : 73219, "needTime" : 238435, "needFetch" : 0, "isEOF" : 1, "dupsTested" : 80331, "dupsDropped" : 7112, "locsForgotten" : 0, "matchTested_0" : 0, "matchTested_1" : 0, "children" : [{ "type" : "FETCH", "works" : 100001, "yields" : 2436, "unyields" : 2436, "invalidates" : 0, "advanced" : 10003, "needTime" : 89997, "needFetch" : 0, "isEOF" : 1, "alreadyHasObj" : 0, "forcedFetches" : 0, "matchTested" : 10003, "children" : [{ "type" : "IXSCAN", "works" : 100000, "yields" : 2436, "unyields" : 2436, "invalidates" : 0, "advanced" : 100000, "needTime" : 0, "needFetch" : 0, "isEOF" : 1, "keyPattern" : "{ SearchTerms.Key: 1, SearchTerms.Value: 1 }", "boundsVerbose" : "field #0['SearchTerms.Key']: [\"ClientId\", \"ClientId\"], field #1['SearchTerms.Value']: [MinKey, MaxKey]", "isMultiKey" : 1, "yieldMovedCursor" : 0, "dupsTested" : 100000, "dupsDropped" : 0, "seenInvalidated" : 0, "matchTested" : 0, "keysExamined" : 100000, "children" : [] }] }, { "type" : "FETCH", "works" : 211654, "yields" : 2436, "unyields" : 2436, "invalidates" : 0, "advanced" : 70328, "needTime" : 141325, "needFetch" : 0, "isEOF" : 1, "alreadyHasObj" : 0, "forcedFetches" : 0, "matchTested" : 70328, "children" : [{ "type" : "IXSCAN", "works" : 211653, "yields" : 2436, "unyields" : 2436, "invalidates" : 0, "advanced" : 90046, "needTime" : 121607, "needFetch" : 0, "isEOF" : 1, "keyPattern" : "{}", "boundsVerbose" : "field #0['SearchTerms.Key']: [\"CustomerName\", \"CustomerName\"], field #1['SearchTerms.Value']: [MinKey, MaxKey]", "isMultiKey" : 1, "yieldMovedCursor" : 0, "dupsTested" : 211653, "dupsDropped" : 121607, "seenInvalidated" : 0, "matchTested" : 0, "keysExamined" : 211653, "children" : [] }] }] }] }

Thanks for posting the explain. Let's address the problems one at a time.
First, I don't think this query does what you think it does / want it to do. Let me show you by example using the mongo shell. Your query, translated into the shell, is
{ "$or" : [
{ "$and" : [
{ "SearchTerms.Key" : "ClientId" },
{ "SearchTerms.Value" : "xxx" }
]},
{ "$and" : [
{ "SearchTerms.Key" : "CustomerName" },
{ "SearchTerms.Value" : "Jan" }
]}
]}
This query finds documents where either some Key has the value "ClientId" and some Value has the value "xxx" or some Key has the value "CustomerName" and some Value the value "Jan". The key and the value don't need to be part of the same array element. For example, the following document matches your query
{ "SearchTerms" : [
{ "Key" : "ClientId", "Value" : 691 },
{ "Key" : "banana", "Value" : "xxx" }
]
}
I'm guessing your desired behavior is to match exactly the documents that contain the Key and Value in the same array element. The $elemMatch operator is the tool for the job:
{ "$or" : [
{ "SearchTerms" : { "$elemMatch" : { "Key" : "ClientId", "Value" : "xxx" } } },
{ "SearchTerms" : { "$elemMatch" : { "Key" : "CustomerName", "Value" : "Jan" } } }
]}
Second, I don't think this schema is what you are looking for. You don't describe your use case so I can't be confident, but the situation described in that blog post is a very rare situation where you need to store and search on arbitrary key-value pairs that can change from one document to the next. This is like letting users put in custom metadata. Almost no applications want or need to do this. It looks like your application is storing information about customers, probably for an internal system. You should be able to define a data model for your customers that looks like
{
"CustomerId" : 1234,
"CustomerName" : "Jan",
"ClientId" : "xpj1234",
...
}
This will simplify and improve things dramatically. I think the wires got crossed here because sometimes people call MongoDB "schemaless" and the blog post talks about "schemaless" documents. The blog post really is talking about schemaless documents where you don't know what is going to go in there. Most applications should know pretty much exactly what the general structure of the documents in a collection will be.
Finally, I think on the basis of this we can disregard the issue with the slow query for now. Feel free to ask another question or edit this one with extra explanation if you need more help or if the problem doesn't go away once you've taken into account what I've said here.

1) Please take a look at mongodb log file and see whats the query that gets generated against the database.
2) Enter that query into mongo shell and add ".explain()" at the end -- and see if your index is actually being used (does it say Basic Cursor or Btree Cursor ?)
3) If your index is used, whats the value of "nscanned" attribute? Perhaps your index does not have enough "value diversity" in it?

Related

MongoDB disk read performance is very low

I have a MongoDB instance with a database and a collection with 80GB of data. The number of documents inside is about 4M with a comparatively large document size of about 20kB on average. Among other more elementary stuff, each documents contains one list of 1024 elements and also 3-4 lists of 200 numbers.
I perform a simple batch find query over a properly indexed string field ('isbn'), intending to get 5000 documents (projected on relevant part) in one batch. For this, I use the $in operator:
rows = COLLECTION.find({"isbn": {"$in": candidate_isbns}}, {"_id": 0,
"isbn": 1,
"other_stuff": 1})
The IXSCAN stage works properly as intended. Since the corresponding documents, however, are not within the WiredTiger cache yet (and probably never will be for my limited 32GB RAM), the data has to be read from disk during FETCH in most cases. (Unfortunately, "other_stuff" is too heavy to yield for an index that could cover this query.)
The SSD attached to my virtual cloud machine has a read performance of about 90MB/s, which is not great, but should be sufficient for now. However, when I monitor the disk read speed (via iostats, for example) the speed during the query goes down to roughly 3MB/s, which seems to be very poor. I can verify this poor behaviour by checking the profiler output (MongoDB seems to split the 5000 in further batches, so I show only the output for a sub-batch of 2094):
{
"op" : "getmore",
"ns" : "data.metadata",
"command" : {
"getMore" : NumberLong(7543502234201790529),
"collection" : "metadata",
"lsid" : {
"id" : UUID("2f410f2d-2f74-4d3a-9041-27c4ddc51bd2")
},
"$db" : "data"
},
"originatingCommand" : {
"$truncated" : "{ find: \"metadata\", filter: { isbn: { $in: [ \"9783927781313\", ..."
},
"cursorid" : NumberLong(7543502234201790529),
"keysExamined" : 4095,
"docsExamined" : 2095,
"numYield" : 803,
"nreturned" : 2094,
"locks" : {
"ReplicationStateTransition" : {
"acquireCount" : {
"w" : NumberLong(805)
}
},
"Global" : {
"acquireCount" : {
"r" : NumberLong(805)
}
},
"Database" : {
"acquireCount" : {
"r" : NumberLong(804)
}
},
"Collection" : {
"acquireCount" : {
"r" : NumberLong(804)
}
},
"Mutex" : {
"acquireCount" : {
"r" : NumberLong(1)
}
}
},
"flowControl" : {},
"storage" : {
"data" : {
"bytesRead" : NumberLong(65454770),
"timeReadingMicros" : NumberLong(21386543)
}
},
"responseLength" : 16769511,
"protocol" : "op_msg",
"millis" : 21745,
"planSummary" : "IXSCAN { isbn: 1 }",
"execStats" : {
"stage" : "PROJECTION_SIMPLE",
"nReturned" : 2196,
"executionTimeMillisEstimate" : 21126,
"works" : 4288,
"advanced" : 2196,
"needTime" : 2092,
"needYield" : 0,
"saveState" : 817,
"restoreState" : 817,
"isEOF" : 0,
"transformBy" : {},
"inputStage" : {
"stage" : "FETCH",
"nReturned" : 2196,
"executionTimeMillisEstimate" : 21116,
"works" : 4288,
"advanced" : 2196,
"needTime" : 2092,
"needYield" : 0,
"saveState" : 817,
"restoreState" : 817,
"isEOF" : 0,
"docsExamined" : 2196,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 2196,
"executionTimeMillisEstimate" : 531,
"works" : 4288,
"advanced" : 2196,
"needTime" : 2092,
"needYield" : 0,
"saveState" : 817,
"restoreState" : 817,
"isEOF" : 0,
"keyPattern" : {
"isbn" : 1.0
},
"indexName" : "isbn_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"isbn" : []
},
"isUnique" : true,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"isbn" : [
"[\"9780230391451\", \"9780230391451\"]",
"[\"9780230593206\", \"9780230593206\"]",
... ]
},
"keysExamined" : 4288,
"seeks" : 2093,
"dupsTested" : 0,
"dupsDropped" : 0
}
}
},
"ts" : ISODate("2022-01-24T07:57:12.132Z"),
"client" : "my_ip",
"allUsers" : [
{
"user" : "myUser",
"db" : "data"
}
],
"user" : "myUser#data"
}
By looking at the ratio of bytesRead vs. timeReadingMicros, this poor read speed of about 3MB/s can be confirmed, indeed.
My question: Why does this degradation of speed take place? Is it pathological, so that I need to do further investigations, or is it the expected behaviour, given the data setup above?
Any help is highly appreciated!

Does MongoDB fetch entire document even if single field is projected?

I have a MongoDB Collection for weather data with each document consisting about 50 different weather parameters fields. Simple Example below:
{
"wind":7,
"swell":6,
"temp":32,
...
"50th_field":32
}
If I only need one field from all documents, say temp, my query would be this:
db.weather.find({},{ temp: 1})
So internally, does MongoDB has to fetch the entire document for just 1 field which was requested(projected)? Wouldn't it be an expensive operation?
I tried MongoDB Compass to benchmark timings, but the time required was <1ms so couldn't figure out.
MonogDB will read all data, however only field temp (and _id) will be transmitted over your network to the client. In case your document are rather big, then the over all performance should be better when you project only the fields you need to get.
Yes. This is how to avoid it:
create an index on temp
Use find(Temp)
turn off _id (necessary).
Run:
db.coll.find({ temp:{ $ne:null }},{ temp:1, _id:0 })`
{} triggers collscan because the algorithm tries to match the query fields with project
With {temp}, {temp, _id:0} it says: "Oh, I only need temp".
It should also be smart to tell that {}, {temp, _id:0} only needs index, but it's not.
Basically using projection with limiting fields is always faster then fetch full document, You can even use the covered index to avoid examining the documents(no disk IO) the archive better performance.
Check the executionStats of demo below, the totalDocsExamined was 0! but you must remove the _id field in projection because it's not included in index.
See also:
https://docs.mongodb.com/manual/core/query-optimization/#covered-query
> db.test.insertOne({name: 'TJT'})
{
"acknowledged" : true,
"insertedId" : ObjectId("5faa0c8469dffee69357dde3")
}
> db.test.createIndex({name: 1})
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
db.test.explain('executionStats').find({name: 'TJT'}, {_id: 0, name: 1})
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "memo.test",
"indexFilterSet" : false,
"parsedQuery" : {
"name" : {
"$eq" : "TJT"
}
},
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"_id" : 0,
"name" : 1
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"name" : 1
},
"indexName" : "name_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"name" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"name" : [
"[\"TJT\", \"TJT\"]"
]
}
}
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 0,
"totalKeysExamined" : 1,
"totalDocsExamined" : 0,
"executionStages" : {
"stage" : "PROJECTION",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
"needTime" : 0,
"needYield" : 0,
"saveState" : 0,
"restoreState" : 0,
"isEOF" : 1,
"invalidates" : 0,
"transformBy" : {
"_id" : 0,
"name" : 1
},
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
"needTime" : 0,
"needYield" : 0,
"saveState" : 0,
"restoreState" : 0,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"name" : 1
},
"indexName" : "name_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"name" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"name" : [
"[\"TJT\", \"TJT\"]"
]
},
"keysExamined" : 1,
"seeks" : 1,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0
}
}
}
}

Mongo 3.2 Sub-Document Index issue

Recently we have upgraded our MongoDB 2.6(MMAPV1) to 3.2(MMAPV1), After upgrade indexes in sub-document is not working. I did a small proof of concept on both databases with query explain.
In the MongoDB 3.2, the subdocument index is not considered, can anybody suggest the fix for this?
This is sample mongo document,
{
"_id" : ObjectId("58bff13e4e6904293cc206b4"),
"Value" : NumberLong(158),
"OVGuid" : NumberLong(0),
"Name" : "User 08/03/2017 03.55.42.782",
"CreateDate" : ISODate("2017-03-08T11:55:42.783Z"),
"RoleLst" : [
{
"_id" : NumberLong(146),
"Name" : "Role1"
},
{
"_id" : NumberLong(108),
"Name" : "Role2"
},
{
"_id" : NumberLong(29),
"Name" : "Role3"
}
]
}
I inserted nearly 100,000 data with index enabled on "RoleLst._id"(db.User.createIndex({"RoleLst._id":1})) in both Mongo DB 2.6 and 3.2.
Then I tried query explain
db.User.find({ "RoleLst" : { "$elemMatch" : { "_id" :NumberLong(200)}}}).explain()
This is the result I got from 3.2 is
Explain for sub-document query
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "SubDocmentIndexChecking.User",
"indexFilterSet" : false,
"parsedQuery" : {
"RoleLst" : {
"$elemMatch" : {
"_id" : {
"$eq" : NumberLong(200)
}
}
}
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"RoleLst" : {
"$elemMatch" : {
"_id" : {
"$eq" : NumberLong(200)
}
}
}
},
"direction" : "forward"
},
"rejectedPlans" : []
},
"serverInfo" : {
"host" : "******",
"port" : ******,
"version" : "3.2.10",
"gitVersion" : "79d9b3ab5ce20f51c272b4411202710a082d0317"
},
"ok" : 1.0
}
This is the result I got from 2.6
Explain for sub-document query
{
"cursor" : "BtreeCursor RoleLst._id_1",
"isMultiKey" : true,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 0,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 0,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"RoleLst._id" : [
[
NumberLong(200),
NumberLong(200)
]
]
},
"server" : "***********",
"filterSet" : false,
"stats" : {
"type" : "KEEP_MUTATIONS",
"works" : 2,
"yields" : 0,
"unyields" : 0,
"invalidates" : 0,
"advanced" : 0,
"needTime" : 1,
"needFetch" : 0,
"isEOF" : 1,
"children" : [
{
"type" : "FETCH",
"works" : 2,
"yields" : 0,
"unyields" : 0,
"invalidates" : 0,
"advanced" : 0,
"needTime" : 1,
"needFetch" : 0,
"isEOF" : 1,
"alreadyHasObj" : 0,
"forcedFetches" : 0,
"matchTested" : 0,
"children" : [
{
"type" : "IXSCAN",
"works" : 1,
"yields" : 0,
"unyields" : 0,
"invalidates" : 0,
"advanced" : 0,
"needTime" : 1,
"needFetch" : 0,
"isEOF" : 1,
"keyPattern" : "{ RoleLst._id: 1.0 }",
"isMultiKey" : 1,
"boundsVerbose" : "field #0['RoleLst._id']: [200, 200]",
"yieldMovedCursor" : 0,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0,
"keysExamined" : 0,
"children" : []
}
]
}
]
}
}

Mongodb - poor performance when no results return

I have Mongodb collection with about 7 million documents that represents places.
I run a query that search for places that their name start with a prefix near a specific location.
We have a compound index as described bellow to speed up the search.
When the search query find match (even if only one) the query is execute very fast (~20 milisec). But when there is no match it can take 30 sec for the query to execute.
Please assist.
In detailed:
Each place (geoData) has the following fields:
"loc" - a GeoJSON point that represent the location
"categoriesIds" - array of int ids
"name" - the name of the placee
The following index is defined on this collection:
{
"loc" : "2dsphere",
"categoriesIds" : 1,
"name" : 1
}
The query is:
db.geoData.find({
"loc":{
"$near":{
"$geometry":{
"type": "Point" ,
"coordinates": [ -0.10675191879272461 , 51.531600743186644]
},
"$maxDistance": 5000.0
}
},
"categoriesIds":{
"$in": [ 1 , 2 , 71 , 70 , 74 , 72 , 73 , 69 , 44 , 26 , 27 , 33 , 43 , 45 , 53 , 79]
},
"name":{ "$regex": "^Cafe Ne"}
})
Execution stats
(Link to the whole explain result)
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 169,
"totalKeysExamined" : 14333,
"totalDocsExamined" : 1,
"executionStages" : {
"stage" : "GEO_NEAR_2DSPHERE",
"nReturned" : 1,
"executionTimeMillisEstimate" : 60,
"works" : 14354,
"advanced" : 1,
"needTime" : 14351,
"needFetch" : 0,
"saveState" : 361,
"restoreState" : 361,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"loc" : "2dsphere",
"categoriesIds" : 1,
"name" : 1
},
"indexName" : "loc_2dsphere_categoriesIds_1_name_1",
"searchIntervals" : [
{
"minDistance" : 0,
"maxDistance" : 3408.329295346151,
"maxInclusive" : false
},
{
"minDistance" : 3408.329295346151,
"maxDistance" : 5000,
"maxInclusive" : true
}
],
"inputStages" : [
{
"stage" : "FETCH",
"nReturned" : 1,
"executionTimeMillisEstimate" : 20,
"works" : 6413,
"advanced" : 1,
"needTime" : 6411,
"needFetch" : 0,
"saveState" : 361,
"restoreState" : 361,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 1,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"filter" : {
"TwoDSphereKeyInRegionExpression" : true
},
"nReturned" : 1,
"executionTimeMillisEstimate" : 20,
"works" : 6413,
"advanced" : 1,
"needTime" : 6411,
"needFetch" : 0,
"saveState" : 361,
"restoreState" : 361,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"loc" : "2dsphere",
"categoriesIds" : 1,
"name" : 1
},
"indexName" : "loc_2dsphere_categoriesIds_1_name_1",
"isMultiKey" : true,
"direction" : "forward",
"indexBounds" : {
"loc" : [
"[\"2f1003230\", \"2f1003230\"]",
"[\"2f10032300\", \"2f10032300\"]",
"[\"2f100323000\", \"2f100323000\"]",
"[\"2f1003230001\", \"2f1003230001\"]",
"[\"2f10032300012\", \"2f10032300013\")",
"[\"2f1003230002\", \"2f1003230002\"]",
"[\"2f10032300021\", \"2f10032300022\")",
"[\"2f10032300022\", \"2f10032300023\")",
"[\"2f100323003\", \"2f100323003\"]",
"[\"2f1003230031\", \"2f1003230031\"]",
"[\"2f10032300311\", \"2f10032300312\")",
"[\"2f10032300312\", \"2f10032300313\")",
"[\"2f10032300313\", \"2f10032300314\")",
"[\"2f1003230032\", \"2f1003230032\"]",
"[\"2f10032300320\", \"2f10032300321\")",
"[\"2f10032300321\", \"2f10032300322\")"
],
"categoriesIds" : [
"[1.0, 1.0]",
"[2.0, 2.0]",
"[26.0, 26.0]",
"[27.0, 27.0]",
"[33.0, 33.0]",
"[43.0, 43.0]",
"[44.0, 44.0]",
"[45.0, 45.0]",
"[53.0, 53.0]",
"[69.0, 69.0]",
"[70.0, 70.0]",
"[71.0, 71.0]",
"[72.0, 72.0]",
"[73.0, 73.0]",
"[74.0, 74.0]",
"[79.0, 79.0]"
],
"name" : [
"[\"Cafe Ne\", \"Cafe Nf\")",
"[/^Cafe Ne/, /^Cafe Ne/]"
]
},
"keysExamined" : 6412,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 1
}
},
{
"stage" : "FETCH",
"nReturned" : 0,
"executionTimeMillisEstimate" : 40,
"works" : 7922,
"advanced" : 0,
"needTime" : 7921,
"needFetch" : 0,
"saveState" : 261,
"restoreState" : 261,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 0,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"filter" : {
"TwoDSphereKeyInRegionExpression" : true
},
"nReturned" : 0,
"executionTimeMillisEstimate" : 40,
"works" : 7922,
"advanced" : 0,
"needTime" : 7921,
"needFetch" : 0,
"saveState" : 261,
"restoreState" : 261,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"loc" : "2dsphere",
"categoriesIds" : 1,
"name" : 1
},
"indexName" : "loc_2dsphere_categoriesIds_1_name_1",
"isMultiKey" : true,
"direction" : "forward",
"indexBounds" : {
"loc" : [
"[\"2f1003230\", \"2f1003230\"]",
"[\"2f10032300\", \"2f10032300\"]",
"[\"2f100323000\", \"2f100323000\"]",
"[\"2f1003230001\", \"2f1003230001\"]",
"[\"2f10032300011\", \"2f10032300012\")",
"[\"2f10032300012\", \"2f10032300013\")",
"[\"2f1003230002\", \"2f1003230002\"]",
"[\"2f10032300021\", \"2f10032300022\")",
"[\"2f10032300022\", \"2f10032300023\")",
"[\"2f100323003\", \"2f100323003\"]",
"[\"2f1003230031\", \"2f1003230032\")",
"[\"2f1003230032\", \"2f1003230032\"]",
"[\"2f10032300320\", \"2f10032300321\")",
"[\"2f10032300321\", \"2f10032300322\")",
"[\"2f10032300322\", \"2f10032300323\")"
],
"categoriesIds" : [
"[1.0, 1.0]",
"[2.0, 2.0]",
"[26.0, 26.0]",
"[27.0, 27.0]",
"[33.0, 33.0]",
"[43.0, 43.0]",
"[44.0, 44.0]",
"[45.0, 45.0]",
"[53.0, 53.0]",
"[69.0, 69.0]",
"[70.0, 70.0]",
"[71.0, 71.0]",
"[72.0, 72.0]",
"[73.0, 73.0]",
"[74.0, 74.0]",
"[79.0, 79.0]"
],
"name" : [
"[\"Cafe Ne\", \"Cafe Nf\")",
"[/^Cafe Ne/, /^Cafe Ne/]"
]
},
"keysExamined" : 7921,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0
}
}
]
},
Execution stats when searching for "CafeNeeNNN" instead of "Cafe Ne"
(Link to the whole explain result )
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 0,
"executionTimeMillis" : 2537,
"totalKeysExamined" : 232259,
"totalDocsExamined" : 162658,
"executionStages" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"name" : /^CafeNeeNNN/
},
{
"categoriesIds" : {
"$in" : [
1,
2,
26,
27,
33,
43,
44,
45,
53,
69,
70,
71,
72,
73,
74,
79
]
}
}
]
},
"nReturned" : 0,
"executionTimeMillisEstimate" : 1330,
"works" : 302752,
"advanced" : 0,
"needTime" : 302750,
"needFetch" : 0,
"saveState" : 4731,
"restoreState" : 4731,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 70486,
"alreadyHasObj" : 70486,
"inputStage" : {
"stage" : "GEO_NEAR_2DSPHERE",
"nReturned" : 70486,
"executionTimeMillisEstimate" : 1290,
"works" : 302751,
"advanced" : 70486,
"needTime" : 232264,
"needFetch" : 0,
"saveState" : 4731,
"restoreState" : 4731,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"loc" : "2dsphere"
},
"indexName" : "loc_2dsphere",
"searchIntervals" : [
{
"minDistance" : 0,
"maxDistance" : 3408.329295346151,
"maxInclusive" : false
},
{
"minDistance" : 3408.329295346151,
"maxDistance" : 5000,
"maxInclusive" : true
}
],
"inputStages" : [
{
"stage" : "FETCH",
"nReturned" : 44540,
"executionTimeMillisEstimate" : 110,
"works" : 102690,
"advanced" : 44540,
"needTime" : 58149,
"needFetch" : 0,
"saveState" : 4731,
"restoreState" : 4731,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 44540,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"filter" : {
"TwoDSphereKeyInRegionExpression" : true
},
"nReturned" : 44540,
"executionTimeMillisEstimate" : 90,
"works" : 102690,
"advanced" : 44540,
"needTime" : 58149,
"needFetch" : 0,
"saveState" : 4731,
"restoreState" : 4731,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"loc" : "2dsphere"
},
"indexName" : "loc_2dsphere",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"loc" : [
"[\"2f1003230\", \"2f1003230\"]",
"[\"2f10032300\", \"2f10032300\"]",
"[\"2f100323000\", \"2f100323000\"]",
"[\"2f1003230001\", \"2f1003230001\"]",
"[\"2f10032300012\", \"2f10032300013\")",
"[\"2f1003230002\", \"2f1003230002\"]",
"[\"2f10032300021\", \"2f10032300022\")",
"[\"2f10032300022\", \"2f10032300023\")",
"[\"2f100323003\", \"2f100323003\"]",
"[\"2f1003230031\", \"2f1003230031\"]",
"[\"2f10032300311\", \"2f10032300312\")",
"[\"2f10032300312\", \"2f10032300313\")",
"[\"2f10032300313\", \"2f10032300314\")",
"[\"2f1003230032\", \"2f1003230032\"]",
"[\"2f10032300320\", \"2f10032300321\")",
"[\"2f10032300321\", \"2f10032300322\")"
]
},
"keysExamined" : 102689,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 44540
}
},
{
"stage" : "FETCH",
"nReturned" : 47632,
"executionTimeMillisEstimate" : 250,
"works" : 129571,
"advanced" : 47632,
"needTime" : 81938,
"needFetch" : 0,
"saveState" : 2556,
"restoreState" : 2556,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 47632,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"filter" : {
"TwoDSphereKeyInRegionExpression" : true
},
"nReturned" : 47632,
"executionTimeMillisEstimate" : 230,
"works" : 129571,
"advanced" : 47632,
"needTime" : 81938,
"needFetch" : 0,
"saveState" : 2556,
"restoreState" : 2556,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"loc" : "2dsphere"
},
"indexName" : "loc_2dsphere",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"loc" : [
"[\"2f1003230\", \"2f1003230\"]",
"[\"2f10032300\", \"2f10032300\"]",
"[\"2f100323000\", \"2f100323000\"]",
"[\"2f1003230001\", \"2f1003230001\"]",
"[\"2f10032300011\", \"2f10032300012\")",
"[\"2f10032300012\", \"2f10032300013\")",
"[\"2f1003230002\", \"2f1003230002\"]",
"[\"2f10032300021\", \"2f10032300022\")",
"[\"2f10032300022\", \"2f10032300023\")",
"[\"2f100323003\", \"2f100323003\"]",
"[\"2f1003230031\", \"2f1003230032\")",
"[\"2f1003230032\", \"2f1003230032\"]",
"[\"2f10032300320\", \"2f10032300321\")",
"[\"2f10032300321\", \"2f10032300322\")",
"[\"2f10032300322\", \"2f10032300323\")"
]
},
"keysExamined" : 129570,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 47632
}
}
]
}
},
Indexes on the collection
{
"0" : {
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "wego.geoData"
},
"1" : {
"v" : 1,
"key" : {
"srcId" : 1
},
"name" : "srcId_1",
"ns" : "wego.geoData"
},
"2" : {
"v" : 1,
"key" : {
"loc" : "2dsphere"
},
"name" : "loc_2dsphere",
"ns" : "wego.geoData",
"2dsphereIndexVersion" : 2
},
"3" : {
"v" : 1,
"key" : {
"name" : 1
},
"name" : "name_1",
"ns" : "wego.geoData"
},
"4" : {
"v" : 1,
"key" : {
"loc" : "2dsphere",
"categoriesIds" : 1,
"name" : 1
},
"name" : "loc_2dsphere_categoriesIds_1_name_1",
"ns" : "wego.geoData",
"2dsphereIndexVersion" : 2
},
"5" : {
"v" : 1,
"key" : {
"loc" : "2dsphere",
"categoriesIds" : 1,
"keywords" : 1
},
"name" : "loc_2dsphere_categoriesIds_1_keywords_1",
"ns" : "wego.geoData",
"2dsphereIndexVersion" : 2
}
}
Collection stats link
I am going to speculate here a bit, and then a comment about your design.
First, when you create an index on key which has an array on a value you create a record for each element of the array:
To index a field that holds an array value, MongoDB creates an index
key for each element in the array.
This is from MongoDB own documentation about indecies.
So, if your typical record more than a hand full of categories and you have 7 million records,
your index is huge, and it will also take time to scan the index itself to find that the index does not contain what you are looking for. It is still
faster than a collection scan, but it darn slow compared to how fast it takes to find an existing record.
Now, let me comment about your schema design. This is a matter of style so feel free to ignore this part.
You have a record which might be in 17 categories. That is a bit overwhelming, and over abusing the term category. A category is a specific
division, a way to quickly associate a thing with a group of things. What is a thing that belong to so many groups?
Let's take for example your records Cafe Ne. I assume in the real world - and please remember, programming and applications are at best when the solve real world problems - Cafe Ne, is either a restaurant, a caffe, a jazz bar,
a dinner. It's for sure not a garage (unless, cafe means cars in a language I don't know). I can hardly imagine it's a bank or a dental clinic. I'd have to really make an effort, to find more than 10 meaningful categories, that users search a cafe by.
My point is, even though mongodb allows you to design things like that, it does not mean you have to. Try to narrow the number of categories you have and the ones you look for, and you will have much better performance.
As JohnnyHK suggested in comments, and Oz123 pointed to in his answer, the issue here appears to be an index that has grown so large that it fails to perform well as an index. I believe that in addition to the category expansion issue that has already been pointed out, the ordering of fields in your index creates trouble. Compound indexes are built according to the order of fields, and putting name after categoriesIds makes it more costly to query on name.
It's clear that you need to tune your indexes. Exactly how you tune them depends on the types of queries that you are expecting to support. In particularly, I'm not sure if you'll see better performance from a compound index of loc and name or if you'll see better performance from individual indexes, one for loc and one for name. Mongo themselves are a little vague on when it's best to use a compound index and when it's best to use individual indexes and rely on index intersection.
My intuition says that individual indexes will perform better, but I'd test both scenarios.
If you anticipate needing to query by category as well, without name or loc fields that could narrow the query down, it's probably best to create a separate categoriesIds index.
The order of the fields in a compound index is very important. It's hard to diagnose without having access to the real data and usage patterns, but this key might increase the odds of matching (or not) the document using only the index:
{
"loc" : "2dsphere",
"name" : 1,
"categoriesIds" : 1
}
Not sure if it is exactly the same issue but we had a similar problem with a multikey index with poor performance when no results were found.
It is actually a Mongo bug that was fixed in v3.3.8.
https://jira.mongodb.org/browse/SERVER-15086
We fixed our problems after upgrading Mongo and rebuilding the index.

Mongodb query excessively slow

I have a collection of tweets, with indexes on userid and tweeted_at (date). I want to find the dates of the oldest and newest tweets in the collection for a user, but the query runs very slowly.
I used explain, and here's what I got. I tried reading the documentation for explain, but I don't understand what is going on here. Is the explain just on the sort? If so, why does it take so long when it's using the index?
> db.tweets.find({userid:50263}).sort({tweeted_at:-1}).limit(1).explain(1)
{
"cursor" : "BtreeCursor tweeted_at_1 reverse",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 12705,
"nscanned" : 12705,
"nscannedObjectsAllPlans" : 12705,
"nscannedAllPlans" : 12705,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 188,
"nChunkSkips" : 0,
"millis" : 7720,
"indexBounds" : {
"tweeted_at" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
},
"allPlans" : [
{
"cursor" : "BtreeCursor tweeted_at_1 reverse",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 12705,
"nscanned" : 12705,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tweeted_at" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
}
}
],
"server" : "adams-server:27017",
"filterSet" : false,
"stats" : {
"type" : "LIMIT",
"works" : 12807,
"yields" : 188,
"unyields" : 188,
"invalidates" : 0,
"advanced" : 0,
"needTime" : 12705,
"needFetch" : 101,
"isEOF" : 1,
"children" : [
{
"type" : "FETCH",
"works" : 12807,
"yields" : 188,
"unyields" : 188,
"invalidates" : 0,
"advanced" : 0,
"needTime" : 12705,
"needFetch" : 101,
"isEOF" : 1,
"alreadyHasObj" : 0,
"forcedFetches" : 0,
"matchTested" : 0,
"children" : [
{
"type" : "IXSCAN",
"works" : 12705,
"yields" : 188,
"unyields" : 188,
"invalidates" : 0,
"advanced" : 12705,
"needTime" : 0,
"needFetch" : 0,
"isEOF" : 1,
"keyPattern" : "{ tweeted_at: 1.
0 }",
"boundsVerbose" : "field #0['twe
eted_at']: [MaxKey, MinKey]",
"isMultiKey" : 0,
"yieldMovedCursor" : 0,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0,
"keysExamined" : 12705,
"children" : [ ]
}
]
}
]
}
}
>
> db.tweets.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "honeypot.tweets"
},
{
"v" : 1,
"unique" : true,
"key" : {
"tweet_id" : 1
},
"name" : "tweet_id_1",
"ns" : "honeypot.tweets",
"dropDups" : true
},
{
"v" : 1,
"key" : {
"tweeted_at" : 1
},
"name" : "tweeted_at_1",
"ns" : "honeypot.tweets"
},
{
"v" : 1,
"key" : {
"keywords" : 1
},
"name" : "keywords_1",
"ns" : "honeypot.tweets"
},
{
"v" : 1,
"key" : {
"user_id" : 1
},
"name" : "user_id_1",
"ns" : "honeypot.tweets"
}
]
>
By looking at the cursor field you can see which index was used:
"cursor" : "BtreeCursor tweeted_at_1 reverse",
BtreeCursor indicates that the query used an index and tweeted_at_1 reverse is the name of the index that was used.
You should check the documentation for each field in the explain to see a detailed description for each field.
Your query lasted 7720 ms (milis) and 12705 documents were scanned(nscanned).
The query is slow because MongoDB scanned all documents that matched your criteria. This happened because MongoDB didn't use your index for querying, but for sorting the data.
To create an index that will be used for querying and sorting, you should create a compound index. Compound index is a single index structure that references multiple fields. You can create a compound index with up to 31 field. You can create a compound index like this (order or fields is important):
db.tweets.ensureIndex({userid: 1, tweeted_at: -1});
This index will be used for searching on userid field and to sort by tweeted_at field.
You can read and see more examples about adding indexes for sorting here.
Edit
If you have other indexes MongoDB is maybe using them. When you're testing query performance you can use hint to use a specific index.
When testing performance of your queries you should always do multiple tests and take an approx. of the results.
Also, if your queries are slow, even when using indexes, then I would check if you have enough memory on the server. Loading the data from disk is order of magnitude slower then loading from the memory. You should always ensure that you have enough RAM, so that all of your data and indexes fit in memory.
Looks like you need to create an index on tweeted_at and userid.
db.tweets.ensureIndex({'tweeted_at':1, 'userid':1})
That should make the query very quick indeed (but at a cost of storage and insert time)