One of my aggregation pipelines is running rather slow.
About the collection
The collection is named as Document and each document can belong to multiple campaigns and be in one of the five statues, 'a' to 'e'. A small portion of documents may belong to no documents and its campaigns field is set to null.
Sample document:
{_id:id, campaigns:['c1', 'c2], status:'a', ...other fields...}
Some collection stats
Number of documents: 2 million only :(
Size: 2GB
Average document size: 980 bytes.
Storage Size: 780MB
Total index size: 134MB
Number of indexes: 12
Number of fields in document: 30-40, may have array or objects.
About the Query
The query is targeting to count the number of documents per campaign per status if its status is in ['a', 'b', 'c']
[
{$match:{campaigns:{$ne:null}, status:{$in:['a','b','c']}}},
{$unwind:'$campaigns'},
{$group:{_id:{campaign:'$campaigns', status:'$status'}, total:{$sum:1}}}
]
It's expected that the aggregation is going to hit almost the whole collection.
When without index the aggregation is taking around 8 seconds to complete.
I tried to create an index on
{campaings:1, status:1}
Explaining plan shows that the index was scanned but the aggregation took near 11 seconds to complete.
Question
The index consists all fields required by the aggregation to do the counting. Should the aggregation not hit the index only? The index is only 10MB in size. How could it be slower? If not index, any other recommendation to tune the query?
Winning plan shows:
{
"stage" : "FETCH",
"filter" : {"$not" : {"campaigns" : {"$eq" : null}}},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {"campaigns" : 1.0,"status" : 1.0},
"indexName" : "campaigns_1_status_1",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"campaigns" : ["[MinKey, null)", "(null, MaxKey]"],
"status" : [ "[\"a\", \"a\"]", "[\"b\", \"b\"]", "[\"c\", \"c\"]"]
}
}
}
If no index, winning plan:
{
"stage" : "COLLSCAN",
"filter" : {
"$and":[
{"status": {"$in": ["a", "b", "c"]}},
{"$not" : {"campaigns": {"$eq" : null}}}
]
},
direction" : "forward"
}
Update
As requested by #Kevin, here're some details about other all indexes, size in MB.
"indexSizes" : {
"_id_" : 32,
"team_1" : 8, //Single value field of ObjectId
"created_time_1" : 16, //Document publish time in source system.
"parent_1" : 2, //_id of parent document.
"by.id_1" : 13, //_id of author from a different collection.
"feedids_1" : 8, //Array, _id of ETL jobs contributing to sync of this doc.
"init_-1" : 2, //Initial load time of the doc.
"campaigns_1" : 10, //Array, _id of campaigns
"last_fetch_-1" : 13, //Last sync time of the doc.
"categories_1" : 8, //Array, _id of document categories.
"status_1" : 8, //Status
"campaigns_1_status_1" : 10 //Combined index of campaign _id and status.
},
After reading the docs from MongoDB, I found this:
The inequality operator $ne is not very selective since it often matches a large portion of the index. As a result, in many cases, a $ne query with an index may perform no better than a $ne query that must scan all documents in a collection. See also Query Selectivity.
Looking at a few different articles using the $type operator might solve the problem.
Could you use this query:
db.data.aggregate([
{$match:{campaigns:{$type:2},status:{$in:["a","b","c"]}}},
{$unwind:'$campaigns'},
{$group:{_id:{campaign:'$campaigns', status:'$status'}, total:{$sum:1}}}])
Related
When explaining a query on a collection having these indexes:
{"user_id": 1, "req.time_us": 1}
{"user_id": 1, "req.uri":1, "req.time_us": 1}
with command like:
db.some_collection.find({"user_id":12345,"req.time_us":{"$gte":1657509059545812,"$lt":1667522903018337}}).limit(20).explain("executionStats")
The winning plan was:
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 20,
"executionTimeMillisEstimate" : 0,
"works" : 20,
"advanced" : 20,
...
"keyPattern" : {
"user_id" : 1,
"req.uri" : 1,
"req.time_us" : 1
},
"indexName" : "user_id_1_req.uri_1_req.time_us_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"user_id" : [ ],
"req.uri" : [ ],
"req.time_us" : [ ]
},
...
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"user_id" : [
"[23456.0, 23456.0]"
],
"req.uri" : [
"[MinKey, MaxKey]"
],
"req.time_us" : [
"[1657509059545812.0, 1667522903018337.0)"
]
},
"keysExamined" : 20,
"seeks" : 1,
...
}
Why was the index user_id_1_req.uri_1_req.time_us_1 used but not user_id_1_req.time_us_1? Since the official manual says a compound index can supports queries that match the prefix fields of the index.
This behavior can be explained in the docs documentation page. To paraphrase:
MongoDB runs the query optimizer to choose the winning plan and executes the winning plan to completion.
During plan selection, if there are more than one index that can satisfy a query, MongoDB will run a trial using all the valid plans to determine which one performed to be the best. You can read about this process more here.
As of MongoDB 3.4.6, the plan selection involves running candidate plans in parallel in a "race", and see which candidate plan returns 101 results first.
So basically these 2 indexes had a mini competition and the "wrong" index one, this can happen as these competitions can be heavily skewed depending on data distribution for similar indexes.
( For example imagine the first 101 documents in the collection match the query then the "better" index will actually be slower as it will continue to scan the index tree deeper while the "worse" index start fetching them immediately)
I recommend for cases like this to use $hint which essentially forces Mongo to use the index you deem most fit.
In my use case, I want to search a document by a given unique string in MongoDB. However, I want my queries to be fast and searching by _id will add some overhead. I want to know if there are any benefits in MongoDB to search a document by _id over any other unique value?
To my knowledge object ID are similar to any other unique value in a document [Point made for the case of searching only].
As for the overhead, you can assume I am caching the string to objectID and the cache is very small and in memory [Almost negligible], though the DB is large.
Analyzing your query performance
I advise you to use .explain() provided by mongoDB to analyze your query performance.
Let's say we are trying to execute this query
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } )
This would be the result of the query execution
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
If we call .execution() this way
db.inventory.find(
{ quantity: { $gte: 100, $lte: 200 } }
).explain("executionStats")
It will return the following result:
{
"queryPlanner" : {
"plannerVersion" : 1,
...
"winningPlan" : {
"stage" : "COLLSCAN",
...
}
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 10,
"executionStages" : {
"stage" : "COLLSCAN",
...
},
...
},
...
}
More details about this can be found here
How efficient is search by _id and indexes
To answer your question, using indexes is always more efficient. Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form. With _id being the default index provided by MongoDB, that makes it more efficient.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
So, YES, using indexes like _id is better!
You can also create your own indexes by using createIndex()
db.collection.createIndex( <key and index type specification>, <options> )
Optimize your MongoDB query
In case you want to optimize your query, there are multiple ways to do that.
Creating custom indexes to support your queries
Limit the Number of Query Results to Reduce Network Demand
db.posts.find().sort( { timestamp : -1 } ).limit(10)
Use Projections to Return Only Necessary Data
db.posts.find( {}, { timestamp : 1 , title : 1 , author : 1 , abstract : 1} ).sort( { timestamp : -1 } )
Use $hint to Select a Particular Index
db.users.find().hint( { age: 1 } )
Short answer, yes _id is the primary key and it's indexed. Of course it's fast.
But you can use an index on the other fields too and get more efficient queries.
I have a Mongo collection containing millions of documents with the following format:
{
"_id" : ObjectId("5ac37fa989e00723fc4c7746"),
"group-number" : NumberLong(128125089),
"date" : ISODate("2018-04-03T13:20:41.193Z")
}
And I want to retrieve the documents between 2 dates ('date') sorted by 'group-number'. So, I am executing this kind of queries:
db.getCollection('group').find({date:{$gt:new Date(1491372960000),$lt:new Date(1553152560000)}}).sort({"group-number":1})
According to https://blog.mlab.com/2012/06/cardinal-ins/ it seems that MongoDB when not querying by equivalent values but with range values (as in my case), it is better to have an index in the inverse order (first the sorted field / then the filtered field).
Indeed, I've had the best results with the index db.group.createIndex({"group-number":1,"date":1});. But still it takes too long; in same cases more than 40 seconds.
According to the explain() results, indeed the above index is being used.
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"date" : {
"$lt" : ISODate("2019-03-21T07:16:00.000Z")
}
},
{
"date" : {
"$gt" : ISODate("2017-04-05T06:16:00.000Z")
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"group-number" : 1.0,
"date" : 1.0
},
"indexName" : "group-number_1_date_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"group-number" : [],
"date" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"group-number" : [
"[MinKey, MaxKey]"
],
"date" : [
"[MinKey, MaxKey]"
]
}
}
}
How can I improve the performance? I must be missing something...
I'd build an index in a reverse way: db.createIndex({date: 1, 'group-number': 1}). Simply because you are actually querying via date field, so it should come first in the compound index. You are only using group-number for sorting. In such way it makes it easier for WiredTiger to find necessary documents in the BTree.
According to the explain() results, indeed the above index is being used.
There is an important distinction between an index being used and an index being used efficiently. Taking a look at the index usage portion of the explain output, we have the following:
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"group-number" : 1.0,
"date" : 1.0
},
"indexName" : "group-number_1_date_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"group-number" : [],
"date" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"group-number" : [
"[MinKey, MaxKey]"
],
"date" : [
"[MinKey, MaxKey]"
]
}
}
There are two (related) important observations here:
The index scan has not been bounded at all. The bounds for all (both) keys are [MinKey, MaxKey]. This means that the operation is scanning the entire index.
The restrictions on the date field expressed by the query predicate are not present in either the index bounds (noted above) or even as a separate filter during the index scanning phase.
What we see instead is that the date bounds are only being applied after the full document has been retrieved:
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"date" : {
"$lt" : ISODate("2019-03-21T07:16:00.000Z")
}
},
{
"date" : {
"$gt" : ISODate("2017-04-05T06:16:00.000Z")
}
}
]
},
Taken together, this means that the operation that originally generated the explain output:
Scanned the entire index
Individually retrieved the full document associated with each key
Filtered out the documents that did not match the date predicate
Returned the remaining documents to the client
The only benefit that the index provided was the fact that it provided the results in sorted order. This may or may not be faster than just doing a full collection scan instead. That would depend on things like the number of matching results as well as the total number of documents in the collection.
Bounding date
An important question to ask would be why the database was not using the date field from the index more effectively?
As far as I can tell, this is (still) a quirk of how MongoDB creates index bounds. For whatever reason, it does not seem to recognize that the second index key can have bounds applied to it despite the fact that the first one does not.
We can, however, trick it into doing so. In particular we can apply a predicate against the sort field (group-number) that doesn't change the results. An example (using the newer mongosh shell) would be "group-number" :{$gte: MinKey}. This would make the full query:
db.getCollection('group').find({"group-number" :{$gte: MinKey}, date:{$gt:new Date(1491372960000),$lt:new Date(1553152560000)}}).sort({"group-number":1})
The explain for this adjusted query generates:
winningPlan: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { 'group-number': 1, date: 1 },
indexName: 'group-number_1_date_1',
isMultiKey: false,
multiKeyPaths: { 'group-number': [], date: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {
'group-number': [ '[MinKey, MaxKey]' ],
date: [ '(new Date(1491372960000), new Date(1553152560000))' ]
}
}
}
We can see above that the date field is now bounded as expected preventing the database from having to unnecessarily retrieve documents that do not match the query predicate. This would likely provide some improvement to the query, but it is impossible to say how much without knowing more about the data distribution.
Other Observations
The index noted in the other answer swaps the order of the index keys. This may reduce the total number of index keys that need to be examined in order to execute the full query. However as noted in the comments, it prevents the database from using the index to provide sorted results. While there is always a tradeoff when it comes to queries that both use range operators and request a sort, my suspicion is that the index described in the question will be superior for this particular situation.
Was the sample document described in the question the full document? If so, then the database is only forced to retrieve the full document in order to gather the _id field to return to the client. You could transform this operation into a covered query (one that can return results directly from the index alone without having to retrieve the full document) by either:
Projecting out the _id field in the query if the client does not need it, or
Appending _id to the index in the last position if the client does want it.
I'm evaluating the performance of the following query.
db.products_old.find({ regularPrice: { $lte: 200 } })
The collection has a bit over a million documents, in total around 0.15GB.
No indexes
This is expected. A full column scan has to be done
"executionTimeMillis" : 1019,
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"regularPrice" : {
"$lte" : 200
}
},
"direction" : "forward"
},
Index { regularPrice: 1 }
"executionTimeMillis" : 2842,
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"regularPrice" : 1
},
"indexName" : "regularPrice_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"regularPrice" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"regularPrice" : [
"[-inf.0, 200.0]"
]
}
}
},
Now it uses the index, but the execution time is noticeably worse. Why?
Also, if it's worse performance, why doesn't Mongo use the COLLSCAN instead of using the index which slows down the execution? rejectedPlans is empty, which suggests that no other plan was even considered. Why?
Here's the full allPlansExecution output.
While doing COLLSCAN, MongoDB is reading from the storage drive and storing matching documents in the RAM for later use directly. On the other hand, IXSCAN reads the index which stores indexed data and pointers to their location on the storage drive. (Here's a nice visualisation from slide 6 to around slide 20)
You have a lot of documents in your collection, but you also have a lot of matching documents in your index. The data stored on the storage drive is not stored in the best way to read it from it (like it is in the index), so when the IXSCAN returns pointers to 220k+ documents it found for your query, FETCH needs to read 220k+ times from the storage drive in a random access way. Which is slow. On the other hand I assume that COLLSCAN is doing sequential read which is probably done page by page and is a lot faster than FETCH reads.
So to sum up: it's not the index that's slowing you down, it's the FETCH stage. If you want to still use this index and have a faster query execution time, then use .select('-_id regularPrice') which will just add a quick PROJECTION stage and read all necessary fields from the index. Or if you need _id, then add an index {regularPrice: 1, _id: 1}.
And regarding the part why does Mongo use index even though it could know that collection scanning is faster: well I think that if it sees an index, it will use it. But you can force it to use collection scan by using hint method with {natural: 1} passed to it.
I've created a compound index:
db.lightningStrikes.createIndex({ datetime: -1, location: "2dsphere" })
But when I run the query below the MongoDB doesn't consider the index, making a COLLSCAN.
db.lightningStrikes.find({ datetime: { $gte: new Date('2017-10-15T00:00:00Z') } }).explain(true).executionStats
The full result is bellow:
{
"executionSuccess" : true,
"nReturned" : 2,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 4,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"datetime" : {
"$gte" : ISODate("2017-10-115T00:00:00Z")
}
},
"nReturned" : 2,
"executionTimeMillisEstimate" : 0,
"works" : 6,
"advanced" : 2,
"needTime" : 3,
"needYield" : 0,
"saveState" : 0,
"restoreState" : 0,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 4
},
"allPlansExecution" : [ ]
}
Ps. I just have 4 documents inserted.
Why is it happen?
db.lightningStrikes.find({ datetime: { $gte: new Date('2017-10-11T23:59:56Z'), $lte: new Date('2017-10-11T23:59:57Z') } }).explain(true)
Result from query above:
https://gist.github.com/anonymous/8dc084132016a1dfe0efb150201f04c7
db.lightningStrikes.find({ datetime: { $gte: new Date('2017-10-11T23:59:56Z'), $lte: new Date('2017-10-11T23:59:57Z') } }).hint("datetime_-1_location_2dsphere").explain(true)
Result from query above:
https://gist.github.com/anonymous/2b76c5a7b4b348ea7206d8b544c7d455
To help understand what MongoDB is doing here you could:
Run explain with allPlansExecution mode and have a look at the rejected plans to see why MongoDB rejected your index
Run the find with .hint(_your_index_name_) and compare the explain output with the output you got for your original (non hinted) find.
Both of these are intended to get at the same thing, namely; comparative explain plans for (1) a find with COLLSCAN and (2) a find which uses your index. By comparing these explain plans you'll likely see some difference which explains MongoDB's decision not to use your index.
More details on analysing explain plans in the docs.
You could even update your OP with the comparative plans if you need help identifying why MongoDB chose the COLLSCAN.
Update 1: looking at the explain plans you provided ...
This plan uses your index but the explain plan output ...
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 4,
"executionTimeMillisEstimate" : 0,
"works" : 5,
"advanced" : 4,
...,
"keyPattern" : {
"datetime" : -1,
"location" : "2dsphere"
},
"indexName" : "datetime_-1_location_2dsphere",
...,
"indexVersion" : 2,
...,
"keysExamined" : 4,
...
}
... shows that it used the index to examine 4 index keys and then return 4 documents to the FETCH stage. This tells us that the index did not provide any selectivity and the selectivity was provided by the FETCH stage which followed the IXSCAN. This is effectively what the COLLSCAN does but without the redundant IXSCAN. This might expain why MongoDB preferred a COLLSCAN but why did the IXSCAN do nothing? I suspect this is because the 2dsphere index cannot be used to answer queries which are missing a geo predicate over the 2dsphere field. Your query has a predicate over datetime but does not have a geo predicate over location. I think this means that MongoDB cannot use the 2dsphere index in order to answer the predicates over datetime. More information on the background to this in the docs. Briefly; the use of a sparse index means that there isn't necessarily an entry in the index for every document in your collection so if you search without the location attribute then MongoDB cannot rely on the index to satisfy the query.
You could test whether this assertion is correct by ...
updating your query to include a predicates on each of the datetime and location attributes
updating uur query to include a predicate on the location attibute only
... and for each of these run the query and then examine the explain plan output to see whether the IXSCAN stage actually selected anything. If the IXSCAN stage is selective then you should see keys examined > nReturned in the explain plan output (assuming that the criteria you pass in does actually select < 4 documents!).