Given a collection of a few million documents that look like:
{
organization: ObjectId("6a55b2f1aae2fe0ddd525828"),
updated_on: 2019-04-18 14:08:48.781Z
}
and 2 indices, on both keys {organization: 1} and {updated_on: 1}
The following query takes ages to return:
db.getCollection('sessions').aggregate([
{
"$match" : {
"organization" : ObjectId("5a55b2f1aae2fe0ddd525827"),
}
},
{
"$sort" : {
"updated_on" : 1
}
}
])
One thing to note is, the result is 0 matches. Upon further investigation, the planner in explain() actually returns the following:
{
"stage" : "FETCH",
"filter" : {
"organization" : {
"$eq" : ObjectId("5a55b2f1aae2fe0ddd525827")
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"updated_on" : 1.0
},
"indexName" : "updated_on_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"updated_on" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"updated_on" : [
"[MinKey, MaxKey]"
]
}
}
}
Why would Mongo combine these into one stage and decide to to sort ALL documents BEFORE filtering?
How can I prevent that?
Why would Mongo combine these into one stage and decide to to sort ALL
documents BEFORE filtering? How can I prevent that?
The sort does happen after the match stage. The query plan doesn't show the SORT stage - that is because there is an index on the sort key updated_on. If you remove the index on the updated_on field you will see a SORT stage in the query plan (and it will be an in-memory sort).
See Explain Results - sort stage.
Some Ideas:
(i) You can use a compound index, instead of a two single field indexes:
{ organization: 1, updated_on: 1 }
It will work fine. See this topic on Sort and Non-prefix Subset of an Index.
(ii) Also, instead of an aggregation, a find() query can do the same job:
db.test.find( { organization : ObjectId("5a55b2f1aae2fe0ddd525827") } ).sort( { updated_on: 1 } )
NOTE: Do verify with explain() and see how they perform. Also, try using the executionStats mode.
MongoDB will use the index on to $sort because it's a heavy operation even if matching before will limit the result to be sorted,
You can either force using the index for $match:
db.collection.aggregate(pipeline, {hint: "index_name"})
Or create a better index to solve both problems, see more informations here
db.collection.createIndex({organization: 1, updated_on:1}, {background: true})
Related
DocumentDB ignores indexes of any field instead of sorted
db.requests.aggregate([
{ $match: {'DeviceId': '5f68c9c1-73c1-e5cb-7a0b-90be2f80a332'}},
{ $sort: { 'Timestamp': 1 } }
])
Useful information:
> explain('executionStats')
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "admin_portal.requests",
"winningPlan" : {
"stage" : "IXSCAN",
"indexName" : "Timestamp_1",
"direction" : "forward"
}
},
"executionStats" : {
"executionSuccess" : true,
"executionTimeMillis" : "398883.755",
"planningTimeMillis" : "0.274",
"executionStages" : {
"stage" : "IXSCAN",
"nReturned" : "20438",
"executionTimeMillisEstimate" : "398879.028",
"indexName" : "Timestamp_1",
"direction" : "forward"
}
},
"serverInfo" : {
...
},
"ok" : 1.0,
"operationTime" : Timestamp(1622585939, 1)
}
> db.requests.getIndexKeys()
[
{
"_id" : 1
},
{
"Timestamp" : 1
},
{
"DeviceId" : 1
}
]
It works fine when I query documents without sorting or when I use find and sort function instead of aggregation.
Important note: Also it works perfect on original MongoDB instance, but not on the DocumentDB
This is more of "how does DocumentDB choose a query plan" kind of question.
There are many answers on how Mongo does it on stackoverflow.
Clearly choosing the "wrong" index can happen from failed trials based on data distribution, the issue here is that DocumentDB adds an unknown layer.
Amazon DocumentDB emulates the MongoDB 4.0 API on a purpose-built database engine that utilizes a distributed, fault-tolerant, self-healing storage system. As a result, query plans and the output of explain() may differ between Amazon DocumentDB and MongoDB. Customers who want control over their query plan can use the $hint operator to enforce selection of a preferred index.
They state that due to this layer differences might happen.
So now that we understand why a wrong index is selected ( kind of ). what can we do? well unless you want to drop or rebuilt your indexes differently somehow then you need to use the hint options for your pipeline.
db.collection.aggregate(pipeline, {hint: "index_name"})
I have a Mongo collection containing millions of documents with the following format:
{
"_id" : ObjectId("5ac37fa989e00723fc4c7746"),
"group-number" : NumberLong(128125089),
"date" : ISODate("2018-04-03T13:20:41.193Z")
}
And I want to retrieve the documents between 2 dates ('date') sorted by 'group-number'. So, I am executing this kind of queries:
db.getCollection('group').find({date:{$gt:new Date(1491372960000),$lt:new Date(1553152560000)}}).sort({"group-number":1})
According to https://blog.mlab.com/2012/06/cardinal-ins/ it seems that MongoDB when not querying by equivalent values but with range values (as in my case), it is better to have an index in the inverse order (first the sorted field / then the filtered field).
Indeed, I've had the best results with the index db.group.createIndex({"group-number":1,"date":1});. But still it takes too long; in same cases more than 40 seconds.
According to the explain() results, indeed the above index is being used.
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"date" : {
"$lt" : ISODate("2019-03-21T07:16:00.000Z")
}
},
{
"date" : {
"$gt" : ISODate("2017-04-05T06:16:00.000Z")
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"group-number" : 1.0,
"date" : 1.0
},
"indexName" : "group-number_1_date_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"group-number" : [],
"date" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"group-number" : [
"[MinKey, MaxKey]"
],
"date" : [
"[MinKey, MaxKey]"
]
}
}
}
How can I improve the performance? I must be missing something...
I'd build an index in a reverse way: db.createIndex({date: 1, 'group-number': 1}). Simply because you are actually querying via date field, so it should come first in the compound index. You are only using group-number for sorting. In such way it makes it easier for WiredTiger to find necessary documents in the BTree.
According to the explain() results, indeed the above index is being used.
There is an important distinction between an index being used and an index being used efficiently. Taking a look at the index usage portion of the explain output, we have the following:
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"group-number" : 1.0,
"date" : 1.0
},
"indexName" : "group-number_1_date_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"group-number" : [],
"date" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"group-number" : [
"[MinKey, MaxKey]"
],
"date" : [
"[MinKey, MaxKey]"
]
}
}
There are two (related) important observations here:
The index scan has not been bounded at all. The bounds for all (both) keys are [MinKey, MaxKey]. This means that the operation is scanning the entire index.
The restrictions on the date field expressed by the query predicate are not present in either the index bounds (noted above) or even as a separate filter during the index scanning phase.
What we see instead is that the date bounds are only being applied after the full document has been retrieved:
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"date" : {
"$lt" : ISODate("2019-03-21T07:16:00.000Z")
}
},
{
"date" : {
"$gt" : ISODate("2017-04-05T06:16:00.000Z")
}
}
]
},
Taken together, this means that the operation that originally generated the explain output:
Scanned the entire index
Individually retrieved the full document associated with each key
Filtered out the documents that did not match the date predicate
Returned the remaining documents to the client
The only benefit that the index provided was the fact that it provided the results in sorted order. This may or may not be faster than just doing a full collection scan instead. That would depend on things like the number of matching results as well as the total number of documents in the collection.
Bounding date
An important question to ask would be why the database was not using the date field from the index more effectively?
As far as I can tell, this is (still) a quirk of how MongoDB creates index bounds. For whatever reason, it does not seem to recognize that the second index key can have bounds applied to it despite the fact that the first one does not.
We can, however, trick it into doing so. In particular we can apply a predicate against the sort field (group-number) that doesn't change the results. An example (using the newer mongosh shell) would be "group-number" :{$gte: MinKey}. This would make the full query:
db.getCollection('group').find({"group-number" :{$gte: MinKey}, date:{$gt:new Date(1491372960000),$lt:new Date(1553152560000)}}).sort({"group-number":1})
The explain for this adjusted query generates:
winningPlan: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { 'group-number': 1, date: 1 },
indexName: 'group-number_1_date_1',
isMultiKey: false,
multiKeyPaths: { 'group-number': [], date: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {
'group-number': [ '[MinKey, MaxKey]' ],
date: [ '(new Date(1491372960000), new Date(1553152560000))' ]
}
}
}
We can see above that the date field is now bounded as expected preventing the database from having to unnecessarily retrieve documents that do not match the query predicate. This would likely provide some improvement to the query, but it is impossible to say how much without knowing more about the data distribution.
Other Observations
The index noted in the other answer swaps the order of the index keys. This may reduce the total number of index keys that need to be examined in order to execute the full query. However as noted in the comments, it prevents the database from using the index to provide sorted results. While there is always a tradeoff when it comes to queries that both use range operators and request a sort, my suspicion is that the index described in the question will be superior for this particular situation.
Was the sample document described in the question the full document? If so, then the database is only forced to retrieve the full document in order to gather the _id field to return to the client. You could transform this operation into a covered query (one that can return results directly from the index alone without having to retrieve the full document) by either:
Projecting out the _id field in the query if the client does not need it, or
Appending _id to the index in the last position if the client does want it.
I've created the following index on my collection:
db.myCollection.createIndex({
user_id: 1,
name: 'text'
})
If I try to see the execution plan of a query containing both fields, like this:
db.getCollection('campaigns').find({
user_id: ObjectId('xxx')
,$text: { $search: 'bla' }
}).explain('executionStats')
I get the following results:
...
"winningPlan" : {
"stage" : "TEXT",
"indexPrefix" : {
"user_id" : ObjectId("xxx")
},
"indexName" : "user_id_1_name_text",
"parsedTextQuery" : {
"terms" : [
"e"
],
"negatedTerms" : [],
"phrases" : [],
"negatedPhrases" : []
},
"inputStage" : {
"stage" : "TEXT_MATCH",
"inputStage" : {
"stage" : "TEXT_OR",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"user_id" : 1.0,
"_fts" : "text",
"_ftsx" : 1
},
"indexName" : "user_id_1_name_text",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "backward",
"indexBounds" : {}
}
}
}
}
...
As stated in the documentation, MongoDB can use index prefixes to perform indexed queries.
Since user_id is a prefix for the index above, I'd expect that a query only by user_id would use the index, but if I try the following:
db.myCollection.find({
user_id: ObjectId('xxx')
}).explain('executionStats')
I get:
...
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"user_id" : {
"$eq" : ObjectId("xxx")
}
},
"direction" : "forward"
},
...
So, it is not using the index at all and performing a full collection scan.
In general MongoDB can use index prefixes to support queries, however compound indexes including geospatial or text fields are a special case of sparse compound indexes. If a document does not include a value for any of the text index field(s) in a compound index, it will not be included in the index.
In order to ensure correct results for a prefix search, an alternative query plan will be chosen over the sparse compound index:
If a sparse index would result in an incomplete result set for queries and sort operations, MongoDB will not use that index unless a hint() explicitly specifies the index.
Setting up some test data in MongoDB 3.4.5 to demonstrate the potential problem:
db.myCollection.createIndex({ user_id:1, name: 'text' }, { name: 'myIndex'})
// `name` is a string; this document will be included in a text index
db.myCollection.insert({ user_id:123, name:'Banana' })
// `name` is a number; this document will NOT be included in a text index
db.myCollection.insert({ user_id:123, name: 456 })
// `name` is missing; this document will NOT be included in a text index
db.myCollection.insert({ user_id:123 })
Then, forcing the compound text index to be used:
db.myCollection.find({user_id:123}).hint('myIndex')
The result only includes the single document with the indexed text field name, rather than the three documents that would be expected:
{
"_id": ObjectId("595ab19e799060aee88cb035"),
"user_id": 123,
"name": "Banana"
}
This exception should be more clearly highlighted in the MongoDB documentation; watch/upvote DOCS-10322 in the MongoDB issue tracker for updates.
This behavior is due to text indexes being sparse by default:
For a compound index that includes a text index key along with keys of
other types, only the text index field determines whether the index
references a document. The other keys do not determine whether the
index references the documents or not.
The query filter is not referencing the text index field, so the query planner won't consider this index as it can't be certain that the full result set of documents will be returned by the index.
My mongo find query is using an index, but the same functionality if I am implementing using aggregate, it is not using the Index.
db.collection1.find({Attribute8: "s1000",Attribute9: "s1000"}).sort({Attribute10: 1})
"cursor used in find" : "BtreeCursor Attribute8_1_Attribute9_1_Attribute10_1"
db.collection1.aggregate([
{
$match: {
Attribute8: "s1000",
Attribute9: "s1000"
}
},
{
$sort: {
Attribute10: 1
}
}
])
"cursor used in aggregate" : "BtreeCursor ".
Can someone tell me where it went wrong. My goal is to use Indexes in aggregate method.
Thanks in advance.
After some digging the issue is the limitation of usage of the following types:
Symbol, MinKey, MaxKey, DBRef, Code, and CodeWScope
In this case Symbol is used for containing a string value, so index wont work.
Please try with a Number en set explain to true in the aggregate option.
[EDIT]
My previous answer is incorrect.
The aggregation pipeline is using a 'BtreeCursor' (only when the defined field has an index) to run the $match query and does uses the ensured index, check "indexBound" for verification.
Ensuring the whole collection to have an index on "Attribute08"
db.temps.ensureIndex({Attribute08:1})
$match on a field with an index:
db.temps.aggregate([{$match:{Attribute08:"s1000"}}],{explain:true})
"allPlans" : [
{
"cursor" : "BtreeCursor ",
"isMultiKey" : false,
"scanAndOrder" : false,
"indexBounds" : {
"Attribute08" : [
[
"s1000",
"s1000"
]
]
}
}
]
Below the $match on a field without index:
db.temps.aggregate([{$match:{Attribute09:"s1000"}}],{explain:true})
"allPlans" : [
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false
}
]
The report to which I asked this question has grown to an additional three subreports.
Each of them counts a number of documents.
Main report:
{
runCommand : {
aggregate : 'Collection',
pipeline : [
{ $match : { time : { '$gte' : '$P!{sttime}', '$lt' : '$P!{edtime}' }}},
{ $match : { owner_id : $P{id} } },
{ $match : { status : 0 } },
{ $group : { _id : { StatusID : '$status', SID : '$sid', UserID : '$owner_id', GroupID : '$group_id' }, count : { $sum : 1} } }
]
}
}
User must set a timeframe (Timestamp type of date) and provide the user, for which the report will be filled.
Then selected timeframe, owner_id and sid transmitted in each subreport as parameters.
Subreport:
{
runCommand : {
aggregate : 'Collection',
pipeline : [
{ $match : { update_time : { '$gte' : '$P!{sttime}', '$lt' : '$P!{edtime}' }}},
{ $match : { sid : $P{sid} } },
{ $match : { owner_id : $P{id} } },
{ $match : { status : 1 } },
{ $group : { _id : { StatusID : '$status' }, count : { $sum : 1} } }
]
}
}
The other subreports same as above, except { $match : { status : 1 } }, where is like:
{ $match : { status : 2 } }
and
{ $match : { status : 3 } }
respectively.
I'm working with a large collection, where in 2 hour timeframe about 400.000 documents.
The maximum timframe where the report showed results is 8 hours.
Anything more than this period of time falls on timeout.
Filling the "2 hour timeframe" takes about 10 minutes.
Tried to use {explain : true} to each request individually. Speed results were the fastest in the form in which they are written.
"cursor" : {
"cursor" : "BtreeCursor update_time_1_owner_id_1_status_1_group_id_1",
"isMultiKey" : false,
"n" : 379843,
"nscannedObjects" : 379843,
"nscanned" : 379843,
"nscannedObjectsAllPlans" : 379843,
"nscannedAllPlans" : 379843,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 13,
"nChunkSkips" : 0,
"millis" : 694,
This example on 2 hour timeframe.
Is there any way to speed up filling the report? Somehow to unite the queries? Or something else?
Goal is to increase possible period of month to report (If this possible)
MongoDB 2.4 and earlier do not merge consecutive calls to $match, so your first $match will use an index and subsequent matches will be done in memory on the ~400k documents fetched for your aggregation pipeline. Given you have two or three subsequent $match operations, this processing isn't going to be overly efficient.
FYI, the issue of coalescing consecutive $match calls been addressed in MongoDB 2.6 (see SERVER-11184), but for clarity I think you would still be best combining the $match statements before doing an $explain.
Also, since you are doing range queries on update_time, you probably want to move that later in your compound index so the equality checks on owner, status, and group_id can be used to filter results before range comparisons. The blog post Optimizing MongoDB Compound Indexes has some relevant explanation for this approach.