Mongodb query does not use prefix on compound index with text field - mongodb

I've created the following index on my collection:
db.myCollection.createIndex({
user_id: 1,
name: 'text'
})
If I try to see the execution plan of a query containing both fields, like this:
db.getCollection('campaigns').find({
user_id: ObjectId('xxx')
,$text: { $search: 'bla' }
}).explain('executionStats')
I get the following results:
...
"winningPlan" : {
"stage" : "TEXT",
"indexPrefix" : {
"user_id" : ObjectId("xxx")
},
"indexName" : "user_id_1_name_text",
"parsedTextQuery" : {
"terms" : [
"e"
],
"negatedTerms" : [],
"phrases" : [],
"negatedPhrases" : []
},
"inputStage" : {
"stage" : "TEXT_MATCH",
"inputStage" : {
"stage" : "TEXT_OR",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"user_id" : 1.0,
"_fts" : "text",
"_ftsx" : 1
},
"indexName" : "user_id_1_name_text",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "backward",
"indexBounds" : {}
}
}
}
}
...
As stated in the documentation, MongoDB can use index prefixes to perform indexed queries.
Since user_id is a prefix for the index above, I'd expect that a query only by user_id would use the index, but if I try the following:
db.myCollection.find({
user_id: ObjectId('xxx')
}).explain('executionStats')
I get:
...
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"user_id" : {
"$eq" : ObjectId("xxx")
}
},
"direction" : "forward"
},
...
So, it is not using the index at all and performing a full collection scan.

In general MongoDB can use index prefixes to support queries, however compound indexes including geospatial or text fields are a special case of sparse compound indexes. If a document does not include a value for any of the text index field(s) in a compound index, it will not be included in the index.
In order to ensure correct results for a prefix search, an alternative query plan will be chosen over the sparse compound index:
If a sparse index would result in an incomplete result set for queries and sort operations, MongoDB will not use that index unless a hint() explicitly specifies the index.
Setting up some test data in MongoDB 3.4.5 to demonstrate the potential problem:
db.myCollection.createIndex({ user_id:1, name: 'text' }, { name: 'myIndex'})
// `name` is a string; this document will be included in a text index
db.myCollection.insert({ user_id:123, name:'Banana' })
// `name` is a number; this document will NOT be included in a text index
db.myCollection.insert({ user_id:123, name: 456 })
// `name` is missing; this document will NOT be included in a text index
db.myCollection.insert({ user_id:123 })
Then, forcing the compound text index to be used:
db.myCollection.find({user_id:123}).hint('myIndex')
The result only includes the single document with the indexed text field name, rather than the three documents that would be expected:
{
"_id": ObjectId("595ab19e799060aee88cb035"),
"user_id": 123,
"name": "Banana"
}
This exception should be more clearly highlighted in the MongoDB documentation; watch/upvote DOCS-10322 in the MongoDB issue tracker for updates.

This behavior is due to text indexes being sparse by default:
For a compound index that includes a text index key along with keys of
other types, only the text index field determines whether the index
references a document. The other keys do not determine whether the
index references the documents or not.
The query filter is not referencing the text index field, so the query planner won't consider this index as it can't be certain that the full result set of documents will be returned by the index.

Related

Is searching by _id in mongoDB more efficient?

In my use case, I want to search a document by a given unique string in MongoDB. However, I want my queries to be fast and searching by _id will add some overhead. I want to know if there are any benefits in MongoDB to search a document by _id over any other unique value?
To my knowledge object ID are similar to any other unique value in a document [Point made for the case of searching only].
As for the overhead, you can assume I am caching the string to objectID and the cache is very small and in memory [Almost negligible], though the DB is large.
Analyzing your query performance
I advise you to use .explain() provided by mongoDB to analyze your query performance.
Let's say we are trying to execute this query
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } )
This would be the result of the query execution
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
If we call .execution() this way
db.inventory.find(
{ quantity: { $gte: 100, $lte: 200 } }
).explain("executionStats")
It will return the following result:
{
"queryPlanner" : {
"plannerVersion" : 1,
...
"winningPlan" : {
"stage" : "COLLSCAN",
...
}
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 10,
"executionStages" : {
"stage" : "COLLSCAN",
...
},
...
},
...
}
More details about this can be found here
How efficient is search by _id and indexes
To answer your question, using indexes is always more efficient. Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form. With _id being the default index provided by MongoDB, that makes it more efficient.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
So, YES, using indexes like _id is better!
You can also create your own indexes by using createIndex()
db.collection.createIndex( <key and index type specification>, <options> )
Optimize your MongoDB query
In case you want to optimize your query, there are multiple ways to do that.
Creating custom indexes to support your queries
Limit the Number of Query Results to Reduce Network Demand
db.posts.find().sort( { timestamp : -1 } ).limit(10)
Use Projections to Return Only Necessary Data
db.posts.find( {}, { timestamp : 1 , title : 1 , author : 1 , abstract : 1} ).sort( { timestamp : -1 } )
Use $hint to Select a Particular Index
db.users.find().hint( { age: 1 } )
Short answer, yes _id is the primary key and it's indexed. Of course it's fast.
But you can use an index on the other fields too and get more efficient queries.

mongo sort before match in aggregation

Given a collection of a few million documents that look like:
{
organization: ObjectId("6a55b2f1aae2fe0ddd525828"),
updated_on: 2019-04-18 14:08:48.781Z
}
and 2 indices, on both keys {organization: 1} and {updated_on: 1}
The following query takes ages to return:
db.getCollection('sessions').aggregate([
{
"$match" : {
"organization" : ObjectId("5a55b2f1aae2fe0ddd525827"),
}
},
{
"$sort" : {
"updated_on" : 1
}
}
])
One thing to note is, the result is 0 matches. Upon further investigation, the planner in explain() actually returns the following:
{
"stage" : "FETCH",
"filter" : {
"organization" : {
"$eq" : ObjectId("5a55b2f1aae2fe0ddd525827")
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"updated_on" : 1.0
},
"indexName" : "updated_on_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"updated_on" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"updated_on" : [
"[MinKey, MaxKey]"
]
}
}
}
Why would Mongo combine these into one stage and decide to to sort ALL documents BEFORE filtering?
How can I prevent that?
Why would Mongo combine these into one stage and decide to to sort ALL
documents BEFORE filtering? How can I prevent that?
The sort does happen after the match stage. The query plan doesn't show the SORT stage - that is because there is an index on the sort key updated_on. If you remove the index on the updated_on field you will see a SORT stage in the query plan (and it will be an in-memory sort).
See Explain Results - sort stage.
Some Ideas:
(i) You can use a compound index, instead of a two single field indexes:
{ organization: 1, updated_on: 1 }
It will work fine. See this topic on Sort and Non-prefix Subset of an Index.
(ii) Also, instead of an aggregation, a find() query can do the same job:
db.test.find( { organization : ObjectId("5a55b2f1aae2fe0ddd525827") } ).sort( { updated_on: 1 } )
NOTE: Do verify with explain() and see how they perform. Also, try using the executionStats mode.
MongoDB will use the index on to $sort because it's a heavy operation even if matching before will limit the result to be sorted,
You can either force using the index for $match:
db.collection.aggregate(pipeline, {hint: "index_name"})
Or create a better index to solve both problems, see more informations here
db.collection.createIndex({organization: 1, updated_on:1}, {background: true})

MongoDB query (range filter + sort) performance tuning

I have a Mongo collection containing millions of documents with the following format:
{
"_id" : ObjectId("5ac37fa989e00723fc4c7746"),
"group-number" : NumberLong(128125089),
"date" : ISODate("2018-04-03T13:20:41.193Z")
}
And I want to retrieve the documents between 2 dates ('date') sorted by 'group-number'. So, I am executing this kind of queries:
db.getCollection('group').find({date:{$gt:new Date(1491372960000),$lt:new Date(1553152560000)}}).sort({"group-number":1})
According to https://blog.mlab.com/2012/06/cardinal-ins/ it seems that MongoDB when not querying by equivalent values but with range values (as in my case), it is better to have an index in the inverse order (first the sorted field / then the filtered field).
Indeed, I've had the best results with the index db.group.createIndex({"group-number":1,"date":1});. But still it takes too long; in same cases more than 40 seconds.
According to the explain() results, indeed the above index is being used.
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"date" : {
"$lt" : ISODate("2019-03-21T07:16:00.000Z")
}
},
{
"date" : {
"$gt" : ISODate("2017-04-05T06:16:00.000Z")
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"group-number" : 1.0,
"date" : 1.0
},
"indexName" : "group-number_1_date_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"group-number" : [],
"date" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"group-number" : [
"[MinKey, MaxKey]"
],
"date" : [
"[MinKey, MaxKey]"
]
}
}
}
How can I improve the performance? I must be missing something...
I'd build an index in a reverse way: db.createIndex({date: 1, 'group-number': 1}). Simply because you are actually querying via date field, so it should come first in the compound index. You are only using group-number for sorting. In such way it makes it easier for WiredTiger to find necessary documents in the BTree.
According to the explain() results, indeed the above index is being used.
There is an important distinction between an index being used and an index being used efficiently. Taking a look at the index usage portion of the explain output, we have the following:
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"group-number" : 1.0,
"date" : 1.0
},
"indexName" : "group-number_1_date_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"group-number" : [],
"date" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"group-number" : [
"[MinKey, MaxKey]"
],
"date" : [
"[MinKey, MaxKey]"
]
}
}
There are two (related) important observations here:
The index scan has not been bounded at all. The bounds for all (both) keys are [MinKey, MaxKey]. This means that the operation is scanning the entire index.
The restrictions on the date field expressed by the query predicate are not present in either the index bounds (noted above) or even as a separate filter during the index scanning phase.
What we see instead is that the date bounds are only being applied after the full document has been retrieved:
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"date" : {
"$lt" : ISODate("2019-03-21T07:16:00.000Z")
}
},
{
"date" : {
"$gt" : ISODate("2017-04-05T06:16:00.000Z")
}
}
]
},
Taken together, this means that the operation that originally generated the explain output:
Scanned the entire index
Individually retrieved the full document associated with each key
Filtered out the documents that did not match the date predicate
Returned the remaining documents to the client
The only benefit that the index provided was the fact that it provided the results in sorted order. This may or may not be faster than just doing a full collection scan instead. That would depend on things like the number of matching results as well as the total number of documents in the collection.
Bounding date
An important question to ask would be why the database was not using the date field from the index more effectively?
As far as I can tell, this is (still) a quirk of how MongoDB creates index bounds. For whatever reason, it does not seem to recognize that the second index key can have bounds applied to it despite the fact that the first one does not.
We can, however, trick it into doing so. In particular we can apply a predicate against the sort field (group-number) that doesn't change the results. An example (using the newer mongosh shell) would be "group-number" :{$gte: MinKey}. This would make the full query:
db.getCollection('group').find({"group-number" :{$gte: MinKey}, date:{$gt:new Date(1491372960000),$lt:new Date(1553152560000)}}).sort({"group-number":1})
The explain for this adjusted query generates:
winningPlan: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { 'group-number': 1, date: 1 },
indexName: 'group-number_1_date_1',
isMultiKey: false,
multiKeyPaths: { 'group-number': [], date: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {
'group-number': [ '[MinKey, MaxKey]' ],
date: [ '(new Date(1491372960000), new Date(1553152560000))' ]
}
}
}
We can see above that the date field is now bounded as expected preventing the database from having to unnecessarily retrieve documents that do not match the query predicate. This would likely provide some improvement to the query, but it is impossible to say how much without knowing more about the data distribution.
Other Observations
The index noted in the other answer swaps the order of the index keys. This may reduce the total number of index keys that need to be examined in order to execute the full query. However as noted in the comments, it prevents the database from using the index to provide sorted results. While there is always a tradeoff when it comes to queries that both use range operators and request a sort, my suspicion is that the index described in the question will be superior for this particular situation.
Was the sample document described in the question the full document? If so, then the database is only forced to retrieve the full document in order to gather the _id field to return to the client. You could transform this operation into a covered query (one that can return results directly from the index alone without having to retrieve the full document) by either:
Projecting out the _id field in the query if the client does not need it, or
Appending _id to the index in the last position if the client does want it.

Indexing strategy for MongoDB to prevent trashed records being indexed

In one of my collection I am having deleted records marked as trash: true, Now can I index only records not marked as trash. What indexing strategy should I use here.
{
"_id" : ObjectId("5ad5eb0b9590e413debd81d6"),
"voucher" : ObjectId("5a731de8ffd6f75b1471742f"),
"brand" : ObjectId("58e3b1e1b473f74462d6c9b2"),
"code" : "6DZQ34KD",
"purchaseDate" : ISODate("2018-04-17T12:39:39.287Z"),
"owner" : ObjectId("5ac3460b8dd26664ab57cb66"),
"isInWishlist" : true,
"expiryDate" : ISODate("2018-04-30T20:59:59.000Z"),
"isUsed" : false,
"trash" : true,
"updatedAt" : ISODate("2018-04-17T12:39:39.800Z"),
"createdAt" : ISODate("2018-04-17T12:39:39.288Z"),
"__v" : 0
}
You can create a partial index.
E.g. to index documents with "trash": false by brand:
db.collection.createIndex(
{ brand: 1},
{ partialFilterExpression: { trash: false } }
)
Please note the documents without trash field will also be excluded from the index.

Optimize query performance in MongoDB

I have a collection named App and need to query those active (active: true) apps that belong to a particular user (user_id) or are available to all users (by their _id). I use query like this
{
"active" : true,
"$or" : [
{
"user_id" : "111111111111111111111111"
},
{
"_id" : {
"$in" : [
ObjectId("222222222222222222222222"),
ObjectId("333333333333333333333333"),
ObjectId("444444444444444444444444")
]
}
}
]
}
However in db.currentOp(true) I see that this query is running very slowly: lockStats.timeLockedMicros.r is about 3000.
How can I optimize performance of this query? I already have the following indexes on App:
> db.App.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "mydb.App"
},
{
"v" : 1,
"key" : {
"active" : 1,
"created_at" : -1
},
"name" : "active_1_created_at_-1",
"ns" : "mydb.App",
"background" : true
},
{
"v" : 1,
"key" : {
"active" : 1,
"user_id" : 1
},
"name" : "active_1_user_id_1",
"ns" : "mydb.App",
"background" : true
}
]
Two issues I see here:
1) You would not need index on the boolean field active as it would have low selectivity and not benefiting query performance.
"If overall selectivity is low, and if MongoDB must read a number of documents to return results, then some queries may perform faster without indexes." source
2) You need an index for user_id because user_id cannot use the compound index you created for active_1_user_id_1
Edit: You can always check index efficiency by doing a explain(true) and look at which indexes are used for that query.
I would try to do the following:
remove all your indexes, your active field has a low cardinality (boolean) and does not help you at all, you are not using created_at, so there is no reason for it.
add an index only on user_id key
change your strings as numbers to numbers.