How can one detect "useless" indexes? - mongodb

I have a MongoDB collection with a lot of indexes.
Would it bring any benefits to delete indexes that are barely used?
Is there any way or tool which can tell me (in numbers) how often a index is used?
EDIT: I'm using version 2.6.4
EDIT2: I'm now using version 3.0.3

Right, so this is how I would do it.
First you need a list of all your indexes for a certain collection (this will be done collection by collection). Let's say we are monitoring the user collection to see which indexes are useless.
So I run a db.user.getIndexes() and this results in a parsable output of JSON (you can run this via command() from the client side as well to integrate with a script).
So you now have a list of your indexes. It is merely a case of understanding which queries use which indexes. If that index is not hit at all you know it is useless.
Now, you need to run every query with explain() from that output you can judge which index is used and match it to and index gotten from getIndexes().
So here is a sample output:
> db.user.find({religion:1}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "meetapp.user",
"indexFilterSet" : false,
"parsedQuery" : {
"religion" : {
"$eq" : 1
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"religion" : NumberLong(1)
},
"indexName" : "religion_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"religion" : [
"[1.0, 1.0]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "ip-172-30-0-35",
"port" : 27017,
"version" : "3.0.0",
"gitVersion" : "a841fd6394365954886924a35076691b4d149168"
},
"ok" : 1
}
There are a set of rules that the queryPlanner field will use and you will need to discover and write for them but this first one is simple enough.
As you can see: the winning plan (in winningPlan) is a single (could be multiple remember, this stuff you will need to code around) IXSCAN (index scan) and the key pattern for the index used is:
"keyPattern" : {
"religion" : NumberLong(1)
},
Great, now we can match that the key output of getIndexes():
{
"v" : 1,
"key" : {
"religion" : NumberLong(1)
},
"name" : "religion_1",
"ns" : "meetapp.user"
},
to tells us that the religion index is not useless and is in fact used.
Unfortunately this is the best way I can see. It used to be that MongoDB had an index stat for number of times the index was hit but it seems that data has been removed.
So you would just rinse and repeat this process for every collection you have until you have removed the indexes that are useless.
One other way of doing this, of course, is to remove all indexes and then re-add indexes as you test your queries. Though that might be bad if you do need to do this in production.
On a side note: the best way to fix this problem is to not have it at all.
I make this easier for me by using a indexing function within my active record. Once every so often I run (from PHP) something of the sort: ./yii index/rebuild which essentially goes through my active record models and detects which indexes I no longer use and have removed from my app and removes them in turn. It will, of course, create new indexes.

Related

Under what circumstances would mongo use a compond index for a query that does not match the prefix fields of the index?

When explaining a query on a collection having these indexes:
{"user_id": 1, "req.time_us": 1}
{"user_id": 1, "req.uri":1, "req.time_us": 1}
with command like:
db.some_collection.find({"user_id":12345,"req.time_us":{"$gte":1657509059545812,"$lt":1667522903018337}}).limit(20).explain("executionStats")
The winning plan was:
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 20,
"executionTimeMillisEstimate" : 0,
"works" : 20,
"advanced" : 20,
...
"keyPattern" : {
"user_id" : 1,
"req.uri" : 1,
"req.time_us" : 1
},
"indexName" : "user_id_1_req.uri_1_req.time_us_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"user_id" : [ ],
"req.uri" : [ ],
"req.time_us" : [ ]
},
...
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"user_id" : [
"[23456.0, 23456.0]"
],
"req.uri" : [
"[MinKey, MaxKey]"
],
"req.time_us" : [
"[1657509059545812.0, 1667522903018337.0)"
]
},
"keysExamined" : 20,
"seeks" : 1,
...
}
Why was the index user_id_1_req.uri_1_req.time_us_1 used but not user_id_1_req.time_us_1? Since the official manual says a compound index can supports queries that match the prefix fields of the index.
This behavior can be explained in the docs documentation page. To paraphrase:
MongoDB runs the query optimizer to choose the winning plan and executes the winning plan to completion.
During plan selection, if there are more than one index that can satisfy a query, MongoDB will run a trial using all the valid plans to determine which one performed to be the best. You can read about this process more here.
As of MongoDB 3.4.6, the plan selection involves running candidate plans in parallel in a "race", and see which candidate plan returns 101 results first.
So basically these 2 indexes had a mini competition and the "wrong" index one, this can happen as these competitions can be heavily skewed depending on data distribution for similar indexes.
( For example imagine the first 101 documents in the collection match the query then the "better" index will actually be slower as it will continue to scan the index tree deeper while the "worse" index start fetching them immediately)
I recommend for cases like this to use $hint which essentially forces Mongo to use the index you deem most fit.

Poor performance on bulk deleting a large collection mongodb

I have a single standalone mongo installation on a Linux machine.
The database contains a collection with 181 million documents. This collection is by far the largest collection in the database (approx 90%)
The size of the collection is currently 3.5 TB.
I'm running Mongo version 4.0.10 (Wired Tiger)
The collection have 2 indexes.
One on id
One on 2 fields and it is used when deleting documents (see those in the snippet below).
When benchmarking bulk deletion on this collection we used the following snippet
db.getCollection('Image').deleteMany(
{$and: [
{"CameraId" : 1},
{"SequenceNumber" : { $lt: 153000000 }}]})
To see the state of the deletion operation I ran a simple test of deleting 1000 documents while looking at the operation using currentOp(). It shows the following.
"command" : {
"q" : {
"$and" : [
{
"CameraId" : 1.0
},
{
"SequenceNumber" : {
"$lt" : 153040000.0
}
}
]
},
"limit" : 0
},
"planSummary" : "IXSCAN { CameraId: 1, SequenceNumber: 1 }",
"numYields" : 876,
"locks" : {
"Global" : "w",
"Database" : "w",
"Collection" : "w"
},
"waitingForLock" : false,
"lockStats" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(877),
"w" : NumberLong(877)
}
},
"Database" : {
"acquireCount" : {
"w" : NumberLong(877)
}
},
"Collection" : {
"acquireCount" : {
"w" : NumberLong(877)
}
}
}
It seems to be using the correct index but the number and type of locks worries me. As I interpret this it aquires 1 global lock for each deleted document from a single collection.
When using this approach it has taken over a week to delete 40 million documents. This cannot be expected performance.
I realise there other design exists such as bulking documents into larger chunks and store them using GridFs, but the current design is what it is and I want to make sure that what I see is expected before changing my design or restructuring the data or even considering clustering etc.
Any suggestions of how to increase performance on bulk deletions or is this expected?

AWS DocumentDB does not use indexes when $sort and $match at the same time

DocumentDB ignores indexes of any field instead of sorted
db.requests.aggregate([
{ $match: {'DeviceId': '5f68c9c1-73c1-e5cb-7a0b-90be2f80a332'}},
{ $sort: { 'Timestamp': 1 } }
])
Useful information:
> explain('executionStats')
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "admin_portal.requests",
"winningPlan" : {
"stage" : "IXSCAN",
"indexName" : "Timestamp_1",
"direction" : "forward"
}
},
"executionStats" : {
"executionSuccess" : true,
"executionTimeMillis" : "398883.755",
"planningTimeMillis" : "0.274",
"executionStages" : {
"stage" : "IXSCAN",
"nReturned" : "20438",
"executionTimeMillisEstimate" : "398879.028",
"indexName" : "Timestamp_1",
"direction" : "forward"
}
},
"serverInfo" : {
...
},
"ok" : 1.0,
"operationTime" : Timestamp(1622585939, 1)
}
> db.requests.getIndexKeys()
[
{
"_id" : 1
},
{
"Timestamp" : 1
},
{
"DeviceId" : 1
}
]
It works fine when I query documents without sorting or when I use find and sort function instead of aggregation.
Important note: Also it works perfect on original MongoDB instance, but not on the DocumentDB
This is more of "how does DocumentDB choose a query plan" kind of question.
There are many answers on how Mongo does it on stackoverflow.
Clearly choosing the "wrong" index can happen from failed trials based on data distribution, the issue here is that DocumentDB adds an unknown layer.
Amazon DocumentDB emulates the MongoDB 4.0 API on a purpose-built database engine that utilizes a distributed, fault-tolerant, self-healing storage system. As a result, query plans and the output of explain() may differ between Amazon DocumentDB and MongoDB. Customers who want control over their query plan can use the $hint operator to enforce selection of a preferred index.
They state that due to this layer differences might happen.
So now that we understand why a wrong index is selected ( kind of ). what can we do? well unless you want to drop or rebuilt your indexes differently somehow then you need to use the hint options for your pipeline.
db.collection.aggregate(pipeline, {hint: "index_name"})

MongoDB query (range filter + sort) performance tuning

I have a Mongo collection containing millions of documents with the following format:
{
"_id" : ObjectId("5ac37fa989e00723fc4c7746"),
"group-number" : NumberLong(128125089),
"date" : ISODate("2018-04-03T13:20:41.193Z")
}
And I want to retrieve the documents between 2 dates ('date') sorted by 'group-number'. So, I am executing this kind of queries:
db.getCollection('group').find({date:{$gt:new Date(1491372960000),$lt:new Date(1553152560000)}}).sort({"group-number":1})
According to https://blog.mlab.com/2012/06/cardinal-ins/ it seems that MongoDB when not querying by equivalent values but with range values (as in my case), it is better to have an index in the inverse order (first the sorted field / then the filtered field).
Indeed, I've had the best results with the index db.group.createIndex({"group-number":1,"date":1});. But still it takes too long; in same cases more than 40 seconds.
According to the explain() results, indeed the above index is being used.
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"date" : {
"$lt" : ISODate("2019-03-21T07:16:00.000Z")
}
},
{
"date" : {
"$gt" : ISODate("2017-04-05T06:16:00.000Z")
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"group-number" : 1.0,
"date" : 1.0
},
"indexName" : "group-number_1_date_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"group-number" : [],
"date" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"group-number" : [
"[MinKey, MaxKey]"
],
"date" : [
"[MinKey, MaxKey]"
]
}
}
}
How can I improve the performance? I must be missing something...
I'd build an index in a reverse way: db.createIndex({date: 1, 'group-number': 1}). Simply because you are actually querying via date field, so it should come first in the compound index. You are only using group-number for sorting. In such way it makes it easier for WiredTiger to find necessary documents in the BTree.
According to the explain() results, indeed the above index is being used.
There is an important distinction between an index being used and an index being used efficiently. Taking a look at the index usage portion of the explain output, we have the following:
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"group-number" : 1.0,
"date" : 1.0
},
"indexName" : "group-number_1_date_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"group-number" : [],
"date" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"group-number" : [
"[MinKey, MaxKey]"
],
"date" : [
"[MinKey, MaxKey]"
]
}
}
There are two (related) important observations here:
The index scan has not been bounded at all. The bounds for all (both) keys are [MinKey, MaxKey]. This means that the operation is scanning the entire index.
The restrictions on the date field expressed by the query predicate are not present in either the index bounds (noted above) or even as a separate filter during the index scanning phase.
What we see instead is that the date bounds are only being applied after the full document has been retrieved:
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"date" : {
"$lt" : ISODate("2019-03-21T07:16:00.000Z")
}
},
{
"date" : {
"$gt" : ISODate("2017-04-05T06:16:00.000Z")
}
}
]
},
Taken together, this means that the operation that originally generated the explain output:
Scanned the entire index
Individually retrieved the full document associated with each key
Filtered out the documents that did not match the date predicate
Returned the remaining documents to the client
The only benefit that the index provided was the fact that it provided the results in sorted order. This may or may not be faster than just doing a full collection scan instead. That would depend on things like the number of matching results as well as the total number of documents in the collection.
Bounding date
An important question to ask would be why the database was not using the date field from the index more effectively?
As far as I can tell, this is (still) a quirk of how MongoDB creates index bounds. For whatever reason, it does not seem to recognize that the second index key can have bounds applied to it despite the fact that the first one does not.
We can, however, trick it into doing so. In particular we can apply a predicate against the sort field (group-number) that doesn't change the results. An example (using the newer mongosh shell) would be "group-number" :{$gte: MinKey}. This would make the full query:
db.getCollection('group').find({"group-number" :{$gte: MinKey}, date:{$gt:new Date(1491372960000),$lt:new Date(1553152560000)}}).sort({"group-number":1})
The explain for this adjusted query generates:
winningPlan: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { 'group-number': 1, date: 1 },
indexName: 'group-number_1_date_1',
isMultiKey: false,
multiKeyPaths: { 'group-number': [], date: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {
'group-number': [ '[MinKey, MaxKey]' ],
date: [ '(new Date(1491372960000), new Date(1553152560000))' ]
}
}
}
We can see above that the date field is now bounded as expected preventing the database from having to unnecessarily retrieve documents that do not match the query predicate. This would likely provide some improvement to the query, but it is impossible to say how much without knowing more about the data distribution.
Other Observations
The index noted in the other answer swaps the order of the index keys. This may reduce the total number of index keys that need to be examined in order to execute the full query. However as noted in the comments, it prevents the database from using the index to provide sorted results. While there is always a tradeoff when it comes to queries that both use range operators and request a sort, my suspicion is that the index described in the question will be superior for this particular situation.
Was the sample document described in the question the full document? If so, then the database is only forced to retrieve the full document in order to gather the _id field to return to the client. You could transform this operation into a covered query (one that can return results directly from the index alone without having to retrieve the full document) by either:
Projecting out the _id field in the query if the client does not need it, or
Appending _id to the index in the last position if the client does want it.

Why does adding index worsen performance?

I'm evaluating the performance of the following query.
db.products_old.find({ regularPrice: { $lte: 200 } })
The collection has a bit over a million documents, in total around 0.15GB.
No indexes
This is expected. A full column scan has to be done
"executionTimeMillis" : 1019,
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"regularPrice" : {
"$lte" : 200
}
},
"direction" : "forward"
},
Index { regularPrice: 1 }
"executionTimeMillis" : 2842,
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"regularPrice" : 1
},
"indexName" : "regularPrice_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"regularPrice" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"regularPrice" : [
"[-inf.0, 200.0]"
]
}
}
},
Now it uses the index, but the execution time is noticeably worse. Why?
Also, if it's worse performance, why doesn't Mongo use the COLLSCAN instead of using the index which slows down the execution? rejectedPlans is empty, which suggests that no other plan was even considered. Why?
Here's the full allPlansExecution output.
While doing COLLSCAN, MongoDB is reading from the storage drive and storing matching documents in the RAM for later use directly. On the other hand, IXSCAN reads the index which stores indexed data and pointers to their location on the storage drive. (Here's a nice visualisation from slide 6 to around slide 20)
You have a lot of documents in your collection, but you also have a lot of matching documents in your index. The data stored on the storage drive is not stored in the best way to read it from it (like it is in the index), so when the IXSCAN returns pointers to 220k+ documents it found for your query, FETCH needs to read 220k+ times from the storage drive in a random access way. Which is slow. On the other hand I assume that COLLSCAN is doing sequential read which is probably done page by page and is a lot faster than FETCH reads.
So to sum up: it's not the index that's slowing you down, it's the FETCH stage. If you want to still use this index and have a faster query execution time, then use .select('-_id regularPrice') which will just add a quick PROJECTION stage and read all necessary fields from the index. Or if you need _id, then add an index {regularPrice: 1, _id: 1}.
And regarding the part why does Mongo use index even though it could know that collection scanning is faster: well I think that if it sees an index, it will use it. But you can force it to use collection scan by using hint method with {natural: 1} passed to it.