MongoDB why is a compound index including 2dsphere not used - mongodb

I've created a compound index:
db.lightningStrikes.createIndex({ datetime: -1, location: "2dsphere" })
But when I run the query below the MongoDB doesn't consider the index, making a COLLSCAN.
db.lightningStrikes.find({ datetime: { $gte: new Date('2017-10-15T00:00:00Z') } }).explain(true).executionStats
The full result is bellow:
{
"executionSuccess" : true,
"nReturned" : 2,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 4,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"datetime" : {
"$gte" : ISODate("2017-10-115T00:00:00Z")
}
},
"nReturned" : 2,
"executionTimeMillisEstimate" : 0,
"works" : 6,
"advanced" : 2,
"needTime" : 3,
"needYield" : 0,
"saveState" : 0,
"restoreState" : 0,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 4
},
"allPlansExecution" : [ ]
}
Ps. I just have 4 documents inserted.
Why is it happen?
db.lightningStrikes.find({ datetime: { $gte: new Date('2017-10-11T23:59:56Z'), $lte: new Date('2017-10-11T23:59:57Z') } }).explain(true)
Result from query above:
https://gist.github.com/anonymous/8dc084132016a1dfe0efb150201f04c7
db.lightningStrikes.find({ datetime: { $gte: new Date('2017-10-11T23:59:56Z'), $lte: new Date('2017-10-11T23:59:57Z') } }).hint("datetime_-1_location_2dsphere").explain(true)
Result from query above:
https://gist.github.com/anonymous/2b76c5a7b4b348ea7206d8b544c7d455

To help understand what MongoDB is doing here you could:
Run explain with allPlansExecution mode and have a look at the rejected plans to see why MongoDB rejected your index
Run the find with .hint(_your_index_name_) and compare the explain output with the output you got for your original (non hinted) find.
Both of these are intended to get at the same thing, namely; comparative explain plans for (1) a find with COLLSCAN and (2) a find which uses your index. By comparing these explain plans you'll likely see some difference which explains MongoDB's decision not to use your index.
More details on analysing explain plans in the docs.
You could even update your OP with the comparative plans if you need help identifying why MongoDB chose the COLLSCAN.
Update 1: looking at the explain plans you provided ...
This plan uses your index but the explain plan output ...
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 4,
"executionTimeMillisEstimate" : 0,
"works" : 5,
"advanced" : 4,
...,
"keyPattern" : {
"datetime" : -1,
"location" : "2dsphere"
},
"indexName" : "datetime_-1_location_2dsphere",
...,
"indexVersion" : 2,
...,
"keysExamined" : 4,
...
}
... shows that it used the index to examine 4 index keys and then return 4 documents to the FETCH stage. This tells us that the index did not provide any selectivity and the selectivity was provided by the FETCH stage which followed the IXSCAN. This is effectively what the COLLSCAN does but without the redundant IXSCAN. This might expain why MongoDB preferred a COLLSCAN but why did the IXSCAN do nothing? I suspect this is because the 2dsphere index cannot be used to answer queries which are missing a geo predicate over the 2dsphere field. Your query has a predicate over datetime but does not have a geo predicate over location. I think this means that MongoDB cannot use the 2dsphere index in order to answer the predicates over datetime. More information on the background to this in the docs. Briefly; the use of a sparse index means that there isn't necessarily an entry in the index for every document in your collection so if you search without the location attribute then MongoDB cannot rely on the index to satisfy the query.
You could test whether this assertion is correct by ...
updating your query to include a predicates on each of the datetime and location attributes
updating uur query to include a predicate on the location attibute only
... and for each of these run the query and then examine the explain plan output to see whether the IXSCAN stage actually selected anything. If the IXSCAN stage is selective then you should see keys examined > nReturned in the explain plan output (assuming that the criteria you pass in does actually select < 4 documents!).

Related

Under what circumstances would mongo use a compond index for a query that does not match the prefix fields of the index?

When explaining a query on a collection having these indexes:
{"user_id": 1, "req.time_us": 1}
{"user_id": 1, "req.uri":1, "req.time_us": 1}
with command like:
db.some_collection.find({"user_id":12345,"req.time_us":{"$gte":1657509059545812,"$lt":1667522903018337}}).limit(20).explain("executionStats")
The winning plan was:
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 20,
"executionTimeMillisEstimate" : 0,
"works" : 20,
"advanced" : 20,
...
"keyPattern" : {
"user_id" : 1,
"req.uri" : 1,
"req.time_us" : 1
},
"indexName" : "user_id_1_req.uri_1_req.time_us_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"user_id" : [ ],
"req.uri" : [ ],
"req.time_us" : [ ]
},
...
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"user_id" : [
"[23456.0, 23456.0]"
],
"req.uri" : [
"[MinKey, MaxKey]"
],
"req.time_us" : [
"[1657509059545812.0, 1667522903018337.0)"
]
},
"keysExamined" : 20,
"seeks" : 1,
...
}
Why was the index user_id_1_req.uri_1_req.time_us_1 used but not user_id_1_req.time_us_1? Since the official manual says a compound index can supports queries that match the prefix fields of the index.
This behavior can be explained in the docs documentation page. To paraphrase:
MongoDB runs the query optimizer to choose the winning plan and executes the winning plan to completion.
During plan selection, if there are more than one index that can satisfy a query, MongoDB will run a trial using all the valid plans to determine which one performed to be the best. You can read about this process more here.
As of MongoDB 3.4.6, the plan selection involves running candidate plans in parallel in a "race", and see which candidate plan returns 101 results first.
So basically these 2 indexes had a mini competition and the "wrong" index one, this can happen as these competitions can be heavily skewed depending on data distribution for similar indexes.
( For example imagine the first 101 documents in the collection match the query then the "better" index will actually be slower as it will continue to scan the index tree deeper while the "worse" index start fetching them immediately)
I recommend for cases like this to use $hint which essentially forces Mongo to use the index you deem most fit.

Understanding why MongoDB skips certain documents during a find() operation on entire collection

So, we use MongoDB at our workplace to store certain information about our customers in a collection named customers. For an ad-hoc task, I am required to iterate through the entire collection and do some processing on each document, which means that it is critical to scan through every document in the collection without missing any.
This is the query I am running -
db.customers.find({}, {"cid":1, "name":1})
The customers collection has an index on the cid field, and this is the result of execution-stats on the query -
"executionStages" : {
"stage" : "PROJECTION",
"nReturned" : 19841,
"executionTimeMillisEstimate" : 10,
"works" : 19843,
"advanced" : 19841,
"needTime" : 1,
"needYield" : 0,
"saveState" : 155,
"restoreState" : 155,
"isEOF" : 1,
"invalidates" : 0,
"transformBy" : {
"cid" : 1,
"name":1
},
"inputStage" : {
"stage" : "COLLSCAN",
"nReturned" : 19841,
"executionTimeMillisEstimate" : 0,
"works" : 19843,
"advanced" : 19841,
"needTime" : 1,
"needYield" : 0,
"saveState" : 155,
"restoreState" : 155,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 19841
}
}
The issue I am facing is that when I run this query, MongoDB doesn't include a few cids in the cursor, which should ideally be present. Those cids where part of the collection before the query started running. When I run the same query again at a later date, it so happens that these documents are returned, but some other documents go missing.
From what I got from reading up before asking this question, it looks like Reads may miss matching documents that are updated during the course of the read operation in MongoDB. The article seems to hint that this, however, happens only when the query uses an index and not during an entire collection scan, which is what I am doing. My query doesn't seem to use any index so I expect to not run into this issue. However, this does happen in my case as well.
So, two questions:
Is my understanding of the issue correct?
How to resolve this problem and retrieve all the existing documents in the customers collection without missing any of them?
Thanks
The article you reference mentions that if scanning over the whole collection, writes may change a document and cause a re-order of the documents of the collection if a document grows and needs to be moved. The author's solution is to use an index that will ensure no documents are missed in the cursor iteration. Thus, "natural order" can be volatile during iteration.
I suggest using a stable index for the scan. In your case,
db.customers.find({}, {"cid":1, "name":1}).hint({cid: 1})
will result in an index scan being the query planner's winning plan (Confirm with db.customers.find({}, {"cid":1, "name":1}).hint({cid: 1}).explain()).

MongoDB - aggregation performance tuning

One of my aggregation pipelines is running rather slow.
About the collection
The collection is named as Document and each document can belong to multiple campaigns and be in one of the five statues, 'a' to 'e'. A small portion of documents may belong to no documents and its campaigns field is set to null.
Sample document:
{_id:id, campaigns:['c1', 'c2], status:'a', ...other fields...}
Some collection stats
Number of documents: 2 million only :(
Size: 2GB
Average document size: 980 bytes.
Storage Size: 780MB
Total index size: 134MB
Number of indexes: 12
Number of fields in document: 30-40, may have array or objects.
About the Query
The query is targeting to count the number of documents per campaign per status if its status is in ['a', 'b', 'c']
[
{$match:{campaigns:{$ne:null}, status:{$in:['a','b','c']}}},
{$unwind:'$campaigns'},
{$group:{_id:{campaign:'$campaigns', status:'$status'}, total:{$sum:1}}}
]
It's expected that the aggregation is going to hit almost the whole collection.
When without index the aggregation is taking around 8 seconds to complete.
I tried to create an index on
{campaings:1, status:1}
Explaining plan shows that the index was scanned but the aggregation took near 11 seconds to complete.
Question
The index consists all fields required by the aggregation to do the counting. Should the aggregation not hit the index only? The index is only 10MB in size. How could it be slower? If not index, any other recommendation to tune the query?
Winning plan shows:
{
"stage" : "FETCH",
"filter" : {"$not" : {"campaigns" : {"$eq" : null}}},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {"campaigns" : 1.0,"status" : 1.0},
"indexName" : "campaigns_1_status_1",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"campaigns" : ["[MinKey, null)", "(null, MaxKey]"],
"status" : [ "[\"a\", \"a\"]", "[\"b\", \"b\"]", "[\"c\", \"c\"]"]
}
}
}
If no index, winning plan:
{
"stage" : "COLLSCAN",
"filter" : {
"$and":[
{"status": {"$in": ["a", "b", "c"]}},
{"$not" : {"campaigns": {"$eq" : null}}}
]
},
direction" : "forward"
}
Update
As requested by #Kevin, here're some details about other all indexes, size in MB.
"indexSizes" : {
"_id_" : 32,
"team_1" : 8, //Single value field of ObjectId
"created_time_1" : 16, //Document publish time in source system.
"parent_1" : 2, //_id of parent document.
"by.id_1" : 13, //_id of author from a different collection.
"feedids_1" : 8, //Array, _id of ETL jobs contributing to sync of this doc.
"init_-1" : 2, //Initial load time of the doc.
"campaigns_1" : 10, //Array, _id of campaigns
"last_fetch_-1" : 13, //Last sync time of the doc.
"categories_1" : 8, //Array, _id of document categories.
"status_1" : 8, //Status
"campaigns_1_status_1" : 10 //Combined index of campaign _id and status.
},
After reading the docs from MongoDB, I found this:
The inequality operator $ne is not very selective since it often matches a large portion of the index. As a result, in many cases, a $ne query with an index may perform no better than a $ne query that must scan all documents in a collection. See also Query Selectivity.
Looking at a few different articles using the $type operator might solve the problem.
Could you use this query:
db.data.aggregate([
{$match:{campaigns:{$type:2},status:{$in:["a","b","c"]}}},
{$unwind:'$campaigns'},
{$group:{_id:{campaign:'$campaigns', status:'$status'}, total:{$sum:1}}}])

Getting rid of _id in mongodb collection

I know it is not possible to remove the _id field in a mongodb collection. However, the size of my collections is large, that the index on the _id field prevents me from loading the other indices in the RAM. My machine has 125GB of RAM and my collection stats is as follows:
db.call_records.stats()
{
"ns" : "stc_cdrs.call_records",
"count" : 1825338618,
"size" : 438081268320,
"avgObjSize" : 240,
"storageSize" : 468641284752,
"numExtents" : 239,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 165290709024,
"indexSizes" : {
"_id_" : 73450862016,
"caller_id_1" : 45919923504,
"receiver_id_1" : 45919923504
},
"ok" : 1
}
When I do a query like the following:
db.call_records.find({ "$or" : [ { "caller_id": 125091840205 }, { "receiver_id" : 125091840205 } ] }).explain()
{
"clauses" : [
{
"cursor" : "BtreeCursor caller_id_1",
"isMultiKey" : false,
"n" : 401,
"nscannedObjects" : 401,
"nscanned" : 401,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"caller_id" : [
[
125091840205,
125091840205
]
]
}
},
{
"cursor" : "BtreeCursor receiver_id_1",
"isMultiKey" : false,
"n" : 383,
"nscannedObjects" : 383,
"nscanned" : 383,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"receiver_id" : [
[
125091840205,
125091840205
]
]
it takes more than 15 seconds on average to return the results. The indices for both caller_id and receiver_id should be around 90GB, which is OK. However, the 73GB index on the _id makes this query very slow.
You correctly told that you can not remove _id field from your document. You also can not remove an index from this field, so this is something you have to live with.
For some reason you start with the assumption that _id index makes your query slow, which is completely unjustifiable and most probably is wrong. This index is not used and just stays there untouched.
Few things I would try to do in your situation:
You have 400 billion documents in your collection, have you thought that this is a right time to start sharding your database? In my opinion you should.
use explain with your query to actually figure out what slows it down.
Looking at your query, I would also try to do the following:
change your document from
{
... something else ...
receiver_id: 234,
caller_id: 342
}
to
{
... something else ...
participants: [342, 234]
}
where your participants are [caller_id, receiver_id] in this order, then you can put only one index on this field. I know that it will not make your indices smaller, but I hope that because you will not use $or clause, you will get results faster. P.S. if you will do this, do not do this in production, test whether it give you a significant improvement and only then change in prod.
There are a lot of potential issues here.
The first is that your indexes do not include all of the data returned. This means Mongo is getting the _id from the index and then using the _id to retrieve and return the document in question. So removing the _id index, even if you could, would not help.
Second, the query includes an OR. This forces Mongo to load both indexes so that it can read them and then retrieve the documents in question.
To improve performance, I think you have just a few choices:
Add the additional elements to the indexes and restrict the data returned to what is available in the index (this would change indexOnly = true in the explain results)
Explore sharding as Skooppa.com mentioned.
Rework the query and/or the document to eliminate the OR condition.

MongoDB fulltext search not using index

We use mongoDB fulltext search to find products in our database.
Unfortunately it is incredible slow.
The collection contains 89.114.052 documents and I have the suspicion, that the full text index is not used.
Performing a search with explain(), nscannedObjects returns 133212.
Shouldn't this be 0 if an index is used?
My index:
{
"v" : 1,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "textIndex",
"ns" : "search.products",
"weights" : {
"brand" : 1,
"desc" : 1,
"ean" : 1,
"name" : 3,
"shop_product_number" : 1
},
"default_language" : "german",
"background" : false,
"language_override" : "language",
"textIndexVersion" : 2
}
The complete test search:
> db.products.find({ $text: { $search: "playstation" } }).limit(100).explain()
{
"cursor" : "TextCursor",
"n" : 100,
"nscannedObjects" : 133212,
"nscanned" : 133212,
"nscannedObjectsAllPlans" : 133212,
"nscannedAllPlans" : 133212,
"scanAndOrder" : false,
"nYields" : 1041,
"nChunkSkips" : 0,
"millis" : 105,
"server" : "search2:27017",
"filterSet" : false
}
Please have a look at the question you asked:
".... The collection contains 89.114.052 documents and I have the suspicion, that the full text index is not used ...."
You are only "nScanned" for 133212 documents. Of course the index is used. If it was not then 89,114,052 documents ( because this is English locale and not German ) would have otherwise been reported in "nScanned" which means an index is not used.
Your query is slow. Well it seems your hardware is not up to the task of keeping 1333212 documents in memory or otherwise having the super fast disk to "page" effectively. But this is not a MongoDB problem but yours.
You have over 100,000 documents that match your query and even if you just want 100 then you need to accept this is how this works and MongoDB does not "give up" once you have matched 100 documents and yield control. The query pattern here finds all of the matches and then applies the "limit" to the cursor in order just to return the most recent.
Maybe some time in the future the "text" functionality might allow you do do things like you can do in the aggregate version of $geoNear and specify "minimum" and "maximum" values for a "score" in order to improve results. But right now it does not.
So either upgrade your hardware or use an external text search solution if your problem is the slow results on matching over 100,000 documents out of over 89,000,000 documents.