Strange behaviour for MongoDB covered query - mongodb

If I have the following collection
{ "_id" : "a" }
{ "_id" : "b" }
{ "_id" : "c" }
Normal query
If I now run the following query
db.test.find({_id: "a"}, {_id: 1}).explain("executionStats")
It returns
"executionStats" : {
"totalKeysExamined" : 1,
"totalDocsExamined" : 1
}
Query with hint
Now for the strange part. If I run the following query
db.test.find({_id: "a"}, {_id: 1}).hint({_id: 1}).explain("executionStats")
It returns
"executionStats" : {
"totalKeysExamined" : 1,
"totalDocsExamined" : 0
}
Question
Why is the normal query examining a document, even if I only want the _id?
Server version: v3.2.1

What you are doing is that you are generally using the _id index instead of going through the collection. You get to see the indexes using
collection.getIndexes()
In that sense the mongo server (query optimizer) used the index specified (_id) and did not have to go through the collection
Hop that helps

Related

Is searching by _id in mongoDB more efficient?

In my use case, I want to search a document by a given unique string in MongoDB. However, I want my queries to be fast and searching by _id will add some overhead. I want to know if there are any benefits in MongoDB to search a document by _id over any other unique value?
To my knowledge object ID are similar to any other unique value in a document [Point made for the case of searching only].
As for the overhead, you can assume I am caching the string to objectID and the cache is very small and in memory [Almost negligible], though the DB is large.
Analyzing your query performance
I advise you to use .explain() provided by mongoDB to analyze your query performance.
Let's say we are trying to execute this query
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } )
This would be the result of the query execution
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
If we call .execution() this way
db.inventory.find(
{ quantity: { $gte: 100, $lte: 200 } }
).explain("executionStats")
It will return the following result:
{
"queryPlanner" : {
"plannerVersion" : 1,
...
"winningPlan" : {
"stage" : "COLLSCAN",
...
}
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 10,
"executionStages" : {
"stage" : "COLLSCAN",
...
},
...
},
...
}
More details about this can be found here
How efficient is search by _id and indexes
To answer your question, using indexes is always more efficient. Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form. With _id being the default index provided by MongoDB, that makes it more efficient.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
So, YES, using indexes like _id is better!
You can also create your own indexes by using createIndex()
db.collection.createIndex( <key and index type specification>, <options> )
Optimize your MongoDB query
In case you want to optimize your query, there are multiple ways to do that.
Creating custom indexes to support your queries
Limit the Number of Query Results to Reduce Network Demand
db.posts.find().sort( { timestamp : -1 } ).limit(10)
Use Projections to Return Only Necessary Data
db.posts.find( {}, { timestamp : 1 , title : 1 , author : 1 , abstract : 1} ).sort( { timestamp : -1 } )
Use $hint to Select a Particular Index
db.users.find().hint( { age: 1 } )
Short answer, yes _id is the primary key and it's indexed. Of course it's fast.
But you can use an index on the other fields too and get more efficient queries.

MongoDB why is a compound index including 2dsphere not used

I've created a compound index:
db.lightningStrikes.createIndex({ datetime: -1, location: "2dsphere" })
But when I run the query below the MongoDB doesn't consider the index, making a COLLSCAN.
db.lightningStrikes.find({ datetime: { $gte: new Date('2017-10-15T00:00:00Z') } }).explain(true).executionStats
The full result is bellow:
{
"executionSuccess" : true,
"nReturned" : 2,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 4,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"datetime" : {
"$gte" : ISODate("2017-10-115T00:00:00Z")
}
},
"nReturned" : 2,
"executionTimeMillisEstimate" : 0,
"works" : 6,
"advanced" : 2,
"needTime" : 3,
"needYield" : 0,
"saveState" : 0,
"restoreState" : 0,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 4
},
"allPlansExecution" : [ ]
}
Ps. I just have 4 documents inserted.
Why is it happen?
db.lightningStrikes.find({ datetime: { $gte: new Date('2017-10-11T23:59:56Z'), $lte: new Date('2017-10-11T23:59:57Z') } }).explain(true)
Result from query above:
https://gist.github.com/anonymous/8dc084132016a1dfe0efb150201f04c7
db.lightningStrikes.find({ datetime: { $gte: new Date('2017-10-11T23:59:56Z'), $lte: new Date('2017-10-11T23:59:57Z') } }).hint("datetime_-1_location_2dsphere").explain(true)
Result from query above:
https://gist.github.com/anonymous/2b76c5a7b4b348ea7206d8b544c7d455
To help understand what MongoDB is doing here you could:
Run explain with allPlansExecution mode and have a look at the rejected plans to see why MongoDB rejected your index
Run the find with .hint(_your_index_name_) and compare the explain output with the output you got for your original (non hinted) find.
Both of these are intended to get at the same thing, namely; comparative explain plans for (1) a find with COLLSCAN and (2) a find which uses your index. By comparing these explain plans you'll likely see some difference which explains MongoDB's decision not to use your index.
More details on analysing explain plans in the docs.
You could even update your OP with the comparative plans if you need help identifying why MongoDB chose the COLLSCAN.
Update 1: looking at the explain plans you provided ...
This plan uses your index but the explain plan output ...
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 4,
"executionTimeMillisEstimate" : 0,
"works" : 5,
"advanced" : 4,
...,
"keyPattern" : {
"datetime" : -1,
"location" : "2dsphere"
},
"indexName" : "datetime_-1_location_2dsphere",
...,
"indexVersion" : 2,
...,
"keysExamined" : 4,
...
}
... shows that it used the index to examine 4 index keys and then return 4 documents to the FETCH stage. This tells us that the index did not provide any selectivity and the selectivity was provided by the FETCH stage which followed the IXSCAN. This is effectively what the COLLSCAN does but without the redundant IXSCAN. This might expain why MongoDB preferred a COLLSCAN but why did the IXSCAN do nothing? I suspect this is because the 2dsphere index cannot be used to answer queries which are missing a geo predicate over the 2dsphere field. Your query has a predicate over datetime but does not have a geo predicate over location. I think this means that MongoDB cannot use the 2dsphere index in order to answer the predicates over datetime. More information on the background to this in the docs. Briefly; the use of a sparse index means that there isn't necessarily an entry in the index for every document in your collection so if you search without the location attribute then MongoDB cannot rely on the index to satisfy the query.
You could test whether this assertion is correct by ...
updating your query to include a predicates on each of the datetime and location attributes
updating uur query to include a predicate on the location attibute only
... and for each of these run the query and then examine the explain plan output to see whether the IXSCAN stage actually selected anything. If the IXSCAN stage is selective then you should see keys examined > nReturned in the explain plan output (assuming that the criteria you pass in does actually select < 4 documents!).

Mongo DB, document count mismatch for a collection

I have a collection User in mongo. When I do a count on this collection I got 13204951 documents
> db.User.count()
13204951
But when I tried to find the count of non-stale documents like this I got a count of 13208778
> db.User.find({"_id": {$exists: true, $ne: null}}).count()
13208778
> db.User.find({"UserId": {$exists: true, $ne: null}}).count()
13208778
I even tried to get the count of this collection using MongoEngine
user_list = set(User.objects().values_list('UserId'))
len(resume_list)
13208778
Here are the indexes of this User collection
>db.User.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "user_db.User"
},
{
"v" : 1,
"unique" : true,
"key" : {
"UserId" : 1
},
"name" : "UserId_1",
"ns" : "user_db.User",
"sparse" : false,
"background" : true
}
]
Any pointers on how to debug the mismatch in counts from different queries.
refer to this document
On a sharded cluster, db.collection.count() can result in an inaccurate count if orphaned documents exist or if a chunk migration is in progress.
Also, refer to this question
If you are not using sharding cluster, you can refer to this question
The basic idea is db.{collection}.count() might do some tricks to make it fast to return a count, and it might be not accurate, use a count() with query should be accurate.

MongoDB Why this error : can't append to array using string field name: comments

I have a DB structure like below:
{
"_id" : 1,
"comments" : [
{
"_id" : 2,
"content" : "xxx"
}
]
}
I update a new subdocument in the comments feild. It is OK.
db.test.update(
{"_id" : 1, "comments._id" : 2},
{$push : {"comments.$.comments" : {_id : 3, content:"xxx"}}}
)
after that the DB structure:
{
"_id" : 1,
"comments" : [
{
"_id" : 2,
"comments" : [
{
"id" : 3,
"content" : "xxx"
}
],
"content" : "xxx"
}
]
}
But when I update a new subdocument in the comment field that _id is 3, There is a error:
db.test.update(
{"_id" : 1, "comments.comments.id" : 3},
{$push : {"comments.comments.$.comments" : {id : 4, content:"xxx"}}}
)
error message:
can't append to array using string field name: comments
Well, it makes total sense if you think about it. MongoDb has the advantage and the disadvantage of solving magically certain things.
When you query the database for a specific regular field like this:
{ field : "value" }
The query {field:"value"} makes total sense, it wouldn't in case value is part of an array but Mongo solves it for you, so in case the structure is:
{ field : ["value", "anothervalue"] }
Mongo iterates through all of them and matches "value" into the field and you don't have to think about it. It works perfectly.. at only one level, because it's impossible to guess what you want to do if you have multiple levels
In your case the first query works because it's the case in this example:
db.test.update(
{"_id" : 1, "comments._id" : 2},
{$push : {"comments.$.comments" : {_id : 3, content:"xxx"}}}
)
Matches _id in the first level, and comments._id at the second level, it gets an array as a result but Mongo is able to solve it.
But in the second case, think what you need, let's isolate the where clause:
{"_id" : 1, "comments.comments.id" : 3},
"Give me from the main collection records with _id:1" (one doc)
"And comments which comments inside have and id=3" (array * array)
The first level is solved easily, comments.id, the second is not possible due comments returns an array, but one more level is an array of arrays and Mongo gets an array of arrays as a result and it's not possible to push a document into all the records of the array.
The solution is to narrow your where clause to obtain an unique document in comments (could be the first one) but it's not a good solution because you never know what is the position of the document you're looking for, using the shell I think the only option to be accurate is to do it in two steps. Check this query that works (not the solution anyway) but "solves" the multiple array part fixing it to the first record:
db.test.update(
{"_id" : 1, "comments.0.comments._id" : 3},
{$push : {"comments.0.comments.$.comments" : {id : 4, content:"xxx"}}}
)

Chaining time-based sort and limit issue

Lately I've encountered some strange behaviours (i.e. meaning that they are, IMHO, counter-intuitive) while playing with mongo and sort/limit.
Let's suppose I do have the following collection:
> db.fred.find()
{ "_id" : ObjectId("..."), "record" : 1, "time" : ISODate("2011-12-01T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 2, "time" : ISODate("2011-12-02T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 3, "time" : ISODate("2011-12-03T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 4, "time" : ISODate("2011-12-04T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 5, "time" : ISODate("2011-12-05T00:00:00Z") }
What I would like is retrieving, in time order, the 2 records previous to "record": 4 plus record 4 (i.e. record 2, record 3 and record 4)
Naively I was about running something along:
db.fred.find({time: {$lte: ISODate("2011-12-04T00:00:00Z")}}).sort({time: -1}).limit(2).sort({time: 1})
but it does not work the way I expected:
{ "_id" : ObjectId("..."), "record" : 1, "time" : ISODate("2011-12-01T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 2, "time" : ISODate("2011-12-02T00:00:00Z") }
I was thinking that the result would have been record 2, record 3 and 4.
From what I recollected, it seems that the 2 sort does apply before limit:
sort({time: -1}) => record 4, record 3, record 2, record 1
sort({time: -1}).limit(2) => record 4, record 3
sort({time: -1}).limit(2).sort({time: 1}) => record 1, record 2
i.e it's like the second sort was applied to the cursor returned by find (i.e. the whole set) and then only, the limit is applied.
What is my mistake here and how can I achieve the expected behavior?
BTW: running mongo 2.0.1 on Ubuntu 11.01
The MongoDB shell lazily evaluates cursors, which is to say, the series of chained operations you've done results in one query being sent to the server, using the final state based on the chained operations. So when you say "sort({time: -1}).limit(2).sort({time: 1})" the second call to sort overrides the sort set by the first call.
To achieve your desired result, you're probably better off reversing the cursor output in your application code, especially if you're limiting to a small result set (here you're using 2). The exact code to do so depends on the language you're using, which you haven't specified.
Applying sort() to the same query multiple times makes no sense here. The effective sorting will be taken from the last sort() call. So
sort({time: -1}).limit(2).sort({time: 1})
is the same as
sort({time: 1}).limit(2)