I have over 600k of record in MongoDb.
my user Schema looks like this:
{
"_id" : ObjectId,
"password" : String,
"email" : String,
"location" : Object,
"followers" : Array,
"following" : Array,
"dateCreated" : Number,
"loginCount" : Number,
"settings" : Object,
"roles" : Array,
"enabled" : Boolean,
"name" : Object
}
following query:
db.users.find(
{},
{
name:1,
settings:1,
email:1,
location:1
}
).skip(656784).limit(10).explain()
results into this:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 656794,
"nscanned" : 656794,
"nscannedObjectsAllPlans" : 656794,
"nscannedAllPlans" : 656794,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5131,
"nChunkSkips" : 0,
"millis" : 1106,
"server" : "shreyance:27017",
"filterSet" : false
}
and after removing projection same query db.users.find().skip(656784).limit(10).explain()
results into this:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 656794,
"nscanned" : 656794,
"nscannedObjectsAllPlans" : 656794,
"nscannedAllPlans" : 656794,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5131,
"nChunkSkips" : 0,
"millis" : 209,
"server" : "shreyance:27017",
"filterSet" : false
}
As far I know projection always increase performance of a query. So I am unable to understand why MongoDB is behaving like this. Can someone explain this. And when to use projection and when not. And how actually projection is implemented in MongoDB.
You are correct that projection makes this skip query slower in MongoDB 2.6.3. This is related to an optimisation issue with the 2.6 query planner tracked as SERVER-13946.
The 2.6 query planner (as at 2.6.3) is adding SKIP (and LIMIT) stages after projection analysis, so the projection is being unnecessarily applied to results that get thrown out during the skip for this query. I tested a similar query in MongoDB 2.4.10 and the nScannedObjects was equal to the number of results returned by my limit rather than skip + limit.
There are several factors contributing to your query performance:
1) You haven't specified any query criteria ({}), so this query is doing a collection scan in natural order rather than using an index.
2) The query cannot be covered because there is no projection.
3) You have an extremely large skip value of 656,784.
There is definitely room for improvement on the query plan, but I wouldn't expect skip values of this magnitude to be reasonable in normal usage. For example, if this was an application query for pagination with 50 results per page your skip() value would be the equivalent of page number 13,135.
Unless the result of your projection does something to produce an "index only" query, and that means only the the fields "projected" in the result are all present in the index only, then you are always producing more work for the query engine.
You have to consider the process:
How do I match? On document or index? Find appropriate primary or other index.
Given the index, scan and find things.
Now what do I have to return? Is all of the data in the index? If not go back to the collection and pull the documents.
That is the basic process. So unless one of those stages "optimizes" in any way then of course things "take longer".
You need to look at this as designing a "server engine" and understand the steps that need to be undertaken. Considering none of your conditions met anything that would produce "optimal" on the specified steps you need to learn to accept that.
Your "best" case, is wher only the projected fields are the fields present in the chosen index. But really, even that has the overhead of loading the index.
So choose wisely, and understand the constraints and memory requirements for what you are writing our query for. That is what "optimization" is all about.
Related
First of all: I already read a lot of post according to MongoDB query performance, but I didn't find any good solution.
Inside the collection, the document structure looks like:
{
"_id" : ObjectId("535c4f1984af556ae798d629"),
"point" : [
-4.372925494081455,
41.367710205649544
],
"location" : [
{
"x" : -7.87297955453618,
"y" : 73.3680160842939
},
{
"x" : -5.87287143362673,
"y" : 73.3674043270052
}
],
"timestamp" : NumberLong("1781389600000")
}
My collection already has an index:
db.collection.ensureIndex({timestamp:-1})
Query looks like:
db.collection.find({ "timestamp" : { "$gte" : 1380520800000 , "$lte" : 1380546000000}})
Despite of this, the response time is too high, about 20 - 30 seconds (this time depends on the specified query params)
Any help is useful!
Thanks in advance.
EDIT: I changed the find params, replacing these by real data.
The above query takes 46 seconds, and this is the information given by explain() function:
{
"cursor" : "BtreeCursor timestamp_1",
"isMultiKey" : false,
"n" : 124494,
"nscannedObjects" : 124494,
"nscanned" : 124494,
"nscannedObjectsAllPlans" : 124494,
"nscannedAllPlans" : 124494,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 45,
"nChunkSkips" : 0,
"millis" : 46338,
"indexBounds" : {
"timestamp" : [
[
1380520800000,
1380558200000
]
]
},
"server" : "ip-XXXXXXXX:27017"
}
The explain-output couldn't be more ideal. You found 124,494 documents via index (nscanned) and they all were valid results, so they all were returned (n). It still wasn't an index-only query, because the bounds weren't exact values found in specific documents.
The reason why this query is a bit slow could be the huge amount of data it returned. All the documents you found must be read from hard-drive (when the collection is cold), scanned, serialized, sent to the client via network and deserialized by the client.
Do you really need that much data for your use-case? When the answer is yes, does responsiveness really matter? I do not know what kind of application you actually want to create, but I am wildly guessing that yours is one of three use-cases:
You want to show all that data in form of some kind of report. That would mean the output would be a huge list the user has to scroll through. In that case I would recommend to use pagination. Only load as much data as fits on one screen and provide next and previous buttons. MongoDB pagination can be done with the cursor methods .limit(n) and .skip(n).
The above, but it is some kind of offline-report the user can download and then examine with all kinds of data-mining tools. In that case the initial load-time would be acceptable, because the user will spend some time with the data they received.
You don't want to show all of that raw-data to the user but process it and present it in some kind of aggregated way, like a statistic or a diagram. In that case you could likely do all that work already on the database with the aggregation framework.
I have an OR query which I'm currently using for a semi large update. Essentially my collection is split into two data sets;
1 main repository and 1 subset of the main repository. This is just to allow quicker searching on a small subset of data.
I'm finding however my query which I create to pull things into the subset is timing out.. and when looking at the explain it looks like two queries are actually happening.
PRIMARY> var date = new Date(2012,05,01);
PRIMARY> db.col.find(
{"$or":[
{"date":{"$gt":date}},
{"keywords":{"$in":["Help","Support"]}}
]}).explain();
This produces:
{
"clauses" : [
{
"cursor" : "BtreeCursor ldate_-1",
"nscanned" : 1493872,
"nscannedObjects" : 1493872,
"n" : 1493872,
"millis" : 1035194,
"nYields" : 3396,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"ldate" : [
[
ISODate("292278995-01--2147483647T07:12:56.808Z"),
ISODate("2012-06-01T07:00:00Z")
]
]
}
},
{
"cursor" : "BtreeCursor keywords_1 multi",
"nscanned" : 88526,
"nscannedObjects" : 88526,
"n" : 2515,
"millis" : 1071902,
"nYields" : 56,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"keywords" : [
[
"Help",
"Help"
],
[
"Support",
"Support"
]
]
}
}
],
"nscanned" : 1582398,
"nscannedObjects" : 1582398,
"n" : 1496387,
"millis" : 1071902
}
Is there something I can be indexing better to make this faster? Seems just way to slow...
Thanks ahead of time!
An $or query will evaluate each clause separately and combine the results to remove duplicates .. so if you want to optimize the queries you should first try to explain() each clause individually.
It looks like part of the problem is that you are retrieving a large number of documents while actively writing to that collection, as evidenced by the high nYields (3396). It would be worth reviewing mongostat output while the query is running to consider other factors such as page faulting, lock %, and read/write queues.
If you want to make this query faster for a large number of documents and very active collection updates, two best practice approaches to consider are:
1) Pre-aggregation
Essentially this is updating aggregate stats as documents are inserted/updated so you can make fast real-time queries. The MongoDB manual describes this use case in more detail: Pre-Aggregated Reports.
2) Incremental Map/Reduce
An incremental Map/Reduce approach can be used to calculate aggregate stats in successive batches (for example, from an hourly or daily cron job). With this approach you perform a Map/Reduce using the reduce output option to save results to a new collection, and include a query filter that only selects documents that have been created/updated since the last time this Map/Reduce job was run.
I think you should create a compound index on both date and keywords. Refer to the below post for more specifics based on your use-case
how to structure a compound index in mongodb
I have a mongodb on a 8GB linux machine running. Currently it's in test-mode so there are very few other requests coming in if any at all.
I have a colelction items with 1 million documents in it. I am creating an index on the fields: PeerGroup and CategoryIds (which is an array of 3-6 elements which will yield in an multi key): db.items.ensureIndex({PeerGroup:1, CategoryIds:1}.
When I am querying
db.items.find({"CategoryIds" : new BinData(3,"xqScEqwPiEOjQg7tzs6PHA=="), "PeerGroup" : "anonymous"}).explain()
I have the following results:
{
"cursor" : "BtreeCursor PeerGroup_1_CategoryIds_1",
"isMultiKey" : true,
"n" : 203944,
"nscannedObjects" : 203944,
"nscanned" : 203944,
"nscannedObjectsAllPlans" : 203944,
"nscannedAllPlans" : 203944,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 680,
"indexBounds" : {
"PeerGroup" : [
[
"anonymous",
"anonymous"
]
],
"CategoryIds" : [
[
BinData(3,"BXzpwVQozECLaPkJy26t6Q=="),
BinData(3,"BXzpwVQozECLaPkJy26t6Q==")
]
]
},
"server" : "db02:27017"
}
I think 680ms is not that very fast. Or is this acceptable?
Also, why does it say "indexOnly:false" ?
I think 680ms is not that very fast. Or is this acceptable?
That kind of depends on how big these objects are and whether this was a first run. Assuming the whole data set (including the index) you are returning fits into memory, then they next time you run this it will be an in-memory query and will then return basically as fast as possible. The nscanned is high meaning that this query is not very selective, are most records going to have an "anonymous" value in PeerGroup? If so, and the CategoryId is more selective then you might try an index on {CategoryIds:1, PeerGroup:1} instead (use hint() to try out one versus the other).
Also, why does it say "indexOnly:false"
This simply indicates that all the fields you wish to return are not in the index, the BtreeCursor indicates that the index was used for the query (a BasicCursor would mean it had not). For this to be an indexOnly query, you would need to be returning only the two fields in the index (that is: {_id : 0, PeerGroup:1, CategoryIds:1}) in your projection. That would mean that it would never have to touch the data itself and could return everything you need from the index alone.
If I run a mongo query from the shell with explain(), get the name of the index used and then run the same query again, but with hint() specifying the same index to be used - "millis" field from explain plan is decreased significantly
for example
no hint provided:
>>db.event.find({ "type" : "X", "active" : true, "timestamp" : { "$gte" : NumberLong("1317498259000") }, "count" : { "$gte" : 0 } }).limit(3).sort({"timestamp" : -1 }).explain();
{
"cursor" : "BtreeCursor my_super_index",
"nscanned" : 599,
"nscannedObjects" : 587,
"n" : 3,
"millis" : 24,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : { ... }
}
hint provided:
>>db.event.find({ "type" : "X", "active" : true, "timestamp" : { "$gte" : NumberLong("1317498259000") }, "count" : { "$gte" : 0 } }).limit(3).sort({"timestamp" : -1 }).hint("my_super_index").explain();
{
"cursor" : "BtreeCursor my_super_index",
"nscanned" : 599,
"nscannedObjects" : 587,
"n" : 3,
"millis" : 2,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : { ... }
}
The only difference is "millis" field
Does anyone know why is that?
UPDATE: "Selecting which index to use" doesn't explain it, because mongo, as far as I know, selects index for each X (100?) runs, so it should be as fast as with hint next (X-1) runs
Mongo uses an algorithm to determine which index to be used when no hint is provided and then caches the index used for the similar query for next 1000 calls
But whenever you explain a mongo query it will always run the index selection algorithm, thus the explain() with hint will always take less time when compared with explain() without hint.
Similar question was answered here
Understanding mongo db explain
Mongo did the same search both times as you can see from the number of scanned objects. Also you can see that the used index was the same (take a look at the "cursor" entry), both used already your my_super_index index.
"hint" only tells Mongo to use that specific index which it already automatically did in the first query.
The second search was simple faster because all the data was probably already in the cache.
I struggled finding reason for same thing. I found that when we have lots of indexes, mongo is indeed taking more time than using hint. Mongo basically is taking lot of time deciding which index to use. Think of a scenario where you have 40 indexes and you do a query. First task which Mongo needs to do is which index is best suited to be used for particular query. This would imply mongo needs to scan all the keys as well as do some computation in every scan to find some performancce index if this key is used. hint will definitely speed up since index key scan will be saved.
I will tell you how to find out how it's faster
1) without index
It will pull every document into memory to get the result
2) with index
If you have a lot of index for that collection it will take index from the cache memory
3) with .hint(_index)
It will take that specific index which you have mention
with hint() without hint()
both time you do .explain("executionStats")
with hint() then you can check totalKeysExamined value that value will match with totalDocsExamined
without hint() you can see totalKeysExamined value is greter then totalDocsExamined
totalDocsExamined this result will perfectly match with result count most of the time.
I have a compound _id containing 3 numeric properties:
_id": {
"KeyA": 0,
"KeyB": 0,
"KeyC": 0
}
The database in question has 2 million identical values for KeyA and clusters of 500k identical values for KeyB.
My understanding is that I can efficiently query for KeyA and KeyB using the command:
find( { "_id.KeyA" : 1, "_id.KeyB": 3 } ).limit(100)
When I explain this query the result is:
"cursor" : "BasicCursor",
"nscanned" : 1000100,
"nscannedObjects" : 1000100,
"n" : 100,
"millis" : 1592,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {}
Without the limit() the result is:
"cursor" : "BasicCursor",
"nscanned" : 2000000,
"nscannedObjects" : 2000000,
"n" : 500000,
"millis" : 3181,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {}
As I understand it BasicCursor means that index has been ignored and both queries have a high execution time - even when I've only requested 100 records it takes ~1.5 seconds. It was my intention to use the limit to implement pagination but this is obviously too slow.
The command:
find( { "_id.KeyA" : 1, "_id.KeyB": 3, , "_id.KeyC": 1000 } )
Correctly uses the BtreeCursor and executes quickly suggesting the compound _id is correct.
I'm using the release 1.8.3 of MongoDb. Could someone clarify if I'm seeing the expected behaviour or have I misunderstood how to use/query the compound index?
Thanks,
Paul.
The index is not a compound index, but an index on the whole value of the _id field. MongoDB does not look into an indexed field, and instead uses the raw BSON representation of a field to make comparisons (if I read the docs correctly).
To do what you want you need an actual compound index over {_id.KeyA: 1, _id.KeyB: 1, _id.KeyC: 1} (which also should be a unique index). Since you can't not have an index on _id you will probably be better off leaving it as ObjectId (that will create a smaller index and waste less space) and keep your KeyA, KeyB and KeyC fields as properties of your document. E.g. {_id: ObjectId("xyz..."), KeyA: 1, KeyB: 2, KeyB: 3}
You would need a separate compound index for the behavior you desire. In general I recommend against using objects as _id because key order is significant in comparisons, so {a:1, b:1} does not equal {b:1, a:1}. Since not all drivers preserve key order in objects it is very easy to shoot yourself in the foot by doing something like this:
db.foo.save(db.foo.findOne())