MongoDB OR condition indexing - mongodb

I have an OR query which I'm currently using for a semi large update. Essentially my collection is split into two data sets;
1 main repository and 1 subset of the main repository. This is just to allow quicker searching on a small subset of data.
I'm finding however my query which I create to pull things into the subset is timing out.. and when looking at the explain it looks like two queries are actually happening.
PRIMARY> var date = new Date(2012,05,01);
PRIMARY> db.col.find(
{"$or":[
{"date":{"$gt":date}},
{"keywords":{"$in":["Help","Support"]}}
]}).explain();
This produces:
{
"clauses" : [
{
"cursor" : "BtreeCursor ldate_-1",
"nscanned" : 1493872,
"nscannedObjects" : 1493872,
"n" : 1493872,
"millis" : 1035194,
"nYields" : 3396,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"ldate" : [
[
ISODate("292278995-01--2147483647T07:12:56.808Z"),
ISODate("2012-06-01T07:00:00Z")
]
]
}
},
{
"cursor" : "BtreeCursor keywords_1 multi",
"nscanned" : 88526,
"nscannedObjects" : 88526,
"n" : 2515,
"millis" : 1071902,
"nYields" : 56,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"keywords" : [
[
"Help",
"Help"
],
[
"Support",
"Support"
]
]
}
}
],
"nscanned" : 1582398,
"nscannedObjects" : 1582398,
"n" : 1496387,
"millis" : 1071902
}
Is there something I can be indexing better to make this faster? Seems just way to slow...
Thanks ahead of time!

An $or query will evaluate each clause separately and combine the results to remove duplicates .. so if you want to optimize the queries you should first try to explain() each clause individually.
It looks like part of the problem is that you are retrieving a large number of documents while actively writing to that collection, as evidenced by the high nYields (3396). It would be worth reviewing mongostat output while the query is running to consider other factors such as page faulting, lock %, and read/write queues.
If you want to make this query faster for a large number of documents and very active collection updates, two best practice approaches to consider are:
1) Pre-aggregation
Essentially this is updating aggregate stats as documents are inserted/updated so you can make fast real-time queries. The MongoDB manual describes this use case in more detail: Pre-Aggregated Reports.
2) Incremental Map/Reduce
An incremental Map/Reduce approach can be used to calculate aggregate stats in successive batches (for example, from an hourly or daily cron job). With this approach you perform a Map/Reduce using the reduce output option to save results to a new collection, and include a query filter that only selects documents that have been created/updated since the last time this Map/Reduce job was run.

I think you should create a compound index on both date and keywords. Refer to the below post for more specifics based on your use-case
how to structure a compound index in mongodb

Related

Projection makes query slower

I have over 600k of record in MongoDb.
my user Schema looks like this:
{
"_id" : ObjectId,
"password" : String,
"email" : String,
"location" : Object,
"followers" : Array,
"following" : Array,
"dateCreated" : Number,
"loginCount" : Number,
"settings" : Object,
"roles" : Array,
"enabled" : Boolean,
"name" : Object
}
following query:
db.users.find(
{},
{
name:1,
settings:1,
email:1,
location:1
}
).skip(656784).limit(10).explain()
results into this:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 656794,
"nscanned" : 656794,
"nscannedObjectsAllPlans" : 656794,
"nscannedAllPlans" : 656794,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5131,
"nChunkSkips" : 0,
"millis" : 1106,
"server" : "shreyance:27017",
"filterSet" : false
}
and after removing projection same query db.users.find().skip(656784).limit(10).explain()
results into this:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 656794,
"nscanned" : 656794,
"nscannedObjectsAllPlans" : 656794,
"nscannedAllPlans" : 656794,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5131,
"nChunkSkips" : 0,
"millis" : 209,
"server" : "shreyance:27017",
"filterSet" : false
}
As far I know projection always increase performance of a query. So I am unable to understand why MongoDB is behaving like this. Can someone explain this. And when to use projection and when not. And how actually projection is implemented in MongoDB.
You are correct that projection makes this skip query slower in MongoDB 2.6.3. This is related to an optimisation issue with the 2.6 query planner tracked as SERVER-13946.
The 2.6 query planner (as at 2.6.3) is adding SKIP (and LIMIT) stages after projection analysis, so the projection is being unnecessarily applied to results that get thrown out during the skip for this query. I tested a similar query in MongoDB 2.4.10 and the nScannedObjects was equal to the number of results returned by my limit rather than skip + limit.
There are several factors contributing to your query performance:
1) You haven't specified any query criteria ({}), so this query is doing a collection scan in natural order rather than using an index.
2) The query cannot be covered because there is no projection.
3) You have an extremely large skip value of 656,784.
There is definitely room for improvement on the query plan, but I wouldn't expect skip values of this magnitude to be reasonable in normal usage. For example, if this was an application query for pagination with 50 results per page your skip() value would be the equivalent of page number 13,135.
Unless the result of your projection does something to produce an "index only" query, and that means only the the fields "projected" in the result are all present in the index only, then you are always producing more work for the query engine.
You have to consider the process:
How do I match? On document or index? Find appropriate primary or other index.
Given the index, scan and find things.
Now what do I have to return? Is all of the data in the index? If not go back to the collection and pull the documents.
That is the basic process. So unless one of those stages "optimizes" in any way then of course things "take longer".
You need to look at this as designing a "server engine" and understand the steps that need to be undertaken. Considering none of your conditions met anything that would produce "optimal" on the specified steps you need to learn to accept that.
Your "best" case, is wher only the projected fields are the fields present in the chosen index. But really, even that has the overhead of loading the index.
So choose wisely, and understand the constraints and memory requirements for what you are writing our query for. That is what "optimization" is all about.

MongoDB - Querying performance for over 10 million records

First of all: I already read a lot of post according to MongoDB query performance, but I didn't find any good solution.
Inside the collection, the document structure looks like:
{
"_id" : ObjectId("535c4f1984af556ae798d629"),
"point" : [
-4.372925494081455,
41.367710205649544
],
"location" : [
{
"x" : -7.87297955453618,
"y" : 73.3680160842939
},
{
"x" : -5.87287143362673,
"y" : 73.3674043270052
}
],
"timestamp" : NumberLong("1781389600000")
}
My collection already has an index:
db.collection.ensureIndex({timestamp:-1})
Query looks like:
db.collection.find({ "timestamp" : { "$gte" : 1380520800000 , "$lte" : 1380546000000}})
Despite of this, the response time is too high, about 20 - 30 seconds (this time depends on the specified query params)
Any help is useful!
Thanks in advance.
EDIT: I changed the find params, replacing these by real data.
The above query takes 46 seconds, and this is the information given by explain() function:
{
"cursor" : "BtreeCursor timestamp_1",
"isMultiKey" : false,
"n" : 124494,
"nscannedObjects" : 124494,
"nscanned" : 124494,
"nscannedObjectsAllPlans" : 124494,
"nscannedAllPlans" : 124494,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 45,
"nChunkSkips" : 0,
"millis" : 46338,
"indexBounds" : {
"timestamp" : [
[
1380520800000,
1380558200000
]
]
},
"server" : "ip-XXXXXXXX:27017"
}
The explain-output couldn't be more ideal. You found 124,494 documents via index (nscanned) and they all were valid results, so they all were returned (n). It still wasn't an index-only query, because the bounds weren't exact values found in specific documents.
The reason why this query is a bit slow could be the huge amount of data it returned. All the documents you found must be read from hard-drive (when the collection is cold), scanned, serialized, sent to the client via network and deserialized by the client.
Do you really need that much data for your use-case? When the answer is yes, does responsiveness really matter? I do not know what kind of application you actually want to create, but I am wildly guessing that yours is one of three use-cases:
You want to show all that data in form of some kind of report. That would mean the output would be a huge list the user has to scroll through. In that case I would recommend to use pagination. Only load as much data as fits on one screen and provide next and previous buttons. MongoDB pagination can be done with the cursor methods .limit(n) and .skip(n).
The above, but it is some kind of offline-report the user can download and then examine with all kinds of data-mining tools. In that case the initial load-time would be acceptable, because the user will spend some time with the data they received.
You don't want to show all of that raw-data to the user but process it and present it in some kind of aggregated way, like a statistic or a diagram. In that case you could likely do all that work already on the database with the aggregation framework.

Indexing MongoDB for quicker find() with sort(), on different fields

I'm running lots of queries of such type:
db.mycollection.find({a:{$gt:10,$lt:100}, b:4}).sort({c:-1, a:-1})
What sort of index should I use to speed it up? I think I'll need to have both {a:1, b:1} and {c:-1, a:-1}, am I right? Or these indexes will somehow interfere with each other at no performance gain?
EDIT: The actual problem for me is that I run many queries in a loop, some of them over small range, others over large range. If I put index on {a:1, b:1}, it selects small chunks very quickly, but when it comes to large range I see an error "too much data for sort() with no index". If, otherwise, I put index on {c:-1, a:-1}, there is no error, but the smaller chunks (and there are more of those) are processed much slower. So, how is it possible to keep the quickness of selection for smaller ranges, but not get error on large amount of data?
If it matters, I run queries through Python's pymongo.
If you had read the documentation you would have seen that using two indexes here would have been useless since MongoDB only uses one index per query (unless it is an $or) until: https://jira.mongodb.org/browse/SERVER-3071 is implemented.
Not only that but also when using a compound sort the order in the index must match the sort order for a index to be used correctly, as such:
Or these indexes will somehow interfere with each other at no performance gain?
If intersectioning were implemented no they would not, {a:1,b:1} does not match the sort and {c:-1,a:-1} is sub-optimal for answering the find() plus a is not a prefix of that compound.
So immediately an iteration of a optimal index would be:
{a:-1,b:1,c:-1}
But this isn't the full story. Since $gt and $lt are actually ranges, like $in they suffer the same problem with indexes, this article should provide the answer: http://blog.mongolab.com/2012/06/cardinal-ins/ don't really see any reason to repeat its content.
Disclaimer: For MongoDB v2.4
Using hint is a nice solution, since it will force the query to use indexes that you chose, so you can optimize the query with different indexes until you are satisfied. The downside is that you are setting your own index per request.
I prefer to set the indexes on the entire collection and let Mongo choose the correct (fastest) index for me, especially for queries that are used repeatedly.
You have two problems in your query:
Never sort on params that are not indexed. You will get this error: "too much data for sort() with no index" if the amount of documents in your .find() are very big, the size depends on the version of mongo that you use. This means that you must have indexes on A and C in order for your query to work.
Now for the bigger problem. You are performing a range query ($lt and $gt on param A), which can't work with Mongo. MongoDB only uses one index at a time, you are using two indexes on the same parameter. There are several solutions to deal with it in your code:
r = range( 11,100 )
db.mycollection.find({a:{$in: r }, b:4}).sort({c:-1, a:-1})
Use only $lt or $gt in your query,
db.mycollection.find({ a: { $lt:100 }, b:4}).sort({c:-1, a:-1})
Get the results and filter them in your python code.
This solution will return more data, so if you have millions of results with that are less then A=11, don't use it!
If you choose this option, make sure you use a compound key with A and B.
Pay attention when using $or in your queries, since $or is less efficiently optimized than $in with it's usage of indexes.
If you define an index {c:-1,a:-1,b:1} it will help with some considerations.
With this option the index fully will be scanned, but based on the index values only the apropriate documents will be visited, and they will be visited in the right order so the ordering phase will not be needed after getting the results. If the index is huge i do not know how it will behave, but i assume when the result would be small it will be slower in case of the resultset is big it will be faster.
About prefix matching. If you hint the index and lower levels are useable to serve the query those levels will be used for. To demonstrate this behaviour i made a short test.
I prepared test data with:
> db.createCollection('testIndex')
{ "ok" : 1 }
> db.testIndex.ensureIndex({a:1,b:1})
> db.testIndex.ensureIndex({c:-1,a:-1})
> db.testIndex.ensureIndex({c:-1,a:-1,b:1})
> for(var i=1;i++<500;){db.testIndex.insert({a:i,b:4,c:i+5});}
> for(var i=1;i++<500;){db.testIndex.insert({a:i,b:6,c:i+5});}
te result of the query with hint:
> db.testIndex.find({a:{$gt:10,$lt:100}, b:4}).hint('c_-1_a_-1_b_1').sort({c:-1, a:-1}).explain()
{
"cursor" : "BtreeCursor c_-1_a_-1_b_1",
"isMultiKey" : false,
"n" : 89,
"nscannedObjects" : 89,
"nscanned" : 588,
"nscannedObjectsAllPlans" : 89,
"nscannedAllPlans" : 588,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"c" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
],
"a" : [
[
100,
10
]
],
"b" : [
[
4,
4
]
]
},
"server" :""
}
Explanation of the output is the index is scanned that is why nscanned is 588 (number of scanned index entries and documents), the number at nscannedObjects is the number of the scanned documents. So based on the index mongo only reads those documents which match the criteria (the index partially covering or so). as you can see scanAndOrder is false so there is no sorting phase. (that implicates if the index is in memory that will be fast)
Along with the article what others linked : http://blog.mongolab.com/wp-content/uploads/2012/06/IndexVisitation-4.png you have to put first the sort keys in the index and the query keys after, if they have a subset match you have to include the subset in the very same order as they in the sorting criteria (while it does not matter for the query part).
I think it will be better to change the order of the fields in find.
db.mycollection.find({b:4, a:{$gt:10,$lt:100}}).sort({c:-1, a:-1})
and then you add an index
{b:1,a:-1,c:-1}
I tried two different indexes,
one with index in the order of db.mycollection.ensureIndex({a:1,b:1,c:-1})
and the explain plan was like below
{
"cursor" : "BtreeCursor a_1_b_1_c_-1",
"nscanned" : 9542,
"nscannedObjects" : 1,
"n" : 1,
"scanAndOrder" : true,
"millis" : 36,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"a" : [
[
3,
10000
]
],
"b" : [
[
4,
4
]
],
"c" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
}
}
and other index with db.mycollection.ensureIndex({b:1,c:-1,a:-1})
> db.mycollection.find({a:{$gt:3,$lt:10000},b:4}).sort({c:-1, a:-1}).explain()
{
"cursor" : "BtreeCursor b_1_c_-1_a_-1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 8,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"b" : [
[
4,
4
]
],
"c" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
],
"a" : [
[
10000,
3
]
]
}
}
>
I believe, since you are querying 'a' over a range of values and 'b' for a specific value I guess second option is more appropriate. nscanned object changed from 9542 to 1

Indexing with mongodb: bad performance / indexOnly=false

I have a mongodb on a 8GB linux machine running. Currently it's in test-mode so there are very few other requests coming in if any at all.
I have a colelction items with 1 million documents in it. I am creating an index on the fields: PeerGroup and CategoryIds (which is an array of 3-6 elements which will yield in an multi key): db.items.ensureIndex({PeerGroup:1, CategoryIds:1}.
When I am querying
db.items.find({"CategoryIds" : new BinData(3,"xqScEqwPiEOjQg7tzs6PHA=="), "PeerGroup" : "anonymous"}).explain()
I have the following results:
{
"cursor" : "BtreeCursor PeerGroup_1_CategoryIds_1",
"isMultiKey" : true,
"n" : 203944,
"nscannedObjects" : 203944,
"nscanned" : 203944,
"nscannedObjectsAllPlans" : 203944,
"nscannedAllPlans" : 203944,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 680,
"indexBounds" : {
"PeerGroup" : [
[
"anonymous",
"anonymous"
]
],
"CategoryIds" : [
[
BinData(3,"BXzpwVQozECLaPkJy26t6Q=="),
BinData(3,"BXzpwVQozECLaPkJy26t6Q==")
]
]
},
"server" : "db02:27017"
}
I think 680ms is not that very fast. Or is this acceptable?
Also, why does it say "indexOnly:false" ?
I think 680ms is not that very fast. Or is this acceptable?
That kind of depends on how big these objects are and whether this was a first run. Assuming the whole data set (including the index) you are returning fits into memory, then they next time you run this it will be an in-memory query and will then return basically as fast as possible. The nscanned is high meaning that this query is not very selective, are most records going to have an "anonymous" value in PeerGroup? If so, and the CategoryId is more selective then you might try an index on {CategoryIds:1, PeerGroup:1} instead (use hint() to try out one versus the other).
Also, why does it say "indexOnly:false"
This simply indicates that all the fields you wish to return are not in the index, the BtreeCursor indicates that the index was used for the query (a BasicCursor would mean it had not). For this to be an indexOnly query, you would need to be returning only the two fields in the index (that is: {_id : 0, PeerGroup:1, CategoryIds:1}) in your projection. That would mean that it would never have to touch the data itself and could return everything you need from the index alone.

Why does Mongo hint make a query run up to 10 times faster?

If I run a mongo query from the shell with explain(), get the name of the index used and then run the same query again, but with hint() specifying the same index to be used - "millis" field from explain plan is decreased significantly
for example
no hint provided:
>>db.event.find({ "type" : "X", "active" : true, "timestamp" : { "$gte" : NumberLong("1317498259000") }, "count" : { "$gte" : 0 } }).limit(3).sort({"timestamp" : -1 }).explain();
{
"cursor" : "BtreeCursor my_super_index",
"nscanned" : 599,
"nscannedObjects" : 587,
"n" : 3,
"millis" : 24,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : { ... }
}
hint provided:
>>db.event.find({ "type" : "X", "active" : true, "timestamp" : { "$gte" : NumberLong("1317498259000") }, "count" : { "$gte" : 0 } }).limit(3).sort({"timestamp" : -1 }).hint("my_super_index").explain();
{
"cursor" : "BtreeCursor my_super_index",
"nscanned" : 599,
"nscannedObjects" : 587,
"n" : 3,
"millis" : 2,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : { ... }
}
The only difference is "millis" field
Does anyone know why is that?
UPDATE: "Selecting which index to use" doesn't explain it, because mongo, as far as I know, selects index for each X (100?) runs, so it should be as fast as with hint next (X-1) runs
Mongo uses an algorithm to determine which index to be used when no hint is provided and then caches the index used for the similar query for next 1000 calls
But whenever you explain a mongo query it will always run the index selection algorithm, thus the explain() with hint will always take less time when compared with explain() without hint.
Similar question was answered here
Understanding mongo db explain
Mongo did the same search both times as you can see from the number of scanned objects. Also you can see that the used index was the same (take a look at the "cursor" entry), both used already your my_super_index index.
"hint" only tells Mongo to use that specific index which it already automatically did in the first query.
The second search was simple faster because all the data was probably already in the cache.
I struggled finding reason for same thing. I found that when we have lots of indexes, mongo is indeed taking more time than using hint. Mongo basically is taking lot of time deciding which index to use. Think of a scenario where you have 40 indexes and you do a query. First task which Mongo needs to do is which index is best suited to be used for particular query. This would imply mongo needs to scan all the keys as well as do some computation in every scan to find some performancce index if this key is used. hint will definitely speed up since index key scan will be saved.
I will tell you how to find out how it's faster
1) without index
It will pull every document into memory to get the result
2) with index
If you have a lot of index for that collection it will take index from the cache memory
3) with .hint(_index)
It will take that specific index which you have mention
with hint() without hint()
both time you do .explain("executionStats")
with hint() then you can check totalKeysExamined value that value will match with totalDocsExamined
without hint() you can see totalKeysExamined value is greter then totalDocsExamined
totalDocsExamined this result will perfectly match with result count most of the time.