In this question's comments, I know how to create index for sort operation: how does Mongodb index works?
But I want to know, when we create a joint index on a & b, how does it work different to a simple index?
And why will we benefit in just finding a, but if we find b, we do not get any benefit from it? Whether joint index is just like concatenating a & b, so we will get benefit from it for Prefix?
1.But I want to know ,when we create an joint index on 'a'&'b',how does it work difference with an simple index?
MongoDB only uses a single index per query .. so if your find() criteria includes both a and b values, you should add a compound index to efficiently search both fields.
2.And why we will benifit just find 'a' but if we find 'b' ,we will do not get any benifit from it? whether joint index just like concatenate 'a'&'b' so we will get benifit from it for Prefix?
MongoDB uses B-tree indexes, so you can only efficiently match partial keys using a prefix. To find all possible values matching a suffix or substring, all index entries would have to be checked.
Set up test data to compare
Examples below are using the mongo shell:
/* Generate some test data */
for (i = 0; i< 1000; i++) {
db.mycoll.insert({a:i})
db.mycoll.insert({b:i})
db.mycoll.insert({a:i,b:i})
}
Now add some sample indexes:
/* Add simple and compound index */
db.mycoll.ensureIndex({a:1})
db.mycoll.ensureIndex({b:1})
db.mycoll.ensureIndex({a:1, b:1})
Finally, for the test scenarios below force your query to use a specific index with $hint and compare the explain() results.
Search for b using simple index
A search for b using the simple index on b can find matching entries directly in the index .. it scans 4 index entries (nscanned) to return 4 results (n):
db.mycoll.find({b:10}).hint({b:1}).explain()
{
"cursor" : "BtreeCursor b_1",
"n" : 4,
"nscannedObjects" : 4,
"nscanned" : 4,
...
}
Search for b using compound index (a,b)
A search for b using the compound index on (a,b) has to check every a value in the index, because the first part of the index is the key value of a.
So to find matching entries directly in the index .. it scans 1904 index entries (nscanned) to return 4 results (n):
db.mycoll.find({b:10}).hint({a:1,b:1}).explain()
{
"cursor" : "BtreeCursor a_1_b_1",
"n" : 4,
"nscannedObjects" : 4,
"nscanned" : 1904,
...
}
Technically scanning 1,904 documents is less than the 3,000 total in my test collection .. but this is far from optimal.
Search for a using compound index (a,b)
For comparison, a search of a using the compound index shows that only 4 values need to be scanned to return 4 documents:
db.mycoll.find({a:10}).hint({a:1,b:1}).explain()
{
"cursor" : "BtreeCursor a_1_b_1",
"n" : 4,
"nscannedObjects" : 4,
"nscanned" : 4,
"nscannedObjectsAllPlans" : 4,
...
}
For some further examples and explanation, I would recommend reading the article Optimizing MongoDB Compound Indexes.
Related
I am using MongoDB 2.4.10, and I have a collection of four million records, and a query that creates a subset of no more than 50000 even for our power users. I need to select a random 30 items from this subset, and, given the potential performance issues with skip and limit (especially when doing it 30 times with random skip amounts from 1-50000), I stumbled across the following solution:
Create a field for each record which is a completely random number
Create an index over this field
Sort by the field, and use skip(X).limit(30) to get a page of 30 items that, while consecutive in terms of the random field, actually bear no relation to each other. To the user, they seem random.
My index looks like this:
{a: 1, b: 1, c: 1, d: 1}
I also have a separate index:
{d : 1}
'd' is the randomised field.
My query looks like this:
db.content.find({a : {$in : ["xyz", "abc"]}, b : "ok", c : "Image"})
.sort({d : 1}).skip(X).limit(30)
When the collection is small, this works perfectly. However, on our performance and live systems, this query fails, because instead of using the a, b, c, d index, it uses this index only:
{d : 1}
As a result, the query ends up scanning more records than it needs to (by a factor of 25). So, I introduced hint:
db.content.find({a : {$in : ["xyz", "abc"]}, b : "ok", c : "Image"})
.hint({a : 1, b : 1, c : 1, d : 1}).sort({d : 1}).skip(X).limit(30)
This now works great with all values of X up to 11000, and explain() shows the correct index in use. But, when the skip amount exceeds 11000, I get:
{
"$err" : "too much data for sort() with no index. add an index or specify a smaller limit",
"code" : 10128
}
Presumably, the risk of hitting this error is why the query (without the hint) wasn't using this index earlier. So:
Why does Mongo think that the sort has no index to use, when I've forced it to use an index that explicitly includes the sorting field at the end?
Is there a better way of doing this?
index_dict = {
"column_7" : 1,
"column_6" : 1,
"column_5" : 1,
"column_4" : 1,
"column_3" : 1,
"column_2" : 1,
"column_1" : 1,
"column_9" : 1,
"column_8" : 1,
"column_11" : 1,
"column_10" : 1
}
db.dataCustom.ensureIndex(index_dict, {unique: true, dropDups: true})
How many columns is the limit for index_dict? I need to know it for the implementation but I am not able to find the answer online
I must be honest, I know of no limitation on the number of columns. There is a logical limitation though.
An index is a very heavy thing, to put an immense, or even close to table length index on a collection will create performance problems. Most specifically the fact that your index is so large in space and values.
This, you must be aware of. However to answer your question: I know of no limit. There is a limit per size of a field within an index (1024 bytes including name) but I know of none on the number of columns in a index.
Edit
It appears I wrote this answer a little fast, there is a limit of 31 fields: http://docs.mongodb.org/manual/reference/limits/#Number%20of%20Indexed%20Fields%20in%20a%20Compound%20Index I guess I just have never reached that number :/
A collection may have no more than 64 indexes. More details Here and limitation
But I was wondering why you want so many Indexes ?
EDIT
There can be no more than 31 fields in a compound index.
Currently I have an index on a geospatial field in one of my collections set up like this:
collection.ensureIndex({ loc : "2d" }, { min : -10000 , max : 10000,
bits : 32, unique:true}
However, I would like to include one more field in the index, so I can take advantage of covered index queries in one of my use cases. An ensureIndex with multiple fields (compound index) looks something like this:
collection.ensureIndex( { username : 1, password : 1, roles : 1} );
Question is - how do I write my first index spec with an additional field, so that I keep my min/max parameters? In other words, how to specify that the min/max/bits only apply to one of the index fields? My best guess so far is:
collection.ensureIndex({ loc : "2d", field2 : 1 },
{ min : -10000 , max : 10000 , bits : 32, unique:true}
But I have no confidence that this is working as it should!
UPDATE:
There is more info in the documentation here, but it still does not explicitly show how to specify the min/max in this case.
db.collection.getIndexes() should give you the parameters used to construct the index - the min/max/bits field will only ever apply to the "2d" field.
There's no way to see these parameters aside from getIndexes(), but you can easily verify that you aren't allowed to insert the same location/field pair, and that you aren't allowed to insert locs outside your bounds (but field2 can be anything). The bits setting is harder to verify directly, though you can indirectly verify by setting it very low and seeing that nearby points then trigger the unique key violations.
If you have a double compound index { a : 1, b : 1}, it makes sense to me that the index won't be used if you query on b alone (i.e. you cannot "skip" a in your query). The index will however be used if you query on a alone.
However, given a triple compound index { a : 1, b: 1, c: 1} my explain command is showing that the index is used when you query on a and c (i.e. you can "skip" b in your query).
How can Mongo use an abc index on a query for ac, and how effective is the index in this case?
Background:
My use case is that sometimes I want to query on a,b,c and sometimes I want to query on a,c. Now should I create only 1 index on a,b,c or should I create one on a,c and one on a,b,c?
(It doesn't make sense to create an index on a,c,b because c is a multi-key index with good selectivity.)
bottom line / tl;dr: Index b can be 'skipped' if a and c are queried for equality or inequality, but not, for instance, for sorts on c.
This is a very good question. Unfortunately, I couldn't find anything that authoritatively answers this in greater detail. I believe the performance of such queries has improved over the last years, so I wouldn't trust old material on the topic.
The whole thing is quite complicated because it depends on the selectivity on your indexes and whether you query for equality, inequality and/or sort, so explain() is your only friend, but here are some things I found:
Caveat: What comes now is a mixture of experimental results, reasoning and guessing. I might be stretching Kyle's analogy too far, and I might even be completely wrong (and unlucky, because my test results loosely match my reasoning).
It is clear that the index of A can be used, which, depending on the selectivity of A, is certainly very helpful. 'Skipping' B can be tricky, or not. Let's keep this similar to Kyle's cookbook example:
French
Beef
...
Chicken
Coq au Vin
Roasted Chicken
Lamb
...
...
If you now ask me to find some French dish called "Chateaubriand", I can use index A and, because I don't know the ingredient, will have to scan all dishes in A. On the other hand, I do know that the list of dishes in each category is sorted through the index C, so I will only have to look for the strings starting with, say, "Cha" in each ingredient-list. If there are 50 ingredients, I will need 50 lookups instead of just one, but that is a lot better than having to scan every French dish!
In my experiments, the number was a lot smaller than the number of distinct values in b: it never seemd to exceed 2. However, I tested this only with a single collection, and it probably has to do with the selectivity of the b-index.
If you asked me to give you an alphabetically sorted list of all French dishes, though, I'd be in trouble. Now the index on C is worthless, I'd have to merge-sort all those index lists. I will have to scan every element to do so.
This reflects in my tests. Here are some simplified results. The original collection has datetimes, ints and strings, but I wanted to keep things simple, so it's now all ints.
Essentially, there are only two classes of queries: those where nscanned <= 2 * limit, and those that have to scan the entire collection (120k documents). The index is {a, b, c}:
// fast (range query on c while skipping b)
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }});
// slow (sorting)
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }}).sort({ "c" : -1});
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }}).sort({ "b" : -1});
// fast (can sort on c if b included in the query)
> db.Test.find({"a" : 43, "b" : 7887, "c" : { $lte : 45454 }}).sort({ "c" : -1});
// fast (older tutorials claim this is slow)
> db.Test.find({"a" : {$gte : 43}, "c" : { $lte : 45454 }});
Your mileage will vary.
You can view querying on A and C as a special case of querying on A (in which case the index would be used). Using the index is more efficient than having to load the whole document.
Suppose you wanted to get all documents with A between 7 and 13, and C between 5 and 8.
If you had an index on A only: the database could use the index to select documents with A between 7 and 13 but, to make sure that C was between 5 and 8, it would have to retrieve the corresponding documents too.
If you had an index on A, B, and C: the database could use the index to select documents with A between 7 and 13. Since the values of C are already stored in the records of the index, it could determine whether the correponding documents also match the C criterion, without having to retrieve those documents. Therefore, you would avoid disk reads, with better performance.
I have a MondoDB collection with over 5 million items. Each item has a "start" and "end" fields containing integer values.
Items don't have overlapping starts and ends.
e.g. this would be invalid:
{start:100, end:200}
{start:150, end:250}
I am trying to locate an item where a given value is between start and end
start <= VALUE <= end
The following query works, but it takes 5 to 15 seconds to return
db.blocks.find({ "start" : { $lt : 3232235521 }, "end" :{ $gt : 3232235521 }}).limit(1);
I've added the following indexes for testing with very little improvement
db.blocks.ensureIndex({start:1});
db.blocks.ensureIndex({end:1});
//also a compounded one
db.blocks.ensureIndex({start:1,end:1});
** Edit **
The result of explain() on the query results in:
> db.blocks.find({ "start" : { $lt : 3232235521 }, "end" :{ $gt : 3232235521 }}).limit(1).explain();
{
"cursor" : "BtreeCursor end_1",
"nscanned" : 1160982,
"nscannedObjects" : 1160982,
"n" : 0,
"millis" : 5779,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"end" : [
[
3232235521,
1.7976931348623157e+308
]
]
}
}
What would be the best approach to speeding this specific query up?
actually I'm working on similar problem and my friend find a nice way to solve this.
If you don't have overlapping data, you can do this:
query using start field and sort function
validate with end field
for example you can do
var x = 100;
var results = db.collection.find({start:{$lte:x}}).sort({start:-1}).limit(1)
if (results!=null) {
var result = results[0];
if (result.end > x) {
return result;
} else {
return null; // no range contain x
}
}
If you are sure that there will always range containing x, then you do not have to validate the result.
By using this piece of code, you only have to index by either start or end field and your query become a lot faster.
--- edit
I did some benchmark, using composite index takes 100-100,000ms per query, in the other hand using one index takes 1-5ms per query.
I guess compbound index should work faster for you:
db.blocks.ensureIndex({start:1, end:1});
You can also use explain to see number of scanned object, etc and choose best index.
Also if you are using mongodb < 2.0 you need to update to 2.0+, because there indexes work faster.
Also you can limit results to optimize query.
This might help: how about you introduce some redundancy. If there is not a big variance in the lengths of the intervals, then you can introduce a tag field for each record - this tag field is a single value or string that represents a large interval - say for example tag 50,000 is used to tag all records with intervals that are at least partially in the range 0-50,000 and tag 100,000 is for all intervals in the range 50,000-100,000, and so on. Now you can index on the tag as primary and one of the end points of record range as secondary.
Records on the edge of big interval would have more than one tag - so we are talking multikeys. On your query you would of course calculate the big interval tag and use it in the query.
You would roughly want SQRT of total records per tag - just a starting point for tests, then you can fine tune the big interval size.
Of course this would make writing bit slower.