Mongodb: geospatial indexes and covered index queries - mongodb

Currently I have an index on a geospatial field in one of my collections set up like this:
collection.ensureIndex({ loc : "2d" }, { min : -10000 , max : 10000,
bits : 32, unique:true}
However, I would like to include one more field in the index, so I can take advantage of covered index queries in one of my use cases. An ensureIndex with multiple fields (compound index) looks something like this:
collection.ensureIndex( { username : 1, password : 1, roles : 1} );
Question is - how do I write my first index spec with an additional field, so that I keep my min/max parameters? In other words, how to specify that the min/max/bits only apply to one of the index fields? My best guess so far is:
collection.ensureIndex({ loc : "2d", field2 : 1 },
{ min : -10000 , max : 10000 , bits : 32, unique:true}
But I have no confidence that this is working as it should!
UPDATE:
There is more info in the documentation here, but it still does not explicitly show how to specify the min/max in this case.

db.collection.getIndexes() should give you the parameters used to construct the index - the min/max/bits field will only ever apply to the "2d" field.
There's no way to see these parameters aside from getIndexes(), but you can easily verify that you aren't allowed to insert the same location/field pair, and that you aren't allowed to insert locs outside your bounds (but field2 can be anything). The bits setting is harder to verify directly, though you can indirectly verify by setting it very low and seeing that nearby points then trigger the unique key violations.

Related

Using an indexed field for selecting a set of random items from a collection (MongoDB)

I am using MongoDB 2.4.10, and I have a collection of four million records, and a query that creates a subset of no more than 50000 even for our power users. I need to select a random 30 items from this subset, and, given the potential performance issues with skip and limit (especially when doing it 30 times with random skip amounts from 1-50000), I stumbled across the following solution:
Create a field for each record which is a completely random number
Create an index over this field
Sort by the field, and use skip(X).limit(30) to get a page of 30 items that, while consecutive in terms of the random field, actually bear no relation to each other. To the user, they seem random.
My index looks like this:
{a: 1, b: 1, c: 1, d: 1}
I also have a separate index:
{d : 1}
'd' is the randomised field.
My query looks like this:
db.content.find({a : {$in : ["xyz", "abc"]}, b : "ok", c : "Image"})
.sort({d : 1}).skip(X).limit(30)
When the collection is small, this works perfectly. However, on our performance and live systems, this query fails, because instead of using the a, b, c, d index, it uses this index only:
{d : 1}
As a result, the query ends up scanning more records than it needs to (by a factor of 25). So, I introduced hint:
db.content.find({a : {$in : ["xyz", "abc"]}, b : "ok", c : "Image"})
.hint({a : 1, b : 1, c : 1, d : 1}).sort({d : 1}).skip(X).limit(30)
This now works great with all values of X up to 11000, and explain() shows the correct index in use. But, when the skip amount exceeds 11000, I get:
{
"$err" : "too much data for sort() with no index. add an index or specify a smaller limit",
"code" : 10128
}
Presumably, the risk of hitting this error is why the query (without the hint) wasn't using this index earlier. So:
Why does Mongo think that the sort has no index to use, when I've forced it to use an index that explicitly includes the sorting field at the end?
Is there a better way of doing this?

Is there a limit on number of columns you can specify in mongodb composite index?

index_dict = {
"column_7" : 1,
"column_6" : 1,
"column_5" : 1,
"column_4" : 1,
"column_3" : 1,
"column_2" : 1,
"column_1" : 1,
"column_9" : 1,
"column_8" : 1,
"column_11" : 1,
"column_10" : 1
}
db.dataCustom.ensureIndex(index_dict, {unique: true, dropDups: true})
How many columns is the limit for index_dict? I need to know it for the implementation but I am not able to find the answer online
I must be honest, I know of no limitation on the number of columns. There is a logical limitation though.
An index is a very heavy thing, to put an immense, or even close to table length index on a collection will create performance problems. Most specifically the fact that your index is so large in space and values.
This, you must be aware of. However to answer your question: I know of no limit. There is a limit per size of a field within an index (1024 bytes including name) but I know of none on the number of columns in a index.
Edit
It appears I wrote this answer a little fast, there is a limit of 31 fields: http://docs.mongodb.org/manual/reference/limits/#Number%20of%20Indexed%20Fields%20in%20a%20Compound%20Index I guess I just have never reached that number :/
A collection may have no more than 64 indexes. More details Here and limitation
But I was wondering why you want so many Indexes ?
EDIT
There can be no more than 31 fields in a compound index.

how does Joint index work in mongodb?

In this question's comments, I know how to create index for sort operation: how does Mongodb index works?
But I want to know, when we create a joint index on a & b, how does it work different to a simple index?
And why will we benefit in just finding a, but if we find b, we do not get any benefit from it? Whether joint index is just like concatenating a & b, so we will get benefit from it for Prefix?
1.But I want to know ,when we create an joint index on 'a'&'b',how does it work difference with an simple index?
MongoDB only uses a single index per query .. so if your find() criteria includes both a and b values, you should add a compound index to efficiently search both fields.
2.And why we will benifit just find 'a' but if we find 'b' ,we will do not get any benifit from it? whether joint index just like concatenate 'a'&'b' so we will get benifit from it for Prefix?
MongoDB uses B-tree indexes, so you can only efficiently match partial keys using a prefix. To find all possible values matching a suffix or substring, all index entries would have to be checked.
Set up test data to compare
Examples below are using the mongo shell:
/* Generate some test data */
for (i = 0; i< 1000; i++) {
db.mycoll.insert({a:i})
db.mycoll.insert({b:i})
db.mycoll.insert({a:i,b:i})
}
Now add some sample indexes:
/* Add simple and compound index */
db.mycoll.ensureIndex({a:1})
db.mycoll.ensureIndex({b:1})
db.mycoll.ensureIndex({a:1, b:1})
Finally, for the test scenarios below force your query to use a specific index with $hint and compare the explain() results.
Search for b using simple index
A search for b using the simple index on b can find matching entries directly in the index .. it scans 4 index entries (nscanned) to return 4 results (n):
db.mycoll.find({b:10}).hint({b:1}).explain()
{
"cursor" : "BtreeCursor b_1",
"n" : 4,
"nscannedObjects" : 4,
"nscanned" : 4,
...
}
Search for b using compound index (a,b)
A search for b using the compound index on (a,b) has to check every a value in the index, because the first part of the index is the key value of a.
So to find matching entries directly in the index .. it scans 1904 index entries (nscanned) to return 4 results (n):
db.mycoll.find({b:10}).hint({a:1,b:1}).explain()
{
"cursor" : "BtreeCursor a_1_b_1",
"n" : 4,
"nscannedObjects" : 4,
"nscanned" : 1904,
...
}
Technically scanning 1,904 documents is less than the 3,000 total in my test collection .. but this is far from optimal.
Search for a using compound index (a,b)
For comparison, a search of a using the compound index shows that only 4 values need to be scanned to return 4 documents:
db.mycoll.find({a:10}).hint({a:1,b:1}).explain()
{
"cursor" : "BtreeCursor a_1_b_1",
"n" : 4,
"nscannedObjects" : 4,
"nscanned" : 4,
"nscannedObjectsAllPlans" : 4,
...
}
For some further examples and explanation, I would recommend reading the article Optimizing MongoDB Compound Indexes.

Search for a record where a value is between two item fields in MongoDB

I have a MondoDB collection with over 5 million items. Each item has a "start" and "end" fields containing integer values.
Items don't have overlapping starts and ends.
e.g. this would be invalid:
{start:100, end:200}
{start:150, end:250}
I am trying to locate an item where a given value is between start and end
start <= VALUE <= end
The following query works, but it takes 5 to 15 seconds to return
db.blocks.find({ "start" : { $lt : 3232235521 }, "end" :{ $gt : 3232235521 }}).limit(1);
I've added the following indexes for testing with very little improvement
db.blocks.ensureIndex({start:1});
db.blocks.ensureIndex({end:1});
//also a compounded one
db.blocks.ensureIndex({start:1,end:1});
** Edit **
The result of explain() on the query results in:
> db.blocks.find({ "start" : { $lt : 3232235521 }, "end" :{ $gt : 3232235521 }}).limit(1).explain();
{
"cursor" : "BtreeCursor end_1",
"nscanned" : 1160982,
"nscannedObjects" : 1160982,
"n" : 0,
"millis" : 5779,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"end" : [
[
3232235521,
1.7976931348623157e+308
]
]
}
}
What would be the best approach to speeding this specific query up?
actually I'm working on similar problem and my friend find a nice way to solve this.
If you don't have overlapping data, you can do this:
query using start field and sort function
validate with end field
for example you can do
var x = 100;
var results = db.collection.find({start:{$lte:x}}).sort({start:-1}).limit(1)
if (results!=null) {
var result = results[0];
if (result.end > x) {
return result;
} else {
return null; // no range contain x
}
}
If you are sure that there will always range containing x, then you do not have to validate the result.
By using this piece of code, you only have to index by either start or end field and your query become a lot faster.
--- edit
I did some benchmark, using composite index takes 100-100,000ms per query, in the other hand using one index takes 1-5ms per query.
I guess compbound index should work faster for you:
db.blocks.ensureIndex({start:1, end:1});
You can also use explain to see number of scanned object, etc and choose best index.
Also if you are using mongodb < 2.0 you need to update to 2.0+, because there indexes work faster.
Also you can limit results to optimize query.
This might help: how about you introduce some redundancy. If there is not a big variance in the lengths of the intervals, then you can introduce a tag field for each record - this tag field is a single value or string that represents a large interval - say for example tag 50,000 is used to tag all records with intervals that are at least partially in the range 0-50,000 and tag 100,000 is for all intervals in the range 50,000-100,000, and so on. Now you can index on the tag as primary and one of the end points of record range as secondary.
Records on the edge of big interval would have more than one tag - so we are talking multikeys. On your query you would of course calculate the big interval tag and use it in the query.
You would roughly want SQRT of total records per tag - just a starting point for tests, then you can fine tune the big interval size.
Of course this would make writing bit slower.

Geospatial Indexing with a simple key first

After reading about MongoDB and Geospatial Indexing
I was amazed that it did not support compound keys not starting with the 2d index.
I dont know if I would gain anything on it, but right now the mssql solution is just as slow/fast.
SELECT TOP 30 * FROM Villages WHERE SID = 10 ORDER BY (math to calc radius from the center point)
This works, but is slow because it not smart enough to use a index so it has to calc the radius for all villages with that SID.
So in Mongo I wanted to create an index like: {sid: 1, loc: "2d"} so I could filter out alot from the start.
I'm not sure there are any solutions for this. I thought about creating a collection for each sid since they don't share any information. But what are the disadvantages of this? Or is this how people do it ?
Update
The maps are flat: 800, 800 to -800,-800, villages are places from the center of the map and out. There are about 300 different maps which are not related, so they could be in diff collections, but not sure about the overhead.
If more information is need, please let me know.
What I have tried
> var res = db.Villages.find({sid: 464})
> db.Villages.find({loc: {$near: [50, 50]}, sid: {$in: res}})
error: { "$err" : "invalid query", "code" : 12580 }
>
Also tried this
db.Villages.find({loc: {$near: [50, 50]}, sid: {$in: db.Villages.find({sid: 464}, {sid: 1})}})
error: { "$err" : "invalid query", "code" : 12580 }
I'm not really sure what I'm doing wrong, but its probably somthing about the syntax. Confused here.
As you stated already Mongodb cannot accept location as secondary key in geo index. 2d has to be first in index. So you are out of luck here in changing indexing patterns here.
But there is a workaround, instead the compound geo index you can create two separate indexes on sid and one compound index with loc and sid
db.your_collection.ensureIndex({sid : 1})
db.your_collection.ensureIndex({loc : '2d',sid:1})
or two separate indexes on sid and loc
db.your_collection.ensureIndex({sid : 1})
db.your_collection.ensureIndex({loc : '2d'})
(am not sure which of the above one is efficient, you can try it yourself)
and you can make two different queries to get the results filterd by sid first and the location next, kinda like this
res = db.your_collection.find({sid:10})
//get all the ids from the res (res_ids)
//and query by location using the ids
db.your_collection.find({loc:{ $near : [50,50] } ,sid : {$in : res_ids}})