mongod.log shows:
{deliver_area: { $geoIntersects:
{ $geometry: {
type: "Point",
coordinates: [ 116.3426399230957, 39.95959281921387 ]
} }
} }
ntoreturn:0
ntoskip:0
nscanned:2965
keyUpdates:0
numYields: 2 locks(micros)
r:136723
nreturned:52
reslen:23453
103ms
The collection has about 10k records, where deliver_area is one of the fields which is a Polygon(GeoJSON) and has a 2dsphere index
This is my query:
db.area_coll.find( {
id: 59,
deliver_area: {
$geoIntersects: {
$geometry: {
type: "Point",
coordinates: [ 116.3175773620605, 39.97607231140137 ]
}
}
}
})
Explain result:
{
"cursor" : "S2Cursor",
"isMultiKey" : true,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 3887,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 3887,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 5,
"indexBounds" : {
},
"nscanned" : 3887,
"matchTested" : NumberLong(666),
"geoTested" : NumberLong(0),
"cellsInCover" : NumberLong(1),
"server" : "testing:27017"
}
The query in the log does not match the query that you run as, the location is different:
[ 116.3426399230957, 39.95959281921387 ] vs.
[ 116.3175773620605, 39.97607231140137 ]
I also don't think you have reproduced your whole log line, as it just mentions area and not deliver_area.
However, they are not really slow. In the first case, it took 103ms, which in some cases might happen as your server is doing other IO. The second query took 5ms as the explain() output tells you.
But what is most striking is that your main criterion is id: 59. I don't know what your _id field is, but if you set an index on id then this should not even have to use a 2dsphere index at all — unless you have of course many documents where id=59. In that case, you could be better off with a compound key on { id: 1, deliver_area: '2dsphere' }.
I had exactly the same issue. My index was compound one.
So I had 2dsphere on location field + Ascending index on zoom field.
I always do query by both fields, filtering by location and zoom and it was really slow.
I tried to make two regular indexes (not compound) and it works fast. So looks like compound index which including 2dsphere doesn't work well or should be used in some complicated way.
Related
I know it is not possible to remove the _id field in a mongodb collection. However, the size of my collections is large, that the index on the _id field prevents me from loading the other indices in the RAM. My machine has 125GB of RAM and my collection stats is as follows:
db.call_records.stats()
{
"ns" : "stc_cdrs.call_records",
"count" : 1825338618,
"size" : 438081268320,
"avgObjSize" : 240,
"storageSize" : 468641284752,
"numExtents" : 239,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 165290709024,
"indexSizes" : {
"_id_" : 73450862016,
"caller_id_1" : 45919923504,
"receiver_id_1" : 45919923504
},
"ok" : 1
}
When I do a query like the following:
db.call_records.find({ "$or" : [ { "caller_id": 125091840205 }, { "receiver_id" : 125091840205 } ] }).explain()
{
"clauses" : [
{
"cursor" : "BtreeCursor caller_id_1",
"isMultiKey" : false,
"n" : 401,
"nscannedObjects" : 401,
"nscanned" : 401,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"caller_id" : [
[
125091840205,
125091840205
]
]
}
},
{
"cursor" : "BtreeCursor receiver_id_1",
"isMultiKey" : false,
"n" : 383,
"nscannedObjects" : 383,
"nscanned" : 383,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"receiver_id" : [
[
125091840205,
125091840205
]
]
it takes more than 15 seconds on average to return the results. The indices for both caller_id and receiver_id should be around 90GB, which is OK. However, the 73GB index on the _id makes this query very slow.
You correctly told that you can not remove _id field from your document. You also can not remove an index from this field, so this is something you have to live with.
For some reason you start with the assumption that _id index makes your query slow, which is completely unjustifiable and most probably is wrong. This index is not used and just stays there untouched.
Few things I would try to do in your situation:
You have 400 billion documents in your collection, have you thought that this is a right time to start sharding your database? In my opinion you should.
use explain with your query to actually figure out what slows it down.
Looking at your query, I would also try to do the following:
change your document from
{
... something else ...
receiver_id: 234,
caller_id: 342
}
to
{
... something else ...
participants: [342, 234]
}
where your participants are [caller_id, receiver_id] in this order, then you can put only one index on this field. I know that it will not make your indices smaller, but I hope that because you will not use $or clause, you will get results faster. P.S. if you will do this, do not do this in production, test whether it give you a significant improvement and only then change in prod.
There are a lot of potential issues here.
The first is that your indexes do not include all of the data returned. This means Mongo is getting the _id from the index and then using the _id to retrieve and return the document in question. So removing the _id index, even if you could, would not help.
Second, the query includes an OR. This forces Mongo to load both indexes so that it can read them and then retrieve the documents in question.
To improve performance, I think you have just a few choices:
Add the additional elements to the indexes and restrict the data returned to what is available in the index (this would change indexOnly = true in the explain results)
Explore sharding as Skooppa.com mentioned.
Rework the query and/or the document to eliminate the OR condition.
During my hands on with MongoDB I came to understand about a problem with MongoDB indexes. Problem is that MongoDB indexes sometimes doesn't enforce the two-end boundaries to query. Here's one of the output I encountered while querying the database:
Query:
db.user.find({transaction:{$elemMatch:{product:"mobile", firstTransaction:{$gte:ISODate("2015-01-01"), $lt:ISODate("2015-01-02")}}}}).hint("transaction.product_1_transaction.firstTransaction_1").explain()
Output:
"cursor" : "BtreeCursor transaction.firstTransaction_1_transaction.product_1",
"isMultiKey" : true,
"n" : 622,
"nscannedObjects" : 350931,
"nscanned" : 6188185,
"nscannedObjectsAllPlans" : 350931,
"nscannedAllPlans" : 6188185,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 235851,
"nChunkSkips" : 0,
"millis" : 407579,
"indexBounds" : {
"transaction.firstTransaction" : [
[
true,
ISODate("2015-01-02T00:00:00Z")
]
],
"transaction.product" : [
[
"mobile",
"mobile"
]
]
},
As you can see in above example for firstTransaction field one end of the bound is true instead of date I mentioned. I found the workaround for this is min(), max() functions. I tried those but they not seem to be working with embedded document (transaction is an array of sub document which contains fields like firstTransaction, product etc). I get following error:
Query:
db.user.find({transaction:{$elemMatch:{product:'mobile'}}}).min({transaction:{$elemMatch:{firstTransaction:ISODate("2015-01-01")}}}).max({transaction:{$elemMatch:{firstTransaction:ISODate("2015-01-02")}}})
Output:
planner returned error: unable to find relevant index for max/min query
firstTransaction field is indexed though as well as product & their compound index too. I don't know what is going wrong here.
Sample document:
{
_id: UUID (indexed by default),
name: string,
dob: ISODate,
addr: string,
createdAt: ISODate (indexed),
.
.
.,
transaction:[
{
firstTransaction: ISODate(indexed),
lastTransaction: ISODate(indexed),
amount: float,
product: string (indexed),
.
.
.
},...
],
other sub documents...
}
This is the correct behavior. You cannot always intersect the index bounds for $lte and $gte - sometimes it would give incorrect results. For example, consider the document
{ "x" : [{ "a" : [4, 6] }] }
This document matches the query
db.test.find({ "x" : { "$elemMatch" : { "a" : { "$gte" : 5, "$lte" : 5 } } } });
If we define an index on { "x.a" : 1 }, the two index bounds would be [5, infinity], and [-infinity, 5]. Intersecting them would give [5, 5] and using this index bound would not match the document - incorrectly!
Can you provide a sample document and tell us more about what you're trying to do with the query? With context, there may be another way to write the query that uses tighter index bounds.
I have a query that uses $near to filter records down to a proximity. It is then supposed to be sorting the results by a separate field. However I'm running into a situation where records are missing even though they match the criteria.
I suspect this is due to the fact that using $near with 2d indexes has a 100 record limit. What I believe is happening is that the geospatial sort is occurring first and mine is only then being applied to the top 100 records of that result.
Is there anyway to overcome this behavior? Can I disregard the sort of $near and use my own as the primary sort or, alternatively, circumvent the 100 record limit so that my sort applies to the entire set?
Here is the explain() from the query I'm using:
db.properties.find({
loc: {
$near: [-80.173366, 34.07868],
$maxDistance: 5
}}).sort({mls: -1}).explain()
{
"cursor" : "GeoSearchCursor",
"isMultiKey" : false,
"n" : 100,
"nscannedObjects" : 211,
"nscanned" : 700,
"nscannedObjectsAllPlans" : 211,
"nscannedAllPlans" : 700,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 2,
"indexBounds" : {
},
"server" : "slate:27017",
"filterSet" : false
}
I ran into the same problem a while ago, you can use aggregate - $match. I have used the following Snippet at a hackaton.
db.kickstarter.aggregate(
{'$match' :
{geo2 :
{$geoWithin :
{ $centerSphere :[[parseFloat(lng), parseFloat(lat) ], radius/6371 ]
}
}
}
},
{$sort : {'pledged' : -1}},
{$limit : 1000}, //you can set your limit here
function(err, data){
if(err)console.log(err);
}
);
I have a very large collection of documents like:
{ loc: [10.32, 24.34], relevance: 0.434 }
and want to be able efficiently do a query like:
{ "loc": {"$geoWithin":{"$box":[[-103,10.1],[-80.43,30.232]]}} }
with arbitrary boxes.
Adding an 2d index on loc makes this very fast and efficient. However, I want to now also just get the most relevant documents:
.sort({ relevance: -1 })
Which causes everything to grind to a crawl (there can be huge amount of results in any particular box, and I just need the top 10 or so).
Any advise or help greatly appreciated!!
Have you tried using the aggregation framework?
A two stage pipeline might work:
a $match stage that uses your existing $geoWithin query.
a $sort stage that sorts by relevance: -1
Here's an example of what it might look like:
db.foo.aggregate(
{$match: { "loc": {"$geoWithin":{"$box":[[-103,10.1],[-80.43,30.232]]}} }},
{$sort: {relevance: -1}}
);
I'm not sure how it will perform. However, even if it's poor with MongoDB 2.4, it might be dramatically different in 2.6/2.5, as 2.6 will include improved aggregation sort performance.
When there is a huge result matching particular box, sort operation is really expensive so that you definitely want to avoid it.
Try creating separate index on relevance field and try using it (without 2d index at all): the query will be executed much more efficiently that way - documents (already sorted by relevance) will be scanned one by one matching the given geo box condition. When top 10 are found, you're good.
It might not be that fast if geo box matches only small subset of the collection, though. In worst case scenario it will need to scan through the whole collection.
I suggest you to create 2 indexes (loc vs. relevance) and run tests on queries which are common in your app (using mongo's hint to force using needed index).
Depending on your tests results, you may even want to add some app logic so that if you know the box is huge you can run the query with relevance index, otherwise use loc 2d index. Just a thought.
You cannot have the scan and order value as 0 when you trying to use to have sorting on the part of a compound key. Unfortunately currently there is no solution for your problem which is not related to the phenomenon that you are using a 2d index or else.
When you run an explain command on your query the value of "scanAndOrder" show weather it was needed to have a sorting phase after collecting the result or not.If it is true a sorting after the querying was needed, if it is false sorting was not needed.
To test the situation i created a collection called t2 in a sample db this way:
db.createCollection('t2')
db.t2.ensureIndex({a:1})
db.t2.ensureIndex({b:1})
db.t2.ensureIndex({a:1,b:1})
db.t2.ensureIndex({b:1,a:1})
for(var i=0;i++<200;){db.t2.insert({a:i,b:i+2})}
While you can use only 1 index to support a query i did the following test with the results included:
mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("b_1").explain()
{
"cursor" : "BtreeCursor b_1",
"isMultiKey" : false,
"n" : 150,
"nscannedObjects" : 200,
"nscanned" : 200,
"nscannedObjectsAllPlans" : 200,
"nscannedAllPlans" : 200,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"b" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "localhost:27418",
"millis" : 0
}
mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("a_1_b_1").explain()
{
"cursor" : "BtreeCursor a_1_b_1",
"isMultiKey" : false,
"n" : 150,
"nscannedObjects" : 150,
"nscanned" : 150,
"nscannedObjectsAllPlans" : 150,
"nscannedAllPlans" : 150,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"a" : [
[
50,
1.7976931348623157e+308
]
],
"b" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "localhost:27418",
"millis" : 1
}
mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("a_1").explain()
{
"cursor" : "BtreeCursor a_1",
"isMultiKey" : false,
"n" : 150,
"nscannedObjects" : 150,
"nscanned" : 150,
"nscannedObjectsAllPlans" : 150,
"nscannedAllPlans" : 150,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"a" : [
[
50,
1.7976931348623157e+308
]
]
},
"server" : "localhost:27418",
"millis" : 1
}
mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("b_1_a_1").explain()
{
"cursor" : "BtreeCursor b_1_a_1",
"isMultiKey" : false,
"n" : 150,
"nscannedObjects" : 150,
"nscanned" : 198,
"nscannedObjectsAllPlans" : 150,
"nscannedAllPlans" : 198,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"b" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"a" : [
[
50,
1.7976931348623157e+308
]
]
},
"server" : "localhost:27418",
"millis" : 0
}
The indexes on individual fields does not help much so a_1 (not support sorting) and b_1 (not support queryin) is out . The index on a_1_b_1 also not fortunate while it will perform worse than the single a_1, mongoDB engine will not utilize the situation that the part related to one 'a' value stored ordered this way. What is worth to try is a compound index b_1_a_1 which in your case relevance_1_loc_1 while it will return the results in ordered manner so scanAndOrder will be false and i have not tested for 2d index but i assume it will exclude scanning some documents based on just the index value (that is why in the test in that case the nscanned is higher than nscannedObjects). The index unfortunately will be huge but still smaller than the docs.
This solution is valid if you need to search inside a box(rectangle).
The problem with geospatial index is that you can only place it in the front of a Compound Index (at least it is so for mongo 3.2)
So I thought why not to create my own "geospatial" index? All I need is to create a Compound Index on Lat, Lgn (X,Y) and add the sort field at the first place. Then I'll need to implement the logic of searching inside the box boundaries and specifically instruct mongo to use it (hint).
Translating to your problem:
db.collection.createIndex({ "relevance": 1, "loc_x": 1, "loc_y": 1 }, { "background": true } )
Logic:
db.collection.find({
"loc_x": { "$gt": -103, "$lt": -80.43 },
"loc_y": { "$gt": 10.1, "$lt": 30.232 }
}).hint("relevance_1_loc_x_1_loc_y_1") // or whatever name you gave it
Use $gte and $lte if you need inclusive results.
And you don't need to use .sort() since it's already sorted, or you can do a reverse sort on relevance if you need.
The only issue that I encountered with it is when the box area is small. It took more time to find small areas than large ones. That is why I kept the geospatial index for small area searches.
I have converted my old collection using mongodb "2d" index to a collection having geojson specification "2dsphere" index. The problem is that the query is taking about 11 second to execute on collection of about 2 lac objects. Previously is was taking about 100 ms for query. My document is as follow.
{
"_id": ObjectId("4f9c2aa2d142b9882f02a3b3"),
"geonameId": NumberInt(1106542),
"name": "Chitungwiza",
"feature code": "PPL",
"country code": "ZW",
"state": "Harare Province",
"population": NumberInt(340360),
"elevation": "",
"timezone": "Africa\/Harare",
"geolocation": {
"type": "Point",
"coordinates": {
"0": 31.07555,
"1": -18.01274
}
}
}
My explain query output is given below.
db.city_info.find({"geolocation":{'$near':{ '$geometry': { 'type':"Point",coordinates:[73,23] } }}}).explain()
{
"cursor" : "S2NearCursor",
"isMultiKey" : true,
"n" : 172980,
"nscannedObjects" : 172980,
"nscanned" : 1121804,
"nscannedObjectsAllPlans" : 172980,
"nscannedAllPlans" : 1121804,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 13,
"nChunkSkips" : 0,
"millis" : 13841,
"indexBounds" : {
},
"nscanned" : 1121804,
"matchTested" : NumberLong(191431),
"geoMatchTested" : NumberLong(191431),
"numShells" : NumberLong(373),
"keyGeoSkip" : NumberLong(930373),
"returnSkip" : NumberLong(933610),
"btreeDups" : NumberLong(0),
"inAnnulusTested" : NumberLong(191431),
"server" : "..."
}
Please let me know how can I correct the problem and reduce the query time.
The $near command does not require $maxDistance argument for "2dsphere" databases as you suggest. Adding $maxDistance just specified a range that reduced the number of query results to a manageable number. The reason for the difference in your experience changing from "2d" to "2dsphere" style indexes is that "2d" indexes impose a default limit of 100 if none is specified. As you can see, the default query plan for 2dsphere indexes does not impose such limit so the query is scanning the entire index ("nscannedObjects" : 172980). If you ran the same query on a "2d" index you would see "n" and "nscannedObjects" are only 100 which explains the cost discrepancy.
If all of your items were within the $maxDistance range (try it with $maxDistance 20M meters, for instance), you will see the query performance degrade back to where it was without it. In either case, it is very important to use limit() to tell the query plan to only scan the necessary results within the index to prevent runaways, especially with larger data sets.
I have solved the problem. The $near command requires $maxDistance argument as specified here: http://docs.mongodb.org/manual/applications/2dsphere/ . As soon as I supplied $maxDistance, the query time reduced to less than 100 ms.