How to optimize a mongodb geospatial query? - mongodb

A while ago I build the website crowdfundstats.com during a hackathon. The website gives some interesting insides based on the 130.000 or so kickstarter project data that we have scraped. The most interesting feature is the http://crowdfundstats.com/map.html on which you can drag a radius on the worldmap to get information on projects within that radius.
I use the aggregate function to find all projects within the radius based on their geospatial information. Each project has a geo location in the following format:
{ g1 :
{ type : "Point" },
{ coordinates : [ -83.102840423584, 42.354639053345] }
}
The aggregate function then returns the total amount of backers, the average duration, the success percentage and the total amount of projects within the radius:
{'$match' :
{g1 :
{$geoWithin :
{ $centerSphere :[[parseFloat(long), parseFloat(lat) ], radius/6371 ]
}
}
}
},
{'$group':
{ "_id":"",
"backers": {"$sum": "$backers"},
"dateDiff2": {"$avg": "$dateDiff2"},
"completed": {"$avg": "$completed"},
"total": {"$sum": 1}
}
}
The issue is that the result of the query takes a long time (for example: more than 10 seconds when dragging the radius over the UK ). I have already added an 2dsphere index to increase speed, but this has almost no effect:
{
"g1" : "2dsphere"
}
Is there anything I can do to optimise the query, or is this the expected performance on geospatial queries?
Thanks in advance

For anyone stumbling on this thread, I have improved the most heavy query from 15 seconds to 0.5seconds by upgrading from MongoDB 3.0 to 3.2. They have improved geospatial querying immensely. you can read more about it on the MongoDB blog

Related

Filter Documents by Distance Stored in Document with $near

I am using the following example to better explain my need.
I have a set of points(users) on a map and collection schema is as below
{
location:{
latlong:[long,lat]
},
maxDistance:Number
}
i have another collection with events happening in the area. schema is given below
{
eventLocation:{
latlong:[long,lat]
}
}
now users can add their location and the maximum distance they want to travel for to attend an event and save it.
whenever a new event is posted , all the users satisfying their preferences will get a notification. Now how do i query that. i tried following query on user schema
{
$where: {
'location.latlong': {
$near: {
$geometry: {
type: "Point",
coordinates: [long,lat]
},
$maxDistance: this.distance
}
}
}
}
got an error
error: {
"$err" : "Can't canonicalize query: BadValue $where got bad type",
"code" : 17287
}
how do i query the above case as maxDistance is defined by user and is not fixed. i am using 2dsphere index.
Presuming you have already worked out to act on the event data as you recieve it and have it in hand ( if you have not, then that is another question, but look at tailable cursors ), then you should have an object with that data for which to query the users with.
This is therefore not a case for JavaScript evaluation with $where, as it cannot access the query data returned from a $near operation anyway. What you want instead is $geoNear from the aggregation framework. This can project the "distance" found from the query, and allow a later stage to "filter" the results against the user stored value for the maximum distance they want to travel to published events:
// Represent retrieved event data
var eventData = {
eventLocation: {
latlong: [long,lat]
}
};
// Find users near that event within their stored distance
User.aggregate(
[
{ "$geoNear": {
"near": {
"type": "Point",
"coordinates": eventData.eventLocation.latlong
},
"distanceField": "eventDistance",
"limit": 100000,
"spherical": true
}},
{ "$redact": {
"$cond": {
"if": { "$lt": [ "$eventDistance", "$maxDistance" ] },
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
]
function(err,results) {
// Work with results in here
}
)
Now you do need to be careful with the returned number, as since you appear to be storing in "legacy coordinate pairs" instead of GeoJSON, then the distance returned from this operation will be in radians and not a standard distance. So presuming you are storing in "miles" or "kilometers" on the user objects then you need to calculate via the formula mentioned in the manual under "Calculate Distances Using Spherical Geometry" as mentioned in the manual.
The basics are that you need to divide by the equatorial radius of the earth, being either 3,963.2 miles or 6,378.1 kilometers to convert for a comparison to what you have stored.
The alternate is to store in GeoJSON instead, where there is a consistent measurement in meters.
Assuming "kilometers" that "if" line becomes:
"if": { "$lt": [
"$eventDistance",
{ "$divide": [ "$maxDistance", 6,378.1 ] }
]},
To reliably compare your stored kilometer value to the radian result retured.
The other thing to be aware of is that $geoNear has a default "limit" of 100 results, so you need to "pump up" the "limit" argument there to the number for expected users to possibly match. You might even want to do this in "range lists" of user id's for a really large system, but you can go as big as memory allows within a single aggreation operation and possibly add allowDiskUse where needed.
If you don't tune that parameter, then only the nearest 100 results ( default ) will be returned, which may well no even suit your next operation of filtering those "near" the event to start with. Use common sense though, as you surely have a max distance to even filter out potential users, and that can be added to the query as well.
As stated, the point here is returning the distance for comparison, so the next stage is the $redact operation which can fiter the user's own "travel distance" value against the returned distance from the event. The end result gives only those users that fall within their own distance contraint from the event who will qualify for notification.
That's the logic. You project the distance from the user to the event and then compare to the user stored value for what distance they are prepared to travel. No JavaScript, and all native operators that make it quite fast.
Also as noted in the options and the general commentary, I really do suggest you use a "2dsphere" index for accurate spherical distance calculation as well as converting to GeoJSON storage for your coordinate storage in your database Objects, as they are both general standards that produce consistent results.
Try it without embedding your query in $where: {. The $where operator is for passing a javascript function to the database, which you don't seem to want to do here (and is in fact something you should generally avoid for performance and security reasons). It has nothing to do with location.
{
'location.latlong': {
$near: {
$geometry: {
type: "Point",
coordinates: [long,lat]
},
$maxDistance: this.distance
}
}
}

How to find nearby events or tweets

I'm new to NoSQL databases and I'm stuck with a fairly basic query.
I have a collection of tweets in a MongoDB database, which I'm querying through both the Mongo shell and pyMongo. The documents are similar to:
{ loc : { lng : 40, lat : 3 },
timestamp : 124125512,
userid = 55 }
I need to find all pairs of users with events close to each other with less than 4 hours of difference. The most naive way would be:
db.tweets.find().forEach(function(tweet)
{
found = db.tweets.find({ "timestamp": { "$gt" : tweet['timestamp'] - 60*60*4,
"$lt" : tweet['timestamp'] + 60*60*4},
"loc" : {"$near" : [ tweet['loc']['lng'],
tweet['loc']['lat'] ],
"$maxDistance" : 500 }
});
//... extract the users from those tweets...
}
Which of course is extremely slow (it can contain as many as a few million tweets).
I haven't been able to express this query using neither aggregation nor MapReduce. How would you do it? What is the most NoSQL-y, efficient and clear way of making this kind of query?
EDIT: I've kind of given up. I've been convinced by a friend that it is not going to worth it using Mongo for this. I can leverage that time restriction to avoid iterating over the whole collection and do it in a simple, more traditional iterative script. Since it is not such a huge dataset as to not fit in RAM, it's going to be faster.
Use $near in conjuction with $maxDistance is the most recommended way
db.collectionName.find({loc: {$near: [50, 50], $maxDistance: 5}});
For performance issues you can try creating index as mentioned below:
To create a geospatial index for GeoJSON-formatted data, use the ensureIndex() method and set the value of the location field for your collection to 2dsphere.
db.points.ensureIndex( { loc : "2dsphere" } );
For more information:
Index creation
Build a 2dsphere index
Geospatial indexes and queries

mongodb and geospatial schema

im breaking my head with mongo and geospatial,
so maybe someone has some idea or solution how to solve this:
my object schema is like this sample for geoJSON taken from http://geojson.org/geojson-spec.html.
{
"name":"name",
"geoJSON":{
"type":"FeatureCollection",
"features":[
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100,0],[101,0],[101,1],[100,1],[100,0]]]},"properties":{}},
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100,0],[101,0],[101,1],[100,1],[100,0]]]},"properties":{}},
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100,0],[101,0],[101,1],[100,1],[100,0]]]},"properties":{}}
]}}
additional info: I'm using spring data but that shouldn't influence the answer.
main problem is how/where to put indexes in this schema. I need to make a query to find all documents for given Point if some polygon intersects.
thanks in advance.
By creating a 2d or 2dsphere index on geoJSON.features.geometry you should be able to create an index covering all of the geoJSON-objects.
To get all documents where at least one of the sub-object in the features array covers a certain point, you can use the $geoIntersects operator with a geoJSON Point:
db.yourcollection.find(
{ `geoJSON.features.geometry` :
{ $geoIntersects :
{ $geometry :
{ type : "Point" ,
coordinates: [ 100.5 , 0.5 ]
}
}
}
}
)

Optimizing Compound Mongo GeoSpatial Index

I have a MongoDB $within that looks like this:
db.action.find( { $and : [
{ actionType : "PLAY" },
{
location : {
$within : {
$polygon : [ [ 0.0, 0.1 ], [ 0.0, 0.2 ] .. [ a.b, c.d ] ]
}
}
}
] } ).sort( { time : -1 } ).limit(50)
With regard to the action collection documents
There are 5 actionTypes
The action documents MAY or MAY NOT have a location with a ratio of approximately 70:30 for PLAY actions
Otherwise there is no location
The action documents will ALWAYS have time
The collection contains the following indexes
# I am interested recent actions
db.action.ensureIndex({"time": -1}
# I am interested in recent actions by a specific user
db.action.ensureIndex({"userId" : 1}, "time" -1}
# I am interested in recent actions that relate to a unique song id
db.action.ensureIndex({"songId" : 1}, "time" -1}
I am experimenting with the following two indexes
LocationOnly: db.action.ensureIndex({"location":"2d"})
LocationPlusTime: db.action.ensureIndex({"location":"2d"}, { "time": -1})
Identical queries with each index are explained below:
LocationOnly
{
"cursor":"BasicCursor",
"isMultiKey":false,
"n":50,
"nscannedObjects":91076,
"nscanned":91076,
"nscannedObjectsAllPlans":273229,
"nscannedAllPlans":273229,
"scanAndOrder":true,
"indexOnly":false,
"nYields":1,
"nChunkSkips":0,
"millis":1090,
"indexBounds":{},
"server":"xxxx"
}
LocationPlusTime
{
"cursor":"BasicCursor",
"isMultiKey":false,
"n":50,
"nscannedObjects":91224,
"nscanned":91224,
"nscannedObjectsAllPlans":273673,
"nscannedAllPlans":273673,
"scanAndOrder":true,
"indexOnly":false,
"nYields":44,
"nChunkSkips":0,
"millis":1156,
"indexBounds":{},
"server":"xxxxx"
}
Given
The geosearch will cover documents of ALL types
The geosearch will cover documents with NO Location and WITH Location in a ratio of roughly 60:40
My questions are
Can anybody explain why isMultiKey="false" on the second explain plan?
Can anybody explain why there are more yields on the 2nd explain plan?
My speculative thoughts are
The potential for NULL location is reducing the effectiveness of the
GeoSpatial index.
Compound Indexes of the GeoSpatial variety are not as powerful as standard compound indexes.
UPDATE
A sample document looks like this.
{ "_id" : "adba1154f1f3d4ddfafbff9bb3ae98f2a50e76ffc74a38bae1c44d251db315d25c99e7a1b4a8acb13d11bcd582b9843e335006a5be1d3ac8a502a0a205c0c527",
"_class" : "ie.soundwave.backstage.model.action.Action",
"time" : ISODate("2013-04-18T10:11:57Z"),
"actionType" : "PLAY",
"location" : { "lon" : -6.412839696767714, "lat" : 53.27401934563561 },
"song" : { "_id" : "82e08446c87d21b032ccaee93109d6be",
"title" : "Motion Sickness", "album" : "In Our Heads", "artist" : "Hot Chip"
},
"userId" : "51309ed6e4b0e1fb33d882eb", "createTime" : ISODate("2013-04-18T10:12:59.127Z")
}
UPDATE
The geo-query looks like this
https://www.google.com/maps/ms?msid=214949566612971430368.0004e267780661744eb95&msa=0&ll=-0.01133,-0.019226&spn=0.14471,0.264187
For various reasons approximately 250,000 documents exist in our DB at the point 0.0
I played with this for a number of days and got the result I was looking for.
Firstly, given that action types other than "PLAY" CAN NOT have a location the additional query parameter "actionType==PLAY" was unnecessary and removed. Straight away I flipped from "time-reverse-b-tree" cursor to "Geobrowse-polygon" and for my test search latency improved by an order of 10.
Next, I revisited the 2dsphere as suggested by Derick. Again another latency improvement by roughly 5. Overall a much better user experience for map searches was achieved.
I have one refinement remaining. Queries in areas where there are no plays for a number of days have generally increased in latency. This is due to the query looking back in time until it can find "some play". If necessary, I will add in a time range guard to limit the search space of these queries to a set number of days.
Thanks for the hints Derick.

Use property for calculation in mongodb

specialists,
I have a document with several properties, one bein an lon/lat-array with a 2d index on it.
Another property is a radius property.
What I want:
{
geo:{
$within: { $center: [[ 9.078597000000036,50.580947], 1+radius]}
}
}
Is this available with mongodb? No matter what I am searching in google I am always directed to the mongodb documentation about geospatial indexes but my question is not getting answered.
Not sure if i understand you, but the docs page about geospatial indexes give a an example of a query like yours:
db.places.find( { geo: { $centerSphere: [ [long,lat ] ,
radius ] } } )
searches places collection for everything withing radius distance (in radians) from the point [long, lat].