How to find nearby events or tweets - mongodb

I'm new to NoSQL databases and I'm stuck with a fairly basic query.
I have a collection of tweets in a MongoDB database, which I'm querying through both the Mongo shell and pyMongo. The documents are similar to:
{ loc : { lng : 40, lat : 3 },
timestamp : 124125512,
userid = 55 }
I need to find all pairs of users with events close to each other with less than 4 hours of difference. The most naive way would be:
db.tweets.find().forEach(function(tweet)
{
found = db.tweets.find({ "timestamp": { "$gt" : tweet['timestamp'] - 60*60*4,
"$lt" : tweet['timestamp'] + 60*60*4},
"loc" : {"$near" : [ tweet['loc']['lng'],
tweet['loc']['lat'] ],
"$maxDistance" : 500 }
});
//... extract the users from those tweets...
}
Which of course is extremely slow (it can contain as many as a few million tweets).
I haven't been able to express this query using neither aggregation nor MapReduce. How would you do it? What is the most NoSQL-y, efficient and clear way of making this kind of query?
EDIT: I've kind of given up. I've been convinced by a friend that it is not going to worth it using Mongo for this. I can leverage that time restriction to avoid iterating over the whole collection and do it in a simple, more traditional iterative script. Since it is not such a huge dataset as to not fit in RAM, it's going to be faster.

Use $near in conjuction with $maxDistance is the most recommended way
db.collectionName.find({loc: {$near: [50, 50], $maxDistance: 5}});
For performance issues you can try creating index as mentioned below:
To create a geospatial index for GeoJSON-formatted data, use the ensureIndex() method and set the value of the location field for your collection to 2dsphere.
db.points.ensureIndex( { loc : "2dsphere" } );
For more information:
Index creation
Build a 2dsphere index
Geospatial indexes and queries

Related

How to optimize a mongodb geospatial query?

A while ago I build the website crowdfundstats.com during a hackathon. The website gives some interesting insides based on the 130.000 or so kickstarter project data that we have scraped. The most interesting feature is the http://crowdfundstats.com/map.html on which you can drag a radius on the worldmap to get information on projects within that radius.
I use the aggregate function to find all projects within the radius based on their geospatial information. Each project has a geo location in the following format:
{ g1 :
{ type : "Point" },
{ coordinates : [ -83.102840423584, 42.354639053345] }
}
The aggregate function then returns the total amount of backers, the average duration, the success percentage and the total amount of projects within the radius:
{'$match' :
{g1 :
{$geoWithin :
{ $centerSphere :[[parseFloat(long), parseFloat(lat) ], radius/6371 ]
}
}
}
},
{'$group':
{ "_id":"",
"backers": {"$sum": "$backers"},
"dateDiff2": {"$avg": "$dateDiff2"},
"completed": {"$avg": "$completed"},
"total": {"$sum": 1}
}
}
The issue is that the result of the query takes a long time (for example: more than 10 seconds when dragging the radius over the UK ). I have already added an 2dsphere index to increase speed, but this has almost no effect:
{
"g1" : "2dsphere"
}
Is there anything I can do to optimise the query, or is this the expected performance on geospatial queries?
Thanks in advance
For anyone stumbling on this thread, I have improved the most heavy query from 15 seconds to 0.5seconds by upgrading from MongoDB 3.0 to 3.2. They have improved geospatial querying immensely. you can read more about it on the MongoDB blog

Complex-ish mongo query runs fairly slow, combination of $and $or $in and regex

I'm running some queries to a mongodb 2.4.9 server that populate a datatable on a webpage. The user needs to be able to do a substring search across multiple fields, sort the data on various columns, and flip through the results in pages. I have to check multiple fields for matches since the user could be searching for anything related to the documents. There are about 300,000 documents in the collection so the database is relatively small.
I have indexes created for the created_by, requester, desc.name, metaprogram.id, program.id, and arr.programid fields. I've also created indexes [("created", 1), ("created_by", 1), ("requester", 1)] and [("created_by", 1), ("requester", 1)] at the suggestion of Dex.
It's also worth mentioning that documents might not have all of the fields that are being searched for here. Some documents might have a metaprogram.id but not the other ID fields for example.
An example of a query I might run is
{
"$query" : {
"$and" : [
{
"created_by" : {"$ne" : "automation"},
"requester" : {"$in" : ["Broadway", "Spec", "Falcon"] }
},
{
"$or" : [
{"requester" : /month/i },
{"created_by" : /month/i },
{"desc.name" : /month/i },
{"metaprogram.id" : {"$in" : [708, 2314, 709 ] } },
{"program.id" : {"$in" : [708, 2314, 709 ] } },
{"arr.programid" : {"$in" : [708, 2314, 709 ] } }
]
}
]
},
"$orderby" : {
"created" : 1
}
}
with differing orderby, limit, and skip values as well.
Queries on average take 500-1500ms to complete.
I've looked into how to make it faster, but haven't been able to come up with anything. Some of the text searching stuff looks handy but as far as I know each collection only supports at most one text index and it doesn't support pagination (skips). I'm sure that prefix searching instead of regex substring matches would be faster as well but I need substring matching.
Is there anything you can think of to improve the speed of a query like this?
It's quite hard to optimize a query when it's unpredictable.
Analyze how the system is being used and place indexes on the most popular fields.
Use .explain() to make sure the indexes are being used.
Also limit the results returned to a value of 50 or 100. The user doesn't need to see everything at once.
Try upgrading mongodb to see if there's a performance improvement.
Side note:
You might want to consider using ElasticSearch as a search engine instead of Mongodb. ElasticSearch would store the searchable fields and return the Mongodb Ids for matched results. ElasticSearch is a magnitude faster as a search engine than Mongodb.
More info:
How to find queries not using indexes or slow in mongodb
Range query for MongoDB pagination
http://www.elasticsearch.org/overview/

mongodb and geospatial schema

im breaking my head with mongo and geospatial,
so maybe someone has some idea or solution how to solve this:
my object schema is like this sample for geoJSON taken from http://geojson.org/geojson-spec.html.
{
"name":"name",
"geoJSON":{
"type":"FeatureCollection",
"features":[
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100,0],[101,0],[101,1],[100,1],[100,0]]]},"properties":{}},
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100,0],[101,0],[101,1],[100,1],[100,0]]]},"properties":{}},
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100,0],[101,0],[101,1],[100,1],[100,0]]]},"properties":{}}
]}}
additional info: I'm using spring data but that shouldn't influence the answer.
main problem is how/where to put indexes in this schema. I need to make a query to find all documents for given Point if some polygon intersects.
thanks in advance.
By creating a 2d or 2dsphere index on geoJSON.features.geometry you should be able to create an index covering all of the geoJSON-objects.
To get all documents where at least one of the sub-object in the features array covers a certain point, you can use the $geoIntersects operator with a geoJSON Point:
db.yourcollection.find(
{ `geoJSON.features.geometry` :
{ $geoIntersects :
{ $geometry :
{ type : "Point" ,
coordinates: [ 100.5 , 0.5 ]
}
}
}
}
)

Optimizing Compound Mongo GeoSpatial Index

I have a MongoDB $within that looks like this:
db.action.find( { $and : [
{ actionType : "PLAY" },
{
location : {
$within : {
$polygon : [ [ 0.0, 0.1 ], [ 0.0, 0.2 ] .. [ a.b, c.d ] ]
}
}
}
] } ).sort( { time : -1 } ).limit(50)
With regard to the action collection documents
There are 5 actionTypes
The action documents MAY or MAY NOT have a location with a ratio of approximately 70:30 for PLAY actions
Otherwise there is no location
The action documents will ALWAYS have time
The collection contains the following indexes
# I am interested recent actions
db.action.ensureIndex({"time": -1}
# I am interested in recent actions by a specific user
db.action.ensureIndex({"userId" : 1}, "time" -1}
# I am interested in recent actions that relate to a unique song id
db.action.ensureIndex({"songId" : 1}, "time" -1}
I am experimenting with the following two indexes
LocationOnly: db.action.ensureIndex({"location":"2d"})
LocationPlusTime: db.action.ensureIndex({"location":"2d"}, { "time": -1})
Identical queries with each index are explained below:
LocationOnly
{
"cursor":"BasicCursor",
"isMultiKey":false,
"n":50,
"nscannedObjects":91076,
"nscanned":91076,
"nscannedObjectsAllPlans":273229,
"nscannedAllPlans":273229,
"scanAndOrder":true,
"indexOnly":false,
"nYields":1,
"nChunkSkips":0,
"millis":1090,
"indexBounds":{},
"server":"xxxx"
}
LocationPlusTime
{
"cursor":"BasicCursor",
"isMultiKey":false,
"n":50,
"nscannedObjects":91224,
"nscanned":91224,
"nscannedObjectsAllPlans":273673,
"nscannedAllPlans":273673,
"scanAndOrder":true,
"indexOnly":false,
"nYields":44,
"nChunkSkips":0,
"millis":1156,
"indexBounds":{},
"server":"xxxxx"
}
Given
The geosearch will cover documents of ALL types
The geosearch will cover documents with NO Location and WITH Location in a ratio of roughly 60:40
My questions are
Can anybody explain why isMultiKey="false" on the second explain plan?
Can anybody explain why there are more yields on the 2nd explain plan?
My speculative thoughts are
The potential for NULL location is reducing the effectiveness of the
GeoSpatial index.
Compound Indexes of the GeoSpatial variety are not as powerful as standard compound indexes.
UPDATE
A sample document looks like this.
{ "_id" : "adba1154f1f3d4ddfafbff9bb3ae98f2a50e76ffc74a38bae1c44d251db315d25c99e7a1b4a8acb13d11bcd582b9843e335006a5be1d3ac8a502a0a205c0c527",
"_class" : "ie.soundwave.backstage.model.action.Action",
"time" : ISODate("2013-04-18T10:11:57Z"),
"actionType" : "PLAY",
"location" : { "lon" : -6.412839696767714, "lat" : 53.27401934563561 },
"song" : { "_id" : "82e08446c87d21b032ccaee93109d6be",
"title" : "Motion Sickness", "album" : "In Our Heads", "artist" : "Hot Chip"
},
"userId" : "51309ed6e4b0e1fb33d882eb", "createTime" : ISODate("2013-04-18T10:12:59.127Z")
}
UPDATE
The geo-query looks like this
https://www.google.com/maps/ms?msid=214949566612971430368.0004e267780661744eb95&msa=0&ll=-0.01133,-0.019226&spn=0.14471,0.264187
For various reasons approximately 250,000 documents exist in our DB at the point 0.0
I played with this for a number of days and got the result I was looking for.
Firstly, given that action types other than "PLAY" CAN NOT have a location the additional query parameter "actionType==PLAY" was unnecessary and removed. Straight away I flipped from "time-reverse-b-tree" cursor to "Geobrowse-polygon" and for my test search latency improved by an order of 10.
Next, I revisited the 2dsphere as suggested by Derick. Again another latency improvement by roughly 5. Overall a much better user experience for map searches was achieved.
I have one refinement remaining. Queries in areas where there are no plays for a number of days have generally increased in latency. This is due to the query looking back in time until it can find "some play". If necessary, I will add in a time range guard to limit the search space of these queries to a set number of days.
Thanks for the hints Derick.

how to deal with complicated query in mongodb?

I use mongodb to save the temporal and spatial data, and the document item is structured as follows:
doc = { time:t,
geo:[x,y]
}
If the different of two docs are defined as:
dist(doc1, doc2) = |t1-t2| + |x1-x2| + |y1 - y2|
How can I query the documents by mongodb and sort the results by their distance to a given document doc0 ={ time:t0, geo:[x0,y0] }?
thanks
Instead of calculating the distance manually, you could trust mongodb with that task. Mongodb has built in geospatial query support.
This would look like this:
db.docs.find( {
"time": "t0",
"geo" : { $near : [x0,y0] }
} ).limit(20)
The result would be all documents near the given location [x0,y0], automatically ordered by distance to that point.