I need to do fast queries to find all documents within a certain GPS radius of a point. The radius will be small and accuracy is not that critical, so I don't need to account for the spherical geometry. There will be lots of writes. Will I get better performance with a 2d index than a 2dsphere?
If you definitely don't need spherical geometry or more than one field in a compound geo index (see the notes on Geospatial Indexes in the MongoDB manual), a 2d index would be more appropriate. There will also be a slight storage advantage in saving coordinates as legacy pairs (longitude, latitude) rather than GeoJSON points. This probably isn't enough to significantly impact your write performance, but it depends what you mean by "lots of writes" and whether these will be pushing your I/O limits.
I'm not sure on the relative performance of queries for different geo index types, but you can easily set up a representative test case in your own dev/staging environment to compare. Make sure you average the measurements over a number of iterations so documents are loaded into memory and there is a fair comparison.
You may also want to consider a haystack index, which is designed to return results for 2d queries within a small area in combination with an additional field criteria (for example, "find restaurants near longitude, latitude"). If you are not fussed on accuracy or sorting by distance (and have an additional search field), this index type may work well for your use case.
2dsphere is now version 3 after MongoDB 3.2
2dsphere is better
data from https://jira.mongodb.org/browse/SERVER-18056
more details : https://www.mongodb.com/blog/post/geospatial-performance-improvements-in-mongodb-3-2
3.1.6 - 2dsphere V2
"executionTimeMillis" : 1875,
"totalKeysExamined" : 24335,
"totalDocsExamined" : 41848,
After reindex
3.1.6 - 2dsphere V3
"executionTimeMillis" : 94,
"totalKeysExamined" : 21676,
"totalDocsExamined" : 38176,
Compared to 2d
3.1.6 - 2d
"executionTimeMillis" : 359,
"totalKeysExamined" : 95671,
"totalDocsExamined" : 112968,
Related
I have millions of documents stored in MongoDb, each one having 64 bit hash.
As an example:
0011101001110001001101110000101011010101101111101110110101011001 doc1
0111100111000011011011100001101010001110111100001101101100011111 doc2
and so on.
Now I would like to find all the documents that have hamming distance <= 5 in an efficient way, given the input that is dynamic, without querying all the results one by one.
There are few solutions I found:
A) pre filter the existing result set Hamming Distance / Similarity searches in a database have not given this go yet, seems interesting to say the least, but can't find any information in the internet how efficient this will be.
B) use some kind of metric-space solution (this involves having another separate structure to keep things in sync etc)
For the purpose of this question, I'd like to narrow it down a bit further, and know if it is possible to "exploit/hack" mongodb provided geospatial indexes.
(https://docs.mongodb.com/manual/core/2dsphere/)
The geospatial indexes:
A) allow you to store GeoJSON objects (point, line, polygon)
B) query efficiently all the GeoJSON objects
C) support operations such as finding geojson objects with radius+point, as well geojson intersection between objects
If I could find a way how to map these 64bit hashes to latitude/longitude (OR maybe into polygons) in such way that similar hashes (hamming distance) are grouped more closer to each other, the geospatial index could work well maybe if I say: from this latitude and longitude point, give me all the binary strings in the radius of 5 (hamming distance), it could work?
the problem is I have no idea if any of this is even feasible.
really old question I found: https://groups.google.com/g/mongodb-user/c/lmlcugk2dFs?pli=1
Hamming distance, when applied to binary data, can be considered a directed graph problem.
For 2 bit values, the first bit is the x coordinate, the second is y, and the hamming distance between any two points is the number of sides that must be traversed to move from one to the other.
For 3 bit values, the third bit is the z coordinate, and the points are the vertices of a cube.
For 4 bits, that is a tesseract, and much harder to visualize.
For 64 bits, each value would be one of the vertices on a "unit cube" in 64 dimensions.
Each point would have 64 neighbors with a hamming distance of exactly 1.
One possibibility is to trade a few extra gigabytes of storage for some performance in finding other points within the hamming distance.
Pre-calculate the hash values of the 64 immediate neighbors, regardless of whether they exist in the data set or not, and store those in an array in the document with the original hash. This might be quite a daunting task for already existing documents, but is a bit more manageable if done during the initial insert process.
You could then find all documents whose hashes are within a hamming distance of 5 using the $graphLookup aggregation stage.
If the hash is stored in a field named hashField and the hashes that are a distance of 1 are in a field named neighbors, that might look something like:
db.collectionName.aggregate([
{$match: {<match criteria to select starting hash>}},
{$graphLookup: {
from: "collectionName",
startsWith: "$neighbors",
connectFromField: "neighbors",
connectToField: "hashField",
as: "closehashes",
maxDepth: 5,
depthField: "distance"
}}
])
This would benefit greatly from an index on {hashField: 1}.
I am working on an IOT Project which associates fuel volume information.
I am iterating each MongoDB document and trying to plot the fuel volumes in a graph while omitting noise values. Each document contains the fuel value on every second,
When I plot the data as it is it gives a lot of noise such as
The vehicle is idle and the same fuel data point is plotted in the graph
When the vehicle is moving on a rough road it gives spikes
The data collection simply looks like below
{
_id : 1,
fuel : 100,
timestamp: 2020-09-18T06:06:01.628+00:00
},
{
_id : 2,
fuel : 100.1,
timestamp: 2020-09-18T06:06:02.628+00:
}
,{
_id : 1,
fuel : 98,
timestamp: 2020-09-18T06:06:03.628+00:
}
I am trying to compare each value of the document fuel value with the previous document fuel value and find the percentage. Then I can give a threshold like plus or minus 5% and filter certain data. This can solve the first problem I am having.
I don't have any idea how to avoid that spikey data points.
I tried the aggregation pipeline and I didn't find a suitable operator or way to save and compare previous data, is there any way to do this. Because I can not smooth data in memory as the data array can be very big from time to time.
Highly appreciate your support.
map/reduce fans out the input documents to processors, the documents are not necessarily processed sequentially therefore it doesn't make sense to talk about "previous" document.
With the aggregation pipeline you can $group or $bucket to create buckets of data and then operate on the buckets (e.g. to average the values within a bucket).
If you want to do complex math on the values you may need to do that in the application in any event.
I have a collection with 100 million documents of geometry.
I have a second collection with time data associated to each of the other geometries. This will be 365 * 96 * 100 million or 3.5 trillion documents.
Rather than store the 100 million entries (365*96) times more than needed, I want to keep them in separate collections and do a type of JOIN/DBRef/Whatever I can in MongoDB.
First and foremost, I want to get a list of GUIDs from the geometry collection by using a geoIntersection. This will filter it down to 100 million to 5000. Then using those 5000 geometries guids I want to filter the 3.5 trillion documents based on the 5000 goemetries and additional date criteria I specify and aggregate the data and find the average. You are left with 5000 geometries and 5000 averages for the date criteria you specified.
This is basically a JOIN as I know it in SQL, is this possible in MongoDB and can it be done optimally in say less than 10 seconds.
Clarify: as I understand, this is what DBrefs is used for, but I read that it is not efficient at all, and with dealing with this much data that it wouldn't be a good fit.
If you're going to be dealing with a geometry and its time series data together, it makes sense to store them in the same doc. A years worth of data in 15 minute increments isn't killer - and you definitely don't want a document for every time-series entry! Since you can retrieve everything you want to operate on as a single geometry document, it's a big win. Note that this also let's you sparse things up for missing data. You can encode the data differently if it's sparse rather than indexing into a 35040 slot array.
A $geoIntersects on a big pile of geometry data will be a performance issue though. Make sure you have some indexing on (like 2dsphere) to speed things up.
If there is any way you can build additional qualifiers into the query that could cheaply eliminate members from the more expensive search, you may make things zippier. Like, say the search will hit states in the US. You could first intersect the search with state boundaries to find the states containing the geodata and use something like a postal code to qualify the documents. That would be a really quick pre-search against 50 documents. If a search boundary was first determined to hit 2 states, and the geo-data records included a state field, you just winnowed away 96 million records (all things being equal) before the more expensive geo part of the query. If you intersect against smallish grid coordinates, you may be able to winnow it further before the geo data is considered.
Of course, going too far adds overhead. If you can correctly tune the system to the density of the 100 million geometries, you may be able to get the times down pretty low. But without actually working with the specifics of the problem, it's hard to know. That much data probably requires some specific experimentation rather than relying on a general solution.
It seems that mongodb has 2 types of geospatial index.
http://www.mongodb.org/display/DOCS/Geospatial+Indexing
The standard one. With a note:
You may only have 1 geospatial index per collection, for now. While
MongoDB may allow to create multiple indexes, this behavior is
unsupported. Because MongoDB can only use one index to support a
single query, in most cases, having multiple geo indexes will produce
undesirable behavior.
And then there is this so called geohaystack thingy.
http://www.mongodb.org/display/DOCS/Geospatial+Haystack+Indexing
They both claim to use the same algorithm. They both turn earth into several grids. And then search based on that.
So what's the different?
Mongodb doesn't seem to use Rtree and stuff right?
NB: Answer to this question that How does MongoDB implement it's spatial indexes? says that 2d index use geohash too.
The implementation is similar, but the use case difference is described on the Geospatial Haystack Indexing page.
The haystack indices are "bucket-based" (aka "quadrant") searches tuned for small-region longitude/latitude searches:
In addition to ordinary 2d geospatial indices, mongodb supports the use
of bucket-based geospatial indexes. Called "Haystack indexing", these
indices can accelerate small-region type longitude / latitude queries
when additional criteria is also required.
For example, "find all restaurants within 25 miles with name 'foo'".
Haystack indices allow you to tune your bucket size to the distribution
of your data, so that in general you search only very small regions of
2d space for a particular kind of document. They are not suited for
finding the closest documents to a particular location, when the
closest documents are far away compared to bucket size.
The bucketSize parameter is required, and determines the granularity of the haystack index.
So, for example:
db.places.ensureIndex({ pos : "geoHaystack", type : 1 }, { bucketSize : 1 })
This example bucketSize of 1 creates an index where keys within 1 unit of longitude or latitude are stored in the same bucket. An additional category can also be included in the index, which means that information will be looked up at the same time as finding the location details.
The B-tree representation would be similar to:
{ loc: "x,y", category: z }
If your use case typically searches for "nearby" locations (i.e. "restaurants within 25 miles") a haystack index can be more efficient. The matches for the additional indexed field (eg. category) can be found and counted within each bucket.
If, instead, you are searching for "nearest restaurant" and would like to return results regardless of distance, a normal 2d index will be more efficient.
There are currently (as of MongoDB 2.2.0) a few limitations on haystack indexes:
only one additional field can be included in the haystack index
the additional index field has to be a single value, not an array
null long/lat values are not supported
Note: distance between degrees of latitude will vary greatly (longitude, less so). See: What is the distance between a degree of latitude and longitude?.
I will build a GIS system based on polygons, not just points. I wanted to use MongoDB or PostGIS.
How do this in MongoDB?
Query A - get the center of a polygon
Query B - distance between two polygons
Query C - list of polygons that are part of a third that I specify
Query D - near-distance of the polygon
Support SRID?
MongoDB's geospatial indexing currently only indexes points. Although it does support proximity and bounds queries, documents are matched by a single point. You may be able to take advantage of multi-location documents and index multiple points along a polygon, which might support some of your queries with reduced precision; however, that would certainly not be ideal.
PostGIS seems more appropriate for your requirements.