It seems that mongodb has 2 types of geospatial index.
http://www.mongodb.org/display/DOCS/Geospatial+Indexing
The standard one. With a note:
You may only have 1 geospatial index per collection, for now. While
MongoDB may allow to create multiple indexes, this behavior is
unsupported. Because MongoDB can only use one index to support a
single query, in most cases, having multiple geo indexes will produce
undesirable behavior.
And then there is this so called geohaystack thingy.
http://www.mongodb.org/display/DOCS/Geospatial+Haystack+Indexing
They both claim to use the same algorithm. They both turn earth into several grids. And then search based on that.
So what's the different?
Mongodb doesn't seem to use Rtree and stuff right?
NB: Answer to this question that How does MongoDB implement it's spatial indexes? says that 2d index use geohash too.
The implementation is similar, but the use case difference is described on the Geospatial Haystack Indexing page.
The haystack indices are "bucket-based" (aka "quadrant") searches tuned for small-region longitude/latitude searches:
In addition to ordinary 2d geospatial indices, mongodb supports the use
of bucket-based geospatial indexes. Called "Haystack indexing", these
indices can accelerate small-region type longitude / latitude queries
when additional criteria is also required.
For example, "find all restaurants within 25 miles with name 'foo'".
Haystack indices allow you to tune your bucket size to the distribution
of your data, so that in general you search only very small regions of
2d space for a particular kind of document. They are not suited for
finding the closest documents to a particular location, when the
closest documents are far away compared to bucket size.
The bucketSize parameter is required, and determines the granularity of the haystack index.
So, for example:
db.places.ensureIndex({ pos : "geoHaystack", type : 1 }, { bucketSize : 1 })
This example bucketSize of 1 creates an index where keys within 1 unit of longitude or latitude are stored in the same bucket. An additional category can also be included in the index, which means that information will be looked up at the same time as finding the location details.
The B-tree representation would be similar to:
{ loc: "x,y", category: z }
If your use case typically searches for "nearby" locations (i.e. "restaurants within 25 miles") a haystack index can be more efficient. The matches for the additional indexed field (eg. category) can be found and counted within each bucket.
If, instead, you are searching for "nearest restaurant" and would like to return results regardless of distance, a normal 2d index will be more efficient.
There are currently (as of MongoDB 2.2.0) a few limitations on haystack indexes:
only one additional field can be included in the haystack index
the additional index field has to be a single value, not an array
null long/lat values are not supported
Note: distance between degrees of latitude will vary greatly (longitude, less so). See: What is the distance between a degree of latitude and longitude?.
Related
I have millions of documents stored in MongoDb, each one having 64 bit hash.
As an example:
0011101001110001001101110000101011010101101111101110110101011001 doc1
0111100111000011011011100001101010001110111100001101101100011111 doc2
and so on.
Now I would like to find all the documents that have hamming distance <= 5 in an efficient way, given the input that is dynamic, without querying all the results one by one.
There are few solutions I found:
A) pre filter the existing result set Hamming Distance / Similarity searches in a database have not given this go yet, seems interesting to say the least, but can't find any information in the internet how efficient this will be.
B) use some kind of metric-space solution (this involves having another separate structure to keep things in sync etc)
For the purpose of this question, I'd like to narrow it down a bit further, and know if it is possible to "exploit/hack" mongodb provided geospatial indexes.
(https://docs.mongodb.com/manual/core/2dsphere/)
The geospatial indexes:
A) allow you to store GeoJSON objects (point, line, polygon)
B) query efficiently all the GeoJSON objects
C) support operations such as finding geojson objects with radius+point, as well geojson intersection between objects
If I could find a way how to map these 64bit hashes to latitude/longitude (OR maybe into polygons) in such way that similar hashes (hamming distance) are grouped more closer to each other, the geospatial index could work well maybe if I say: from this latitude and longitude point, give me all the binary strings in the radius of 5 (hamming distance), it could work?
the problem is I have no idea if any of this is even feasible.
really old question I found: https://groups.google.com/g/mongodb-user/c/lmlcugk2dFs?pli=1
Hamming distance, when applied to binary data, can be considered a directed graph problem.
For 2 bit values, the first bit is the x coordinate, the second is y, and the hamming distance between any two points is the number of sides that must be traversed to move from one to the other.
For 3 bit values, the third bit is the z coordinate, and the points are the vertices of a cube.
For 4 bits, that is a tesseract, and much harder to visualize.
For 64 bits, each value would be one of the vertices on a "unit cube" in 64 dimensions.
Each point would have 64 neighbors with a hamming distance of exactly 1.
One possibibility is to trade a few extra gigabytes of storage for some performance in finding other points within the hamming distance.
Pre-calculate the hash values of the 64 immediate neighbors, regardless of whether they exist in the data set or not, and store those in an array in the document with the original hash. This might be quite a daunting task for already existing documents, but is a bit more manageable if done during the initial insert process.
You could then find all documents whose hashes are within a hamming distance of 5 using the $graphLookup aggregation stage.
If the hash is stored in a field named hashField and the hashes that are a distance of 1 are in a field named neighbors, that might look something like:
db.collectionName.aggregate([
{$match: {<match criteria to select starting hash>}},
{$graphLookup: {
from: "collectionName",
startsWith: "$neighbors",
connectFromField: "neighbors",
connectToField: "hashField",
as: "closehashes",
maxDepth: 5,
depthField: "distance"
}}
])
This would benefit greatly from an index on {hashField: 1}.
I have a table "ways" containing coordinates (lat/lon) values. Suppose I have a coordinate (x,y) and I want to check the closest match from the table ways. I have seen some similar questions like this: Is there a postgres CLOSEST operator?
However in my case I have 2 columns instead of 1. Is there a query to do something like this this ?
You could store the data as PostGIS 'point' type instead of coordinate values:
https://postgis.net/docs/ST_MakePoint.html
https://postgis.net/docs/ST_GeomFromText.html
This would empower you with all the PostGIS functionality such as:
https://postgis.net/docs/ST_DWithin.html
Then you could create a GiST index and use the PostGIS <-> operator to take advantage of index assisted nearest neighbor result sets. The 'nearest neighbor' functionality is pretty common. Here is a great explanation:
https://postgis.net/workshops/postgis-intro/knn.html
“KNN” stands for “K nearest neighbours”, where “K” is the number of neighbours you are looking for.
KNN is a pure index based nearest neighbour search. By walking up and down the index, the search can find the nearest candidate geometries without using any magical search radius numbers, so the technique is suitable and high performance even for very large tables with highly variable data densities.
I have a PostGreSQL table that has a geometry type column, in which different simple polygons (possibly intersecting) are stored. The polygons are are all areas within a city. I receive an input of a point (latitude-longitude pair) and need to find the list of polygons that contain the given point. What I have currently:
Unclustered GiST index defined on the polygon column.
Use ST_Contains(#param_Point, table.Polygon) on the whole table.
It is quite slow, so I am looking for a more performant in-memory alternative. I have the following ideas:
Maintain dictionary of polygons in Redis, keyed by their geohash. Polygons with same geohash would be saved as a list. When I receive the point, calculate its geohash and trim to a desired level. Then search in the Redis map and keep trimming the point's geohash until I find the first result (or enough results).
Have a trie of geohashes loaded from the database. Update the trie periodically or by receiving update events. Calculate the point's geohash, search in the trie until I find enough results. I prefer this because the map may have long lists for a geohash, given the nature of the polygons.
Any other approaches?
I have read about libraries like GeoTrie and Polygon Geohasher but can't seem to integrate them with the database and the above ideas.
Any cues or starting points, please?
Have you tried using ST_Within? Not sure if it meets your criteria but I believe it is meant to be faster than st_contains
I can understand how B*Tree index works by searching through a Tree.
But, I can't understand how sparse index or dense index works.
For example, if dense index need to have each value mapped by a key. How it's going to benefit when you do the search?
Adding more clarification:
This spare/dense index refer to the index described here on wiki:
https://en.wikipedia.org/wiki/Database_index#Sparse_index
For my understanding the point of index works is that you can search through the B*Tree as O(logN) instead of searching each block as O(N)
But, from the description of either sparse index or dense index.
I can't see how it benefit for searching, you search through keys? But, keys are having the same amount as values right? (for dense index it's strictly equal)
What I am guessing is that dense index and sparse index is just the index used in B*Tree. But, I am not sure if I understand it correctly. Since, I can't find anything online to confirm my thought.
Block-level sparse index
A block-level sparse index will only be helpful for queries where the index is also clustered (i.e. sort order of the index represents the locality of data on disk). A block-level sparse index will have fewer values but still be useful to find the approximate location before starting a sequential scan. The sparseness in this case is effectively "index every nth value in a clustered index".
From a search point of view, a block-level sparse index query would:
find the largest key less than or equal to your indexed search criteria (normal B-tree Big O time complexity which is O(log N) for search)
use that key as the starting point for a sequential scan of the clustered index (O(N) for search)
The advantage of a sparse block-level index is mainly around size rather than speed: a smaller sparse index may fit into memory when a dense index (including all values) would not. Range-based queries on a clustered index are already going to return sequential results, so a sparse index may have some advantages as long as the index isn't too sparse to efficiently support common queries.
A clustered index including records with duplicate keys is effectively one example of a sparse index: there is no need to index the offset of each individual record with the same value because the logical order of the clustered index matches the physical order of the data.
For a worked example, see: Dense and Sparse Indices (sfu.ca).
MongoDB index with sparse option
still I can't figure out how the sparse index works in MongoDB. For example, you have N values with field x not empty. then your will have N keys. Then how does the key help your in search?
A MongoDB index with the sparse option only contains entries for documents that have the indexed field. MongoDB has flexible schema so fields are not required to be present (or the same type) for all documents in a collection. Note: optional document validation is a feature of MongoDB 3.2+.
By default all documents in a collection will be included in an index, but those that do not have the indexed field present will store a null value. If all of your documents in a MongoDB collection have a value for the indexed field, there is no difference between a default index and one with the sparse option.
This is really a special case of a partial index: the sparseness refers to limiting the scope of the indexed values to only include non-null entries. The indexing approach is otherwise identical to a non-sparse index.
The MongoDB documentation calls this out with a note:
Do not confuse sparse indexes in MongoDB with block-level indexes in other databases. Think of them as dense indexes with a specific filter.
I will build a GIS system based on polygons, not just points. I wanted to use MongoDB or PostGIS.
How do this in MongoDB?
Query A - get the center of a polygon
Query B - distance between two polygons
Query C - list of polygons that are part of a third that I specify
Query D - near-distance of the polygon
Support SRID?
MongoDB's geospatial indexing currently only indexes points. Although it does support proximity and bounds queries, documents are matched by a single point. You may be able to take advantage of multi-location documents and index multiple points along a polygon, which might support some of your queries with reduced precision; however, that would certainly not be ideal.
PostGIS seems more appropriate for your requirements.