Hamming distance using geospatial indexes on MongoDb? - mongodb

I have millions of documents stored in MongoDb, each one having 64 bit hash.
As an example:
0011101001110001001101110000101011010101101111101110110101011001 doc1
0111100111000011011011100001101010001110111100001101101100011111 doc2
and so on.
Now I would like to find all the documents that have hamming distance <= 5 in an efficient way, given the input that is dynamic, without querying all the results one by one.
There are few solutions I found:
A) pre filter the existing result set Hamming Distance / Similarity searches in a database have not given this go yet, seems interesting to say the least, but can't find any information in the internet how efficient this will be.
B) use some kind of metric-space solution (this involves having another separate structure to keep things in sync etc)
For the purpose of this question, I'd like to narrow it down a bit further, and know if it is possible to "exploit/hack" mongodb provided geospatial indexes.
(https://docs.mongodb.com/manual/core/2dsphere/)
The geospatial indexes:
A) allow you to store GeoJSON objects (point, line, polygon)
B) query efficiently all the GeoJSON objects
C) support operations such as finding geojson objects with radius+point, as well geojson intersection between objects
If I could find a way how to map these 64bit hashes to latitude/longitude (OR maybe into polygons) in such way that similar hashes (hamming distance) are grouped more closer to each other, the geospatial index could work well maybe if I say: from this latitude and longitude point, give me all the binary strings in the radius of 5 (hamming distance), it could work?
the problem is I have no idea if any of this is even feasible.
really old question I found: https://groups.google.com/g/mongodb-user/c/lmlcugk2dFs?pli=1

Hamming distance, when applied to binary data, can be considered a directed graph problem.
For 2 bit values, the first bit is the x coordinate, the second is y, and the hamming distance between any two points is the number of sides that must be traversed to move from one to the other.
For 3 bit values, the third bit is the z coordinate, and the points are the vertices of a cube.
For 4 bits, that is a tesseract, and much harder to visualize.
For 64 bits, each value would be one of the vertices on a "unit cube" in 64 dimensions.
Each point would have 64 neighbors with a hamming distance of exactly 1.
One possibibility is to trade a few extra gigabytes of storage for some performance in finding other points within the hamming distance.
Pre-calculate the hash values of the 64 immediate neighbors, regardless of whether they exist in the data set or not, and store those in an array in the document with the original hash. This might be quite a daunting task for already existing documents, but is a bit more manageable if done during the initial insert process.
You could then find all documents whose hashes are within a hamming distance of 5 using the $graphLookup aggregation stage.
If the hash is stored in a field named hashField and the hashes that are a distance of 1 are in a field named neighbors, that might look something like:
db.collectionName.aggregate([
{$match: {<match criteria to select starting hash>}},
{$graphLookup: {
from: "collectionName",
startsWith: "$neighbors",
connectFromField: "neighbors",
connectToField: "hashField",
as: "closehashes",
maxDepth: 5,
depthField: "distance"
}}
])
This would benefit greatly from an index on {hashField: 1}.

Related

Find closest match from 2 columns in postgresql

I have a table "ways" containing coordinates (lat/lon) values. Suppose I have a coordinate (x,y) and I want to check the closest match from the table ways. I have seen some similar questions like this: Is there a postgres CLOSEST operator?
However in my case I have 2 columns instead of 1. Is there a query to do something like this this ?
You could store the data as PostGIS 'point' type instead of coordinate values:
https://postgis.net/docs/ST_MakePoint.html
https://postgis.net/docs/ST_GeomFromText.html
This would empower you with all the PostGIS functionality such as:
https://postgis.net/docs/ST_DWithin.html
Then you could create a GiST index and use the PostGIS <-> operator to take advantage of index assisted nearest neighbor result sets. The 'nearest neighbor' functionality is pretty common. Here is a great explanation:
https://postgis.net/workshops/postgis-intro/knn.html
“KNN” stands for “K nearest neighbours”, where “K” is the number of neighbours you are looking for.
KNN is a pure index based nearest neighbour search. By walking up and down the index, the search can find the nearest candidate geometries without using any magical search radius numbers, so the technique is suitable and high performance even for very large tables with highly variable data densities.

How does Top-K sort algorithm work in MongoDB

Based on the answer and from MongoDB Documentation, I understood that MongoDB is able to sort a large data set and provide sorted results when limit() is used.
However, when the same data set is queried using sort() results into a memory exception.
From the second answer in the above post, poster mentions that whole collection is scanned, sorted and top N results are returned. I would like to know how the collection is sorted when I use limit().
From document I found that when limit() is used it does Top-K sort, however there is not much explanation available about it anywhere. I would like to see any references about Top-K Sort algorithm.
In general, you can do an efficient top-K sort with a min-heap of size K. The min-heap represents the largest K elements seen so far in the data set. It also gives you constant-time access to the smallest element of those top K elements.
As you scan over the data set, if a given element is larger than the smallest element in the min-heap (i.e. the smallest of the largest top K so far), you replace the smallest from the min-heap with that element and re-heapify (O(lg K)).
At the end, you're left with the top K elements of the entire data set, without having had to sort them all (worst-case running time is O(N lg K)), using only Θ(K) memory.
I actually learnt this in school for a change :-)

Number of buckets in LSH

In LSH, you hash slices of the documents into buckets. The idea is that these documents that fell into the same buckets will be potentially similar, thus a nearest neighbor, possibly.
For 40.000 documents, what is a good value (pretty much) for the number of buckets?
I have it as: number_of_buckets = 40.000/4 now, but I feel it can be reduced more.
Any ideas, please?
Relative: How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?
A common starting point is to use sqrt(n) buckets for n documents. You can try doubling and halving that and run some analysis to see what kind of document distributions you got. Naturally any other exponent can be tried as well, and even K * log(n) if you expect that the number of distinct clusters grows "slowly".
I don't think this is an exact science yet, belongs on the similar topic as choosing the optimal k for k-means clustering.
I think it should be at least n. If it is less than that, let's say n/2, you ensure that for all bands, each document will have at least 1 possible similar document on average, due to collisions. So, your complexity when calculating the similarities will be at least O(n).
On the other hand, you will have to pass through the buckets at least K times, so that is O(K*B), being B your buckets. I believe the latter is faster, because it is just iterating over your data structure (namely a Dictionary of some kind) and counting the number of documents that hashed to each bucket.

Is Mongodb geohaystack the same with standard mongodb spatial index?

It seems that mongodb has 2 types of geospatial index.
http://www.mongodb.org/display/DOCS/Geospatial+Indexing
The standard one. With a note:
You may only have 1 geospatial index per collection, for now. While
MongoDB may allow to create multiple indexes, this behavior is
unsupported. Because MongoDB can only use one index to support a
single query, in most cases, having multiple geo indexes will produce
undesirable behavior.
And then there is this so called geohaystack thingy.
http://www.mongodb.org/display/DOCS/Geospatial+Haystack+Indexing
They both claim to use the same algorithm. They both turn earth into several grids. And then search based on that.
So what's the different?
Mongodb doesn't seem to use Rtree and stuff right?
NB: Answer to this question that How does MongoDB implement it's spatial indexes? says that 2d index use geohash too.
The implementation is similar, but the use case difference is described on the Geospatial Haystack Indexing page.
The haystack indices are "bucket-based" (aka "quadrant") searches tuned for small-region longitude/latitude searches:
In addition to ordinary 2d geospatial indices, mongodb supports the use
of bucket-based geospatial indexes. Called "Haystack indexing", these
indices can accelerate small-region type longitude / latitude queries
when additional criteria is also required.
For example, "find all restaurants within 25 miles with name 'foo'".
Haystack indices allow you to tune your bucket size to the distribution
of your data, so that in general you search only very small regions of
2d space for a particular kind of document. They are not suited for
finding the closest documents to a particular location, when the
closest documents are far away compared to bucket size.
The bucketSize parameter is required, and determines the granularity of the haystack index.
So, for example:
db.places.ensureIndex({ pos : "geoHaystack", type : 1 }, { bucketSize : 1 })
This example bucketSize of 1 creates an index where keys within 1 unit of longitude or latitude are stored in the same bucket. An additional category can also be included in the index, which means that information will be looked up at the same time as finding the location details.
The B-tree representation would be similar to:
{ loc: "x,y", category: z }
If your use case typically searches for "nearby" locations (i.e. "restaurants within 25 miles") a haystack index can be more efficient. The matches for the additional indexed field (eg. category) can be found and counted within each bucket.
If, instead, you are searching for "nearest restaurant" and would like to return results regardless of distance, a normal 2d index will be more efficient.
There are currently (as of MongoDB 2.2.0) a few limitations on haystack indexes:
only one additional field can be included in the haystack index
the additional index field has to be a single value, not an array
null long/lat values are not supported
Note: distance between degrees of latitude will vary greatly (longitude, less so). See: What is the distance between a degree of latitude and longitude?.

What element of the array would be the median if the the size of the array was even and not odd?

I read that it's possible to make quicksort run at O(nlogn)
the algorithm says on each step choose the median as a pivot
but, suppose we have this array:
10 8 39 2 9 20
which value will be the median?
In math if I remember correct the median is (39+2)/2 = 41/2 = 20.5
I don't have a 20.5 in my array though
thanks in advance
You can choose either of them; if you consider the input as a limit, it does not matter as it scales up.
We're talking about the exact wording of the description of an algorithm here, and I don't have the text you're referring to. But I think in context by "median" they probably meant, not the mathematical median of the values in the list, but rather the middle point in the list, i.e. the median INDEX, which in this cade would be 3 or 4. As coffNjava says, you can take either one.
The median is actually found by sorting the array first, so in your example, the median is found by arranging the numbers as 2 8 9 10 20 39 and the median would be the mean of the two middle elements, (9+10)/2 = 9.5, which doesn't help you at all. Using the median is sort of an ideal situation, but would work if the array were at least already partially sorted, I think.
With an even numbered array, you can't find an exact pivot point, so I believe you can use either of the middle numbers. It'll throw off the efficiency a bit, but not substantially unless you always ended up sorting even arrays.
Finding the median of an unsorted set of numbers can be done in O(N) time, but it's not really necessary to find the true median for the purposes of quicksort's pivot. You just need to find a pivot that's reasonable.
As the Wikipedia entry for quicksort says:
In very early versions of quicksort, the leftmost element of the partition would often be chosen as the pivot element. Unfortunately, this causes worst-case behavior on already sorted arrays, which is a rather common use-case. The problem was easily solved by choosing either a random index for the pivot, choosing the middle index of the partition or (especially for longer partitions) choosing the median of the first, middle and last element of the partition for the pivot (as recommended by R. Sedgewick).
Finding the median of three values is much easier than finding it for the whole collection of values, and for collections that have an even number of elements, it doesn't really matter which of the two 'middle' elements you choose as the potential pivot.