How does Top-K sort algorithm work in MongoDB - mongodb

Based on the answer and from MongoDB Documentation, I understood that MongoDB is able to sort a large data set and provide sorted results when limit() is used.
However, when the same data set is queried using sort() results into a memory exception.
From the second answer in the above post, poster mentions that whole collection is scanned, sorted and top N results are returned. I would like to know how the collection is sorted when I use limit().
From document I found that when limit() is used it does Top-K sort, however there is not much explanation available about it anywhere. I would like to see any references about Top-K Sort algorithm.

In general, you can do an efficient top-K sort with a min-heap of size K. The min-heap represents the largest K elements seen so far in the data set. It also gives you constant-time access to the smallest element of those top K elements.
As you scan over the data set, if a given element is larger than the smallest element in the min-heap (i.e. the smallest of the largest top K so far), you replace the smallest from the min-heap with that element and re-heapify (O(lg K)).
At the end, you're left with the top K elements of the entire data set, without having had to sort them all (worst-case running time is O(N lg K)), using only Θ(K) memory.
I actually learnt this in school for a change :-)

Related

Unusual hash map implementation

Does anyone know the original hash table implementation?
Every realization I've found is based on separate chaining or open addressing methods
Chaining, by Hans Peter Luhn, in 1953.
https://en.wikipedia.org/wiki/Hash_table#History
The first implementation, not that the most common, is probably the one that uses an array (which is resized as needed) where each entry points to a list of elements.
The hash code, computed mod the size of the array, points to the integer index at which the list of the element to be searched is located. In case of hash code collision, the elements will accumulate in the list of the related entry.
So, once the hash code is computed, we have O(1) for accessing the entry of the array and O(N) for the actual search of the element in the list by verifying its actual equality. The value of N must be kept low for obvious performance consequences.
In case the collision becomes high we resize the array by increasing the number of entries and decreasing the collisions accordingly. This occurs as the hash code mod a higher number than the previous one is computed.
Some more complicated implementations convert the lists to trees if they become too long so that O(N) to O(log(N)) for equality search.

Why are bloom filters not implemented like count-min sketch?

So I only recently learned about these, but from what I understood counting bloom filters are very similar to count-min sketches. The difference being that the former use a single array for all hash functions and the latter use an array per hash function.
If using separate arrays for each hash function will result in less collisions and reduce false positives, why are counting bloom filters not implemented as such?
Though both are space-efficient probabilistic data structures, BloomFilter and Count-min-sketch solve diff use-cases.
BloomFilter is used to test whether an element is a member of a set or not. and It gives boolean False positive results. False positive means, it might tell that a given element is already present but actually, it’s not. See here for working details: https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/
Count-min-sketch tells about keeping track of the count of things i.e, How many times an element is present in a set. See here for working details: https://www.geeksforgeeks.org/count-min-sketch-in-java-with-examples/
I would like to add to #roottraveller answer and try to answer the OP question. First, I find the following resources really helpful for understanding the basic difference between Bloom Filter, Counting Bloom Filter and Count-min Sketch: https://octo.vmware.com/bloom-filter/
As can be find the document:
Bloom Filter is used to test whether an element is a member of a set or not
Count-min-sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data
Counting Bloom Filter an extension of the Bloom filter that allows deletion of elements by storing the frequency of occurrence
So, in short, Counting Bloom Filter only supports deletion of elements and cannot return the frequency of elements. Only CM sketch can return the frequency of elements. And, to answer OP question, sketches are a family of probabilistic data structures that deals with data stream with efficient space time complexity and they have always been constructed using an array per hash function. (https://www.sciencedirect.com/science/article/abs/pii/S0196677403001913)

Hamming distance using geospatial indexes on MongoDb?

I have millions of documents stored in MongoDb, each one having 64 bit hash.
As an example:
0011101001110001001101110000101011010101101111101110110101011001 doc1
0111100111000011011011100001101010001110111100001101101100011111 doc2
and so on.
Now I would like to find all the documents that have hamming distance <= 5 in an efficient way, given the input that is dynamic, without querying all the results one by one.
There are few solutions I found:
A) pre filter the existing result set Hamming Distance / Similarity searches in a database have not given this go yet, seems interesting to say the least, but can't find any information in the internet how efficient this will be.
B) use some kind of metric-space solution (this involves having another separate structure to keep things in sync etc)
For the purpose of this question, I'd like to narrow it down a bit further, and know if it is possible to "exploit/hack" mongodb provided geospatial indexes.
(https://docs.mongodb.com/manual/core/2dsphere/)
The geospatial indexes:
A) allow you to store GeoJSON objects (point, line, polygon)
B) query efficiently all the GeoJSON objects
C) support operations such as finding geojson objects with radius+point, as well geojson intersection between objects
If I could find a way how to map these 64bit hashes to latitude/longitude (OR maybe into polygons) in such way that similar hashes (hamming distance) are grouped more closer to each other, the geospatial index could work well maybe if I say: from this latitude and longitude point, give me all the binary strings in the radius of 5 (hamming distance), it could work?
the problem is I have no idea if any of this is even feasible.
really old question I found: https://groups.google.com/g/mongodb-user/c/lmlcugk2dFs?pli=1
Hamming distance, when applied to binary data, can be considered a directed graph problem.
For 2 bit values, the first bit is the x coordinate, the second is y, and the hamming distance between any two points is the number of sides that must be traversed to move from one to the other.
For 3 bit values, the third bit is the z coordinate, and the points are the vertices of a cube.
For 4 bits, that is a tesseract, and much harder to visualize.
For 64 bits, each value would be one of the vertices on a "unit cube" in 64 dimensions.
Each point would have 64 neighbors with a hamming distance of exactly 1.
One possibibility is to trade a few extra gigabytes of storage for some performance in finding other points within the hamming distance.
Pre-calculate the hash values of the 64 immediate neighbors, regardless of whether they exist in the data set or not, and store those in an array in the document with the original hash. This might be quite a daunting task for already existing documents, but is a bit more manageable if done during the initial insert process.
You could then find all documents whose hashes are within a hamming distance of 5 using the $graphLookup aggregation stage.
If the hash is stored in a field named hashField and the hashes that are a distance of 1 are in a field named neighbors, that might look something like:
db.collectionName.aggregate([
{$match: {<match criteria to select starting hash>}},
{$graphLookup: {
from: "collectionName",
startsWith: "$neighbors",
connectFromField: "neighbors",
connectToField: "hashField",
as: "closehashes",
maxDepth: 5,
depthField: "distance"
}}
])
This would benefit greatly from an index on {hashField: 1}.

quick sort is slower than merge sort

I think the speed of quick sort is less efficient when arranging an array with duplicate data, right? when datatype is char, the bigger the array(over 100000), the closer it gets to the n^2 order.
and assuming there is no duplicate data, to get the best case of a quick sort where the first element is placed as a pivot, first elementsI think we can recursively change the first and intermediate elements by dividing the already aligned array like a merge sort. right? is there general best case?
Lomuto partition scheme, which scans from one end to the other during partition, is slower with duplicates. If all the values are the same, then each partition step splits it into sizes 1 and n-1, a worst case scenario.
Hoare partition scheme, which scans from both both ends towards each other until the indexes (or iterators or pointers) cross, is usually faster with duplicates. Even though duplicates result in more swaps, each swap occurs just after reading and comparing two values to the pivot and are still in the cache for the swap (assuming object size is not huge). As the number of duplicates increases, the splitting improves towards the ideal case where each partition step splits the data into two equal halves. I ran a benchmark sorting 16 million 64 bit integers: with random data, it took about 1.37 seconds, improving with duplicates and with all values the same, it took about about 0.288 seconds.
Another alternative is a 3 way partition, which splits a partition into elements < pivot, elements == pivot, elements > pivot. If all the elements are the same, it's done in O(n) time. For n elements with only k possible values, then time complexity is O(n ⌈log3(k)⌉), and since k is constant, the time complexity is still O(n).
Wiki links:
https://en.wikipedia.org/wiki/Quicksort#Repeated_elements
https://en.wikipedia.org/wiki/Dutch_national_flag_problem

What element of the array would be the median if the the size of the array was even and not odd?

I read that it's possible to make quicksort run at O(nlogn)
the algorithm says on each step choose the median as a pivot
but, suppose we have this array:
10 8 39 2 9 20
which value will be the median?
In math if I remember correct the median is (39+2)/2 = 41/2 = 20.5
I don't have a 20.5 in my array though
thanks in advance
You can choose either of them; if you consider the input as a limit, it does not matter as it scales up.
We're talking about the exact wording of the description of an algorithm here, and I don't have the text you're referring to. But I think in context by "median" they probably meant, not the mathematical median of the values in the list, but rather the middle point in the list, i.e. the median INDEX, which in this cade would be 3 or 4. As coffNjava says, you can take either one.
The median is actually found by sorting the array first, so in your example, the median is found by arranging the numbers as 2 8 9 10 20 39 and the median would be the mean of the two middle elements, (9+10)/2 = 9.5, which doesn't help you at all. Using the median is sort of an ideal situation, but would work if the array were at least already partially sorted, I think.
With an even numbered array, you can't find an exact pivot point, so I believe you can use either of the middle numbers. It'll throw off the efficiency a bit, but not substantially unless you always ended up sorting even arrays.
Finding the median of an unsorted set of numbers can be done in O(N) time, but it's not really necessary to find the true median for the purposes of quicksort's pivot. You just need to find a pivot that's reasonable.
As the Wikipedia entry for quicksort says:
In very early versions of quicksort, the leftmost element of the partition would often be chosen as the pivot element. Unfortunately, this causes worst-case behavior on already sorted arrays, which is a rather common use-case. The problem was easily solved by choosing either a random index for the pivot, choosing the middle index of the partition or (especially for longer partitions) choosing the median of the first, middle and last element of the partition for the pivot (as recommended by R. Sedgewick).
Finding the median of three values is much easier than finding it for the whole collection of values, and for collections that have an even number of elements, it doesn't really matter which of the two 'middle' elements you choose as the potential pivot.