Sparse/Dense index and how does it work? - mongodb

I can understand how B*Tree index works by searching through a Tree.
But, I can't understand how sparse index or dense index works.
For example, if dense index need to have each value mapped by a key. How it's going to benefit when you do the search?
Adding more clarification:
This spare/dense index refer to the index described here on wiki:
https://en.wikipedia.org/wiki/Database_index#Sparse_index
For my understanding the point of index works is that you can search through the B*Tree as O(logN) instead of searching each block as O(N)
But, from the description of either sparse index or dense index.
I can't see how it benefit for searching, you search through keys? But, keys are having the same amount as values right? (for dense index it's strictly equal)
What I am guessing is that dense index and sparse index is just the index used in B*Tree. But, I am not sure if I understand it correctly. Since, I can't find anything online to confirm my thought.

Block-level sparse index
A block-level sparse index will only be helpful for queries where the index is also clustered (i.e. sort order of the index represents the locality of data on disk). A block-level sparse index will have fewer values but still be useful to find the approximate location before starting a sequential scan. The sparseness in this case is effectively "index every nth value in a clustered index".
From a search point of view, a block-level sparse index query would:
find the largest key less than or equal to your indexed search criteria (normal B-tree Big O time complexity which is O(log N) for search)
use that key as the starting point for a sequential scan of the clustered index (O(N) for search)
The advantage of a sparse block-level index is mainly around size rather than speed: a smaller sparse index may fit into memory when a dense index (including all values) would not. Range-based queries on a clustered index are already going to return sequential results, so a sparse index may have some advantages as long as the index isn't too sparse to efficiently support common queries.
A clustered index including records with duplicate keys is effectively one example of a sparse index: there is no need to index the offset of each individual record with the same value because the logical order of the clustered index matches the physical order of the data.
For a worked example, see: Dense and Sparse Indices (sfu.ca).
MongoDB index with sparse option
still I can't figure out how the sparse index works in MongoDB. For example, you have N values with field x not empty. then your will have N keys. Then how does the key help your in search?
A MongoDB index with the sparse option only contains entries for documents that have the indexed field. MongoDB has flexible schema so fields are not required to be present (or the same type) for all documents in a collection. Note: optional document validation is a feature of MongoDB 3.2+.
By default all documents in a collection will be included in an index, but those that do not have the indexed field present will store a null value. If all of your documents in a MongoDB collection have a value for the indexed field, there is no difference between a default index and one with the sparse option.
This is really a special case of a partial index: the sparseness refers to limiting the scope of the indexed values to only include non-null entries. The indexing approach is otherwise identical to a non-sparse index.
The MongoDB documentation calls this out with a note:
Do not confuse sparse indexes in MongoDB with block-level indexes in other databases. Think of them as dense indexes with a specific filter.

Related

Does the size of the value affect the size of the index in MongoDB?

I have a set of IDs which are numbers anywhere between 8 and 11 digits long, and there are only 300K of them (so not exactly sequential etc.). These are stored in collection A.
I have a collection B with millions of entries in which every entry has an array of these IDs, and every array could have thousands of these IDs. I need to index this field too (i.e. hundreds of millions, potentially up to a billion+ entries). When I indexed it, the index turned out to be very large, way over the RAM size of the cluster.
Would it be worth trying to compress the value of each ID down from 8-11 numbers into some small alphanumeric encoded string? Or simply re-number them from e.g. 1 - 300,000 sequentially (and maintain a mapping of this)? Would that have a significant impact on the index size, or is it not worth the effort?
The size of your indexed field affects the size of the index. You can run a collStats command to check the size of the index and compare the size of your indexed field with the total size that MongoDb needs to create the index.
Mongo already performs some compressions on indexes so trying to encode your field in an alphanumeric encoded string is probably not going to have a benefit or a marginal one.
Using a smaller numeric type is going to save a small amount of size in your index, but if you need to maintain a mapping, it´s probably not worth the effort and it is probably overcomplicating things.
The size of the index for a collection with 300K elements only indexing the 11 digits ID should be small, something around a few Mb. So it´s very likely that you don't have
storage or memory issues with that index size.
Regarding your second collection, if you reduce some bytes in each ID, you can get some reduction of the size of the index.
e.g. Reducing the size of each ID from 8bytes to 4bytes and having about 1Billion elements, you are reducing some GB the size of the index.
Reducing the size of the index and the collection B some GB could be an interesting save, so based on your needs, it could worth the effort to modify the IDs to use the smallest type possible. However, you could still have or you could have it in the near future if the collections keep growing, the memory issues due to the index not fitting in memory. So sharding the collection could be a good possibility.
Hashed Column index you can create which will give more or less same performance in case you interested more on saving the size of index.
You can check with some data how much % size of data you saving for index and performance impact and take decision

How hashtables are linear on same or collision values?

I was looking at this StackOverflow answer to understand hashing better and saw the following (regarding the fact that we would need to get bucket size in constant time):
if you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though--it's linear on the number of items that hashed to the same or a colliding value.
What does this mean that it's "linear on the number of items that hashed to the same or a colliding value"? Wouldn't it be linear on total number of items in the hashtable, since it's possible that it will need to walk through every value during linear probing? I don't see why it would just have to go through the ones that collided.
Like for example, if I am using linear probing (step size 1) on a hashtable and I have different keys (not colliding, all hash to unique values) mapping to the odd index slots 1,3,5,7,9..... Then, I want to insert many keys that all hash to 2, so I fill up all my even index spots with those keys. If I wanted to know how many keys hash to 2, wouldn't I need to go through the entire hash table? But I'm not just iterating through items that hashed to the same or colliding value, since the odd index slots are not colliding.
A hash table is conceptually similar to an array (table) of linked lists (bucket in the table). The difference is in how you manage and access that array: using a function to generate a number that is used to compute the array index.
Once you have two elements placed in the same bucket (the same computed value, i.e. collission), then the problem turns out to be a search in a list. The number of elements in the list is hopefully lower than the total elements in the hash table (meaning that other elements exist in other buckets).
However, you are skipping the important introduction in that paragraph:
If you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though -- it's linear on the number of items that hashed to the same or a colliding value.
Linear probing is a different implementation of a hash table in which you don't use any list (chain) for your collissions. Instead, you just find the nearest available spot in the array, starting from the expected position and going forward. The more populated the array is, the higher the chance is to find that the next position is being used too, so you just need to keep searching. The positions are used by items that hashed to the same or colliding value, although you are never (and you don't really care) which of these two cases is, unless you explicitly see the hash of the existing element there.
This CppCon presentation video makes a good introduction and in-depth analysis of hash tables.

How does Top-K sort algorithm work in MongoDB

Based on the answer and from MongoDB Documentation, I understood that MongoDB is able to sort a large data set and provide sorted results when limit() is used.
However, when the same data set is queried using sort() results into a memory exception.
From the second answer in the above post, poster mentions that whole collection is scanned, sorted and top N results are returned. I would like to know how the collection is sorted when I use limit().
From document I found that when limit() is used it does Top-K sort, however there is not much explanation available about it anywhere. I would like to see any references about Top-K Sort algorithm.
In general, you can do an efficient top-K sort with a min-heap of size K. The min-heap represents the largest K elements seen so far in the data set. It also gives you constant-time access to the smallest element of those top K elements.
As you scan over the data set, if a given element is larger than the smallest element in the min-heap (i.e. the smallest of the largest top K so far), you replace the smallest from the min-heap with that element and re-heapify (O(lg K)).
At the end, you're left with the top K elements of the entire data set, without having had to sort them all (worst-case running time is O(N lg K)), using only Θ(K) memory.
I actually learnt this in school for a change :-)

mongodb integer index precision

TLDR: Given documents such as { x: 15, y: 15 }, can I create an index on 20 / x, 20 / y without adding those values as fields?
I'd like to index the x, y coordinates of a collection of documents. However my use case is such that:
Each document will have a unique x,y pair, so you can think of this as an ID.
Documents are added in blocks of 20x20 and are always fully populated (every x,y combo in that block will exists)
Lookups will only ever be against blocks of 20x20 and will always align on those boundaries. I'll never query for part of a block or for values that aren't multiples of 20.
For any possible block, there will either be no data or 4,000 results. A block is never sparsely populated.
I will be writing much more frequently than reading so write efficiency is very important.
An index of { x: 1, y: 1} would work but seems wasteful. With my use case the index would have an entry for every document in the collection! Since my queries will be aligned on multiples of 20, I really only need to index to that resolution. I'm expecting this would produce a smaller footprint on disk, in memory, and slightly faster lookups.
Is there a way to create an index like this or am I thinking in the wrong direction?
I'm aware that I could add block_x and block_y to my document, but the block concept doesn't exist in my application so it would be junk data.
Since my queries will be aligned on multiples of 20, I really only need to index to that resolution. I'm expecting this would produce a smaller footprint on disk, in memory, and slightly faster lookups.
Index entries are created per-document, since each index entry points to a single document. Lowering the "resolution" of the index would therefore impart no space savings at all, since the size of the index depends on the index type (single field, compound, etc. see https://docs.mongodb.com/manual/indexes/#index-types) and the number of documents in that collection.
I'm aware that I could add block_x and block_y to my document, but the block concept doesn't exist in my application so it would be junk data.
If the fields block_x and block_y would help you to more effectively find a document, then I wouldn't say it's junk data. It's true that you don't need to display those fields in your application, but they could be useful to speed up your queries nonetheless.

Is Mongodb geohaystack the same with standard mongodb spatial index?

It seems that mongodb has 2 types of geospatial index.
http://www.mongodb.org/display/DOCS/Geospatial+Indexing
The standard one. With a note:
You may only have 1 geospatial index per collection, for now. While
MongoDB may allow to create multiple indexes, this behavior is
unsupported. Because MongoDB can only use one index to support a
single query, in most cases, having multiple geo indexes will produce
undesirable behavior.
And then there is this so called geohaystack thingy.
http://www.mongodb.org/display/DOCS/Geospatial+Haystack+Indexing
They both claim to use the same algorithm. They both turn earth into several grids. And then search based on that.
So what's the different?
Mongodb doesn't seem to use Rtree and stuff right?
NB: Answer to this question that How does MongoDB implement it's spatial indexes? says that 2d index use geohash too.
The implementation is similar, but the use case difference is described on the Geospatial Haystack Indexing page.
The haystack indices are "bucket-based" (aka "quadrant") searches tuned for small-region longitude/latitude searches:
In addition to ordinary 2d geospatial indices, mongodb supports the use
of bucket-based geospatial indexes. Called "Haystack indexing", these
indices can accelerate small-region type longitude / latitude queries
when additional criteria is also required.
For example, "find all restaurants within 25 miles with name 'foo'".
Haystack indices allow you to tune your bucket size to the distribution
of your data, so that in general you search only very small regions of
2d space for a particular kind of document. They are not suited for
finding the closest documents to a particular location, when the
closest documents are far away compared to bucket size.
The bucketSize parameter is required, and determines the granularity of the haystack index.
So, for example:
db.places.ensureIndex({ pos : "geoHaystack", type : 1 }, { bucketSize : 1 })
This example bucketSize of 1 creates an index where keys within 1 unit of longitude or latitude are stored in the same bucket. An additional category can also be included in the index, which means that information will be looked up at the same time as finding the location details.
The B-tree representation would be similar to:
{ loc: "x,y", category: z }
If your use case typically searches for "nearby" locations (i.e. "restaurants within 25 miles") a haystack index can be more efficient. The matches for the additional indexed field (eg. category) can be found and counted within each bucket.
If, instead, you are searching for "nearest restaurant" and would like to return results regardless of distance, a normal 2d index will be more efficient.
There are currently (as of MongoDB 2.2.0) a few limitations on haystack indexes:
only one additional field can be included in the haystack index
the additional index field has to be a single value, not an array
null long/lat values are not supported
Note: distance between degrees of latitude will vary greatly (longitude, less so). See: What is the distance between a degree of latitude and longitude?.