MongoDB Index Complexity - mongodb

I really like MongoDB, I use it at work and home, and not once yet have I hit a performance, complexity, or limitation issue with it. But I've been thinking about indexes a lot and I had a question I've not found an adequate answer to.
One of the big issues with SQL databases at scale is the relative complexity of queries. Specifically, MySQL uses b-trees for most of it's indexes, which querying takes O(log(n)), better than linear, but still means things take longer the more data you have.
A big attraction of noSQL databases is the removal/mitigation of this scaling issue, often relying instead on hash style indexes, which have O(1) lookup time, so having more data doesn't slow down your app. This is where my question comes in:
According to the offical MongoDB documentation, all indexes in Mongo use b-trees. Despite the fact that Mongo does in fact have a hashed index, as far as I can tell these are still stored in b-trees, same with the index on the _id field. I couldn't even find anything indicating anything about constant time anywhere in Mongo's documentation!
So my question is this: are, in fact, all indexes (including _id and hashed) in Mongo stored in b-trees? Does this mean querying for keys (even by _id) in fact takes O(log(n)) time?
Addendum: As a point of note, I'd be great if Mongo documentation provided some complexity formulas with examples queries. My favorite example of this is the Redis documentation.
Also: This is related. But I have the added specific questions regarding the hashed indexes and (more importantly) the _id index.

If you look at the code for indexing in mongodb (here), you can easily see that it's using btree for indexing. So the order of the algorithm is O(log n), but the base of this logarithm function is not 2, but 8192 instead, which is here in the code.
So for a million records we only have two levels (assuming the tree is balanced) and that is why it can find the record so fast. Overall, it's true the order is logarithmic, but since the base of the logarithm function is so large, it grows slowly.

Related

are hashed indexes in mongodb field-size limited?

in our DB we have a large text field which we want to filter on exists/does not exist basis. So we don't need to perform any text search in it.
we assume that index would help, although it's not guaranteed the fiels wont exceed 1024 bytes. So that's not an option.
does hashed index on such field support $exists-filtering queries?
do hashed indexes impose any field-size limitations (in our experiments, hashed index is well capable of indexing fields where ordinary index fails)? We haven't found any explicit statement on this in docs though.
is chosen approach as a whole the correct one?
Yes, your approach is the correct one given the constraints. However, there are some caveats.
The performance advantage of an index compared to a collection scan is limited by the RAM available, since mongod tries to keep indices in RAM. If it can't (die to queries, for example), even an index will be read from disk, more or less eliminating the performance advantage in using it. So you should test wether the additional index does not push the RAM needed beyond the limits of your planned deployment.
The other, more severe problem is that you can not use said index to reliably distinguish unique documents with it, since there is no guarantee for uniqueness on hashes. Albeit a bit theoretical, you have to keep that in mind.

Skipping the first term of a compound index by using hint()

Suppose I have a Mongo collection with fields a and b. I've populated this collection with {a:'a', b : index } where index increases iteratively from 0 to 1000.
I know this is very, very wrong, but can't explain (no pun intended) why:
collection.find({i:{$gt:500}}).explain() confirms that the index was not used (I can see that it scanned all 1,000 documents in the collection).
Somehow forcing Mongo to use the index seems to work though:
collection.find({i:{$gt:500}}).hint({a:1,i:1}).explain()
Edit
The Mongo documentation is very clear that it will only use compound indexes if one of your query terms is the matches the first term of the compound index. In this case, using hint, it appears that Mongo used the compound index {a:1,i:1} even though the query terms do NOT include a. Is this true?
The interesting part about the way MongoDB performs queries is that it actually may run multiple queries in parallel to determine what is the best plan. It may have chosen to not use the index due to other experimenting you've done from the shell, or even when you added the data and whether it was in memory, etc/ (or a few other factors). Looking at the performance numbers, it's not reporting that using the index was actually any faster than not (although you shouldn't take much stock in those numbers generally). In this case, the data set is really small.
But, more importantly, according to the MongoDB docs, the output from the hinted run also suggests that the query wasn't covered entirely by the index (indexOnly=false).
That's because your index is a:1, i:1, yet the query is for i. Compound indexes only support searches based on any prefix of the indexed fields (meaning they must be in the order they were specified).
http://docs.mongodb.org/manual/core/read-operations/#query-optimization
FYI: Use the verbose option to see a report of all plans that were considered for the find().

Multiple indexes with different definitions in mongodb

The question is a very simple one, can you have more than one index in a collection. I suppose you can, but every time I search for multiple indexes I get explanations on compound indexes and that is not what I'm looking for.
All I want to do is make sure that I can have two simple separate indexes.
(I'm using PHP, I'll use php code formatting, but I understand
db.posts.ensureIndex({ my_id1: 1 }, {unique: true, background: true});
db.posts.ensureIndex({ my_id2: 1 }, {background: true});
I'll only search for one index at a time.
Compound indexes are not what I'm looking for because:
one index is unique and the other is not.
I think it's not going to be the fastest option. (open the link to understand the reason I think its going to be slower. link)
I just want to make sure that the indexes will work.
You sure can have indexes defined the way you have it. From MongoDB documentation:
How many indexes? Indexes make retrieval by a key, including ordered sequential retrieval, very fast. Updates by key are faster too as MongoDB can find the document to update very quickly. However, keep in mind that each index created adds a certain amount of overhead for inserts and deletes. In addition to writing data to the base collection, keys must then be added to the B-Tree indexes. Thus, indexes are best for collections where the number of reads is much greater than the number of writes. For collections which are write-intensive, indexes, in some cases, may be counterproductive. Most collections are read-intensive, so indexes are a good thing in most situations.
I also recommend you look at how Mongo will decide what index to use when it comes to running a query that goes by both fields.
Also take a look at their Indexing Advice and FAQ page. It will explain things like only one index per query, selectivity, etc.
p.s. This slideshare deck from 10gen suggests there's a limit of 40 indexes per collection.

Is mongoDB efficient in doing multi-key lookups?

I'm evaluating MongoDB, coming from Membased/memcached because I want more flexibility.
Of course Membase is excellent in doing fast (multi)-key lookups.
I like the additional options that MongoDB gives me, but is it also fast in doing multi-key lookups? I've seen the $or and $in operator and I'm sure I can model it with that. I just want to know if it's performant (in the same league) as Membase.
use-case, e.g., Lucene/Solr returns 20 product-ids. Lookup these product-ids in Couchdb to return docs/ appropriate fields.
Thanks,
Geert-Jan
For your use case, I'd say it is, from my experience: I hacked some analytics into a database of mine that made a lot of $in queries with thousands of ids and it worked fine (it was a hack). To my surprise, it worked rather well, in the lower millisecond area.
Of course, it's hard to compare this, and -as usual- theory is a bad companion when it comes to performance. I guess the best way to figure it out is to migrate some test data and send some queries to the system.
Use MongoDB's excellent built-in profiler, use $explain, keep the one index per query rule in mind, take a look at the logs, keep an eye on mongostat, and do some benchmarks. This shouldn't take too long and give you a definite and affirmative answer. If your queries turn out slow, people here and on the news group probably have some ideas how to improve the exact query, or the indexation.
One index per query. It's sometimes thought that queries on multiple
keys can use multiple indexes; this is not the case with MongoDB. If
you have a query that selects on multiple keys, and you want that
query to use an index efficiently, then a compound-key index is
necessary.
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-Oneindexperquery.
There's more information on that page as well with regard to Indexes.
The bottom line is Mongo will be great if your indexes are in memory and you are indexing on the columns you want to query using composite keys. If you have poor indexing then your performance will suffer as a result. This is pretty much in line with most systems.

What's the difference between B-Tree and GiST index methods (in PostgreSQL)?

I have been working on optimizing my Postgres databases recently, and traditionally, I've only ever use B-Tree indexes. However, I saw that GiST indexes suport non-unique, multicolumn indexes, in the Postgres 8.3 documentation.
I couldn't, however, see what the actual difference between them is. I was hoping that my fellow coders might beable to explain, what the pros and cons between them are, and more importantly, the reasons why I would use one over the other?
In a nutshell: B-Tree indexes perform better, but GiST indexes are more flexible. Usually, you want B-Tree indexes if they'll work for your data type. There was a recent post on the PG lists about a huge performance hit for using GiST indexes; they're expected to be slower than B-Trees (such is the price of flexibility), but not that much slower... work is, as you might expect, ongoing.
From a post by Tom Lane, a core PostgreSQL developer:
The main point of GIST is to be able to index queries that simply are
not indexable in btree. ... One would fully
expect btree to beat out GIST for btree-indexable cases. I think the
significant point here is that it's winning by a factor of a couple
hundred; that's pretty awful, and might point to some implementation
problem.
Basically everybody's right - btree is default index as it performs very well. GiST are somewhat different beasts - it's more of a "framework to write index types" than a index type on its own. You have to add custom code (in server) to use it, but on the other hand - they are very flexible.
Generally - you don't use GiST unless the datatype you're using tell you to do so. Example of datatypes that use GiST: ltree (from contrib), tsvector (contrib/tsearch till 8.2, in core since 8.3), and others.
There is well known, and pretty fast geographic extenstion to PostgreSQL - PostGIS (http://postgis.refractions.net/) which uses GiST for its purposes.
GiST are more general indexes. You can use them for broader purposes that the ones you would use with B-Tree. Including the ability to build a B-Tree using GiST.
I.E.: you can use GiST to index on geographical points, or geographical areas, something you won't be able to do with B-Tree indexes, since the only thing that matter on a B-Tree is the key (or keys) you are indexing on.
GiST indexes are lossy to an extent, meaning that the DBMS has to deal with false positives/negatives, i.e.:
GiST indexes are lossy because each document is represented in the index by a fixed-
length signature. The signature is
generated by hashing each word into a
random bit in an n-bit string, with
all these bits OR-ed together to
produce an n-bit document signature.
When two words hash to the same bit
position there will be a false match.
If all words in the query have matches
(real or false) then the table row
must be retrieved to see if the match
is correct.
b-trees do not have this behavior, so depending on the data being indexed, there may be some performance difference between the two.
See for text search behavior http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html and http://www.postgresql.org/docs/8.3/static/indexes-types.html for a general purpose comparison.