From the MongoDB docs: MongoDB indexes use a B-tree data structure.
But, does it also apply to the compound indexes? In affirmative case, how is it actually implemented?
PD: The only way I imagine it is as a B-tree in which each node has not a single value, but as many values as indexes stored, for example, in an array (i.e. as if two binary trees (or more, one for each index) had been merged).
I cannot say anything with 100% certainty on the actual implementation, but I do think of compound indices as implemented as B-trees as well, with, as you describe, nodes that have sequences of values (respecting the index key order), and, very importantly using a lexicographic order on the key values.
Having said that, in order to save some space, I would also consider a B-tree that only uses the values associated with the first key on the top of the tree, then the values of the second key a bit further down, ... and the value of the last key close to the leaves. This is nothing else than an optimization that makes the boundaries of the nodes coincide with the individual keys.
The advantage of implementing a compound index this way (with or without the said optimization) is that, if you have an index on several keys, let's say A then B then C, then you get an index on A alone for free, and an index on A then B for free: indeed, all values for any prefix of the sequence of keys are always grouped together in such a B-tree, because of the lexicographic order.
Since MongoDB documents that such is the case, I think of the implementation of compound indices this way when I use MongoDB.
Furthermore, the documentation specifies that hashed index fields are forbidden on compound indices. This is one more clue, as B-trees implement range indices.
Also, I would expect MongoDB's hash indices to be implemented with a hash table rather than B-trees, as it would be less efficient to use a B-tree for point queries only (logarithmic lookup vs. O(1))
Related
I'm trying to determine the best way to deal with a composite primary key in a mongo db. The main key for interacting with the data in this system is made up of 2 uuids. The combination of uuids is guaranteed to be unique, but neither of the individual uuids is.
I see a couple of ways of managing this:
Use an object for the primary key that is made up of 2 values (as suggested here)
Use a standard auto-generated mongo object id as the primary key, store my key in two separate fields, and then create a composite index on those two fields
Make the primary key a hash of the 2 uuids
Some other awesome solution that I currently am unaware of
What are the performance implications of these approaches?
For option 1, I'm worried about the insert performance due to having non sequential keys. I know this can kill traditional RDBMS systems and I've seen indications that this could be true in MongoDB as well.
For option 2, it seems a little odd to have a primary key that would never be used by the system. Also, it seems that query performance might not be as good as in option 1. In a traditional RDBMS a clustered index gives the best query results. How relevant is this in MongoDB?
For option 3, this would create one single id field, but again it wouldn't be sequential when inserting. Are there any other pros/cons to this approach?
For option 4, well... what is option 4?
Also, there's some discussion of possibly using CouchDB instead of MongoDB at some point in the future. Would using CouchDB suggest a different solution?
MORE INFO: some background about the problem can be found here
You should go with option 1.
The main reason is that you say you are worried about performance - using the _id index which is always there and already unique will allow you to save having to maintain a second unique index.
For option 1, I'm worried about the insert performance do to having
non sequential keys. I know this can kill traditional RDBMS systems
and I've seen indications that this could be true in MongoDB as well.
Your other options do not avoid this problem, they just shift it from the _id index to the secondary unique index - but now you have two indexes, once that's right-balanced and the other one that's random access.
There is only one reason to question option 1 and that is if you plan to access the documents by just one or just the other UUID value. As long as you are always providing both values and (this part is very important) you always order them the same way in all your queries, then the _id index will be efficiently serving its full purpose.
As an elaboration on why you have to make sure you always order the two UUID values the same way, when comparing subdocuments { a:1, b:2 } is not equal to { b:2, a:1 } - you could have a collection where two documents had those values for _id. So if you store _id with field a first, then you must always keep that order in all of your documents and queries.
The other caution is that index on _id:1 will be usable for query:
db.collection.find({_id:{a:1,b:2}})
but it will not be usable for query
db.collection.find({"_id.a":1, "_id.b":2})
I have an option 4 for you:
Use the automatic _id field and add 2 single field indexes for both uuid's instead of a single composite index.
The _id index would be sequential (although that's less important in MongoDB), easily shardable, and you can let MongoDB manage it.
The 2 uuid indexes let you to make any kind of query you need (with the first one, with the second or with both in any order) and they take up less space than 1 compound index.
In case you use both indexes (and other ones as well) in the same query MongoDB will intersect them (new in v2.6) as if you were using a compound index.
I'd go for the 2 option and there is why
Having two separate fields instead of the one concatenated from both uuids as suggested in 1st, will leave you the flexibility to create other combinations of indexes to support the future query requests or if turns out, that the cardinality of one key is higher then another.
having non sequential keys could help you to avoid the hotspots while inserting in sharded environment, so its not such a bad option. Sharding is the best way, for my opinion, to scale inserts and updates on the collections, since the write locking is on database level (prior to 2.6) or collection level (2.6 version)
I would've gone with option 2. You can still make an index that handles both the UUID fields, and performance should be the same as a compound primary key, except it'll be much easier to work with.
Also, in my experience, I've never regretted giving something a unique ID, even if it wasn't strictly required. Perhaps that's an unpopular opinion though.
In this slides, the author said that capped collection is perfect for logging because it is speedy by natural ordering. Could you please explain for me why it is speedy?
Natural order means "return the data in the same order it is stored on disk, no sorting necessary". This is fast. Unfortunately, it usually is no "meaningful" order at all. To get a meaningful order, you have to sort by data in a field, and this implies either in-memory sorting, or random access through an index (which is slower than sequential access).
In a capped collection, natural order happens to be the same order as document creation.
So if you want log entries in chronological order, a capped collection can provide that cheaply.
(Unless explicitly created) there is no index on the collection, which means insertion is very quick. Think of it as appending to a list, as opposed to inserting an element to a sorted data structure.
Suppose I have a Mongo collection with fields a and b. I've populated this collection with {a:'a', b : index } where index increases iteratively from 0 to 1000.
I know this is very, very wrong, but can't explain (no pun intended) why:
collection.find({i:{$gt:500}}).explain() confirms that the index was not used (I can see that it scanned all 1,000 documents in the collection).
Somehow forcing Mongo to use the index seems to work though:
collection.find({i:{$gt:500}}).hint({a:1,i:1}).explain()
Edit
The Mongo documentation is very clear that it will only use compound indexes if one of your query terms is the matches the first term of the compound index. In this case, using hint, it appears that Mongo used the compound index {a:1,i:1} even though the query terms do NOT include a. Is this true?
The interesting part about the way MongoDB performs queries is that it actually may run multiple queries in parallel to determine what is the best plan. It may have chosen to not use the index due to other experimenting you've done from the shell, or even when you added the data and whether it was in memory, etc/ (or a few other factors). Looking at the performance numbers, it's not reporting that using the index was actually any faster than not (although you shouldn't take much stock in those numbers generally). In this case, the data set is really small.
But, more importantly, according to the MongoDB docs, the output from the hinted run also suggests that the query wasn't covered entirely by the index (indexOnly=false).
That's because your index is a:1, i:1, yet the query is for i. Compound indexes only support searches based on any prefix of the indexed fields (meaning they must be in the order they were specified).
http://docs.mongodb.org/manual/core/read-operations/#query-optimization
FYI: Use the verbose option to see a report of all plans that were considered for the find().
I've tried to search through Mongo documentation, but can't really find any details on whether queries on unique indexes will be faster than queries on non-unique indexes (given the same data)
So I understand that a unique index will have high selectivity and good performance. But, given two fields whose concatenation is unique, would a non-unique compound index perform slower than a unique compound index?
I am assuming that unique indexes can slow down inserts as the uniqueness must be verified. But is the read performance improvement of a unique index, if any, really worth it?
A quick grep of the source tree seems to indicate that unique indexes are only used on insert, so there shouldn't be any performance benefit or detriment between a query that returns one document, whether the index is unique or not.
MongoDB indexes are implemented as btrees, so it wouldn't make any logical sense for them to perform any differently whether the index is unique or not.
I did my own small research on that topic. I generated 500,000 records (randomly generated strings) in a collection, and tried a couple of queries with explain() statement.
Then I ensured a unique index, and tried few other queries again:
As you can see, after adding index the time consumption decreased from ~276ms to 0ms! So it seems like even if the index is unique, it affects (in a positive way) the find queries.
I am creating a service for which I will use MongoDB as a storage backend.
The service will produce a hash of the user input and then see if that same hash (+ input) already exists in our dataset.
The hash will be unique yet random ( = non-incremental/sequential), so my question is:
Is it -legitimate- to use a random value for an Object ID? Example:
$object_id = new MongoId(HEX-OF-96BIT-HASH);
Or will MongoDB treat the ObjectID differently from other server-produced ones, since a "real" ObjectID also contains timestamps, machine_id, etc?
What are the pros and cons of using a 'random' value? I guess it would be statistically slower for the engine to update the index on inserts when the new _id's are not in any way incremental - am I correct on that?
Yes it is perfectly fine to use a random value for an object id, if some value is present in _id field of a document being stored, it is treated as objectId.
Since _id field is always indexed, and primary key, you need to make sure that different objectid is generated for each object.
There are some guidelines to optimize user defined object ids :
https://docs.mongodb.com/manual/core/document/#the-id-field.
While any values, including hashes, can be used for the _id field, I would recommend against using random values for two reasons:
You may need to develop a collision-management strategy in the case you produce identical random values for two different objects. In the question, you imply that you'll generate IDs using a some type of a hash algorithm. I would not consider these values "random" as they are based on the content you are digesting with the hash. The probability of a collision then is a function of the diversity of content and the hash algorithm. If you are using something like MD5 or SHA-1, I wouldn't worry about the algorithm, just the content you are hashing. If you need to develop a collision-management strategy then you definitely should not use random or hash-based IDs as collision management in a clustered environment is complicated and requires additional queries.
Random values as well as hash values are purposefully meant to be dispersed on the number line. That (a) will require more of the B-tree index to be kept in memory at all times and (b) may cause variable insert performance due to B-tree rebalancing. MongoDB is optimized to handle ObjectIDs, which come in ascending order (with one second time granularity). You're likely better off sticking with them.
I just found out an answer to one of my questions, regarding indexing performance:
If the _id's are in a somewhat well defined order, on inserts the entire b-tree for the _id index need not be loaded. BSON ObjectIds have this property.
Source: http://www.mongodb.org/display/DOCS/Optimizing+Object+IDs
Whether it is good or bad depends upon it's uniqueness. Of course the ObjectId provided by MongoDB is quite unique so this is a good thing. So long as you can replicate that uniqueness then you should be fine.
There are no inherent risks/performance loses by using your own ID. I guess using it in string form might use up more index/storage/querying power but there you are using it in MongoID (ObjectId) form which should preserve the strengths of not storing it in a simple string.