Mongo _id as string indexed key. Good or bad? - mongodb

I'm developing an API that the only method to get a resource is providing a string key like my_resource.
It's a good practice to override _id (this make some mongodb drivers more easy to use) or its bad? What about in the long term?
Thank you

If there is a more natural primary key to use than an ObjectID (for example, a string value) feel free to use it.
The purpose of ObjectIDs is to allow distributed clients to quickly and independently generate unique identifiers according to a standard formula. The 12-byte ObjectID formula includes a 4-byte timestamp, 3-byte machine identifier, 2 byte process ID, and a 3-byte counter starting with a random value.
If you generate/assign your own unique identifiers (i.e. using strings), the potential performance consideration is that you won't if know this name is unique until you try to insert your document using that _id. In this case you would need to handle the duplicate key exception by retrying the insert with a new _id.

In my experience, overriding _id is not the best idea. Only if your data has a value field that is naturally unique and can easily be used to replace _id should _id be overridden. But it wouldn't make a whole lot of sense to override _id only to replace it with a contrived value.
I would recommend against it for a few reasons:
First of all, doing so requires an additional implementation to handle the inevitable instances when your "unique" values will conflict. And this will almost certainly arise in a database of any significant size. This can be a problem, since MongoDB can be unforgiving when it comes to overwriting values and generally handling conflicts. In other words, you're almost certain to overwrite values or meet unhandled exceptions unless you design your database structure very carefully from the beginning.
Second, and equally important: ObjectIDs naturally have an optimized insertion formula which allows for a very good creation of indexes. When a new document is inserted, that ObjectID is created to be mathematically as close as possible to the previous ObjectID, optimizing memory and indexing capabilities. It might be more trouble than it's worth to recreate this very handy item yourself.
And lastly, although this isn't as significant, overriding _id makes it that much harder to use the standard ObjectID methods.
Now, there is at least one positive that I can think of for overriding the ObjectID:
If there is an instance when _id will certainly never be used in your database, then it can save you a good amount of memory, as indexes are pretty costly in MongoDB.

Related

email as _id in a MongoDB user collection

I have a user collection in a MongoDB. The _id is currently the standard MongoDB generated ObjectId. I also have a unique key constraint against a required 'email' field. This seems like a waste.
Is there any reason why I should not ditch the 'email' field and make that data the _id field?
I have read Neil's answer and I partially agree with it (also I am really skeptical about 'significant performance gains'). One thing I have not found in your question is 'what are you going to do with this email'. Are you going to search by it or it is just saved there? And one of the most important things which was not addressed in previous answer: is it going to be changed?
It is not uncommon that people who would use your system will be going to change their email (lost / is not used anymore). If you will put your _id as their email you will not be able to change it easily (you can not modify _id in mongo). You will need to copy, remove add new element in this case (which will not be atomic).
So I would put this as one big reason not to do so. But you need to decide whether you will allow people to change email addresses.
Generally speaking, no there is no real reason and in fact there are significant performance gains to be realized if you actually do use your "email" as a primary key.
Where most of your lookup's are actually on that primary key. Even creating a unique key for a different field, MongoDB is optimized so that "finding" the _id field index is a no-brainer. It's always there.
No additional space used for an index. So again where you are looking up your primary key there is not need to pull in anything other than the default index, as well as this naturally saving on disk space in addition to the I/O cost that would be incurred otherwise.
Perhaps the only real relevant consideration would be with sharding. And that would only be if your use case was better suited to some different form of "bucketed" distribution of "high/low" volume users for example. In that case some other form of Primary key would be required in order to facilitate that.
The default ObjectId type that generally occupies the _id field is great as it maintains a natural insertion order and also even makes it possible to do such things as general range based queries or even time based queries (within reason). So where there is a need for a natural insertion order it is generally be best choice and is highly collision safe.
But if you are generally looking for efficient lookup of Primary key values, then anything that serves as a natural primary key is ideally put in the _id field of the collection, as long as it is reasonably guaranteed to be unique.

Performance disadvantage using slug as primary key/_id in mongo?

Let's take for example a blog post where a unique slug is generated from the post's title: sample_blog_post. Instead of storing a mongo ObjectId as the _id, say you store the slug in the _id. Besides the obvious case where the slug may change if the title changes, are there major disadvantages in terms of performance by using a string instead of a numerical _id? This could become problematic if, say, the number of posts became very large, say, over a million. But if the number of posts was relatively low, say, 2000, would it make much of a difference? So far the only thing about the ObjectId that I think I'd take advantage of is the created_on date the comes for free.
So in summation, is it worth it to store the slug as the _id and not use an ObjectId? There seems to be discussion on how to store alternate values as an _id, but not the performance advantages/disadvantages to it.
So in summation, is it worth it to store the slug as the _id and not use an ObjectId?
In my opinion, no. The performance difference will be negligible for most scenarios (except paging), but
The old discussion of surrogate primary keys comes up. A "slug" is not a very natural key. Yes, it must be unique, but as you already pointed out, changing the slug shouldn't be impossible. This alone would keep me from bothering...
Having a monotonic _id key can save you from a number of headaches, most importantly to avoid expensive paging via skip and take (use $lt/$gt on the _id instead).
There's a limit on the maximum index length in mongodb of less than 1024 bytes. While not pretty, URLs are allowed to be a lot longer. If someone entered a longer slug, it wouldn't be found because it's silently dropped from the index.
It's a good idea to have a consistent interface, i.e. to use the same type of _id on all, or at least, most of your objects. In my code, I have a single exception where I'm using a special hash as id because the value can't change, the collection has extremely high write rates and it's large.
Let's say you want to link to the article in your management interface (not the public site), which link would you use? Normally the id, but now the id and the slug are equivalent. Now a simple bug (such as allowing an empty slug) would be hard to recover from, because the user couldn't even go to the management interface anymore.
You'll be dealing with charset issues. I'd suggest to not even use the slug for looking up the article, but the slug's hash.
Essentially, you'd end up with a schema like
{ "_id" : ObjectId("a237b45..."), // PK
"slug" : "mongodb-is-fun", // not indexed
"hash" : "5af87c62da34" } // indexed, unique

Using timestamp as mongodb _id?

I want to use MongoDB to store timeseries data, and think it would make things more sense to keep one unique indexed field that represent date-time. So the question is, can I really replace the automatic _id creation with my own timestamp, and would there be any drawbacks?
can I really replace the automatic _id creation with my own timestamp?
Yes, you can.
would there be any drawbacks?
One is that you have to work for it, whereas the built in _id is, well, built in.
Another one is that you're responsible to making sure your _id is indeed unique. Depending on your data frequency and the kind of timestamp you use, this may or may not be simple.
I'm not saying it's necessarily a bad idea. The advantages are clear, but, yes, there are drawbacks.
You can definitely populate _id field with your own timestamp. The things to look out for are:
_id is a unique index so you would have to be sure that no 2 documents shared a timestamp. If you can't guarantee this then it would not work.
If you were to shard this collection, you may want to avoid using a timestamp as the shard key. If you were always writing data points with the current timestamp then you would find all of your writes would go to a single shard, rather than distributed evenly across shards.

strategy for creating MongoDB short ids that scale

I want to have a friendlier facing ids (ie Youtube style: /posts/cxB6Ey6) than MongoDB's ObjectID.
I read that for scalability its best to leave _id as an ObjectID so I thought about two solutions:
1) add an indexed postid field to each document
2) create a mapping collection between _id and the postid
in both cases use something like https://github.com/dylang/shortid to generate the short id, and while generating make sure that the id is unique by querying the database.
(can this query-generate-insert be an atomic operation?)
will those solutions have a noticeable impact on performance ?
what's the best strategy for doing this ?
The normal method of doing this is to base64 encode a unique id but:
add an indexed postid field to each document
You definitely want to go for this method. Out of the two I would say this method is easily the most scalable and performant, for one it would only need one round trip to get a short URLs details where as the second option would take 2. Another consideration is the shortage of index overhead of maintaining an extra collection, this is a bit of a no-brainer.
I would not replace the _id field within the document either since the default ObjectId could still be useful in the foreseeable future.
So this limits it down to a separate field and index (unique key) for the short code of a URL.
The next thing is that you don't want an ID which forces you to query the database for uniqueness prior to every insert. This is where the ObjectId shines. The ObjectId is good at being made within the client application while being unique in the database without having to specifically query those assumptions.
Unique ids that do not require querying the database first are normally time based. In PHP ( http://php.net/manual/en/function.uniqid.php ) and in the MongoDB Drivers ( http://docs.mongodb.org/manual/core/object-id/ ) and even the plug-in you linked on github ( https://github.com/dylang/shortid/blob/master/lib/shortid.js#L50 ) they all use time as a basis for being unique.
Considering the plug-in you linked does not query the database to check its own IDs uniqueness I would say that this plug-in probably is quite performant and if you use it with the first solution you stated you should get a good benchmark out of it.
If you want to replace build-in ObjectID with custom user-friendly short id's then do it. You can either use build-in _id field or add a new unique-indexed field id for your custom ids. The benefit of using build-in ObjectID's is than they won't duplicate even if your database is extremely large. So, by replacing them with short id's you take the risk of id duplication.
Now about the performance. I think that the best solution is not to query DB for id's, because with properly adjusted ids length the probability of duplication is extremely small. So, the best way to handle ids duplication in this model is to check Mongo responses. If it responded with "duplicate key error" then you shall generate a new one.
And now about scaling. To scale your custom ids you can just add a few more symbols to it. "Duplicate key error" shall be a trigger for making that change. Normally there shall be no such errors. So, if they started to appear then its time to scale.
I don't think that generating ObjectId for _id field affect directly scalability or performance. Whereby this can be happen?
Main difference is that ObjectIds are created by MongoDB and you don't burden yourself with responsibility for this. Otherwise you must by yourself to determine optimal size of id and to ensure unique value for each _id field of documents stored in collection. It's required because _id used as primary key. This can be justified if you have not very big collection and custom value of identifier is need for you.
But you have such additional benefits with _id field that stores ObjectId values as opportunity to create object id's from time and use this fact to your advantage in queries. Also you can get timestamp of ObjectId’s creation with getTimestamp() method. And sorting on _id in this case is equivalent to sorting by creation time.
But if you're going to use ObjectId in URLs or HTML then for security concerns you can encrypt it. To prevent leakage of information and access to object's creation time. It may be security risk.
About your solutions:
1) I suppose this's very convenient and flexible solution. In this case you can specify any value in postId which doesn't depend directly on _id.
But little disadvantage of this solution is that you have to have extra field and to create extra index. While _id is automatically indexed.
2) I don't think this's good solution from the point of view of performance and philosophy of noSQL approach.

Creating custom Object ID in MongoDB

I am creating a service for which I will use MongoDB as a storage backend.
The service will produce a hash of the user input and then see if that same hash (+ input) already exists in our dataset.
The hash will be unique yet random ( = non-incremental/sequential), so my question is:
Is it -legitimate- to use a random value for an Object ID? Example:
$object_id = new MongoId(HEX-OF-96BIT-HASH);
Or will MongoDB treat the ObjectID differently from other server-produced ones, since a "real" ObjectID also contains timestamps, machine_id, etc?
What are the pros and cons of using a 'random' value? I guess it would be statistically slower for the engine to update the index on inserts when the new _id's are not in any way incremental - am I correct on that?
Yes it is perfectly fine to use a random value for an object id, if some value is present in _id field of a document being stored, it is treated as objectId.
Since _id field is always indexed, and primary key, you need to make sure that different objectid is generated for each object.
There are some guidelines to optimize user defined object ids :
https://docs.mongodb.com/manual/core/document/#the-id-field.
While any values, including hashes, can be used for the _id field, I would recommend against using random values for two reasons:
You may need to develop a collision-management strategy in the case you produce identical random values for two different objects. In the question, you imply that you'll generate IDs using a some type of a hash algorithm. I would not consider these values "random" as they are based on the content you are digesting with the hash. The probability of a collision then is a function of the diversity of content and the hash algorithm. If you are using something like MD5 or SHA-1, I wouldn't worry about the algorithm, just the content you are hashing. If you need to develop a collision-management strategy then you definitely should not use random or hash-based IDs as collision management in a clustered environment is complicated and requires additional queries.
Random values as well as hash values are purposefully meant to be dispersed on the number line. That (a) will require more of the B-tree index to be kept in memory at all times and (b) may cause variable insert performance due to B-tree rebalancing. MongoDB is optimized to handle ObjectIDs, which come in ascending order (with one second time granularity). You're likely better off sticking with them.
I just found out an answer to one of my questions, regarding indexing performance:
If the _id's are in a somewhat well defined order, on inserts the entire b-tree for the _id index need not be loaded. BSON ObjectIds have this property.
Source: http://www.mongodb.org/display/DOCS/Optimizing+Object+IDs
Whether it is good or bad depends upon it's uniqueness. Of course the ObjectId provided by MongoDB is quite unique so this is a good thing. So long as you can replicate that uniqueness then you should be fine.
There are no inherent risks/performance loses by using your own ID. I guess using it in string form might use up more index/storage/querying power but there you are using it in MongoID (ObjectId) form which should preserve the strengths of not storing it in a simple string.