email as _id in a MongoDB user collection - mongodb

I have a user collection in a MongoDB. The _id is currently the standard MongoDB generated ObjectId. I also have a unique key constraint against a required 'email' field. This seems like a waste.
Is there any reason why I should not ditch the 'email' field and make that data the _id field?

I have read Neil's answer and I partially agree with it (also I am really skeptical about 'significant performance gains'). One thing I have not found in your question is 'what are you going to do with this email'. Are you going to search by it or it is just saved there? And one of the most important things which was not addressed in previous answer: is it going to be changed?
It is not uncommon that people who would use your system will be going to change their email (lost / is not used anymore). If you will put your _id as their email you will not be able to change it easily (you can not modify _id in mongo). You will need to copy, remove add new element in this case (which will not be atomic).
So I would put this as one big reason not to do so. But you need to decide whether you will allow people to change email addresses.

Generally speaking, no there is no real reason and in fact there are significant performance gains to be realized if you actually do use your "email" as a primary key.
Where most of your lookup's are actually on that primary key. Even creating a unique key for a different field, MongoDB is optimized so that "finding" the _id field index is a no-brainer. It's always there.
No additional space used for an index. So again where you are looking up your primary key there is not need to pull in anything other than the default index, as well as this naturally saving on disk space in addition to the I/O cost that would be incurred otherwise.
Perhaps the only real relevant consideration would be with sharding. And that would only be if your use case was better suited to some different form of "bucketed" distribution of "high/low" volume users for example. In that case some other form of Primary key would be required in order to facilitate that.
The default ObjectId type that generally occupies the _id field is great as it maintains a natural insertion order and also even makes it possible to do such things as general range based queries or even time based queries (within reason). So where there is a need for a natural insertion order it is generally be best choice and is highly collision safe.
But if you are generally looking for efficient lookup of Primary key values, then anything that serves as a natural primary key is ideally put in the _id field of the collection, as long as it is reasonably guaranteed to be unique.

Related

Mongo _id as string indexed key. Good or bad?

I'm developing an API that the only method to get a resource is providing a string key like my_resource.
It's a good practice to override _id (this make some mongodb drivers more easy to use) or its bad? What about in the long term?
Thank you
If there is a more natural primary key to use than an ObjectID (for example, a string value) feel free to use it.
The purpose of ObjectIDs is to allow distributed clients to quickly and independently generate unique identifiers according to a standard formula. The 12-byte ObjectID formula includes a 4-byte timestamp, 3-byte machine identifier, 2 byte process ID, and a 3-byte counter starting with a random value.
If you generate/assign your own unique identifiers (i.e. using strings), the potential performance consideration is that you won't if know this name is unique until you try to insert your document using that _id. In this case you would need to handle the duplicate key exception by retrying the insert with a new _id.
In my experience, overriding _id is not the best idea. Only if your data has a value field that is naturally unique and can easily be used to replace _id should _id be overridden. But it wouldn't make a whole lot of sense to override _id only to replace it with a contrived value.
I would recommend against it for a few reasons:
First of all, doing so requires an additional implementation to handle the inevitable instances when your "unique" values will conflict. And this will almost certainly arise in a database of any significant size. This can be a problem, since MongoDB can be unforgiving when it comes to overwriting values and generally handling conflicts. In other words, you're almost certain to overwrite values or meet unhandled exceptions unless you design your database structure very carefully from the beginning.
Second, and equally important: ObjectIDs naturally have an optimized insertion formula which allows for a very good creation of indexes. When a new document is inserted, that ObjectID is created to be mathematically as close as possible to the previous ObjectID, optimizing memory and indexing capabilities. It might be more trouble than it's worth to recreate this very handy item yourself.
And lastly, although this isn't as significant, overriding _id makes it that much harder to use the standard ObjectID methods.
Now, there is at least one positive that I can think of for overriding the ObjectID:
If there is an instance when _id will certainly never be used in your database, then it can save you a good amount of memory, as indexes are pretty costly in MongoDB.

Performance disadvantage using slug as primary key/_id in mongo?

Let's take for example a blog post where a unique slug is generated from the post's title: sample_blog_post. Instead of storing a mongo ObjectId as the _id, say you store the slug in the _id. Besides the obvious case where the slug may change if the title changes, are there major disadvantages in terms of performance by using a string instead of a numerical _id? This could become problematic if, say, the number of posts became very large, say, over a million. But if the number of posts was relatively low, say, 2000, would it make much of a difference? So far the only thing about the ObjectId that I think I'd take advantage of is the created_on date the comes for free.
So in summation, is it worth it to store the slug as the _id and not use an ObjectId? There seems to be discussion on how to store alternate values as an _id, but not the performance advantages/disadvantages to it.
So in summation, is it worth it to store the slug as the _id and not use an ObjectId?
In my opinion, no. The performance difference will be negligible for most scenarios (except paging), but
The old discussion of surrogate primary keys comes up. A "slug" is not a very natural key. Yes, it must be unique, but as you already pointed out, changing the slug shouldn't be impossible. This alone would keep me from bothering...
Having a monotonic _id key can save you from a number of headaches, most importantly to avoid expensive paging via skip and take (use $lt/$gt on the _id instead).
There's a limit on the maximum index length in mongodb of less than 1024 bytes. While not pretty, URLs are allowed to be a lot longer. If someone entered a longer slug, it wouldn't be found because it's silently dropped from the index.
It's a good idea to have a consistent interface, i.e. to use the same type of _id on all, or at least, most of your objects. In my code, I have a single exception where I'm using a special hash as id because the value can't change, the collection has extremely high write rates and it's large.
Let's say you want to link to the article in your management interface (not the public site), which link would you use? Normally the id, but now the id and the slug are equivalent. Now a simple bug (such as allowing an empty slug) would be hard to recover from, because the user couldn't even go to the management interface anymore.
You'll be dealing with charset issues. I'd suggest to not even use the slug for looking up the article, but the slug's hash.
Essentially, you'd end up with a schema like
{ "_id" : ObjectId("a237b45..."), // PK
"slug" : "mongodb-is-fun", // not indexed
"hash" : "5af87c62da34" } // indexed, unique

Looking up submissions by ID with MongoDB/Mongoose

So I'm used to looking up submissions by an auto increment primary ID with MySQL, but after using the MongoDB ORM wrapper, Mongoose, I'm finding that since Mongo stores data within collections differently, there is not really any concept of a traditional auto-increment ID.
I'm stuck trying to figure out how to grab a submission now because normally I'd structure my URL like so:
submission/34/category/slug-goes-here.
Since the 34 now becomes an ugly string based UUID with Mongo, I don't necessary want to display that in my URLs, but I want a unique URL in order to look up my submissions.
I'm thinking of maybe having a set method that when I insert the submission into my database, it generates some kind of 6 character hash e.g. zhXk40 and looks it up like that.
I'm wondering if I do it like this what the performance trade-offs would be. If I made constraints on the slug and then looked them up with the slug, and verified that the category matched, would that be more efficient? Either way I'm going to have to check if the category and slug match, but I'm not sure if an ID is even really necessary in this case.
What's the best practice for creating a route + looking up some piece of data from the db based on that route?
The first thing you should know is:
The _id property doesn't necessarily is that "ugly" ObjectId string.
Actually, the _id just need to be unique within its collection, so if you want to use auto incrementing IDs, there are no problems, however...
If you plan to use sharding within your database, then using auto incrementing field as the _id is overkill.
Why? Read the accepted answer here: Should I implement auto-incrementing in MongoDB?
In my application, as we're not going to shard it, then we use a indexed, numeric ID just for easier usability to the final user, and internally all references are ObjectIds.
Also, here's an good tuto for creating a auto incrementing field in MongoDB: http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
Sheesh! If we were able to truly know best practice here we'd all be better off. The talking heads are still talking and will be for some time.
How I would approach this is to go as pretty as I could. If I had a usable text string to make it semantic that'd be best case.
If I couldn't do that I'd go with the hash thing you suggested.
With both solutions the challenge will be to ensure that it remains unique. That means lookups before you save.
On performance, it's the same as SQL. Index what you use to look things up with. Mongo does good with composite indexes so category name and hash will look up pretty quick.

strategy for creating MongoDB short ids that scale

I want to have a friendlier facing ids (ie Youtube style: /posts/cxB6Ey6) than MongoDB's ObjectID.
I read that for scalability its best to leave _id as an ObjectID so I thought about two solutions:
1) add an indexed postid field to each document
2) create a mapping collection between _id and the postid
in both cases use something like https://github.com/dylang/shortid to generate the short id, and while generating make sure that the id is unique by querying the database.
(can this query-generate-insert be an atomic operation?)
will those solutions have a noticeable impact on performance ?
what's the best strategy for doing this ?
The normal method of doing this is to base64 encode a unique id but:
add an indexed postid field to each document
You definitely want to go for this method. Out of the two I would say this method is easily the most scalable and performant, for one it would only need one round trip to get a short URLs details where as the second option would take 2. Another consideration is the shortage of index overhead of maintaining an extra collection, this is a bit of a no-brainer.
I would not replace the _id field within the document either since the default ObjectId could still be useful in the foreseeable future.
So this limits it down to a separate field and index (unique key) for the short code of a URL.
The next thing is that you don't want an ID which forces you to query the database for uniqueness prior to every insert. This is where the ObjectId shines. The ObjectId is good at being made within the client application while being unique in the database without having to specifically query those assumptions.
Unique ids that do not require querying the database first are normally time based. In PHP ( http://php.net/manual/en/function.uniqid.php ) and in the MongoDB Drivers ( http://docs.mongodb.org/manual/core/object-id/ ) and even the plug-in you linked on github ( https://github.com/dylang/shortid/blob/master/lib/shortid.js#L50 ) they all use time as a basis for being unique.
Considering the plug-in you linked does not query the database to check its own IDs uniqueness I would say that this plug-in probably is quite performant and if you use it with the first solution you stated you should get a good benchmark out of it.
If you want to replace build-in ObjectID with custom user-friendly short id's then do it. You can either use build-in _id field or add a new unique-indexed field id for your custom ids. The benefit of using build-in ObjectID's is than they won't duplicate even if your database is extremely large. So, by replacing them with short id's you take the risk of id duplication.
Now about the performance. I think that the best solution is not to query DB for id's, because with properly adjusted ids length the probability of duplication is extremely small. So, the best way to handle ids duplication in this model is to check Mongo responses. If it responded with "duplicate key error" then you shall generate a new one.
And now about scaling. To scale your custom ids you can just add a few more symbols to it. "Duplicate key error" shall be a trigger for making that change. Normally there shall be no such errors. So, if they started to appear then its time to scale.
I don't think that generating ObjectId for _id field affect directly scalability or performance. Whereby this can be happen?
Main difference is that ObjectIds are created by MongoDB and you don't burden yourself with responsibility for this. Otherwise you must by yourself to determine optimal size of id and to ensure unique value for each _id field of documents stored in collection. It's required because _id used as primary key. This can be justified if you have not very big collection and custom value of identifier is need for you.
But you have such additional benefits with _id field that stores ObjectId values as opportunity to create object id's from time and use this fact to your advantage in queries. Also you can get timestamp of ObjectId’s creation with getTimestamp() method. And sorting on _id in this case is equivalent to sorting by creation time.
But if you're going to use ObjectId in URLs or HTML then for security concerns you can encrypt it. To prevent leakage of information and access to object's creation time. It may be security risk.
About your solutions:
1) I suppose this's very convenient and flexible solution. In this case you can specify any value in postId which doesn't depend directly on _id.
But little disadvantage of this solution is that you have to have extra field and to create extra index. While _id is automatically indexed.
2) I don't think this's good solution from the point of view of performance and philosophy of noSQL approach.

Mongodb: object id as short primary key within a collection

How to make better use of objectId generate by MongoDB. I am not an expert user, but so far i ended up creating seperate id for my object (userid, postid) etc because the object id is too long and makes the url ugly if use as the main ID. I keep the _id intact as it help indexing etc. I was wondering about any better strategy so that one can use mongo objectId as more url friendly and easy to remember key. I read the key was a combination of date etc, so any of the part can be used unique within a collection for this purpose.
thanks,
bsr/
If you have an existing ID (say from an existing data set), then it's perfectly OK to override _id with the one you have.
...keeo the _id intact as it help indexing etc
MongoDB indexes the _id field by default. If you start putting integers in the _id field, they will be indexed like everything else.
So most RDBMs provide an "auto-increment" ID. This is nice for small datasets, but really poor in terms of scalability. If you're trying to insert data to 20 servers at once, how do you keep the "auto-increment" intact?
The normal answer is that you don't. Instead, you end up using things like GUIDs for those IDs. In the case of MongoDB, the ObjectId is already provided.
I was wondering about any better strategy so that one can use mongo objectId as more url friendly and easy to remember key
So the problem here is that "easy to remember" ID doesn't really mesh with "highly scalable database". When you have a billion documents, the IDs are not really "easy to remember".
So you have to make the trade-off here. If you have a table that can get really big, I suggest using the ObjectId. If you have a table that's relatively small and doesn't get updated often, (like a "lookup" table) then you can build your own auto-increment.
The choice is really up to you.
You can overwrite the _id yourself. There is no obligation for using the auto-generated object id. What is the problem with overriding _id inside your app according to your own needs?