The autogenerated BSON ID that is stored in the _id field of every document, is it a GUID?
The documentation says its 'most likely unique', so I am a little confused. Why would they use an id that is not guaranteed to be unique?
Its uniqness is based upon probability. Unlike #mattexx answer:
It's not "guaranteed" to be unique because MongoDB does not enforce uniqueness to save time.
MongoDB DOES enforce uniqness on the ObjectId, it in fact has a unique index on the _id field. When talking about saving time, the ObjectId is historical in that manner since it was designed in the days when MongoDB did not ack any writes and needed a 99% chance of being able to insert a new unique record without the client waiting for an ack (ObjectIds are generated client side).
They are not GUIDs however they, as #Asya says, are guaranteed to have a high level of uniqness.
So long as time never moves backwards there is still a 99% chance it will be unique forever. Okay, as #Devesh says, there is a, 1 in 1 trillion (? haven't done the math), chance of even a GUID being duplicated but, again, I do not think you will reach that probability anytime soon.
It is unique in most of the requirement and it is consists of timestamp , unique identifier of the machine (hash of the machine host) , process Identifier and in last the increment number. http://docs.mongodb.org/manual/reference/object-id/
The chance of an ID collision is theoretically close enough to zero that it can be presumed for typical web apps. Many real-world systems (Mongo or not) do rely on this property of GUIDs, though it would not be a good assumption for safety/mission-critical systems.
In practical terms, there are indeed scenarios where it could go wrong if there's a misconfiguration or third-party library bug. Those shouldn't rule out the concept, but important to be aware of those risks and avoid them where possible.
Some good analysis here of practical issues that could cause a collision. In particular:
Some Mongo drivers use random numbers instead of incrementing numbers for the counter bytes. In these cases, there is a 1/16,777,216 chance of generating a non-unique ID, but only if those two IDs are generated in the same second (i.e. before the time section of the ID updates to the next second), on the same machine, in the same process.
ObjectId is explained in the doc here. It's not "guaranteed" to be unique because MongoDB does not enforce uniqueness to save time. It simply trusts that the complicated generation algorithm will probably never produce two identical ObjectIds in the same datastore. So technically it is not a GUID, but pretty much as good.
Related
Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?
Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/
I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.
I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.
Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.
I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide
In my app I'm letting mongo generate order id's via its ObjectId method.
But in user testing we've had some concerns that the order id's are humanly 'intimidating', i.e. if you need to discuss your order with someone over the telephone, reading out 24 alphanumeric characters is a bit tedious.
At the same time, I don't really want to have to store two different id's, one 'human-accessible' and one used by mongo internally.
So my question is this - is there a way to choose a substring of length 6 or even 8 of the mongo objectId string that I could be fairly sure would be unique ?
For example if I have a mongo objectid like this
id = '4b28dcb61083ed3c809e0416'
maybe I could take out
human_id = id.substr(0,7);
and be sure that i'd always get unique id's for my orders...
The advantage of course is that these are orders, and so are human-created, and so there aren't millions of them per millisecond. On the other hand, it would really be a problem if two orders had the same shortened id...
--- clearer explanation ---
I guess a better way to ask my question would be this :
If I decide for example to just use the last 6 characters of a mongo id, is there some kind of measure of 'probability' that just these 6 characters would repeat in a given week ?
Given a certain number of mongo's running in parallel, a certain number of users during the week, etc.
If you have multiple web servers, with multiple processes, then there really isn't something you can remove with losing uniqueness.
If you look at the nature of the ObjectId:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
You'll see there's not much there that you could safely remove. As the first 4 bytes are time, it would be challenging to implement an algorithm that removed portions of the time stamp in a clean and safe way.
The machine identifier and process identifier are used in cases where there are multiple servers and/or processes acting as clients to the database server. If you dropped either of those, you could end up with duplicates again. The random value as the last 3 bytes is used to make sure that two identifiers, on the same machine, within the same process are unique, even when requested frequently.
If you were using it as an order id, and you want assured uniqueness, I wouldn't trim anything away from the 12 byte number as it was carefully designed to provide a robust and efficient distributed mechanism for generating unique numbers when there are many connected database clients.
If you took the last 5 characters of the ObjectId ..., and in a given period, what's the probability of conflict?
process id
counter
The probability of conflict is high. The process id may remain the same through the entire period, and the other number is just an incrementing number that would repeat after 4095 orders. But, if the process recycles, then you also have the chance that there will be a conflict with older orders, etc. And if you're talking multiple database clients, the chances increase as well. I just wouldn't try to trim the number. It's not worth the unhappy customers trying to place orders.
Even the timestamp and the random seed value aren't sufficient when there are multiple database clients generating ObjectIds. As you start to look at the various pieces, especially in the context of a farm of database clients, you should see why the pieces are there, and why removing them could lead to a meltdown in ObjectId generation.
I'd suggest you implement an algorithm to create a unique number and store it in the database. It's simple enough to do. It does impact performance a bit, but it's safe.
I wrote this answer a while ago about the challenges of using an ObjectId in a Url. It includes a link to how to create a unique auto incrementing number using MongoDB.
Actually what you choose for and Id (actually _id in MongoDB storage) is totally up to you. If there is some useful data you can keep in _id as long as you keep it unique, then do so. If it has to be something valid to url encoding, then do so.
By default, if you do not specify an _id then that field will be populated with the value you have come to love and hate. But if you explicitly use it, then you will get what you want.
The extra thing to keep in mind is that even if you specify an addtional unique index field, let's say order_id then MongoDB would actually have to check through that and other indexes on a query plan to see which one was best to use. But if _id was your key, the plan would give up and go strait for the 'Primary Key', and this is going to be a lot faster.
So make your own Id just as long as you can ensure it will be unique.
I'm developing an API that the only method to get a resource is providing a string key like my_resource.
It's a good practice to override _id (this make some mongodb drivers more easy to use) or its bad? What about in the long term?
Thank you
If there is a more natural primary key to use than an ObjectID (for example, a string value) feel free to use it.
The purpose of ObjectIDs is to allow distributed clients to quickly and independently generate unique identifiers according to a standard formula. The 12-byte ObjectID formula includes a 4-byte timestamp, 3-byte machine identifier, 2 byte process ID, and a 3-byte counter starting with a random value.
If you generate/assign your own unique identifiers (i.e. using strings), the potential performance consideration is that you won't if know this name is unique until you try to insert your document using that _id. In this case you would need to handle the duplicate key exception by retrying the insert with a new _id.
In my experience, overriding _id is not the best idea. Only if your data has a value field that is naturally unique and can easily be used to replace _id should _id be overridden. But it wouldn't make a whole lot of sense to override _id only to replace it with a contrived value.
I would recommend against it for a few reasons:
First of all, doing so requires an additional implementation to handle the inevitable instances when your "unique" values will conflict. And this will almost certainly arise in a database of any significant size. This can be a problem, since MongoDB can be unforgiving when it comes to overwriting values and generally handling conflicts. In other words, you're almost certain to overwrite values or meet unhandled exceptions unless you design your database structure very carefully from the beginning.
Second, and equally important: ObjectIDs naturally have an optimized insertion formula which allows for a very good creation of indexes. When a new document is inserted, that ObjectID is created to be mathematically as close as possible to the previous ObjectID, optimizing memory and indexing capabilities. It might be more trouble than it's worth to recreate this very handy item yourself.
And lastly, although this isn't as significant, overriding _id makes it that much harder to use the standard ObjectID methods.
Now, there is at least one positive that I can think of for overriding the ObjectID:
If there is an instance when _id will certainly never be used in your database, then it can save you a good amount of memory, as indexes are pretty costly in MongoDB.
I'm using mongoDB for the first time in a RESTful service. Previously the id column in my SQL databases was an incrementing integer so my RESTful endpoints would look something like /rest/objectType/1. Is there any reason why I shouldn't just use mongoDB's ObjectId's in the same role, or is it wiser to maintain a separate incrementing integer id column and use this for urls?
Having used ObjectIds in RESTful APIs several times, the biggest downside is really that they are very noisy in terms of having a clean URL. You'll either leave it as a HEX number, or convert it to a very large integer number, both making for a somewhat unfriendly URL:
/rest/resource/52435dbecb970072ec3a780f
/rest/resource/25459211534898951476729247759
I've added a "title" to the URL (like StackOverflow does) to make them slightly more friendly:
/rest/resource/52435dbecb970072ec3a780f/FriendlyResourceName
Of course, the "title" is ignored in software, but the user sees it and can mentally ignore the crazy ID segment.
There's very little useful that could be learned from the infrastructure by exposing them:
Timestamp
Machine ID
Process ID
Random incrementing value
Other than potentially gathering Machine IDs (which generally would indicate the number of clients creating ObjectIds), there's not much there.
ObjectIds aren't random, so you couldn't use them for security. You'll always need to secure the data. While they may not increment in an obvious way, it would be easy to find other resources through brute force. However, if you were using auto-incrementing IDs before, this isn't a new problem for you.
If you know you aren't creating many new documents at any given time, it might be worth using one of the patterns here to create a simpler ID. In one app I wrote, I used an auto-inc technique for some of the document IDs that were shown in URLs, and for those that were Ajax-only, I used ObjectIds. I really wanted some URLs to be easily "typed". No form of an ObjectId is easily typed by an end user. That's one of the strengths of MongoDB -- that you can use any _id format you want. :)
It's wiser to use the ObjectIds, because keeping an incrementing counter can be a bottleneck. Also, since ObjectId contains a timestamp and is monotonic, they can be helpful in optimizing queries.
The ObjectIds can be guessed, but since that is definitely true for incrementing IDs, I suspect you didn't rely on security through obscurity before, so that's no trouble for you.
A downside, albeit a small one, is that the creation time on your server leaks to the user, i.e. if the user is able to identify this as an ObjectId, she can reverse-engineer the creation time of the object. That's the only potential issue I see.
I keep reading that using an ObjectId as the unique key makes sharding easier, but I haven't seen a relatively detailed explanation as to why that is. Could someone shed some light on this?
The reason I ask is that I want to use an english string (which will be unique obviously) as the unique key, but want to make sure that it won't tie my hands later on.
I've just recently been getting familiar with mongoDB myself so take this with a grain of salt but I suspect that sharding is probably more efficient when using ObjectId rather that your own key values because of the fact that part of the ObjectId will point out which machine or shard that the document was created on. The bottom of this page in the mongo docs explains what each portion of the ObjectId means.
I asked this question on Mongo user list and basically the reply was that it's OK to generate your own value of _id and it will not make sharding more difficult. For me sometimes it's necessary to have numeric values on _id like when I'm going to use them in url, so I'm generating my own _id in some collections.
ObjectId is designed to be globally unique. So, when used as a primary key and a new record is appended to the dataset without primary key value, then each shard can generate a new objectid and not worry about collisions with other shards. This somewhat simplifies life for everyone :)
Shard key does not have to be unique. We can't conclude that sharding a collection based on object id is always efficient .
Actually, ObjectID is probably a poor choice for a shard key.
From the docs (http://docs.mongodb.org/manual/core/sharded-cluster-internals/ the section on "Write Scaling"):
"[T]he most significant bits of [an ObjectID] represent a time stamp, which means that they increment in a regular and predictable pattern. [Therefore] all insert operations will be storing data into a single chunk, and therefore, a single shard. As a result, the write capacity of this shard will define the effective write capacity of the cluster."
In other words, because every OID sorts "bigger" than the one created immediately before it, an inserts that are keyed by OID will land on the same machine, and the write I/O capacity of that one machine will be the total I/O of your entire cluster. (This is true not just of OIDs, but any predictable key -- timestamps, autoincrementing numbers, etc.)
Contrariwise, if you chose a random string as your shard key, writes would tend to distribute evenly over the cluster, and your throughput would be the total I/O of the whole cluster.
(EDIT to be complete: with an OID shard key, as new records landed on the "rightmost" shard, the balancer would handle moving them elsewhere, so they would eventually end up on other machines. But that doesn't solve the I/O problem; it actually makes it worse.)